This item was added on: 2003/07/01
Tokenizing is a useful task. There are many ways to do it.
A token is a group of characters (like a word) which is separated by a delimiter. A delimiter is a symbol which separates tokens from each other. The process of tokenizing is extracting tokens from a string or a stream.
"a b cde"
If you define your delimiter to be a single space, ' ', then you have three tokens: "a", "b", and "cde". Using whitespace as a delimiter makes tokenizing easy. Stringstreams and input streams have the ability to tokenize based on this.
#include <string>
#include <iostream>
#include <sstream>
int main(void)
{
std::string s = "A B CDE";
std::stringstream os(s);
std::string temp;
std::cout <<"s is: " <<s <<std::endl;
while (os >> temp)
std::cout <<temp <<std::endl;
return(0);
}
But what if your delimiter wasn't whitespace? The iostream get() function allows you to choose your own delimiter. get() is problematic, however, because it leaves behind the delimiter.If you want to tokenize a string contained inside the string class, you can use the find() and erase() functions of the string class.
#include <string>
#include <iostream>
#include <vector>
int main(void)
{
std::string numbers_str = "zero,one,two,three,four,five,six,seven,eight,nine,ten";
std::vector < std::string > numbers;
std::string temp;
while (numbers_str.find(",", 0) != std::string::npos)
{
size_t pos = numbers_str.find(",", 0);
temp = numbers_str.substr(0, pos);
numbers_str.erase(0, pos + 1);
numbers.push_back(temp);
}
numbers.push_back(numbers_str);
std::cout << "Number " << 3 << " is " << numbers[3] << std::endl;
return(0);
}
Sometimes you want to ignore some areas of your string. What if your delimiter was a dash? numbers_str = "forty-five-forty-six-forty-seven-forty-eight";
(Pretend that these numbers need the dashes, and that you need to use the '-' as a delimiter)
You could take two tokens and combine them yourself. That often is enough. But sometimes there are situations where you need more flexibility.
equation = "3+f(x+y)+4"
Simply parsing by plus signs won't do because it would tear apart your function f(x+y). In situtations like these it's often easiest to grab the area in between parentheses before your tokenizing function can get into it.
#include <iostream>
#include <string>
#include <vector>
std::vector<std::string> parse(std::string);
int main(void)
{
std::vector < std::string > v = parse("3-(x+y)+4");
for (unsigned int i = 0; i < v.size(); i++)
{
std::cout << v.at(i) << std::endl;
}
return(0);
}
std::vector<std::string> parse(std::string equation)
{
std::vector < std::string > tokens;
std::string accumulator;
for ( ; ; )
{
if (equation.size() == 0)
{
tokens.push_back(accumulator);
break;
}
switch (equation[0])
{
case '(':
{
equation.erase(0, 1);
int pos = 1;
size_t siz;
for (siz = 0; pos && (siz != equation.size()); ++siz)
{
if (equation[siz] == '(') ++pos;
if (equation[siz] == ')') --pos;
}
if (pos)
{
break;
}
std::string temp = equation.substr(0, siz - 1);
std::vector < std::string > temp_tokens;
temp_tokens = parse(temp);
equation.erase(0, siz + 1);
for (size_t i = 0; i < temp_tokens.size(); ++i)
{
tokens.push_back(temp_tokens[i]);
}
}
break;
case '+':
equation.erase(0, 1);
tokens.push_back(accumulator);
accumulator.clear();
break;
case '-':
equation.erase(0, 1);
tokens.push_back(accumulator);
accumulator.clear();
break;
default:
accumulator += equation.substr(0, 1);
equation.erase(0, 1);
break;
}
}
return(tokens);
}
Tokenizing can get very complex if you're not careful. Hopefully these few tips will help you out.Credit: ygfperson