FAQ > Separate a string into tokens? (C++)


Match word(s).

If you have any questions or comments,
please visit us on the Forums.

FAQ > How do I... (Level 3) > Separate a string into tokens? (C++)

This item was added on: 2003/07/01

Tokenizing is a useful task. There are many ways to do it.

A token is a group of characters (like a word) which is separated by a delimiter. A delimiter is a symbol which separates tokens from each other. The process of tokenizing is extracting tokens from a string or a stream.

"a b cde"

If you define your delimiter to be a single space, ' ', then you have three tokens: "a", "b", and "cde". Using whitespace as a delimiter makes tokenizing easy. Stringstreams and input streams have the ability to tokenize based on this.


#include <string> 
#include <iostream> 
#include <sstream> 
int main(void)
{
  std::string s = "A B CDE";        //a standard string
  std::stringstream os(s);          //a standard stringstream which parses 's'
  std::string temp;                 //a temporary string
  
  std::cout <<"s is: " <<s <<std::endl;
  
  while (os >> temp)                //the stringstream makes temp a token
    std::cout <<temp <<std::endl;   //and deletes that token from itself 
                                    //the token can now be
                                    //outputted to console, or put into an array, 
                                    //or whatever you choose to do ith it .
  return(0);
}


But what if your delimiter wasn't whitespace? The iostream get() function allows you to choose your own delimiter. get() is problematic, however, because it leaves behind the delimiter.

If you want to tokenize a string contained inside the string class, you can use the find() and erase() functions of the string class.


#include <string> 
#include <iostream> 
#include <vector> 
int main(void)
{
  std::string numbers_str = "zero,one,two,three,four,five,six,seven,eight,nine,ten";
  std::vector < std::string > numbers; //we'll put all of the tokens in here 
  std::string  temp;

  while (numbers_str.find(",", 0) != std::string::npos)
  { //does the string a comma in it?
    size_t  pos = numbers_str.find(",", 0); //store the position of the delimiter
    temp = numbers_str.substr(0, pos);      //get the token
    numbers_str.erase(0, pos + 1);          //erase it from the source 
    numbers.push_back(temp);                //and put it into the array
  }

  numbers.push_back(numbers_str);           //the last token is all alone 
  std::cout << "Number " << 3 << " is " << numbers[3] << std::endl;

  return(0);
}


Sometimes you want to ignore some areas of your string. What if your delimiter was a dash?

numbers_str = "forty-five-forty-six-forty-seven-forty-eight";

(Pretend that these numbers need the dashes, and that you need to use the '-' as a delimiter)

You could take two tokens and combine them yourself. That often is enough. But sometimes there are situations where you need more flexibility.

equation = "3+f(x+y)+4"

Simply parsing by plus signs won't do because it would tear apart your function f(x+y). In situtations like these it's often easiest to grab the area in between parentheses before your tokenizing function can get into it.


#include <iostream> 
#include <string> 
#include <vector> 

std::vector<std::string>  parse(std::string);

int main(void)
{
  std::vector < std::string > v = parse("3-(x+y)+4");

  for (unsigned int i = 0; i < v.size(); i++)
  {
    std::cout << v.at(i) << std::endl;
  }

  return(0);
}

std::vector<std::string> parse(std::string equation)
{
  std::vector < std::string > tokens; //used to store the tokens
  std::string accumulator;

  for ( ; ; )
  {
    if (equation.size() == 0)
    {
      //end of string
      tokens.push_back(accumulator);  
      break;
    }

    switch (equation[0])
    { 
      //the first letter of equation
    case '(':
      {
        equation.erase(0, 1);       //remove the '('
        int pos = 1;                //this indicates how many '(' and')'
        size_t siz;
        for (siz = 0; pos && (siz != equation.size()); ++siz)
        {
          if (equation[siz] == '(') ++pos;
          if (equation[siz] == ')') --pos;
        }

        if (pos)
        {
          //error: mismatched parentheses, too many '('
          // add error message if you want
          break;
        }

        std::string temp = equation.substr(0, siz - 1);

        //the tokens don't include ending parentheses
        std::vector < std::string > temp_tokens;
        temp_tokens = parse(temp);
        equation.erase(0, siz + 1); //remove all traces of anything in parentheses
        for (size_t i = 0; i < temp_tokens.size(); ++i)
        {
          //iterators not used to simplify things
          tokens.push_back(temp_tokens[i]);
        }
      }
      break;
    case '+':
      equation.erase(0, 1);
      tokens.push_back(accumulator);
      accumulator.clear();

      break;
    case '-': //for our intents and purposes this is the same as '+'
      equation.erase(0, 1);
      tokens.push_back(accumulator);
      accumulator.clear();
      break;
    default:
      accumulator += equation.substr(0, 1);
      equation.erase(0, 1);
      break;
    }
  }

  return(tokens);
}


Tokenizing can get very complex if you're not careful. Hopefully these few tips will help you out.

Credit: ygfperson

Script provided by SmartCGIs