STL copy with a delimiter???

**LarryChen** · December 12th, 2011, 06:42 PM

As we know, we can use copy to split a string and then insert the tokens into a vector like this,

Code:

vector<string> vec;
string iss;

copy(istream_iterator<string>(iss), istream_iterator<string>(), back_inserter< vector<string> >(vec) );

However, the method above works only when the string is separated by white space. What if the string is separated by some delimiter other than white space? Thanks.

**laserlight** · December 12th, 2011, 11:10 PM

Originally Posted by LarryChen

What if the string is separated by some delimiter other than white space?

I would probably opt for a loop with getline or use Boost.Tokenizer (which would then allow for the use of std::copy).

**nuzzle** · December 13th, 2011, 01:11 AM

Originally Posted by LarryChen

What if the string is separated by some delimiter other than white space? Thanks.

I would have a look at regular expressions, now part of the C++ standard.

**PredicateNormative** · December 13th, 2011, 03:13 AM

I would probably use boost split. But if you don't want to use boost, with not much effort, you could write your own version of the split functionality found in boost. Something like the following should do it:

Code:

#include <string>
#include <functional>

struct is_any_of : std::unary_function<char, bool>
{
  is_any_of(const std::string& values)
    :values_(values)
  {}

  bool operator()(char v)
  {
    return values_.find(v) != std::string::npos;
  }

private:
  std::string values_;
};

template <typename OutputContainer, typename Predicate>
void split(OutputContainer& dst, const std::string& src, Predicate predicate)
{
  std::string::const_iterator first = src.begin();
  std::string::const_iterator last  = src.end();

  std::string item;

  while(first != last)
  {
    if(predicate(*first))
    {
      dst.push_back(item);
      item = "";
    }
    else
    {
      item.push_back(*first);
    }

    ++first;
  }
  dst.push_back(item);
}

Now assuming that you put the above in a header file called split.hpp then you could write something like:

Code:

#include <iostream>
#include <fstream>
#include <vector>
#include <string>

#include "split.hpp"

int main()
{
  std::string linebuffer;
  std::ifstream ifile;
  std::vector<std::string> vec;

  //Load a file here
  //....
  std::getline(ifile, linebuffer)
  
  split(vec, linebuffer, is_any_of(";:, |\t"));
}

Anyway, that's what I would do.

**LarryChen** · December 13th, 2011, 11:54 AM

Thanks so much for you guys help. I decide to use regular expression to solve my problem as nuzzle suggested. Here is my sample code,

Code:

int main()
{
	string s = "abc|def gh|ijk|lmn";
	regex pattern("\\w+|");
	sregex_token_iterator end;

	for(sregex_token_iterator i(s.begin(), s.end(), pattern); i!=end;++i)
	{
		cout<<*i<<endl;
	}
	
	return 0;
}

The problem I still have is that the retrieved tokens are "abc" "def" "gh" "ijk" "lmn" but what I expect is "abc" "def gh" "ijk" "lmn". How'd I get around the white space issue here? Thanks.

**superbonzo** · December 13th, 2011, 12:39 PM

just write

Code:

int main()
{
	string s = "abc|def gh|ijk|lmn";
	regex pattern( "[|]");
	sregex_token_iterator end;

	for(sregex_token_iterator i(s.begin(), s.end(), pattern, -1 ); i!=end;++i)
	{
		cout<<*i<<endl;
	}
	
	return 0;
}

the "-1" basically commands the iterator to split the string when it finds a matching pattern ...

**LarryChen** · December 13th, 2011, 06:22 PM

Thanks for your code. It works perfectly! Would you explain the meaning of "-1" used in sregex_token_iterator? I am not able to understand the explanation from MSDN.

Originally Posted by superbonzo

just write

Code:

int main()
{
	string s = "abc|def gh|ijk|lmn";
	regex pattern( "[|]");
	sregex_token_iterator end;

	for(sregex_token_iterator i(s.begin(), s.end(), pattern, -1 ); i!=end;++i)
	{
		cout<<*i<<endl;
	}
	
	return 0;
}

the "-1" basically commands the iterator to split the string when it finds a matching pattern ...

**superbonzo** · December 14th, 2011, 05:25 AM

Originally Posted by LarryChen

Thanks for your code. It works perfectly! Would you explain the meaning of "-1" used in sregex_token_iterator? I am not able to understand the explanation from MSDN.

well, a regex_token_iterator is based on regex_iterator, so let's see it first.

now, a regex_iterator R basically wraps consecutive regex_search calls on a sequence S of characters going from the end of the previous match or the beginning of the sequence if R has been just constructed.

Hence, the result of *R is a (const reference to) a match_result object storing the following ranges of iterators of S:
- a prefix range R->prefix(), going from the end of the previous match to the current match
- a suffix range R->suffix(), going from the end of the current match to the end of S
- a match range (*R)[0], the current match
- a set of match ranges (*R)[j], representing marked submatches

For example, "\\d+" on "a 1 b 10 c 100 d 1000 e" will give the sequence of [prefix, match, suffix] ( there are no submatches in this case ):

["a ","1"," b 10 c 100 d 1000 e"]
[" b ","10"," c 100 d 1000 e"]
[" c ","100"," d 1000 e"]
[" d ","1000"," e"]

then, a regex_token_iterator T wraps a regex_iterator R and a vector of indeces V:={i1,...,iN}:

T represents the sequence of subranges (*R)[i1],(*R)[i2],...,(*R)[iN], ++R, (*R)[i1], ..., (*R)[iN], ++R, ... and so on until R becomes an end iterator. So, it's the same as a regex_iterator but this time instead of returning a sequence of match_result's it returns a sequence of iterator ranges of S where the enumerated marked submatches ( index > 0 ) or the match itself ( index == 0 ) are specified by the supplyed vector of indeces.

Now, in theory, only non negative indeces make sense here; actually, the token iterator supports an extended semantics where intuitively an index of "-1" represents the prefix of the current match result.
So, if the Jth index is -1 the resulting sequence will be

(*R)[i1],(*R)[i2], ..., (*R)[iJ-1], R->prefix(), (*R)[iJ+1], ... ,(*R)[iN], ++R, ...

moreover, whenever a -1 index appears in V it further extends the semantics by adding a last element to the sequence represented by T, this time consisting in the suffix of the current ( and thus the last ) match result.

So again, if the Jth index is -1 the resulting sequence will end with

..., (*R)[i1],(*R)[i2], ..., (*R)[iJ-1], R->prefix(), (*R)[iJ+1], ... ,(*R)[iN], R->suffix()

the rational being that the remnant unmatched part of S could be considered the prefix of the "end" of S.

In this way, initializing T with a single -1 index will exactly split S in substrings delimited by the specififed pattern. In the example above, sregex_token_iterator( "a 1 b 10 c 100 d 1000 e", "\\d+", -1 ) will give the sequence "a "," b "," c "," d "," e".

and that's it

Thread: STL copy with a delimiter???

Thread Tools

Display

STL copy with a delimiter???

Re: STL copy with a delimiter???

Re: STL copy with a delimiter???

Re: STL copy with a delimiter???

Re: STL copy with a delimiter???

Re: STL copy with a delimiter???

Re: STL copy with a delimiter???

Re: STL copy with a delimiter???

Posting Permissions