STL copy with a delimiter???

Printable View

December 12th, 2011, 06:42 PM
LarryChen

STL copy with a delimiter???

As we know, we can use copy to split a string and then insert the tokens into a vector like this,

Code:

vector<string> vec; string iss; copy(istream_iterator<string>(iss), istream_iterator<string>(), back_inserter< vector<string> >(vec) );

However, the method above works only when the string is separated by white space. What if the string is separated by some delimiter other than white space? Thanks.
December 12th, 2011, 11:10 PM
laserlight

Re: STL copy with a delimiter???

Quote:

Originally Posted by LarryChen

What if the string is separated by some delimiter other than white space?

I would probably opt for a loop with getline or use Boost.Tokenizer (which would then allow for the use of std::copy).
December 13th, 2011, 01:11 AM
nuzzle

Re: STL copy with a delimiter???

Quote:

Originally Posted by LarryChen

What if the string is separated by some delimiter other than white space? Thanks.

I would have a look at regular expressions, now part of the C++ standard.
December 13th, 2011, 03:13 AM
PredicateNormative

Re: STL copy with a delimiter???

I would probably use boost split. But if you don't want to use boost, with not much effort, you could write your own version of the split functionality found in boost. Something like the following should do it:

Code:

#include <string> #include <functional> struct is_any_of : std::unary_function<char, bool> { is_any_of(const std::string& values) :values_(values) {} bool operator()(char v) { return values_.find(v) != std::string::npos; } private: std::string values_; }; template <typename OutputContainer, typename Predicate> void split(OutputContainer& dst, const std::string& src, Predicate predicate) { std::string::const_iterator first = src.begin(); std::string::const_iterator last = src.end(); std::string item; while(first != last) { if(predicate(*first)) { dst.push_back(item); item = ""; } else { item.push_back(*first); } ++first; } dst.push_back(item); }

Now assuming that you put the above in a header file called split.hpp then you could write something like:

Code:

#include <iostream> #include <fstream> #include <vector> #include <string> #include "split.hpp" int main() { std::string linebuffer; std::ifstream ifile; std::vector<std::string> vec; //Load a file here //.... std::getline(ifile, linebuffer) split(vec, linebuffer, is_any_of(";:, |\t")); }

Anyway, that's what I would do.
December 13th, 2011, 11:54 AM
LarryChen

Re: STL copy with a delimiter???

Thanks so much for you guys help. I decide to use regular expression to solve my problem as nuzzle suggested. Here is my sample code,

Code:

int main() { string s = "abc|def gh|ijk|lmn"; regex pattern("\\w+|"); sregex_token_iterator end; for(sregex_token_iterator i(s.begin(), s.end(), pattern); i!=end;++i) { cout<<*i<<endl; } return 0; }

The problem I still have is that the retrieved tokens are "abc" "def" "gh" "ijk" "lmn" but what I expect is "abc" "def gh" "ijk" "lmn". How'd I get around the white space issue here? Thanks.
December 13th, 2011, 12:39 PM
superbonzo

Re: STL copy with a delimiter???

just write

Code:

int main() { string s = "abc|def gh|ijk|lmn"; regex pattern( "[|]"); sregex_token_iterator end; for(sregex_token_iterator i(s.begin(), s.end(), pattern, -1 ); i!=end;++i) { cout<<*i<<endl; } return 0; }

the "-1" basically commands the iterator to split the string when it finds a matching pattern ...
December 13th, 2011, 06:22 PM
LarryChen

Re: STL copy with a delimiter???

Thanks for your code. It works perfectly! Would you explain the meaning of "-1" used in sregex_token_iterator? I am not able to understand the explanation from MSDN.

Quote:

Originally Posted by superbonzo

just write

Code:

int main() { string s = "abc|def gh|ijk|lmn"; regex pattern( "[|]"); sregex_token_iterator end; for(sregex_token_iterator i(s.begin(), s.end(), pattern, -1 ); i!=end;++i) { cout<<*i<<endl; } return 0; }

the "-1" basically commands the iterator to split the string when it finds a matching pattern ...
December 14th, 2011, 05:25 AM
superbonzo

Re: STL copy with a delimiter???

Quote:

Originally Posted by LarryChen

Thanks for your code. It works perfectly! Would you explain the meaning of "-1" used in sregex_token_iterator? I am not able to understand the explanation from MSDN.

well, a regex_token_iterator is based on regex_iterator, so let's see it first.

now, a regex_iterator R basically wraps consecutive regex_search calls on a sequence S of characters going from the end of the previous match or the beginning of the sequence if R has been just constructed.

Hence, the result of *R is a (const reference to) a match_result object storing the following ranges of iterators of S:
- a prefix range R->prefix(), going from the end of the previous match to the current match
- a suffix range R->suffix(), going from the end of the current match to the end of S
- a match range (*R)[0], the current match
- a set of match ranges (*R)[j], representing marked submatches

For example, "\\d+" on "a 1 b 10 c 100 d 1000 e" will give the sequence of [prefix, match, suffix] ( there are no submatches in this case ):

["a ","1"," b 10 c 100 d 1000 e"]
[" b ","10"," c 100 d 1000 e"]
[" c ","100"," d 1000 e"]
[" d ","1000"," e"]

then, a regex_token_iterator T wraps a regex_iterator R and a vector of indeces V:={i1,...,iN}:

T represents the sequence of subranges (*R)[i1],(*R)[i2],...,(*R)[iN], ++R, (*R)[i1], ..., (*R)[iN], ++R, ... and so on until R becomes an end iterator. So, it's the same as a regex_iterator but this time instead of returning a sequence of match_result's it returns a sequence of iterator ranges of S where the enumerated marked submatches ( index > 0 ) or the match itself ( index == 0 ) are specified by the supplyed vector of indeces.

Now, in theory, only non negative indeces make sense here; actually, the token iterator supports an extended semantics where intuitively an index of "-1" represents the prefix of the current match result.
So, if the Jth index is -1 the resulting sequence will be

(*R)[i1],(*R)[i2], ..., (*R)[iJ-1], R->prefix(), (*R)[iJ+1], ... ,(*R)[iN], ++R, ...

moreover, whenever a -1 index appears in V it further extends the semantics by adding a last element to the sequence represented by T, this time consisting in the suffix of the current ( and thus the last ) match result.

So again, if the Jth index is -1 the resulting sequence will end with

..., (*R)[i1],(*R)[i2], ..., (*R)[iJ-1], R->prefix(), (*R)[iJ+1], ... ,(*R)[iN], R->suffix()

the rational being that the remnant unmatched part of S could be considered the prefix of the "end" of S.

In this way, initializing T with a single -1 index will exactly split S in substrings delimited by the specififed pattern. In the example above, sregex_token_iterator( "a 1 b 10 c 100 d 1000 e", "\\d+", -1 ) will give the sequence "a "," b "," c "," d "," e".

and that's it :)