counting consecutive null characters

Hi All,

Is there any function in C++ 's standard library or in STL which gives the number of consecutive characters in a stringstream object ?

std::count() and strspn() do not seem to be what I am looking for.

In my case, the sequence of consecutive characters may begin at any location in string stream as opposed to what is checked by strspn(), which returns the length of the "initial" portion of a string containing the consecutive characters.

Performance is my first priority while implementing this task.

Thanks.

Re: counting consecutive null characters

What is the criteria for the count?

0, 1,2,3,0,04,5,0,0,0,6,7

What do you want the above to count as? 2? 5? 6?

Re: counting consecutive null characters

Quote:

Originally Posted by JohnW@Wessex

What is the criteria for the count?

0, 1,2,3,0,04,5,0,0,0,6,7

What do you want the above to count as? 2? 5? 6?

Sorry, I admit that my question was not clear.

If I have the following binary stream in hex representation:

Code:

00 01 02 03 00 00 04 05 00 00 00 06 07

what I want is to obtain the size of the "largest" sequence of consecutive NULL bytes, which is 3 in above example.

Re: counting consecutive null characters

Quote:

Originally Posted by JohnW@Wessex

What is the criteria for the count?

0, 1,2,3,0,04,5,0,0,0,6,7

What do you want the above to count as? 2? 5? 6?

What do you think about the function below ?

Code:

int checkConNullBytes() { const char *BEGIN = dataStream.str().c_str(); const char *END = dataStream.str().c_str() + dataStream.str().length(); uint32_t totalNumberOfNullCharactersInStream = 0; map<uint32_t, int> myMap; for(char *i = const_cast<char*>(BEGIN); i != const_cast<char*>(END); i++) { totalNumberOfNullCharactersInStream = strspn(i, "\0"); myMap[totalNumberOfNullCharactersInStream] = 0;//assume that there is no two keys in map which are the same } //return last entry in myMap since myMap 's entries will be sorted in ascending order based on key }

This is probably not the most effective way of doing the job.

Re: counting consecutive null characters

That looks a bit over complicated. How about...

Code:

#include <string> #include <algorithm> int checkConNullBytes(const std::string &dataStream) { int maxCount = 0; int count = 0; std::string::const_iterator begin = dataStream.begin(); std::string::const_iterator end = dataStream.end(); while (begin != end) { if (*begin == 0) { ++count; maxCount = std::max(maxCount, count); } else { count = 0; } ++begin; } return maxCount; } int main() { char data[] = {0, 1, 2, 3, 0, 0, 4 ,5 ,0, 0, 0, 6, 7}; std::string dataStream(data, data + 13); int largest = checkConNullBytes(dataStream); }

Re: counting consecutive null characters

... and with C++0x

Code:

#include <algorithm> int main() { char data[] = {0, 1, 2, 3, 0, 0, 4 ,5 ,0, 0, 0, 6, 7}; int largest = std::accumulate( data, data + 13, []( int count, char value ){ return value ? 0 : ++count; }, 0 ); }

Re: counting consecutive null characters

Quote:

Originally Posted by superbonzo

... and with C++0x

Code:

#include <algorithm> int main() { char data[] = {0, 1, 2, 3, 0, 0, 4 ,5 ,0, 0, 0, 6, 7}; int largest = std::accumulate( data, data + 13, []( int count, char value ){ return value ? 0 : ++count; }, 0 ); }

Well you don't actually need c++0x to do that. The lambda function is just sugar coating. Nicer than writing a one shot functor though, I'll grant you that.

The big flaw though is that your function doesn't actually count the longest consecutive null chars, but just counts the null chars. And if we wanted to do that, I'm sure std::count could do a better job.

Also, but I'm not completely sure, the above could be undefined, as accumulate expects the predicate to "return the result of an accumulation operation", and since in this case lambda(count, value) != lambda(value, count), depending on the implementation, the result could be outrageously different. On my system, I had to swap count and value for it to work correctly.

I doubt an algorithm could do this, as it would require the predicate to have a state (remember the longest chain, for example), and in Scott Meyer's words "Make predicates pure functions". Anything else is undefined with std::algorithm, I believe.

Re: counting consecutive null characters

Quote:

... and with C++0x

Are you sure?

Isn't that equivalent to this?

Code:

#include <string> #include <numeric> int Function(int count, char value) { return value ? 0 : ++count; } int main() { char data[] = {0, 1, 2, 3, 0, 0, 4 ,5 ,0, 0, 0, 6, 7}; std::string dataStream(data, data + 13); int largest = std::accumulate(dataStream.begin(), dataStream.end(), int(0), Function); }

If so, it doesn't work!

Re: counting consecutive null characters

Quote:

Originally Posted by monarch_dodra

I doubt an algorithm could do this, as it would require the predicate to have a state (remember the longest chain, for example), and in Scott Meyer's words "Make predicates pure functions". Anything else is undefined with std::algorithm, I believe.

I find that to be such an annoying restriction, as often I find something that could be achieved in a couple of lines ends up turning in to a 'home-made' algorithm or loop just so I can be 'pure' and not pass a stateful predicate to an STL algorithm. And when I look at the code for the said STL algorithm, it's written in the same way as my home-made one anyway!

For example:
I often write image processing algorithms that require the last few pixels to be remembered. std::find_if with a stateful predicate would be perfect for this, but it's 'not allowed' (despite the supplied find_if being perfectly compatible). So I write my own Find_If, just in case the STL library writer decides to do something different, one day, maybe.

Is it a case of the STL designers keeping their options just a little too open and making life difficult for the end users?

Re: counting consecutive null characters

Quote:

Originally Posted by monarch_dodra

as accumulate expects the predicate to "return the result of an accumulation operation", and since in this case lambda(count, value) != lambda(value, count), depending on the implementation, the result could be outrageously different. On my system, I had to swap count and value for it to work correctly.

uhm, the standard says "Computes its result by initializing the accumulator acc with the initial value init and then modifies it with acc = acc + *i or acc = binary_op(acc, *i) for every iterator i in the range [first, last) in order."

so its behavior is prefectly defined. But ...

Quote:

Originally Posted by JohnW@Wessex

Isn't that equivalent to this?...If so, it doesn't work!

:blush: uhm... yes, that functor should have a state to store the maximal count:

Code:

int main() { char data[] = {0, 1, 2, 3, 0, 0, 4 ,5 ,0, 0, 0, 6, 7}; int largest = 0; std::accumulate( data, data + 13, int(0), [largest&]( int count, char value ) -> int { count = value ? 0 : ++count; largest = std::max(largest,count); return count; } ); }

Re: counting consecutive null characters

Quote:

Originally Posted by JohnW@Wessex

That looks a bit over complicated. How about...

Code:

#include <string> #include <algorithm> int checkConNullBytes(const std::string &dataStream) { int maxCount = 0; int count = 0; std::string::const_iterator begin = dataStream.begin(); std::string::const_iterator end = dataStream.end(); while (begin != end) { if (*begin == 0) { ++count; maxCount = std::max(maxCount, count); } else { count = 0; } ++begin; } return maxCount; } int main() { char data[] = {0, 1, 2, 3, 0, 0, 4 ,5 ,0, 0, 0, 6, 7}; std::string dataStream(data, data + 13); int largest = checkConNullBytes(dataStream); }

Thanks.

This is much better.

Re: counting consecutive null characters

Quote:

Originally Posted by superbonzo

uhm, the standard says "Computes its result by initializing the accumulator acc with the initial value init and then modifies it with acc = acc + *i or acc = binary_op(acc, *i) for every iterator i in the range [first, last) in order."

Yeah, apparently I was wrong. I missread my code :eek:

Quote:

Originally Posted by superbonzo

:blush: uhm... yes, that functor should have a state to store the maximal count:

Code:

int main() { char data[] = {0, 1, 2, 3, 0, 0, 4 ,5 ,0, 0, 0, 6, 7}; int largest = 0; std::accumulate( data, data + 13, int(0), [largest&]( int count, char value ) -> int { count = value ? 0 : ++count; largest = std::max(largest,count); return count; } ); }

I can't give examples about it, but I remember Scott Meyers being very explicit about not doing this. Saying that functors should be pure functions.

I don't know where and when this applies, so I'm playing it safe, but yeah, just a warning. I hope you know what you are doing.

Re: counting consecutive null characters

Quote:

Originally Posted by monarch_dodra

I don't know where and when this applies, so I'm playing it safe, but yeah, just a warning. I hope you know what you are doing.

The problem I see is that quite a lot of C++ coders aren't sure either, and when told, can't see the logic of why on earth it was done that way at all.

Re: counting consecutive null characters

Quote:

Originally Posted by JohnW@Wessex

The problem I see is that quite a lot of C++ coders aren't sure either, and when told, can't see the logic of why on earth it was done that way at all.

I remember it made sense when I read it, which is why I remember it. But for me, right now, it is "maybe it will work, maybe it won't". That's undefined as far as I'm concerned. I'll try to re-read it if I find the time.

Re: counting consecutive null characters

Quote:

Originally Posted by monarch_dodra

I can't give examples about it, but I remember Scott Meyers being very explicit about not doing this. Saying that functors should be pure functions.

I don't know where and when this applies, so I'm playing it safe, but yeah, just a warning. I hope you know what you are doing.

I suppose Meyers said that because there are no guarantees on how ( many times ) the functor is copyed and/or constructed nor when its operator() is invoked. Indeed, the standard also add "binary_op shall not cause side effects" at the and of that paragraph.

Quote:

The problem I see is that quite a lot of C++ coders aren't sure either, and when told, can't see the logic of why on earth it was done that way at all.

Agreed. BTW, when/how on earth lambda functors which have no state but induce sideeffects (like mine above) should be used with STL ?!?

Re: counting consecutive null characters

taken by a sudden spirit of rebellion I reread the paragraph on for_each and it (implicitly) allows side effects on its functor argument... mah... :confused:

therefore the following code should be standard correct

Code:

int main() { char data[] = {0, 1, 2, 3, 0, 0, 4 ,5 ,0, 0, 0, 6, 7}; int count = 0; int largest = 0; std::for_each( data, data + 13, [&]( char value ) { if( value ) count = 0; else largest = std::max( largest, ++count ); } ); }

a presumably correct alternative with accumulate would be

Code:

int main() { char data[] = {0, 1, 2, 3, 0, 0, 4 ,5 ,0, 0, 0, 6, 7}; struct MyCounter { MyCounter(): value(0), count(0) {} MyCounter( int value, int count ): value(value), count(count) {} MyCounter operator+( char a_char ) { if( a_char ) count = 0; else ++count; return MyCounter( std::max( value, count ), count ); } int value; int count; }; MyCounter count = std::accumulate( data, data + 13, MyCounter() ); }

Re: counting consecutive null characters

Why all the hoopla for such simple task:

Code:

int main() { #define BUFSZ 13 int CurRun=0,MaxRun=0; unsigned char buf[BUFSZ]={0,1,2,3,0,0,4,5,0,0,0,6,7}; for(size_t x=0;x<BUFSZ;x++){ buf[x]?(CurRun>MaxRun?MaxRun=CurRun:0,CurRun=0):CurRun++; } printf("Max: %d\n",MaxRun); return 0; }

(off the top of my head. It could probably be reduced even more).

Re: counting consecutive null characters

Quote:

Originally Posted by hoxsiew

Why all the hoopla for such simple task:

That looks like roughly the same algorithm as what JohnW@Wessex proposed in post #5, except that your implementation is more obfuscated. The posts in between have been attempts to answer (and discuss answers to) the original question by proposing things in the standard library beyond std::string and std::max.

Re: counting consecutive null characters

Quote:

Originally Posted by aryan1

Performance is my first priority while implementing this task.

Looks like this thread departed from stated “first priority”.
There are some obvious algorithmic optimizations to this process.
For example, in JohnW@Wessex’s code when you find a beginning of sequence of zeroes, you can first check if an element at *(begin + maxCount) position is 0; if it is – keep checking for all zeroes, but if it isn’t – you can adjust your current position to (begin + maxCount) skipping maxCount elements.
Then, for larger maxCount, you could replace byte-by-byte compare with 4-bytes int compare to 0 (taking care of alignment, of course). Depending on expected maxCount, this could be very beneficial.
If the performance's priority is really high, we could keep going…

Re: counting consecutive null characters

Quote:

Originally Posted by VladimirF

Looks like this thread departed from stated “first priority”.
There are some obvious algorithmic optimizations to this process.
For example, in JohnW@Wessex’s code when you find a beginning of sequence of zeroes, you can first check if an element at *(begin + maxCount) position is 0; if it is – keep checking for all zeroes, but if it isn’t – you can adjust your current position to (begin + maxCount) skipping maxCount elements.
Then, for larger maxCount, you could replace byte-by-byte compare with 4-bytes int compare to 0 (taking care of alignment, of course). Depending on expected maxCount, this could be very beneficial.
If the performance's priority is really high, we could keep going…

A binary search implementation of the "find the longest chain of a item" algorithm. Sounds like a challenge!

Re: counting consecutive null characters

Quote:

Originally Posted by aryan1

what I want is to obtain the size of the "largest" sequence of consecutive NULL bytes,

Isn't an optimization possible here?

Once you've found a null sequence of length N you only have to check whether every N+1'th byte is null after that (but when you do find a null you need to look both backwards and forwards to establish the length of that particular null sequence). Still the longer the null sequence you already have the faster it becomes to check the rest of the bytes because you can increase the stride.

Re: counting consecutive null characters

Quote:

Originally Posted by JohnW@Wessex

That looks a bit over complicated. How about...

Code:

#include <string> #include <algorithm> int checkConNullBytes(const std::string &dataStream) { int maxCount = 0; int count = 0; std::string::const_iterator begin = dataStream.begin(); std::string::const_iterator end = dataStream.end(); while (begin != end) { if (*begin == 0) { ++count; maxCount = std::max(maxCount, count); } else { count = 0; } ++begin; } return maxCount; } int main() { char data[] = {0, 1, 2, 3, 0, 0, 4 ,5 ,0, 0, 0, 6, 7}; std::string dataStream(data, data + 13); int largest = checkConNullBytes(dataStream); }

If you expect the string to have mostly 0s, you should put the std::max call in the else rather than the if (prior to resetting count). If you expect the 0s to be fairly rare, it'll be faster as-is.

Re: counting consecutive null characters

Quote:

Originally Posted by nuzzle

Once you've found a null sequence of size N you only have to check whether every N+1'th byte is null after that (but when you find a null you need to look both backwards and forwards to establish the size of that particular null sequence). Still the larger the null sequence you already have the faster it becomes to check the rest of the bytes because you can increase the stride.

That has the taste of Boyer-Moore :)

Re: counting consecutive null characters

Quote:

Originally Posted by monarch_dodra

A binary search implementation of the "find the longest chain of a item" algorithm. Sounds like a challenge!

I keep hinting to a programming challenge for awhile, with no support. Maybe to subtle?
If we only had sizeable data file with bunch of zeroes in it for benchmark… Would anybody accept such challenge?

Re: counting consecutive null characters

Quote:

Originally Posted by nuzzle

Isn't an optimization possible here?

Once you've found a null sequence of length N you only have to check whether every N+1'th byte is null after that (but when you find a null you need to look both backwards and forwards to establish the length of that particular null sequence). Still the longer the null sequence you already have the faster it becomes to check the rest of the bytes because you can increase the stride.

I am trying to find the difference between your suggestion an my post # 19 above. Don’t see it :)

Re: counting consecutive null characters

Quote:

Originally Posted by VladimirF

I keep hinting to a programming challenge for awhile, with no support. Maybe to subtle?
If we only had sizeable data file with bunch of zeroes in it for benchmark… Would anybody accept such challenge?

Why not? I'm already doing it anyways.

Actually, I'm implementing a different solution for each iterators type. You wouldn't want to skip too far ahead if you have input iterators...

Re: counting consecutive null characters

Quote:

Originally Posted by VladimirF

I am trying to find the difference between your suggestion an my post # 19 above. Don’t see it :)

There were so many replies I just browsed them before I replied. But now when I look at your suggestion I can see it's based on the same idea.

One difference thought seems to be that you make a longer jump forward only after you've found a zero. In my version you switch to N+1 increments for the remainder of the search when you've found a sequence of length N. If you have found a null sequence of 10 bytes you only need to check every 11'th byte ever after.

Re: counting consecutive null characters

Hmm, except that if you *do* find a null on one of those jumps, you'll need to look backward to find the start of the sequence. Not a problem for bidirectional iterators, but for forward iterators it would be.

Re: counting consecutive null characters

Quote:

Originally Posted by Lindley

Hmm, except that if you *do* find a null on one of those jumps, you'll need to look backward to find the start of the sequence.

Yes I mentioned that in #21.

Quote:

Not a problem for bidirectional iterators, but for forward iterators it would be.

With a little effort you can make almost everything not work.

Re: counting consecutive null characters

I suppose for forward iterators you're stuck looking at everything anyway, so you may as well do a simpler algorithm. Bidirectional would depend on the relative speed of conditionals versus traversal. It's a good optimization for random access iterators though.

Re: counting consecutive null characters

Quote:

Originally Posted by Lindley

I suppose for forward iterators you're stuck looking at everything anyway, so you may as well do a simpler algorithm. Bidirectional would depend on the relative speed of conditionals versus traversal. It's a good optimization for random access iterators though.

That's exactly how binary_search works. it does just as many iterations, but a fraction of the checks.

I was hoping to rely on distance and advance to write a common algorithm for both, but it creates a lot of problems. For instance, it is very hard to know if you have reached the end of your boundaries if you try to jump forward.

Re: counting consecutive null characters

Well, I had my fun.

I won't bore you with everything I did, but the conclusion is that the jump algorithm applied to bidirectional iterators proved some good performance.

After much tweaking, I got some good result for RA, BD, and forward iterators.

My test bench was a pre-filled char array of 8096 chars randomly distributed between '0' and '1'. I do the search 100 000 times. There are 13 consecutive 0.

My times:
Standard array:
Jump algorithm (RA): 0.73 seconds
Standard algo: 3.1 seconds

List:
Jump algorithm (BD): 2.2 seconds
Standard algo: 3.5 seconds

EDIT: mingw -O3 on a 3.0Ghz Phenom 9950

For those interested, here is my code:

Code:

template<typename iterator, typename T> typename std::iterator_traits<iterator>::difference_type findLongestSequence(iterator first, iterator last, T value) { return findLongestSequenceImplementation( first, last, value, typename std::iterator_traits<iterator>::iterator_category() ); } template<typename iterator, typename T> typename std::iterator_traits<iterator>::difference_type findLongestSequenceImplementation(iterator first, iterator last, T value, std::random_access_iterator_tag) { iterator middle = first; //quick check of first sequence, to never check bounds when iterating backwards. while(middle!=last && *middle==value) { ++middle; } //fast initialization typename std::iterator_traits<iterator>::difference_type maxCount = std::distance(first, middle); //potentially 0 typename std::iterator_traits<iterator>::difference_type step = maxCount+1; ++middle; while(middle<last) //This check makes things difficult for bidirectional iterator { if (*middle == value) { iterator low = middle; //No need to check bounds here while(*--low == value); ++low; while((++middle != last) && (*middle == value)); //since the if is wrong 90% of the time, evaluating distance twice is faster if(maxCount < std::distance(low, middle)) { maxCount = std::distance(low, middle); step = maxCount+1; } } std::advance(middle, step); } return maxCount; } template<typename iterator, typename T> typename std::iterator_traits<iterator>::difference_type findLongestSequenceImplementation(iterator first, iterator last, T value, std::bidirectional_iterator_tag) { typename std::iterator_traits<iterator>::difference_type maxCount = 0; iterator middle = first; //quick check of first sequence, to never check bounds when iterating backwards. while(middle!=last && *middle==value) { ++middle; ++maxCount; } if(middle==last) return maxCount; typename std::iterator_traits<iterator>::difference_type step = maxCount+1; while(1) { if (*middle == value) { iterator low = middle; typename std::iterator_traits<iterator>::difference_type aDistance = 1; //No need to check bounds here while(*--low == value) {++aDistance;} while((++middle != last) && (*middle == value)) {++aDistance;} if(maxCount < aDistance) { maxCount = aDistance; step = maxCount+1; } } for(typename std::iterator_traits<iterator>::difference_type i = 0; i!=step; ++middle, ++i) { if (middle==last) return maxCount; } } } template<typename iterator, typename T> typename std::iterator_traits<iterator>::difference_type findLongestSequenceImplementation(iterator first, iterator last, T value, std::forward_iterator_tag) { typename std::iterator_traits<iterator>::difference_type curCount=0; typename std::iterator_traits<iterator>::difference_type maxCount=0; while(first != last) { //search for the first occurence of value if (*first++ == value) { //get length of current occurence curCount = 1; while((first != last) && (*first++ == value)) { ++curCount; } //test max only once the current chain is counted maxCount = std::max(maxCount, curCount); } } return maxCount; }

Code:

int main() { char a[]= //see below std::list<char> aList(a, a+sizeof(a)); clock_t begTime = clock(); for(int i=0; i<100000; ++i) { findLongestSequence(a, a+sizeof(a), '0'); //findLongestSequence(aList.begin(), aList.end(), '0'); } std::cout << clock()-begTime << std::endl; }

And here is my char array

Code:

"11101000101110111001010111001111110101010000101110011011000100100010100111000110" "11001111111110011010011101010100010001110101101101000111110111000101010111110010" "00001001000000011111000100110101001010110110001100001111001111110111001101011111" "10001001001100100000111101011011001100010000010101011111001000111111000101001100" "11010000000111111101010011011001001001101000001111111000101000110100100010111001" "11011000001110000111000101001001101100011010001111010011111111101011000100011100" "00001100110011000001010000011110000010010001100011110101101010010011001100110111" "00100101111010110010101000100001100011101010110001010011000010000011010110101010" "01001100011110110111110010100000111111110111110111100011000010110010001011000000" "11100010001010011001110110100011101111111101111100000100011110100101110001101011" "10101001010110110111011011111011110110111011110011011101101100010101000001011011" "10001100010010010100101100011010010101000110011100000101010110010000000100011100" "11011010010111011011101101111000111110000111000011010010101001100000001010011010" "01101000101101011100011111101111011101100100010100000011000000110010000010000010" "10001101110101001010001000100001010101010111100010110001111011010010110100100000" "00010101110010100000000111000011111010001100110101101111111110110110110001101110" "00000101100111011110011000111111010101111000011100100100110100101110101100011111" "00110010111001111011110000101111010011110110100001010010111100010011110110110011" "11101011000001000100011011110100111000011011111001010110111010100000111000101100" "11000001000011010000101001111101000000100111011000111001101100111101011011000100" "00011010100000000110100111010110011110111110110011011000100110010101010010010011" "10011101011110111101000101111000000001100111001100011000001001010001110110010010" "10010011101001110100010001100100000011000001111010101011101110001011000110001111" "10010010011010001110001000111000000011101110000001001010011100110111010111010010" "01011011011101111010100111010101000111011101100011010100110001111110000000111001" "01001110101001111110000001101100001111111011110000101011010111101000111100101010" "11001111110011111000000110111111000110010010011111010100100101011000111110110001" "01011100101111100101110100110100011001000001011100100111110001001110111011011001" "00010001100001100011000010000100011001011001101010010010110111010011111110111101" "00000011111001000100010111111011110100011000100111101001101110111000011100111100" "10111000001010101010001000011111010110000101010011000000110011101110100010100000" "10101010010011101110100100011010101000111111000100010100000100010011110011010110" "00110000110110011110001001101100001101011101011000010011000101110001010001000100" "10011000000010010110000101101101000011110100011001011111100101110110101001000000" "11001000100110001111100111101000000100010000011111111000110000001001001101000010" "10000001001111100011011010011011010101101011001010011011101001100110000100110000" "11101100010110011110001101111010010001001011011010000000101011100101000111110101" "01001111100010100001011011111100011010011111110011101010101110011010111010111101" "11000010110011011100110001010111001100010111101001011010011110011101000100001010" "01100100101001000001110111000011001100010101111110010110000111011011110111101101" "01010110100101001000001101000100001111001100100000101011000000110001101000100000" "11010101000110110111111000011110011001111000011000111111000110001111101111011001" "01101101011111100011011000110010111001011111111001101010001011111011001111001010" "01010100100001000101000000010010001000111101010011100000111000100101100100010101" "11100011001111110101110000101010001111011101010111011010111011100011100000010001" "01010111100101110010011101110110111001100000110100011100011000000100111100110000" "01100010000000011111101001111100010011110111001010010111100110101100101010110110" "00001100100000110000101111000011000011110111111110001000011000100011010010100001" "00111110111100111011000101001101100110010100001010000100100010111111111011111101" "00010101000001010101110110010010010001110011110101001011100100011001101000010001" "00001000000110101000111000101100111110101101010001110001010011101100110110101001" "00111001010000000011001110010111101111011100001111011001001011101001111001111011" "00000110110010101100011001100100100111000101110101001000000101100100110011010001" "01111010110010010011010000101011110111100011110000000010010010000010001001100001" "01111110011100101000010001011000001101000101010000011100000010110111011100011110" "11001011011001001111111101010110100110011101000110101001110011101011010111011110" "01000110101100101101011000000111110111100100011110101100100010101111101111110011" "10111001000100000010001111110101001011111011000100110110001101001010100011000100" "00101101011110100000100111110010000011111100110110100010111000100110101010100111" "00011100101101110100000101010110011100111010010111010010011100111110111000010101" "00001001111011101110111100100100100000101010110011010011110100100110100101010111" "11100100010101101000000111000110010110011100010011110110101100011111000010011110" "01111111011100001011001001100110010001110001010001111111011010110010100000011011" "01111011011100111111000001110000100101000000001111010101001011110111110010011010" "01101111100101010100010110011110110110111110011001101011010010110000111000101100" "00101111101111111000000011111011101011110000111001001000011101011001110110110011" "00100110000011110011100111100111000011001100110111011111011100100111010011110001" "01110001011000110110100011110100111010010001000001111000101110100010100111111111" "11110101001100100100110010001010100000010011000101001000011000010100111010010111" "00101000001111001101011000111011011010011111100011111100110100110000100011001111" "00100011000100101100010000100010011001001000001001001101010010100111010010011111" "11111100111010010000010110001111010101101111110101110100101001101110010110100000" "01101011100111101010010001110110110110000101011100011011010011100000010101001110" "00110010101111101101001010001001001011110100101001100011000000100000010110000100" "10001110101001111110111000110110111111000001000000000001110011000110011101101101" "01011110110110000111010001111100100001001011010000110010111001111001000111001000" "11111101101101000110000110011001110111110100111010000011111000010110010101100001" "10100000001001011000000011000011101011100001000000010100000010001101100010100001" "10000010100111100111001011010001011100100001100101101000001010000010101010001011" "11000100111111110001111010101001001110001010100100001000110101011011110010101000" "01101010100000010001010011101111011110001111011111011100100001001110010110110111" "10100011001100111001001100010101101111110011001001110101010110111100110111100010" "11000100101111111001011001100011111001100011011011101110111111000011100110011110" "00110001010110000110110011000010100110100011000101010110100111110001111000110011" "00011101001010111011000100000110011111000100001000000000010110100111101110111011" "01000101011011001101011010110111110101101011010100011111000010101101010111000011" "11111100010011011101110000010001011010111010011011111011101110111001010000001011" "00001011001011011110010000110100111011100110011101101101110010011010110110010011" "01111100110110000110000110110101001110101100001101100100100000101110111110010101" "10010110111000101100111111100010011001110111111001111100010001000101011101110111" "10101000001001110101100110001011000101101100100100001101010100101001100000011010" "11000110111000100111100010011010110101000111111100110000111011100011011000001100" "00010110000010011100100010010111000000100011000101001011100111100111000000000101" "11110011101101010110001001010111001110010001110011101110110100110101011101100101" "10011111110101010111101010101010011111110111101100011100110110011100111010011110" "01110011001111001110111000110000001100110010011000110011111110101100010011000111" "01110111011001110111101000101010010110100111101111101011010001000001001101001110" "00000000000111010110111001101110111101110100000101110001000111011011101010010011" "11000110101110100110000110101011000101111010111000100000000000001110010001101101" "00111001111011100111000110001110000111100110001010010100000011111111001111110010" "11100011111100000001101100111000011000111010110101001000011010000111011000011101" "10010110010011110011100000000110100100010011010111100101001100110111111101110101" "11000101101001101001010000101110";

Re: counting consecutive null characters

Quote:

Originally Posted by superbonzo

taken by a sudden spirit of rebellion I reread the paragraph on for_each and it (implicitly) allows side effects on its functor argument... mah... :confused:

I feel your pain. I wonder why the C++ standard is so vague about this. I mean why else would for_each return a functor if not to allow the functor to accumulate data that can be retrieved after the algorithm finishes? Yet the std does not explicitly state anything about that. It reads,

Quote:

Notes: If f returns a result, the result is ignored.

What does that mean? Ignored by who or what? :confused:

Re: counting consecutive null characters

Well, keep in mind that you can always use the accumulate() algorithm if you need stateful behavior; it's allowed there.

Re: counting consecutive null characters

Quote:

Originally Posted by Lindley

Well, keep in mind that you can always use the accumulate() algorithm if you need stateful behavior; it's allowed there.

I understand but it simply sums values with operator+. for_each appears to support far more complex behavior. What if I wanted to analyze a range of objects and build some metrics? Isn't that one of the reasons that for_each exists for? That is what I would have thought by looking at the implementations that I currently see but the standard is pretty vague and scott meyers in effective c++ writes as though he is rather perplexed about for_each as well. It shouldn't be confusing at all. The std should just tell us what the algorithm is for and what its restrictions are very concisely.

Re: counting consecutive null characters

Josuttis shows exactly that use of the for_each return value in his book. His example shows taking the mean of a sequence with a stateful functor and for_each.

Re: counting consecutive null characters

Quote:

Originally Posted by kempofighter

I understand but it simply sums values with operator+. for_each appears to support far more complex behavior.

Actually, accumulate() supports an arbitrary functor argument. std::plus is merely the default.

However, I actually got it backwards: according to Effective STL, for_each() is the one which is allowed to have a stateful functor; accumulate() is not. The two are similar otherwise.

Re: counting consecutive null characters

Quote:

Originally Posted by Lindley

Hmm, except that if you *do* find a null on one of those jumps, you'll need to look backward to find the start of the sequence. Not a problem for bidirectional iterators, but for forward iterators it would be.

There's no reason why you maybe couldn't have multiple forward iterators to simulate the look back.

Re: counting consecutive null characters

Quote:

Originally Posted by Lindley

However, I actually got it backwards

Further evidence that the algorithm limitations need to be better defined or, better still, remove as many limitations as possible and make them more compatible with stateful functors & predicates. This seems to be how most people expect them to work.

I read an article a while ago by one of the people on the C++ committee discussing the STL. Apparently the committee was mainly made up of compiler and library writers. Maybe a few coders who work at the application sharp end would give them a better grounding in what is really useful.

Re: counting consecutive null characters

Quote:

Originally Posted by JohnW@Wessex

Further evidence that the algorithm limitations need to be better defined or, better still, remove as many limitations as possible and make them more compatible with stateful functors & predicates. This seems to be how most people expect them to work.

I read an article a while ago by one of the people on the C++ committee discussing the STL. Apparently the committee was mainly made up of compiler and library writers. Maybe a few coders who work at the application sharp end would give them a better grounding in what is really useful.

On the other hand it's worth noting that the no-side effects requirement in the accumulate algorithm took me to the overloaded operator+ solution which better fits the semantics of std::accumulate... that is a good thing ( of course, you could say that it was my fault from the beginning :) ).

Re: counting consecutive null characters

Quote:

Originally Posted by Lindley

Actually, accumulate() supports an arbitrary functor argument. std::plus is merely the default.

However, I actually got it backwards: according to Effective STL, for_each() is the one which is allowed to have a stateful functor; accumulate() is not. The two are similar otherwise.

I told you (well, superbonzo actually) that it was dangerous and confusing lol :rolleyes:! Any chance you could enlighten us quickly as to why that is? What makes the two so different?

Quote:

Originally Posted by JohnW@Wessex

There's no reason why you maybe couldn't have multiple forward iterators to simulate the look back.

That's true, but for this algorithm, it would kind of defeat the purpose. No point "jumping ahead" if you are just going to iterate through everything just to see where the range starts later. Well, you might save a few comparisons, but nothing drastic.

Re: counting consecutive null characters

Quote:

Originally Posted by monarch_dodra

Any chance you could enlighten us quickly as to why that is? What makes the two so different?

I'm just repeating what I read in Effective STL. Even Scott Meyers shrugs and dodges the question regarding why they're different.

Re: counting consecutive null characters

Quote:

Originally Posted by monarch_dodra

My test bench was a pre-filled char array of 8096 chars randomly distributed between '0' and '1'. I do the search 100 000 times. There are 13 consecutive 0.

My times:
Standard array:
Jump algorithm (RA): 0.73 seconds
Standard algo: 3.1 seconds

1. This is very impressive!
2. I couldn't beat it with my ideas.
3. Wasted some time on a fight with Visual Studio's optimization.
4. I can't rate this post at the moment. Will have to spread some stuff around first. Is this some kind of redistribution of wealth? :)
5. I suspect that overhead from my ideas overweight the benefits for such a short longest sequence of 13.
6. Have some promising results from SSE. Needs some more work.

Re: counting consecutive null characters

Quote:

Originally Posted by Lindley

I suppose for forward iterators you're stuck looking at everything anyway, so you may as well do a simpler algorithm. Bidirectional would depend on the relative speed of conditionals versus traversal. It's a good optimization for random access iterators though.

I'm not sure you understand the beauty of the algorithm I suggested. If you did why are you so preoccupied with the choise of iterators? It's just a technicality.

My suggestion, which now seems to be called the "jump algorithm" in this thread, increases the performance of the native approach (where every byte is checked) many times. But I can't take much credit for it because as laserlight noted, it's probably just a special case of the Boyer-Moore algorithm.

http://en.wikipedia.org/wiki/Boyer%E...arch_algorithm

Re: counting consecutive null characters

It's not a "choice" of iterators. The coder writing the algorithm does not necessarily control what type of iterator they're going to get when the algorithm is called. They could put a restriction on that, but the better option is to fall back to a simpler, less efficient algorithm if the iterator type provided doesn't work with the fast approach.

The proposed "jump" optimization can be applied best to random-access iterators, and for those I'm sure it works fantastically. I was simply speculating on how well it would extend to the other types, if at all.

And yes, I did grasp why it's such a good idea.

Re: counting consecutive null characters

Quote:

Originally Posted by Lindley

It's not a "choice" of iterators. The coder writing the algorithm does not necessarily control what type of iterator they're going to get when the algorithm is called. They could put a restriction on that, but the better option is to fall back to a simpler, less efficient algorithm if the iterator type provided doesn't work with the fast approach.

The proposed "jump" optimization can be applied best to random-access iterators, and for those I'm sure it works fantastically. I was simply speculating on how well it would extend to the other types, if at all.

And yes, I did grasp why it's such a good idea.

Well the "jump optimization" can be applied to almost all iterator types. The less "random access" the iterator is, the slower the algorithm, but probably better than the "dumb" algorithm regardless.

RA works very fast.
Bidirectional requires many iterations for the jumps, but does less comparisons
Forward iterator also works, if you keep the "previous" iterator in memory.

I was surprised, but I tested it, and there are gains for all iterator types.

The only iterator where the "jump" algorithm does not work are input iterators*...

I did not realize it at first, but I guess it is a generalization of the Boyer Moore algorithm.

*I had said in some previous posts that the algorithm did not work on forward iterators, but I meant input.

Re: counting consecutive null characters

Quote:

Originally Posted by Lindley

The proposed "jump" optimization can be applied best to random-access iterators, and for those I'm sure it works fantastically. I was simply speculating on how well it would extend to the other types, if at all.

The "jump algorithm" doesn't work well with traditional sequential iterators. The reason is obvious. It's because they don't support the element skipping the "jump algorithm" is based on.

Compare with array sorting algorithms. If you enforce accesses to be via iterator even the fastest sorting algorithm becomes slow like a dinosaur in a tar pit. Just like array sorting the "jump algorithm" needs random access.

So you can stop speculating now. Introducing an obstacle like an iterator is guaranteed to spoil the performance of the "jump algorithm". The only kind of "iterator" that will work well is one that's not strictly sequential, like for example the C++ random-access iterator.

Re: counting consecutive null characters

Quote:

Originally Posted by monarch_dodra

I was surprised, but I tested it, and there are gains for all iterator types.

The reason for that probably is that although a sequential iterator must visit all elements during a skip, it doesn't have to access them. It just quickly passes the elements whereas the naive approach must stay and check whether each element is zero. Still, making a random access skip is faster.

Re: counting consecutive null characters

It's also worth mentioning that the "jump algorithm" performs better on a muticore processor. Let's say you have four cores. This means if you divide the array in four sections you can run four concurrent "jump algorithms" at the same time. This gives a theoretical speedup of four times.

But what's even better is that those four "jump algorithms" can share a skip length counter. This means that when one of the "jump algorithms" finds a longer zero sequence this finding is instantly shared with the three other "jump algorithms" allowing them to make longer jumps too.

So the concurrent "jump algorithms" don't just tug along each with their own part of the array. They cooperate so that all of them uses the longest jump found by any one of them.

A note. If someone plans to implement this make sure not to split the array in the middle of a zero sequence. Also handling of the shared skip length counter requires some care.

Re: counting consecutive null characters

Quote:

Originally Posted by nuzzle

So you can stop speculating now. Introducing an obstacle like an iterator is guaranteed to spoil the performance of the "jump algorithm". The only kind of "iterator" that will work well is one that's not strictly sequential, like for example the C++ random-access iterator.

There is a fundamental difference between working well, and working better. That is what this discussion has been about.

You probably missed my post (the discussion was long), so I'm just going to quote myself:

Quote:

My times:
Standard array:
Jump algorithm (RA): 0.73 seconds
Standard algo: 3.1 seconds

List:
Jump algorithm (BD): 2.2 seconds
Standard algo: 3.5 seconds

It's like doing a binary search on a sorted list. Not great, but beats the alternative.