Hmm.....you'd only enter this case when the post-jump iterator points to 0 and the pre-jump doesn't, and if you assume that *usually* you won't be looking at a new longest in this case, then you're likely to find a non-0 by advancing the post-jump before you would find a 0 by advancing the pre-jump.
However, after you do find a non-0 with the post-jump, you'd need to still advance the pre-jump by at least as much before you could definitively rule out the possibility that you're looking at a new longest sequence. And given that with a forward iterator you probably aren't dealing with contiguous memory, there's no cache benefit to doing all of the post-jump increments first. So I think it could be done alternately without any loss of efficiency, at least until the post-jump encounters a nonzero.
Hmm.....you'd only enter this case when the post-jump iterator points to 0 and the pre-jump doesn't, and if you assume that *usually* you won't be looking at a new longest in this case, then you're likely to find a non-0 by advancing the post-jump before you would find a 0 by advancing the pre-jump.
However, after you do find a non-0 with the post-jump, you'd need to still advance the pre-jump by at least as much before you could definitively rule out the possibility that you're looking at a new longest sequence. And given that with a forward iterator you probably aren't dealing with contiguous memory, there's no cache benefit to doing all of the post-jump increments first. So I think it could be done alternately without any loss of efficiency, at least until the post-jump encounters a nonzero.
I won't bore you with everything I did, but the conclusion is that the jump algorithm applied to bidirectional iterators proved some good performance.
After much tweaking, I got some good result for RA, BD, and forward iterators.
My test bench was a pre-filled char array of 8096 chars randomly distributed between '0' and '1'. I do the search 100 000 times. There are 13 consecutive 0.
My times:
Standard array:
Jump algorithm (RA): 0.73 seconds
Standard algo: 3.1 seconds
Since it took me awhile to come up with this, I *AM* going to bore you with details.
1. As I’ve mentioned, Visual Studio 2005 optimized away the 100,000 loop. So I had to use QueryPerformanceCounter() on a single path throug the data. The results are in milliseconds.
2. I’ve implemented “simple” algorithm, just for reference.
3. I am using SSE to produce a 16-bit mask for each 16 bytes of data where bit is set if data byte is equal to a sample byte.
4. I failed to implement ‘jump’ algorithm on bits (what I got was slower than other implementation).
5. I’ve implemented a lookup table (BTW, first choice in any high-performance algorithm), where for each 16-bit word (all 64K of them) I pre-calculate length of leftmost sequense of 1’s (in case that previous word ended with 1), longest middle sequence (surrounded by 0’s), and the rightmost sequence (for possible roll-over to the next word).
The result is about half the time of monarch_dodra's 'jump' algorithm.
Table initialization is expensive, but will pay off at 1,000 iterations.
Here are my results (on Core 2 Quad Q6600 @ 3GHz):
Attached is zipped solution. For non-Visual Studio users, just use the source files.
Would really appreciate any comments.
Vlad - MS MVP [2007 - 2012] - www.FeinSoftware.com
Convenience and productivity tools for Microsoft Visual Studio: FeinWindows - replacement windows manager for Visual Studio, and more...
I don't have time to look into it in detail right now, but i definitely want to sit down and see. I was planning to try to multi-thread the jump and see how fast it can get (in theory, up to 4x faster, which would beat your SSE )
I don't have time to look into it in detail right now, but i definitely want to sit down and see. I was planning to try to multi-thread the jump and see how fast it can get (in theory, up to 4x faster, which would beat your SSE )
At first I was going to say that I too can do multithreading. However, I am not having any luck with it. Even if I simply call one of my functions on a separate thread and wait for it to complete (no synchronization needed at all), it ads an overhead that is 10 times my total execution time. Doesn’t look like it depends on a sample data size, appears to be a fixed value ~80 microseconds. Does it seam reasonable?
Anyway, I suspect that you will hit the same wall. Please let me know if I am wrong. Good luck!
Vlad - MS MVP [2007 - 2012] - www.FeinSoftware.com
Convenience and productivity tools for Microsoft Visual Studio: FeinWindows - replacement windows manager for Visual Studio, and more...
Well, spawning a separate thread may not make sense to include in the timing. If you use a thread-pool design, simply dispatching tasks to be picked up by pre-existing threads, what then? Or maybe use something designed for fine-grained parallelism like OpenMP or ITBB.
Of course, another option that we haven't talked about at all yet would be writing a GPU implementation. That's probably serious overkill for the problem, but hell....why not.
The algorithm would be simple (assuming G80 architecture): Download the string of bytes to the GPU, then do a simple reduction, which can be computed in O(log n) relative to the size of the input string. This could probably all be done in one kernel using CUDA; it would take multiple renders using OpenGL+CG. I'm not sure about OpenCL, I need to read up on it. Then you pull the result back into main memory.
The slowest part of the whole thing would probably be the latency due to the GPU download.
The details of the reduction might be a bit tricky. I'll try to work them out.
Last edited by Lindley; April 23rd, 2010 at 07:44 PM.
Well, spawning a separate thread may not make sense to include in the timing.
Well, even if I simply do table initialization in multiple threads (it is timed separately, and does show 4-times performance improvement on 4-core processor), the main processing algorithm takes a LOT more time. I do not know what a connection here is. Possibly, optimizing compiler gets spooked by seeing access to the table from multiple threads. Does this sound reasonable?
Vlad - MS MVP [2007 - 2012] - www.FeinSoftware.com
Convenience and productivity tools for Microsoft Visual Studio: FeinWindows - replacement windows manager for Visual Studio, and more...
The details of the reduction might be a bit tricky. I'll try to work them out.
Okay, here's a first shot. For background, a GPU shader usually gives you 4 channels to work with (RGBA) even if your data is only 1 channel at minimal extra cost. So we'll leverage that.
In a reduction, you iteratively replace two values with one at every pair of positions in the array, effectively reducing its length by a factor of 2, all in parallel. You repeat until you've got only one value. We'll refer to the index of the current level of the reduction, and the original indices which reduced to it from the first level.
The r channel will store the length of the longest sequence of 0s which a given index knows about. If nonzero, the g channel stores the length of any sequence of 0s ending at the minimum original index this index knows about. Thus g <= r. Similarly, b stores the length of any sequence of 0s ending at the maximum original index this index knows about. a is a flag which is 1 if this index knows about at least one nonzero original index.
Then, each reduction step will look like:
Z = reduce(X,Y)
where X and Y are adjacent pixels of the previous level (ind(X) < ind(Y)) and Z is the new output,
In practice I expect this isn't the best GPU algorithm, since conditionals on a GPU are slow. But it could easily be rewritten as, for instance, Z.g = (1.0-X.a)*joincount + X.a*X.g;.
Code:
00324330000000204
Pass 1: convert to 4-channel format
{1,1,1,0}{1,1,1,0}{0,0,0,1}{0,0,0,1}{0,0,0,1}{0,0,0,1}{0,0,0,1}{1,1,1,0}{1,1,1,0}{1,1,1,0}{1,1,1,0}{1,1,1,0}{1,1,1,0}{1,1,1,0}{0,0,0,1}{1,1,1,0}{0,0,0,1}
Pass 2: reduce (note correct behavior on the odd boundary condition)
{2,2,2,0}{0,0,0,1}{0,0,0,1}{1,0,1,1}{2,2,2,0}{2,2,2,0}{2,2,2,0}{1,0,1,1}{0,0,0,1}
Pass 3: reduce
{2,2,0,1}{1,0,1,1}{4,4,4,0}{2,2,1,1}{0,0,0,1}
Pass 4: reduce
{2,2,1,1}{6,6,1,1}{0,0,0,1}
Pass 5: reduce
{7,2,1,1}{0,0,0,1}
Pass 6: reduce
{7,2,0,1}
Output: 7
I'm relying on the fact that an out-of-bounds access will produce {0,0,0,0}, which is easy enough to arrange for a GPU texture query.
Last edited by Lindley; April 24th, 2010 at 10:17 PM.
* The Best Reasons to Target Windows 8
Learn some of the best reasons why you should seriously consider bringing your Android mobile development expertise to bear on the Windows 8 platform.