OK, here are new results.
I have used Joeman’s adoptation of ninja9578’s asm. I don’t speak GCC’s asm, but the code looks very similar (to a naked eye).
Anyway, here is machine code, see if there is anything you don’t like about it:
For lucky owners of Visual Studio 2005 (or above) I am attaching zipped solution so that you can verify my findings.
My conclusion: it is VERY HARD (if not impossible) to beat optimizing compiler with hand-coded asm. I bet it (at least the good one) knows everything about cache, pipeline, prefetch, etc.
So the bottom line (as I see it) is: in a fight for performance choose the best algorithm and implement it in as simple as possible way, so the compiler doesn’t get confused and is able to optimize it nicely.
Vlad - MS MVP [2007 - 2012] - www.FeinSoftware.com
Convenience and productivity tools for Microsoft Visual Studio: FeinWindows - replacement windows manager for Visual Studio, and more...
I ran the test at least 4 times to make sure the numbers didn't fluctuate too much.
Seems like some assembly versions won pretty good.
Just to be clear, I ran it in release mode without starting the debugger. I did not make any changes to the build configurations.
I even moved the function calls in different orders to make sure the order didn't effect overall timings
Overall I don't recommend placing assembly in c++ just for that fact it makes it specific to a certain type of processor, compiler and etc.
if speed is a must and you know what platform and compiler you are using your code for, I suppose this wouldn't be bad as long as you make sure it is faster
Last edited by Joeman; May 15th, 2010 at 02:12 PM.
That defiantly seems as if it's getting close to the memory bandwidth limit. I have yet to write a inline asm function that out performs the compiler at the same task.
I once thought that I could write a SSE memcpy function that would be lightning fast... but its results were identical to the std library which I'm pretty sure didn't compile to SSE.
I'm no expert on cache lines and such, but I've found that the way in which you access memory can make a world of difference.
I screwed up and said the last result where 2010. I have now corrected my previous post and now for the Visual Studio 2010 express
You had me going mad for almost half an hour!
I am even installing Express to verify your results.
Could it be that you screwed up more than once and ran Debug build (or simply not optimized) in your first test? Because your numbers are inline with my Debug.
Anyway, I too screwed up (a little).
I can shave another 5% off my SSE implementation by replacing four calls to _mm_loadu_si128 (unaligned) with calls to _mm_load_si128 (aligned). Should have listen more carefully to Chris F in post # 14. For some reason, I thought that having memory aligned was enough
Also, as a benchmark, I call mamcpy() between two 400,000,000 MB buffers, and in my test it takes 42ms. THIS must be limited by memory speed.
Vlad - MS MVP [2007 - 2012] - www.FeinSoftware.com
Convenience and productivity tools for Microsoft Visual Studio: FeinWindows - replacement windows manager for Visual Studio, and more...
I once thought that I could write a SSE memcpy function that would be lightning fast... but its results were identical to the std library which I'm pretty sure didn't compile to SSE.
Actually, if you trace into memcpy (and I think std library uses it), you'll see that it tests for alignment and presence of SSE2, and in such case does use SSE.
Vlad - MS MVP [2007 - 2012] - www.FeinSoftware.com
Convenience and productivity tools for Microsoft Visual Studio: FeinWindows - replacement windows manager for Visual Studio, and more...
As soon has I realized I messed up, I corrected asap
Originally Posted by VladimirF
Could it be that you screwed up more than once and ran Debug build (or simply not optimized) in your first test? Because your numbers are inline with my Debug.
Well I ran it in release mode without debugger, BUT some how my configurations weren't correct. I think they weren't transferred over from your 2005 solution . now that I checked and fixed them, here are the results
Visual Studio 2008 express with proper release configuration
I think it will run faster if you hard coded the size of the table in. I think it's only fair, seeing that the compiler does that :P
Looks like the compiler beat me here though :-\ Perhaps we should come up with something more complicated for it. We did a big contest in college where we each had to write a piece of software that could solve some type of math problem. I wrote C with inline assembly and it beat all the others pretty easily, the only one that came close was fortran believe it or not. One guy used java... not sure what he was thinking.
Last edited by ninja9578; May 15th, 2010 at 05:10 PM.
Perhaps a SURF feature extractor? First thing that comes to mind. I have a GPU implementation (proprietary) which runs at 30+ fps, 1280x1024 images, on an NVIDIA 8-series GPU. Might be interesting to see what highly optimized CPU approaches can do. On the other hand, that might be rather ambitious.
GPU will always beat a CPU implementation, it has special hardware for matrix and vector math. I'll bet in software, that would run about about 1 frame every couple of seconds :P
I have to say I am a bit disappointed with this thread. No interest in over two weeks! Anyone saw Dave1024? Hope he is OK.
And I would *REALLY* like to see GPU implementation!
Vlad - MS MVP [2007 - 2012] - www.FeinSoftware.com
Convenience and productivity tools for Microsoft Visual Studio: FeinWindows - replacement windows manager for Visual Studio, and more...
* The Best Reasons to Target Windows 8
Learn some of the best reasons why you should seriously consider bringing your Android mobile development expertise to bear on the Windows 8 platform.