OK, here are new results.
I have used Joeman’s adoptation of ninja9578’s asm. I don’t speak GCC’s asm, but the code looks very similar (to a naked eye).
Anyway, here is machine code, see if there is anything you don’t like about it:
For lucky owners of Visual Studio 2005 (or above) I am attaching zipped solution so that you can verify my findings.
My conclusion: it is VERY HARD (if not impossible) to beat optimizing compiler with hand-coded asm. I bet it (at least the good one) knows everything about cache, pipeline, prefetch, etc.
So the bottom line (as I see it) is: in a fight for performance choose the best algorithm and implement it in as simple as possible way, so the compiler doesn’t get confused and is able to optimize it nicely.
Vlad - MS MVP [2007 - 2012] - www.FeinSoftware.com
Convenience and productivity tools for Microsoft Visual Studio: FeinWindows - replacement windows manager for Visual Studio, and more...
* The Best Reasons to Target Windows 8
Learn some of the best reasons why you should seriously consider bringing your Android mobile development expertise to bear on the Windows 8 platform.