OK, here are new results.
I have used Joeman’s adoptation of ninja9578’s asm. I don’t speak GCC’s asm, but the code looks very similar (to a naked eye).
Anyway, here is machine code, see if there is anything you don’t like about it:
And results:Code:Asm();
004013F1 push edx
004013F2 push ebx
004013F3 push ecx
004013F4 mov edx,offset src (6362480h)
004013F9 mov ebx,offset dst (404380h)
004013FE mov ecx,dword ptr ds:[403158h]
00401404 mov ah,byte ptr [edx+0Ch]
00401407 mov al,byte ptr [edx+8]
0040140A shl eax,10h
0040140D mov ah,byte ptr [edx+4]
00401410 mov al,byte ptr [edx]
00401412 mov dword ptr [ebx],eax
00401414 add ebx,4
00401417 add edx,10h
0040141A loop main+374h (401404h)
0040141C pop edx
0040141D pop ebx
0040141E pop ecx
For lucky owners of Visual Studio 2005 (or above) I am attaching zipped solution so that you can verify my findings.Code:Init array
291.013
Simple loop
88.0397
4-in-1 loop
85.2146
ninja9578's Asm
105.585
Chris F's Asm
90.5299
Lindley
90.5767
Vlad's SSE
81.3531
My conclusion: it is VERY HARD (if not impossible) to beat optimizing compiler with hand-coded asm. I bet it (at least the good one) knows everything about cache, pipeline, prefetch, etc.
So the bottom line (as I see it) is: in a fight for performance choose the best algorithm and implement it in as simple as possible way, so the compiler doesn’t get confused and is able to optimize it nicely.

