Click to See Complete Forum and Search --> : Writing/Understanding Optimized Code


robione
December 28th, 2009, 04:11 AM
I started tinkering with image processing/computer vision and decided I want to get as much speed as I could. So in this instance I started with completely unoptimized source from a website and started to optimize it in C++.... as I use VC++ 98 EE. Now I'm in the process of converting over to x86 assembly... using the inline assembler. I started timing different sections of code and was not so surprised my initial version of assembly ran slower than the C code. It was written before I started to try and optimize it.

After getting rid of a bunch of stuff, I'm scratching my head as to why it's still slower. The C code is doing ~750k - 1000k more iterations a second (~10% faster) than the assembler and I'm kinda confused as to how. I was hoping that someone here might know a trick that I don't (as I'm pretty new to assembly). The code is pretty simple.... it's part of a Laplace edge detection algorithm. The commented C code and the asm code perform the same actions.


// iSum = 0;
// iSum -= *(pbySrcOffset-2) + *(pbySrcOffset-1) + *(pbySrcOffset+2) + *(pbySrcOffset+1) - (24 * (*pbySrcOffset));

__asm mov esi,pbySrcOffset
while(time_f-time_i < 1000) {
__asm {
imul ecx,byte ptr [esi+2],24
movzx eax,byte ptr [esi]
movzx ebx,byte ptr [esi+1]
movzx edx,byte ptr [esi+3]
sub ecx,eax
sub ecx,ebx
movzx eax,byte ptr [esi+4]
sub ecx,edx
sub ecx,eax
mov iSum,ecx
}
count++;
time_f = ::timeGetTime();
}


Why don't I just let the compiler do it? I guess I like the challenge and I like to know how stuff works at a deeper level than most I guess. Also I wrote a bunch of stuff with SSE2 and looking at the assembly is easier for me than intrinsics... and if I remember right... more portable to. So I'd like to take what I learn here and hopefully squeeze some more speed out there. My SSE times are way faster then the C/x86 code but I have a feeling that if the compiler can squeeze more time out of the simple code above.... there is probably more time that can be squeezed out of my SSE code too.

If I could find my code debugging the release build I could do this on my own but that doesn't seem like an option unfortunately. The first instruction I go to is always the beginning of the program :/

Thanks for any input guys. (There's more to the timing code then shown here if the thought crossed your mind.... like updating time_i :) It's in an outer loop)

ninja9578
December 28th, 2009, 04:14 PM
Most IDE's have an option to output assembly for a specific chunk of code. Have VC++ give you what it's doing, then see if you can learn from that.

My guess of why you think the compiler's version is going faster than your is because you are mixing assembly and C++. With G++, you can tell the assembly block, which registers are going to be clobbered so it only pushes those onto the stack, my guess is that VC++ is pushing all registers onto the stack, then popping them all off at the end of the asm block, which is a huge amount of lag, which could be causing your speed issues.

robione
December 28th, 2009, 05:27 PM
Thanks ninja. As far as I know I can disassemble code in debug mode (and go wherever I want in the code)... but there are no optimizations. In release mode I can only start at the beginning of the program no matter where I tell it to go. Hence the problem..... I can only copy/paste what i see on the screen too.... Kinda annoying. Well thanks for the advice I'll look deeper into my documentation.

[Time passes.] An idea came to me. If I save the EIP and output it instead of count I think I can find my code.... well I thought I could. How do I translate the value I find into a valid memory address? It looks to be too short to be valid.... not to mention where the EIP is sooooo far from the number I got. .... Basically I read:


__asm {
call label
label:
pop eax
mov count,eax
}


... will save the EIP register. I'm getting 40113F as an address... while wherever the compiler attached to the process to debug it the EIP is 7C90120E. Just seems kinda low.. the address I'm getting. Does the CS register somehow come into play here?