Writing/Understanding Optimized Code

**robione** · December 28th, 2009, 05:11 AM

I started tinkering with image processing/computer vision and decided I want to get as much speed as I could. So in this instance I started with completely unoptimized source from a website and started to optimize it in C++.... as I use VC++ 98 EE. Now I'm in the process of converting over to x86 assembly... using the inline assembler. I started timing different sections of code and was not so surprised my initial version of assembly ran slower than the C code. It was written before I started to try and optimize it.

After getting rid of a bunch of stuff, I'm scratching my head as to why it's still slower. The C code is doing ~750k - 1000k more iterations a second (~10% faster) than the assembler and I'm kinda confused as to how. I was hoping that someone here might know a trick that I don't (as I'm pretty new to assembly). The code is pretty simple.... it's part of a Laplace edge detection algorithm. The commented C code and the asm code perform the same actions.

Code:

//		iSum = 0;
//		iSum -= *(pbySrcOffset-2) + *(pbySrcOffset-1) + *(pbySrcOffset+2) + *(pbySrcOffset+1) - (24 * (*pbySrcOffset));

		__asm	mov			esi,pbySrcOffset
while(time_f-time_i < 1000) {
		__asm {
			imul		ecx,byte ptr [esi+2],24
			movzx		eax,byte ptr [esi]
			movzx		ebx,byte ptr [esi+1]
			movzx		edx,byte ptr [esi+3]
			sub			ecx,eax
			sub			ecx,ebx
			movzx		eax,byte ptr [esi+4]
			sub			ecx,edx
			sub			ecx,eax
			mov			iSum,ecx
		}
count++;
time_f = ::timeGetTime();
}

Why don't I just let the compiler do it? I guess I like the challenge and I like to know how stuff works at a deeper level than most I guess. Also I wrote a bunch of stuff with SSE2 and looking at the assembly is easier for me than intrinsics... and if I remember right... more portable to. So I'd like to take what I learn here and hopefully squeeze some more speed out there. My SSE times are way faster then the C/x86 code but I have a feeling that if the compiler can squeeze more time out of the simple code above.... there is probably more time that can be squeezed out of my SSE code too.

If I could find my code debugging the release build I could do this on my own but that doesn't seem like an option unfortunately. The first instruction I go to is always the beginning of the program :/

Thanks for any input guys. (There's more to the timing code then shown here if the thought crossed your mind.... like updating time_i

It's in an outer loop)

**ninja9578** · December 28th, 2009, 05:14 PM

Most IDE's have an option to output assembly for a specific chunk of code. Have VC++ give you what it's doing, then see if you can learn from that.

My guess of why you think the compiler's version is going faster than your is because you are mixing assembly and C++. With G++, you can tell the assembly block, which registers are going to be clobbered so it only pushes those onto the stack, my guess is that VC++ is pushing all registers onto the stack, then popping them all off at the end of the asm block, which is a huge amount of lag, which could be causing your speed issues.

**robione** · December 28th, 2009, 06:27 PM

Thanks ninja. As far as I know I can disassemble code in debug mode (and go wherever I want in the code)... but there are no optimizations. In release mode I can only start at the beginning of the program no matter where I tell it to go. Hence the problem..... I can only copy/paste what i see on the screen too.... Kinda annoying. Well thanks for the advice I'll look deeper into my documentation.

[Time passes.] An idea came to me. If I save the EIP and output it instead of count I think I can find my code.... well I thought I could. How do I translate the value I find into a valid memory address? It looks to be too short to be valid.... not to mention where the EIP is sooooo far from the number I got. .... Basically I read:

Code:

__asm {
	call label
label:
	pop eax
	mov count,eax
}

... will save the EIP register. I'm getting 40113F as an address... while wherever the compiler attached to the process to debug it the EIP is 7C90120E. Just seems kinda low.. the address I'm getting. Does the CS register somehow come into play here?

Thread: Writing/Understanding Optimized Code

Thread Tools

Display

Writing/Understanding Optimized Code

Re: Writing/Understanding Optimized Code

Re: Writing/Understanding Optimized Code

Posting Permissions