robione
December 28th, 2009, 04:11 AM
I started tinkering with image processing/computer vision and decided I want to get as much speed as I could. So in this instance I started with completely unoptimized source from a website and started to optimize it in C++.... as I use VC++ 98 EE. Now I'm in the process of converting over to x86 assembly... using the inline assembler. I started timing different sections of code and was not so surprised my initial version of assembly ran slower than the C code. It was written before I started to try and optimize it.
After getting rid of a bunch of stuff, I'm scratching my head as to why it's still slower. The C code is doing ~750k - 1000k more iterations a second (~10% faster) than the assembler and I'm kinda confused as to how. I was hoping that someone here might know a trick that I don't (as I'm pretty new to assembly). The code is pretty simple.... it's part of a Laplace edge detection algorithm. The commented C code and the asm code perform the same actions.
// iSum = 0;
// iSum -= *(pbySrcOffset-2) + *(pbySrcOffset-1) + *(pbySrcOffset+2) + *(pbySrcOffset+1) - (24 * (*pbySrcOffset));
__asm mov esi,pbySrcOffset
while(time_f-time_i < 1000) {
__asm {
imul ecx,byte ptr [esi+2],24
movzx eax,byte ptr [esi]
movzx ebx,byte ptr [esi+1]
movzx edx,byte ptr [esi+3]
sub ecx,eax
sub ecx,ebx
movzx eax,byte ptr [esi+4]
sub ecx,edx
sub ecx,eax
mov iSum,ecx
}
count++;
time_f = ::timeGetTime();
}
Why don't I just let the compiler do it? I guess I like the challenge and I like to know how stuff works at a deeper level than most I guess. Also I wrote a bunch of stuff with SSE2 and looking at the assembly is easier for me than intrinsics... and if I remember right... more portable to. So I'd like to take what I learn here and hopefully squeeze some more speed out there. My SSE times are way faster then the C/x86 code but I have a feeling that if the compiler can squeeze more time out of the simple code above.... there is probably more time that can be squeezed out of my SSE code too.
If I could find my code debugging the release build I could do this on my own but that doesn't seem like an option unfortunately. The first instruction I go to is always the beginning of the program :/
Thanks for any input guys. (There's more to the timing code then shown here if the thought crossed your mind.... like updating time_i :) It's in an outer loop)
After getting rid of a bunch of stuff, I'm scratching my head as to why it's still slower. The C code is doing ~750k - 1000k more iterations a second (~10% faster) than the assembler and I'm kinda confused as to how. I was hoping that someone here might know a trick that I don't (as I'm pretty new to assembly). The code is pretty simple.... it's part of a Laplace edge detection algorithm. The commented C code and the asm code perform the same actions.
// iSum = 0;
// iSum -= *(pbySrcOffset-2) + *(pbySrcOffset-1) + *(pbySrcOffset+2) + *(pbySrcOffset+1) - (24 * (*pbySrcOffset));
__asm mov esi,pbySrcOffset
while(time_f-time_i < 1000) {
__asm {
imul ecx,byte ptr [esi+2],24
movzx eax,byte ptr [esi]
movzx ebx,byte ptr [esi+1]
movzx edx,byte ptr [esi+3]
sub ecx,eax
sub ecx,ebx
movzx eax,byte ptr [esi+4]
sub ecx,edx
sub ecx,eax
mov iSum,ecx
}
count++;
time_f = ::timeGetTime();
}
Why don't I just let the compiler do it? I guess I like the challenge and I like to know how stuff works at a deeper level than most I guess. Also I wrote a bunch of stuff with SSE2 and looking at the assembly is easier for me than intrinsics... and if I remember right... more portable to. So I'd like to take what I learn here and hopefully squeeze some more speed out there. My SSE times are way faster then the C/x86 code but I have a feeling that if the compiler can squeeze more time out of the simple code above.... there is probably more time that can be squeezed out of my SSE code too.
If I could find my code debugging the release build I could do this on my own but that doesn't seem like an option unfortunately. The first instruction I go to is always the beginning of the program :/
Thanks for any input guys. (There's more to the timing code then shown here if the thought crossed your mind.... like updating time_i :) It's in an outer loop)