Data Type Conversion

**VladimirF** · May 15th, 2010, 01:04 PM

OK, here are new results.
I have used Joeman’s adoptation of ninja9578’s asm. I don’t speak GCC’s asm, but the code looks very similar (to a naked eye).
Anyway, here is machine code, see if there is anything you don’t like about it:

Code:

	Asm();
004013F1  push        edx  
004013F2  push        ebx  
004013F3  push        ecx  
004013F4  mov         edx,offset src (6362480h) 
004013F9  mov         ebx,offset dst (404380h) 
004013FE  mov         ecx,dword ptr ds:[403158h] 
00401404  mov         ah,byte ptr [edx+0Ch] 
00401407  mov         al,byte ptr [edx+8] 
0040140A  shl         eax,10h 
0040140D  mov         ah,byte ptr [edx+4] 
00401410  mov         al,byte ptr [edx] 
00401412  mov         dword ptr [ebx],eax 
00401414  add         ebx,4 
00401417  add         edx,10h 
0040141A  loop        main+374h (401404h) 
0040141C  pop         edx  
0040141D  pop         ebx  
0040141E  pop         ecx

And results:

Code:

Init array
291.013

Simple loop
88.0397

4-in-1 loop
85.2146

ninja9578's Asm
105.585

Chris F's Asm
90.5299

Lindley
90.5767

Vlad's SSE
81.3531

For lucky owners of Visual Studio 2005 (or above) I am attaching zipped solution so that you can verify my findings.
My conclusion: it is VERY HARD (if not impossible) to beat optimizing compiler with hand-coded asm. I bet it (at least the good one) knows everything about cache, pipeline, prefetch, etc.
So the bottom line (as I see it) is: in a fight for performance choose the best algorithm and implement it in as simple as possible way, so the compiler doesn’t get confused and is able to optimize it nicely.

**Joeman** · May 15th, 2010, 01:51 PM

Visual Studio 2008 express EDIT: it wasn't 2010..

Code:

Init array
598.937

Simple loop
250.653

4-in-1 loop
128.359

ninja9578's Asm
116.295

Chirs F's Asm
94.1516

Lindley
311.069

Vlad's SSE
140.421

I ran the test at least 4 times to make sure the numbers didn't fluctuate too much.

Seems like some assembly versions won pretty good.
Just to be clear, I ran it in release mode without starting the debugger. I did not make any changes to the build configurations.

I even moved the function calls in different orders to make sure the order didn't effect overall timings

Overall I don't recommend placing assembly in c++ just for that fact it makes it specific to a certain type of processor, compiler and etc.

if speed is a must and you know what platform and compiler you are using your code for, I suppose this wouldn't be bad as long as you make sure it is faster

**Joeman** · May 15th, 2010, 02:17 PM

I screwed up and said the last result where 2010. I have now corrected my previous post and now for the Visual Studio 2010 express

Code:

Init array
352.502

Simple loop
93.4141

4-in-1 loop
83.0259

ninja9578's Asm
104.253

Chirs F's Asm
97.9349

Lindley
96.6014

Vlad's SSE
84.4739

The numbers did fluctuate a good bit though :S so this is the averaged run time

Seems like the simple loop is the best choice in this case

**Chris_F** · May 15th, 2010, 02:41 PM

That defiantly seems as if it's getting close to the memory bandwidth limit. I have yet to write a inline asm function that out performs the compiler at the same task.

I once thought that I could write a SSE memcpy function that would be lightning fast... but its results were identical to the std library which I'm pretty sure didn't compile to SSE.

I'm no expert on cache lines and such, but I've found that the way in which you access memory can make a world of difference.

**VladimirF** · May 15th, 2010, 02:50 PM

Originally Posted by Joeman

I screwed up and said the last result where 2010. I have now corrected my previous post and now for the Visual Studio 2010 express

You had me going mad for almost half an hour!

I am even installing Express to verify your results.
Could it be that you screwed up more than once and ran Debug build (or simply not optimized) in your first test? Because your numbers are inline with my Debug.

Anyway, I too screwed up (a little).
I can shave another 5% off my SSE implementation by replacing four calls to _mm_loadu_si128 (unaligned) with calls to _mm_load_si128 (aligned). Should have listen more carefully to Chris F in post # 14. For some reason, I thought that having memory aligned was enough

Also, as a benchmark, I call mamcpy() between two 400,000,000 MB buffers, and in my test it takes 42ms. THIS must be limited by memory speed.

**VladimirF** · May 15th, 2010, 02:52 PM

Originally Posted by Chris_F

I once thought that I could write a SSE memcpy function that would be lightning fast... but its results were identical to the std library which I'm pretty sure didn't compile to SSE.

Actually, if you trace into memcpy (and I think std library uses it), you'll see that it tests for alignment and presence of SSE2, and in such case does use SSE.

**Joeman** · May 15th, 2010, 03:04 PM

Originally Posted by VladimirF

You had me going mad for almost half an hour!

As soon has I realized I messed up, I corrected asap

Originally Posted by VladimirF

Could it be that you screwed up more than once and ran Debug build (or simply not optimized) in your first test? Because your numbers are inline with my Debug.

Well I ran it in release mode without debugger, BUT some how my configurations weren't correct. I think they weren't transferred over from your 2005 solution

. now that I checked and fixed them, here are the results

Visual Studio 2008 express with proper release configuration

Code:

Init array
383.338

Simple loop
85.3905

4-in-1 loop
82.2526

ninja9578's Asm
105.795

Chirs F's Asm
97.2565

Lindley
92.2891

Vlad's SSE
80.3013

EDIT: I hope this is right now

**ninja9578** · May 15th, 2010, 04:59 PM

I think it will run faster if you hard coded the size of the table in. I think it's only fair, seeing that the compiler does that :P

Looks like the compiler beat me here though :-\ Perhaps we should come up with something more complicated for it. We did a big contest in college where we each had to write a piece of software that could solve some type of math problem. I wrote C with inline assembly and it beat all the others pretty easily, the only one that came close was fortran believe it or not. One guy used java... not sure what he was thinking.

**Lindley** · May 15th, 2010, 05:16 PM

Perhaps a SURF feature extractor? First thing that comes to mind. I have a GPU implementation (proprietary) which runs at 30+ fps, 1280x1024 images, on an NVIDIA 8-series GPU. Might be interesting to see what highly optimized CPU approaches can do. On the other hand, that might be rather ambitious.

**ninja9578** · May 15th, 2010, 07:43 PM

GPU will always beat a CPU implementation, it has special hardware for matrix and vector math. I'll bet in software, that would run about about 1 frame every couple of seconds :P

**VladimirF** · June 4th, 2010, 12:33 PM

I have to say I am a bit disappointed with this thread. No interest in over two weeks! Anyone saw Dave1024? Hope he is OK.

And I would *REALLY* like to see GPU implementation!

Thread: Data Type Conversion

Thread Tools

Display

Re: Data Type Conversion

Re: Data Type Conversion

Re: Data Type Conversion

Re: Data Type Conversion

Re: Data Type Conversion

Re: Data Type Conversion

Re: Data Type Conversion

Re: Data Type Conversion

Re: Data Type Conversion

Re: Data Type Conversion

Re: Data Type Conversion

Posting Permissions