How can i convert an Integer buffer of size 176x144 in to unsigned char of size 176x144
I set all the data in the integer buffer comes in the Range [0,255]
Is there any faster method for this ?
When using for() loop there is critical performence issue.
I can't see anyway that a loop won't be involved somewhere.
Do you want the result in the same buffer or copied to a different one?
"It doesn't matter how beautiful your theory is, it doesn't matter how smart you are. If it doesn't agree with experiment, it's wrong."
Richard P. Feynman
If it's time critical, then you can't use a for loop. For loops take too much memory and processing power.
pseudocode
Code:
allocate your new buffer
mov ecx, 6336 (176 x 144 / 4) //you'll see why divide by four below
mov the pointer to your int array in a register
mov the pointer to your char array in a register
label:
mov the first byte of the int array in a lolo register
mov the first byte of the next int in the hilo register
mov the first byte of the next next int int to the lohi register
mov the first byte of the next next next int to the hihi register
write the entire register to the char array
increment the int array pointer by 16
increment the char array pointer by 4
LOOP label
In this pseudocode you are doing four items at a time. This will be many many times faster than doing it in C. It might be even faster to loop only 1584 times and do 8 at a time using 2 registers.
This is for all you you guys who occasionally tell me that an optimizing compiler will always beat an assembly programmer. For small pieces of code a good assembly engineer will always beat the compiler. :P Knowledge of how processor pipelines and caching works is the key.
Last edited by ninja9578; May 13th, 2010 at 10:27 AM.
...This is for all you you guys who occasionally tell me that an optimizing compiler will always beat an assembly programmer. For small pieces of code a good assembly engineer will always beat the compiler. :P Knowledge of how processor pipelines and caching works is the key.
This sounds like a challenge, and I accept!
Let’s take some measurable buffer size of 1,000,000 integers and transfer them into unsigned char array.
I suggest this benchmark (timing is Windows-specific, you can substitute with your OS’s favorite).
#pragma once
#ifndef _WIN32_WINNT // Allow use of features specific to Windows XP or later.
#define _WIN32_WINNT 0x0501 // Change this to the appropriate value to target other versions of Windows.
#endif
#include <windows.h>
#include <iostream>
Just add your code to YourFunction() and run, then post your result here.
Please leave Simple() function in so that we can eliminate differences in hardware.
Everybody is welcome to participate. Since Dave stated that this issue is performance-critical, I think that doing it in this thread is appropriate.
Vlad
Vlad - MS MVP [2007 - 2012] - www.FeinSoftware.com
Convenience and productivity tools for Microsoft Visual Studio: FeinWindows - replacement windows manager for Visual Studio, and more...
Time slices here are also critical in determining which function is faster. I've written a function in C++ for YourFunction() defined as follows.
Also, the frequency returned from QueryPerformanceFrequency shouldn't be divided by 1000.0, that frequency IS the amount ticks in a second. Dividing it by 1000.0 results in how many "performance ticks" are in a millisecond. Unless you're wanting that? Which is what I assume.
Code:
void YourFunction()
{
const int nCount = TABLE_SIZE >> 2;
int * pSource = (int *)&src[0];
int * pDest = (int *)&dst[0];
for (int i = 0; i < nCount; ++i)
{
int nTemp = pSource[3];
nTemp <<= 8;
nTemp |= pSource[2];
nTemp <<= 8;
nTemp |= pSource[1];
nTemp <<= 8;
nTemp |= pSource[0];
*pDest = nTemp;
pSource += 4;
++pDest;
}
}
And I get different results, sometimes YourFunction() is faster, sometimes Simple() is faster. It happens when I reverse the order of the calls. Which would tell me something.
Last edited by CppCoder2010; May 13th, 2010 at 08:43 PM.
I'd recommend dropping the volatile qualifiers from the data. It won't do anything except kill possible optimizations.
I was only trying to prevent "optimizing out"...
But it looks like you are correct, it works fine (loops 1,000,000 times) without it.
Vlad - MS MVP [2007 - 2012] - www.FeinSoftware.com
Convenience and productivity tools for Microsoft Visual Studio: FeinWindows - replacement windows manager for Visual Studio, and more...
Hey, if the compiler can optimize out the loop and still get the data where it needs to be, then mission accomplished. Optimizing out is really only a problem in truly trivial speed tests which probably won't mean much anyway.
...I get different results, sometimes YourFunction() is faster, sometimes Simple() is faster. It happens when I reverse the order of the calls. Which would tell me something.
I too have noticed this variation. Looks like it has something to do with the memory being “touched”.
I’ve fixed it by calling Init() first thing from the main() function:
Code:
void Init()
{
for(int i = 0; i < TABLE_SIZE; i++)
{
src[i] = i & 0xFF;
dst[i] = 0;
}
}
I can then call both functions repeatedly in different order but still get consistent results.
My first attempt at “4 elements in 1 iteration” looks almost like yours:
Code:
void FourInOne()
{
int* p = (int*)dst;
for(int i = 0, j = 0; i < TABLE_SIZE; i += 4)
{
*p++ = src[i] | src[i+1] << 8 | src[i+2] << 16 | src[i+3] << 24;
}
}
But it only gets minimal benefit over the Simple() function – about 0.5% <edited> I meant - 5% </edited>
I am working on my “optimized” implementation, but interested to see the ASM results as well.
Last edited by VladimirF; May 14th, 2010 at 06:15 PM.
Reason: Correction: %5, NOT 0.5%!
Vlad - MS MVP [2007 - 2012] - www.FeinSoftware.com
Convenience and productivity tools for Microsoft Visual Studio: FeinWindows - replacement windows manager for Visual Studio, and more...
Well, this all is pretty sad, actually
Below are asm listings for four functions:
1. Simple – assigning one byte at a time, in a loop.
2. Lindley’s code (see above).
3. My 4-in-1 code.
4. My super-secret SSE implementation:
- looping over 16 elements at a time;
- load two groups of 4 ints into two XMM registers;
- pack into one XXM register (16-bit values);
- repeat for the third and fourth group of 4 ints;
- pack two XXM registers with 16-bit values into one with 8-bit values.
Your mileage may vary, but the ratio should be the same.
I *REALLY* had bigger hopes for SSE… I guess if there was an instruction to pack 32-bit values directly into 8-bit (bypassing 16-bit), we would get a little better results. Or did I miss such an instruction? Any SSE experts here?
<edited>
@ninja9578 - Looking at generated asm, I doubt that you will be able to shave anything off. But – good luck!
Last edited by VladimirF; May 14th, 2010 at 06:51 PM.
Vlad - MS MVP [2007 - 2012] - www.FeinSoftware.com
Convenience and productivity tools for Microsoft Visual Studio: FeinWindows - replacement windows manager for Visual Studio, and more...
I *REALLY* had bigger hopes for SSE… I guess if there was an instruction to pack 32-bit values directly into 8-bit (bypassing 16-bit), we would get a little better results. Or did I miss such an instruction? Any SSE experts here?
I would guess that no mater how you try to do it, you are going to be limited by memory performance. Also, if the memory isn't aligned properly, SSE is going to run dog slow.
I would guess that no mater how you try to do it, you are going to be limited by memory performance. Also, if the memory isn't aligned properly, SSE is going to run dog slow.
You might be right. Reading 400,000,000 bytes and writing 100,000,000 bytes must take some time.
And I think my arrays are aligned OK; the addresses end with 80h – what else could you wish for?
Vlad - MS MVP [2007 - 2012] - www.FeinSoftware.com
Convenience and productivity tools for Microsoft Visual Studio: FeinWindows - replacement windows manager for Visual Studio, and more...
* The Best Reasons to Target Windows 8
Learn some of the best reasons why you should seriously consider bringing your Android mobile development expertise to bear on the Windows 8 platform.