Data Type Conversion

**Dave1024** · May 13th, 2010, 02:23 AM

Hi,

How can i convert an Integer buffer of size 176x144 in to unsigned char of size 176x144
I set all the data in the integer buffer comes in the Range [0,255]

Is there any faster method for this ?
When using for() loop there is critical performence issue.

Any library function available for this ?

Rgds
Dave

**JohnW@Wessex** · May 13th, 2010, 04:42 AM

I can't see anyway that a loop won't be involved somewhere.

Do you want the result in the same buffer or copied to a different one?

**ninja9578** · May 13th, 2010, 06:39 AM

If it's time critical, then you can't use a for loop. For loops take too much memory and processing power.

pseudocode

Code:

allocate your new buffer
mov ecx, 6336 (176 x 144 / 4) //you'll see why divide by four below
mov the pointer to your int array in a register
mov the pointer to your char array in a register
label:
     mov the first byte of the int array in a lolo register
     mov the first byte of the next int in the hilo register
     mov the first byte of the next next int int to the lohi register
     mov the first byte of the next next next int to the hihi register
     write the entire register to the char array
     increment the int array pointer by 16
     increment the char array pointer by 4
LOOP label

In this pseudocode you are doing four items at a time. This will be many many times faster than doing it in C. It might be even faster to loop only 1584 times and do 8 at a time using 2 registers.

This is for all you you guys who occasionally tell me that an optimizing compiler will always beat an assembly programmer. For small pieces of code a good assembly engineer will always beat the compiler. :P Knowledge of how processor pipelines and caching works is the key.

**VladimirF** · May 13th, 2010, 07:46 PM

Originally Posted by ninja9578

...This is for all you you guys who occasionally tell me that an optimizing compiler will always beat an assembly programmer. For small pieces of code a good assembly engineer will always beat the compiler. :P Knowledge of how processor pipelines and caching works is the key.

This sounds like a challenge, and I accept!

Let’s take some measurable buffer size of 1,000,000 integers and transfer them into unsigned char array.
I suggest this benchmark (timing is Windows-specific, you can substitute with your OS’s favorite).

Code:

#include "stdafx.h"

double PCFreq = 0.0; 
__int64 CounterStart = 0; 

void StartCounter() 
{ 
	LARGE_INTEGER li; 
	if(!QueryPerformanceFrequency(&li)) 
		std::cout << "QueryPerformanceFrequency failed!\n"; 

	PCFreq = double(li.QuadPart)/1000.0; 

	QueryPerformanceCounter(&li); 
	CounterStart = li.QuadPart; 
} 
double GetCounter() 
{ 
	LARGE_INTEGER li; 
	QueryPerformanceCounter(&li); 
	return double(li.QuadPart-CounterStart)/PCFreq; 
} 


const int TABLE_SIZE = 1000000;
volatile int src[TABLE_SIZE];
volatile unsigned char dst[TABLE_SIZE];

void Simple()
{
	for(int i = 0; i < TABLE_SIZE; i++)
		dst[i] = src[i];
}

void YourFunction()
{
}

int main()
{
	std::cout << "Simple loop" << std::endl;
	StartCounter(); 
	Simple();
	std::cout << GetCounter() << std::endl << std::endl;

	std::cout << "Your function here" << std::endl;
	StartCounter(); 
	YourFunction();
	std::cout << GetCounter() << std::endl << std::endl;

	return 0;
}

And here is what I have in stdafx.h:

Code:

#pragma once

#ifndef _WIN32_WINNT		// Allow use of features specific to Windows XP or later.                   
#define _WIN32_WINNT 0x0501	// Change this to the appropriate value to target other versions of Windows.
#endif						

#include <windows.h>
#include <iostream>

Just add your code to YourFunction() and run, then post your result here.
Please leave Simple() function in so that we can eliminate differences in hardware.
Everybody is welcome to participate. Since Dave stated that this issue is performance-critical, I think that doing it in this thread is appropriate.

Vlad

**CppCoder2010** · May 13th, 2010, 08:06 PM

Consider though...

**CppCoder2010** · May 13th, 2010, 08:36 PM

Time slices here are also critical in determining which function is faster. I've written a function in C++ for YourFunction() defined as follows.

Also, the frequency returned from QueryPerformanceFrequency shouldn't be divided by 1000.0, that frequency IS the amount ticks in a second. Dividing it by 1000.0 results in how many "performance ticks" are in a millisecond. Unless you're wanting that? Which is what I assume.

Code:

void YourFunction()
{
	const int nCount = TABLE_SIZE >> 2;

	int * pSource = (int *)&src[0];
	int * pDest = (int *)&dst[0];

	for (int i = 0; i < nCount; ++i)
	{
		int nTemp = pSource[3];
		nTemp <<= 8;
		nTemp |= pSource[2];
		nTemp <<= 8;
		nTemp |= pSource[1];
		nTemp <<= 8;
		nTemp |= pSource[0];

		*pDest = nTemp;

		pSource += 4;
		++pDest;
	}
}

And I get different results, sometimes YourFunction() is faster, sometimes Simple() is faster. It happens when I reverse the order of the calls. Which would tell me something.

**Lindley** · May 13th, 2010, 09:42 PM

I'd recommend dropping the volatile qualifiers from the data. It won't do anything except kill possible optimizations.

**VladimirF** · May 13th, 2010, 10:06 PM

Originally Posted by Lindley

I'd recommend dropping the volatile qualifiers from the data. It won't do anything except kill possible optimizations.

I was only trying to prevent "optimizing out"...
But it looks like you are correct, it works fine (loops 1,000,000 times) without it.

**Lindley** · May 13th, 2010, 10:56 PM

Hey, if the compiler can optimize out the loop and still get the data where it needs to be, then mission accomplished. Optimizing out is really only a problem in truly trivial speed tests which probably won't mean much anyway.

**ninja9578** · May 14th, 2010, 08:28 AM

Oh fun

I will write my function tonight or tomorrow.

**VladimirF** · May 14th, 2010, 05:08 PM

Originally Posted by CppCoder2010

...I get different results, sometimes YourFunction() is faster, sometimes Simple() is faster. It happens when I reverse the order of the calls. Which would tell me something.

I too have noticed this variation. Looks like it has something to do with the memory being “touched”.
I’ve fixed it by calling Init() first thing from the main() function:

Code:

void Init()
{
	for(int i = 0; i < TABLE_SIZE; i++)
	{
		src[i] = i & 0xFF;
		dst[i] = 0;
	}
}

I can then call both functions repeatedly in different order but still get consistent results.
My first attempt at “4 elements in 1 iteration” looks almost like yours:

Code:

void FourInOne()
{
	int* p = (int*)dst;
	for(int i = 0, j = 0; i < TABLE_SIZE; i += 4)
	{
		*p++ = src[i] | src[i+1] << 8 | src[i+2] << 16 | src[i+3] << 24;
	}
}

But it only gets minimal benefit over the Simple() function – about 0.5% <edited> I meant - 5% </edited>
I am working on my “optimized” implementation, but interested to see the ASM results as well.

**Lindley** · May 14th, 2010, 05:30 PM

If we're assuming that the input integers are already in the proper range [0,255], then I doubt it'll be easy to get much faster than this....

Code:

unsigned char *srcptr = src;
unsigned char *dstptr = reinterpret_cast<unsigned char*>(dst);// assumes little endian; +3 if BE.
for (int i = 0; i < TABLE_SIZE; i++, ++dstptr, srcptr += 4)
{
    *dstptr = *srcptr;
}

**VladimirF** · May 14th, 2010, 06:42 PM

Well, this all is pretty sad, actually

Below are asm listings for four functions:
1. Simple – assigning one byte at a time, in a loop.
2. Lindley’s code (see above).
3. My 4-in-1 code.
4. My super-secret SSE implementation:
- looping over 16 elements at a time;
- load two groups of 4 ints into two XMM registers;
- pack into one XXM register (16-bit values);
- repeat for the third and fourth group of 4 ints;
- pack two XXM registers with 16-bit values into one with 8-bit values.

Code:

	Simple();
00401436  xor         eax,eax 
00401438  jmp         main+3B0h (401440h) 
0040143A  lea         ebx,[ebx] 
00401440  mov         dl,byte ptr src (6362480h)[eax*4] 
00401447  mov         byte ptr dst (404380h)[eax],dl 
0040144D  add         eax,1 
00401450  cmp         eax,5F5E100h 
00401455  jl          main+3B0h (401440h)

Code:

	Lindley();
00401555  mov         ecx,offset src (6362480h) 
0040155A  mov         eax,offset dst (404380h) 
0040155F  mov         esi,5F5E100h 
00401564  mov         dl,byte ptr [ecx] 
00401566  mov         byte ptr [eax],dl 
00401568  add         eax,1 
0040156B  add         ecx,4 
0040156E  sub         esi,1 
00401571  jne         00401564

Code:

void FourInOne()
{
	int* p = (int*)dst;
	for(int i = 0, j = 0; i < TABLE_SIZE; i += 4)
00401050  xor         eax,eax 
	{
		*p++ = src[i] | src[i+1] << 8 | src[i+2] << 16 | src[i+3] << 24;
00401052  mov         ecx,dword ptr src+0Ch (636248Ch)[eax*4] 
00401059  shl         ecx,8 
0040105C  or          ecx,dword ptr src+8 (6362488h)[eax*4] 
00401063  add         eax,4 
00401066  shl         ecx,8 
00401069  or          ecx,dword ptr [eax*4+6362474h] 
00401070  shl         ecx,8 
00401073  or          ecx,dword ptr [eax*4+6362470h] 
0040107A  cmp         eax,5F5E100h 
0040107F  mov         dword ptr ___@@_PchSym_@00@UxlwvUgvhgDCglIUgvhgDCglIUivovzhvUhgwzucOlyq@+4 (40437Ch)[eax],ecx 
00401085  jl          FourInOne+2 (401052h) 
	}
}
00401087  ret

Code:

void SSE()
{
	for(int i = 0, j = 0; i < TABLE_SIZE; i += 16)
00401000  xor         ecx,ecx 
00401002  mov         eax,offset src+20h (63624A0h) 
00401007  jmp         SSE+10h (401010h) 
00401009  lea         esp,[esp] 
	{
		pack(&src[i], &dst[i]);
00401010  movdqu      xmm1,xmmword ptr [eax-10h] 
00401015  movdqu      xmm0,xmmword ptr [eax-20h] 
0040101A  movdqu      xmm2,xmmword ptr [eax+10h] 
0040101F  packssdw    xmm0,xmm1 
00401023  movdqu      xmm1,xmmword ptr [eax] 
00401027  packssdw    xmm1,xmm2 
0040102B  packuswb    xmm0,xmm1 
0040102F  movdqa      xmmword ptr dst (404380h)[ecx],xmm0 
00401037  add         eax,40h 
0040103A  add         ecx,10h 
0040103D  cmp         eax,offset ___onexitbegin (1E0DA8A0h) 
00401042  jl          SSE+10h (401010h) 
	}
}
00401044  ret

And here are the <sad> results:

Code:

Simple loop 88.9159

Lindley     91.1238

4-in-1 loop 85.364

Vlad's SSE  81.8267

Your mileage may vary, but the ratio should be the same.

I *REALLY* had bigger hopes for SSE… I guess if there was an instruction to pack 32-bit values directly into 8-bit (bypassing 16-bit), we would get a little better results. Or did I miss such an instruction? Any SSE experts here?

<edited>
@ninja9578 - Looking at generated asm, I doubt that you will be able to shave anything off. But – good luck!

**Chris_F** · May 14th, 2010, 06:53 PM

Originally Posted by VladimirF

I *REALLY* had bigger hopes for SSE… I guess if there was an instruction to pack 32-bit values directly into 8-bit (bypassing 16-bit), we would get a little better results. Or did I miss such an instruction? Any SSE experts here?

I would guess that no mater how you try to do it, you are going to be limited by memory performance. Also, if the memory isn't aligned properly, SSE is going to run dog slow.

**VladimirF** · May 14th, 2010, 07:06 PM

Originally Posted by Chris_F

I would guess that no mater how you try to do it, you are going to be limited by memory performance. Also, if the memory isn't aligned properly, SSE is going to run dog slow.

You might be right. Reading 400,000,000 bytes and writing 100,000,000 bytes must take some time.
And I think my arrays are aligned OK; the addresses end with 80h – what else could you wish for?

Thread: Data Type Conversion

Thread Tools

Display

Data Type Conversion

Re: Data Type Conversion

Re: Data Type Conversion

Re: Data Type Conversion

Re: Data Type Conversion

Re: Data Type Conversion

Re: Data Type Conversion

Re: Data Type Conversion

Re: Data Type Conversion

Re: Data Type Conversion

Re: Data Type Conversion

Re: Data Type Conversion

Re: Data Type Conversion

Re: Data Type Conversion

Re: Data Type Conversion

Posting Permissions