Data Type Conversion

**ninja9578** · May 14th, 2010, 10:10 PM

Does anyone see the bus error? I hate writing 32-bit assembly on a 64-bit machine:

Code:

inline void Assembly(){
    __asm__ __volatile__(
	    "  movl $1000000, &#37;%ecx		  ;\n"	   //put the size of the table in here, don't reference it
	    "  myloop:				  ;\n"	   //beginning of my loop
	    "	  movb 12(%0), %%ah	  ;\n"	   //char 4
	    "	  movb 8(%0), %%al	  ;\n"	   //char 3
	    "	  shl $16, %%eax	  ;\n"	   //can't access high bits directly, so shift these there for now
	    "	  movb 4(%0), %%ah	  ;\n"	   //char 2
	    "	  movb (%0), %%al	  ;\n"	   //char 1
	    "	  movl %%eax, (%1)	  ;\n"	   //push it out to the destination
	    "	  add $4, %1		  ;\n"	   //move the dst ptr by 4 because we did 4 ata  time
	    "	  add $16, %0		  ;\n"	   //move the src ptr by 16
	    "  loop myloop			  ;\n"	   //loop until ecx is zero
	    :							   //No output
	    :  "r" (src),					   //Let CGG decide what registers to assign these to
		  "r" (dst)					   //Let GCC decide what registers to assign these to
	    :  "eax", "ecx"					   //these two get explicitly clobbred
	    );
}

It runs fine for small arrays, but once I try doing one over 1000, it starts throwing bus errors.

**Chris_F** · May 14th, 2010, 10:40 PM

Originally Posted by ninja9578

Does anyone see the bus error? I hate writing 32-bit assembly on a 64-bit machine:

Code:

inline void Assembly(){
    __asm__ __volatile__(
	    "  movl $1000000, &#37;%ecx		  ;\n"	   //put the size of the table in here, don't reference it
	    "  myloop:				  ;\n"	   //beginning of my loop
	    "	  movb 12(%0), %%ah	  ;\n"	   //char 4
	    "	  movb 8(%0), %%al	  ;\n"	   //char 3
	    "	  shl $16, %%eax	  ;\n"	   //can't access high bits directly, so shift these there for now
	    "	  movb 4(%0), %%ah	  ;\n"	   //char 2
	    "	  movb (%0), %%al	  ;\n"	   //char 1
	    "	  movl %%eax, (%1)	  ;\n"	   //push it out to the destination
	    "	  add $4, %1		  ;\n"	   //move the dst ptr by 4 because we did 4 ata  time
	    "	  add $16, %0		  ;\n"	   //move the src ptr by 16
	    "  loop myloop			  ;\n"	   //loop until ecx is zero
	    :							   //No output
	    :  "r" (src),					   //Let CGG decide what registers to assign these to
		  "r" (dst)					   //Let GCC decide what registers to assign these to
	    :  "eax", "ecx"					   //these two get explicitly clobbred
	    );
}

It runs fine for small arrays, but once I try doing one over 1000, it starts throwing bus errors.

Oh god, I hate GCCs representation of inline asm, and AT&T syntax in general. I'm not even sure what that does.

**ninja9578** · May 15th, 2010, 12:03 AM

I kind of do too, it would be much nicer if they used intels, but oh well. A few years ago I was going PPC assembly using AT&T's syntax. That was a nightmare. It's not as bad as it looks, looks like the forum software did some weird things with my tabs.

**Joeman** · May 15th, 2010, 01:36 AM

Originally Posted by ninja9578

Does anyone see the bus error? I hate writing 32-bit assembly on a 64-bit machine:

It thought it worked for me, but I have little experience with this matter, but I updated your code a tad bit.

Code:

const int TABLE_SIZE_DIV_4 = TABLE_SIZE / 4;

inline void Assembly()
{
    __asm__ __volatile__
    (
        "  movl &#37;2, %%ecx          \n"       //put the size of the table in here, don't reference it
        "  myloop:                  \n"       //beginning of my loop
        "      movb 12(%0), %%ah      \n"       //char 4
        "      movb  8(%0), %%al      \n"       //char 3
        "      shl     $16, %%eax  \n"       //can't access high bits directly, so shift these there for now
        "      movb  4(%0), %%ah      \n"       //char 2
        "      movb  0(%0), %%al      \n"       //char 1
        "      movl  %%eax, (%1)      \n"       //push it out to the destination
        "      add      $4, %1      \n"       //move the dst ptr by 4 because we did 4 ata  time
        "      add     $16, %0      \n"       //move the src ptr by 16
        " loop myloop "
        :                                   //No output
        :  "r" (src),                       //Let CGG decide what registers to assign these to
           "r" (dst),                       //Let GCC decide what registers to assign these to
           "r" (TABLE_SIZE_DIV_4)           //Let GCC decide what registers to assign these to
        :  "eax", "ecx"                       //these two get explicitly clobbred
    );
}

and converted to msvc

Code:

const int TABLE_SIZE_DIV_4 = TABLE_SIZE / 4;

inline void Assembly()
{
    __asm
    {
	push edx
	push ebx
        push ecx

        mov edx, offset src
        mov ebx, offset dst

        mov ecx, TABLE_SIZE_DIV_4
        myloop:
            mov ah, [edx] + 12
            mov al, [edx] + 8
            shl eax, 16
            mov ah, [edx] + 4
            mov al, [edx] + 0
            mov [ebx], eax
            add ebx, 4
            add edx, 16
        loop myloop

	pop edx
	pop ebx
        pop ecx
    }
}

I make no promises this is 100% right. You need to test these for yourself.

**Chris_F** · May 15th, 2010, 02:37 AM

dunno how the performance compares, but I believe it to work.

Code:

inline void int_to_char(int *pInts, char *pChars, int arrSize)
{
	_asm {

		mov esi, pInts
		mov edi, pChars
		mov ebx, arrSize
		xor ecx, ecx
	myloop:
		mov eax, [esi]
		mov byte ptr [edi], al
		add esi, 4
		inc edi
		inc ecx
		cmp ecx, ebx
		jne myloop
	}
}

**ninja9578** · May 15th, 2010, 07:55 AM

The division by four, that's what what causing my bus error. Bloody hell, I can't believe that I missed that.

Well, I finished my code:

Code:

#include <iostream>
#include <ctime>

const int TABLE_SIZE = 1000000;
const unsigned int LOOPS = 0xFF;
volatile int src[TABLE_SIZE];
volatile unsigned char dst[TABLE_SIZE];

void Simple(){
    for(int i = 0; i < TABLE_SIZE; i++)
	   dst[i] = src[i];
}

inline void Assembly(){
    __asm__ __volatile__(
	    "  movl $250000, &#37;%ecx		  ;\n"	   //put the size of the table in here, don't reference it
	    "  myloop:				  ;\n"	   //beginning of my loop
	    "	  movb 12(%0), %%ah	  ;\n"	   //char 4
	    "	  movb 8(%0), %%al	  ;\n"	   //char 3
	    "	  shl $16, %%eax	  ;\n"	   //can't access high bits directly, so shift these there for now
	    "	  movb 4(%0), %%ah	  ;\n"	   //char 2
	    "	  movb (%0), %%al	  ;\n"	   //char 1
	    "	  movl %%eax, (%1)	  ;\n"	   //push it out to the destination
	    "	  add $4, %1		  ;\n"	   //move the dst ptr by 4 because we did 4 ata  time
	    "	  add $16, %0		  ;\n"	   //move the src ptr by 16
	    "  loop myloop			  ;\n"	   //loop until ecx is zero
	    :							   //No output
	    :  "r" (src),					   //Let CGG decide what registers to assign these to
		  "r" (dst)					   //Let GCC decide what registers to assign these to
	    :  "eax", "ecx"
	    );
}


int main (int argc, char * const argv[]) {
    
    clock_t start = clock();
    for (unsigned int i = 0; i < LOOPS; ++i)
	   Simple();
    std::cout << clock() - start << std::endl;
	
    start = clock();
    for (unsigned int i = 0; i < LOOPS; ++i)
	   Assembly();
    std::cout << clock() - start << std::endl;
    
    return 0;
}

And sorry guys betting on the optimizer

:

Code:

Ninjas-MacBook-Pro:Release ninja9578$ ./AssemblyChallenge
1059456
366081

Yes, used maximum optimizations, not the default release build on XCode, and I ran it in the console, not the dev environment. Looks like I beat the compiler.

I know some of you guys wrote some more advanced routines, but you all said that they run either on par or slightly faster than the simple one, no one posts that it ran 3x faster, so I didn't bother benchmarking them.

@Chris_F: Your code looks good. But I'm concerned about the registers that you use. I've never done inline with VC++, is the assembler smart enough to realize that you clobbered those registers? Because you didn't push their state. Also it won't run as fast as mine for two reasons:

1) You are only doing one integer at a time, where as I'm doing 4. Registers are 32 bit, so use the whole thing, registers are almost a million times faster than RAM.
2) You and the compiler both increment a register, do a compare, then a jump. The processor has a built in function to do all of that in a single tick: loop.

Another thing, is that my code above uses volatile to keep the assembly as it is. If I didn't have that the optimizer could come in and change the assembly, perhaps making it even faster. So it's important when writing assembly to benchmark it with and without the volatile keyword. Sometimes the compiler can make it faster, sometimes it makes it slower, sometimes it does nothing.

**Russco** · May 15th, 2010, 10:03 AM

What happens when you swap the calls to simple and assembly? Do you notice the assembly getting slower and the simple getting faster?

**ninja9578** · May 15th, 2010, 10:38 AM

Uh oh. ***? I hate when weird things like that happen

Someone want to run the thing on Windows and use that magical process query function?

**Russco** · May 15th, 2010, 11:01 AM

Its because of the cache. the first function is paying to load the cache, the second is using the data already loaded. Makes your asm look much faster than the C, but much of that cost is cache loading.

**VladimirF** · May 15th, 2010, 11:18 AM

Originally Posted by Russco

Its because of the cache. the first function is paying to load the cache, the second is using the data already loaded. Makes your asm look much faster than the C, but much of that cost is cache loading.

Hmmm... Do you have 400MB cache on your processor?

**Chris_F** · May 15th, 2010, 11:23 AM

Originally Posted by VladimirF

Hmmm... Do you have 400MB cache on your processor?

Itanium 3???

**VladimirF** · May 15th, 2010, 11:56 AM

Originally Posted by Chris_F

Itanium 3???

Is it this one? Tukwila (processor)
Than it tops at “puny” 24MiB, not anywhere near 400MB.

**Russco** · May 15th, 2010, 12:41 PM

1 mill ints = 4 mill bytes = 4Mb (well a touch less).

My cpu has 4mb of l2 cache. Isn't l2 used for data?? I was under the impression l1 was for code, l2/3 were data caches.

Why would you need 420Mb to store 1 mill ints?

**Chris_F** · May 15th, 2010, 12:50 PM

Originally Posted by Russco

1 mill ints = 4 mill bytes = 4Mb (well a touch less).

My cpu has 4mb of l2 cache. Isn't l2 used for data?? I was under the impression l1 was for code, l2/3 were data caches.

Why would you need 420Mb to store 1 mill ints?

L1 is Harvard model, which means data and code are separate. L2 is not, it's both. L3 is just a slower and larger L2.

**VladimirF** · May 15th, 2010, 01:01 PM

Originally Posted by Russco

1 mill ints = 4 mill bytes = 4Mb (well a touch less).

My cpu has 4mb of l2 cache. Isn't l2 used for data?? I was under the impression l1 was for code, l2/3 were data caches.

Why would you need 420Mb to store 1 mill ints?

Sorry, this thread became too long. I thought I've mentioned that I bumped the array size to 100,000,000 to reduce fluctuation in results (at the time of post #13). Looks like I didn’t say it here.
Anyway, even with 100,000,000 ints the first pass through it takes almost 3 times longer. I don’t know why; I think I read something about “hot” vs. “cold” memory. Are there electrical engineers here who can confirm / deny that?
Regardless, in the current code I run through both arrays before each measurement, so that difference is eliminated: each function runs on a “hot” memory.

Thread: Data Type Conversion

Thread Tools

Display

Re: Data Type Conversion

Re: Data Type Conversion

Re: Data Type Conversion

Re: Data Type Conversion

Re: Data Type Conversion

Re: Data Type Conversion

Re: Data Type Conversion

Re: Data Type Conversion

Re: Data Type Conversion

Re: Data Type Conversion

Re: Data Type Conversion

Re: Data Type Conversion

Re: Data Type Conversion

Re: Data Type Conversion

Re: Data Type Conversion

Posting Permissions