Integer Aligned memcpy, and 64-bit integer speeds

**mkukli** · October 19th, 2009, 10:55 AM

Two questions in this one:

Is memcpy on some platform libraries integer-aligned, or is it always character aligned? I have to do a large batch copy of data that is integer aligned, meaning I know it is either aligned on 4 or 8 byte boundaries, and copying integer to integer SHOULD be faster than copying char to char.

As well, on 64-bit systems, is it 'faster' to use the native type (64-bit integer) for parameters than 32-bit integers, or would it not matter?

The second question is slightly related to the first, as the data being copied is aligned TECHNICALLY on 8-byte boundaries, so it would only be 2 64-bit integers to copy, or 4 32-bit, or 16 bytes.

This is for copying matrices, btw... if you know of a faster way to copy 4x4 matrices, I would love to know of it... I need to do batch copies from one system to another as fast as possible to free up semaphores for multiprocessing, and the faster this part is the better.

Thank you!

**mpauna** · October 20th, 2009, 12:28 PM

memcpy() must support copying from any source location to any destination location (with the exception that the behaviour is undefined if the source and destination overlap). This means that memcpy() is character aligned, but it can be optimized to perform 64-bit copies when alignment permits. Indeed, many versions of memcpy() are highly optimized and some are written in assembly language to take advantage of everything that the machine has to offer, as I have seen examples use DMA, register block save/restores (the PowerPC "lmw r4,0(r3)" instruction loads 28 32-bit words at once), special caching instructions, and even OS-supported shared page mappings to improve performance.

A "typical" memcpy() routine copying 0x1000 bytes from address 0x3 to address 0x10000003 might perform the following:
Copy the 8-bit byte from 0x3 to 0x10000003
Copy the 8-bit byte from 0x4 to 0x10000004
Copy the 8-bit byte from 0x5 to 0x10000005
Copy the 8-bit byte from 0x6 to 0x10000006
Copy the 8-bit byte from 0x7 to 0x10000007
At this point, both source and dest are aligned, so ...
... Enter a highly optimized loop copying 64-bit values from 0x8-0xFFF to 0x10000008 to 0x10000FFF
Copy the 8-bit value from 0x1000 to 0x10001000
Copy the 8-bit value from 0x1001 to 0x10001001
Copy the 8-bit value from 0x1002 to 0x10001002

Many implementations include special code for handling unaligned data at the start and end of the copy, with a highly optimized loop for handling aligned data in the middle of the copy. Some implementations even optimize the special case where there is no internal alignment between source and destination (for instance, copying from 0x1 to 0x10000003) by reading in multiple aligned values from the source and then shifting them so that they are aligned for the destination.

The net result is that the memcpy() is almost always the best way to copy large blocks of memory around, and will usually be as quick as or quicker than anything that can be written in a higher-level language, including C. However, your milage may vary depending upon your implementation. If profiling indicates that there is an issue, then you might check your library for other options (such as an OS-supported page-to-page copy, or possibly an aligned_memcpy() which can avoid the alignment checks if you are calling memcpy() extremely frequently on aligned data()).

Mark Pauna

Thread: Integer Aligned memcpy, and 64-bit integer speeds

Thread Tools

Display

Integer Aligned memcpy, and 64-bit integer speeds

Re: Integer Aligned memcpy, and 64-bit integer speeds

Posting Permissions