multihreading on multi-cpu slower than on multi-core

**boettcher** · April 10th, 2010, 03:30 PM

I have written a program which uses multihreading to accomplish image processing. The image is read and stored in memory by the main thread. Then working threads are created using AfxBeginThread passing a pointer to the image in memory to each of the working threads. The number of working threads is equal to the number of available cores. Each tread proceses only 1/(number of cores) of the image using the pointer to the image in memory. At the end the main thread displays the processed image. This works fine on single-cpu multi-core systems providing good speed up according to the number of cores.
Since a few days I have access to a 32 core 8-cpu WinServer2008. However, when executing my program on that system (with 32 threads) it is slower than on my 4-core PC. After having looked at the web I suppose that it has something to do how threads share memory among multi-cpu compared to multi-core systems. Unfortunately I was not able to find a source indicating on how to solve my problem. I thought of using MPI but wasn’t able to set it up properly. My actual plan is to create the threads and assign them to the different cores (SetThreadAffinityMask ), pass them the pointer to the image in memory. Than each thread makes his own copy of the image, hoping that this way each thread will be forced to use the memory connected to the cpu the tread is running on, overall improving speed of memory access.

Will that work or do I have to think another way around?

Regards, Peter.

**sunnypalsingh** · April 11th, 2010, 05:30 AM

I believe you would also need to play with SetProcessAffinityMask
From MSDN:

A thread affinity mask is a bit vector in which each bit represents a logical processor that a thread is allowed to run on. A thread affinity mask must be a subset of the process affinity mask for the containing process of a thread. A thread can only run on the processors its process can run on. Therefore, the thread affinity mask cannot specify a 1 bit for a processor when the process affinity mask specifies a 0 bit for that processor.

Caveat:

Setting an affinity mask for a process or thread can result in threads receiving less processor time, as the system is restricted from running the threads on certain processors. In most cases, it is better to let the system select an available processor.

Also you can take a look at SetThreadIdealProcessor
Which sets a preferred processor for a thread. The system schedules threads on their preferred processors whenever possible.

**boettcher** · April 16th, 2010, 06:01 AM

Unfortunately the way I intended to solved my problem did not speed up my application. I think it is all about memory access of multiple threads on the same pointer in shared memory.
This ist the current situation:
My application reads two images (imageA, imageB) into memmory and creates a pointer for each: pA, pB. Then multiple threads (equal the number of cores available on the system) are created. Each thread process a different portion of imageA via pA. First it reads in the information of the first pixel in its subimage, then reads in pixels of imageB using pB. Some calculations are performed and the result stored back to pA. This is repeated until the complet subimage assigned to each thread has been recalculated.
This means that all cores access the same shared memory through two pointers which were assigned by the main thread. This way I get good speed up when running the system on multi-core uni-cpu systems. But when I switch to a multi-core multi-cpu system the calculation is slower (32 core) than on my 4-core 1-cpu system.

My latest version creates the threads and each thread makes its own copy of the subimage of imageA and a copy of imageB. This way each thread has its own images in memmory only this thread points to. After all threads have finished processing their subimage the main thread puts back together those subimages. I hoped to solve any memmory access problem this way, but unfortunately the application is stll mach slower on a 32-core multi-cpu than on a 4-core uni-cpu.

Peter.

Thread: multihreading on multi-cpu slower than on multi-core

Thread Tools

Display

multihreading on multi-cpu slower than on multi-core

Re: multihreading on multi-cpu slower than on multi-core

Re: multihreading on multi-cpu slower than on multi-core

Tags for this Thread

Posting Permissions