I have written a program which uses multihreading to accomplish image processing. The image is read and stored in memory by the main thread. Then working threads are created using AfxBeginThread passing a pointer to the image in memory to each of the working threads. The number of working threads is equal to the number of available cores. Each tread proceses only 1/(number of cores) of the image using the pointer to the image in memory. At the end the main thread displays the processed image. This works fine on single-cpu multi-core systems providing good speed up according to the number of cores.
Since a few days I have access to a 32 core 8-cpu WinServer2008. However, when executing my program on that system (with 32 threads) it is slower than on my 4-core PC. After having looked at the web I suppose that it has something to do how threads share memory among multi-cpu compared to multi-core systems. Unfortunately I was not able to find a source indicating on how to solve my problem. I thought of using MPI but wasn’t able to set it up properly. My actual plan is to create the threads and assign them to the different cores (SetThreadAffinityMask ), pass them the pointer to the image in memory. Than each thread makes his own copy of the image, hoping that this way each thread will be forced to use the memory connected to the cpu the tread is running on, overall improving speed of memory access.

Will that work or do I have to think another way around?

Regards, Peter.