My name is Chris. I have been developing all kinds of parallel applications for all kinds of microprocessor systems as well as high-performance parallel algorithms for many years.
There are several technologies available for implementing parallel processing in C++. It is possible to create parallel threads, and the portability thereof should improve drastically with the advent of threads specified in C++0x. It is also possible to use OpenMP for selected, specially prepared code sequences. And it is even possible to activate Intel's parallelizing options which automatically seek out sequences which possibly benefit from parallelization, which are subsequently automatically parallelized by the compiler.
So finally a few general questions for Clay and Aaron:
Do you have any guidelines for using different implementation methods for parallelization in specific situations?
Are there any clear cases for which one form of parallelization, such as creating dedicated parallel threads, clearly outperforms other methods?
Are there any methods available with which developers can judge the parallelization overhead, such as the overhead caused by creation of dedicated threads or event catching, and relate this overhead to the expected benefit in order to better select the right parallelization technology?
You're gonna go blind staring into that box all day.
Excellent questions. I wish I had some brilliant and excellent answers for them.
I've found that the platform being used rather than the computations to be done may have more influence over what methods are "best" to be used. If you're in a distributed environment you're pretty much stuck with some kind of message-passing method (MPI, PVM). For shared-memory there are explicit threads (Pthreads, Windows threads) and implicit threads (OpenMP, TBB). For GPUs, there are a host of data-parallel languages out or soon to be released (CUDA, OpenCL, Ct). The language being used can affect your choice, too.
In the realm of shared-memory, I guess I would recommned that you use the easiest method that will give you the power to express the parallelism that you need. If you've got computations that are implelmented in a threaded library, use the library. If you've got loops with independent iterations, use OpenMP or TBB. If your parallel algorithm needs more complex interaction between concurrent tasks/threads, then you likely need to go with an explicit threading method.
In one of the examples from The Art of Concurrency, I used both TBB and OpenMP. OpenMP was used to thread the loop iterations over the bulk of the computation and TBB was used to perform a reduction operation that was more complex than could be handled by OpenMP.
As for judging the overheads of different methods, I've not seen a study of that. I'm not sure if there are profiling tools that can give you the calling overhead for functions. To get an idea about execution time for things like creating and synchronizing threads, you could write your applications with several different methods and time each one on the same data set. Besides the time involved in creating, debugging and tuing multiple versions of the same code, this would give you an idea about how that particular algorithm and mix of threading calls might take in general. Things you can't measure would be the amount of time a thread spends waiting on a contested sync object, at least not by timing the total execution time. Tools, like Parallel Studio, can give you focus on those kinds of timing issues.
Performance from one threading library to another is less of a concern for me. I assume that they will all be tuned as best they can. The bottom line is use the easiest method of implementing the parallelism you need. If you're not getting the performance you think you should, try another library to see if there might be issues with the overheads of the functions being used. Another algorithmic approach with the original library might fix things up, too.