Re: Parallelization Methods
Excellent questions. I wish I had some brilliant and excellent answers for them.
I've found that the platform being used rather than the computations to be done may have more influence over what methods are "best" to be used. If you're in a distributed environment you're pretty much stuck with some kind of message-passing method (MPI, PVM). For shared-memory there are explicit threads (Pthreads, Windows threads) and implicit threads (OpenMP, TBB). For GPUs, there are a host of data-parallel languages out or soon to be released (CUDA, OpenCL, Ct). The language being used can affect your choice, too.
In the realm of shared-memory, I guess I would recommned that you use the easiest method that will give you the power to express the parallelism that you need. If you've got computations that are implelmented in a threaded library, use the library. If you've got loops with independent iterations, use OpenMP or TBB. If your parallel algorithm needs more complex interaction between concurrent tasks/threads, then you likely need to go with an explicit threading method.
In one of the examples from The Art of Concurrency, I used both TBB and OpenMP. OpenMP was used to thread the loop iterations over the bulk of the computation and TBB was used to perform a reduction operation that was more complex than could be handled by OpenMP.
As for judging the overheads of different methods, I've not seen a study of that. I'm not sure if there are profiling tools that can give you the calling overhead for functions. To get an idea about execution time for things like creating and synchronizing threads, you could write your applications with several different methods and time each one on the same data set. Besides the time involved in creating, debugging and tuing multiple versions of the same code, this would give you an idea about how that particular algorithm and mix of threading calls might take in general. Things you can't measure would be the amount of time a thread spends waiting on a contested sync object, at least not by timing the total execution time. Tools, like Parallel Studio, can give you focus on those kinds of timing issues.
Performance from one threading library to another is less of a concern for me. I assume that they will all be tuned as best they can. The bottom line is use the easiest method of implementing the parallelism you need. If you're not getting the performance you think you should, try another library to see if there might be issues with the overheads of the functions being used. Another algorithmic approach with the original library might fix things up, too.