I do a lot of work with high-performance mathematecal algorithm design. Some of my work has significantly benefited from parallelization.
I have, at times,
- manually created dedicated threads for parallel matrix calculations
- manually created dedicated threads for parallel FFT operations
- manually separated chunks of large-number algorithms (those dealing with hundreds or thousands of digits or more) into parallelizable parts.
All of this work has involved tedious manual separation of algorithm chunks, and it has been limited to operations which are easily identified as parallelizable.
Are there any better ways? Are there methods to create automation mechanisms for parallelization?
Consider a big number class which performs many, many mathematical operations in a linear, sequential fashion, each operation --- particularly big-number multiplication --- being large-scale enough in scope to benefit from parallelization. Some operations like (a * b) + (c * d) can be intuitively parallelized. On the other hand, many algorithms such as series expansions have sums of terms for which the value of the n'th term depends on the value of the [n-1]'th term. These kinds of operations seem to be sequential --- that is unless the number of expansion terms is known in advance and the algorithm sets up something like a parallelized "for_each" mechanism.
Expanding on these ideas, I always think of parsers like EBNF and parser-generators like flex and bison. They parse syntaxes and place operations on a sequential operation stack and subsequently carry them out. Are there any technologies available for placing operations onto a parallelized operation stack? This seems quite difficult to me, almost like a research project (and I believe there is research being carried out in this field). The mechanism would have to determe which operations can be done in parallel, it must synchronize those with previous, pending sequential results, and wrap all synchronized and sequential results into a single answer at the end. It always sort of blows my mind.
Is there any kind of technology or research in this area? Does anyone know of research results or other literature in the area of automated parallelization of operation stacks? Is there any similar technology?
You're gonna go blind staring into that box all day.
As I see it, automatic parallelization is a dead end. There hasn't been a significant breakthrough since I first heard about the idea many, many years ago. The problem is that auto-parallelizers need to be conservative when examining code. Obvious loop iterations that are independent can be run in parallel. Throw in a pointer reference and the whole thing is unknown (without programmer intervention). General parallelism within an application is going to be too far out of reach for a compiler or parser.
Patterns, like loop iterations or your (a * b) + (c * d) example, once they are known, can be found in the algorithm and transformed to a parallel version. The serial code pattern is set by the programmer and is just the order of the instructions. Thus, sticking things on a stack or running instuctions out of order (with the obvious checks for read-write order to variables) is easy to do.
If a compiler, like the Intel compilers, that has an auto-parallelizing feature should have a report feature to denote possible parallelizable code and state the reasons that the code was rejected by the compiler. This gives the programmer the chance to add pragmas or restructure the code to make it more amenable to automatic parallelization. Now that the programmer has needed to step in, it seems less "automatic" than we would hope. However, I think that this is the state-of-the-art.
A related idea is to embed parallelism into the language. For example, map in Lisp or vector and matrix operations in APL or array syntax in Fortran 90. If the a, b, c, and d in the sum of products example are vectors, writing the one line would take the place of a hand-coded loop over the elements and the multiplication and addition operations of vector elements could be recognized as a parallel operation; if there were multiple cores available, those could be set up to run on as many as cores as needed. Still, not exactly automatic parallelization and only applicable to the set of algorithmic patterns that are known to be safe for such a transformation.