adding multicore and GPU support: thoughts and strategies
This year's GSoC efforts spurred some project proposals concerned
with accelerating Boost.uBLAS using a variety of techniques
(vectorization, parallelization, GPU acceleration).
As I have worked on similar efforts in the past (e.g.,
http://openvsip.org/), I'd like to share some of my experience, so
any future effort can learn from that.
1) choice of optimisation
The first thing to note is that there are quite a few approaches
to accelerating BLAS routines, and which one to choose on a
particular hardware platform depends on different criteria,
including information that may be available at compile time
(operand types, specific operation, etc.), or runtime (the problem
size, the exact array dimensions, memory alignment, number of
cores available, etc.). Worse, the platform on which the code is
compiled may not even be the platform on which it is to run
(unless you want to end up in a situation like ATLAS, which
doesn't support cross-compilation precisely because it fine-tunes
generated code by measuring performance on hardware available
during the build).
This suggests a different approach, where multiple "backends"
coexist (SIMD, OpenMP, OpenCL, CUDA, etc.) in parallel, and may,
depending on the deployment context, be enabled individually.
Then, a user may either select one of the available backends
explicitly (using an appropriate API that needs to be added), or a
mechanism needs to be added that allows to select the "best"
backend. This selection itself could be done in different ways,
either in-process ("just in time"), or out-of-process, in a
Note that, in case of GPU-based backends, it is crucial to
eliminate unnecessary data movements, as they will have a huge
impact on performance. Therefore, rather than naively moving data
from the host to the GPU, run the operation, then move it back, on
each operation, it's much better to move data "lazily", i.e. keep
data on the GPU in case the next operation is also performed
there. All this suggests that a good data model is crucial for
such acceleration work.
2) do-it-yourself versus using existing backends
At least for certain platforms there already exist optimised
"kernels", and it might be best to call those rather than
reimplement them. For example, both for CUDA as well as OpenCL
there exist freely distributable BLAS libraries. Thus, it might be
more efficient to add adapters that allow Boost.uBLAS to call
those, rather than implement its own.
All that being said, I don't think it's a good idea to let GSoC
students make their own choices, hoping that those will be in line
with what the Boost.uBLAS developers have planned for the future.
On the other hand, such an architectural vision may not even exist
as of yet, so it's hard to come up with a clear path forward,
without doing some actual prototyping. But with all the above open
questions, it seems there is a real danger of any project to be
over-ambitious, while in the end not having any tangible results
that could be re-integrated into Boost.uBLAS. I'd thus like to
suggest that we scale down the expectations a bit, perhaps picking
one or two self-contained ideas from the above, which can be
relatively easily implemented and even validated.