To implement multi-GPU stencil computation using Accelerate

Takayuki Muranushi nushio at
Thu May 26 01:17:25 BST 2011

(continued from private communication)

Hello Manuel, nice to meet you. This is Takayuki Muranushi speaking.

Now knowing that Accelerate supports stencil calculations on GPUs, I'm
interested in using Accelerate for our business --- solving partial
differential equations on distributed, GPU accelerated computers.

So, will you, as a developer of Accelerate, guide me figure out how
hard it will be?

Q1. With Accelerate, can I keep the array data within each GPU's
device memory and only read/write margin regions out of GPUs?
Q2. Can I launch communication (cudaMemcpy) and calculation (GPU
kernel call) in parallel?
Q3. Does Accelerate have support, or can be modified by users, for
customizing array storing orders?

Let me put it in detail.

Simulating partial differential equations (with explicit type of
solvers) reduce to non-linear stencil computations. Usually stencil is
applied thousands to millions of times. The entire array is divided
and distributed among computation nodes so communications arise. Since
in stencil calculation each cell require neighbour cell's information,
not entire region but margin regions are to be communicated.

If we compare communication and computation time, there are 4 cases.
Here, by 'communication' I mean copying data from device to host
memory, then optionally, sending the data via interconnect, then
copying from host to device memory.

0. Time for communicating margin regions much greater than computation.
1. Time for communicating margin regions is comparable to computation.
2. Time for communicating margin regions is much smaller than computation, but
time for communicating the entire region is still greater than computation.
3. Time for communicating the entire region is much smaller than computation.

In case 0. there is little gain in using accelerators. My models
usually fall into case 2. Some of my colleagues are using simpler
models with less computation, those fall into case 1.

In case 1., we have to launch computation and communication
simultaneously to achieve good performance. Changing array layout so
that margin regions reside in continuous portion of memory helps.
Array layout such as Morton ordering or Peano ordering are also good
for the cache. In case 2., we still pay attention not to copy the
entire array to the host memory. Here the questions arise.

I don't care if I do everything in Haskell. Moreover, there's no
guarantee I can build Haskell in every supercomputers I use. If we can
find a systematic way to utilize CUDA codes generated by Accelerate,
combine them with CUDA codes written by hand or by other code
generators, that will be the solution, too.

>> Accelerate was prime candidate for my backend. I've been testing it
>> for a while. Two reasons I didn't use it for the time are: that
>> Accelerate doesn't support stencil operations for GPUs, and that I'm
>> not sure how to use multiple GPUs with it. Anyway, your mail reminded
>> me that I should look it carefully again, and follow the CUDA-code
>> generating process of Accelerate to the bottom.
> The current version of Accelerate does indeed have support for stencil computations.  I recently moved the code over to GitHub, where you can find the latest version:
> Here, as an example, is a Canny edge detection in Accelerate:
> The use of multiple GPUs is a more complex issue and depends on how you access the various GPUs.  Are they attached to one PC or in a distributed cluster where you need to use message passing for communication.  My recommendation would be to run one Accelerate computation per GPU and coordinate the GPUs with custom Haskell code running in multiple Haskell threads.  The Haskell MPI library
> might help coordinating the various threads.

Thanks in advance,

The Hakubi Center / Yukawa Institute of Theoretical Physics, Kyoto University

More information about the Accelerate mailing list