To implement multi-GPU stencil computation using Accelerate

Mon May 30 08:47:52 BST 2011

Hi Takayuki,

> Q1. With Accelerate, can I keep the array data within each GPU's
> device memory and only read/write margin regions out of GPUs?

The management of array data in GPU memory is mostly automatic in Accelerate.  At the moment, we don't have a function that extracts a subarray (such as a margin) from a larger array — there is only 'slice', but that extracts all data along a number of dimensions (so it is not useful for margins).  I imagine that it would not be very hard to add such a function, though.

Similarly, we could add a function replacing the margins of an array.  In general, this will require copying the entire data array (locally on the GPU), though, to preserve the functional semantics.  (In principle, it would be possible to optimise this by using reference counts, but we don't do that at the moment.)

> Q2. Can I launch communication (cudaMemcpy) and calculation (GPU
> kernel call) in parallel?

Yes, but that is not under programmer control.  While evaluating an Accelerate expression, the backend can overlap data transfers and independent computations.  This process is probably not optimal at the moment, though.

> Q3. Does Accelerate have support, or can be modified by users, for
> customizing array storing orders?

There is no such support at the moment.  You could change the storage order by hacking Accelerate.  However, if you like to use multiple different storage orders in a single program, you would need to extend the array type to indicate the storage order it is using.

> Let me put it in detail.
> 
> Simulating partial differential equations (with explicit type of
> solvers) reduce to non-linear stencil computations. Usually stencil is
> applied thousands to millions of times. The entire array is divided
> and distributed among computation nodes so communications arise. Since
> in stencil calculation each cell require neighbour cell's information,
> not entire region but margin regions are to be communicated.
> 
> If we compare communication and computation time, there are 4 cases.
> Here, by 'communication' I mean copying data from device to host
> memory, then optionally, sending the data via interconnect, then
> copying from host to device memory.
> 
> 0. Time for communicating margin regions much greater than computation.
> 1. Time for communicating margin regions is comparable to computation.
> 2. Time for communicating margin regions is much smaller than computation, but
> time for communicating the entire region is still greater than computation.
> 3. Time for communicating the entire region is much smaller than computation.
> 
> In case 0. there is little gain in using accelerators. My models
> usually fall into case 2. Some of my colleagues are using simpler
> models with less computation, those fall into case 1.
> 
> In case 1., we have to launch computation and communication
> simultaneously to achieve good performance. Changing array layout so
> that margin regions reside in continuous portion of memory helps.
> Array layout such as Morton ordering or Peano ordering are also good
> for the cache. In case 2., we still pay attention not to copy the
> entire array to the host memory. Here the questions arise.
> 
> I don't care if I do everything in Haskell. Moreover, there's no
> guarantee I can build Haskell in every supercomputers I use. If we can
> find a systematic way to utilize CUDA codes generated by Accelerate,
> combine them with CUDA codes written by hand or by other code
> generators, that will be the solution, too.

We discussed a form of foreign function interface for Accelerate, where you can call native functions implemented in CUDA from within Accelerate expressions.  This is not implemented at the moment, but it shouldn't be hard to add.

Cheers,
Manuel

>>> Accelerate was prime candidate for my backend. I've been testing it
>>> for a while. Two reasons I didn't use it for the time are: that
>>> Accelerate doesn't support stencil operations for GPUs, and that I'm
>>> not sure how to use multiple GPUs with it. Anyway, your mail reminded
>>> me that I should look it carefully again, and follow the CUDA-code
>>> generating process of Accelerate to the bottom.
>> 
>> The current version of Accelerate does indeed have support for stencil computations.  I recently moved the code over to GitHub, where you can find the latest version:
>> 
>> https://github.com/mchakravarty/accelerate
>> 
>> Here, as an example, is a Canny edge detection in Accelerate:
>> 
>> https://github.com/mchakravarty/accelerate/blob/master/accelerate-examples/tests/image-processing/Canny.hs
>> 
>> The use of multiple GPUs is a more complex issue and depends on how you access the various GPUs.  Are they attached to one PC or in a distributed cluster where you need to use message passing for communication.  My recommendation would be to run one Accelerate computation per GPU and coordinate the GPUs with custom Haskell code running in multiple Haskell threads.  The Haskell MPI library
>> 
>> https://github.com/bjpop/haskell-mpi
>> 
>> might help coordinating the various threads.
>> 
> 
> Thanks in advance,
> 
> -- 
> MURANUSHI Takayuki
> The Hakubi Center / Yukawa Institute of Theoretical Physics, Kyoto University
> 
> _______________________________________________
> Accelerate mailing list
> Accelerate at projects.haskell.org
> http://projects.haskell.org/cgi-bin/mailman/listinfo/accelerate