To implement multi-GPU stencil computation using Accelerate

Takayuki Muranushi nushio at
Sat Jul 9 14:01:13 BST 2011

I'm trying to benchmark accelerate, by applying some
computationally-heavy functions on a large array. For short, I'd like
to generate codes like this:

__device__ Real f (Real x){
  for (int i = 0; i < iteration; ++i) {
    x = 4*x*(1-x);
  return x;

What I've tried is this:

However, looking at CUDA_PROFILE, each map is getting called as a
separate cuda kernel and I can not get good benchmark results (not
larger than 20Gflops single precision on a M2050)

On the other hand, if I try the following expression

(foldl (.) id $ replicate iteration (\x -> 4*x*(1-x)))

Accelerate tries to expand every term. This causes the size of the
expression to grow as an exponential function of 'iteration,' and I
get stack overflow quickly.

Any good idea?

The Hakubi Center, Kyoto University :

More information about the Accelerate mailing list