generalized fold (and other newbie question)

Trevor McDonell tmcdonell at cse.unsw.edu.au
Sun Oct 10 03:58:00 EDT 2010

```Hi Johannes,

Firstly, sorry for the very late reply, I've been quite swamped lately.

> What is the accelerate example code for (dense, standard)
> array multiplication?

Good question. Of arbitrary rank? Short answer: no.

The slightly longer answer is that this piqued my interest, so I attach something for matrix multiplication. It pretty much follows what you outline below (and the repa paper, which it appears you have read).

The caveat, however, is that it tickles a currently unimplemented feature in the CUDA backend, namely 'shape' inside of array computations. So you could (a) pass the dimensionality as a separate Exp term, or (b) bug me to finish that (= . I have an idea how it can be done, it is just a matter of finding time (famous last words).

> Can I express something like the following,
> to get  c =  a * b  (both Array DIM2 Int)
>
> a', b' :: Array DIM3 Int   such that
> a'!(i,j,k) = a!(i,j) ; b'!(i,j,k) = b!(j,k)
>
> then obtain  c  as a  fold (+) 0  along the middle coordinate
> of  zipWith (+) a' b'
>
> Two questions from this:
>
> 1. I assume the first step can be done with "replicate" but exactly how?
> (the API doc is severely missing a formal spec here. and elsewhere)

I agree, replicate (and slice) are a bit difficult to use. The type checker needs its hand held a bit, and the errors it spits out aren't particularly helpful (at least to my eyes). This mostly derives from tuples, which are a similar story, and why you see lots of explicit type signatures near uses of 'Acc.untuple' and friends.

> 2. this would need a fold that goes from DIM3 to DIM2,
> but I don't see any such function.

Right, I also agree something like this would be useful in Accelerate, however since I haven't completed the CUDA backend as it is, figured I should do that before adding extra functionality (=

You can mimic the behaviour using a segmented reduction. I do this in the attached example.

> Background: I want to find out how to use accelerate+CUDA (on GTX285)
> to find matrix interpretations, as in the ICFP 2010 contest, cf.
> http://www.imn.htwk-leipzig.de/~waldmann/talk/10/icfp/

Cool! We just make the language tools, we are always looking for people who use them to do interesting things (=

> The matrices are small, and I want to do some kind of hill climbing.
> Ideally, all the "climbing" will be done inside the GPU, and I just want
> to start lots of these processes, and collect their results.
> (Then re-start, from Haskell land, using the previous results somehow.)
> What accelerate control structure would I use
> for doing a certain number of hill climbing steps?

One thing you may run into is the current lack of staging / sharing of array products, so building large expressions may involve a lot of recomputation. If you use Haskell-side control flow to generate a large expression, this can be bad... Manuel Chakravarty is currently working on this.

> In fact, the hill climbing process would compute a list of individuals,
> of which I only ever need the most recent/last one.
> Can this be done by accelerate? (Will it overwrite
> the earlier ones by some symbolic GC magic,
> or does it allocate space for all?)

CUDA.run returns a single array as the result. If your operation involved several steps, all of the intermediate stuff is handled by the backend (c.f. work distribution below).

> What about random numbers? Should I pass in (from Haskell land)
> a list/array of "randoms"?

This is what I generally do, yes, although that is purely for the purposes of feeding test programs "sample data". If you were doing something like monte carlo, maybe that wouldn't be such a good idea and we should look for ways to generate randoms on the device.

> I guess the alternative is to control the hill climbing from Haskell
> land. Then I would use accelerate only to compute matrix products.
> This would incur too much communication via the bus?

What kind of control flow? You probably noticed the (?) operator for branching on Exp terms, but there is no equivalent for Acc things, because that could introduce nested parallelism. I'd need to know more about what you want to do.

You are correct about communication costs. How large are the arrays for your problem?
Saying that, the example I attached isn't exactly bandwidth efficient, compared to the example in the CUDA documentation... The 'stencil' operation *might* be useful for a faster matrix multiply, but I have never tried this, and (again) isn't yet implemented in the CUDA backend )=

> I totally don't see how accelerate/CUDA will distribute the work.
> Each expression (Acc ...) is compiled to run in one kernel?
> Then how to I run several?

Generally speaking, each function in the library generates a single piece of executable device code (a few functions will generate more than one). The backend coordinates how and when to run each of them, as well as memory management for intermediate arrays, etc. Example:

dotp xs ys = fold (+) 0
\$ zipWith (*) (use xs) (use ys)

This generates two kernels. The zipWith kernel is run first and writes to an intermediate array, after that we can GC the device-side xs and ys arrays. Then that intermediate array is fed into the fold kernel, before also being GC-ed. The result of the fold is copied back to the host (and only that, none of the intermediate stuff).

> Yes, that's a lot of questions. I am very willing to invest
> some work here -  it's not just that I want matrix interpretations;
> I also plan to present this stuff (program CUDA from Haskell) in a
> lecture. So it should not only work - it should also look
> nice and easy ...

It would be excellent if Accelerate could fulfil this purpose for you. That is exactly what Accelerate should be (=

Does that help at all?

Regards,
-Trevor

-------------- next part --------------
A non-text attachment was scrubbed...
Name: MM.hs
Type: application/octet-stream
Size: 1478 bytes
Desc: not available