generalized fold (and other newbie question)

Wed Oct 13 01:12:15 EDT 2010

Trevor McDonell:
>> What is the accelerate example code for (dense, standard)
>> array multiplication?
> 
> Good question. Of arbitrary rank? Short answer: no.
> 
> The slightly longer answer is that this piqued my interest, so I attach something for matrix multiplication. It pretty much follows what you outline below (and the repa paper, which it appears you have read).
> 
> The caveat, however, is that it tickles a currently unimplemented feature in the CUDA backend, namely 'shape' inside of array computations. So you could (a) pass the dimensionality as a separate Exp term, or (b) bug me to finish that (= . I have an idea how it can be done, it is just a matter of finding time (famous last words).

The problem with that algorithm is that, without delayed arrays (as in Repa), it's space requirements are n^3 for nxn matrices.

A more space-efficient version would, instead of using replicates, map over the left matrix and then use explicit indexing to pick the right values form the second matrix to do the multiplications. (In fact, you need a zipWith, not a map, together with an array containing a value of (i, j) at position (i, j).)

I believe what we *really* want is a binding to CUBLAS that we can use from within Accelerate.

>> I totally don't see how accelerate/CUDA will distribute the work.
>> Each expression (Acc ...) is compiled to run in one kernel?
>> Then how to I run several?

We just wrote a paper about Accelerate.  It's at

  http://www.cse.unsw.edu.au/~chak/papers/CKLM+10.html

This should answer some of the questions about how this all works.

Cheers,
Manuel