static code generation and open Accs

Tue Aug 24 08:22:23 EDT 2010

Hi Rami,

>>> The front end AST conversion and look up for a cached version of the code is slow (i.e. excluding compilation and code generation).  With the current version of the front-end and CUDA back-end it takes ~170ms.  With the proposed variant of 'run' won't this penalty still be incurred each time the 'run' variant is called?  I believe that there is lots of scope for speeding this up by optimising the language front-end and CUDA back-end.  However, my guess is that the penalty is always going to be an issue for applications that repeat the computation at a high frequency.
>> 
>> Trevor mentioned that you did some measurements.  I'm surprised that the front-end takes so long, but on the other hand, I haven't looked at performance at all so far.  It should be possible to improve the performance significantly.
> Agree.  I haven't tried doing this yet - it has potential.

I can have a look at the performance of the front-end after I tackled the outstanding implementation of sharing.

>>> I guess there needs to be a way of bypassing the overhead of AST conversion and code lookup - so that there is a way of directly invoking the computation again with new 'input data'.
>>> 
>>> Is there a way to achieve this?
>> 
>> I'd suggest to first understand the reason for the bad performance.  It is likely that we can improve performance up to a point where it is no significant overhead anymore.
>> 
>> In your case, I guess, the requirement for repeated invocations comes from processing a stream of data (eg, stream of video frames), where you want to execute the same Accelerate computation on each element of the stream.  That may very well be an idiom for which we should add special support.  After all, we know a priori that a particular set of computations needs to be invoked many times per second.  Would that address your problem?
> 
> Correct.  Your proposal would address our problem.  At this stage I am experimenting with adding a function 'step' to the CUDA backend that has the following signature:
> 
> step :: Arrays a => Array dim e -> (Acc a, CUDAState) -> IO (a, CUDAState)
> 
> This enables me to repeat the same Accelerate computation on a mutable 'input' array (as identified by the first argument).  Where I am unsafely mutating the input array between each iteration.  It is very ugly and a hack, but I have been able to get much better performance by doing this.
> 
> Something like your stream proposal would be a much better solution.  Would this require a new array type (e.g. array stream) to be defined for the language?

Depends on exactly what the requirements are.

* Given the kernels that you are looking at, is there always just one array that is being streamed in?  (There may of course be other arrays involved that do not change on a frame by frame basis.)

* Is there also an array that you want to stream out?  (If there is one in and one out, then we'd have a classic pipeline.  If the output is a scalar, then that would still be a DIM0 array in Accelerate.)

We might be able to use the following interface:

  stream :: (Array dim e -> Acc (Array dim' e')) -> [Array dim e] -> [Array dim' e']

The lazy list '[Array dim e]' would be fed to the kernel as list elements become available (which might be frames from a video camera); similarly, the result list would be incrementally produced as the kernel processes each frame.  The result could, for example, be consumed by a function that displays the processed frames on the screen as they become available.

This approach would ensure that we not only avoid re-analysing the kernel code in the frontend, but an optimised implementation would also ensure that incoming frames are pushed to the device (via CUDA's host->device copying) in parallel with the processing of the preceding frame (and where the hardware has that capability, eg, on Tesla GPUs), and similarly outgoing frames would be transferred back to host memory while the device is being kept busy with subsequent frames.  (AFAIK high-end devices can simultaneously perform host->device, device->host, and local device computations.  For best performance, you need to overlap all of them.)

What do you think about that approach?

Cheers,
Manuel