static code generation and open Accs

Fri Aug 27 00:38:27 EDT 2010

Hi Rami,

>>>>> I guess there needs to be a way of bypassing the overhead of AST conversion and code lookup - so that there is a way of directly invoking the computation again with new 'input data'.
>>>>> 
>>>>> Is there a way to achieve this?
>>>> 
>>>> I'd suggest to first understand the reason for the bad performance.  It is likely that we can improve performance up to a point where it is no significant overhead anymore.
>>>> 
>>>> In your case, I guess, the requirement for repeated invocations comes from processing a stream of data (eg, stream of video frames), where you want to execute the same Accelerate computation on each element of the stream.  That may very well be an idiom for which we should add special support.  After all, we know a priori that a particular set of computations needs to be invoked many times per second.  Would that address your problem?
>>> 
>>> Correct.  Your proposal would address our problem.  At this stage I am experimenting with adding a function 'step' to the CUDA backend that has the following signature:
>>> 
>>> step :: Arrays a => Array dim e -> (Acc a, CUDAState) -> IO (a, CUDAState)
>>> 
>>> This enables me to repeat the same Accelerate computation on a mutable 'input' array (as identified by the first argument).  Where I am unsafely mutating the input array between each iteration.  It is very ugly and a hack, but I have been able to get much better performance by doing this.
>>> 
>>> Something like your stream proposal would be a much better solution.  Would this require a new array type (e.g. array stream) to be defined for the language?
>> 
>> Depends on exactly what the requirements are.
>> 
>> * Given the kernels that you are looking at, is there always just one array that is being streamed in?  (There may of course be other arrays involved that do not change on a frame by frame basis.)
>> 
>> * Is there also an array that you want to stream out?  (If there is one in and one out, then we'd have a classic pipeline.  If the output is a scalar, then that would still be a DIM0 array in Accelerate.)
>> 
>> We might be able to use the following interface:
>> 
>> stream :: (Array dim e -> Acc (Array dim' e')) -> [Array dim e] -> [Array dim' e']
>> 
>> The lazy list '[Array dim e]' would be fed to the kernel as list elements become available (which might be frames from a video camera); similarly, the result list would be incrementally produced as the kernel processes each frame.  The result could, for example, be consumed by a function that displays the processed frames on the screen as they become available.
>> 
>> This approach would ensure that we not only avoid re-analysing the kernel code in the frontend, but an optimised implementation would also ensure that incoming frames are pushed to the device (via CUDA's host->device copying) in parallel with the processing of the preceding frame (and where the hardware has that capability, eg, on Tesla GPUs), and similarly outgoing frames would be transferred back to host memory while the device is being kept busy with subsequent frames.  (AFAIK high-end devices can simultaneously perform host->device, device->host, and local device computations.  For best performance, you need to overlap all of them.)
>> 
>> What do you think about that approach?
> 
> The requirement of a single input is too restrictive for computer vision.  We need support for multiple input streams for each kernel.  For example a gradient orientation calculation would require the X and Y gradient images.  Maybe we could define stream2, stream3, and stream variants - these could then be combined by the user to process an arbitrary number if input streams.

An extension to multiple input and output streams should be quite simple *assuming* that the streams are in sync, are they?  If so, we can use the existing Arrays class

  stream :: (Arrays as, Arrays bs) => (as -> Acc bs) -> [as] -> [bs]

> The other two aspects of the approach which are not clear to me at the moment are:
> 
> 1) Would using lazy lists preclude a backend from using static memory allocation?  Allocating memory on the heap and performing GC for list elements may be too slow (since each element is a frame - they are likely to be large).

The GHC storage manager already handles large objects specially; in particular, large arrays of unboxed values —which is what we have got here— are not scanned during GC.  The GC'ed list in the heap just contains a pointer to each array, not the array itself.  In my experience, it is usually best to let the storage manager do its job unless there is concrete evidence that its policies are not appropriate for a particular program (and then there are many options with which it can be tuned).

> 2) Does the suggested approach allow the user to control the scheduling of computation steps (e.g. by using the Intel.Cnc library or similar)?

What exactly do you like to control?  The applications does have control over the scheduling in so far as Haskell's lazy evaluation allows the application to generate the input stream at its own pace.  Moreover, any buffering needed to handle a mismatch between the input frame rate and the execution speed of the kernel on the GPU will be automatic.

> It may be better to define an accelerate 'step' along the lines of your stream function:
> 
> runStep :: AccStep (Acc (Array dim e) -> Acc (Array dim' e')) -> Acc (Array dim e) -> Array dim' e
> 
> where the AccStep monad (would need a different monad type for each device type) denotes a compiled Accelerate function, in a similar way to the use function may trigger a host->device transfer:
> 
> prepareStep :: (Acc (Array dim e) -> Acc (Array dim' e')) -> AccStep (Acc (Array dim e) -> Acc (Array dim' e'))
> 
> Not sure if the above makes sense (my Haskell is still rudimentary), and not sure if this can be used to efficiently schedule host->device copying.

I don't think that this gives you any more control than using a lazy list.  In fact, it gives you less, as runStep would suspend (on the host) for the entire duration of the GPU computation and data transfer.  As a result, it is probably much harder to overlap transfer with computation.

Personally, I'd use lazy lists as streams and see how that goes.  If we identify a performance bottleneck and that requires more explicit control, sure, let us move to a more complicated API that breaks the whole process into more smaller components.  But then we know at least where the bottleneck is and can address that specifically.

Cheers,
Manuel