static code generation and open Accs
Rami.Mukhtar at nicta.com.au
Wed Aug 25 21:11:54 EDT 2010
On 24/08/2010, at 10:22 PM, Manuel M T Chakravarty wrote:
> Hi Rami,
>>>> The front end AST conversion and look up for a cached version of the code is slow (i.e. excluding compilation and code generation). With the current version of the front-end and CUDA back-end it takes ~170ms. With the proposed variant of 'run' won't this penalty still be incurred each time the 'run' variant is called? I believe that there is lots of scope for speeding this up by optimising the language front-end and CUDA back-end. However, my guess is that the penalty is always going to be an issue for applications that repeat the computation at a high frequency.
>>> Trevor mentioned that you did some measurements. I'm surprised that the front-end takes so long, but on the other hand, I haven't looked at performance at all so far. It should be possible to improve the performance significantly.
>> Agree. I haven't tried doing this yet - it has potential.
> I can have a look at the performance of the front-end after I tackled the outstanding implementation of sharing.
Appreciate you prioritising sharing :) Having sharing would help us a lot with performance.
>>>> I guess there needs to be a way of bypassing the overhead of AST conversion and code lookup - so that there is a way of directly invoking the computation again with new 'input data'.
>>>> Is there a way to achieve this?
>>> I'd suggest to first understand the reason for the bad performance. It is likely that we can improve performance up to a point where it is no significant overhead anymore.
>>> In your case, I guess, the requirement for repeated invocations comes from processing a stream of data (eg, stream of video frames), where you want to execute the same Accelerate computation on each element of the stream. That may very well be an idiom for which we should add special support. After all, we know a priori that a particular set of computations needs to be invoked many times per second. Would that address your problem?
>> Correct. Your proposal would address our problem. At this stage I am experimenting with adding a function 'step' to the CUDA backend that has the following signature:
>> step :: Arrays a => Array dim e -> (Acc a, CUDAState) -> IO (a, CUDAState)
>> This enables me to repeat the same Accelerate computation on a mutable 'input' array (as identified by the first argument). Where I am unsafely mutating the input array between each iteration. It is very ugly and a hack, but I have been able to get much better performance by doing this.
>> Something like your stream proposal would be a much better solution. Would this require a new array type (e.g. array stream) to be defined for the language?
> Depends on exactly what the requirements are.
> * Given the kernels that you are looking at, is there always just one array that is being streamed in? (There may of course be other arrays involved that do not change on a frame by frame basis.)
> * Is there also an array that you want to stream out? (If there is one in and one out, then we'd have a classic pipeline. If the output is a scalar, then that would still be a DIM0 array in Accelerate.)
> We might be able to use the following interface:
> stream :: (Array dim e -> Acc (Array dim' e')) -> [Array dim e] -> [Array dim' e']
> The lazy list '[Array dim e]' would be fed to the kernel as list elements become available (which might be frames from a video camera); similarly, the result list would be incrementally produced as the kernel processes each frame. The result could, for example, be consumed by a function that displays the processed frames on the screen as they become available.
> This approach would ensure that we not only avoid re-analysing the kernel code in the frontend, but an optimised implementation would also ensure that incoming frames are pushed to the device (via CUDA's host->device copying) in parallel with the processing of the preceding frame (and where the hardware has that capability, eg, on Tesla GPUs), and similarly outgoing frames would be transferred back to host memory while the device is being kept busy with subsequent frames. (AFAIK high-end devices can simultaneously perform host->device, device->host, and local device computations. For best performance, you need to overlap all of them.)
> What do you think about that approach?
The requirement of a single input is too restrictive for computer vision. We need support for multiple input streams for each kernel. For example a gradient orientation calculation would require the X and Y gradient images. Maybe we could define stream2, stream3, and stream variants - these could then be combined by the user to process an arbitrary number if input streams.
The other two aspects of the approach which are not clear to me at the moment are:
1) Would using lazy lists preclude a backend from using static memory allocation? Allocating memory on the heap and performing GC for list elements may be too slow (since each element is a frame - they are likely to be large).
2) Does the suggested approach allow the user to control the scheduling of computation steps (e.g. by using the Intel.Cnc library or similar)?
It may be better to define an accelerate 'step' along the lines of your stream function:
runStep :: AccStep (Acc (Array dim e) -> Acc (Array dim' e')) -> Acc (Array dim e) -> Array dim' e
where the AccStep monad (would need a different monad type for each device type) denotes a compiled Accelerate function, in a similar way to the use function may trigger a host->device transfer:
prepareStep :: (Acc (Array dim e) -> Acc (Array dim' e')) -> AccStep (Acc (Array dim e) -> Acc (Array dim' e'))
Not sure if the above makes sense (my Haskell is still rudimentary), and not sure if this can be used to efficiently schedule host->device copying.
The information in this e-mail may be confidential and subject to legal professional privilege and/or copyright. National ICT Australia Limited accepts no liability for any damage caused by this email or its attachments.
More information about the Accelerate