static code generation and open Accs
Rami.Mukhtar at nicta.com.au
Tue Aug 24 04:31:11 EDT 2010
On 24/08/2010, at 6:13 PM, Manuel M T Chakravarty wrote:
> Hi Rami,
>> The front end AST conversion and look up for a cached version of the code is slow (i.e. excluding compilation and code generation). With the current version of the front-end and CUDA back-end it takes ~170ms. With the proposed variant of 'run' won't this penalty still be incurred each time the 'run' variant is called? I believe that there is lots of scope for speeding this up by optimising the language front-end and CUDA back-end. However, my guess is that the penalty is always going to be an issue for applications that repeat the computation at a high frequency.
> Trevor mentioned that you did some measurements. I'm surprised that the front-end takes so long, but on the other hand, I haven't looked at performance at all so far. It should be possible to improve the performance significantly.
Agree. I haven't tried doing this yet - it has potential.
>> I guess there needs to be a way of bypassing the overhead of AST conversion and code lookup - so that there is a way of directly invoking the computation again with new 'input data'.
>> Is there a way to achieve this?
> I'd suggest to first understand the reason for the bad performance. It is likely that we can improve performance up to a point where it is no significant overhead anymore.
> In your case, I guess, the requirement for repeated invocations comes from processing a stream of data (eg, stream of video frames), where you want to execute the same Accelerate computation on each element of the stream. That may very well be an idiom for which we should add special support. After all, we know a priori that a particular set of computations needs to be invoked many times per second. Would that address your problem?
Correct. Your proposal would address our problem. At this stage I am experimenting with adding a function 'step' to the CUDA backend that has the following signature:
step :: Arrays a => Array dim e -> (Acc a, CUDAState) -> IO (a, CUDAState)
This enables me to repeat the same Accelerate computation on a mutable 'input' array (as identified by the first argument). Where I am unsafely mutating the input array between each iteration. It is very ugly and a hack, but I have been able to get much better performance by doing this.
Something like your stream proposal would be a much better solution. Would this require a new array type (e.g. array stream) to be defined for the language?
The information in this e-mail may be confidential and subject to legal professional privilege and/or copyright. National ICT Australia Limited accepts no liability for any damage caused by this email or its attachments.
More information about the Accelerate