static code generation and open Accs

Tue Aug 24 04:13:58 EDT 2010

Hi Rami,

> The front end AST conversion and look up for a cached version of the code is slow (i.e. excluding compilation and code generation).  With the current version of the front-end and CUDA back-end it takes ~170ms.  With the proposed variant of 'run' won't this penalty still be incurred each time the 'run' variant is called?  I believe that there is lots of scope for speeding this up by optimising the language front-end and CUDA back-end.  However, my guess is that the penalty is always going to be an issue for applications that repeat the computation at a high frequency.

Trevor mentioned that you did some measurements.  I'm surprised that the front-end takes so long, but on the other hand, I haven't looked at performance at all so far.  It should be possible to improve the performance significantly.

> I guess there needs to be a way of bypassing the overhead of AST conversion and code lookup - so that there is a way of directly invoking the computation again with new 'input data'.
> 
> Is there a way to achieve this?

I'd suggest to first understand the reason for the bad performance.  It is likely that we can improve performance up to a point where it is no significant overhead anymore.

In your case, I guess, the requirement for repeated invocations comes from processing a stream of data (eg, stream of video frames), where you want to execute the same Accelerate computation on each element of the stream.  That may very well be an idiom for which we should add special support.  After all, we know a priori that a particular set of computations needs to be invoked many times per second.  Would that address your problem?

Cheers,
Manuel