[Bf-committers] Patch 35559 submitted, Fix possible very slow startup time for OpenCL renderer

Wed May 29 10:36:56 CEST 2013

This patch implements a context and program cache for OpenCL. The OpenCL 
renderer can have very slow startup time in some implementations. For 
example, Intel CPU OpenCL takes 9 seconds to startup even when loading a 
cached program binary on a Core I7 990x Extreme Edition. This startup 
time happens every time the user renders. This patch makes startup time 
instant.

The implementation maintains a single process-wide thread-safe object 
that contains a map for contexts and for programs. Since a program is 
part of a context, they need to be maintained together. The cache itself 
is lazy instantiated - no instance will be constructed until the first 
time it is accessed.

OpenCL objects are reference counted, and the cache takes advantage of 
that, by using "retain" calls. This allows the OpenCLDevice 
implementation to just go ahead and release the object when it is 
finished with it. Each time an object is fetched from the cache, it is 
assumed that the caller will just release it (as usual) when finished 
with it, so it needs to be retained each time.

It is possible for a race condition to occur, where two threads do a 
lookup and both don't find the object. Both threads will proceed with 
compilation and both threads will try to insert into the cache. The 
access to the cache data itself is protected by a mutex, so the race 
loser's map insert will fail, so no retain call will be issued, and the 
loser will really release the object when it is finished with it.

Besides the render startup time fix, this also changes the clFinish 
after issuing the kernel to clFlush instead. clFinish is completely 
unnecessary, but clFlush ensures that the device will begin working on 
it as soon as possible without waiting until a blocking copy. We 
definitely don't want the compute hardware to be idle when 
enqueue_kernel returns, which is how it is without this change. We 
currently block at the next memory copy, which maintains full coherency 
already.

I changed the implementation of mem_alloc partially, which is part of a 
change I have in progress that uses unified memory (zero-copy) for 
capable OpenCL devices.