[Bf-committers] Patch 35559 submitted, Fix possible very slow startup time for OpenCL renderer
Doug Gale
doug65536 at gmail.com
Wed May 29 10:36:56 CEST 2013
This patch implements a context and program cache for OpenCL. The OpenCL
renderer can have very slow startup time in some implementations. For
example, Intel CPU OpenCL takes 9 seconds to startup even when loading a
cached program binary on a Core I7 990x Extreme Edition. This startup
time happens every time the user renders. This patch makes startup time
instant.
The implementation maintains a single process-wide thread-safe object
that contains a map for contexts and for programs. Since a program is
part of a context, they need to be maintained together. The cache itself
is lazy instantiated - no instance will be constructed until the first
time it is accessed.
OpenCL objects are reference counted, and the cache takes advantage of
that, by using "retain" calls. This allows the OpenCLDevice
implementation to just go ahead and release the object when it is
finished with it. Each time an object is fetched from the cache, it is
assumed that the caller will just release it (as usual) when finished
with it, so it needs to be retained each time.
It is possible for a race condition to occur, where two threads do a
lookup and both don't find the object. Both threads will proceed with
compilation and both threads will try to insert into the cache. The
access to the cache data itself is protected by a mutex, so the race
loser's map insert will fail, so no retain call will be issued, and the
loser will really release the object when it is finished with it.
Besides the render startup time fix, this also changes the clFinish
after issuing the kernel to clFlush instead. clFinish is completely
unnecessary, but clFlush ensures that the device will begin working on
it as soon as possible without waiting until a blocking copy. We
definitely don't want the compute hardware to be idle when
enqueue_kernel returns, which is how it is without this change. We
currently block at the next memory copy, which maintains full coherency
already.
I changed the implementation of mem_alloc partially, which is part of a
change I have in progress that uses unified memory (zero-copy) for
capable OpenCL devices.
More information about the Bf-committers
mailing list