[Bf-committers] "Official" CUDA Benchmark/Implementation Thread

Sat Dec 20 15:53:20 CET 2008

> Of course is not very important to optimize (or replace) for instance a
> function which is called only ONCE and whose
> execution time is 0.01% of the total. So for instance we have the
> vlr_check_intersect() is called 87million times and
> the testnode() 40million.That should also take already in account  the
> number reciprocal (or recursive) calls.
> You may also do some speculation and "forecasting" about what happens if
> that function could be made
> 10 times faster or 100 times faster. But also looking at the same
> functions (or the function they were called most of the
> time) you can see whether they can be cud-ized  (or CL-ized) or not. Of
> course the extreme is to "rewrite" a new
> independent CUDA|OpenCL (realtime) engine. But that in a "ideal world"...

That is exactly the part of the rendering pipeline I worked on. Even when 
using the most optimized algorithms known, I couldn't get so convincing 
improvements. There are a few ugly things going on in there, like, for 
instance, the code that verifies for self intersection is extremely 
inefficient. But to avoid that would require rewrite of how data is being 
prepared upstream. Matt Ebb did a rewrite of that data preparation for v2.46 
I think that had the effect that those checks could be avoided. That caused 
other rendering artifacts so I don't know if those changes are still there. 
This may not be an easy issue to solve depending on how deep the preparation 
is. If we could avoid having to check for self intersection right in the 
traversal, we would already save significant time IMO. But the real culprit 
is not there.

For accelerating traversal, it have been demonstrated numerous times, that 
using parallelism is the way to go. This is one area where SIMD is very 
efficient. CUDA could probably be made to perform in the same way using more 
or less the same techniques. One interesting techniques that is quite 
scalable in a CUDA or OpenCL environment is the multiway BVH as per a paper 
from RT 2008. One difficulty with GPGPU and acceleration structure is that 
the acceleration structure must be stored in the GPU memory to gain 
anything. There have been a few papers just recently published where an 
acceleration structure is built and traversed in real-time on the GPU. See: 
http://www.cs.unc.edu/~lauterb/GPUBVH/ Once again check on ompf.org for more 
pointers in that direction.

Yves