[Bf-committers] "Official" CUDA Benchmark/Implementation Thread
ypoissant2 at videotron.ca
Sat Dec 20 15:53:20 CET 2008
> Of course is not very important to optimize (or replace) for instance a
> function which is called only ONCE and whose
> execution time is 0.01% of the total. So for instance we have the
> vlr_check_intersect() is called 87million times and
> the testnode() 40million.That should also take already in account the
> number reciprocal (or recursive) calls.
> You may also do some speculation and "forecasting" about what happens if
> that function could be made
> 10 times faster or 100 times faster. But also looking at the same
> functions (or the function they were called most of the
> time) you can see whether they can be cud-ized (or CL-ized) or not. Of
> course the extreme is to "rewrite" a new
> independent CUDA|OpenCL (realtime) engine. But that in a "ideal world"...
That is exactly the part of the rendering pipeline I worked on. Even when
using the most optimized algorithms known, I couldn't get so convincing
improvements. There are a few ugly things going on in there, like, for
instance, the code that verifies for self intersection is extremely
inefficient. But to avoid that would require rewrite of how data is being
prepared upstream. Matt Ebb did a rewrite of that data preparation for v2.46
I think that had the effect that those checks could be avoided. That caused
other rendering artifacts so I don't know if those changes are still there.
This may not be an easy issue to solve depending on how deep the preparation
is. If we could avoid having to check for self intersection right in the
traversal, we would already save significant time IMO. But the real culprit
is not there.
For accelerating traversal, it have been demonstrated numerous times, that
using parallelism is the way to go. This is one area where SIMD is very
efficient. CUDA could probably be made to perform in the same way using more
or less the same techniques. One interesting techniques that is quite
scalable in a CUDA or OpenCL environment is the multiway BVH as per a paper
from RT 2008. One difficulty with GPGPU and acceleration structure is that
the acceleration structure must be stored in the GPU memory to gain
anything. There have been a few papers just recently published where an
acceleration structure is built and traversed in real-time on the GPU. See:
http://www.cs.unc.edu/~lauterb/GPUBVH/ Once again check on ompf.org for more
pointers in that direction.
More information about the Bf-committers