[Bf-committers] "Official" CUDA Benchmark/Implementation Thread

Fri Dec 19 17:26:38 CET 2008

That is an interesting discussion I find hard not to participate into. I 
have a lot to say/comment here. I'll try to organize my thoughts.

First, indeed, anyone interested in real-time ray-tracing (and thus on 
accelerating ray-tracing) should check ompf.org forum. It is a must. And I 
invite you to take a particular look at Arauna by Jacco Bicker. This render 
engine can do real-time ray-tracing on the CPU only. A runable demo is 
available and it is truely impressive. You can also download the 
source-code. Browsing through the source code is very revealing about what 
sort of programming techniques must be used to achieve high speed rendering. 
In particular, the use of SSE is so heavy that in some critical parts of the 
code. it does not look like C or C++ anymore.

I worked on acceleration structures for Blender ray-tracer some months ago 
and I found, at that time, that a SAH BVH was the most efficient structure 
most of the time. I still think that SAH BVH is the way to go but I now have 
a caveat. As some of you know, I program a render engine for a living. I 
can't divulge much because I'm under NDA. But I can say that we can 
ray-trace render a full room, fully furnished, with all the construction 
geometry details in the furnitures, and fully decorated, with indirect 
illumination in 800x450 and 5 sample AA under 10 seconds using one single 
CPU alone. We are not even using SSE nor multi-cores (but we will).

I'm not mentioning that just for showing off but because I want to give a 
hint at what makes the difference between our render engine and Blender 
render engine. For example, changing an aspect of the acceleration structure 
in our render engine does have a very noticeable impact on the rendering 
performance. That was not the case when I tried different acceleration 
structures for Blender and when I tried different optimization approaches. 
Improvements were difficult to notice and I could only get tiny percentage 
of improvements that I needed to tabulate in order to monitor my progress. 
At the time, I did not bother too much. But with my current experience, I 
know that this indicates that something else, elsewhere in the rendering 
pipeline is taking a lot of time. And those other inefficient procedures 
need to be revised. This is where improvement efforts need to be put IMO.

That "RE_ray_tree_intersect_check" takes the most time only tells a small 
fraction of the story. This function calls other functions that needs 
optimizing and some redundant and slow calculations are done there that 
could be avoided if data was better prepared before calling it. But the most 
important and invisible aspect is how this function is being called. The way 
the raytracing is directed takes no care about memory coherency and most 
importantly about cache coherency. If there is one single important hint 
that can be gathered from recent papers about accelerating ray-tracing since 
Havran thesis, it is that memory cache misses are extremely (I would even 
dare say excessively) costly. By the time RE_ray_tree_intersect_check is 
called a second time, the memory cache layout is so trashed that 
RE_ray_tree_intersect_check generated tons of cache misses. At least, when I 
compare traversal times in Blender with those times I get here, the numbers 
seem to point in that direction. Not only cache misses during traversal but 
cache misses in all the other render steps inbetween each traversals too. 
Rendering in strict scanline order is not optimal. Rendering in packet of 
rays is the way to go even if not using SSE.

Cache misses are important and the whole rendering pipeline must be 
optimized to improve memory access coherency. Blender render engine being a 
first generation render engine, like most render engines that exist since 
several years, it is designed for CPU where memory caches were no issues. 
New CPUs require different programming approaches.

That's it for now.
Regards,
Yves

----- Original Message ----- 
From: "Matt Ebb" <matt at mke3.net>
To: "bf-blender developers" <bf-committers at blender.org>
Sent: Thursday, December 18, 2008 5:38 PM
Subject: Re: [Bf-committers] "Official" CUDA Benchmark/Implementation Thread

> On Fri, Dec 19, 2008 at 8:18 AM, Timothy Baldridge <tbaldridge at gmail.com> 
> wrote:
>>>
>>> As you can see the first 7 functions consume more than 90% of the total
>>> time during a rendering...
>>>
>>
>> That brings about an interesting idea. If really that much of the code
>> is spent in ray intersection, then the question is can that part be
>> pulled out on its own and turned somehow into batches?
>
> As far as I'm aware, this sort of thing isn't easy to do on the GPU.
> If you want to see what the 'state of the art' of realtime raytracing
> / ray intersection acceleration, have a look at the forums at
> http://ompf.org/forum/.
>
> Real benefits could be made in this area simply* by implementing an
> improved intersection acceleration structure to the current octree.
> Yves Poissant did a lot of work experimenting with different systems,
> finding that SAH_BVH was best in his opinion. His patch is here:
> https://projects.blender.org/tracker/index.php?func=detail&aid=
>
> cheers
>
> Matt
>
>
> *for large values of simple ;)
> _______________________________________________
> Bf-committers mailing list
> Bf-committers at blender.org
> http://lists.blender.org/mailman/listinfo/bf-committers
>