I&#39;m not sure how you&#39;d avoid cache misses though. . .we simply have to deal with too much data. &nbsp;About the only thing I can think of is sorting faces/strands (I actually do this in my DSM branch) per tile and using a more optimal render order then simply going over the scanlines. &nbsp;The ray tracing traversal could be made more efficient, but optimizing what the renderer does between could be more difficult.<div>

<br></div><div>You know I think the CodeAnalyst profiling tool from AMD can measure cache misses, I&#39;ll have to try and figure out how it works.<br><div><br></div><div>Joe</div><div><br><div class="gmail_quote">On Fri, Dec 19, 2008 at 9:26 AM, Yves Poissant <span dir="ltr">&lt;<a href="mailto:ypoissant2@videotron.ca">ypoissant2@videotron.ca</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">That is an interesting discussion I find hard not to participate into. I<br>

have a lot to say/comment here. I&#39;ll try to organize my thoughts.<br>

<br>

First, indeed, anyone interested in real-time ray-tracing (and thus on<br>

accelerating ray-tracing) should check <a href="http://ompf.org" target="_blank">ompf.org</a> forum. It is a must. And I<br>

invite you to take a particular look at Arauna by Jacco Bicker. This render<br>

engine can do real-time ray-tracing on the CPU only. A runable demo is<br>

available and it is truely impressive. You can also download the<br>

source-code. Browsing through the source code is very revealing about what<br>

sort of programming techniques must be used to achieve high speed rendering.<br>

In particular, the use of SSE is so heavy that in some critical parts of the<br>

code. it does not look like C or C++ anymore.<br>

<br>

I worked on acceleration structures for Blender ray-tracer some months ago<br>

and I found, at that time, that a SAH BVH was the most efficient structure<br>

most of the time. I still think that SAH BVH is the way to go but I now have<br>

a caveat. As some of you know, I program a render engine for a living. I<br>

can&#39;t divulge much because I&#39;m under NDA. But I can say that we can<br>

ray-trace render a full room, fully furnished, with all the construction<br>

geometry details in the furnitures, and fully decorated, with indirect<br>

illumination in 800x450 and 5 sample AA under 10 seconds using one single<br>

CPU alone. We are not even using SSE nor multi-cores (but we will).<br>

<br>

I&#39;m not mentioning that just for showing off but because I want to give a<br>

hint at what makes the difference between our render engine and Blender<br>

render engine. For example, changing an aspect of the acceleration structure<br>

in our render engine does have a very noticeable impact on the rendering<br>

performance. That was not the case when I tried different acceleration<br>

structures for Blender and when I tried different optimization approaches.<br>

Improvements were difficult to notice and I could only get tiny percentage<br>

of improvements that I needed to tabulate in order to monitor my progress.<br>

At the time, I did not bother too much. But with my current experience, I<br>

know that this indicates that something else, elsewhere in the rendering<br>

pipeline is taking a lot of time. And those other inefficient procedures<br>

need to be revised. This is where improvement efforts need to be put IMO.<br>

<br>

That &quot;RE_ray_tree_intersect_check&quot; takes the most time only tells a small<br>

fraction of the story. This function calls other functions that needs<br>

optimizing and some redundant and slow calculations are done there that<br>

could be avoided if data was better prepared before calling it. But the most<br>

important and invisible aspect is how this function is being called. The way<br>

the raytracing is directed takes no care about memory coherency and most<br>

importantly about cache coherency. If there is one single important hint<br>

that can be gathered from recent papers about accelerating ray-tracing since<br>

Havran thesis, it is that memory cache misses are extremely (I would even<br>

dare say excessively) costly. By the time RE_ray_tree_intersect_check is<br>

called a second time, the memory cache layout is so trashed that<br>

RE_ray_tree_intersect_check generated tons of cache misses. At least, when I<br>

compare traversal times in Blender with those times I get here, the numbers<br>

seem to point in that direction. Not only cache misses during traversal but<br>

cache misses in all the other render steps inbetween each traversals too.<br>

Rendering in strict scanline order is not optimal. Rendering in packet of<br>

rays is the way to go even if not using SSE.<br>

<br>

Cache misses are important and the whole rendering pipeline must be<br>

optimized to improve memory access coherency. Blender render engine being a<br>

first generation render engine, like most render engines that exist since<br>

several years, it is designed for CPU where memory caches were no issues.<br>

New CPUs require different programming approaches.<br>

<br>

That&#39;s it for now.<br>

Regards,<br>

<font color="#888888">Yves<br>

</font><div><div></div><div class="Wj3C7c"><br>

----- Original Message -----<br>

From: &quot;Matt Ebb&quot; &lt;<a href="mailto:matt@mke3.net">matt@mke3.net</a>&gt;<br>

To: &quot;bf-blender developers&quot; &lt;<a href="mailto:bf-committers@blender.org">bf-committers@blender.org</a>&gt;<br>

Sent: Thursday, December 18, 2008 5:38 PM<br>

Subject: Re: [Bf-committers] &quot;Official&quot; CUDA Benchmark/Implementation Thread<br>

<br>

<br>

&gt; On Fri, Dec 19, 2008 at 8:18 AM, Timothy Baldridge &lt;<a href="mailto:tbaldridge@gmail.com">tbaldridge@gmail.com</a>&gt;<br>

&gt; wrote:<br>

&gt;&gt;&gt;<br>

&gt;&gt;&gt; As you can see the first 7 functions consume more than 90% of the total<br>

&gt;&gt;&gt; time during a rendering...<br>

&gt;&gt;&gt;<br>

&gt;&gt;<br>

&gt;&gt; That brings about an interesting idea. If really that much of the code<br>

&gt;&gt; is spent in ray intersection, then the question is can that part be<br>

&gt;&gt; pulled out on its own and turned somehow into batches?<br>

&gt;<br>

&gt; As far as I&#39;m aware, this sort of thing isn&#39;t easy to do on the GPU.<br>

&gt; If you want to see what the &#39;state of the art&#39; of realtime raytracing<br>

&gt; / ray intersection acceleration, have a look at the forums at<br>

&gt; <a href="http://ompf.org/forum/" target="_blank">http://ompf.org/forum/</a>.<br>

&gt;<br>

&gt; Real benefits could be made in this area simply* by implementing an<br>

&gt; improved intersection acceleration structure to the current octree.<br>

&gt; Yves Poissant did a lot of work experimenting with different systems,<br>

&gt; finding that SAH_BVH was best in his opinion. His patch is here:<br>

&gt; <a href="https://projects.blender.org/tracker/index.php?func=detail&amp;aid=" target="_blank">https://projects.blender.org/tracker/index.php?func=detail&amp;aid=</a><br>

&gt;<br>

&gt; cheers<br>

&gt;<br>

&gt; Matt<br>

&gt;<br>

&gt;<br>

&gt; *for large values of simple ;)<br>

&gt; _______________________________________________<br>

&gt; Bf-committers mailing list<br>

&gt; <a href="mailto:Bf-committers@blender.org">Bf-committers@blender.org</a><br>

&gt; <a href="http://lists.blender.org/mailman/listinfo/bf-committers" target="_blank">http://lists.blender.org/mailman/listinfo/bf-committers</a><br>

&gt;<br>

<br>

<br>

_______________________________________________<br>

Bf-committers mailing list<br>

<a href="mailto:Bf-committers@blender.org">Bf-committers@blender.org</a><br>

<a href="http://lists.blender.org/mailman/listinfo/bf-committers" target="_blank">http://lists.blender.org/mailman/listinfo/bf-committers</a><br>

</div></div></blockquote></div><br></div></div>