<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=iso-8859-1">
<META content="MSHTML 6.00.6000.16788" name=GENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=#ffffff>
<DIV><FONT face=Arial size=2>You cannot avoid cache misses completely. But you
can do your raytracing is such a way that you use the same blocks of memory most
of the time for a larger proportion of samples than for just one pixel. Think
"bundles" and "frustrum" -> read Reshetov. There were a bunch of
different good approaches to bundeling in the last RT conference. That we have
to deal with a lot of data is unavoidable. But the trick is to figure a way to
access this data in a coherent way while raytracing. Meaning, once you accessed
a piece of data, try to do as much as you can with it while you are there. Try
to reuse it as much as possible. Preprocess as much data as possible. Most of
the time, that means organizing the data in a different way than how it is
organized right now. Put data that are to be used together physically close
together. That is a tall task though and may not even be possible. Organizing
the data according to a hierarchical model is easy but different than organizing
the data for coherent and efficient access which is less easy to
do.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>I fully agree with you. This is difficult to do.
But this will need to be addressed sooner or later. Even if we cannot possibly
think of changing the whole render pipeline for now, we need to start be aware
of the issue and start thinking about possible solutions and implementation
alternatives. We need to start discussing that.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>In the meantime, this situation needs to be kept in
mind. I have a colegue from a former employer who had the task of using Gelato
to accelerate a render engine. He was not succesfull because the overhead was
way too heavy and because of the way the render engine was programmed, the
result was like spoonfeeding Gelato with data. To correctly using Gelato would
have required to rewrite the whole render pipeline. A couple years ago, I saw
the same situation happen with using multithreading. On ompf.org, there were
some people playing with CUDA a few months ago and they got really interesting
results but they wrote the raytracing from scratch. My experience is that you
cannot get significant performance improvement if you try to adapt first
generation render engines to these new technologies. The programming techniques
are too different for them to work well together. It might still be an
interesting experience to adapt CUDA to Blender renderer if only to raise the
awareness in the developer's community as to the kind of modifications that
might be required in the render engine but keep your expectations reasonable.
Don't expect the kind of performance boost that is published for CUDA specific
implementations.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>Yves</FONT></DIV>
<BLOCKQUOTE dir=ltr
style="PADDING-RIGHT: 0px; PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #000000 2px solid; MARGIN-RIGHT: 0px">
<DIV style="FONT: 10pt arial">----- Original Message ----- </DIV>
<DIV
style="BACKGROUND: #e4e4e4; FONT: 10pt arial; font-color: black"><B>From:</B>
<A title=joeedh@gmail.com href="mailto:joeedh@gmail.com">joe</A> </DIV>
<DIV style="FONT: 10pt arial"><B>To:</B> <A title=ypoissant2@videotron.ca
href="mailto:ypoissant2@videotron.ca">Yves Poissant</A> ; <A
title=bf-committers@blender.org
href="mailto:bf-committers@blender.org">bf-blender developers</A> </DIV>
<DIV style="FONT: 10pt arial"><B>Sent:</B> Friday, December 19, 2008 2:29
PM</DIV>
<DIV style="FONT: 10pt arial"><B>Subject:</B> Re: [Bf-committers] "Official"
CUDA Benchmark/Implementation Thread</DIV>
<DIV><BR></DIV>I'm not sure how you'd avoid cache misses though. . .we simply
have to deal with too much data. About the only thing I can think of is
sorting faces/strands (I actually do this in my DSM branch) per tile and using
a more optimal render order then simply going over the scanlines. The
ray tracing traversal could be made more efficient, but optimizing what the
renderer does between could be more difficult.
<DIV><BR></DIV>
<DIV>You know I think the CodeAnalyst profiling tool from AMD can measure
cache misses, I'll have to try and figure out how it works.<BR>
<DIV><BR></DIV>
<DIV>Joe</DIV>
<DIV><BR>
<DIV class=gmail_quote>On Fri, Dec 19, 2008 at 9:26 AM, Yves Poissant <SPAN
dir=ltr><<A
href="mailto:ypoissant2@videotron.ca">ypoissant2@videotron.ca</A>></SPAN>
wrote:<BR>
<BLOCKQUOTE class=gmail_quote
style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid">That
is an interesting discussion I find hard not to participate into. I<BR>have
a lot to say/comment here. I'll try to organize my thoughts.<BR><BR>First,
indeed, anyone interested in real-time ray-tracing (and thus
on<BR>accelerating ray-tracing) should check <A href="http://ompf.org"
target=_blank>ompf.org</A> forum. It is a must. And I<BR>invite you to take
a particular look at Arauna by Jacco Bicker. This render<BR>engine can do
real-time ray-tracing on the CPU only. A runable demo is<BR>available and it
is truely impressive. You can also download the<BR>source-code. Browsing
through the source code is very revealing about what<BR>sort of programming
techniques must be used to achieve high speed rendering.<BR>In particular,
the use of SSE is so heavy that in some critical parts of the<BR>code. it
does not look like C or C++ anymore.<BR><BR>I worked on acceleration
structures for Blender ray-tracer some months ago<BR>and I found, at that
time, that a SAH BVH was the most efficient structure<BR>most of the time. I
still think that SAH BVH is the way to go but I now have<BR>a caveat. As
some of you know, I program a render engine for a living. I<BR>can't divulge
much because I'm under NDA. But I can say that we can<BR>ray-trace render a
full room, fully furnished, with all the construction<BR>geometry details in
the furnitures, and fully decorated, with indirect<BR>illumination in
800x450 and 5 sample AA under 10 seconds using one single<BR>CPU alone. We
are not even using SSE nor multi-cores (but we will).<BR><BR>I'm not
mentioning that just for showing off but because I want to give a<BR>hint at
what makes the difference between our render engine and Blender<BR>render
engine. For example, changing an aspect of the acceleration structure<BR>in
our render engine does have a very noticeable impact on the
rendering<BR>performance. That was not the case when I tried different
acceleration<BR>structures for Blender and when I tried different
optimization approaches.<BR>Improvements were difficult to notice and I
could only get tiny percentage<BR>of improvements that I needed to tabulate
in order to monitor my progress.<BR>At the time, I did not bother too much.
But with my current experience, I<BR>know that this indicates that something
else, elsewhere in the rendering<BR>pipeline is taking a lot of time. And
those other inefficient procedures<BR>need to be revised. This is where
improvement efforts need to be put IMO.<BR><BR>That
"RE_ray_tree_intersect_check" takes the most time only tells a
small<BR>fraction of the story. This function calls other functions that
needs<BR>optimizing and some redundant and slow calculations are done there
that<BR>could be avoided if data was better prepared before calling it. But
the most<BR>important and invisible aspect is how this function is being
called. The way<BR>the raytracing is directed takes no care about memory
coherency and most<BR>importantly about cache coherency. If there is one
single important hint<BR>that can be gathered from recent papers about
accelerating ray-tracing since<BR>Havran thesis, it is that memory cache
misses are extremely (I would even<BR>dare say excessively) costly. By the
time RE_ray_tree_intersect_check is<BR>called a second time, the memory
cache layout is so trashed that<BR>RE_ray_tree_intersect_check generated
tons of cache misses. At least, when I<BR>compare traversal times in Blender
with those times I get here, the numbers<BR>seem to point in that direction.
Not only cache misses during traversal but<BR>cache misses in all the other
render steps inbetween each traversals too.<BR>Rendering in strict scanline
order is not optimal. Rendering in packet of<BR>rays is the way to go even
if not using SSE.<BR><BR>Cache misses are important and the whole rendering
pipeline must be<BR>optimized to improve memory access coherency. Blender
render engine being a<BR>first generation render engine, like most render
engines that exist since<BR>several years, it is designed for CPU where
memory caches were no issues.<BR>New CPUs require different programming
approaches.<BR><BR>That's it for now.<BR>Regards,<BR><FONT
color=#888888>Yves<BR></FONT>
<DIV>
<DIV></DIV>
<DIV class=Wj3C7c><BR>----- Original Message -----<BR>From: "Matt Ebb"
<<A href="mailto:matt@mke3.net">matt@mke3.net</A>><BR>To: "bf-blender
developers" <<A
href="mailto:bf-committers@blender.org">bf-committers@blender.org</A>><BR>Sent:
Thursday, December 18, 2008 5:38 PM<BR>Subject: Re: [Bf-committers]
"Official" CUDA Benchmark/Implementation Thread<BR><BR><BR>> On Fri, Dec
19, 2008 at 8:18 AM, Timothy Baldridge <<A
href="mailto:tbaldridge@gmail.com">tbaldridge@gmail.com</A>><BR>>
wrote:<BR>>>><BR>>>> As you can see the first 7 functions
consume more than 90% of the total<BR>>>> time during a
rendering...<BR>>>><BR>>><BR>>> That brings about an
interesting idea. If really that much of the code<BR>>> is spent in
ray intersection, then the question is can that part be<BR>>> pulled
out on its own and turned somehow into batches?<BR>><BR>> As far as
I'm aware, this sort of thing isn't easy to do on the GPU.<BR>> If you
want to see what the 'state of the art' of realtime raytracing<BR>> / ray
intersection acceleration, have a look at the forums at<BR>> <A
href="http://ompf.org/forum/"
target=_blank>http://ompf.org/forum/</A>.<BR>><BR>> Real benefits
could be made in this area simply* by implementing an<BR>> improved
intersection acceleration structure to the current octree.<BR>> Yves
Poissant did a lot of work experimenting with different systems,<BR>>
finding that SAH_BVH was best in his opinion. His patch is here:<BR>> <A
href="https://projects.blender.org/tracker/index.php?func=detail&aid="
target=_blank>https://projects.blender.org/tracker/index.php?func=detail&aid=</A><BR>><BR>>
cheers<BR>><BR>> Matt<BR>><BR>><BR>> *for large values of
simple ;)<BR>> _______________________________________________<BR>>
Bf-committers mailing list<BR>> <A
href="mailto:Bf-committers@blender.org">Bf-committers@blender.org</A><BR>>
<A href="http://lists.blender.org/mailman/listinfo/bf-committers"
target=_blank>http://lists.blender.org/mailman/listinfo/bf-committers</A><BR>><BR><BR><BR>_______________________________________________<BR>Bf-committers
mailing list<BR><A
href="mailto:Bf-committers@blender.org">Bf-committers@blender.org</A><BR><A
href="http://lists.blender.org/mailman/listinfo/bf-committers"
target=_blank>http://lists.blender.org/mailman/listinfo/bf-committers</A><BR></DIV></DIV></BLOCKQUOTE></DIV><BR></DIV></DIV></BLOCKQUOTE></BODY></HTML>