[Bf-committers] "Official" CUDA Benchmark/Implementation Thread

Yves Poissant ypoissant2 at videotron.ca
Fri Dec 19 21:37:17 CET 2008


You cannot avoid cache misses completely. But you can do your raytracing is such a way that you use the same blocks of memory most of the time for a larger proportion of samples than for just one pixel. Think "bundles" and "frustrum" -> read Reshetov. There were a bunch of different good approaches to bundeling in the last RT conference. That we have to deal with a lot of data is unavoidable. But the trick is to figure a way to access this data in a coherent way while raytracing. Meaning, once you accessed a piece of data, try to do as much as you can with it while you are there. Try to reuse it as much as possible. Preprocess as much data as possible. Most of the time, that means organizing the data in a different way than how it is organized right now. Put data that are to be used together physically close together. That is a tall task though and may not even be possible. Organizing the data according to a hierarchical model is easy but different than organizing the data for coherent and efficient access which is less easy to do.

I fully agree with you. This is difficult to do. But this will need to be addressed sooner or later. Even if we cannot possibly think of changing the whole render pipeline for now, we need to start be aware of the issue and start thinking about possible solutions and implementation alternatives. We need to start discussing that.

In the meantime, this situation needs to be kept in mind. I have a colegue from a former employer who had the task of using Gelato to accelerate a render engine. He was not succesfull because the overhead was way too heavy and because of the way the render engine was programmed, the result was like spoonfeeding Gelato with data. To correctly using Gelato would have required to rewrite the whole render pipeline. A couple years ago, I saw the same situation happen with using multithreading. On ompf.org, there were some people playing with CUDA a few months ago and they got really interesting results but they wrote the raytracing from scratch. My experience is that you cannot get significant performance improvement if you try to adapt first generation render engines to these new technologies. The programming techniques are too different for them to work well together. It might still be an interesting experience to adapt CUDA to Blender renderer if only to raise the awareness in the developer's community as to the kind of modifications that might be required in the render engine but keep your expectations reasonable. Don't expect the kind of performance boost that is published for CUDA specific implementations.

Yves
  ----- Original Message ----- 
  From: joe 
  To: Yves Poissant ; bf-blender developers 
  Sent: Friday, December 19, 2008 2:29 PM
  Subject: Re: [Bf-committers] "Official" CUDA Benchmark/Implementation Thread


  I'm not sure how you'd avoid cache misses though. . .we simply have to deal with too much data.  About the only thing I can think of is sorting faces/strands (I actually do this in my DSM branch) per tile and using a more optimal render order then simply going over the scanlines.  The ray tracing traversal could be made more efficient, but optimizing what the renderer does between could be more difficult.


  You know I think the CodeAnalyst profiling tool from AMD can measure cache misses, I'll have to try and figure out how it works.



  Joe


  On Fri, Dec 19, 2008 at 9:26 AM, Yves Poissant <ypoissant2 at videotron.ca> wrote:

    That is an interesting discussion I find hard not to participate into. I
    have a lot to say/comment here. I'll try to organize my thoughts.

    First, indeed, anyone interested in real-time ray-tracing (and thus on
    accelerating ray-tracing) should check ompf.org forum. It is a must. And I
    invite you to take a particular look at Arauna by Jacco Bicker. This render
    engine can do real-time ray-tracing on the CPU only. A runable demo is
    available and it is truely impressive. You can also download the
    source-code. Browsing through the source code is very revealing about what
    sort of programming techniques must be used to achieve high speed rendering.
    In particular, the use of SSE is so heavy that in some critical parts of the
    code. it does not look like C or C++ anymore.

    I worked on acceleration structures for Blender ray-tracer some months ago
    and I found, at that time, that a SAH BVH was the most efficient structure
    most of the time. I still think that SAH BVH is the way to go but I now have
    a caveat. As some of you know, I program a render engine for a living. I
    can't divulge much because I'm under NDA. But I can say that we can
    ray-trace render a full room, fully furnished, with all the construction
    geometry details in the furnitures, and fully decorated, with indirect
    illumination in 800x450 and 5 sample AA under 10 seconds using one single
    CPU alone. We are not even using SSE nor multi-cores (but we will).

    I'm not mentioning that just for showing off but because I want to give a
    hint at what makes the difference between our render engine and Blender
    render engine. For example, changing an aspect of the acceleration structure
    in our render engine does have a very noticeable impact on the rendering
    performance. That was not the case when I tried different acceleration
    structures for Blender and when I tried different optimization approaches.
    Improvements were difficult to notice and I could only get tiny percentage
    of improvements that I needed to tabulate in order to monitor my progress.
    At the time, I did not bother too much. But with my current experience, I
    know that this indicates that something else, elsewhere in the rendering
    pipeline is taking a lot of time. And those other inefficient procedures
    need to be revised. This is where improvement efforts need to be put IMO.

    That "RE_ray_tree_intersect_check" takes the most time only tells a small
    fraction of the story. This function calls other functions that needs
    optimizing and some redundant and slow calculations are done there that
    could be avoided if data was better prepared before calling it. But the most
    important and invisible aspect is how this function is being called. The way
    the raytracing is directed takes no care about memory coherency and most
    importantly about cache coherency. If there is one single important hint
    that can be gathered from recent papers about accelerating ray-tracing since
    Havran thesis, it is that memory cache misses are extremely (I would even
    dare say excessively) costly. By the time RE_ray_tree_intersect_check is
    called a second time, the memory cache layout is so trashed that
    RE_ray_tree_intersect_check generated tons of cache misses. At least, when I
    compare traversal times in Blender with those times I get here, the numbers
    seem to point in that direction. Not only cache misses during traversal but
    cache misses in all the other render steps inbetween each traversals too.
    Rendering in strict scanline order is not optimal. Rendering in packet of
    rays is the way to go even if not using SSE.

    Cache misses are important and the whole rendering pipeline must be
    optimized to improve memory access coherency. Blender render engine being a
    first generation render engine, like most render engines that exist since
    several years, it is designed for CPU where memory caches were no issues.
    New CPUs require different programming approaches.

    That's it for now.
    Regards,
    Yves


    ----- Original Message -----
    From: "Matt Ebb" <matt at mke3.net>
    To: "bf-blender developers" <bf-committers at blender.org>
    Sent: Thursday, December 18, 2008 5:38 PM
    Subject: Re: [Bf-committers] "Official" CUDA Benchmark/Implementation Thread


    > On Fri, Dec 19, 2008 at 8:18 AM, Timothy Baldridge <tbaldridge at gmail.com>
    > wrote:
    >>>
    >>> As you can see the first 7 functions consume more than 90% of the total
    >>> time during a rendering...
    >>>
    >>
    >> That brings about an interesting idea. If really that much of the code
    >> is spent in ray intersection, then the question is can that part be
    >> pulled out on its own and turned somehow into batches?
    >
    > As far as I'm aware, this sort of thing isn't easy to do on the GPU.
    > If you want to see what the 'state of the art' of realtime raytracing
    > / ray intersection acceleration, have a look at the forums at
    > http://ompf.org/forum/.
    >
    > Real benefits could be made in this area simply* by implementing an
    > improved intersection acceleration structure to the current octree.
    > Yves Poissant did a lot of work experimenting with different systems,
    > finding that SAH_BVH was best in his opinion. His patch is here:
    > https://projects.blender.org/tracker/index.php?func=detail&aid=
    >
    > cheers
    >
    > Matt
    >
    >
    > *for large values of simple ;)
    > _______________________________________________
    > Bf-committers mailing list
    > Bf-committers at blender.org
    > http://lists.blender.org/mailman/listinfo/bf-committers
    >


    _______________________________________________
    Bf-committers mailing list
    Bf-committers at blender.org
    http://lists.blender.org/mailman/listinfo/bf-committers


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.blender.org/pipermail/bf-committers/attachments/20081219/fdae198e/attachment.htm 


More information about the Bf-committers mailing list