[Soc-2009-dev] Ray-tracing & SIMD
ypoissant2 at videotron.ca
Mon Aug 10 19:54:17 CEST 2009
The build with SAH builds a 2-way tree.. but then the tree optimizations (done in O(N) and O(NLogN)) (which reduce the expected number of BB-tests) transform it into a tree where each node can have any number of childs.
This "any number of childs" make it harder to implement groupped childs.
I'd guess so. One thing I learned with SSE is that if the data is not already organized as SoA when needed, then any additional operation to organize it as SoA on demand will kill any little gain from processing SIMD way.
If your tree build algorithm does not allow to systematically pack 4 nodes as SoA then I'd say it is not worth pursuing a SIMD implementation with this tree build algorithm for speed improvement alone. For the acquired experience though, it is always goos to try that sort of thing.
Both Dammertz, Hanika & Keller, and Ernst & Greimer do the a priori 4-ary SoA systematically. Wald, Benthin & Boulos are not so systematic but they use 16-wide SIMD (we can guess it is the Larrabee) and they never get nodes larger than 16 anyway. But because their SIMD is so wide, they nevertheless get significant gains
eg.: some sample data of a subset of the overlap2 scene (witouth tree optimization / with tree optimization):
BB tests per ray: 96.979 / 76.210
BB hits per ray: 64.598 / 33.964
That is very interesting results. Have you implemented an algorithm that you picked from a paper or is that an algorithm you developped yourself? Is this based on SAH, an extension of it or something else? Can you give information? (I'd understand if you keep it for a paper though)
About ray-coherency, I decided to only exploit it using hints to build tree-cuts (like a LCTS), as other type of approach would be too troublesome and a waste of time considering the modifications on BI code that will probably be modified soon.
When I checked the ray generation code in Blender, it seemed difficult to break the code so we can generate bundles of rays. I agree with you there.
As so my initial approach to SIMD, was the simplest as possible, 1ray-4bvhs.
The current SoC tree structures was not organized for it, and so an initial try was implemented by poping 4nodes at a time from the dfs stack.
(This allowed to test simd ray-bb code, improve BLI_arena for doing 16aligned mallocs, and for me to get use to simd code)
Well, that is good experience anyway. And the addition to BLI_arena is cool too.
During this week I plan to optimize the data structure not only for grouping childs of BVH's, but also to yeild better memory access for non-SIMD builds.
So I think I am going on a good direction!
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Soc-2009-dev