[Bf-cycles] split kernel and CUDA

Stefan Werner swerner at smithmicro.com
Mon May 30 10:14:09 CEST 2016

It very much depends on the GPU model. You can try it yourself if you call nvcc directly to compile the kernel once for SM 3.x and once for SM 5.x. I’m not if nvcc does this on purpose, for performance reasons on SM 5.x, but I wish there was a way to shut off that behavior.


>In principle the compiler can figure out that it can use e.g. the same
>memory for the SVM and BVH stacks and so decreasing the BVH stack size
>could have no impact overall.
>However we've seen that the compiler can't always do this, maybe it
>depends on the GPU model somehow.
>Stackless might help speedup more on AMD if it has slower caches, I
>guess there is a reason FireRays uses it.
>On Fri, May 27, 2016 at 4:26 PM, Sergey Sharybin <sergey.vfx at gmail.com> wrote:
>> That is weird, ptax reports stack for each of the functions here..
>> Anyway, got a quick implementation of stackless BVH traversal [1]. It's
>> currently quite slower (like, 20%), but it totally lacks
>> closest-child-traversal heuristic, so hopefully by bringing it back we can
>> compensate for the speed. However, what's much more weird, memory usage
>> difference is next to nothing here (it's less than 1% even with fully empty
>> scene).
>> Would be nice if someone makes additional tests on a higher level cards,
>> maybe there'll be a difference (i was testing 760 and c2075 so far).
>> [1] https://developer.blender.org/D2032
>> On Wed, May 25, 2016 at 10:29 AM, Brecht Van Lommel
>> <brechtvanlommel at pandora.be> wrote:
>>> Windows 10, CUDA toolkit 7.5, GTX 960 (sm_52).
>>> For the stack ptxas output, it reports "0 bytes stack frame" for all
>>> functions expect the entry functions here. For the spills, if you add
>>> those up it seems to exceed the numbers in the entry function, so I
>>> guess it does not include all nested functions.
>>> On Wed, May 25, 2016 at 9:17 AM, Sergey Sharybin <sergey.vfx at gmail.com>
>>> wrote:
>>> > Brecht, that's a real nice breakdown! Question tho, what OS and GPU you
>>> > used? I've been playing a bit last night with setting BVH stack stack to
>>> > 4
>>> > and didn't see measurable difference in neither GPU memory consumption
>>> > nor
>>> > in compiler output, which is rather weird.
>>> >
>>> > Side question: the stats of stack/spills reported by PTex about entry
>>> > function, does it include all nested function calls or it's only a stats
>>> > of
>>> > kernel function itself?
>>> >
>>> > On Wed, May 25, 2016 at 12:29 AM, Brecht Van Lommel
>>> > <brechtvanlommel at pandora.be> wrote:
>>> >>
>>> >> Stackless BVH traversal could be nice (I probably won't work on that
>>> >> though). If FireRays is using it then it must work well on AMD cards.
>>> >>
>>> >> I've tried to make a breakdown of stack memory usage here. Seems BVH
>>> >> stacks account for about 10-14% of stack memory now:
>>> >> https://developer.blender.org/D2023#46333
>>> >>

More information about the Bf-cycles mailing list