[Bf-cycles] split kernel and CUDA

Fri May 27 21:51:23 CEST 2016

In principle the compiler can figure out that it can use e.g. the same
memory for the SVM and BVH stacks and so decreasing the BVH stack size
could have no impact overall.

However we've seen that the compiler can't always do this, maybe it
depends on the GPU model somehow.

Stackless might help speedup more on AMD if it has slower caches, I
guess there is a reason FireRays uses it.

On Fri, May 27, 2016 at 4:26 PM, Sergey Sharybin <sergey.vfx at gmail.com> wrote:
> That is weird, ptax reports stack for each of the functions here..
>
> Anyway, got a quick implementation of stackless BVH traversal [1]. It's
> currently quite slower (like, 20%), but it totally lacks
> closest-child-traversal heuristic, so hopefully by bringing it back we can
> compensate for the speed. However, what's much more weird, memory usage
> difference is next to nothing here (it's less than 1% even with fully empty
> scene).
>
> Would be nice if someone makes additional tests on a higher level cards,
> maybe there'll be a difference (i was testing 760 and c2075 so far).
>
> [1] https://developer.blender.org/D2032
>
> On Wed, May 25, 2016 at 10:29 AM, Brecht Van Lommel
> <brechtvanlommel at pandora.be> wrote:
>>
>> Windows 10, CUDA toolkit 7.5, GTX 960 (sm_52).
>>
>> For the stack ptxas output, it reports "0 bytes stack frame" for all
>> functions expect the entry functions here. For the spills, if you add
>> those up it seems to exceed the numbers in the entry function, so I
>> guess it does not include all nested functions.
>>
>> On Wed, May 25, 2016 at 9:17 AM, Sergey Sharybin <sergey.vfx at gmail.com>
>> wrote:
>> > Brecht, that's a real nice breakdown! Question tho, what OS and GPU you
>> > used? I've been playing a bit last night with setting BVH stack stack to
>> > 4
>> > and didn't see measurable difference in neither GPU memory consumption
>> > nor
>> > in compiler output, which is rather weird.
>> >
>> > Side question: the stats of stack/spills reported by PTex about entry
>> > function, does it include all nested function calls or it's only a stats
>> > of
>> > kernel function itself?
>> >
>> > On Wed, May 25, 2016 at 12:29 AM, Brecht Van Lommel
>> > <brechtvanlommel at pandora.be> wrote:
>> >>
>> >> Stackless BVH traversal could be nice (I probably won't work on that
>> >> though). If FireRays is using it then it must work well on AMD cards.
>> >>
>> >> I've tried to make a breakdown of stack memory usage here. Seems BVH
>> >> stacks account for about 10-14% of stack memory now:
>> >> https://developer.blender.org/D2023#46333
>> >>
>> >> On Tue, May 24, 2016 at 12:46 PM, Sergey Sharybin
>> >> <sergey.vfx at gmail.com>
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > Brecht, nice work indeed! :)
>> >> >
>> >> > Stefan, sharing the stack is an interesting idea indeed, but there
>> >> > are
>> >> > techniques of stack-less BVH traversal. If i read it correct, that's
>> >> > exactly
>> >> > what was implemented in FireRays. Shouldn't be that hard to
>> >> > experiment
>> >> > with
>> >> > both shared stack and stackless implementations. Let me know if
>> >> > you're
>> >> > up to
>> >> > that tests or if me or Brecht can look into this (so we don't do
>> >> > duplicated
>> >> > work).
>> >> >
>> >> > On Tue, May 24, 2016 at 12:32 PM, Stefan Werner
>> >> > <swerner at smithmicro.com>
>> >> > wrote:
>> >> >>
>> >> >> Impressive! That goes beyond what I’ve done so far. One thing we may
>> >> >> want
>> >> >> to test is to share the BVH traversal stack, my suspicion is that
>> >> >> nvcc
>> >> >> for
>> >> >> Maxwell also reserves memory for every possible instance of the
>> >> >> traversal
>> >> >> function (triangle, hair, motion, SSS, etc).
>> >> >>
>> >> >> Next up in terms of breaking the memory barrier is using host memory
>> >> >> when
>> >> >> CUDA runs out of device memory, we’ve tested it extensively for 2D
>> >> >> textures
>> >> >> already in Poser. I’m working on a patch right now, it will just
>> >> >> take a
>> >> >> little time to make it work with 3d and bindless textures. When
>> >> >> using
>> >> >> host
>> >> >> memory, I can throw GB’s worth of textures at an anemic GTX 460
>> >> >> (768MB
>> >> >> VRAM).
>> >> >>
>> >> >> -Stefan
>> >> >>
>> >> >> On 5/22/16, 6:53 PM, "bf-cycles-bounces at blender.org on behalf of
>> >> >> Brecht
>> >> >> Van Lommel" <bf-cycles-bounces at blender.org on behalf of
>> >> >> brechtvanlommel at pandora.be> wrote:
>> >> >>
>> >> >> >I've added some optimizations for reducing stack memory usage here:
>> >> >> >https://developer.blender.org/D2023
>> >> >> >
>> >> >> >On Wed, May 18, 2016 at 2:27 PM, Stefan Werner
>> >> >> > <swerner at smithmicro.com>
>> >> >> > wrote:
>> >> >> >> Don’t be too excited too early. The more I work with it, the more
>> >> >> >> it
>> >> >> >> looks
>> >> >> >> like it’s just an elaborate workaround for compiler behavior. It
>> >> >> >> appears
>> >> >> >> that NVCC insists on inlining everything on Maxwell, ignoring any
>> >> >> >> __noinline__ hints.  So far, there are no benefits whatsoever on
>> >> >> >> Kepler,
>> >> >> >> there NVCC appears to do the right thing out of the box.
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> I submitted a bug report to Nvidia about the difference in stack
>> >> >> >> usage
>> >> >> >> between Kepler and Maxwell last year, and it was marked as
>> >> >> >> resolved
>> >> >> >> and
>> >> >> >> to
>> >> >> >> be shipped in the next CUDA update. So maybe I shouldn’t spend
>> >> >> >> too
>> >> >> >> much
>> >> >> >> time
>> >> >> >> with it until we see CUDA 8.
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> -Stefan
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> From: <bf-cycles-bounces at blender.org> on behalf of Thomas Dinges
>> >> >> >> <blender at dingto.org>
>> >> >> >> Reply-To: Discussion list to assist Cycles render engine
>> >> >> >> developers
>> >> >> >> <bf-cycles at blender.org>
>> >> >> >> Date: Tuesday, May 17, 2016 at 4:45 PM
>> >> >> >>
>> >> >> >>
>> >> >> >> To: Discussion list to assist Cycles render engine developers
>> >> >> >> <bf-cycles at blender.org>
>> >> >> >> Subject: Re: [Bf-cycles] split kernel and CUDA
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> That sounds promising, feel free to submit a patch for this and
>> >> >> >> we
>> >> >> >> can
>> >> >> >> check. :)
>> >> >> >>
>> >> >> >> Am 17.05.2016 um 16:40 schrieb Stefan Werner:
>> >> >> >>
>> >> >> >> The patch is surprisingly clean. It removes some of the #ifdef
>> >> >> >> __SPLIT_KERNEL__ blocks and unifies CPU, OpenCL and CUDA a bit
>> >> >> >> more.
>> >> >> >> I
>> >> >> >> didn’t run a speed benchmark, and I wouldn’t even make speed the
>> >> >> >> ultimate
>> >> >> >> top priority: Right now, the problem we see in the field is that
>> >> >> >> people
>> >> >> >> are
>> >> >> >> unable to use high-end gaming GPUs because the VRAM is so full of
>> >> >> >> geometry
>> >> >> >> and textures that the CUDA runtime doesn’t have room for kernel
>> >> >> >> memory
>> >> >> >> any
>> >> >> >> more. On my 1664 core M4000 card, I see a simple kernel launch
>> >> >> >> already
>> >> >> >> taking ~1600MB of VRAM with almost empty scenes.
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> It looks to me like the CUDA compiler reserves room for every
>> >> >> >> stack
>> >> >> >> instance
>> >> >> >> of ShaderData (or other structs) in advance, and that sharing
>> >> >> >> that
>> >> >> >> memory
>> >> >> >> instead of instantiating it separately is an easy way to reduce
>> >> >> >> VRAM
>> >> >> >> requirements without changing the code much.
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> -Stefan
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> From: <bf-cycles-bounces at blender.org> on behalf of Sergey
>> >> >> >> Sharybin
>> >> >> >> <sergey.vfx at gmail.com>
>> >> >> >> Reply-To: Discussion list to assist Cycles render engine
>> >> >> >> developers
>> >> >> >> <bf-cycles at blender.org>
>> >> >> >> Date: Tuesday, May 17, 2016 at 9:20 AM
>> >> >> >> To: Discussion list to assist Cycles render engine developers
>> >> >> >> <bf-cycles at blender.org>
>> >> >> >> Subject: Re: [Bf-cycles] split kernel and CUDA
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> hi,
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> Lukas Stocker was doing experiments with CUDA split kernel. With
>> >> >> >> the
>> >> >> >> current
>> >> >> >> design of the split it was taking more VRAM actually, AFAIR.
>> >> >> >> Hopefully
>> >> >> >> he'll
>> >> >> >> read this mail and reply in more details.
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> Would be cool to have this front moving forward, but i fear we'll
>> >> >> >> have
>> >> >> >> to
>> >> >> >> step back and reconsider some things about how split kernel works
>> >> >> >> together
>> >> >> >> with a regular one.
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> There are interesting results on the stack memory! I can see
>> >> >> >> number
>> >> >> >> of
>> >> >> >> spill
>> >> >> >> loads go up tho, did you measure if it gives measurable render
>> >> >> >> time
>> >> >> >> slowdown? And how messy is the patch i wonder :)
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> On Tue, May 17, 2016 at 8:47 AM, Stefan Werner
>> >> >> >> <swerner at smithmicro.com>
>> >> >> >> wrote:
>> >> >> >>
>> >> >> >> Hi,
>> >> >> >>
>> >> >> >> Has anyone experimented with building a split kernel for CUDA? It
>> >> >> >> seems
>> >> >> >> to
>> >> >> >> me that this could lift some of the limitations on Nvidia
>> >> >> >> hardware,
>> >> >> >> such as
>> >> >> >> the high memory requirements on cards with many CUDA cores or the
>> >> >> >> driver
>> >> >> >> time out. I just tried out what happens when I take the shared
>> >> >> >> ShaderData
>> >> >> >> (KernelGlobals.sd_input) from the split kernel into the CUDA
>> >> >> >> kernel,
>> >> >> >> as
>> >> >> >> opposed to creating separate ShaderData structs on the stack, and
>> >> >> >> it
>> >> >> >> looks
>> >> >> >> like it has an impact:
>> >> >> >>
>> >> >> >> before:
>> >> >> >> ptxas info    : Compiling entry function
>> >> >> >> 'kernel_cuda_branched_path_trace'
>> >> >> >> for 'sm_50'
>> >> >> >> ptxas info    : Function properties for
>> >> >> >> kernel_cuda_branched_path_trace
>> >> >> >>     68416 bytes stack frame, 1188 bytes spill stores, 3532 bytes
>> >> >> >> spill
>> >> >> >> loads
>> >> >> >>
>> >> >> >> after:
>> >> >> >> ptxas info    : Compiling entry function
>> >> >> >> 'kernel_cuda_branched_path_trace'
>> >> >> >> for 'sm_50'
>> >> >> >> ptxas info    : Function properties for
>> >> >> >> kernel_cuda_branched_path_trace
>> >> >> >>     58976 bytes stack frame, 1256 bytes spill stores, 3676 bytes
>> >> >> >> spill
>> >> >> >> loads
>> >> >> >>
>> >> >> >> -Stefan
>> >> >> >>
>> >> >> >> _______________________________________________
>> >> >> >> Bf-cycles mailing list
>> >> >> >> Bf-cycles at blender.org
>> >> >> >> https://lists.blender.org/mailman/listinfo/bf-cycles
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> --
>> >> >> >>
>> >> >> >> With best regards, Sergey Sharybin
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> _______________________________________________
>> >> >> >>
>> >> >> >> Bf-cycles mailing list
>> >> >> >>
>> >> >> >> Bf-cycles at blender.org
>> >> >> >>
>> >> >> >> https://lists.blender.org/mailman/listinfo/bf-cycles
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> _______________________________________________
>> >> >> >> Bf-cycles mailing list
>> >> >> >> Bf-cycles at blender.org
>> >> >> >> https://lists.blender.org/mailman/listinfo/bf-cycles
>> >> >> >>
>> >> >> >_______________________________________________
>> >> >> >Bf-cycles mailing list
>> >> >> >Bf-cycles at blender.org
>> >> >> >https://lists.blender.org/mailman/listinfo/bf-cycles
>> >> >>
>> >> >> _______________________________________________
>> >> >> Bf-cycles mailing list
>> >> >> Bf-cycles at blender.org
>> >> >> https://lists.blender.org/mailman/listinfo/bf-cycles
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > With best regards, Sergey Sharybin
>> >> >
>> >> > _______________________________________________
>> >> > Bf-cycles mailing list
>> >> > Bf-cycles at blender.org
>> >> > https://lists.blender.org/mailman/listinfo/bf-cycles
>> >> >
>> >> _______________________________________________
>> >> Bf-cycles mailing list
>> >> Bf-cycles at blender.org
>> >> https://lists.blender.org/mailman/listinfo/bf-cycles
>> >
>> >
>> >
>> >
>> > --
>> > With best regards, Sergey Sharybin
>> >
>> > _______________________________________________
>> > Bf-cycles mailing list
>> > Bf-cycles at blender.org
>> > https://lists.blender.org/mailman/listinfo/bf-cycles
>> >
>> _______________________________________________
>> Bf-cycles mailing list
>> Bf-cycles at blender.org
>> https://lists.blender.org/mailman/listinfo/bf-cycles
>
>
>
>
> --
> With best regards, Sergey Sharybin
>
> _______________________________________________
> Bf-cycles mailing list
> Bf-cycles at blender.org
> https://lists.blender.org/mailman/listinfo/bf-cycles
>