[Bf-cycles] split kernel and CUDA

Fri May 27 16:26:31 CEST 2016

That is weird, ptax reports stack for each of the functions here..

Anyway, got a quick implementation of stackless BVH traversal [1]. It's
currently quite slower (like, 20%), but it totally lacks
closest-child-traversal heuristic, so hopefully by bringing it back we can
compensate for the speed. However, what's much more weird, memory usage
difference is next to nothing here (it's less than 1% even with fully empty
scene).

Would be nice if someone makes additional tests on a higher level cards,
maybe there'll be a difference (i was testing 760 and c2075 so far).

[1] https://developer.blender.org/D2032

On Wed, May 25, 2016 at 10:29 AM, Brecht Van Lommel <
brechtvanlommel at pandora.be> wrote:

> Windows 10, CUDA toolkit 7.5, GTX 960 (sm_52).
>
> For the stack ptxas output, it reports "0 bytes stack frame" for all
> functions expect the entry functions here. For the spills, if you add
> those up it seems to exceed the numbers in the entry function, so I
> guess it does not include all nested functions.
>
> On Wed, May 25, 2016 at 9:17 AM, Sergey Sharybin <sergey.vfx at gmail.com>
> wrote:
> > Brecht, that's a real nice breakdown! Question tho, what OS and GPU you
> > used? I've been playing a bit last night with setting BVH stack stack to
> 4
> > and didn't see measurable difference in neither GPU memory consumption
> nor
> > in compiler output, which is rather weird.
> >
> > Side question: the stats of stack/spills reported by PTex about entry
> > function, does it include all nested function calls or it's only a stats
> of
> > kernel function itself?
> >
> > On Wed, May 25, 2016 at 12:29 AM, Brecht Van Lommel
> > <brechtvanlommel at pandora.be> wrote:
> >>
> >> Stackless BVH traversal could be nice (I probably won't work on that
> >> though). If FireRays is using it then it must work well on AMD cards.
> >>
> >> I've tried to make a breakdown of stack memory usage here. Seems BVH
> >> stacks account for about 10-14% of stack memory now:
> >> https://developer.blender.org/D2023#46333
> >>
> >> On Tue, May 24, 2016 at 12:46 PM, Sergey Sharybin <sergey.vfx at gmail.com
> >
> >> wrote:
> >> > Hi,
> >> >
> >> > Brecht, nice work indeed! :)
> >> >
> >> > Stefan, sharing the stack is an interesting idea indeed, but there are
> >> > techniques of stack-less BVH traversal. If i read it correct, that's
> >> > exactly
> >> > what was implemented in FireRays. Shouldn't be that hard to experiment
> >> > with
> >> > both shared stack and stackless implementations. Let me know if you're
> >> > up to
> >> > that tests or if me or Brecht can look into this (so we don't do
> >> > duplicated
> >> > work).
> >> >
> >> > On Tue, May 24, 2016 at 12:32 PM, Stefan Werner <
> swerner at smithmicro.com>
> >> > wrote:
> >> >>
> >> >> Impressive! That goes beyond what I’ve done so far. One thing we may
> >> >> want
> >> >> to test is to share the BVH traversal stack, my suspicion is that
> nvcc
> >> >> for
> >> >> Maxwell also reserves memory for every possible instance of the
> >> >> traversal
> >> >> function (triangle, hair, motion, SSS, etc).
> >> >>
> >> >> Next up in terms of breaking the memory barrier is using host memory
> >> >> when
> >> >> CUDA runs out of device memory, we’ve tested it extensively for 2D
> >> >> textures
> >> >> already in Poser. I’m working on a patch right now, it will just
> take a
> >> >> little time to make it work with 3d and bindless textures. When using
> >> >> host
> >> >> memory, I can throw GB’s worth of textures at an anemic GTX 460
> (768MB
> >> >> VRAM).
> >> >>
> >> >> -Stefan
> >> >>
> >> >> On 5/22/16, 6:53 PM, "bf-cycles-bounces at blender.org on behalf of
> Brecht
> >> >> Van Lommel" <bf-cycles-bounces at blender.org on behalf of
> >> >> brechtvanlommel at pandora.be> wrote:
> >> >>
> >> >> >I've added some optimizations for reducing stack memory usage here:
> >> >> >https://developer.blender.org/D2023
> >> >> >
> >> >> >On Wed, May 18, 2016 at 2:27 PM, Stefan Werner
> >> >> > <swerner at smithmicro.com>
> >> >> > wrote:
> >> >> >> Don’t be too excited too early. The more I work with it, the more
> it
> >> >> >> looks
> >> >> >> like it’s just an elaborate workaround for compiler behavior. It
> >> >> >> appears
> >> >> >> that NVCC insists on inlining everything on Maxwell, ignoring any
> >> >> >> __noinline__ hints.  So far, there are no benefits whatsoever on
> >> >> >> Kepler,
> >> >> >> there NVCC appears to do the right thing out of the box.
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> I submitted a bug report to Nvidia about the difference in stack
> >> >> >> usage
> >> >> >> between Kepler and Maxwell last year, and it was marked as
> resolved
> >> >> >> and
> >> >> >> to
> >> >> >> be shipped in the next CUDA update. So maybe I shouldn’t spend too
> >> >> >> much
> >> >> >> time
> >> >> >> with it until we see CUDA 8.
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> -Stefan
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> From: <bf-cycles-bounces at blender.org> on behalf of Thomas Dinges
> >> >> >> <blender at dingto.org>
> >> >> >> Reply-To: Discussion list to assist Cycles render engine
> developers
> >> >> >> <bf-cycles at blender.org>
> >> >> >> Date: Tuesday, May 17, 2016 at 4:45 PM
> >> >> >>
> >> >> >>
> >> >> >> To: Discussion list to assist Cycles render engine developers
> >> >> >> <bf-cycles at blender.org>
> >> >> >> Subject: Re: [Bf-cycles] split kernel and CUDA
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> That sounds promising, feel free to submit a patch for this and we
> >> >> >> can
> >> >> >> check. :)
> >> >> >>
> >> >> >> Am 17.05.2016 um 16:40 schrieb Stefan Werner:
> >> >> >>
> >> >> >> The patch is surprisingly clean. It removes some of the #ifdef
> >> >> >> __SPLIT_KERNEL__ blocks and unifies CPU, OpenCL and CUDA a bit
> more.
> >> >> >> I
> >> >> >> didn’t run a speed benchmark, and I wouldn’t even make speed the
> >> >> >> ultimate
> >> >> >> top priority: Right now, the problem we see in the field is that
> >> >> >> people
> >> >> >> are
> >> >> >> unable to use high-end gaming GPUs because the VRAM is so full of
> >> >> >> geometry
> >> >> >> and textures that the CUDA runtime doesn’t have room for kernel
> >> >> >> memory
> >> >> >> any
> >> >> >> more. On my 1664 core M4000 card, I see a simple kernel launch
> >> >> >> already
> >> >> >> taking ~1600MB of VRAM with almost empty scenes.
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> It looks to me like the CUDA compiler reserves room for every
> stack
> >> >> >> instance
> >> >> >> of ShaderData (or other structs) in advance, and that sharing that
> >> >> >> memory
> >> >> >> instead of instantiating it separately is an easy way to reduce
> VRAM
> >> >> >> requirements without changing the code much.
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> -Stefan
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> From: <bf-cycles-bounces at blender.org> on behalf of Sergey
> Sharybin
> >> >> >> <sergey.vfx at gmail.com>
> >> >> >> Reply-To: Discussion list to assist Cycles render engine
> developers
> >> >> >> <bf-cycles at blender.org>
> >> >> >> Date: Tuesday, May 17, 2016 at 9:20 AM
> >> >> >> To: Discussion list to assist Cycles render engine developers
> >> >> >> <bf-cycles at blender.org>
> >> >> >> Subject: Re: [Bf-cycles] split kernel and CUDA
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> hi,
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> Lukas Stocker was doing experiments with CUDA split kernel. With
> the
> >> >> >> current
> >> >> >> design of the split it was taking more VRAM actually, AFAIR.
> >> >> >> Hopefully
> >> >> >> he'll
> >> >> >> read this mail and reply in more details.
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> Would be cool to have this front moving forward, but i fear we'll
> >> >> >> have
> >> >> >> to
> >> >> >> step back and reconsider some things about how split kernel works
> >> >> >> together
> >> >> >> with a regular one.
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> There are interesting results on the stack memory! I can see
> number
> >> >> >> of
> >> >> >> spill
> >> >> >> loads go up tho, did you measure if it gives measurable render
> time
> >> >> >> slowdown? And how messy is the patch i wonder :)
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> On Tue, May 17, 2016 at 8:47 AM, Stefan Werner
> >> >> >> <swerner at smithmicro.com>
> >> >> >> wrote:
> >> >> >>
> >> >> >> Hi,
> >> >> >>
> >> >> >> Has anyone experimented with building a split kernel for CUDA? It
> >> >> >> seems
> >> >> >> to
> >> >> >> me that this could lift some of the limitations on Nvidia
> hardware,
> >> >> >> such as
> >> >> >> the high memory requirements on cards with many CUDA cores or the
> >> >> >> driver
> >> >> >> time out. I just tried out what happens when I take the shared
> >> >> >> ShaderData
> >> >> >> (KernelGlobals.sd_input) from the split kernel into the CUDA
> kernel,
> >> >> >> as
> >> >> >> opposed to creating separate ShaderData structs on the stack, and
> it
> >> >> >> looks
> >> >> >> like it has an impact:
> >> >> >>
> >> >> >> before:
> >> >> >> ptxas info    : Compiling entry function
> >> >> >> 'kernel_cuda_branched_path_trace'
> >> >> >> for 'sm_50'
> >> >> >> ptxas info    : Function properties for
> >> >> >> kernel_cuda_branched_path_trace
> >> >> >>     68416 bytes stack frame, 1188 bytes spill stores, 3532 bytes
> >> >> >> spill
> >> >> >> loads
> >> >> >>
> >> >> >> after:
> >> >> >> ptxas info    : Compiling entry function
> >> >> >> 'kernel_cuda_branched_path_trace'
> >> >> >> for 'sm_50'
> >> >> >> ptxas info    : Function properties for
> >> >> >> kernel_cuda_branched_path_trace
> >> >> >>     58976 bytes stack frame, 1256 bytes spill stores, 3676 bytes
> >> >> >> spill
> >> >> >> loads
> >> >> >>
> >> >> >> -Stefan
> >> >> >>
> >> >> >> _______________________________________________
> >> >> >> Bf-cycles mailing list
> >> >> >> Bf-cycles at blender.org
> >> >> >> https://lists.blender.org/mailman/listinfo/bf-cycles
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >>
> >> >> >> With best regards, Sergey Sharybin
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> _______________________________________________
> >> >> >>
> >> >> >> Bf-cycles mailing list
> >> >> >>
> >> >> >> Bf-cycles at blender.org
> >> >> >>
> >> >> >> https://lists.blender.org/mailman/listinfo/bf-cycles
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> _______________________________________________
> >> >> >> Bf-cycles mailing list
> >> >> >> Bf-cycles at blender.org
> >> >> >> https://lists.blender.org/mailman/listinfo/bf-cycles
> >> >> >>
> >> >> >_______________________________________________
> >> >> >Bf-cycles mailing list
> >> >> >Bf-cycles at blender.org
> >> >> >https://lists.blender.org/mailman/listinfo/bf-cycles
> >> >>
> >> >> _______________________________________________
> >> >> Bf-cycles mailing list
> >> >> Bf-cycles at blender.org
> >> >> https://lists.blender.org/mailman/listinfo/bf-cycles
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > With best regards, Sergey Sharybin
> >> >
> >> > _______________________________________________
> >> > Bf-cycles mailing list
> >> > Bf-cycles at blender.org
> >> > https://lists.blender.org/mailman/listinfo/bf-cycles
> >> >
> >> _______________________________________________
> >> Bf-cycles mailing list
> >> Bf-cycles at blender.org
> >> https://lists.blender.org/mailman/listinfo/bf-cycles
> >
> >
> >
> >
> > --
> > With best regards, Sergey Sharybin
> >
> > _______________________________________________
> > Bf-cycles mailing list
> > Bf-cycles at blender.org
> > https://lists.blender.org/mailman/listinfo/bf-cycles
> >
> _______________________________________________
> Bf-cycles mailing list
> Bf-cycles at blender.org
> https://lists.blender.org/mailman/listinfo/bf-cycles
>

-- 
With best regards, Sergey Sharybin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.blender.org/pipermail/bf-cycles/attachments/20160527/59ebf1cc/attachment.htm