[Bf-cycles] split kernel and CUDA

Wed May 25 09:17:07 CEST 2016

Brecht, that's a real nice breakdown! Question tho, what OS and GPU you
used? I've been playing a bit last night with setting BVH stack stack to 4
and didn't see measurable difference in neither GPU memory consumption nor
in compiler output, which is rather weird.

Side question: the stats of stack/spills reported by PTex about entry
function, does it include all nested function calls or it's only a stats of
kernel function itself?

On Wed, May 25, 2016 at 12:29 AM, Brecht Van Lommel <
brechtvanlommel at pandora.be> wrote:

> Stackless BVH traversal could be nice (I probably won't work on that
> though). If FireRays is using it then it must work well on AMD cards.
>
> I've tried to make a breakdown of stack memory usage here. Seems BVH
> stacks account for about 10-14% of stack memory now:
> https://developer.blender.org/D2023#46333
>
> On Tue, May 24, 2016 at 12:46 PM, Sergey Sharybin <sergey.vfx at gmail.com>
> wrote:
> > Hi,
> >
> > Brecht, nice work indeed! :)
> >
> > Stefan, sharing the stack is an interesting idea indeed, but there are
> > techniques of stack-less BVH traversal. If i read it correct, that's
> exactly
> > what was implemented in FireRays. Shouldn't be that hard to experiment
> with
> > both shared stack and stackless implementations. Let me know if you're
> up to
> > that tests or if me or Brecht can look into this (so we don't do
> duplicated
> > work).
> >
> > On Tue, May 24, 2016 at 12:32 PM, Stefan Werner <swerner at smithmicro.com>
> > wrote:
> >>
> >> Impressive! That goes beyond what I’ve done so far. One thing we may
> want
> >> to test is to share the BVH traversal stack, my suspicion is that nvcc
> for
> >> Maxwell also reserves memory for every possible instance of the
> traversal
> >> function (triangle, hair, motion, SSS, etc).
> >>
> >> Next up in terms of breaking the memory barrier is using host memory
> when
> >> CUDA runs out of device memory, we’ve tested it extensively for 2D
> textures
> >> already in Poser. I’m working on a patch right now, it will just take a
> >> little time to make it work with 3d and bindless textures. When using
> host
> >> memory, I can throw GB’s worth of textures at an anemic GTX 460 (768MB
> >> VRAM).
> >>
> >> -Stefan
> >>
> >> On 5/22/16, 6:53 PM, "bf-cycles-bounces at blender.org on behalf of Brecht
> >> Van Lommel" <bf-cycles-bounces at blender.org on behalf of
> >> brechtvanlommel at pandora.be> wrote:
> >>
> >> >I've added some optimizations for reducing stack memory usage here:
> >> >https://developer.blender.org/D2023
> >> >
> >> >On Wed, May 18, 2016 at 2:27 PM, Stefan Werner <swerner at smithmicro.com
> >
> >> > wrote:
> >> >> Don’t be too excited too early. The more I work with it, the more it
> >> >> looks
> >> >> like it’s just an elaborate workaround for compiler behavior. It
> >> >> appears
> >> >> that NVCC insists on inlining everything on Maxwell, ignoring any
> >> >> __noinline__ hints.  So far, there are no benefits whatsoever on
> >> >> Kepler,
> >> >> there NVCC appears to do the right thing out of the box.
> >> >>
> >> >>
> >> >>
> >> >> I submitted a bug report to Nvidia about the difference in stack
> usage
> >> >> between Kepler and Maxwell last year, and it was marked as resolved
> and
> >> >> to
> >> >> be shipped in the next CUDA update. So maybe I shouldn’t spend too
> much
> >> >> time
> >> >> with it until we see CUDA 8.
> >> >>
> >> >>
> >> >>
> >> >> -Stefan
> >> >>
> >> >>
> >> >>
> >> >> From: <bf-cycles-bounces at blender.org> on behalf of Thomas Dinges
> >> >> <blender at dingto.org>
> >> >> Reply-To: Discussion list to assist Cycles render engine developers
> >> >> <bf-cycles at blender.org>
> >> >> Date: Tuesday, May 17, 2016 at 4:45 PM
> >> >>
> >> >>
> >> >> To: Discussion list to assist Cycles render engine developers
> >> >> <bf-cycles at blender.org>
> >> >> Subject: Re: [Bf-cycles] split kernel and CUDA
> >> >>
> >> >>
> >> >>
> >> >> That sounds promising, feel free to submit a patch for this and we
> can
> >> >> check. :)
> >> >>
> >> >> Am 17.05.2016 um 16:40 schrieb Stefan Werner:
> >> >>
> >> >> The patch is surprisingly clean. It removes some of the #ifdef
> >> >> __SPLIT_KERNEL__ blocks and unifies CPU, OpenCL and CUDA a bit more.
> I
> >> >> didn’t run a speed benchmark, and I wouldn’t even make speed the
> >> >> ultimate
> >> >> top priority: Right now, the problem we see in the field is that
> people
> >> >> are
> >> >> unable to use high-end gaming GPUs because the VRAM is so full of
> >> >> geometry
> >> >> and textures that the CUDA runtime doesn’t have room for kernel
> memory
> >> >> any
> >> >> more. On my 1664 core M4000 card, I see a simple kernel launch
> already
> >> >> taking ~1600MB of VRAM with almost empty scenes.
> >> >>
> >> >>
> >> >>
> >> >> It looks to me like the CUDA compiler reserves room for every stack
> >> >> instance
> >> >> of ShaderData (or other structs) in advance, and that sharing that
> >> >> memory
> >> >> instead of instantiating it separately is an easy way to reduce VRAM
> >> >> requirements without changing the code much.
> >> >>
> >> >>
> >> >>
> >> >> -Stefan
> >> >>
> >> >>
> >> >>
> >> >> From: <bf-cycles-bounces at blender.org> on behalf of Sergey Sharybin
> >> >> <sergey.vfx at gmail.com>
> >> >> Reply-To: Discussion list to assist Cycles render engine developers
> >> >> <bf-cycles at blender.org>
> >> >> Date: Tuesday, May 17, 2016 at 9:20 AM
> >> >> To: Discussion list to assist Cycles render engine developers
> >> >> <bf-cycles at blender.org>
> >> >> Subject: Re: [Bf-cycles] split kernel and CUDA
> >> >>
> >> >>
> >> >>
> >> >> hi,
> >> >>
> >> >>
> >> >>
> >> >> Lukas Stocker was doing experiments with CUDA split kernel. With the
> >> >> current
> >> >> design of the split it was taking more VRAM actually, AFAIR.
> Hopefully
> >> >> he'll
> >> >> read this mail and reply in more details.
> >> >>
> >> >>
> >> >>
> >> >> Would be cool to have this front moving forward, but i fear we'll
> have
> >> >> to
> >> >> step back and reconsider some things about how split kernel works
> >> >> together
> >> >> with a regular one.
> >> >>
> >> >>
> >> >>
> >> >> There are interesting results on the stack memory! I can see number
> of
> >> >> spill
> >> >> loads go up tho, did you measure if it gives measurable render time
> >> >> slowdown? And how messy is the patch i wonder :)
> >> >>
> >> >>
> >> >>
> >> >> On Tue, May 17, 2016 at 8:47 AM, Stefan Werner <
> swerner at smithmicro.com>
> >> >> wrote:
> >> >>
> >> >> Hi,
> >> >>
> >> >> Has anyone experimented with building a split kernel for CUDA? It
> seems
> >> >> to
> >> >> me that this could lift some of the limitations on Nvidia hardware,
> >> >> such as
> >> >> the high memory requirements on cards with many CUDA cores or the
> >> >> driver
> >> >> time out. I just tried out what happens when I take the shared
> >> >> ShaderData
> >> >> (KernelGlobals.sd_input) from the split kernel into the CUDA kernel,
> as
> >> >> opposed to creating separate ShaderData structs on the stack, and it
> >> >> looks
> >> >> like it has an impact:
> >> >>
> >> >> before:
> >> >> ptxas info    : Compiling entry function
> >> >> 'kernel_cuda_branched_path_trace'
> >> >> for 'sm_50'
> >> >> ptxas info    : Function properties for
> kernel_cuda_branched_path_trace
> >> >>     68416 bytes stack frame, 1188 bytes spill stores, 3532 bytes
> spill
> >> >> loads
> >> >>
> >> >> after:
> >> >> ptxas info    : Compiling entry function
> >> >> 'kernel_cuda_branched_path_trace'
> >> >> for 'sm_50'
> >> >> ptxas info    : Function properties for
> kernel_cuda_branched_path_trace
> >> >>     58976 bytes stack frame, 1256 bytes spill stores, 3676 bytes
> spill
> >> >> loads
> >> >>
> >> >> -Stefan
> >> >>
> >> >> _______________________________________________
> >> >> Bf-cycles mailing list
> >> >> Bf-cycles at blender.org
> >> >> https://lists.blender.org/mailman/listinfo/bf-cycles
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >>
> >> >> With best regards, Sergey Sharybin
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> _______________________________________________
> >> >>
> >> >> Bf-cycles mailing list
> >> >>
> >> >> Bf-cycles at blender.org
> >> >>
> >> >> https://lists.blender.org/mailman/listinfo/bf-cycles
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> _______________________________________________
> >> >> Bf-cycles mailing list
> >> >> Bf-cycles at blender.org
> >> >> https://lists.blender.org/mailman/listinfo/bf-cycles
> >> >>
> >> >_______________________________________________
> >> >Bf-cycles mailing list
> >> >Bf-cycles at blender.org
> >> >https://lists.blender.org/mailman/listinfo/bf-cycles
> >>
> >> _______________________________________________
> >> Bf-cycles mailing list
> >> Bf-cycles at blender.org
> >> https://lists.blender.org/mailman/listinfo/bf-cycles
> >
> >
> >
> >
> > --
> > With best regards, Sergey Sharybin
> >
> > _______________________________________________
> > Bf-cycles mailing list
> > Bf-cycles at blender.org
> > https://lists.blender.org/mailman/listinfo/bf-cycles
> >
> _______________________________________________
> Bf-cycles mailing list
> Bf-cycles at blender.org
> https://lists.blender.org/mailman/listinfo/bf-cycles
>

-- 
With best regards, Sergey Sharybin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.blender.org/pipermail/bf-cycles/attachments/20160525/345a10cd/attachment.htm