[Bf-cycles] split kernel and CUDA

Sun May 22 18:53:07 CEST 2016

I've added some optimizations for reducing stack memory usage here:
https://developer.blender.org/D2023

On Wed, May 18, 2016 at 2:27 PM, Stefan Werner <swerner at smithmicro.com> wrote:
> Don’t be too excited too early. The more I work with it, the more it looks
> like it’s just an elaborate workaround for compiler behavior. It appears
> that NVCC insists on inlining everything on Maxwell, ignoring any
> __noinline__ hints.  So far, there are no benefits whatsoever on Kepler,
> there NVCC appears to do the right thing out of the box.
>
>
>
> I submitted a bug report to Nvidia about the difference in stack usage
> between Kepler and Maxwell last year, and it was marked as resolved and to
> be shipped in the next CUDA update. So maybe I shouldn’t spend too much time
> with it until we see CUDA 8.
>
>
>
> -Stefan
>
>
>
> From: <bf-cycles-bounces at blender.org> on behalf of Thomas Dinges
> <blender at dingto.org>
> Reply-To: Discussion list to assist Cycles render engine developers
> <bf-cycles at blender.org>
> Date: Tuesday, May 17, 2016 at 4:45 PM
>
>
> To: Discussion list to assist Cycles render engine developers
> <bf-cycles at blender.org>
> Subject: Re: [Bf-cycles] split kernel and CUDA
>
>
>
> That sounds promising, feel free to submit a patch for this and we can
> check. :)
>
> Am 17.05.2016 um 16:40 schrieb Stefan Werner:
>
> The patch is surprisingly clean. It removes some of the #ifdef
> __SPLIT_KERNEL__ blocks and unifies CPU, OpenCL and CUDA a bit more. I
> didn’t run a speed benchmark, and I wouldn’t even make speed the ultimate
> top priority: Right now, the problem we see in the field is that people are
> unable to use high-end gaming GPUs because the VRAM is so full of geometry
> and textures that the CUDA runtime doesn’t have room for kernel memory any
> more. On my 1664 core M4000 card, I see a simple kernel launch already
> taking ~1600MB of VRAM with almost empty scenes.
>
>
>
> It looks to me like the CUDA compiler reserves room for every stack instance
> of ShaderData (or other structs) in advance, and that sharing that memory
> instead of instantiating it separately is an easy way to reduce VRAM
> requirements without changing the code much.
>
>
>
> -Stefan
>
>
>
> From: <bf-cycles-bounces at blender.org> on behalf of Sergey Sharybin
> <sergey.vfx at gmail.com>
> Reply-To: Discussion list to assist Cycles render engine developers
> <bf-cycles at blender.org>
> Date: Tuesday, May 17, 2016 at 9:20 AM
> To: Discussion list to assist Cycles render engine developers
> <bf-cycles at blender.org>
> Subject: Re: [Bf-cycles] split kernel and CUDA
>
>
>
> hi,
>
>
>
> Lukas Stocker was doing experiments with CUDA split kernel. With the current
> design of the split it was taking more VRAM actually, AFAIR. Hopefully he'll
> read this mail and reply in more details.
>
>
>
> Would be cool to have this front moving forward, but i fear we'll have to
> step back and reconsider some things about how split kernel works together
> with a regular one.
>
>
>
> There are interesting results on the stack memory! I can see number of spill
> loads go up tho, did you measure if it gives measurable render time
> slowdown? And how messy is the patch i wonder :)
>
>
>
> On Tue, May 17, 2016 at 8:47 AM, Stefan Werner <swerner at smithmicro.com>
> wrote:
>
> Hi,
>
> Has anyone experimented with building a split kernel for CUDA? It seems to
> me that this could lift some of the limitations on Nvidia hardware, such as
> the high memory requirements on cards with many CUDA cores or the driver
> time out. I just tried out what happens when I take the shared ShaderData
> (KernelGlobals.sd_input) from the split kernel into the CUDA kernel, as
> opposed to creating separate ShaderData structs on the stack, and it looks
> like it has an impact:
>
> before:
> ptxas info    : Compiling entry function 'kernel_cuda_branched_path_trace'
> for 'sm_50'
> ptxas info    : Function properties for kernel_cuda_branched_path_trace
>     68416 bytes stack frame, 1188 bytes spill stores, 3532 bytes spill loads
>
> after:
> ptxas info    : Compiling entry function 'kernel_cuda_branched_path_trace'
> for 'sm_50'
> ptxas info    : Function properties for kernel_cuda_branched_path_trace
>     58976 bytes stack frame, 1256 bytes spill stores, 3676 bytes spill loads
>
> -Stefan
>
> _______________________________________________
> Bf-cycles mailing list
> Bf-cycles at blender.org
> https://lists.blender.org/mailman/listinfo/bf-cycles
>
>
>
>
>
> --
>
> With best regards, Sergey Sharybin
>
>
>
>
> _______________________________________________
>
> Bf-cycles mailing list
>
> Bf-cycles at blender.org
>
> https://lists.blender.org/mailman/listinfo/bf-cycles
>
>
>
>
> _______________________________________________
> Bf-cycles mailing list
> Bf-cycles at blender.org
> https://lists.blender.org/mailman/listinfo/bf-cycles
>