<div dir="ltr">Brecht, that's a real nice breakdown! Question tho, what OS and GPU you used? I've been playing a bit last night with setting BVH stack stack to 4 and didn't see measurable difference in neither GPU memory consumption nor in compiler output, which is rather weird.<div><br></div><div>Side question: the stats of stack/spills reported by PTex about entry function, does it include all nested function calls or it's only a stats of kernel function itself?</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, May 25, 2016 at 12:29 AM, Brecht Van Lommel <span dir="ltr"><<a href="mailto:brechtvanlommel@pandora.be" target="_blank">brechtvanlommel@pandora.be</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Stackless BVH traversal could be nice (I probably won't work on that<br>
though). If FireRays is using it then it must work well on AMD cards.<br>
<br>
I've tried to make a breakdown of stack memory usage here. Seems BVH<br>
stacks account for about 10-14% of stack memory now:<br>
<a href="https://developer.blender.org/D2023#46333" rel="noreferrer" target="_blank">https://developer.blender.org/D2023#46333</a><br>
<div class="HOEnZb"><div class="h5"><br>
On Tue, May 24, 2016 at 12:46 PM, Sergey Sharybin <<a href="mailto:sergey.vfx@gmail.com">sergey.vfx@gmail.com</a>> wrote:<br>
> Hi,<br>
><br>
> Brecht, nice work indeed! :)<br>
><br>
> Stefan, sharing the stack is an interesting idea indeed, but there are<br>
> techniques of stack-less BVH traversal. If i read it correct, that's exactly<br>
> what was implemented in FireRays. Shouldn't be that hard to experiment with<br>
> both shared stack and stackless implementations. Let me know if you're up to<br>
> that tests or if me or Brecht can look into this (so we don't do duplicated<br>
> work).<br>
><br>
> On Tue, May 24, 2016 at 12:32 PM, Stefan Werner <<a href="mailto:swerner@smithmicro.com">swerner@smithmicro.com</a>><br>
> wrote:<br>
>><br>
>> Impressive! That goes beyond what I’ve done so far. One thing we may want<br>
>> to test is to share the BVH traversal stack, my suspicion is that nvcc for<br>
>> Maxwell also reserves memory for every possible instance of the traversal<br>
>> function (triangle, hair, motion, SSS, etc).<br>
>><br>
>> Next up in terms of breaking the memory barrier is using host memory when<br>
>> CUDA runs out of device memory, we’ve tested it extensively for 2D textures<br>
>> already in Poser. I’m working on a patch right now, it will just take a<br>
>> little time to make it work with 3d and bindless textures. When using host<br>
>> memory, I can throw GB’s worth of textures at an anemic GTX 460 (768MB<br>
>> VRAM).<br>
>><br>
>> -Stefan<br>
>><br>
>> On 5/22/16, 6:53 PM, "<a href="mailto:bf-cycles-bounces@blender.org">bf-cycles-bounces@blender.org</a> on behalf of Brecht<br>
>> Van Lommel" <<a href="mailto:bf-cycles-bounces@blender.org">bf-cycles-bounces@blender.org</a> on behalf of<br>
>> <a href="mailto:brechtvanlommel@pandora.be">brechtvanlommel@pandora.be</a>> wrote:<br>
>><br>
>> >I've added some optimizations for reducing stack memory usage here:<br>
>> ><a href="https://developer.blender.org/D2023" rel="noreferrer" target="_blank">https://developer.blender.org/D2023</a><br>
>> ><br>
>> >On Wed, May 18, 2016 at 2:27 PM, Stefan Werner <<a href="mailto:swerner@smithmicro.com">swerner@smithmicro.com</a>><br>
>> > wrote:<br>
>> >> Don’t be too excited too early. The more I work with it, the more it<br>
>> >> looks<br>
>> >> like it’s just an elaborate workaround for compiler behavior. It<br>
>> >> appears<br>
>> >> that NVCC insists on inlining everything on Maxwell, ignoring any<br>
>> >> __noinline__ hints. So far, there are no benefits whatsoever on<br>
>> >> Kepler,<br>
>> >> there NVCC appears to do the right thing out of the box.<br>
>> >><br>
>> >><br>
>> >><br>
>> >> I submitted a bug report to Nvidia about the difference in stack usage<br>
>> >> between Kepler and Maxwell last year, and it was marked as resolved and<br>
>> >> to<br>
>> >> be shipped in the next CUDA update. So maybe I shouldn’t spend too much<br>
>> >> time<br>
>> >> with it until we see CUDA 8.<br>
>> >><br>
>> >><br>
>> >><br>
>> >> -Stefan<br>
>> >><br>
>> >><br>
>> >><br>
>> >> From: <<a href="mailto:bf-cycles-bounces@blender.org">bf-cycles-bounces@blender.org</a>> on behalf of Thomas Dinges<br>
>> >> <<a href="mailto:blender@dingto.org">blender@dingto.org</a>><br>
>> >> Reply-To: Discussion list to assist Cycles render engine developers<br>
>> >> <<a href="mailto:bf-cycles@blender.org">bf-cycles@blender.org</a>><br>
>> >> Date: Tuesday, May 17, 2016 at 4:45 PM<br>
>> >><br>
>> >><br>
>> >> To: Discussion list to assist Cycles render engine developers<br>
>> >> <<a href="mailto:bf-cycles@blender.org">bf-cycles@blender.org</a>><br>
>> >> Subject: Re: [Bf-cycles] split kernel and CUDA<br>
>> >><br>
>> >><br>
>> >><br>
>> >> That sounds promising, feel free to submit a patch for this and we can<br>
>> >> check. :)<br>
>> >><br>
>> >> Am 17.05.2016 um 16:40 schrieb Stefan Werner:<br>
>> >><br>
>> >> The patch is surprisingly clean. It removes some of the #ifdef<br>
>> >> __SPLIT_KERNEL__ blocks and unifies CPU, OpenCL and CUDA a bit more. I<br>
>> >> didn’t run a speed benchmark, and I wouldn’t even make speed the<br>
>> >> ultimate<br>
>> >> top priority: Right now, the problem we see in the field is that people<br>
>> >> are<br>
>> >> unable to use high-end gaming GPUs because the VRAM is so full of<br>
>> >> geometry<br>
>> >> and textures that the CUDA runtime doesn’t have room for kernel memory<br>
>> >> any<br>
>> >> more. On my 1664 core M4000 card, I see a simple kernel launch already<br>
>> >> taking ~1600MB of VRAM with almost empty scenes.<br>
>> >><br>
>> >><br>
>> >><br>
>> >> It looks to me like the CUDA compiler reserves room for every stack<br>
>> >> instance<br>
>> >> of ShaderData (or other structs) in advance, and that sharing that<br>
>> >> memory<br>
>> >> instead of instantiating it separately is an easy way to reduce VRAM<br>
>> >> requirements without changing the code much.<br>
>> >><br>
>> >><br>
>> >><br>
>> >> -Stefan<br>
>> >><br>
>> >><br>
>> >><br>
>> >> From: <<a href="mailto:bf-cycles-bounces@blender.org">bf-cycles-bounces@blender.org</a>> on behalf of Sergey Sharybin<br>
>> >> <<a href="mailto:sergey.vfx@gmail.com">sergey.vfx@gmail.com</a>><br>
>> >> Reply-To: Discussion list to assist Cycles render engine developers<br>
>> >> <<a href="mailto:bf-cycles@blender.org">bf-cycles@blender.org</a>><br>
>> >> Date: Tuesday, May 17, 2016 at 9:20 AM<br>
>> >> To: Discussion list to assist Cycles render engine developers<br>
>> >> <<a href="mailto:bf-cycles@blender.org">bf-cycles@blender.org</a>><br>
>> >> Subject: Re: [Bf-cycles] split kernel and CUDA<br>
>> >><br>
>> >><br>
>> >><br>
>> >> hi,<br>
>> >><br>
>> >><br>
>> >><br>
>> >> Lukas Stocker was doing experiments with CUDA split kernel. With the<br>
>> >> current<br>
>> >> design of the split it was taking more VRAM actually, AFAIR. Hopefully<br>
>> >> he'll<br>
>> >> read this mail and reply in more details.<br>
>> >><br>
>> >><br>
>> >><br>
>> >> Would be cool to have this front moving forward, but i fear we'll have<br>
>> >> to<br>
>> >> step back and reconsider some things about how split kernel works<br>
>> >> together<br>
>> >> with a regular one.<br>
>> >><br>
>> >><br>
>> >><br>
>> >> There are interesting results on the stack memory! I can see number of<br>
>> >> spill<br>
>> >> loads go up tho, did you measure if it gives measurable render time<br>
>> >> slowdown? And how messy is the patch i wonder :)<br>
>> >><br>
>> >><br>
>> >><br>
>> >> On Tue, May 17, 2016 at 8:47 AM, Stefan Werner <<a href="mailto:swerner@smithmicro.com">swerner@smithmicro.com</a>><br>
>> >> wrote:<br>
>> >><br>
>> >> Hi,<br>
>> >><br>
>> >> Has anyone experimented with building a split kernel for CUDA? It seems<br>
>> >> to<br>
>> >> me that this could lift some of the limitations on Nvidia hardware,<br>
>> >> such as<br>
>> >> the high memory requirements on cards with many CUDA cores or the<br>
>> >> driver<br>
>> >> time out. I just tried out what happens when I take the shared<br>
>> >> ShaderData<br>
>> >> (KernelGlobals.sd_input) from the split kernel into the CUDA kernel, as<br>
>> >> opposed to creating separate ShaderData structs on the stack, and it<br>
>> >> looks<br>
>> >> like it has an impact:<br>
>> >><br>
>> >> before:<br>
>> >> ptxas info : Compiling entry function<br>
>> >> 'kernel_cuda_branched_path_trace'<br>
>> >> for 'sm_50'<br>
>> >> ptxas info : Function properties for kernel_cuda_branched_path_trace<br>
>> >> 68416 bytes stack frame, 1188 bytes spill stores, 3532 bytes spill<br>
>> >> loads<br>
>> >><br>
>> >> after:<br>
>> >> ptxas info : Compiling entry function<br>
>> >> 'kernel_cuda_branched_path_trace'<br>
>> >> for 'sm_50'<br>
>> >> ptxas info : Function properties for kernel_cuda_branched_path_trace<br>
>> >> 58976 bytes stack frame, 1256 bytes spill stores, 3676 bytes spill<br>
>> >> loads<br>
>> >><br>
>> >> -Stefan<br>
>> >><br>
>> >> _______________________________________________<br>
>> >> Bf-cycles mailing list<br>
>> >> <a href="mailto:Bf-cycles@blender.org">Bf-cycles@blender.org</a><br>
>> >> <a href="https://lists.blender.org/mailman/listinfo/bf-cycles" rel="noreferrer" target="_blank">https://lists.blender.org/mailman/listinfo/bf-cycles</a><br>
>> >><br>
>> >><br>
>> >><br>
>> >><br>
>> >><br>
>> >> --<br>
>> >><br>
>> >> With best regards, Sergey Sharybin<br>
>> >><br>
>> >><br>
>> >><br>
>> >><br>
>> >> _______________________________________________<br>
>> >><br>
>> >> Bf-cycles mailing list<br>
>> >><br>
>> >> <a href="mailto:Bf-cycles@blender.org">Bf-cycles@blender.org</a><br>
>> >><br>
>> >> <a href="https://lists.blender.org/mailman/listinfo/bf-cycles" rel="noreferrer" target="_blank">https://lists.blender.org/mailman/listinfo/bf-cycles</a><br>
>> >><br>
>> >><br>
>> >><br>
>> >><br>
>> >> _______________________________________________<br>
>> >> Bf-cycles mailing list<br>
>> >> <a href="mailto:Bf-cycles@blender.org">Bf-cycles@blender.org</a><br>
>> >> <a href="https://lists.blender.org/mailman/listinfo/bf-cycles" rel="noreferrer" target="_blank">https://lists.blender.org/mailman/listinfo/bf-cycles</a><br>
>> >><br>
>> >_______________________________________________<br>
>> >Bf-cycles mailing list<br>
>> ><a href="mailto:Bf-cycles@blender.org">Bf-cycles@blender.org</a><br>
>> ><a href="https://lists.blender.org/mailman/listinfo/bf-cycles" rel="noreferrer" target="_blank">https://lists.blender.org/mailman/listinfo/bf-cycles</a><br>
>><br>
>> _______________________________________________<br>
>> Bf-cycles mailing list<br>
>> <a href="mailto:Bf-cycles@blender.org">Bf-cycles@blender.org</a><br>
>> <a href="https://lists.blender.org/mailman/listinfo/bf-cycles" rel="noreferrer" target="_blank">https://lists.blender.org/mailman/listinfo/bf-cycles</a><br>
><br>
><br>
><br>
><br>
> --<br>
> With best regards, Sergey Sharybin<br>
><br>
> _______________________________________________<br>
> Bf-cycles mailing list<br>
> <a href="mailto:Bf-cycles@blender.org">Bf-cycles@blender.org</a><br>
> <a href="https://lists.blender.org/mailman/listinfo/bf-cycles" rel="noreferrer" target="_blank">https://lists.blender.org/mailman/listinfo/bf-cycles</a><br>
><br>
_______________________________________________<br>
Bf-cycles mailing list<br>
<a href="mailto:Bf-cycles@blender.org">Bf-cycles@blender.org</a><br>
<a href="https://lists.blender.org/mailman/listinfo/bf-cycles" rel="noreferrer" target="_blank">https://lists.blender.org/mailman/listinfo/bf-cycles</a><br>
</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature"><div><span style="color:rgb(102,102,102)">With best regards, Sergey Sharybin</span></div></div>
</div>