[Bf-cycles] Cuda launch bounds for Pascal

Brecht Van Lommel brechtvanlommel at pandora.be
Wed Nov 15 18:28:22 CET 2017


It seems to be related to the CUDA version, 9.0.176 has a performance
regression compared to 9.0.102. Increasing the registers partially
compensates for that, but not entirely.
https://developer.blender.org/F1141999

On Wed, Nov 15, 2017 at 12:49 PM, Stefan Werner <stewreo at gmail.com> wrote:

> Wow, those results are almost the complete opposite of what I'm seeing. I
> re-ran the tests on Linux:
>
> Nvidia 1080Ti, driver 384.90, installed as secondary GPU (no display
> attached)
> Xubuntu 17.04, CUDA 9.0.176, gcc 6.3.0
> master branch, 556b13f03e561b54d4f0186e207f080c786f8b66
>
> 48 registers:
> BMW: 1m28s
> Classroom: 3m12s
> Fish Cat: 3m07s
> Koro: 5m40s
> Pavillion: 6m52s
> Victor: 15m01s
>
>  64 registers:
>  BMW: 1m11s
>  Classroom: 2m59s
>  Fishy Cat: 2m51s
>  Koro: 4m39s
>  Pavillion: 5m32s
>  Victor: 12m19s
>
> (Victor had a tile size of 32, all others were the *_gpu.blend files with
> the default 256 tile size)
>
> On Windows, all GTX cards are treated as display cards, regardless of
> whether a monitor is plugged in or not. Only Quadro, Tesla and Titan cards
> can be set to TCC, that mode is not available for my GTX.
>
> I wonder what's behind the difference we're seeing? The GPUs themselves
> shoudln't be that different, both are based on GP102, where only the 1080Ti
> has two SMX units disabled.
>
> -Stefan
>
> On Wed, Nov 15, 2017 at 1:35 AM, Brecht Van Lommel <
> brechtvanlommel at pandora.be> wrote:
>
>> Hi,
>>
>> The registers were set based on benchmarks with a GTX 1080 on Linux, when
>> we first optimized the code for Pascal. But that was more than a year ago.
>> Going from 63 to 64 registers should be fine if it's faster.
>>
>> Here's a benchmarks with a Titan Xp, Linux, driver 384.90. Results are
>> not so good there:
>> CUDA 8.0.61: https://developer.blender.org/F1137606
>> CUDA 9.0.102: https://developer.blender.org/F1137502
>>
>> Which driver and CUDA version are you using?
>>
>> One difference between Windows and Linux is the compute preemption
>> support. It might be useful to test if that min_blocks *= 8 helps on
>> Windows, if your GTX 1080Ti is used for display.
>> https://developer.blender.org/rBe360d003e
>>
>> Regards,
>> Brecht.
>>
>>
>> On Tue, Nov 14, 2017 at 11:48 PM, Stefan Werner <stewreo at gmail.com>
>> wrote:
>>
>>> Hello,
>>>
>>> currently the Cuda kernel uses the same launch bounds for Pascal (SM
>>> 6.x) as for Maxwell (SM 5.x) hardware, that is 63 registers for branched
>>> path tracing and 48 registers for path tracing. Are all of those derived
>>> from benchmarks or is the value for Pascal just being carried over from
>>> Maxwell?
>>>
>>> The reason I'm asking is that I'm observing a performance increase on
>>> Pascal when I increase the number of registers to 64 for path tracing. Here
>>> are before/after benchmarks from a GTX 1080Ti/Win10:
>>>
>>> 48 registers (as is):
>>> BMW: 1m52
>>> Classroom: 3m31s
>>> Fishy Cat: 4m33s
>>> Koro: 8m30s
>>> Pavillion: 7m39s
>>>
>>> 64 registers:
>>> BMW: 1m36s
>>> Classroom: 3m34s
>>> Fishy Cat: 3m57s
>>> Koro: 6m45s
>>> Pavillion: 6m39s
>>>
>>> With the exception of the classroom scene, all benchmarks show
>>> significantly better performance. If there are no objections, I'd like to
>>> commit that register increase for SM 6.x to master.
>>>
>>> Running the same test on a Quadro M4000 (Maxwell) shows much smaller
>>> differences, so I'd leave SM 5.x as is:
>>>
>>> 48 registers (as is):
>>> BMW: 4m38s
>>> Classroom: 12m32s
>>> Fishy Cat: 11m18s
>>> Koro: 20m38s
>>> Pavillion: 21m12s
>>>
>>> 64 registers:
>>> BMW: 4m38s
>>> Classroom: 13m07s
>>> Fishy Cat: 10m52s
>>> Koro: 18m51s
>>> Pavillion: 21m32s
>>>
>>> Another note: 63 registers was a hard limit for SM 2.x hardware. Is 63
>>> instead of 64 as register limit for kernels SM 3.x and higher just carried
>>> over or is there a reason to not go to 64 registers?
>>>
>>> -Stefan
>>> PS: I'd love it if someone would sacrifice the time to run 48/64
>>> register comparison benchmarks on other Pascal hardware and/or on Linux.
>>>
>>> _______________________________________________
>>> Bf-cycles mailing list
>>> Bf-cycles at blender.org
>>> https://lists.blender.org/mailman/listinfo/bf-cycles
>>>
>>>
>>
>> _______________________________________________
>> Bf-cycles mailing list
>> Bf-cycles at blender.org
>> https://lists.blender.org/mailman/listinfo/bf-cycles
>>
>>
>
> _______________________________________________
> Bf-cycles mailing list
> Bf-cycles at blender.org
> https://lists.blender.org/mailman/listinfo/bf-cycles
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.blender.org/pipermail/bf-cycles/attachments/20171115/54ec5f40/attachment.html>


More information about the Bf-cycles mailing list