[Bf-cycles] Cuda launch bounds for Pascal

Wed Nov 15 22:37:32 CET 2017

Seems to be not just the CUDA version but only the chip model. I now ran my
benchmarks on a GTX 1060 too, there the difference betwen 48 and 64
registers is close to nothing:

64 registers:
BMW: 2m41s
Classroom: 8m02s
Fish Cat: 6m39s
Koro: 11m17s
Pavillion: 13m38s

48 registers:
BMW: 2m43s
Classroom: 7m56s
Fishy Cat: 6m52s
Koro: 12m17s
Pavillion: 13m50s

Maybe here it's the ratio of bandwidth/core that makes register spilling
less costly on the 1060 than on the 1080Ti?

Well, there go my dreams of a one-line commit that brings 10-20%
performance boost.

-Stefan

On Wed, Nov 15, 2017 at 6:28 PM, Brecht Van Lommel <
brechtvanlommel at pandora.be> wrote:

> It seems to be related to the CUDA version, 9.0.176 has a performance
> regression compared to 9.0.102. Increasing the registers partially
> compensates for that, but not entirely.
> https://developer.blender.org/F1141999
>
> On Wed, Nov 15, 2017 at 12:49 PM, Stefan Werner <stewreo at gmail.com> wrote:
>
>> Wow, those results are almost the complete opposite of what I'm seeing. I
>> re-ran the tests on Linux:
>>
>> Nvidia 1080Ti, driver 384.90, installed as secondary GPU (no display
>> attached)
>> Xubuntu 17.04, CUDA 9.0.176, gcc 6.3.0
>> master branch, 556b13f03e561b54d4f0186e207f080c786f8b66
>>
>> 48 registers:
>> BMW: 1m28s
>> Classroom: 3m12s
>> Fish Cat: 3m07s
>> Koro: 5m40s
>> Pavillion: 6m52s
>> Victor: 15m01s
>>
>>  64 registers:
>>  BMW: 1m11s
>>  Classroom: 2m59s
>>  Fishy Cat: 2m51s
>>  Koro: 4m39s
>>  Pavillion: 5m32s
>>  Victor: 12m19s
>>
>> (Victor had a tile size of 32, all others were the *_gpu.blend files with
>> the default 256 tile size)
>>
>> On Windows, all GTX cards are treated as display cards, regardless of
>> whether a monitor is plugged in or not. Only Quadro, Tesla and Titan cards
>> can be set to TCC, that mode is not available for my GTX.
>>
>> I wonder what's behind the difference we're seeing? The GPUs themselves
>> shoudln't be that different, both are based on GP102, where only the 1080Ti
>> has two SMX units disabled.
>>
>> -Stefan
>>
>> On Wed, Nov 15, 2017 at 1:35 AM, Brecht Van Lommel <
>> brechtvanlommel at pandora.be> wrote:
>>
>>> Hi,
>>>
>>> The registers were set based on benchmarks with a GTX 1080 on Linux,
>>> when we first optimized the code for Pascal. But that was more than a year
>>> ago. Going from 63 to 64 registers should be fine if it's faster.
>>>
>>> Here's a benchmarks with a Titan Xp, Linux, driver 384.90. Results are
>>> not so good there:
>>> CUDA 8.0.61: https://developer.blender.org/F1137606
>>> CUDA 9.0.102: https://developer.blender.org/F1137502
>>>
>>> Which driver and CUDA version are you using?
>>>
>>> One difference between Windows and Linux is the compute preemption
>>> support. It might be useful to test if that min_blocks *= 8 helps on
>>> Windows, if your GTX 1080Ti is used for display.
>>> https://developer.blender.org/rBe360d003e
>>>
>>> Regards,
>>> Brecht.
>>>
>>>
>>> On Tue, Nov 14, 2017 at 11:48 PM, Stefan Werner <stewreo at gmail.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> currently the Cuda kernel uses the same launch bounds for Pascal (SM
>>>> 6.x) as for Maxwell (SM 5.x) hardware, that is 63 registers for branched
>>>> path tracing and 48 registers for path tracing. Are all of those derived
>>>> from benchmarks or is the value for Pascal just being carried over from
>>>> Maxwell?
>>>>
>>>> The reason I'm asking is that I'm observing a performance increase on
>>>> Pascal when I increase the number of registers to 64 for path tracing. Here
>>>> are before/after benchmarks from a GTX 1080Ti/Win10:
>>>>
>>>> 48 registers (as is):
>>>> BMW: 1m52
>>>> Classroom: 3m31s
>>>> Fishy Cat: 4m33s
>>>> Koro: 8m30s
>>>> Pavillion: 7m39s
>>>>
>>>> 64 registers:
>>>> BMW: 1m36s
>>>> Classroom: 3m34s
>>>> Fishy Cat: 3m57s
>>>> Koro: 6m45s
>>>> Pavillion: 6m39s
>>>>
>>>> With the exception of the classroom scene, all benchmarks show
>>>> significantly better performance. If there are no objections, I'd like to
>>>> commit that register increase for SM 6.x to master.
>>>>
>>>> Running the same test on a Quadro M4000 (Maxwell) shows much smaller
>>>> differences, so I'd leave SM 5.x as is:
>>>>
>>>> 48 registers (as is):
>>>> BMW: 4m38s
>>>> Classroom: 12m32s
>>>> Fishy Cat: 11m18s
>>>> Koro: 20m38s
>>>> Pavillion: 21m12s
>>>>
>>>> 64 registers:
>>>> BMW: 4m38s
>>>> Classroom: 13m07s
>>>> Fishy Cat: 10m52s
>>>> Koro: 18m51s
>>>> Pavillion: 21m32s
>>>>
>>>> Another note: 63 registers was a hard limit for SM 2.x hardware. Is 63
>>>> instead of 64 as register limit for kernels SM 3.x and higher just carried
>>>> over or is there a reason to not go to 64 registers?
>>>>
>>>> -Stefan
>>>> PS: I'd love it if someone would sacrifice the time to run 48/64
>>>> register comparison benchmarks on other Pascal hardware and/or on Linux.
>>>>
>>>> _______________________________________________
>>>> Bf-cycles mailing list
>>>> Bf-cycles at blender.org
>>>> https://lists.blender.org/mailman/listinfo/bf-cycles
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Bf-cycles mailing list
>>> Bf-cycles at blender.org
>>> https://lists.blender.org/mailman/listinfo/bf-cycles
>>>
>>>
>>
>> _______________________________________________
>> Bf-cycles mailing list
>> Bf-cycles at blender.org
>> https://lists.blender.org/mailman/listinfo/bf-cycles
>>
>>
>
> _______________________________________________
> Bf-cycles mailing list
> Bf-cycles at blender.org
> https://lists.blender.org/mailman/listinfo/bf-cycles
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.blender.org/pipermail/bf-cycles/attachments/20171115/4a532916/attachment.html>