[Bf-cycles] Cuda launch bounds for Pascal

Thu Nov 16 00:11:21 CET 2017

Still I suggest to commit this change for CUDA 9, checking with
__CUDACC_VER_MAJOR__. We can ask NVidia to take a look and see if there's a
way to get back the performance from the early CUDA 9.0.102 release (which
was a beta I think). But avoiding the major slowdown for now is good.

Here's a graph relative to CUDA 8 for completeness.
https://developer.blender.org/F1142667

On Wed, Nov 15, 2017 at 10:37 PM, Stefan Werner <stewreo at gmail.com> wrote:

> Seems to be not just the CUDA version but only the chip model. I now ran
> my benchmarks on a GTX 1060 too, there the difference betwen 48 and 64
> registers is close to nothing:
>
> 64 registers:
> BMW: 2m41s
> Classroom: 8m02s
> Fish Cat: 6m39s
> Koro: 11m17s
> Pavillion: 13m38s
>
> 48 registers:
> BMW: 2m43s
> Classroom: 7m56s
> Fishy Cat: 6m52s
> Koro: 12m17s
> Pavillion: 13m50s
>
> Maybe here it's the ratio of bandwidth/core that makes register spilling
> less costly on the 1060 than on the 1080Ti?
>
> Well, there go my dreams of a one-line commit that brings 10-20%
> performance boost.
>
> -Stefan
>
> On Wed, Nov 15, 2017 at 6:28 PM, Brecht Van Lommel <
> brechtvanlommel at pandora.be> wrote:
>
>> It seems to be related to the CUDA version, 9.0.176 has a performance
>> regression compared to 9.0.102. Increasing the registers partially
>> compensates for that, but not entirely.
>> https://developer.blender.org/F1141999
>>
>> On Wed, Nov 15, 2017 at 12:49 PM, Stefan Werner <stewreo at gmail.com>
>> wrote:
>>
>>> Wow, those results are almost the complete opposite of what I'm seeing.
>>> I re-ran the tests on Linux:
>>>
>>> Nvidia 1080Ti, driver 384.90, installed as secondary GPU (no display
>>> attached)
>>> Xubuntu 17.04, CUDA 9.0.176, gcc 6.3.0
>>> master branch, 556b13f03e561b54d4f0186e207f080c786f8b66
>>>
>>> 48 registers:
>>> BMW: 1m28s
>>> Classroom: 3m12s
>>> Fish Cat: 3m07s
>>> Koro: 5m40s
>>> Pavillion: 6m52s
>>> Victor: 15m01s
>>>
>>>  64 registers:
>>>  BMW: 1m11s
>>>  Classroom: 2m59s
>>>  Fishy Cat: 2m51s
>>>  Koro: 4m39s
>>>  Pavillion: 5m32s
>>>  Victor: 12m19s
>>>
>>> (Victor had a tile size of 32, all others were the *_gpu.blend files
>>> with the default 256 tile size)
>>>
>>> On Windows, all GTX cards are treated as display cards, regardless of
>>> whether a monitor is plugged in or not. Only Quadro, Tesla and Titan cards
>>> can be set to TCC, that mode is not available for my GTX.
>>>
>>> I wonder what's behind the difference we're seeing? The GPUs themselves
>>> shoudln't be that different, both are based on GP102, where only the 1080Ti
>>> has two SMX units disabled.
>>>
>>> -Stefan
>>>
>>> On Wed, Nov 15, 2017 at 1:35 AM, Brecht Van Lommel <
>>> brechtvanlommel at pandora.be> wrote:
>>>
>>>> Hi,
>>>>
>>>> The registers were set based on benchmarks with a GTX 1080 on Linux,
>>>> when we first optimized the code for Pascal. But that was more than a year
>>>> ago. Going from 63 to 64 registers should be fine if it's faster.
>>>>
>>>> Here's a benchmarks with a Titan Xp, Linux, driver 384.90. Results are
>>>> not so good there:
>>>> CUDA 8.0.61: https://developer.blender.org/F1137606
>>>> CUDA 9.0.102: https://developer.blender.org/F1137502
>>>>
>>>> Which driver and CUDA version are you using?
>>>>
>>>> One difference between Windows and Linux is the compute preemption
>>>> support. It might be useful to test if that min_blocks *= 8 helps on
>>>> Windows, if your GTX 1080Ti is used for display.
>>>> https://developer.blender.org/rBe360d003e
>>>>
>>>> Regards,
>>>> Brecht.
>>>>
>>>>
>>>> On Tue, Nov 14, 2017 at 11:48 PM, Stefan Werner <stewreo at gmail.com>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> currently the Cuda kernel uses the same launch bounds for Pascal (SM
>>>>> 6.x) as for Maxwell (SM 5.x) hardware, that is 63 registers for branched
>>>>> path tracing and 48 registers for path tracing. Are all of those derived
>>>>> from benchmarks or is the value for Pascal just being carried over from
>>>>> Maxwell?
>>>>>
>>>>> The reason I'm asking is that I'm observing a performance increase on
>>>>> Pascal when I increase the number of registers to 64 for path tracing. Here
>>>>> are before/after benchmarks from a GTX 1080Ti/Win10:
>>>>>
>>>>> 48 registers (as is):
>>>>> BMW: 1m52
>>>>> Classroom: 3m31s
>>>>> Fishy Cat: 4m33s
>>>>> Koro: 8m30s
>>>>> Pavillion: 7m39s
>>>>>
>>>>> 64 registers:
>>>>> BMW: 1m36s
>>>>> Classroom: 3m34s
>>>>> Fishy Cat: 3m57s
>>>>> Koro: 6m45s
>>>>> Pavillion: 6m39s
>>>>>
>>>>> With the exception of the classroom scene, all benchmarks show
>>>>> significantly better performance. If there are no objections, I'd like to
>>>>> commit that register increase for SM 6.x to master.
>>>>>
>>>>> Running the same test on a Quadro M4000 (Maxwell) shows much smaller
>>>>> differences, so I'd leave SM 5.x as is:
>>>>>
>>>>> 48 registers (as is):
>>>>> BMW: 4m38s
>>>>> Classroom: 12m32s
>>>>> Fishy Cat: 11m18s
>>>>> Koro: 20m38s
>>>>> Pavillion: 21m12s
>>>>>
>>>>> 64 registers:
>>>>> BMW: 4m38s
>>>>> Classroom: 13m07s
>>>>> Fishy Cat: 10m52s
>>>>> Koro: 18m51s
>>>>> Pavillion: 21m32s
>>>>>
>>>>> Another note: 63 registers was a hard limit for SM 2.x hardware. Is 63
>>>>> instead of 64 as register limit for kernels SM 3.x and higher just carried
>>>>> over or is there a reason to not go to 64 registers?
>>>>>
>>>>> -Stefan
>>>>> PS: I'd love it if someone would sacrifice the time to run 48/64
>>>>> register comparison benchmarks on other Pascal hardware and/or on Linux.
>>>>>
>>>>> _______________________________________________
>>>>> Bf-cycles mailing list
>>>>> Bf-cycles at blender.org
>>>>> https://lists.blender.org/mailman/listinfo/bf-cycles
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Bf-cycles mailing list
>>>> Bf-cycles at blender.org
>>>> https://lists.blender.org/mailman/listinfo/bf-cycles
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Bf-cycles mailing list
>>> Bf-cycles at blender.org
>>> https://lists.blender.org/mailman/listinfo/bf-cycles
>>>
>>>
>>
>> _______________________________________________
>> Bf-cycles mailing list
>> Bf-cycles at blender.org
>> https://lists.blender.org/mailman/listinfo/bf-cycles
>>
>>
>
> _______________________________________________
> Bf-cycles mailing list
> Bf-cycles at blender.org
> https://lists.blender.org/mailman/listinfo/bf-cycles
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.blender.org/pipermail/bf-cycles/attachments/20171116/3574a665/attachment.html>