[Bf-cycles] Cuda launch bounds for Pascal

Brecht Van Lommel brechtvanlommel at pandora.be
Wed Nov 15 01:35:01 CET 2017


The registers were set based on benchmarks with a GTX 1080 on Linux, when
we first optimized the code for Pascal. But that was more than a year ago.
Going from 63 to 64 registers should be fine if it's faster.

Here's a benchmarks with a Titan Xp, Linux, driver 384.90. Results are not
so good there:
CUDA 8.0.61: https://developer.blender.org/F1137606
CUDA 9.0.102: https://developer.blender.org/F1137502

Which driver and CUDA version are you using?

One difference between Windows and Linux is the compute preemption support.
It might be useful to test if that min_blocks *= 8 helps on Windows, if
your GTX 1080Ti is used for display.


On Tue, Nov 14, 2017 at 11:48 PM, Stefan Werner <stewreo at gmail.com> wrote:

> Hello,
> currently the Cuda kernel uses the same launch bounds for Pascal (SM 6.x)
> as for Maxwell (SM 5.x) hardware, that is 63 registers for branched path
> tracing and 48 registers for path tracing. Are all of those derived from
> benchmarks or is the value for Pascal just being carried over from Maxwell?
> The reason I'm asking is that I'm observing a performance increase on
> Pascal when I increase the number of registers to 64 for path tracing. Here
> are before/after benchmarks from a GTX 1080Ti/Win10:
> 48 registers (as is):
> BMW: 1m52
> Classroom: 3m31s
> Fishy Cat: 4m33s
> Koro: 8m30s
> Pavillion: 7m39s
> 64 registers:
> BMW: 1m36s
> Classroom: 3m34s
> Fishy Cat: 3m57s
> Koro: 6m45s
> Pavillion: 6m39s
> With the exception of the classroom scene, all benchmarks show
> significantly better performance. If there are no objections, I'd like to
> commit that register increase for SM 6.x to master.
> Running the same test on a Quadro M4000 (Maxwell) shows much smaller
> differences, so I'd leave SM 5.x as is:
> 48 registers (as is):
> BMW: 4m38s
> Classroom: 12m32s
> Fishy Cat: 11m18s
> Koro: 20m38s
> Pavillion: 21m12s
> 64 registers:
> BMW: 4m38s
> Classroom: 13m07s
> Fishy Cat: 10m52s
> Koro: 18m51s
> Pavillion: 21m32s
> Another note: 63 registers was a hard limit for SM 2.x hardware. Is 63
> instead of 64 as register limit for kernels SM 3.x and higher just carried
> over or is there a reason to not go to 64 registers?
> -Stefan
> PS: I'd love it if someone would sacrifice the time to run 48/64 register
> comparison benchmarks on other Pascal hardware and/or on Linux.
> _______________________________________________
> Bf-cycles mailing list
> Bf-cycles at blender.org
> https://lists.blender.org/mailman/listinfo/bf-cycles
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.blender.org/pipermail/bf-cycles/attachments/20171115/63160461/attachment.html>

More information about the Bf-cycles mailing list