[Bf-cycles] Cuda launch bounds for Pascal

Stefan Werner stewreo at gmail.com
Tue Nov 14 23:48:18 CET 2017


currently the Cuda kernel uses the same launch bounds for Pascal (SM 6.x)
as for Maxwell (SM 5.x) hardware, that is 63 registers for branched path
tracing and 48 registers for path tracing. Are all of those derived from
benchmarks or is the value for Pascal just being carried over from Maxwell?

The reason I'm asking is that I'm observing a performance increase on
Pascal when I increase the number of registers to 64 for path tracing. Here
are before/after benchmarks from a GTX 1080Ti/Win10:

48 registers (as is):
BMW: 1m52
Classroom: 3m31s
Fishy Cat: 4m33s
Koro: 8m30s
Pavillion: 7m39s

64 registers:
BMW: 1m36s
Classroom: 3m34s
Fishy Cat: 3m57s
Koro: 6m45s
Pavillion: 6m39s

With the exception of the classroom scene, all benchmarks show
significantly better performance. If there are no objections, I'd like to
commit that register increase for SM 6.x to master.

Running the same test on a Quadro M4000 (Maxwell) shows much smaller
differences, so I'd leave SM 5.x as is:

48 registers (as is):
BMW: 4m38s
Classroom: 12m32s
Fishy Cat: 11m18s
Koro: 20m38s
Pavillion: 21m12s

64 registers:
BMW: 4m38s
Classroom: 13m07s
Fishy Cat: 10m52s
Koro: 18m51s
Pavillion: 21m32s

Another note: 63 registers was a hard limit for SM 2.x hardware. Is 63
instead of 64 as register limit for kernels SM 3.x and higher just carried
over or is there a reason to not go to 64 registers?

PS: I'd love it if someone would sacrifice the time to run 48/64 register
comparison benchmarks on other Pascal hardware and/or on Linux.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.blender.org/pipermail/bf-cycles/attachments/20171114/aa978f59/attachment.html>

More information about the Bf-cycles mailing list