[Bf-cycles] Cuda launch bounds for Pascal

Thu Nov 16 23:57:05 CET 2017

I am a little late however I can confirm that on CUDA 8 and 1060 (6GB)
it does not consistently  increase performance.

System:
Debian SID
Source compiled with gcc 7.2
Cuda compled with clang-3.8
cuda-compile tools: 8.0

CUDA_KERNEL_MAX_REGISTERS 48
BMW: 2:38
Classroom: 7:43
Fishy Cat: 7:20
Koro: 14:03
Pavillion: 15:12

CUDA_KERNEL_MAX_REGISTERS 64
BMW: 2:46
Classroom: 8:03
Fishy Cat: 7:10
Koro: 12:46
Pavillion: 16:06

Greetings Knork

On 11/16/2017 12:11 AM, Brecht Van Lommel wrote:
> Still I suggest to commit this change for CUDA 9, checking with
> __CUDACC_VER_MAJOR__. We can ask NVidia to take a look and see if
> there's a way to get back the performance from the early CUDA 9.0.102
> release (which was a beta I think). But avoiding the major slowdown
> for now is good.
>
> Here's a graph relative to CUDA 8 for completeness.
> https://developer.blender.org/F1142667
>
>
> On Wed, Nov 15, 2017 at 10:37 PM, Stefan Werner <stewreo at gmail.com
> <mailto:stewreo at gmail.com>> wrote:
>
>     Seems to be not just the CUDA version but only the chip model. I
>     now ran my benchmarks on a GTX 1060 too, there the difference
>     betwen 48 and 64 registers is close to nothing:
>
>     64 registers:
>     BMW: 2m41s
>     Classroom: 8m02s
>     Fish Cat: 6m39s
>     Koro: 11m17s
>     Pavillion: 13m38s
>
>     48 registers:
>     BMW: 2m43s
>     Classroom: 7m56s
>     Fishy Cat: 6m52s
>     Koro: 12m17s
>     Pavillion: 13m50s
>
>     Maybe here it's the ratio of bandwidth/core that makes register
>     spilling less costly on the 1060 than on the 1080Ti?
>
>     Well, there go my dreams of a one-line commit that brings 10-20%
>     performance boost.
>
>     -Stefan
>
>     On Wed, Nov 15, 2017 at 6:28 PM, Brecht Van Lommel
>     <brechtvanlommel at pandora.be <mailto:brechtvanlommel at pandora.be>>
>     wrote:
>
>         It seems to be related to the CUDA version, 9.0.176 has a
>         performance regression compared to 9.0.102. Increasing the
>         registers partially compensates for that, but not entirely.
>         https://developer.blender.org/F1141999
>         <https://developer.blender.org/F1141999>
>
>         On Wed, Nov 15, 2017 at 12:49 PM, Stefan Werner
>         <stewreo at gmail.com <mailto:stewreo at gmail.com>> wrote:
>
>             Wow, those results are almost the complete opposite of
>             what I'm seeing. I re-ran the tests on Linux:
>
>             Nvidia 1080Ti, driver 384.90, installed as secondary GPU
>             (no display attached)
>             Xubuntu 17.04, CUDA 9.0.176, gcc 6.3.0
>             master branch, 556b13f03e561b54d4f0186e207f080c786f8b66
>
>             48 registers:
>             BMW: 1m28s
>             Classroom: 3m12s
>             Fish Cat: 3m07s
>             Koro: 5m40s
>             Pavillion: 6m52s
>             Victor: 15m01s
>
>              64 registers:
>              BMW: 1m11s
>              Classroom: 2m59s
>              Fishy Cat: 2m51s
>              Koro: 4m39s
>              Pavillion: 5m32s
>              Victor: 12m19s
>
>             (Victor had a tile size of 32, all others were the
>             *_gpu.blend files with the default 256 tile size)
>
>             On Windows, all GTX cards are treated as display cards,
>             regardless of whether a monitor is plugged in or not. Only
>             Quadro, Tesla and Titan cards can be set to TCC, that mode
>             is not available for my GTX.
>
>             I wonder what's behind the difference we're seeing? The
>             GPUs themselves shoudln't be that different, both are
>             based on GP102, where only the 1080Ti has two SMX units
>             disabled.
>
>             -Stefan
>
>             On Wed, Nov 15, 2017 at 1:35 AM, Brecht Van Lommel
>             <brechtvanlommel at pandora.be
>             <mailto:brechtvanlommel at pandora.be>> wrote:
>
>                 Hi,
>
>                 The registers were set based on benchmarks with a GTX
>                 1080 on Linux, when we first optimized the code for
>                 Pascal. But that was more than a year ago. Going from
>                 63 to 64 registers should be fine if it's faster.
>
>                 Here's a benchmarks with a Titan Xp,
>                 Linux, driver 384.90. Results are not so good there:
>                 CUDA 8.0.61: https://developer.blender.org/F1137606
>                 <https://developer.blender.org/F1137606>
>                 CUDA 9.0.102: https://developer.blender.org/F1137502
>                 <https://developer.blender.org/F1137502>
>
>                 Which driver and CUDA version are you using?
>
>                 One difference between Windows and Linux is the
>                 compute preemption support. It might be useful to test
>                 if that min_blocks *= 8 helps on Windows, if your GTX
>                 1080Ti is used for display.
>                 https://developer.blender.org/rBe360d003e
>                 <https://developer.blender.org/rBe360d003e>
>
>                 Regards,
>                 Brecht.
>
>
>                 On Tue, Nov 14, 2017 at 11:48 PM, Stefan Werner
>                 <stewreo at gmail.com <mailto:stewreo at gmail.com>> wrote:
>
>                     Hello,
>
>                     currently the Cuda kernel uses the same launch
>                     bounds for Pascal (SM 6.x) as for Maxwell (SM 5.x)
>                     hardware, that is 63 registers for branched path
>                     tracing and 48 registers for path tracing. Are all
>                     of those derived from benchmarks or is the value
>                     for Pascal just being carried over from Maxwell?
>
>                     The reason I'm asking is that I'm observing a
>                     performance increase on Pascal when I increase the
>                     number of registers to 64 for path tracing. Here
>                     are before/after benchmarks from a GTX 1080Ti/Win10:
>
>                     48 registers (as is):
>                     BMW: 1m52
>                     Classroom: 3m31s
>                     Fishy Cat: 4m33s
>                     Koro: 8m30s
>                     Pavillion: 7m39s
>
>                     64 registers:
>                     BMW: 1m36s
>                     Classroom: 3m34s
>                     Fishy Cat: 3m57s
>                     Koro: 6m45s
>                     Pavillion: 6m39s
>
>                     With the exception of the classroom scene, all
>                     benchmarks show significantly better performance.
>                     If there are no objections, I'd like to commit
>                     that register increase for SM 6.x to master.
>
>                     Running the same test on a Quadro M4000 (Maxwell)
>                     shows much smaller differences, so I'd leave SM
>                     5.x as is:
>
>                     48 registers (as is):
>                     BMW: 4m38s
>                     Classroom: 12m32s
>                     Fishy Cat: 11m18s
>                     Koro: 20m38s
>                     Pavillion: 21m12s
>
>                     64 registers:
>                     BMW: 4m38s
>                     Classroom: 13m07s
>                     Fishy Cat: 10m52s
>                     Koro: 18m51s
>                     Pavillion: 21m32s
>
>                     Another note: 63 registers was a hard limit for SM
>                     2.x hardware. Is 63 instead of 64 as register
>                     limit for kernels SM 3.x and higher just carried
>                     over or is there a reason to not go to 64 registers?
>
>                     -Stefan
>                     PS: I'd love it if someone would sacrifice the
>                     time to run 48/64 register comparison benchmarks
>                     on other Pascal hardware and/or on Linux.
>
>                     _______________________________________________
>                     Bf-cycles mailing list
>                     Bf-cycles at blender.org <mailto:Bf-cycles at blender.org>
>                     https://lists.blender.org/mailman/listinfo/bf-cycles
>                     <https://lists.blender.org/mailman/listinfo/bf-cycles>
>
>
>
>                 _______________________________________________
>                 Bf-cycles mailing list
>                 Bf-cycles at blender.org <mailto:Bf-cycles at blender.org>
>                 https://lists.blender.org/mailman/listinfo/bf-cycles
>                 <https://lists.blender.org/mailman/listinfo/bf-cycles>
>
>
>
>             _______________________________________________
>             Bf-cycles mailing list
>             Bf-cycles at blender.org <mailto:Bf-cycles at blender.org>
>             https://lists.blender.org/mailman/listinfo/bf-cycles
>             <https://lists.blender.org/mailman/listinfo/bf-cycles>
>
>
>
>         _______________________________________________
>         Bf-cycles mailing list
>         Bf-cycles at blender.org <mailto:Bf-cycles at blender.org>
>         https://lists.blender.org/mailman/listinfo/bf-cycles
>         <https://lists.blender.org/mailman/listinfo/bf-cycles>
>
>
>
>     _______________________________________________
>     Bf-cycles mailing list
>     Bf-cycles at blender.org <mailto:Bf-cycles at blender.org>
>     https://lists.blender.org/mailman/listinfo/bf-cycles
>     <https://lists.blender.org/mailman/listinfo/bf-cycles>
>
>
>
>
> _______________________________________________
> Bf-cycles mailing list
> Bf-cycles at blender.org
> https://lists.blender.org/mailman/listinfo/bf-cycles

-- 
Jan Scheffczy
w: https://knork.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.blender.org/pipermail/bf-cycles/attachments/20171116/2f5cef37/attachment.html>