[Bf-cycles] Cuda launch bounds for Pascal
Jan Scheffczyk
knork at m1234.de
Thu Nov 16 23:57:05 CET 2017
I am a little late however I can confirm that on CUDA 8 and 1060 (6GB)
it does not consistently increase performance.
System:
Debian SID
Source compiled with gcc 7.2
Cuda compled with clang-3.8
cuda-compile tools: 8.0
CUDA_KERNEL_MAX_REGISTERS 48
BMW: 2:38
Classroom: 7:43
Fishy Cat: 7:20
Koro: 14:03
Pavillion: 15:12
CUDA_KERNEL_MAX_REGISTERS 64
BMW: 2:46
Classroom: 8:03
Fishy Cat: 7:10
Koro: 12:46
Pavillion: 16:06
Greetings Knork
On 11/16/2017 12:11 AM, Brecht Van Lommel wrote:
> Still I suggest to commit this change for CUDA 9, checking with
> __CUDACC_VER_MAJOR__. We can ask NVidia to take a look and see if
> there's a way to get back the performance from the early CUDA 9.0.102
> release (which was a beta I think). But avoiding the major slowdown
> for now is good.
>
> Here's a graph relative to CUDA 8 for completeness.
> https://developer.blender.org/F1142667
>
>
> On Wed, Nov 15, 2017 at 10:37 PM, Stefan Werner <stewreo at gmail.com
> <mailto:stewreo at gmail.com>> wrote:
>
> Seems to be not just the CUDA version but only the chip model. I
> now ran my benchmarks on a GTX 1060 too, there the difference
> betwen 48 and 64 registers is close to nothing:
>
> 64 registers:
> BMW: 2m41s
> Classroom: 8m02s
> Fish Cat: 6m39s
> Koro: 11m17s
> Pavillion: 13m38s
>
> 48 registers:
> BMW: 2m43s
> Classroom: 7m56s
> Fishy Cat: 6m52s
> Koro: 12m17s
> Pavillion: 13m50s
>
> Maybe here it's the ratio of bandwidth/core that makes register
> spilling less costly on the 1060 than on the 1080Ti?
>
> Well, there go my dreams of a one-line commit that brings 10-20%
> performance boost.
>
> -Stefan
>
> On Wed, Nov 15, 2017 at 6:28 PM, Brecht Van Lommel
> <brechtvanlommel at pandora.be <mailto:brechtvanlommel at pandora.be>>
> wrote:
>
> It seems to be related to the CUDA version, 9.0.176 has a
> performance regression compared to 9.0.102. Increasing the
> registers partially compensates for that, but not entirely.
> https://developer.blender.org/F1141999
> <https://developer.blender.org/F1141999>
>
> On Wed, Nov 15, 2017 at 12:49 PM, Stefan Werner
> <stewreo at gmail.com <mailto:stewreo at gmail.com>> wrote:
>
> Wow, those results are almost the complete opposite of
> what I'm seeing. I re-ran the tests on Linux:
>
> Nvidia 1080Ti, driver 384.90, installed as secondary GPU
> (no display attached)
> Xubuntu 17.04, CUDA 9.0.176, gcc 6.3.0
> master branch, 556b13f03e561b54d4f0186e207f080c786f8b66
>
> 48 registers:
> BMW: 1m28s
> Classroom: 3m12s
> Fish Cat: 3m07s
> Koro: 5m40s
> Pavillion: 6m52s
> Victor: 15m01s
>
> 64 registers:
> BMW: 1m11s
> Classroom: 2m59s
> Fishy Cat: 2m51s
> Koro: 4m39s
> Pavillion: 5m32s
> Victor: 12m19s
>
> (Victor had a tile size of 32, all others were the
> *_gpu.blend files with the default 256 tile size)
>
> On Windows, all GTX cards are treated as display cards,
> regardless of whether a monitor is plugged in or not. Only
> Quadro, Tesla and Titan cards can be set to TCC, that mode
> is not available for my GTX.
>
> I wonder what's behind the difference we're seeing? The
> GPUs themselves shoudln't be that different, both are
> based on GP102, where only the 1080Ti has two SMX units
> disabled.
>
> -Stefan
>
> On Wed, Nov 15, 2017 at 1:35 AM, Brecht Van Lommel
> <brechtvanlommel at pandora.be
> <mailto:brechtvanlommel at pandora.be>> wrote:
>
> Hi,
>
> The registers were set based on benchmarks with a GTX
> 1080 on Linux, when we first optimized the code for
> Pascal. But that was more than a year ago. Going from
> 63 to 64 registers should be fine if it's faster.
>
> Here's a benchmarks with a Titan Xp,
> Linux, driver 384.90. Results are not so good there:
> CUDA 8.0.61: https://developer.blender.org/F1137606
> <https://developer.blender.org/F1137606>
> CUDA 9.0.102: https://developer.blender.org/F1137502
> <https://developer.blender.org/F1137502>
>
> Which driver and CUDA version are you using?
>
> One difference between Windows and Linux is the
> compute preemption support. It might be useful to test
> if that min_blocks *= 8 helps on Windows, if your GTX
> 1080Ti is used for display.
> https://developer.blender.org/rBe360d003e
> <https://developer.blender.org/rBe360d003e>
>
> Regards,
> Brecht.
>
>
> On Tue, Nov 14, 2017 at 11:48 PM, Stefan Werner
> <stewreo at gmail.com <mailto:stewreo at gmail.com>> wrote:
>
> Hello,
>
> currently the Cuda kernel uses the same launch
> bounds for Pascal (SM 6.x) as for Maxwell (SM 5.x)
> hardware, that is 63 registers for branched path
> tracing and 48 registers for path tracing. Are all
> of those derived from benchmarks or is the value
> for Pascal just being carried over from Maxwell?
>
> The reason I'm asking is that I'm observing a
> performance increase on Pascal when I increase the
> number of registers to 64 for path tracing. Here
> are before/after benchmarks from a GTX 1080Ti/Win10:
>
> 48 registers (as is):
> BMW: 1m52
> Classroom: 3m31s
> Fishy Cat: 4m33s
> Koro: 8m30s
> Pavillion: 7m39s
>
> 64 registers:
> BMW: 1m36s
> Classroom: 3m34s
> Fishy Cat: 3m57s
> Koro: 6m45s
> Pavillion: 6m39s
>
> With the exception of the classroom scene, all
> benchmarks show significantly better performance.
> If there are no objections, I'd like to commit
> that register increase for SM 6.x to master.
>
> Running the same test on a Quadro M4000 (Maxwell)
> shows much smaller differences, so I'd leave SM
> 5.x as is:
>
> 48 registers (as is):
> BMW: 4m38s
> Classroom: 12m32s
> Fishy Cat: 11m18s
> Koro: 20m38s
> Pavillion: 21m12s
>
> 64 registers:
> BMW: 4m38s
> Classroom: 13m07s
> Fishy Cat: 10m52s
> Koro: 18m51s
> Pavillion: 21m32s
>
> Another note: 63 registers was a hard limit for SM
> 2.x hardware. Is 63 instead of 64 as register
> limit for kernels SM 3.x and higher just carried
> over or is there a reason to not go to 64 registers?
>
> -Stefan
> PS: I'd love it if someone would sacrifice the
> time to run 48/64 register comparison benchmarks
> on other Pascal hardware and/or on Linux.
>
> _______________________________________________
> Bf-cycles mailing list
> Bf-cycles at blender.org <mailto:Bf-cycles at blender.org>
> https://lists.blender.org/mailman/listinfo/bf-cycles
> <https://lists.blender.org/mailman/listinfo/bf-cycles>
>
>
>
> _______________________________________________
> Bf-cycles mailing list
> Bf-cycles at blender.org <mailto:Bf-cycles at blender.org>
> https://lists.blender.org/mailman/listinfo/bf-cycles
> <https://lists.blender.org/mailman/listinfo/bf-cycles>
>
>
>
> _______________________________________________
> Bf-cycles mailing list
> Bf-cycles at blender.org <mailto:Bf-cycles at blender.org>
> https://lists.blender.org/mailman/listinfo/bf-cycles
> <https://lists.blender.org/mailman/listinfo/bf-cycles>
>
>
>
> _______________________________________________
> Bf-cycles mailing list
> Bf-cycles at blender.org <mailto:Bf-cycles at blender.org>
> https://lists.blender.org/mailman/listinfo/bf-cycles
> <https://lists.blender.org/mailman/listinfo/bf-cycles>
>
>
>
> _______________________________________________
> Bf-cycles mailing list
> Bf-cycles at blender.org <mailto:Bf-cycles at blender.org>
> https://lists.blender.org/mailman/listinfo/bf-cycles
> <https://lists.blender.org/mailman/listinfo/bf-cycles>
>
>
>
>
> _______________________________________________
> Bf-cycles mailing list
> Bf-cycles at blender.org
> https://lists.blender.org/mailman/listinfo/bf-cycles
--
Jan Scheffczy
w: https://knork.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.blender.org/pipermail/bf-cycles/attachments/20171116/2f5cef37/attachment.html>
More information about the Bf-cycles
mailing list