<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">It’s committed to master. Fingers crossed. I also submitted a bug report to NVIDIA.<div class=""><br class=""></div><div class="">-Stefan<br class=""><div class=""><br class=""><div><br class=""><blockquote type="cite" class=""><div class="">On 16. Nov 2017, at 00:11, Brecht Van Lommel <<a href="mailto:brechtvanlommel@pandora.be" class="">brechtvanlommel@pandora.be</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class="">Still I suggest to commit this change for CUDA 9, checking with __CUDACC_VER_MAJOR__. We can ask NVidia to take a look and see if there's a way to get back the performance from the early CUDA 9.0.102 release (which was a beta I think). But avoiding the major slowdown for now is good.<div class=""><br class=""></div><div class="">Here's a graph relative to CUDA 8 for completeness.</div><div class=""><a href="https://developer.blender.org/F1142667" class="">https://developer.blender.org/F1142667</a></div><div class=""><br class=""></div></div><div class="gmail_extra"><br class=""><div class="gmail_quote">On Wed, Nov 15, 2017 at 10:37 PM, Stefan Werner <span dir="ltr" class=""><<a href="mailto:stewreo@gmail.com" target="_blank" class="">stewreo@gmail.com</a>></span> wrote:<br class=""><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr" class=""><div class="">Seems to be not just the CUDA version but only the chip model. I now ran my benchmarks on a GTX 1060 too, there the difference betwen 48 and 64 registers is close to nothing:<br class=""><br class="">64 registers:<br class="">BMW: 2m41s<br class="">Classroom: 8m02s<br class="">Fish Cat: 6m39s<br class="">Koro: 11m17s<br class="">Pavillion: 13m38s<br class=""><br class="">48 registers:<br class="">BMW: 2m43s<br class="">Classroom: 7m56s<br class="">Fishy Cat: 6m52s<br class="">Koro: 12m17s<br class="">Pavillion: 13m50s<br class=""><br class=""></div>Maybe here it's the ratio of bandwidth/core that makes register spilling less costly on the 1060 than on the 1080Ti?<br class=""><div class=""><br class=""></div><div class="">Well, there go my dreams of a one-line commit that brings 10-20% performance boost.</div><span class="HOEnZb"><font color="#888888" class=""><div class=""><br class=""></div><div class="">-Stefan<br class=""></div></font></span></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br class=""><div class="gmail_quote">On Wed, Nov 15, 2017 at 6:28 PM, Brecht Van Lommel <span dir="ltr" class=""><<a href="mailto:brechtvanlommel@pandora.be" target="_blank" class="">brechtvanlommel@pandora.be</a>></span> wrote:<br class=""><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr" class="">It seems to be related to the CUDA version, 9.0.176 has a performance regression compared to 9.0.102. Increasing the registers partially compensates for that, but not entirely.<div class=""><a href="https://developer.blender.org/F1141999" target="_blank" class="">https://developer.blender.org/<wbr class="">F1141999</a><br class=""></div><div class=""><div class="m_-5574327464039304880h5"><div class="gmail_extra"><br class=""><div class="gmail_quote">On Wed, Nov 15, 2017 at 12:49 PM, Stefan Werner <span dir="ltr" class=""><<a href="mailto:stewreo@gmail.com" target="_blank" class="">stewreo@gmail.com</a>></span> wrote:<br class=""><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr" class=""><div class="">Wow, those results are almost the complete opposite of what I'm seeing. I re-ran the tests on Linux:<br class=""><br class="">Nvidia 1080Ti, driver 384.90, installed as secondary GPU (no display attached)<br class="">Xubuntu 17.04, CUDA 9.0.176, gcc 6.3.0<br class=""></div>master branch, 556b13f03e561b54d4f0186e207f08<wbr class="">0c786f8b66<br class=""><div class=""><div class=""><br class="">48 registers:<br class="">BMW: 1m28s<br class="">Classroom: 3m12s<br class="">Fish Cat: 3m07s<br class="">Koro: 5m40s<br class="">Pavillion: 6m52s<br class="">Victor: 15m01s<br class=""><br class=""> 64 registers:<br class=""> BMW: 1m11s<br class=""> Classroom: 2m59s<br class=""> Fishy Cat: 2m51s<br class=""> Koro: 4m39s<br class=""> Pavillion: 5m32s<br class=""> Victor: 12m19s</div><div class=""><br class=""></div><div class="">(Victor had a tile size of 32, all others were the *_gpu.blend files with the default 256 tile size)<br class=""></div><div class=""><br class=""></div><div class="">On Windows, all GTX cards are treated as display cards, regardless of whether a monitor is plugged in or not. Only Quadro, Tesla and Titan cards can be set to TCC, that mode is not available for my GTX.</div><div class=""><br class=""></div><div class="">I wonder what's behind the difference we're seeing? The GPUs themselves shoudln't be that different, both are based on GP102, where only the 1080Ti has two SMX units disabled.<span class="m_-5574327464039304880m_-7340134897853140633m_759512126984442797HOEnZb"><font color="#888888" class=""><br class=""></font></span></div><span class="m_-5574327464039304880m_-7340134897853140633m_759512126984442797HOEnZb"><font color="#888888" class=""><div class=""><br class=""></div><div class="">-Stefan<br class=""></div></font></span></div></div><div class="m_-5574327464039304880m_-7340134897853140633m_759512126984442797HOEnZb"><div class="m_-5574327464039304880m_-7340134897853140633m_759512126984442797h5"><div class="gmail_extra"><br class=""><div class="gmail_quote">On Wed, Nov 15, 2017 at 1:35 AM, Brecht Van Lommel <span dir="ltr" class=""><<a href="mailto:brechtvanlommel@pandora.be" target="_blank" class="">brechtvanlommel@pandora.be</a>></span> wrote:<br class=""><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr" class="">Hi,<div class=""><br class=""></div><div class="">The registers were set based on benchmarks with a GTX 1080 on Linux, when we first optimized the code for Pascal. But that was more than a year ago. Going from 63 to 64 registers should be fine if it's faster.</div><div class=""><br class=""></div><div class="">Here's a benchmarks with a Titan Xp, Linux, driver 384.90. Results are not so good there:</div><div class="">CUDA 8.0.61: <a href="https://developer.blender.org/F1137606" target="_blank" class="">https://developer<wbr class="">.blender.org/F1137606</a></div><div class="">CUDA 9.0.102: <a href="https://developer.blender.org/F1137502" target="_blank" class="">https://developer.ble<wbr class="">nder.org/F1137502</a></div><div class=""><br class=""></div><div class="">Which driver and CUDA version are you using?</div><div class=""><br class=""></div><div class="">One difference between Windows and Linux is the compute preemption support. It might be useful to test if that min_blocks *= 8 helps on Windows, if your GTX 1080Ti is used for display.</div><div class=""><a href="https://developer.blender.org/rBe360d003e" target="_blank" class="">https://developer.blender.org/<wbr class="">rBe360d003e</a></div><div class=""><br class=""></div><div class="">Regards,</div><div class="">Brecht.</div><div class=""><br class=""></div></div><div class="gmail_extra"><br class=""><div class="gmail_quote"><div class=""><div class="m_-5574327464039304880m_-7340134897853140633m_759512126984442797m_4961958422127873663h5">On Tue, Nov 14, 2017 at 11:48 PM, Stefan Werner <span dir="ltr" class=""><<a href="mailto:stewreo@gmail.com" target="_blank" class="">stewreo@gmail.com</a>></span> wrote:<br class=""></div></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class=""><div class="m_-5574327464039304880m_-7340134897853140633m_759512126984442797m_4961958422127873663h5"><div dir="ltr" class="">Hello,<div class=""><br class=""></div><div class="">currently the Cuda kernel uses the same launch bounds for Pascal (SM 6.x) as for Maxwell (SM 5.x) hardware, that is 63 registers for branched path tracing and 48 registers for path tracing. Are all of those derived from benchmarks or is the value for Pascal just being carried over from Maxwell?</div><div class=""><br class=""></div><div class="">The reason I'm asking is that I'm observing a performance increase on Pascal when I increase the number of registers to 64 for path tracing. Here are before/after benchmarks from a GTX 1080Ti/Win10:</div><div class=""><br class=""></div><div class=""><div class="">48 registers (as is):</div><div class="">BMW: 1m52</div><div class="">Classroom: 3m31s</div><div class="">Fishy Cat: 4m33s</div><div class="">Koro: 8m30s</div><div class="">Pavillion: 7m39s</div></div><div class=""><br class=""></div><div class=""><div class="">64 registers:</div><div class="">BMW: 1m36s</div><div class="">Classroom: 3m34s</div><div class="">Fishy Cat: 3m57s</div><div class="">Koro: 6m45s</div><div class="">Pavillion: 6m39s</div></div><div class=""><br class=""></div><div class="">With the exception of the classroom scene, all benchmarks show significantly better performance. If there are no objections, I'd like to commit that register increase for SM 6.x to master.</div><div class=""><br class=""></div><div class="">Running the same test on a Quadro M4000 (Maxwell) shows much smaller differences, so I'd leave SM 5.x as is:</div><div class=""><br class=""></div><div class="">48 registers (as is):</div><div class=""><div class="">BMW: 4m38s</div><div class="">Classroom: 12m32s</div><div class="">Fishy Cat: 11m18s</div><div class="">Koro: 20m38s</div><div class="">Pavillion: 21m12s</div></div><div class=""><br class=""></div><div class=""><div class="">64 registers:<br class=""></div><div class="">BMW: 4m38s</div><div class="">Classroom: 13m07s</div><div class="">Fishy Cat: 10m52s</div><div class="">Koro: 18m51s</div><div class="">Pavillion: 21m32s</div></div><div class=""><br class=""></div><div class="">Another note: 63 registers was a hard limit for SM 2.x hardware. Is 63 instead of 64 as register limit for kernels SM 3.x and higher just carried over or is there a reason to not go to 64 registers?</div><span class="m_-5574327464039304880m_-7340134897853140633m_759512126984442797m_4961958422127873663m_-2063734483547205841HOEnZb"><font color="#888888" class=""><div class=""><br class=""></div><div class="">-Stefan</div></font></span><div class="">PS: I'd love it if someone would sacrifice the time to run 48/64 register comparison benchmarks on other Pascal hardware and/or on Linux.</div></div>

<br class=""></div></div>______________________________<wbr class="">_________________<br class="">

Bf-cycles mailing list<br class="">

<a href="mailto:Bf-cycles@blender.org" target="_blank" class="">Bf-cycles@blender.org</a><br class="">

<a href="https://lists.blender.org/mailman/listinfo/bf-cycles" rel="noreferrer" target="_blank" class="">https://lists.blender.org/mail<wbr class="">man/listinfo/bf-cycles</a><br class="">

<br class=""></blockquote></div><br class=""></div>

<br class="">______________________________<wbr class="">_________________<br class="">

Bf-cycles mailing list<br class="">

<a href="mailto:Bf-cycles@blender.org" target="_blank" class="">Bf-cycles@blender.org</a><br class="">

<a href="https://lists.blender.org/mailman/listinfo/bf-cycles" rel="noreferrer" target="_blank" class="">https://lists.blender.org/mail<wbr class="">man/listinfo/bf-cycles</a><br class="">

<br class=""></blockquote></div><br class=""></div>

</div></div><br class="">______________________________<wbr class="">_________________<br class="">

Bf-cycles mailing list<br class="">

<a href="mailto:Bf-cycles@blender.org" target="_blank" class="">Bf-cycles@blender.org</a><br class="">

<a href="https://lists.blender.org/mailman/listinfo/bf-cycles" rel="noreferrer" target="_blank" class="">https://lists.blender.org/mail<wbr class="">man/listinfo/bf-cycles</a><br class="">

<br class=""></blockquote></div><br class=""></div></div></div></div>

<br class="">______________________________<wbr class="">_________________<br class="">

Bf-cycles mailing list<br class="">

<a href="mailto:Bf-cycles@blender.org" target="_blank" class="">Bf-cycles@blender.org</a><br class="">

<a href="https://lists.blender.org/mailman/listinfo/bf-cycles" rel="noreferrer" target="_blank" class="">https://lists.blender.org/mail<wbr class="">man/listinfo/bf-cycles</a><br class="">

<br class=""></blockquote></div><br class=""></div>

</div></div><br class="">______________________________<wbr class="">_________________<br class="">

Bf-cycles mailing list<br class="">

<a href="mailto:Bf-cycles@blender.org" class="">Bf-cycles@blender.org</a><br class="">

<a href="https://lists.blender.org/mailman/listinfo/bf-cycles" rel="noreferrer" target="_blank" class="">https://lists.blender.org/<wbr class="">mailman/listinfo/bf-cycles</a><br class="">

<br class=""></blockquote></div><br class=""></div>

_______________________________________________<br class="">Bf-cycles mailing list<br class=""><a href="mailto:Bf-cycles@blender.org" class="">Bf-cycles@blender.org</a><br class="">https://lists.blender.org/mailman/listinfo/bf-cycles<br class=""></div></blockquote></div><br class=""></div></div></body></html>