<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p>I am a little late however I can confirm that on CUDA 8 and 1060
(6GB) it does not consistently increase performance.</p>
System:<br>
Debian SID<br>
Source compiled with gcc 7.2<br>
Cuda compled with clang-3.8<br>
cuda-compile tools: 8.0<br>
<p>
CUDA_KERNEL_MAX_REGISTERS 48<br>
BMW: 2:38<br>
Classroom: 7:43<br>
Fishy Cat: 7:20<br>
Koro: 14:03<br>
Pavillion: 15:12<br>
<br>
CUDA_KERNEL_MAX_REGISTERS 64<br>
BMW: 2:46<br>
Classroom: 8:03<br>
Fishy Cat: 7:10<br>
Koro: 12:46<br>
Pavillion: 16:06</p>
<p><br>
</p>
<p>Greetings Knork<br>
</p>
<br>
<div class="moz-cite-prefix">On 11/16/2017 12:11 AM, Brecht Van
Lommel wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAKFUgC3DQM+4ZKmLuz642zj+Mv+xS6g2MaV5x_NO71trec=Eqg@mail.gmail.com">
<div dir="ltr">Still I suggest to commit this change for CUDA 9,
checking with __CUDACC_VER_MAJOR__. We can ask NVidia to take a
look and see if there's a way to get back the performance from
the early CUDA 9.0.102 release (which was a beta I think). But
avoiding the major slowdown for now is good.
<div><br>
</div>
<div>Here's a graph relative to CUDA 8 for completeness.</div>
<div><a href="https://developer.blender.org/F1142667"
moz-do-not-send="true">https://developer.blender.org/F1142667</a></div>
<div><br>
</div>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Wed, Nov 15, 2017 at 10:37 PM,
Stefan Werner <span dir="ltr"><<a
href="mailto:stewreo@gmail.com" target="_blank"
moz-do-not-send="true">stewreo@gmail.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">
<div>Seems to be not just the CUDA version but only the
chip model. I now ran my benchmarks on a GTX 1060 too,
there the difference betwen 48 and 64 registers is close
to nothing:<br>
<br>
64 registers:<br>
BMW: 2m41s<br>
Classroom: 8m02s<br>
Fish Cat: 6m39s<br>
Koro: 11m17s<br>
Pavillion: 13m38s<br>
<br>
48 registers:<br>
BMW: 2m43s<br>
Classroom: 7m56s<br>
Fishy Cat: 6m52s<br>
Koro: 12m17s<br>
Pavillion: 13m50s<br>
<br>
</div>
Maybe here it's the ratio of bandwidth/core that makes
register spilling less costly on the 1060 than on the
1080Ti?<br>
<div><br>
</div>
<div>Well, there go my dreams of a one-line commit that
brings 10-20% performance boost.</div>
<span class="HOEnZb"><font color="#888888">
<div><br>
</div>
<div>-Stefan<br>
</div>
</font></span></div>
<div class="HOEnZb">
<div class="h5">
<div class="gmail_extra"><br>
<div class="gmail_quote">On Wed, Nov 15, 2017 at 6:28
PM, Brecht Van Lommel <span dir="ltr"><<a
href="mailto:brechtvanlommel@pandora.be"
target="_blank" moz-do-not-send="true">brechtvanlommel@pandora.be</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">It seems to be related to the CUDA
version, 9.0.176 has a performance regression
compared to 9.0.102. Increasing the registers
partially compensates for that, but not
entirely.
<div><a
href="https://developer.blender.org/F1141999"
target="_blank" moz-do-not-send="true">https://developer.blender.org/<wbr>F1141999</a><br>
</div>
<div>
<div class="m_-5574327464039304880h5">
<div class="gmail_extra"><br>
<div class="gmail_quote">On Wed, Nov 15,
2017 at 12:49 PM, Stefan Werner <span
dir="ltr"><<a
href="mailto:stewreo@gmail.com"
target="_blank"
moz-do-not-send="true">stewreo@gmail.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">
<div dir="ltr">
<div>Wow, those results are almost
the complete opposite of what I'm
seeing. I re-ran the tests on
Linux:<br>
<br>
Nvidia 1080Ti, driver 384.90,
installed as secondary GPU (no
display attached)<br>
Xubuntu 17.04, CUDA 9.0.176, gcc
6.3.0<br>
</div>
master branch,
556b13f03e561b54d4f0186e207f08<wbr>0c786f8b66<br>
<div>
<div><br>
48 registers:<br>
BMW: 1m28s<br>
Classroom: 3m12s<br>
Fish Cat: 3m07s<br>
Koro: 5m40s<br>
Pavillion: 6m52s<br>
Victor: 15m01s<br>
<br>
64 registers:<br>
BMW: 1m11s<br>
Classroom: 2m59s<br>
Fishy Cat: 2m51s<br>
Koro: 4m39s<br>
Pavillion: 5m32s<br>
Victor: 12m19s</div>
<div><br>
</div>
<div>(Victor had a tile size of
32, all others were the
*_gpu.blend files with the
default 256 tile size)<br>
</div>
<div><br>
</div>
<div>On Windows, all GTX cards are
treated as display cards,
regardless of whether a monitor
is plugged in or not. Only
Quadro, Tesla and Titan cards
can be set to TCC, that mode is
not available for my GTX.</div>
<div><br>
</div>
<div>I wonder what's behind the
difference we're seeing? The
GPUs themselves shoudln't be
that different, both are based
on GP102, where only the 1080Ti
has two SMX units disabled.<span
class="m_-5574327464039304880m_-7340134897853140633m_759512126984442797HOEnZb"><font
color="#888888"><br>
</font></span></div>
<span
class="m_-5574327464039304880m_-7340134897853140633m_759512126984442797HOEnZb"><font
color="#888888">
<div><br>
</div>
<div>-Stefan<br>
</div>
</font></span></div>
</div>
<div
class="m_-5574327464039304880m_-7340134897853140633m_759512126984442797HOEnZb">
<div
class="m_-5574327464039304880m_-7340134897853140633m_759512126984442797h5">
<div class="gmail_extra"><br>
<div class="gmail_quote">On Wed,
Nov 15, 2017 at 1:35 AM,
Brecht Van Lommel <span
dir="ltr"><<a
href="mailto:brechtvanlommel@pandora.be"
target="_blank"
moz-do-not-send="true">brechtvanlommel@pandora.be</a>></span>
wrote:<br>
<blockquote
class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">
<div dir="ltr">Hi,
<div><br>
</div>
<div>The registers were
set based on benchmarks
with a GTX 1080 on
Linux, when we first
optimized the code for
Pascal. But that was
more than a year ago.
Going from 63 to 64
registers should be fine
if it's faster.</div>
<div><br>
</div>
<div>Here's a benchmarks
with a Titan Xp,
Linux, driver 384.90.
Results are not so good
there:</div>
<div>CUDA 8.0.61: <a
href="https://developer.blender.org/F1137606"
target="_blank"
moz-do-not-send="true">https://developer<wbr>.blender.org/F1137606</a></div>
<div>CUDA 9.0.102: <a
href="https://developer.blender.org/F1137502"
target="_blank"
moz-do-not-send="true">https://developer.ble<wbr>nder.org/F1137502</a></div>
<div><br>
</div>
<div>Which driver and CUDA
version are you using?</div>
<div><br>
</div>
<div>One difference
between Windows and
Linux is the compute
preemption support. It
might be useful to test
if that min_blocks *= 8
helps on Windows, if
your GTX 1080Ti is used
for display.</div>
<div><a
href="https://developer.blender.org/rBe360d003e"
target="_blank"
moz-do-not-send="true">https://developer.blender.org/<wbr>rBe360d003e</a></div>
<div><br>
</div>
<div>Regards,</div>
<div>Brecht.</div>
<div><br>
</div>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">
<div>
<div
class="m_-5574327464039304880m_-7340134897853140633m_759512126984442797m_4961958422127873663h5">On
Tue, Nov 14, 2017 at
11:48 PM, Stefan
Werner <span
dir="ltr"><<a
href="mailto:stewreo@gmail.com"
target="_blank"
moz-do-not-send="true">stewreo@gmail.com</a>></span> wrote:<br>
</div>
</div>
<blockquote
class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px
#ccc
solid;padding-left:1ex">
<div>
<div
class="m_-5574327464039304880m_-7340134897853140633m_759512126984442797m_4961958422127873663h5">
<div dir="ltr">Hello,
<div><br>
</div>
<div>currently
the Cuda
kernel uses
the same
launch bounds
for Pascal (SM
6.x) as for
Maxwell (SM
5.x) hardware,
that is 63
registers for
branched path
tracing and 48
registers for
path tracing.
Are all of
those derived
from
benchmarks or
is the value
for Pascal
just being
carried over
from Maxwell?</div>
<div><br>
</div>
<div>The reason
I'm asking is
that I'm
observing a
performance
increase on
Pascal when I
increase the
number of
registers to
64 for path
tracing. Here
are
before/after
benchmarks
from a GTX
1080Ti/Win10:</div>
<div><br>
</div>
<div>
<div>48
registers (as
is):</div>
<div>BMW: 1m52</div>
<div>Classroom:
3m31s</div>
<div>Fishy
Cat: 4m33s</div>
<div>Koro:
8m30s</div>
<div>Pavillion:
7m39s</div>
</div>
<div><br>
</div>
<div>
<div>64
registers:</div>
<div>BMW:
1m36s</div>
<div>Classroom:
3m34s</div>
<div>Fishy
Cat: 3m57s</div>
<div>Koro:
6m45s</div>
<div>Pavillion:
6m39s</div>
</div>
<div><br>
</div>
<div>With the
exception of
the classroom
scene, all
benchmarks
show
significantly
better
performance.
If there are
no objections,
I'd like to
commit that
register
increase for
SM 6.x to
master.</div>
<div><br>
</div>
<div>Running the
same test on a
Quadro M4000
(Maxwell)
shows much
smaller
differences,
so I'd leave
SM 5.x as is:</div>
<div><br>
</div>
<div>48
registers (as
is):</div>
<div>
<div>BMW:
4m38s</div>
<div>Classroom:
12m32s</div>
<div>Fishy
Cat: 11m18s</div>
<div>Koro:
20m38s</div>
<div>Pavillion:
21m12s</div>
</div>
<div><br>
</div>
<div>
<div>64
registers:<br>
</div>
<div>BMW:
4m38s</div>
<div>Classroom:
13m07s</div>
<div>Fishy
Cat: 10m52s</div>
<div>Koro:
18m51s</div>
<div>Pavillion:
21m32s</div>
</div>
<div><br>
</div>
<div>Another
note: 63
registers was
a hard limit
for SM 2.x
hardware. Is
63 instead of
64 as register
limit for
kernels SM 3.x
and higher
just carried
over or is
there a reason
to not go to
64 registers?</div>
<span
class="m_-5574327464039304880m_-7340134897853140633m_759512126984442797m_4961958422127873663m_-2063734483547205841HOEnZb"><font
color="#888888">
<div><br>
</div>
<div>-Stefan</div>
</font></span>
<div>PS: I'd
love it if
someone would
sacrifice the
time to run
48/64 register
comparison
benchmarks on
other Pascal
hardware
and/or on
Linux.</div>
</div>
<br>
</div>
</div>
______________________________<wbr>_________________<br>
Bf-cycles mailing list<br>
<a
href="mailto:Bf-cycles@blender.org"
target="_blank"
moz-do-not-send="true">Bf-cycles@blender.org</a><br>
<a
href="https://lists.blender.org/mailman/listinfo/bf-cycles"
rel="noreferrer"
target="_blank"
moz-do-not-send="true">https://lists.blender.org/mail<wbr>man/listinfo/bf-cycles</a><br>
<br>
</blockquote>
</div>
<br>
</div>
<br>
______________________________<wbr>_________________<br>
Bf-cycles mailing list<br>
<a
href="mailto:Bf-cycles@blender.org"
target="_blank"
moz-do-not-send="true">Bf-cycles@blender.org</a><br>
<a
href="https://lists.blender.org/mailman/listinfo/bf-cycles"
rel="noreferrer"
target="_blank"
moz-do-not-send="true">https://lists.blender.org/mail<wbr>man/listinfo/bf-cycles</a><br>
<br>
</blockquote>
</div>
<br>
</div>
</div>
</div>
<br>
______________________________<wbr>_________________<br>
Bf-cycles mailing list<br>
<a href="mailto:Bf-cycles@blender.org"
target="_blank"
moz-do-not-send="true">Bf-cycles@blender.org</a><br>
<a
href="https://lists.blender.org/mailman/listinfo/bf-cycles"
rel="noreferrer" target="_blank"
moz-do-not-send="true">https://lists.blender.org/mail<wbr>man/listinfo/bf-cycles</a><br>
<br>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</div>
<br>
______________________________<wbr>_________________<br>
Bf-cycles mailing list<br>
<a href="mailto:Bf-cycles@blender.org"
target="_blank" moz-do-not-send="true">Bf-cycles@blender.org</a><br>
<a
href="https://lists.blender.org/mailman/listinfo/bf-cycles"
rel="noreferrer" target="_blank"
moz-do-not-send="true">https://lists.blender.org/mail<wbr>man/listinfo/bf-cycles</a><br>
<br>
</blockquote>
</div>
<br>
</div>
</div>
</div>
<br>
______________________________<wbr>_________________<br>
Bf-cycles mailing list<br>
<a href="mailto:Bf-cycles@blender.org"
moz-do-not-send="true">Bf-cycles@blender.org</a><br>
<a
href="https://lists.blender.org/mailman/listinfo/bf-cycles"
rel="noreferrer" target="_blank" moz-do-not-send="true">https://lists.blender.org/<wbr>mailman/listinfo/bf-cycles</a><br>
<br>
</blockquote>
</div>
<br>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
Bf-cycles mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Bf-cycles@blender.org">Bf-cycles@blender.org</a>
<a class="moz-txt-link-freetext" href="https://lists.blender.org/mailman/listinfo/bf-cycles">https://lists.blender.org/mailman/listinfo/bf-cycles</a>
</pre>
</blockquote>
<br>
<pre class="moz-signature" cols="72">--
Jan Scheffczy
w: <a class="moz-txt-link-freetext" href="https://knork.org">https://knork.org</a>
</pre>
</body>
</html>