<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <p>I am a little late however I can confirm that on CUDA 8 and 1060
      (6GB) it does not consistently  increase performance.</p>
    System:<br>
    Debian SID<br>
    Source compiled with gcc 7.2<br>
    Cuda compled with clang-3.8<br>
    cuda-compile tools: 8.0<br>
    <p>
      CUDA_KERNEL_MAX_REGISTERS 48<br>
      BMW: 2:38<br>
      Classroom: 7:43<br>
      Fishy Cat: 7:20<br>
      Koro: 14:03<br>
      Pavillion: 15:12<br>
      <br>
      CUDA_KERNEL_MAX_REGISTERS 64<br>
      BMW: 2:46<br>
      Classroom: 8:03<br>
      Fishy Cat: 7:10<br>
      Koro: 12:46<br>
      Pavillion: 16:06</p>
    <p><br>
    </p>
    <p>Greetings Knork<br>
    </p>
    <br>
    <div class="moz-cite-prefix">On 11/16/2017 12:11 AM, Brecht Van
      Lommel wrote:<br>
    </div>
    <blockquote type="cite"
cite="mid:CAKFUgC3DQM+4ZKmLuz642zj+Mv+xS6g2MaV5x_NO71trec=Eqg@mail.gmail.com">
      <div dir="ltr">Still I suggest to commit this change for CUDA 9,
        checking with __CUDACC_VER_MAJOR__. We can ask NVidia to take a
        look and see if there's a way to get back the performance from
        the early CUDA 9.0.102 release (which was a beta I think). But
        avoiding the major slowdown for now is good.
        <div><br>
        </div>
        <div>Here's a graph relative to CUDA 8 for completeness.</div>
        <div><a href="https://developer.blender.org/F1142667"
            moz-do-not-send="true">https://developer.blender.org/F1142667</a></div>
        <div><br>
        </div>
      </div>
      <div class="gmail_extra"><br>
        <div class="gmail_quote">On Wed, Nov 15, 2017 at 10:37 PM,
          Stefan Werner <span dir="ltr"><<a
              href="mailto:stewreo@gmail.com" target="_blank"
              moz-do-not-send="true">stewreo@gmail.com</a>></span>
          wrote:<br>
          <blockquote class="gmail_quote" style="margin:0 0 0
            .8ex;border-left:1px #ccc solid;padding-left:1ex">
            <div dir="ltr">
              <div>Seems to be not just the CUDA version but only the
                chip model. I now ran my benchmarks on a GTX 1060 too,
                there the difference betwen 48 and 64 registers is close
                to nothing:<br>
                <br>
                64 registers:<br>
                BMW: 2m41s<br>
                Classroom: 8m02s<br>
                Fish Cat: 6m39s<br>
                Koro: 11m17s<br>
                Pavillion: 13m38s<br>
                <br>
                48 registers:<br>
                BMW: 2m43s<br>
                Classroom: 7m56s<br>
                Fishy Cat: 6m52s<br>
                Koro: 12m17s<br>
                Pavillion: 13m50s<br>
                <br>
              </div>
              Maybe here it's the ratio of bandwidth/core that makes
              register spilling less costly on the 1060 than on the
              1080Ti?<br>
              <div><br>
              </div>
              <div>Well, there go my dreams of a one-line commit that
                brings 10-20% performance boost.</div>
              <span class="HOEnZb"><font color="#888888">
                  <div><br>
                  </div>
                  <div>-Stefan<br>
                  </div>
                </font></span></div>
            <div class="HOEnZb">
              <div class="h5">
                <div class="gmail_extra"><br>
                  <div class="gmail_quote">On Wed, Nov 15, 2017 at 6:28
                    PM, Brecht Van Lommel <span dir="ltr"><<a
                        href="mailto:brechtvanlommel@pandora.be"
                        target="_blank" moz-do-not-send="true">brechtvanlommel@pandora.be</a>></span>
                    wrote:<br>
                    <blockquote class="gmail_quote" style="margin:0 0 0
                      .8ex;border-left:1px #ccc solid;padding-left:1ex">
                      <div dir="ltr">It seems to be related to the CUDA
                        version, 9.0.176 has a performance regression
                        compared to 9.0.102. Increasing the registers
                        partially compensates for that, but not
                        entirely.
                        <div><a
                            href="https://developer.blender.org/F1141999"
                            target="_blank" moz-do-not-send="true">https://developer.blender.org/<wbr>F1141999</a><br>
                        </div>
                        <div>
                          <div class="m_-5574327464039304880h5">
                            <div class="gmail_extra"><br>
                              <div class="gmail_quote">On Wed, Nov 15,
                                2017 at 12:49 PM, Stefan Werner <span
                                  dir="ltr"><<a
                                    href="mailto:stewreo@gmail.com"
                                    target="_blank"
                                    moz-do-not-send="true">stewreo@gmail.com</a>></span>
                                wrote:<br>
                                <blockquote class="gmail_quote"
                                  style="margin:0 0 0
                                  .8ex;border-left:1px #ccc
                                  solid;padding-left:1ex">
                                  <div dir="ltr">
                                    <div>Wow, those results are almost
                                      the complete opposite of what I'm
                                      seeing. I re-ran the tests on
                                      Linux:<br>
                                      <br>
                                      Nvidia 1080Ti, driver 384.90,
                                      installed as secondary GPU (no
                                      display attached)<br>
                                      Xubuntu 17.04, CUDA 9.0.176, gcc
                                      6.3.0<br>
                                    </div>
                                    master branch,
                                    556b13f03e561b54d4f0186e207f08<wbr>0c786f8b66<br>
                                    <div>
                                      <div><br>
                                        48 registers:<br>
                                        BMW: 1m28s<br>
                                        Classroom: 3m12s<br>
                                        Fish Cat: 3m07s<br>
                                        Koro: 5m40s<br>
                                        Pavillion: 6m52s<br>
                                        Victor: 15m01s<br>
                                        <br>
                                         64 registers:<br>
                                         BMW: 1m11s<br>
                                         Classroom: 2m59s<br>
                                         Fishy Cat: 2m51s<br>
                                         Koro: 4m39s<br>
                                         Pavillion: 5m32s<br>
                                         Victor: 12m19s</div>
                                      <div><br>
                                      </div>
                                      <div>(Victor had a tile size of
                                        32, all others were the
                                        *_gpu.blend files with the
                                        default 256 tile size)<br>
                                      </div>
                                      <div><br>
                                      </div>
                                      <div>On Windows, all GTX cards are
                                        treated as display cards,
                                        regardless of whether a monitor
                                        is plugged in or not. Only
                                        Quadro, Tesla and Titan cards
                                        can be set to TCC, that mode is
                                        not available for my GTX.</div>
                                      <div><br>
                                      </div>
                                      <div>I wonder what's behind the
                                        difference we're seeing? The
                                        GPUs themselves shoudln't be
                                        that different, both are based
                                        on GP102, where only the 1080Ti
                                        has two SMX units disabled.<span
class="m_-5574327464039304880m_-7340134897853140633m_759512126984442797HOEnZb"><font
                                            color="#888888"><br>
                                          </font></span></div>
                                      <span
class="m_-5574327464039304880m_-7340134897853140633m_759512126984442797HOEnZb"><font
                                          color="#888888">
                                          <div><br>
                                          </div>
                                          <div>-Stefan<br>
                                          </div>
                                        </font></span></div>
                                  </div>
                                  <div
class="m_-5574327464039304880m_-7340134897853140633m_759512126984442797HOEnZb">
                                    <div
class="m_-5574327464039304880m_-7340134897853140633m_759512126984442797h5">
                                      <div class="gmail_extra"><br>
                                        <div class="gmail_quote">On Wed,
                                          Nov 15, 2017 at 1:35 AM,
                                          Brecht Van Lommel <span
                                            dir="ltr"><<a
                                              href="mailto:brechtvanlommel@pandora.be"
                                              target="_blank"
                                              moz-do-not-send="true">brechtvanlommel@pandora.be</a>></span>
                                          wrote:<br>
                                          <blockquote
                                            class="gmail_quote"
                                            style="margin:0 0 0
                                            .8ex;border-left:1px #ccc
                                            solid;padding-left:1ex">
                                            <div dir="ltr">Hi,
                                              <div><br>
                                              </div>
                                              <div>The registers were
                                                set based on benchmarks
                                                with a GTX 1080 on
                                                Linux, when we first
                                                optimized the code for
                                                Pascal. But that was
                                                more than a year ago.
                                                Going from 63 to 64
                                                registers should be fine
                                                if it's faster.</div>
                                              <div><br>
                                              </div>
                                              <div>Here's a benchmarks
                                                with a Titan Xp,
                                                Linux, driver 384.90.
                                                Results are not so good
                                                there:</div>
                                              <div>CUDA 8.0.61: <a
                                                  href="https://developer.blender.org/F1137606"
                                                  target="_blank"
                                                  moz-do-not-send="true">https://developer<wbr>.blender.org/F1137606</a></div>
                                              <div>CUDA 9.0.102: <a
                                                  href="https://developer.blender.org/F1137502"
                                                  target="_blank"
                                                  moz-do-not-send="true">https://developer.ble<wbr>nder.org/F1137502</a></div>
                                              <div><br>
                                              </div>
                                              <div>Which driver and CUDA
                                                version are you using?</div>
                                              <div><br>
                                              </div>
                                              <div>One difference
                                                between Windows and
                                                Linux is the compute
                                                preemption support. It
                                                might be useful to test
                                                if that min_blocks *= 8
                                                helps on Windows, if
                                                your GTX 1080Ti is used
                                                for display.</div>
                                              <div><a
                                                  href="https://developer.blender.org/rBe360d003e"
                                                  target="_blank"
                                                  moz-do-not-send="true">https://developer.blender.org/<wbr>rBe360d003e</a></div>
                                              <div><br>
                                              </div>
                                              <div>Regards,</div>
                                              <div>Brecht.</div>
                                              <div><br>
                                              </div>
                                            </div>
                                            <div class="gmail_extra"><br>
                                              <div class="gmail_quote">
                                                <div>
                                                  <div
class="m_-5574327464039304880m_-7340134897853140633m_759512126984442797m_4961958422127873663h5">On
                                                    Tue, Nov 14, 2017 at
                                                    11:48 PM, Stefan
                                                    Werner <span
                                                      dir="ltr"><<a
                                                        href="mailto:stewreo@gmail.com"
                                                        target="_blank"
moz-do-not-send="true">stewreo@gmail.com</a>></span> wrote:<br>
                                                  </div>
                                                </div>
                                                <blockquote
                                                  class="gmail_quote"
                                                  style="margin:0 0 0
                                                  .8ex;border-left:1px
                                                  #ccc
                                                  solid;padding-left:1ex">
                                                  <div>
                                                    <div
class="m_-5574327464039304880m_-7340134897853140633m_759512126984442797m_4961958422127873663h5">
                                                      <div dir="ltr">Hello,
                                                        <div><br>
                                                        </div>
                                                        <div>currently
                                                          the Cuda
                                                          kernel uses
                                                          the same
                                                          launch bounds
                                                          for Pascal (SM
                                                          6.x) as for
                                                          Maxwell (SM
                                                          5.x) hardware,
                                                          that is 63
                                                          registers for
                                                          branched path
                                                          tracing and 48
                                                          registers for
                                                          path tracing.
                                                          Are all of
                                                          those derived
                                                          from
                                                          benchmarks or
                                                          is the value
                                                          for Pascal
                                                          just being
                                                          carried over
                                                          from Maxwell?</div>
                                                        <div><br>
                                                        </div>
                                                        <div>The reason
                                                          I'm asking is
                                                          that I'm
                                                          observing a
                                                          performance
                                                          increase on
                                                          Pascal when I
                                                          increase the
                                                          number of
                                                          registers to
                                                          64 for path
                                                          tracing. Here
                                                          are
                                                          before/after
                                                          benchmarks
                                                          from a GTX
                                                          1080Ti/Win10:</div>
                                                        <div><br>
                                                        </div>
                                                        <div>
                                                          <div>48
                                                          registers (as
                                                          is):</div>
                                                          <div>BMW: 1m52</div>
                                                          <div>Classroom:
                                                          3m31s</div>
                                                          <div>Fishy
                                                          Cat: 4m33s</div>
                                                          <div>Koro:
                                                          8m30s</div>
                                                          <div>Pavillion:
                                                          7m39s</div>
                                                        </div>
                                                        <div><br>
                                                        </div>
                                                        <div>
                                                          <div>64
                                                          registers:</div>
                                                          <div>BMW:
                                                          1m36s</div>
                                                          <div>Classroom:
                                                          3m34s</div>
                                                          <div>Fishy
                                                          Cat: 3m57s</div>
                                                          <div>Koro:
                                                          6m45s</div>
                                                          <div>Pavillion:
                                                          6m39s</div>
                                                        </div>
                                                        <div><br>
                                                        </div>
                                                        <div>With the
                                                          exception of
                                                          the classroom
                                                          scene, all
                                                          benchmarks
                                                          show
                                                          significantly
                                                          better
                                                          performance.
                                                          If there are
                                                          no objections,
                                                          I'd like to
                                                          commit that
                                                          register
                                                          increase for
                                                          SM 6.x to
                                                          master.</div>
                                                        <div><br>
                                                        </div>
                                                        <div>Running the
                                                          same test on a
                                                          Quadro M4000
                                                          (Maxwell)
                                                          shows much
                                                          smaller
                                                          differences,
                                                          so I'd leave
                                                          SM 5.x as is:</div>
                                                        <div><br>
                                                        </div>
                                                        <div>48
                                                          registers (as
                                                          is):</div>
                                                        <div>
                                                          <div>BMW:
                                                          4m38s</div>
                                                          <div>Classroom:
                                                          12m32s</div>
                                                          <div>Fishy
                                                          Cat: 11m18s</div>
                                                          <div>Koro:
                                                          20m38s</div>
                                                          <div>Pavillion:
                                                          21m12s</div>
                                                        </div>
                                                        <div><br>
                                                        </div>
                                                        <div>
                                                          <div>64
                                                          registers:<br>
                                                          </div>
                                                          <div>BMW:
                                                          4m38s</div>
                                                          <div>Classroom:
                                                          13m07s</div>
                                                          <div>Fishy
                                                          Cat: 10m52s</div>
                                                          <div>Koro:
                                                          18m51s</div>
                                                          <div>Pavillion:
                                                          21m32s</div>
                                                        </div>
                                                        <div><br>
                                                        </div>
                                                        <div>Another
                                                          note: 63
                                                          registers was
                                                          a hard limit
                                                          for SM 2.x
                                                          hardware. Is
                                                          63 instead of
                                                          64 as register
                                                          limit for
                                                          kernels SM 3.x
                                                          and higher
                                                          just carried
                                                          over or is
                                                          there a reason
                                                          to not go to
                                                          64 registers?</div>
                                                        <span
class="m_-5574327464039304880m_-7340134897853140633m_759512126984442797m_4961958422127873663m_-2063734483547205841HOEnZb"><font
color="#888888">
                                                          <div><br>
                                                          </div>
                                                          <div>-Stefan</div>
                                                          </font></span>
                                                        <div>PS: I'd
                                                          love it if
                                                          someone would
                                                          sacrifice the
                                                          time to run
                                                          48/64 register
                                                          comparison
                                                          benchmarks on
                                                          other Pascal
                                                          hardware
                                                          and/or on
                                                          Linux.</div>
                                                      </div>
                                                      <br>
                                                    </div>
                                                  </div>
______________________________<wbr>_________________<br>
                                                  Bf-cycles mailing list<br>
                                                  <a
                                                    href="mailto:Bf-cycles@blender.org"
                                                    target="_blank"
                                                    moz-do-not-send="true">Bf-cycles@blender.org</a><br>
                                                  <a
                                                    href="https://lists.blender.org/mailman/listinfo/bf-cycles"
                                                    rel="noreferrer"
                                                    target="_blank"
                                                    moz-do-not-send="true">https://lists.blender.org/mail<wbr>man/listinfo/bf-cycles</a><br>
                                                  <br>
                                                </blockquote>
                                              </div>
                                              <br>
                                            </div>
                                            <br>
______________________________<wbr>_________________<br>
                                            Bf-cycles mailing list<br>
                                            <a
                                              href="mailto:Bf-cycles@blender.org"
                                              target="_blank"
                                              moz-do-not-send="true">Bf-cycles@blender.org</a><br>
                                            <a
                                              href="https://lists.blender.org/mailman/listinfo/bf-cycles"
                                              rel="noreferrer"
                                              target="_blank"
                                              moz-do-not-send="true">https://lists.blender.org/mail<wbr>man/listinfo/bf-cycles</a><br>
                                            <br>
                                          </blockquote>
                                        </div>
                                        <br>
                                      </div>
                                    </div>
                                  </div>
                                  <br>
                                  ______________________________<wbr>_________________<br>
                                  Bf-cycles mailing list<br>
                                  <a href="mailto:Bf-cycles@blender.org"
                                    target="_blank"
                                    moz-do-not-send="true">Bf-cycles@blender.org</a><br>
                                  <a
                                    href="https://lists.blender.org/mailman/listinfo/bf-cycles"
                                    rel="noreferrer" target="_blank"
                                    moz-do-not-send="true">https://lists.blender.org/mail<wbr>man/listinfo/bf-cycles</a><br>
                                  <br>
                                </blockquote>
                              </div>
                              <br>
                            </div>
                          </div>
                        </div>
                      </div>
                      <br>
                      ______________________________<wbr>_________________<br>
                      Bf-cycles mailing list<br>
                      <a href="mailto:Bf-cycles@blender.org"
                        target="_blank" moz-do-not-send="true">Bf-cycles@blender.org</a><br>
                      <a
                        href="https://lists.blender.org/mailman/listinfo/bf-cycles"
                        rel="noreferrer" target="_blank"
                        moz-do-not-send="true">https://lists.blender.org/mail<wbr>man/listinfo/bf-cycles</a><br>
                      <br>
                    </blockquote>
                  </div>
                  <br>
                </div>
              </div>
            </div>
            <br>
            ______________________________<wbr>_________________<br>
            Bf-cycles mailing list<br>
            <a href="mailto:Bf-cycles@blender.org"
              moz-do-not-send="true">Bf-cycles@blender.org</a><br>
            <a
              href="https://lists.blender.org/mailman/listinfo/bf-cycles"
              rel="noreferrer" target="_blank" moz-do-not-send="true">https://lists.blender.org/<wbr>mailman/listinfo/bf-cycles</a><br>
            <br>
          </blockquote>
        </div>
        <br>
      </div>
      <br>
      <fieldset class="mimeAttachmentHeader"></fieldset>
      <br>
      <pre wrap="">_______________________________________________
Bf-cycles mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Bf-cycles@blender.org">Bf-cycles@blender.org</a>
<a class="moz-txt-link-freetext" href="https://lists.blender.org/mailman/listinfo/bf-cycles">https://lists.blender.org/mailman/listinfo/bf-cycles</a>
</pre>
    </blockquote>
    <br>
    <pre class="moz-signature" cols="72">-- 
Jan Scheffczy
w: <a class="moz-txt-link-freetext" href="https://knork.org">https://knork.org</a>
</pre>
  </body>
</html>