[Bf-cycles] CUDA performance tests

Mon Apr 28 14:49:43 CEST 2014

To Rolf, thanks for the tests, I hadn't looked into sm_50 yet. It
seems there is a big performance regression after the
__launch_bounds__ change, the other performance differences are more
or less as expected.

I'm not sure why there is a difference there. You correctly changed
CUDA_MULTIPROCESSOR_MAX_BLOCKS to 32 for sm_50, so that seems right. I
also checked the ptxas output to see if it was indeed still using max
32 registers and it seems to be the case:

pre __launch_bounds__:

ptxas info    : Compiling entry function 'kernel_cuda_path_trace' for 'sm_50'
ptxas info    : Function properties for kernel_cuda_path_trace
    9248 bytes stack frame, 3764 bytes spill stores, 3676 bytes spill loads
ptxas info    : Used 32 registers, 764 bytes cmem[0], 392 bytes
cmem[2], 100 textures

post __launch_bounds__:

ptxas info    : Compiling entry function 'kernel_cuda_path_trace' for 'sm_50'
ptxas info    : Function properties for kernel_cuda_path_trace
    9248 bytes stack frame, 3776 bytes spill stores, 3668 bytes spill loads
ptxas info    : Used 32 registers, 764 bytes cmem[0], 392 bytes
cmem[2], 100 textures

If possible, could you show this ptxas output for your build, to
verify if you are getting the right values? It prints to the console
by default during builds.

Thanks,
Brecht.

On Sat, Apr 26, 2014 at 3:33 PM, Brecht Van Lommel
<brechtvanlommel at pandora.be> wrote:
> For the two cards I have here, it seems that after tweaking the code,
> CUDA 6.0 performance is similar to previous releases with 5.0. There's
> no improvement and even some small slowdown, but we actually had to do
> quite a bit work since 5.5 was released to even get it to that point,
> and some features have been added in the meantime. Overall I think
> it's acceptable if other cards show similar results.
>
> The patches are from an earlier test I did with the CUDA 6.0 beta,
> based on feedback from NVidia. They help on sm_30 and sm_35, but not
> sm_20 and sm_21 (these changes were also tested on 480 GTX and Tesla
> K20m). So it seems we should enable them for sm_3x if we switch to
> CUDA 6.0.
>
> We need to test more cards and platforms, comparing latest master +
> CUDA 6.0 to the 2.69 and 2.70a releases, before we can make a
> decision.
>
> Brecht.
>
> On Fri, Apr 25, 2014 at 5:44 PM, Brecht Van Lommel
> <brechtvanlommel at pandora.be> wrote:
>> Hi all,
>>
>> We're having some trouble with CUDA performance at the moment, with in
>> a performance regression on Titan cards compared to 2.69, and unclear
>> effects from recent changes. Further there's the new CUDA 6.0 toolkit,
>> which we ideally could move to, as sticking to 5.0 is a problem with
>> newer compilers.
>>
>> We could use some help testing all this, especially for Titan cards as
>> I don't have one of those.
>>
>>
>> There's a google doc here:
>> https://docs.google.com/spreadsheets/d/1IKvO6s7h_0oVGbGE3giowx2fBvyXzIfdM3qCSfROJfI
>>
>> For now I'm assuming someone with a Titan card can make own builds to
>> test these git revisions. If not we can build and upload binaries to
>> test.
>>
>> The test is:
>> * download: http://www.pasteall.org/blend/28679
>> * run: ./blender -b modified_bmw.blend -f 1
>>
>> If you've got a different .blend file that is giving performance
>> regressions, or you find different behavior when rendering in the UI
>> rather than command line, then results from that are welcome too.
>>
>>
>> The information that I'm looking for from this is especially:
>> * Which revision caused the Titan performance regressions?
>> * Is CUDA 6.0 performance acceptable to us for the next release?
>> * Were there significant performance regressions due to recent changes?
>> * Which effect do the two provided patches have when building with CUDA 6.0?
>>
>> Thanks,
>> Brecht.