[Bf-committers] CUDA performance analysis (with my experimental changes)
Doug Gale
doug65536 at gmail.com
Wed Jun 5 12:24:14 CEST 2013
I modified the the CUDA renderer to take environment variable overrides
to allow a performance test to iterate through several combinations of
parameters.
The code has many performance optimizations that I have made to
eliminate redundant memory allocations, using all asynchronous
operations (all memory copies and kernel launches), using two streams
fed from a single thread (interleaved 1:1 between them) each processing
a separate tile, and using a RenderBuffers pool to eliminate those
allocations and frees.
Surprisingly, the largest register count with the smallest warp size had
the best time! Best time has 33% occupancy.
Tested on SM 2.0 hardware (single GTX 580) rendered from command line.
Display driver is nvidia-experimental-310. Scene is the BMW scene with
the default size (1280x600), percent at 75% (result is 960x450), tile
size at the default 64x64. Using CUDA 5.0 toolkit on Linux Mint 14 Mate
64-bit.Linux kernel is 3.5.0-28-generic. CPU is 6-core Core I7 990x
Extreme Edition 12MB L3. PCIe 16x bus. 20GB 1066 RAM in triple channel
configuration. Bus interface is 16x PCIe.
*registers* *warp size* *optimization* *time (s)
*
63 64 3 70.80
63 64 2 70.81
63 64 1 70.87
63 64 0 70.96
63 64 4 71.03
42 64 3 72.10
42 64 0 72.17
42 64 4 72.21
32 64 0 72.44
42 64 2 72.47
32 64 4 72.49
63 256 4 72.61
63 256 3 72.65
63 256 1 72.70
32 64 2 72.77
42 64 1 72.78
32 64 1 72.82
63 256 2 73.09
63 256 0 73.27
32 64 3 73.55
42 256 3 74.16
42 256 0 74.26
42 256 2 74.34
42 256 4 74.39
42 256 1 74.42
32 256 2 74.57
32 256 4 74.57
32 256 0 74.60
24 64 3 74.62
32 256 3 74.70
24 64 1 74.75
24 64 2 75.09
32 256 1 75.15
24 64 4 75.30
24 64 0 75.39
24 256 3 76.90
20 64 1 76.92
20 64 2 77.12
20 64 0 77.14
20 64 3 77.14
24 256 1 77.42
24 256 2 77.64
20 64 4 77.65
24 256 0 78.31
20 256 1 79.21
20 256 4 79.23
20 256 2 79.43
*20* *256* *3* *79.75*
24 256 4 79.81
20 256 0 79.83
32 1024 2 100.93
32 1024 3 100.95
32 1024 0 101.25
32 1024 4 101.28
32 1024 1 101.47
24 1024 2 105.52
24 1024 0 105.62
24 1024 4 106.00
24 1024 3 106.39
24 1024 1 106.53
20 1024 1 111.51
20 1024 2 111.61
20 1024 3 111.69
20 1024 0 111.74
20 1024 4 111.84
The following combinations fail due to insufficient resources:
/42/ /1024/ /2/ /failed/
/63/ /1024/ /4/ /failed/
/42/ /1024/ /4/ /failed/
/63/ /1024/ /1/ /failed/
/63/ /1024/ /2/ /failed/
/42/ /1024/ /0/ /failed/
/42/ /1024/ /1/ /failed/
/63/ /1024/ /0/ /failed/
/42/ /1024/ /3/ /failed/
/63/ /1024/ /3/ /failed/
I am going to re-run the analysis with even smaller warp sizes to see
what happens.
-Doug
More information about the Bf-committers
mailing list