[Bf-committers] CUDA performance analysis (with my experimental changes)

Wed Jun 5 12:24:14 CEST 2013

I modified the the CUDA renderer to take environment variable overrides 
to allow a performance test to iterate through several combinations of 
parameters.

The code has many performance optimizations that I have made to 
eliminate redundant memory allocations, using all asynchronous 
operations (all memory copies and kernel launches), using two streams 
fed from a single thread (interleaved 1:1 between them) each processing 
a separate tile, and using a RenderBuffers pool to eliminate those 
allocations and frees.

Surprisingly, the largest register count with the smallest warp size had 
the best time! Best time has 33% occupancy.

Tested on SM 2.0 hardware (single GTX 580) rendered from command line. 
Display driver is nvidia-experimental-310. Scene is the BMW scene with 
the default size (1280x600), percent at 75% (result is 960x450), tile 
size at the default 64x64. Using CUDA 5.0 toolkit on Linux Mint 14 Mate 
64-bit.Linux kernel is 3.5.0-28-generic. CPU is 6-core Core I7 990x 
Extreme Edition 12MB L3. PCIe 16x bus. 20GB 1066 RAM in triple channel 
configuration. Bus interface is 16x PCIe.

*registers* 	*warp size* 	*optimization* 	*time (s)
*
63 	64 	3 	70.80
63 	64 	2 	70.81
63 	64 	1 	70.87
63 	64 	0 	70.96
63 	64 	4 	71.03
42 	64 	3 	72.10
42 	64 	0 	72.17
42 	64 	4 	72.21
32 	64 	0 	72.44
42 	64 	2 	72.47
32 	64 	4 	72.49
63 	256 	4 	72.61
63 	256 	3 	72.65
63 	256 	1 	72.70
32 	64 	2 	72.77
42 	64 	1 	72.78
32 	64 	1 	72.82
63 	256 	2 	73.09
63 	256 	0 	73.27
32 	64 	3 	73.55
42 	256 	3 	74.16
42 	256 	0 	74.26
42 	256 	2 	74.34
42 	256 	4 	74.39
42 	256 	1 	74.42
32 	256 	2 	74.57
32 	256 	4 	74.57
32 	256 	0 	74.60
24 	64 	3 	74.62
32 	256 	3 	74.70
24 	64 	1 	74.75
24 	64 	2 	75.09
32 	256 	1 	75.15
24 	64 	4 	75.30
24 	64 	0 	75.39
24 	256 	3 	76.90
20 	64 	1 	76.92
20 	64 	2 	77.12
20 	64 	0 	77.14
20 	64 	3 	77.14
24 	256 	1 	77.42
24 	256 	2 	77.64
20 	64 	4 	77.65
24 	256 	0 	78.31
20 	256 	1 	79.21
20 	256 	4 	79.23
20 	256 	2 	79.43
*20* 	*256* 	*3* 	*79.75*
24 	256 	4 	79.81
20 	256 	0 	79.83
32 	1024 	2 	100.93
32 	1024 	3 	100.95
32 	1024 	0 	101.25
32 	1024 	4 	101.28
32 	1024 	1 	101.47
24 	1024 	2 	105.52
24 	1024 	0 	105.62
24 	1024 	4 	106.00
24 	1024 	3 	106.39
24 	1024 	1 	106.53
20 	1024 	1 	111.51
20 	1024 	2 	111.61
20 	1024 	3 	111.69
20 	1024 	0 	111.74
20 	1024 	4 	111.84

The following combinations fail due to insufficient resources:

/42/ 	/1024/ 	/2/ 	/failed/
/63/ 	/1024/ 	/4/ 	/failed/
/42/ 	/1024/ 	/4/ 	/failed/
/63/ 	/1024/ 	/1/ 	/failed/
/63/ 	/1024/ 	/2/ 	/failed/
/42/ 	/1024/ 	/0/ 	/failed/
/42/ 	/1024/ 	/1/ 	/failed/
/63/ 	/1024/ 	/0/ 	/failed/
/42/ 	/1024/ 	/3/ 	/failed/
/63/ 	/1024/ 	/3/ 	/failed/

I am going to re-run the analysis with even smaller warp sizes to see 
what happens.

-Doug