[Bf-committers] "Official" CUDA Benchmark/Implementation Thread

Sat Dec 20 11:57:19 CET 2008

Timothy Baldridge wrote:
>> I'm not sure how you'd avoid cache misses though. . .we simply have to deal
>> with too much data.  About the only thing I can think of is sorting
>> faces/strands (I actually do this in my DSM branch) per tile and using a
>> more optimal render order then simply going over the scanlines.  The ray
>> tracing traversal could be made more efficient, but optimizing what the
>> renderer does between could be more difficult.
>> You know I think the CodeAnalyst profiling tool from AMD can measure cache
>> misses, I'll have to try and figure out how it works.
>>     
>
> You cannot avoid all cache misses, but it is possible to avoid many
> cache misses. Modern CPUs load cache lines in 64byte segements. This
> means that if you read one byte from memory the CPU really loads
> 64bytes. Thus, if you can arrange data in such a way that it can be
> read and processed as sequential data, the performance will be greatly
> enhanced.
>
> I wish I could find it, but there is an excellent video on youtube
> from a Google Tech Talk. In the talk the speaker explains these
> caches, and goes to show that reading items from a linked list or
> vector can be (IIRC) up to a order of magnitude slower than reading
> items from an array. That is if the entire set does not lie in memory.
> This is due to the fact that linked lists require allot of jumping
> around in memory, which causes the cache to be come less useful.
>
>   

Well, AFAIK many of such things are also reduced by the compiler.
Backing to cache problems
consider also that most of modern multicore CPU shares even the L2 or
the L3  memory cache (so even if you
have NUMA e.g. in a dual socket, you have two or four cores accessing at
the same bank of memory and not at its own).
But IMHO those problems are even more important in case of
parallelization for multiple threads
It's worthwhile to read the Moshibroda paper [(and also try yourself to
run the rdarray.c and stream.c tests of the article). You'll find
some interesting things, such as that most of the modern multicore CPU
which are claimed like to be independent CPUs,
are in most cases, just vector units, which performs good only if used
in such a way... ;-)

IMHO most of the boost the CUDA or OpenCL implementations apparently
could give are not coming from the parallelization..., but
from the fact that you have a much more powerful invisible "CPU"(GPU)
for certain complex high level tasks (not *every* task, otherwise we would
have to just replace our CPU by GPU and we have done) vs the standard
multipurpose CPU. The important is to identify those tasks
(indeed you have the parallelization also on the CUDA (such as in SLI or
Tesla configuration but that's another story).

Regarding the function I cited, we could for instance sort them by
number of calls. In that case we could have:

 7.85     73.40     7.58 87591281     0.00     0.00  vlr_check_intersect
 20.03     57.26    19.34 40151652     0.00     0.00  testnode
  4.70     82.72     4.54 36850240     0.00     0.00  calc_ocval_ray
  1.77     90.77     1.71 26767830     0.00     0.00  vlr_face_coords
  4.67     87.23     4.51 25437217     0.00     0.00  Normalize_d
  8.87     65.82     8.56 24297509     0.00     0.00 
RE_ray_face_intersection
  0.41     92.05     0.40 20899588     0.00     0.00  vlr_get_transform
 39.28     37.92    37.92 10317691     0.00     0.00 
RE_ray_tree_intersect_chec
k
  0.17     94.39     0.16 10077327     0.00     0.00  RE_ray_tree_intersect
  0.28     93.20     0.27  3516407     0.00     0.00  Mat3MulVecfl
  0.04     96.04     0.04   703745     0.00     0.00  RE_vertren_get_rad

Of course is not very important to optimize (or replace) for instance a
function which is called only ONCE and whose
execution time is 0.01% of the total. So for instance we have the
vlr_check_intersect() is called 87million times and
the testnode() 40million.That should also take already in account  the
number reciprocal (or recursive) calls.
You may also do some speculation and "forecasting" about what happens if
that function could be made
10 times faster or 100 times faster. But also looking at the same
functions (or the function they were called most of the
time) you can see whether they can be cud-ized  (or CL-ized) or not. Of
course the extreme is to "rewrite" a new
independent CUDA|OpenCL (realtime) engine. But that in a "ideal world"...

Apart this render is problably not the only task where appartently
CUDA|OpenCL (if) can do the boost..., maybe in the
sequencer could also take a lot of advantages and could be an easier
task (peter?). I remember also that
actually there should be the possibility to access to the H264 codec of
some nvidia card (VPDAU, see
ftp://download.nvidia.com/XFree86/vdpau/mplayer-vdpau-3076399.README.txt).
Of course this would lead to have things
dependent on the hardware (and in this case a particular brand) and not
fully OSS, but on the other hand we have always the
software fallback.

Bye
Giuseppe.