[Bf-committers] "Official" CUDA Benchmark/Implementation Thread

Fri Dec 19 21:09:32 CET 2008

> I'm not sure how you'd avoid cache misses though. . .we simply have to deal
> with too much data.  About the only thing I can think of is sorting
> faces/strands (I actually do this in my DSM branch) per tile and using a
> more optimal render order then simply going over the scanlines.  The ray
> tracing traversal could be made more efficient, but optimizing what the
> renderer does between could be more difficult.
> You know I think the CodeAnalyst profiling tool from AMD can measure cache
> misses, I'll have to try and figure out how it works.

You cannot avoid all cache misses, but it is possible to avoid many
cache misses. Modern CPUs load cache lines in 64byte segements. This
means that if you read one byte from memory the CPU really loads
64bytes. Thus, if you can arrange data in such a way that it can be
read and processed as sequential data, the performance will be greatly
enhanced.

I wish I could find it, but there is an excellent video on youtube
from a Google Tech Talk. In the talk the speaker explains these
caches, and goes to show that reading items from a linked list or
vector can be (IIRC) up to a order of magnitude slower than reading
items from an array. That is if the entire set does not lie in memory.
This is due to the fact that linked lists require allot of jumping
around in memory, which causes the cache to be come less useful.

Timothy