[Bf-committers] "Official" CUDA Benchmark/Implementation Thread
joeedh at gmail.com
Sat Dec 20 12:45:21 CET 2008
Ah which one of Moshibroda's papers is that?
On Sat, Dec 20, 2008 at 3:57 AM, Giuseppe Ghibò <ghibo at mandriva.com> wrote:
> Timothy Baldridge wrote:
> >> I'm not sure how you'd avoid cache misses though. . .we simply have to
> >> with too much data. About the only thing I can think of is sorting
> >> faces/strands (I actually do this in my DSM branch) per tile and using a
> >> more optimal render order then simply going over the scanlines. The ray
> >> tracing traversal could be made more efficient, but optimizing what the
> >> renderer does between could be more difficult.
> >> You know I think the CodeAnalyst profiling tool from AMD can measure
> >> misses, I'll have to try and figure out how it works.
> > You cannot avoid all cache misses, but it is possible to avoid many
> > cache misses. Modern CPUs load cache lines in 64byte segements. This
> > means that if you read one byte from memory the CPU really loads
> > 64bytes. Thus, if you can arrange data in such a way that it can be
> > read and processed as sequential data, the performance will be greatly
> > enhanced.
> > I wish I could find it, but there is an excellent video on youtube
> > from a Google Tech Talk. In the talk the speaker explains these
> > caches, and goes to show that reading items from a linked list or
> > vector can be (IIRC) up to a order of magnitude slower than reading
> > items from an array. That is if the entire set does not lie in memory.
> > This is due to the fact that linked lists require allot of jumping
> > around in memory, which causes the cache to be come less useful.
> Well, AFAIK many of such things are also reduced by the compiler.
> Backing to cache problems
> consider also that most of modern multicore CPU shares even the L2 or
> the L3 memory cache (so even if you
> have NUMA e.g. in a dual socket, you have two or four cores accessing at
> the same bank of memory and not at its own).
> But IMHO those problems are even more important in case of
> parallelization for multiple threads
> It's worthwhile to read the Moshibroda paper [(and also try yourself to
> run the rdarray.c and stream.c tests of the article). You'll find
> some interesting things, such as that most of the modern multicore CPU
> which are claimed like to be independent CPUs,
> are in most cases, just vector units, which performs good only if used
> in such a way... ;-)
> IMHO most of the boost the CUDA or OpenCL implementations apparently
> could give are not coming from the parallelization..., but
> from the fact that you have a much more powerful invisible "CPU"(GPU)
> for certain complex high level tasks (not *every* task, otherwise we would
> have to just replace our CPU by GPU and we have done) vs the standard
> multipurpose CPU. The important is to identify those tasks
> (indeed you have the parallelization also on the CUDA (such as in SLI or
> Tesla configuration but that's another story).
> Regarding the function I cited, we could for instance sort them by
> number of calls. In that case we could have:
> 7.85 73.40 7.58 87591281 0.00 0.00 vlr_check_intersect
> 20.03 57.26 19.34 40151652 0.00 0.00 testnode
> 4.70 82.72 4.54 36850240 0.00 0.00 calc_ocval_ray
> 1.77 90.77 1.71 26767830 0.00 0.00 vlr_face_coords
> 4.67 87.23 4.51 25437217 0.00 0.00 Normalize_d
> 8.87 65.82 8.56 24297509 0.00 0.00
> 0.41 92.05 0.40 20899588 0.00 0.00 vlr_get_transform
> 39.28 37.92 37.92 10317691 0.00 0.00
> 0.17 94.39 0.16 10077327 0.00 0.00 RE_ray_tree_intersect
> 0.28 93.20 0.27 3516407 0.00 0.00 Mat3MulVecfl
> 0.04 96.04 0.04 703745 0.00 0.00 RE_vertren_get_rad
> Of course is not very important to optimize (or replace) for instance a
> function which is called only ONCE and whose
> execution time is 0.01% of the total. So for instance we have the
> vlr_check_intersect() is called 87million times and
> the testnode() 40million.That should also take already in account the
> number reciprocal (or recursive) calls.
> You may also do some speculation and "forecasting" about what happens if
> that function could be made
> 10 times faster or 100 times faster. But also looking at the same
> functions (or the function they were called most of the
> time) you can see whether they can be cud-ized (or CL-ized) or not. Of
> course the extreme is to "rewrite" a new
> independent CUDA|OpenCL (realtime) engine. But that in a "ideal world"...
> Apart this render is problably not the only task where appartently
> CUDA|OpenCL (if) can do the boost..., maybe in the
> sequencer could also take a lot of advantages and could be an easier
> task (peter?). I remember also that
> actually there should be the possibility to access to the H264 codec of
> some nvidia card (VPDAU, see
> Of course this would lead to have things
> dependent on the hardware (and in this case a particular brand) and not
> fully OSS, but on the other hand we have always the
> software fallback.
> Bf-committers mailing list
> Bf-committers at blender.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Bf-committers