[Bf-committers] "Official" CUDA Benchmark/Implementation Thread

joe joeedh at gmail.com
Sat Dec 20 12:45:21 CET 2008


Ah which one of Moshibroda's papers is that?
Joe

On Sat, Dec 20, 2008 at 3:57 AM, Giuseppe Ghibò <ghibo at mandriva.com> wrote:

> Timothy Baldridge wrote:
> >> I'm not sure how you'd avoid cache misses though. . .we simply have to
> deal
> >> with too much data.  About the only thing I can think of is sorting
> >> faces/strands (I actually do this in my DSM branch) per tile and using a
> >> more optimal render order then simply going over the scanlines.  The ray
> >> tracing traversal could be made more efficient, but optimizing what the
> >> renderer does between could be more difficult.
> >> You know I think the CodeAnalyst profiling tool from AMD can measure
> cache
> >> misses, I'll have to try and figure out how it works.
> >>
> >
> > You cannot avoid all cache misses, but it is possible to avoid many
> > cache misses. Modern CPUs load cache lines in 64byte segements. This
> > means that if you read one byte from memory the CPU really loads
> > 64bytes. Thus, if you can arrange data in such a way that it can be
> > read and processed as sequential data, the performance will be greatly
> > enhanced.
> >
> > I wish I could find it, but there is an excellent video on youtube
> > from a Google Tech Talk. In the talk the speaker explains these
> > caches, and goes to show that reading items from a linked list or
> > vector can be (IIRC) up to a order of magnitude slower than reading
> > items from an array. That is if the entire set does not lie in memory.
> > This is due to the fact that linked lists require allot of jumping
> > around in memory, which causes the cache to be come less useful.
> >
> >
>
> Well, AFAIK many of such things are also reduced by the compiler.
> Backing to cache problems
> consider also that most of modern multicore CPU shares even the L2 or
> the L3  memory cache (so even if you
> have NUMA e.g. in a dual socket, you have two or four cores accessing at
> the same bank of memory and not at its own).
> But IMHO those problems are even more important in case of
> parallelization for multiple threads
> It's worthwhile to read the Moshibroda paper [(and also try yourself to
> run the rdarray.c and stream.c tests of the article). You'll find
> some interesting things, such as that most of the modern multicore CPU
> which are claimed like to be independent CPUs,
> are in most cases, just vector units, which performs good only if used
> in such a way... ;-)
>
> IMHO most of the boost the CUDA or OpenCL implementations apparently
> could give are not coming from the parallelization..., but
> from the fact that you have a much more powerful invisible "CPU"(GPU)
> for certain complex high level tasks (not *every* task, otherwise we would
> have to just replace our CPU by GPU and we have done) vs the standard
> multipurpose CPU. The important is to identify those tasks
> (indeed you have the parallelization also on the CUDA (such as in SLI or
> Tesla configuration but that's another story).
>
> Regarding the function I cited, we could for instance sort them by
> number of calls. In that case we could have:
>
>  7.85     73.40     7.58 87591281     0.00     0.00  vlr_check_intersect
>  20.03     57.26    19.34 40151652     0.00     0.00  testnode
>  4.70     82.72     4.54 36850240     0.00     0.00  calc_ocval_ray
>  1.77     90.77     1.71 26767830     0.00     0.00  vlr_face_coords
>  4.67     87.23     4.51 25437217     0.00     0.00  Normalize_d
>  8.87     65.82     8.56 24297509     0.00     0.00
> RE_ray_face_intersection
>  0.41     92.05     0.40 20899588     0.00     0.00  vlr_get_transform
>  39.28     37.92    37.92 10317691     0.00     0.00
> RE_ray_tree_intersect_chec
> k
>  0.17     94.39     0.16 10077327     0.00     0.00  RE_ray_tree_intersect
>  0.28     93.20     0.27  3516407     0.00     0.00  Mat3MulVecfl
>  0.04     96.04     0.04   703745     0.00     0.00  RE_vertren_get_rad
>
> Of course is not very important to optimize (or replace) for instance a
> function which is called only ONCE and whose
> execution time is 0.01% of the total. So for instance we have the
> vlr_check_intersect() is called 87million times and
> the testnode() 40million.That should also take already in account  the
> number reciprocal (or recursive) calls.
> You may also do some speculation and "forecasting" about what happens if
> that function could be made
> 10 times faster or 100 times faster. But also looking at the same
> functions (or the function they were called most of the
> time) you can see whether they can be cud-ized  (or CL-ized) or not. Of
> course the extreme is to "rewrite" a new
> independent CUDA|OpenCL (realtime) engine. But that in a "ideal world"...
>
> Apart this render is problably not the only task where appartently
> CUDA|OpenCL (if) can do the boost..., maybe in the
> sequencer could also take a lot of advantages and could be an easier
> task (peter?). I remember also that
> actually there should be the possibility to access to the H264 codec of
> some nvidia card (VPDAU, see
> ftp://download.nvidia.com/XFree86/vdpau/mplayer-vdpau-3076399.README.txt).
> Of course this would lead to have things
> dependent on the hardware (and in this case a particular brand) and not
> fully OSS, but on the other hand we have always the
> software fallback.
>
> Bye
> Giuseppe.
>
> _______________________________________________
> Bf-committers mailing list
> Bf-committers at blender.org
> http://lists.blender.org/mailman/listinfo/bf-committers
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.blender.org/pipermail/bf-committers/attachments/20081220/39e0a4b3/attachment.htm 


More information about the Bf-committers mailing list