Ah which one of Moshibroda's papers is that?<div><br></div><div>Joe<br><br><div class="gmail_quote">On Sat, Dec 20, 2008 at 3:57 AM, Giuseppe Ghibò <span dir="ltr"><<a href="mailto:ghibo@mandriva.com">ghibo@mandriva.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class="Ih2E3d">Timothy Baldridge wrote:<br>
>> I'm not sure how you'd avoid cache misses though. . .we simply have to deal<br>
>> with too much data. About the only thing I can think of is sorting<br>
>> faces/strands (I actually do this in my DSM branch) per tile and using a<br>
>> more optimal render order then simply going over the scanlines. The ray<br>
>> tracing traversal could be made more efficient, but optimizing what the<br>
>> renderer does between could be more difficult.<br>
>> You know I think the CodeAnalyst profiling tool from AMD can measure cache<br>
>> misses, I'll have to try and figure out how it works.<br>
>><br>
><br>
> You cannot avoid all cache misses, but it is possible to avoid many<br>
> cache misses. Modern CPUs load cache lines in 64byte segements. This<br>
> means that if you read one byte from memory the CPU really loads<br>
> 64bytes. Thus, if you can arrange data in such a way that it can be<br>
> read and processed as sequential data, the performance will be greatly<br>
> enhanced.<br>
><br>
> I wish I could find it, but there is an excellent video on youtube<br>
> from a Google Tech Talk. In the talk the speaker explains these<br>
> caches, and goes to show that reading items from a linked list or<br>
> vector can be (IIRC) up to a order of magnitude slower than reading<br>
> items from an array. That is if the entire set does not lie in memory.<br>
> This is due to the fact that linked lists require allot of jumping<br>
> around in memory, which causes the cache to be come less useful.<br>
><br>
><br>
<br>
</div>Well, AFAIK many of such things are also reduced by the compiler.<br>
Backing to cache problems<br>
consider also that most of modern multicore CPU shares even the L2 or<br>
the L3 memory cache (so even if you<br>
have NUMA e.g. in a dual socket, you have two or four cores accessing at<br>
the same bank of memory and not at its own).<br>
But IMHO those problems are even more important in case of<br>
parallelization for multiple threads<br>
It's worthwhile to read the Moshibroda paper [(and also try yourself to<br>
run the rdarray.c and stream.c tests of the article). You'll find<br>
some interesting things, such as that most of the modern multicore CPU<br>
which are claimed like to be independent CPUs,<br>
are in most cases, just vector units, which performs good only if used<br>
in such a way... ;-)<br>
<br>
IMHO most of the boost the CUDA or OpenCL implementations apparently<br>
could give are not coming from the parallelization..., but<br>
from the fact that you have a much more powerful invisible "CPU"(GPU)<br>
for certain complex high level tasks (not *every* task, otherwise we would<br>
have to just replace our CPU by GPU and we have done) vs the standard<br>
multipurpose CPU. The important is to identify those tasks<br>
(indeed you have the parallelization also on the CUDA (such as in SLI or<br>
Tesla configuration but that's another story).<br>
<br>
Regarding the function I cited, we could for instance sort them by<br>
number of calls. In that case we could have:<br>
<br>
7.85 73.40 7.58 87591281 0.00 0.00 vlr_check_intersect<br>
20.03 57.26 19.34 40151652 0.00 0.00 testnode<br>
4.70 82.72 4.54 36850240 0.00 0.00 calc_ocval_ray<br>
1.77 90.77 1.71 26767830 0.00 0.00 vlr_face_coords<br>
4.67 87.23 4.51 25437217 0.00 0.00 Normalize_d<br>
8.87 65.82 8.56 24297509 0.00 0.00<br>
RE_ray_face_intersection<br>
0.41 92.05 0.40 20899588 0.00 0.00 vlr_get_transform<br>
39.28 37.92 37.92 10317691 0.00 0.00<br>
RE_ray_tree_intersect_chec<br>
k<br>
0.17 94.39 0.16 10077327 0.00 0.00 RE_ray_tree_intersect<br>
0.28 93.20 0.27 3516407 0.00 0.00 Mat3MulVecfl<br>
0.04 96.04 0.04 703745 0.00 0.00 RE_vertren_get_rad<br>
<br>
Of course is not very important to optimize (or replace) for instance a<br>
function which is called only ONCE and whose<br>
execution time is 0.01% of the total. So for instance we have the<br>
vlr_check_intersect() is called 87million times and<br>
the testnode() 40million.That should also take already in account the<br>
number reciprocal (or recursive) calls.<br>
You may also do some speculation and "forecasting" about what happens if<br>
that function could be made<br>
10 times faster or 100 times faster. But also looking at the same<br>
functions (or the function they were called most of the<br>
time) you can see whether they can be cud-ized (or CL-ized) or not. Of<br>
course the extreme is to "rewrite" a new<br>
independent CUDA|OpenCL (realtime) engine. But that in a "ideal world"...<br>
<br>
Apart this render is problably not the only task where appartently<br>
CUDA|OpenCL (if) can do the boost..., maybe in the<br>
sequencer could also take a lot of advantages and could be an easier<br>
task (peter?). I remember also that<br>
actually there should be the possibility to access to the H264 codec of<br>
some nvidia card (VPDAU, see<br>
<a href="ftp://download.nvidia.com/XFree86/vdpau/mplayer-vdpau-3076399.README.txt" target="_blank">ftp://download.nvidia.com/XFree86/vdpau/mplayer-vdpau-3076399.README.txt</a>).<br>
Of course this would lead to have things<br>
dependent on the hardware (and in this case a particular brand) and not<br>
fully OSS, but on the other hand we have always the<br>
software fallback.<br>
<br>
Bye<br>
<font color="#888888">Giuseppe.<br>
</font><div><div></div><div class="Wj3C7c"><br>
_______________________________________________<br>
Bf-committers mailing list<br>
<a href="mailto:Bf-committers@blender.org">Bf-committers@blender.org</a><br>
<a href="http://lists.blender.org/mailman/listinfo/bf-committers" target="_blank">http://lists.blender.org/mailman/listinfo/bf-committers</a><br>
</div></div></blockquote></div><br></div>