[Soc-2014-dev] Weekly Report #1 Cycles

Fri May 23 13:14:20 CEST 2014

Hey everyone,
here my report for the first week on Cycles optimizations: 
http://wiki.blender.org/index.php/User:DingTo/GSoC_2014/Weekly_Reports/Week1

Best regards,
Thomas

== Pre work ==
I started early with my GSoC, therefore I already worked on some of my 
goals.
* Calculate face normal on the fly: Instead of storing the face normal, 
we now calculate it during rendering. See commit (6d62837e5bb2). The 
performance loss is only ~1-2%, while saving quite some memory. I hope 
to speed this up still, but I need to find the right place inside the 
BVH traversal still, to check if we can calculate it there and then 
store it somewhere (Intersection struct?).

* AVX2 kernel: I added an AVX2 kernel for Intel Haswell CPUs (can also 
be used with AMD, as soon as they support it). The AVX2 kernel makes 
rendering about 3-5% faster in several scenes. I tested this with clang 
on Mac OS with files from our test suite. The AVX2 kernel relies on 
AVX2, FMA3, BMI and BMI2 instruction sets, and we use some dedicated 
FMA3 intrinsics already in the kernel. More improvements here can 
probably be made, but I think it's already a solid basis. See commits 
(ac908f6c1f6d, 3844b8f85c7d and caaf0e484da8)

* I also looked into Multi Lamp Sampling for Volumes, and submitted a 
first patch. This needs additional work for Equi-angular sampling 
though. https://developer.blender.org/D526

== What I did this week ==
This week I spend most of the time on research and tests, but also 
looked into the fast inverse sqrt instructions.
* Read some documentation on SIMD intrinsics and C++ code optimization, 
thanks to Marcos Sánchez-Dehes for pointing me to these! 
http://www.agner.org/optimize/

* I looked into High-Performance timers for benchmarking purposes, but I 
don't have a working implementation yet. It looks like each OS might 
need its own implementation, e.g. QueryPerformanceCounter on Windows. 
Maybe there is a better solution here, some feedback on this would be 
appreciated! Probably I should also look into profilers, I am mainly 
interested in benchmarking specific code parts or a function, to see 
whether a change improves performance or not.

* I started to look into fast inverse sqrt instructions. Here is a 
simple patch: http://pasteall.org/51827/diff Performance wise, I need to 
do more tests with it, but the render result is slightly different with 
the patch. Maybe the solution needs to be refined with one or more 
Newton-Raphson steps? Also it looks like we only use 1/sqrt() in the 
Microfacet and Ward closure code, which are not really bottlenecks afaik.

== Next week ==
Continue to look into the Face Normal calculation code and start with 
uchar attribute support, for things like Vertex colors (to reduce memory 
usage).

== Questions ==
See above, mainly some input about profiling would be cool. :)

Thanks!