[Soc-2016-dev] Weekly Report #13 - Cycles Denoising

Sun Aug 21 17:40:28 CEST 2016

Hi!

This week, I mainly worked on three topics:

- Bug fixing:
 - The CUDA host side of the feature preprocessing missed the 16-byte width alignment, which caused weird channel separation artifacts if the tile size wasn't a multiple of 4 already. (25df3ca1)
 - The filter debug output code was outdated, which caused the build to fail if it was enabled. (4ab88b4a)
 - If the user cancelled rendering with CUDA (or with CPU when the CPU Overscan was enabled), the tile would still be denoised. Now, this is skipped if the render was cancelled. (6a67f80a)
 - The feature matrix norm calculation still used the old offset which is now invalid since the cross-frame filtering was added. (ece8cb8c)
 - The sample variance variance calculation missed a factor, which caused the shadow prefiltering to sometimes overblur edges. (5008072f)
 - The debug passes contained some invalid data, for example, the bandwidths were scaled with the global bandwidth before writing. (8dcf23bb)
 - The global bandwidth optimization actually is wrong in the paper and the reference implementation. I implemented it correctly now, details can be found in the commit message. (e1df90ad)

- Shadow prefiltering tweaking: Now the code handles some corner cases better (8771846b). Also, sometimes the even/odd filtering causes weird artifacts to appear at low sample counts - using Correlated Multi-Jittered sampling instead of Sobol solves that.

- CUDA redesign:
 - First of all, the huge first kernel was split into four kernels, one of which is executed 6 times - so 9 kernel launches in total. That should improve UI responsiveness while denoising larger tiles.
 - The transform matrices are now stored separately, which means that they can be marked as read-only in all kernels but the first one. That allows CUDA to cache them, which improves performance.
 - Also, the transforms are now stored interleaved (SoA layout instead of AoS), which means that all accesses to them are now coalesced.
 - The feature vectors are now stored in shared memory, which takes a lot of load off the device memory (since is was in local memory before).
 - I also added a new approach of creating the design rows, but it turned out to be a bad idea (I screwed up the benchmarking first), so I reverted it again. (fba2b77c, 8ad0423c)
 => In total, these changes reduce rendering time for a test file from 39sec to 21sec - 6sec of both is path tracing time, so denoising went from 33sec to 15sec.

So, for the last 2 days I'll be finishing the documentation and putting it on my Wiki page.

Lukas

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: OpenPGP digital signature
Url : http://lists.blender.org/pipermail/soc-2016-dev/attachments/20160821/282fa521/attachment.pgp