[Soc-2016-dev] Weekly Report #12 - Cycles Denoising

Tue Aug 16 05:03:35 CEST 2016

Hi!

The NLM prefiltering is about 3 times faster - in a test file I used for profiling, total CPU-Time spent in kernel_filter_non_local_means went down from 83sec to 26sec.
For the main denoising kernel, results depend on the size of the half window, since only the per-window-pixel operations are vectorized, not the other numerical code. In that particular file, the speedup was from 43sec to 17sec.
Note how the prefiltering time is actually higher - that's currently a problem since the denoiser requires data from around the tile, so the preprocessing has to be done on a larger area - for 8x8 tiles and a HW of 8, that means that each prefiltering call is on a 24x24 window, so each area is prefiltered nine times!
There are three options here:
1. Ignore the problem - not good, too slow.
2. Denoise as soon as the tile is rendered - not good since that would make tile border artifacts show up.
3. First render, then prefilter as soon as al neighbors are rendered, then denoise as soon as all neighbors are prefiltered. Works fine, but makes the outermost 2 rows of tiles not be denoised during rendering, instead of only the outermost row like now.
Probably 3. is the best choice here, on CPUs tiles are small anyways - on GPUs the problem isn't noticable due to the large tiles.

As for blue noise, it's not directly related, but might provide a significant advantage in denoising quality in some cases - since denoising artifacts tend to be low-frequency residuals, which blue-noise sampling aims to reduce. Also, as far as I can see, it actually provides a way to avoid the correlation issues with Sobol sampling and odd/even buffers.

I'll look more into the threaded EXR loading, I already noticed that ~8 threads are created and destroyed for every channel of every EXR that's loaded.

Am 15.08.2016 um 09:24 schrieb Sergey Sharybin:
> Hi,
> 
> When you mention using intrinsics to speed something up it's always interesting to know what kind of speedup this gives.
> 
> Also while blue noise is an interesting experiment, it's not a part of GSoC so not sure what it is doing in the report.
> 
> In Blender we use threaded EXR decoding, so that could be a reason why OIIO is poor in performance. It had some options for threading AFAIR, so make sure the IO is threaded there.
> 
> On Sat, Aug 13, 2016 at 5:47 AM, Lukas Stockner <lukas.stockner at freenet.de <mailto:lukas.stockner at freenet.de>> wrote:
> 
>     Hi!
> 
>     This week, I started over the weekend by doing some minor changes and fixes:
>      - Now, Cycles Standalone has a command line option to set the tile size (ffea3f5a in the branch, ef27d8ec in master)
>      - Revert a "fix" that actually totally broke Glossy/Glass GGX (10cb9a19)
>      - Fix building after the debug_fpe commit (edcf60b4)
>      - Fix a wrong calculation of the feature matrix norm (d23f0003)
>      - Remove some useless files I added a while back by accident
>      - Moving denoise utility and prefiltering functions into separate files to clean up the main file (b7dc25cb)
> 
>     The next larger feature was standalone denoising of single frames: By rendering with denoising information enabled (no need to activate denoising itself) and saving the result as Multilayer EXR, that EXR can then be denoised by Cycles Standalone to produce a clean output file. In itself, that feature is mainly useful for development, since it allows to pre-render once and just test the filter. (Commit: 3f94371a)
> 
>     After that, I decided to go for the SIMD kernel optimization next. That resulted in:
>      - A fix for a pretty longstanding hidden issue in master, where SSE4.1 function fallbacks were accidentally also used to override the native functions (cf017e81)
>      - A few SSE utility functions, like horizontal maximum and sum (741a2453)
>      - A SSE4.1-optimized version of the first kernel, which used to take up most of the time. The speedup in that function depends on a few factors, but it's usually about 2-3 times faster (dba99c49).
>      - A SSE4.1-optimized version of the NLM prefiltering kernel, which reduces prefiltering time by a factor of about 3 (95fa4836)
>     Together, these functions make denoising more than twice as fast on compatible processors (pretty much any processor since 2011).
> 
>     Next, I created a clean implementation of the Blue-Noise dithering patch - now under review at D2149. While doing so, I also fixed a problem in master regarding CUDA texture limits (bbbc079a), fixed the KernelIntegrator structure padding that was wrong since the Light Portal commit (82e65abf) and added a CTest that checks for problems like that in the future (7c3a06c3).
> 
>     One of the components of D2149, the simulated annealing tool used to precalculate the dither matrix, took a bit of time to optimize - but after a number of improvements, such as approximate math functions and yet another SSE4.1-optimized code path, it now runs 9 times as fast! A simulation pass with 3 Billion iterations is running right now, I'll upload the result once it's done.
> 
>     After that, I finished the standalone denoising by finally adding the inter-frame denoising mode (1c675f1c, e0208200, 2af90268) - now, the denoiser can use previous and later frames to avoid flickering and produce a better result in general!
>     The filtering is a bit slow, though - one reason for that is that OIIO actually needs about 10 seconds to read five Multilayer EXRs from the disk, and I don't yet understand why (Disk I/O isn't the bottleneck, I even tried a Ramdisk). Also, it's just more pixels - but that could be improved by doing things like using a smaller half window for secondary frames.
> 
> 
> 
>     So, since next week is the final GSoC week, I'll spend most of my time on final documentation.
>     The project in general is in a working shape now, I covered the main parts of the proposal (except for possibly adaptive sampling), but the branch isn't close to being finished and polished yet.
> 
>     Of course, though, I'll continue to work on it after the GSoC ends - my goal is to get the denoiser into master, after all!
> 
>     Lukas
> 
> 
>     _______________________________________________
>     Soc-2016-dev mailing list
>     Soc-2016-dev at blender.org <mailto:Soc-2016-dev at blender.org>
>     https://lists.blender.org/mailman/listinfo/soc-2016-dev <https://lists.blender.org/mailman/listinfo/soc-2016-dev>
> 
> 
> 
> 
> -- 
> With best regards, Sergey Sharybin
> 
> 
> _______________________________________________
> Soc-2016-dev mailing list
> Soc-2016-dev at blender.org
> https://lists.blender.org/mailman/listinfo/soc-2016-dev
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: OpenPGP digital signature
Url : http://lists.blender.org/pipermail/soc-2016-dev/attachments/20160816/b1cbac88/attachment.pgp