[Bf-blender-cvs] [170c70b] soc-2016-cycles_denoising: Cycles: Prefilter the shadow feature passes

Sun Jul 24 03:46:09 CEST 2016

Commit: 170c70b29072d3c83d3e3a1f55666855a47bc41d
Author: Lukas Stockner
Date:   Sun Jul 24 02:17:11 2016 +0200
Branches: soc-2016-cycles_denoising
https://developer.blender.org/rB170c70b29072d3c83d3e3a1f55666855a47bc41d

Cycles: Prefilter the shadow feature passes

The previous commit already generates the features, but they're quite noisy, which is unacceptable for a LWR feature since it leads to noise in the result.

The filtering algorithm is:
1. filter_divide_shadow:
 - Divide the R and G channels of both passes to get two noisy shadow passes
 - Scale the B channels and combine them to get the approximate Sample Variance
 - Compute the squared difference of the A and B divided passes to get the correct, but also noisy, Buffer Variance
 - Compute the squared difference of the A and B Sample Variances to get the variance of the Sample Variance estimate
2. filter_non_local_means:
 - Smooth the Buffer Variance using Non-Local Means with weights derived from the Sample Variance pass
3. filter_non_local_means:
 - Smooth the A and B shadow passes using Non-Local Means with weights derived from the other pass (B to smooth A, A to smooth B) and from the smooth buffer variance.
4. filter_combine_halves:
 - Compute the squared difference of the A and B smoothed shadow passes to estimate the residual variance in the channels.
5. filter_non_local_means:
 - Smooth the two passes again using each other and the residual variance for weights.
6. filter_combine_halves:
 - Average the two double-smoothed passes to obtain the final shadow feature used for the LWR algorithm.

Although the algorithm might sound rather slow, that's not the case. This can be seen by reducing the half window: Doing so reduces the time LWR takes, but the prefiltering stays the same. So, since the time used can be reduced drastically with the half window, prefiltering can't be the bottleneck.
Also, the amount of repeated smoothing sounds like it destroys fine details. However, that is not the case: Due to taking variance into account and the remarkable quality of the NLM filter, details that only span a couple of pixels are still preserved without blurring.

The final feature isn't used yet, that will be added in the next commit.

===================================================================

M	intern/cycles/device/device_cpu.cpp
M	intern/cycles/kernel/kernel_filter.h
M	intern/cycles/kernel/kernels/cpu/kernel_cpu.h
M	intern/cycles/kernel/kernels/cpu/kernel_cpu_impl.h
M	intern/cycles/kernel/kernels/cuda/kernel.cu

===================================================================

diff --git a/intern/cycles/device/device_cpu.cpp b/intern/cycles/device/device_cpu.cpp
index e03ccfe..4a8acfd 100644
--- a/intern/cycles/device/device_cpu.cpp
+++ b/intern/cycles/device/device_cpu.cpp
@@ -208,6 +208,148 @@ public:
 		}
 	};
 
+	float2* denoise_prefilter(int4 prefilter_rect, RenderTile &tile, KernelGlobals *kg, int sample, float** buffers, int* tile_x, int* tile_y, int *offsets, int *strides)
+	{
+		void(*filter_divide_shadow)(KernelGlobals*, int, float**, int, int, int*, int*, int*, int*, float*, float*, float*, float*, int4);
+		void(*filter_non_local_means)(int, int, float*, float*, float*, float*, int4, int, int, float, float);
+		void(*filter_combine_halves)(int, int, float*, float*, float*, float*, int, int4);
+
+#ifdef WITH_CYCLES_OPTIMIZED_KERNEL_AVX2
+		if(system_cpu_support_avx2()) {
+			filter_divide_shadow = kernel_cpu_avx2_filter_divide_shadow;
+			filter_non_local_means = kernel_cpu_avx2_filter_non_local_means;
+			filter_combine_halves = kernel_cpu_avx2_filter_combine_halves;
+		}
+		else
+#endif
+#ifdef WITH_CYCLES_OPTIMIZED_KERNEL_AVX
+		if(system_cpu_support_avx()) {
+			filter_divide_shadow = kernel_cpu_avx_filter_divide_shadow;
+			filter_non_local_means = kernel_cpu_avx_filter_non_local_means;
+			filter_combine_halves = kernel_cpu_avx_filter_combine_halves;
+		}
+		else
+#endif
+#ifdef WITH_CYCLES_OPTIMIZED_KERNEL_SSE41
+		if(system_cpu_support_sse41()) {
+			filter_divide_shadow = kernel_cpu_sse41_filter_divide_shadow;
+			filter_non_local_means = kernel_cpu_sse41_filter_non_local_means;
+			filter_combine_halves = kernel_cpu_sse41_filter_combine_halves;
+		}
+		else
+#endif
+#ifdef WITH_CYCLES_OPTIMIZED_KERNEL_SSE3
+		if(system_cpu_support_sse3()) {
+			filter_divide_shadow = kernel_cpu_sse3_filter_divide_shadow;
+			filter_non_local_means = kernel_cpu_sse3_filter_non_local_means;
+			filter_combine_halves = kernel_cpu_sse3_filter_combine_halves;
+		}
+		else
+#endif
+#ifdef WITH_CYCLES_OPTIMIZED_KERNEL_SSE2
+		if(system_cpu_support_sse2()) {
+			filter_divide_shadow = kernel_cpu_sse2_filter_divide_shadow;
+			filter_non_local_means = kernel_cpu_sse2_filter_non_local_means;
+			filter_combine_halves = kernel_cpu_sse2_filter_combine_halves;
+		}
+		else
+#endif
+		{
+			filter_divide_shadow = kernel_cpu_filter_divide_shadow;
+			filter_non_local_means = kernel_cpu_filter_non_local_means;
+			filter_combine_halves = kernel_cpu_filter_combine_halves;
+		}
+
+		int w = (prefilter_rect.z - prefilter_rect.x), h = (prefilter_rect.w - prefilter_rect.y);
+		float2 *prefiltered = new float2[w*h];
+		float *unfiltered = new float[2*w*h], *sampleV = ((float*) prefiltered), *sampleVV = new float[w*h], *bufferV = ((float*) prefiltered) + w*h, *cleanV = new float[w*h];
+
+
+
+		/* Get the A/B unfiltered passes, the combined sample variance, the estimated variance of the sample variance and the buffer variance. */
+		for(int y = prefilter_rect.y; y < prefilter_rect.w; y++) {
+			for(int x = prefilter_rect.x; x < prefilter_rect.z; x++) {
+				filter_divide_shadow(kg, sample, buffers, x, y, tile_x, tile_y, offsets, strides, unfiltered, sampleV, sampleVV, bufferV, prefilter_rect);
+			}
+		}
+#ifdef WITH_CYCLES_DEBUG_FILTER
+#define WRITE_DEBUG(name, var, stride) debug_write_pfm(string_printf("debug_%dx%d_shadow_%s.pfm", tile.x, tile.y, name).c_str(), var, w, h, stride, w)
+		WRITE_DEBUG("unfilteredA", unfiltered, 1);
+		WRITE_DEBUG("unfilteredB", unfiltered + w*h, 1);
+		WRITE_DEBUG("bufferV", bufferV, 1);
+		WRITE_DEBUG("sampleV", sampleV, 1);
+		WRITE_DEBUG("sampleVV", sampleVV, 1);
+#endif
+
+
+
+		/* Smooth the (generally pretty noisy) buffer variance using the spatial information from the sample variance. */
+		for(int y = prefilter_rect.y; y < prefilter_rect.w; y++) {
+			for(int x = prefilter_rect.x; x < prefilter_rect.z; x++) {
+				//filter_prefilter_features(&kg, sample, x, y, filteredA, filteredB, prefilter_rect);
+				filter_non_local_means(x, y, bufferV, sampleV, sampleVV, cleanV, prefilter_rect, 3, 1, 4, 1.0f);
+			}
+		}
+#ifdef WITH_CYCLES_DEBUG_FILTER
+		WRITE_DEBUG("cleanV", cleanV, 1);
+#endif
+
+
+
+		/* Use the smoothed variance to filter the two shadow half images using each other for weight calculation. */
+		for(int y = prefilter_rect.y; y < prefilter_rect.w; y++) {
+			for(int x = prefilter_rect.x; x < prefilter_rect.z; x++) {
+				filter_non_local_means(x, y, unfiltered, unfiltered + w*h, cleanV, sampleV, prefilter_rect, 5, 3, 1, 0.25f);
+				filter_non_local_means(x, y, unfiltered + w*h, unfiltered, cleanV, bufferV, prefilter_rect, 5, 3, 1, 0.25f);
+			}
+		}
+		delete[] cleanV;
+#ifdef WITH_CYCLES_DEBUG_FILTER
+		WRITE_DEBUG("filteredA", sampleV, 1);
+		WRITE_DEBUG("filteredB", bufferV, 1);
+#endif
+
+
+
+		/* Estimate the residual variance between the two filtered halves. */
+		for(int y = prefilter_rect.y; y < prefilter_rect.w; y++) {
+			for(int x = prefilter_rect.x; x < prefilter_rect.z; x++) {
+				filter_combine_halves(x, y, NULL, sampleVV, sampleV, bufferV, 1, prefilter_rect);
+			}
+		}
+#ifdef WITH_CYCLES_DEBUG_FILTER
+		WRITE_DEBUG("residualV", sampleVV, 1);
+#endif
+
+		/* Use the residual variance for a second filter pass. */
+		for(int y = prefilter_rect.y; y < prefilter_rect.w; y++) {
+			for(int x = prefilter_rect.x; x < prefilter_rect.z; x++) {
+				filter_non_local_means(x, y, sampleV, bufferV, sampleVV, unfiltered      , prefilter_rect, 4, 2, 1, 0.25f);
+				filter_non_local_means(x, y, bufferV, sampleV, sampleVV, unfiltered + w*h, prefilter_rect, 4, 2, 1, 0.25f);
+			}
+		}
+		delete[] sampleVV;
+#ifdef WITH_CYCLES_DEBUG_FILTER
+		WRITE_DEBUG("finalA", unfiltered, 1);
+		WRITE_DEBUG("finalB", unfiltered + w*h, 1);
+#endif
+
+		/* Combine the two double-filtered halves to a final shadow feature image and associated variance. */
+		for(int y = prefilter_rect.y; y < prefilter_rect.w; y++) {
+			for(int x = prefilter_rect.x; x < prefilter_rect.z; x++) {
+				filter_combine_halves(x, y, (float*) prefiltered, ((float*) prefiltered)+1, unfiltered, unfiltered + w*h, 2, prefilter_rect);
+			}
+		}
+		delete[] unfiltered;
+#ifdef WITH_CYCLES_DEBUG_FILTER
+		WRITE_DEBUG("final", (float*) prefiltered, 2);
+		WRITE_DEBUG("finalV", ((float*) prefiltered) + 1, 2);
+#undef WRITE_DEBUG
+#endif
+
+		return prefiltered;
+	}
+
 	void thread_render(DeviceTask& task)
 	{
 		if(task_pool.canceled()) {
diff --git a/intern/cycles/kernel/kernel_filter.h b/intern/cycles/kernel/kernel_filter.h
index 895c68a..87b360b 100644
--- a/intern/cycles/kernel/kernel_filter.h
+++ b/intern/cycles/kernel/kernel_filter.h
@@ -115,6 +115,107 @@ ccl_device_inline bool filter_firefly_rejection(float3 pixel_color, float pixel_
 	return (color_diff > 3.0f*variance);
 }
 
+
+/* General Non-Local Means filter implementation.
+ * NLM essentially is an extension of the bilaterail filter: It also loops over all the pixels in a neighborhood, calculates a weight for each one and combines them.
+ * The difference is the weighting function: While the Bilateral filter just looks that the two pixels (center=p and pixel in neighborhood=q) and calculates the weight from
+ * their distance and color difference, NLM considers small patches around both pixels and compares those. That way, it is able to identify similar image regions and compute
+ * better weights.
+ * One important consideration is that the image used for comparing patches doesn't have to be the one that's being filtered.
+ * This is used in two different ways in the denoiser: First, by splitting the samples in half, we get two unbiased estimates of the image.
+ * Then, we can use one of the halves to calculate the weights for filtering the other one. This way, the weights are decorrelated from the image and the result is smoother.
+ * The second use is for variance: Sample variance (generated in the kernel) tends to be quite smooth, but is biased.
+ * On the other hand, buffer variance, calculated from the difference of the two half images, is unbiased, but noisy.
+ * Therefore, by filtering the buffer variance based on weights from the sample variance, we get the same smooth structure, but the unbiased result.
+
+ * Parameters:
+ * - x, y: The position that is to be filtered (=p in the algorithm)
+ * - noisyImage: The image that is being filtered
+ * - weightImage: The image used for comparing patches and calculating weights
+ * - variance: The variance of the weight image (!), used to account for noisy input
+ * - filteredImage: Output image, only pixel (x, y) will be written
+ * - rect: The coordinates of the corners of the four images in image space.
+ * - r: The half radius of the area over which q is looped
+ * - f: The size of the patches that are used for comparing pixels
+ * - a: Can be tweaked to account for noisy variance, generally a=1
+ * - k_2: Squared k parameter of the NLM filter, general strength control (higher k => smoother image)
+ */
+ccl_device void kernel_filter_non_local_means(int x, int y, float *noisyImage, float *weightImage, float *variance, float *filteredImage, int4 rect, int r, int f, float a, float k_2)
+{
+	int2 low  = make_int2(max(rect.x, x - r),
+	                      max(rect.y, y - r));
+	int2 high = make_int2(min(rect.z, x + r + 1),
+	                      min(rect.w, y + r + 1));
+
+	float sum_image = 0.0f, sum_weight = 0.0f;
+
+	int w = rect.z-rect.x;
+	int p_idx = (y-rect.y)*w + (x - rect.x);
+	int q_idx = (low.y-rect.y)*w + (low.x-rect.x);
+	/* Loop over the q's, center pixels of all relevant patches. */
+	for(int qy = low.y; qy < high.y; qy++) {
+		for(int qx = low.x; qx < high.x; qx++, q_idx++) {
+			float dI = 0.0f;
+			int2  low_dPatch = make_int2(max(max(rect.x - qx, rect.x - x),  -f), max(max(rect.y - qy, rect.y - y),  -f));
+			int2 high_dPatch = make_int2(min(min(rect.z - qx, rect.z - x), f+1), min(min(rect.w - qy, rect.w - y), f+1));
+			int dIdx = low_dPatch.x + low_dPatch.y*w;
+			/* Loop over the pixels in the patch.
+			 * Note that the patch must be small enough to be fully inside the rect, both at p and q.
+			 * Do avoid doing all the coordinate calculations twice, the code here computes both weights at once. */
+			for(int dy = low_dPatch.y; dy < high_dPatch.y; dy++) {
+				for(int dx = low_dPatch.x; dx < high_d

@@ Diff output truncated at 10240 characters. @@