[Bf-blender-cvs] [3e9265a31a7] cycles-x: Fix Cycles X adaptive sampling convergence check

Thu Jul 15 17:15:02 CEST 2021

Commit: 3e9265a31a7573e93dc90d821bf6429474c7eb77
Author: Sergey Sharybin
Date:   Thu Jul 15 11:21:12 2021 +0200
Branches: cycles-x
https://developer.blender.org/rB3e9265a31a7573e93dc90d821bf6429474c7eb77

Fix Cycles X adaptive sampling convergence check

The optimization of atomics and reduction was wrong: the warp voting
functions operate on a threads from a warp (obviously), and the result
of the vote is to be accumulated once for every warp.

Thread index is measured within a block, not within a warp: a block
can have a lot (GPU-dependent) number of threads, while warp has only
32 threads.

Now the code does a voting and atomically adds to the result.

This solves possible too-early sampling stop on GPU, but because the
old code could have finished too soon, there is potential that the
absolute render time number goes up.

Is one of the things which is a bit hard to see on the real file,
but the same approach was giving wrong approach during development of
shadow catcher occupancy improvement. So best visualization of the
problem so far was to force `converged` to be always false and print
number of pixels and active pixels after the running kernel. Before
this change the number of active pixels was much smaller than the
number of pixels, now those values match.

===================================================================

M	intern/cycles/kernel/device/cuda/kernel.cu

===================================================================

diff --git a/intern/cycles/kernel/device/cuda/kernel.cu b/intern/cycles/kernel/device/cuda/kernel.cu
index 14da02cc809..4e95dc513a0 100644
--- a/intern/cycles/kernel/device/cuda/kernel.cu
+++ b/intern/cycles/kernel/device/cuda/kernel.cu
@@ -421,7 +421,8 @@ extern "C" __global__ void CUDA_LAUNCH_BOUNDS(CUDA_KERNEL_BLOCK_NUM_THREADS,
 
   /* NOTE: All threads specified in the mask must execute the intrinsic. */
   const uint num_active_pixels_mask = __ballot_sync(0xffffffff, !converged);
-  if (threadIdx.x == 0) {
+  const int lane_id = threadIdx.x % warpSize;
+  if (lane_id == 0) {
     atomic_fetch_and_add_uint32(num_active_pixels, __popc(num_active_pixels_mask));
   }
 }