[Bf-blender-cvs] [e0716af1a4f] cycles-x: Cycles X: Align kernels of existing and new paths

Fri May 21 20:04:55 CEST 2021

Commit: e0716af1a4f43bc3bf9238556dcd44d35e830ed9
Author: Sergey Sharybin
Date:   Fri May 21 14:31:50 2021 +0200
Branches: cycles-x
https://developer.blender.org/rBe0716af1a4f43bc3bf9238556dcd44d35e830ed9

Cycles X: Align kernels of existing and new paths

Only enqueue new kernels when the existing wavefront is at the
intersect closest stage. This seems to positively affect on the
coherency, gaining performance:

```
                              new          cycles-x(1)    megakernel(2)
bmw27.blend                   10.198       10.6995        10.4269
classroom.blend               16.7821      17.2352        16.6609
pabellon.blend                9.39898      9.65984        9.14966
monster.blend                 10.5923      10.5799        12.0106
barbershop_interior.blend     11.777       11.8852        12.5769
junkshop.blend                16.085       16.2971        16.5213
pvt_flat.blend                16.5704      16.3189        17.4047

(1) cyclex-x branch hash ad81074fab1
(2) cyclex-x branch hash ef6ce4fa8ca (right before disabling megakernel)
```

While the pvt_flat (with adaptive sampling) is 1% slower, some
other scenes has performance gained almost all the way back in
comparison to the Cycles-X before megakernel removal.

Note that coherency is a hypothesis. Performance gain might also be
caused by less active paths array calculations.

===================================================================

M	intern/cycles/integrator/path_trace_work_gpu.cpp
M	intern/cycles/integrator/path_trace_work_gpu.h

===================================================================

diff --git a/intern/cycles/integrator/path_trace_work_gpu.cpp b/intern/cycles/integrator/path_trace_work_gpu.cpp
index 615832dd443..6a50feab497 100644
--- a/intern/cycles/integrator/path_trace_work_gpu.cpp
+++ b/intern/cycles/integrator/path_trace_work_gpu.cpp
@@ -193,6 +193,23 @@ void PathTraceWorkGPU::render_samples(int start_sample, int samples_num)
   }
 }
 
+DeviceKernel PathTraceWorkGPU::get_most_queued_kernel() const
+{
+  const IntegratorQueueCounter *queue_counter = integrator_queue_counter_.data();
+
+  int max_num_queued = 0;
+  DeviceKernel kernel = DEVICE_KERNEL_NUM;
+
+  for (int i = 0; i < DEVICE_KERNEL_INTEGRATOR_NUM; i++) {
+    if (queue_counter->num_queued[i] > max_num_queued) {
+      kernel = (DeviceKernel)i;
+      max_num_queued = queue_counter->num_queued[i];
+    }
+  }
+
+  return kernel;
+}
+
 void PathTraceWorkGPU::enqueue_reset()
 {
   const int num_keys = integrator_sort_key_counter_.size();
@@ -210,7 +227,7 @@ void PathTraceWorkGPU::enqueue_reset()
 bool PathTraceWorkGPU::enqueue_path_iteration()
 {
   /* Find kernel to execute, with max number of queued paths. */
-  IntegratorQueueCounter *queue_counter = integrator_queue_counter_.data();
+  const IntegratorQueueCounter *queue_counter = integrator_queue_counter_.data();
 
   int num_paths = 0;
   for (int i = 0; i < DEVICE_KERNEL_INTEGRATOR_NUM; i++) {
@@ -222,17 +239,8 @@ bool PathTraceWorkGPU::enqueue_path_iteration()
   }
 
   /* Find kernel to execute, with max number of queued paths. */
-  int max_num_queued = 0;
-  DeviceKernel kernel = DEVICE_KERNEL_NUM;
-
-  for (int i = 0; i < DEVICE_KERNEL_INTEGRATOR_NUM; i++) {
-    if (queue_counter->num_queued[i] > max_num_queued) {
-      kernel = (DeviceKernel)i;
-      max_num_queued = queue_counter->num_queued[i];
-    }
-  }
-
-  if (max_num_queued == 0) {
+  const DeviceKernel kernel = get_most_queued_kernel();
+  if (kernel == DEVICE_KERNEL_NUM) {
     return false;
   }
 
@@ -390,6 +398,15 @@ void PathTraceWorkGPU::compute_queued_paths(DeviceKernel kernel, int queued_kern
 
 bool PathTraceWorkGPU::enqueue_work_tiles(bool &finished)
 {
+  /* If there are existing paths wait them to go to intersect closest kernel, which will align the
+   * wavefront of the existing and newely added paths. */
+  /* TODO: Check whether counting new intersection kernels here will have positive affect on the
+   * performance. */
+  const DeviceKernel kernel = get_most_queued_kernel();
+  if (kernel != DEVICE_KERNEL_NUM && kernel != DEVICE_KERNEL_INTEGRATOR_INTERSECT_CLOSEST) {
+    return false;
+  }
+
   const float regenerate_threshold = 0.5f;
   int num_paths = get_num_active_paths();
 
diff --git a/intern/cycles/integrator/path_trace_work_gpu.h b/intern/cycles/integrator/path_trace_work_gpu.h
index e3b67c08cac..3cd193e606f 100644
--- a/intern/cycles/integrator/path_trace_work_gpu.h
+++ b/intern/cycles/integrator/path_trace_work_gpu.h
@@ -54,6 +54,9 @@ class PathTraceWorkGPU : public PathTraceWork {
   void alloc_integrator_queue();
   void alloc_integrator_sorting();
 
+  /* Returns DEVICE_KERNEL_NUM if there are no scheduled kernels. */
+  DeviceKernel get_most_queued_kernel() const;
+
   void enqueue_reset();
 
   bool enqueue_work_tiles(bool &finished);