[Bf-blender-cvs] [17774afed1a] cycles-x: Cycles X: Remove usage of mega-kernel

Wed May 19 19:18:18 CEST 2021

Commit: 17774afed1a3f17f32eddef48155c6ed65e03099
Author: Sergey Sharybin
Date:   Wed May 19 16:37:30 2021 +0200
Branches: cycles-x
https://developer.blender.org/rB17774afed1a3f17f32eddef48155c6ed65e03099

Cycles X: Remove usage of mega-kernel

The usage of the mega-kernel is commented out with this change.

There are few benefits of removing the mega-kernel:

- It takes extra time to compile and space to ship.
- It is not compatible with features like shadow catcher.

The rest of the changes are related on attempt to avoid performance
loss in various scenes. Those changes include:

- Make work tile smaller in size. This makes the work tile more
  friendly for greedy scheduling when adaptive sampling is used.
  Currently this is achieved by keeping pixel same the same and
  lowering number of samples per work tile. The idea behind this
  is to avoid dramatic change in order in which pixels are
  scheduled for sampling.

- Keep tile size dimensions a power of two.
  This lowers the unused path states (which can be watched with

  ./bin//blender --debug-cycles --verbose 3 2>&1 | grep "Number of unused path states"

  In own tests it seems that we barely "waste" path states now.

- Make it so tiles are scheduled in the order of samples first.
  As in: keep pixel-space coherency, similar to how it is done
  in the `get_work_pixel()`.

- Only keep extreme case tests for the tile size calculation.
  Avoids some unnecessary updates, while still ensuring correct
  behavior in extremes.

The timing goes as following:
```
RTX 6000 (Turing)

                              new                           cycles-x
bmw27.blend                   10.8964                       10.4269
classroom.blend               17.4476                       16.6609
pabellon.blend                9.77167                       9.14966
monster.blend                 10.3662                       12.0106
barbershop_interior.blend     11.9445                       12.5769
junkshop.blend                16.3556                       16.5213
pvt_flat.blend                16.5317                       17.4047

RTX A6000 (Ampere)
                              new                           cycles-x
bmw27.blend                   7.74059                       7.65293
classroom.blend               10.775                        10.9143
pabellon.blend                6.00643                       5.85334
monster.blend                 6.79277                       8.0134
barbershop_interior.blend     8.39941                       8.47159
junkshop.blend                10.4258                       10.9882
pvt_flat.blend                10.2752                       10.8821
```

Not entirely happy with the results: there are some very nice speedups
interleaved with some slowdown. Although, slowdown is within 5%, so
hopes that we can gain it back with more tricks from the sleeves.

Some thing to try:
- Try lowering tile size in pixels
- Try better alignment of tile size with number of threads on a
  multiprocessor.

This change is a combined brain activity from Brecht and myself.

Differential Revision: https://developer.blender.org/D11311

===================================================================

M	intern/cycles/integrator/path_trace_work_gpu.cpp
M	intern/cycles/integrator/tile.cpp
M	intern/cycles/integrator/work_tile_scheduler.cpp
M	intern/cycles/test/integrator_tile_test.cpp

===================================================================

diff --git a/intern/cycles/integrator/path_trace_work_gpu.cpp b/intern/cycles/integrator/path_trace_work_gpu.cpp
index 889079ba98b..f7db88fe126 100644
--- a/intern/cycles/integrator/path_trace_work_gpu.cpp
+++ b/intern/cycles/integrator/path_trace_work_gpu.cpp
@@ -221,6 +221,7 @@ bool PathTraceWorkGPU::enqueue_path_iteration()
     return false;
   }
 
+#if 0
   /* Megakernel does not support state split, so disable for the shadow catcher.
    * It is possible to make it work, but currently we are planning to make the megakernel
    * obsolete for the GPU rendering, so we don't spend time on making shadow catcher to work
@@ -235,6 +236,7 @@ bool PathTraceWorkGPU::enqueue_path_iteration()
       return true;
     }
   }
+#endif
 
   /* Find kernel to execute, with max number of queued paths. */
   int max_num_queued = 0;
diff --git a/intern/cycles/integrator/tile.cpp b/intern/cycles/integrator/tile.cpp
index 3e2e994512b..87524f329af 100644
--- a/intern/cycles/integrator/tile.cpp
+++ b/intern/cycles/integrator/tile.cpp
@@ -37,6 +37,15 @@ ccl_device_inline uint round_down_to_power_of_two(uint x)
   return prev_power_of_two(x);
 }
 
+ccl_device_inline uint round_up_to_power_of_two(uint x)
+{
+  if (is_power_of_two(x)) {
+    return x;
+  }
+
+  return next_power_of_two(x);
+}
+
 TileSize tile_calculate_best_size(const int2 &image_size,
                                   const int num_samples,
                                   const int max_num_path_states)
@@ -55,31 +64,31 @@ TileSize tile_calculate_best_size(const int2 &image_size,
   }
 
   /* The idea here is to keep number of samples per tile as much as possible to improve coherency
-   * across threads. */
-
-  const int num_path_states_per_sample = max(max_num_path_states / num_samples, 1);
+   * across threads.
+   *
+   * Some general ideas:
+   *  - Prefer smaller tiles with more samples, which improves spatial coherency of paths.
+   *  - Keep values a power of two, for more integer fit into the maximum number of paths. */
 
   TileSize tile_size;
 
-  if (true) {
-    /* Occupy as much of GPU threads as possible by the single tile.
-     * This could cause non-optimal load due to "wasted" path states (due to non-integer division)
-     * but currently it gives better performance. Possibly that coalescing will help with. */
-    tile_size.width = max(static_cast<int>(lround(sqrt(num_path_states_per_sample))), 1);
-    tile_size.height = max(num_path_states_per_sample / tile_size.width, 1);
+  /* Calculate tile size as if it is the most possible one to fit an entire range of samples.
+   * The idea here is to keep tiles as small as possible, and keep device occupied by scheduling
+   * multiple tiles with the same coordinates rendering different samples. */
+  const int num_path_states_per_sample = max_num_path_states / num_samples;
+  tile_size.width = round_down_to_power_of_two(lround(sqrt(num_path_states_per_sample)));
+  tile_size.height = tile_size.width;
+
+  if (num_samples == 1) {
+    tile_size.num_samples = 1;
   }
   else {
-    /* Round down to the power of two, so that all path states are occupied. */
-    /* TODO(sergey): Investigate why this is slower than the scheduling based on the code above and
-     * use this scheduling strategy instead. */
-    tile_size.width = round_down_to_power_of_two(
-        max(static_cast<int>(lround(sqrt(num_path_states_per_sample))), 1));
-    tile_size.height = tile_size.width;
+    /* Heuristic here is to have more uniform division of the sample range: for example prefer
+     * [32 <38 times>, 8] over [1024, 200]. This allows to greedily add more tiles early on. */
+    tile_size.num_samples = min(round_up_to_power_of_two(lround(sqrt(num_samples / 2))),
+                                static_cast<uint>(num_samples));
   }
 
-  tile_size.num_samples = min(num_samples,
-                              max_num_path_states / (tile_size.width * tile_size.height));
-
   DCHECK_LE(tile_size.width * tile_size.height * tile_size.num_samples, max_num_path_states);
 
   return tile_size;
diff --git a/intern/cycles/integrator/work_tile_scheduler.cpp b/intern/cycles/integrator/work_tile_scheduler.cpp
index 6557b470164..4d5eeb9da20 100644
--- a/intern/cycles/integrator/work_tile_scheduler.cpp
+++ b/intern/cycles/integrator/work_tile_scheduler.cpp
@@ -88,9 +88,9 @@ bool WorkTileScheduler::get_work(KernelWorkTile *work_tile_, const int max_work_
     return false;
   }
 
-  const int sample_range_index = work_index / total_tiles_num_;
+  const int sample_range_index = work_index % num_tiles_per_sample_range_;
   const int start_sample = sample_range_index * tile_size_.num_samples;
-  const int tile_index = work_index - sample_range_index * total_tiles_num_;
+  const int tile_index = work_index / num_tiles_per_sample_range_;
   const int tile_y = tile_index / num_tiles_x_;
   const int tile_x = tile_index - tile_y * num_tiles_x_;
 
diff --git a/intern/cycles/test/integrator_tile_test.cpp b/intern/cycles/test/integrator_tile_test.cpp
index 7fffb6de06b..32a323683c7 100644
--- a/intern/cycles/test/integrator_tile_test.cpp
+++ b/intern/cycles/test/integrator_tile_test.cpp
@@ -32,19 +32,6 @@ TEST(tile_calculate_best_size, Basic)
             TileSize(1920, 1080, 1));
   EXPECT_EQ(tile_calculate_best_size(make_int2(1920, 1080), 100, 1920 * 1080 * 100),
             TileSize(1920, 1080, 100));
-
-  /* Enough path states to only fit few samples of the entire image. */
-  EXPECT_EQ(tile_calculate_best_size(make_int2(1920, 1080), 100, 1920 * 1080 * 10),
-            TileSize(455, 455, 100));
-
-  /* Typical non-stressed configuration. */
-  EXPECT_EQ(tile_calculate_best_size(make_int2(1920, 1080), 1, 1024 * 1024),
-            TileSize(1024, 1024, 1));
-  EXPECT_EQ(tile_calculate_best_size(make_int2(1920, 1080), 8, 1024 * 1024),
-            TileSize(362, 362, 8));
-
-  /* Number of samples is much higher than the state can handle. */
-  EXPECT_EQ(tile_calculate_best_size(make_int2(1920, 1080), 10000, 10), TileSize(1, 1, 10));
 }
 
 CCL_NAMESPACE_END