[Bf-blender-cvs] [17774afed1a] cycles-x: Cycles X: Remove usage of mega-kernel
Sergey Sharybin
noreply at git.blender.org
Wed May 19 19:18:18 CEST 2021
Commit: 17774afed1a3f17f32eddef48155c6ed65e03099
Author: Sergey Sharybin
Date: Wed May 19 16:37:30 2021 +0200
Branches: cycles-x
https://developer.blender.org/rB17774afed1a3f17f32eddef48155c6ed65e03099
Cycles X: Remove usage of mega-kernel
The usage of the mega-kernel is commented out with this change.
There are few benefits of removing the mega-kernel:
- It takes extra time to compile and space to ship.
- It is not compatible with features like shadow catcher.
The rest of the changes are related on attempt to avoid performance
loss in various scenes. Those changes include:
- Make work tile smaller in size. This makes the work tile more
friendly for greedy scheduling when adaptive sampling is used.
Currently this is achieved by keeping pixel same the same and
lowering number of samples per work tile. The idea behind this
is to avoid dramatic change in order in which pixels are
scheduled for sampling.
- Keep tile size dimensions a power of two.
This lowers the unused path states (which can be watched with
./bin//blender --debug-cycles --verbose 3 2>&1 | grep "Number of unused path states"
In own tests it seems that we barely "waste" path states now.
- Make it so tiles are scheduled in the order of samples first.
As in: keep pixel-space coherency, similar to how it is done
in the `get_work_pixel()`.
- Only keep extreme case tests for the tile size calculation.
Avoids some unnecessary updates, while still ensuring correct
behavior in extremes.
The timing goes as following:
```
RTX 6000 (Turing)
new cycles-x
bmw27.blend 10.8964 10.4269
classroom.blend 17.4476 16.6609
pabellon.blend 9.77167 9.14966
monster.blend 10.3662 12.0106
barbershop_interior.blend 11.9445 12.5769
junkshop.blend 16.3556 16.5213
pvt_flat.blend 16.5317 17.4047
RTX A6000 (Ampere)
new cycles-x
bmw27.blend 7.74059 7.65293
classroom.blend 10.775 10.9143
pabellon.blend 6.00643 5.85334
monster.blend 6.79277 8.0134
barbershop_interior.blend 8.39941 8.47159
junkshop.blend 10.4258 10.9882
pvt_flat.blend 10.2752 10.8821
```
Not entirely happy with the results: there are some very nice speedups
interleaved with some slowdown. Although, slowdown is within 5%, so
hopes that we can gain it back with more tricks from the sleeves.
Some thing to try:
- Try lowering tile size in pixels
- Try better alignment of tile size with number of threads on a
multiprocessor.
This change is a combined brain activity from Brecht and myself.
Differential Revision: https://developer.blender.org/D11311
===================================================================
M intern/cycles/integrator/path_trace_work_gpu.cpp
M intern/cycles/integrator/tile.cpp
M intern/cycles/integrator/work_tile_scheduler.cpp
M intern/cycles/test/integrator_tile_test.cpp
===================================================================
diff --git a/intern/cycles/integrator/path_trace_work_gpu.cpp b/intern/cycles/integrator/path_trace_work_gpu.cpp
index 889079ba98b..f7db88fe126 100644
--- a/intern/cycles/integrator/path_trace_work_gpu.cpp
+++ b/intern/cycles/integrator/path_trace_work_gpu.cpp
@@ -221,6 +221,7 @@ bool PathTraceWorkGPU::enqueue_path_iteration()
return false;
}
+#if 0
/* Megakernel does not support state split, so disable for the shadow catcher.
* It is possible to make it work, but currently we are planning to make the megakernel
* obsolete for the GPU rendering, so we don't spend time on making shadow catcher to work
@@ -235,6 +236,7 @@ bool PathTraceWorkGPU::enqueue_path_iteration()
return true;
}
}
+#endif
/* Find kernel to execute, with max number of queued paths. */
int max_num_queued = 0;
diff --git a/intern/cycles/integrator/tile.cpp b/intern/cycles/integrator/tile.cpp
index 3e2e994512b..87524f329af 100644
--- a/intern/cycles/integrator/tile.cpp
+++ b/intern/cycles/integrator/tile.cpp
@@ -37,6 +37,15 @@ ccl_device_inline uint round_down_to_power_of_two(uint x)
return prev_power_of_two(x);
}
+ccl_device_inline uint round_up_to_power_of_two(uint x)
+{
+ if (is_power_of_two(x)) {
+ return x;
+ }
+
+ return next_power_of_two(x);
+}
+
TileSize tile_calculate_best_size(const int2 &image_size,
const int num_samples,
const int max_num_path_states)
@@ -55,31 +64,31 @@ TileSize tile_calculate_best_size(const int2 &image_size,
}
/* The idea here is to keep number of samples per tile as much as possible to improve coherency
- * across threads. */
-
- const int num_path_states_per_sample = max(max_num_path_states / num_samples, 1);
+ * across threads.
+ *
+ * Some general ideas:
+ * - Prefer smaller tiles with more samples, which improves spatial coherency of paths.
+ * - Keep values a power of two, for more integer fit into the maximum number of paths. */
TileSize tile_size;
- if (true) {
- /* Occupy as much of GPU threads as possible by the single tile.
- * This could cause non-optimal load due to "wasted" path states (due to non-integer division)
- * but currently it gives better performance. Possibly that coalescing will help with. */
- tile_size.width = max(static_cast<int>(lround(sqrt(num_path_states_per_sample))), 1);
- tile_size.height = max(num_path_states_per_sample / tile_size.width, 1);
+ /* Calculate tile size as if it is the most possible one to fit an entire range of samples.
+ * The idea here is to keep tiles as small as possible, and keep device occupied by scheduling
+ * multiple tiles with the same coordinates rendering different samples. */
+ const int num_path_states_per_sample = max_num_path_states / num_samples;
+ tile_size.width = round_down_to_power_of_two(lround(sqrt(num_path_states_per_sample)));
+ tile_size.height = tile_size.width;
+
+ if (num_samples == 1) {
+ tile_size.num_samples = 1;
}
else {
- /* Round down to the power of two, so that all path states are occupied. */
- /* TODO(sergey): Investigate why this is slower than the scheduling based on the code above and
- * use this scheduling strategy instead. */
- tile_size.width = round_down_to_power_of_two(
- max(static_cast<int>(lround(sqrt(num_path_states_per_sample))), 1));
- tile_size.height = tile_size.width;
+ /* Heuristic here is to have more uniform division of the sample range: for example prefer
+ * [32 <38 times>, 8] over [1024, 200]. This allows to greedily add more tiles early on. */
+ tile_size.num_samples = min(round_up_to_power_of_two(lround(sqrt(num_samples / 2))),
+ static_cast<uint>(num_samples));
}
- tile_size.num_samples = min(num_samples,
- max_num_path_states / (tile_size.width * tile_size.height));
-
DCHECK_LE(tile_size.width * tile_size.height * tile_size.num_samples, max_num_path_states);
return tile_size;
diff --git a/intern/cycles/integrator/work_tile_scheduler.cpp b/intern/cycles/integrator/work_tile_scheduler.cpp
index 6557b470164..4d5eeb9da20 100644
--- a/intern/cycles/integrator/work_tile_scheduler.cpp
+++ b/intern/cycles/integrator/work_tile_scheduler.cpp
@@ -88,9 +88,9 @@ bool WorkTileScheduler::get_work(KernelWorkTile *work_tile_, const int max_work_
return false;
}
- const int sample_range_index = work_index / total_tiles_num_;
+ const int sample_range_index = work_index % num_tiles_per_sample_range_;
const int start_sample = sample_range_index * tile_size_.num_samples;
- const int tile_index = work_index - sample_range_index * total_tiles_num_;
+ const int tile_index = work_index / num_tiles_per_sample_range_;
const int tile_y = tile_index / num_tiles_x_;
const int tile_x = tile_index - tile_y * num_tiles_x_;
diff --git a/intern/cycles/test/integrator_tile_test.cpp b/intern/cycles/test/integrator_tile_test.cpp
index 7fffb6de06b..32a323683c7 100644
--- a/intern/cycles/test/integrator_tile_test.cpp
+++ b/intern/cycles/test/integrator_tile_test.cpp
@@ -32,19 +32,6 @@ TEST(tile_calculate_best_size, Basic)
TileSize(1920, 1080, 1));
EXPECT_EQ(tile_calculate_best_size(make_int2(1920, 1080), 100, 1920 * 1080 * 100),
TileSize(1920, 1080, 100));
-
- /* Enough path states to only fit few samples of the entire image. */
- EXPECT_EQ(tile_calculate_best_size(make_int2(1920, 1080), 100, 1920 * 1080 * 10),
- TileSize(455, 455, 100));
-
- /* Typical non-stressed configuration. */
- EXPECT_EQ(tile_calculate_best_size(make_int2(1920, 1080), 1, 1024 * 1024),
- TileSize(1024, 1024, 1));
- EXPECT_EQ(tile_calculate_best_size(make_int2(1920, 1080), 8, 1024 * 1024),
- TileSize(362, 362, 8));
-
- /* Number of samples is much higher than the state can handle. */
- EXPECT_EQ(tile_calculate_best_size(make_int2(1920, 1080), 10000, 10), TileSize(1, 1, 10));
}
CCL_NAMESPACE_END
More information about the Bf-blender-cvs
mailing list