[Bf-blender-cvs] [5c81d3bd469] master: Geometry Nodes: improve evaluator with lazy threading

Tue Sep 20 11:08:21 CEST 2022

Commit: 5c81d3bd4691214164a7f071d239ef6c84dba8dc
Author: Jacques Lucke
Date:   Tue Sep 20 10:59:12 2022 +0200
Branches: master
https://developer.blender.org/rB5c81d3bd4691214164a7f071d239ef6c84dba8dc

Geometry Nodes: improve evaluator with lazy threading

In large node setup the threading overhead was sometimes very significant.
That's especially true when most nodes do very little work.

This commit improves the scheduling by not using multi-threading in many
cases unless it's likely that it will be worth it. For more details see the comments
in `BLI_lazy_threading.hh`.

Differential Revision: https://developer.blender.org/D15976

===================================================================

A	source/blender/blenlib/BLI_lazy_threading.hh
M	source/blender/blenlib/BLI_task.hh
M	source/blender/blenlib/CMakeLists.txt
A	source/blender/blenlib/intern/lazy_threading.cc
M	source/blender/blenlib/intern/task_range.cc
M	source/blender/functions/FN_lazy_function.hh
M	source/blender/functions/FN_lazy_function_execute.hh
M	source/blender/functions/intern/lazy_function.cc
M	source/blender/functions/intern/lazy_function_execute.cc
M	source/blender/functions/intern/lazy_function_graph_executor.cc
M	source/blender/nodes/NOD_geometry_nodes_lazy_function.hh
M	source/blender/nodes/geometry/nodes/node_geo_distribute_points_on_faces.cc
M	source/blender/nodes/intern/geometry_nodes_lazy_function.cc

===================================================================

diff --git a/source/blender/blenlib/BLI_lazy_threading.hh b/source/blender/blenlib/BLI_lazy_threading.hh
new file mode 100644
index 00000000000..61532fe24f0
--- /dev/null
+++ b/source/blender/blenlib/BLI_lazy_threading.hh
@@ -0,0 +1,83 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+
+#pragma once
+
+/** \file
+ * \ingroup bli
+ *
+ * The goal of "lazy threading" is to avoid using threads unless one can reasonably assume that it
+ * is worth distributing work over multiple threads. Using threads can lead to worse overall
+ * performance by introducing inter-thread communication overhead. Keeping all work on a single
+ * thread reduces this overhead to zero and also makes better use of the CPU cache.
+ *
+ * Functions like #parallel_for also solve this to some degree by using a "grain size". When the
+ * number of individual tasks is too small, no multi-threading is used. This works very well when
+ * there are many homogeneous tasks that can be expected to take approximately the same time.
+ *
+ * The situation becomes more difficult when:
+ * - The individual tasks are not homogeneous, i.e. they take different amounts of time to compute.
+ * - It is practically impossible to guess how long each task will take in advance.
+ *
+ * Given those constraints, a single grain size cannot be determined. One could just schedule all
+ * tasks individually but that would create a lot of overhead when the tasks happen to be very
+ * small. While TBB will keep all tasks on a single thread if the other threads are busy, if they
+ * are idle they will start stealing the work even if that's not benefitial for overall
+ * performance.
+ *
+ * This file provides a simple API that allows a task scheduler to properly handle tasks whose size
+ * is not known in advance. The key idea is this:
+ *
+ * > By default, all work stays on a single thread. If an individual task notices that it is about
+ * > start a computation that will take a while, it notifies the task scheduler further up on the
+ * > stack. The scheduler then allows other threads to take over other tasks that were originally
+ * > meant for the current thread.
+ *
+ * This way, when all tasks are small, no threading overhead has to be paid for. Whenever there is
+ * a task that keeps the current thread busy for a while, the other tasks are moved to a separate
+ * thread so that they can be executed without waiting for the long computation to finish.
+ *
+ * Consequently, the earlier a task knows during it execution that it will take a while, the
+ * better. That's because if it is blocking anyway, it's more efficient to move the other tasks to
+ * another thread earlier.
+ *
+ * To make this work, three things have to be solved:
+ * 1. The task scheduler has to be able to start single-threaded and become multi-threaded after
+ *    tasks have started executing. This has to be solved in the specific task scheduler.
+ * 2. There has to be a way for the currently running task to tell the task scheduler that it is
+ *    about to perform a computation that will take a while and that it would be reasonable to move
+ *    other tasks to other threads. This part is implemented in the API provided by this file.
+ * 3. Individual tasks have to decide when a computation is long enough to justify talking to the
+ *    scheduler. This is always based on heuristics that have to be fine tuned over time. One could
+ *    assume that this means adding new work-size checks to many parts in Blender, but that's
+ *    actually not necessary, because these checks exist already in the form of grain sizes passed
+ *    to e.g. #parallel_for. The assumption here is that when the task thinks the current work load
+ *    is big enough to justify using threads, it's also big enough to justify using another thread
+ *    for waiting tasks on the current thread.
+ */
+
+#include "BLI_function_ref.hh"
+
+namespace blender::lazy_threading {
+
+/**
+ * Tell task schedulers on the current thread that it is about to start a long computation
+ * and that other waiting tasks should better be moved to another thread if possible.
+ */
+void send_hint();
+
+/**
+ * Used by the task scheduler to receive hints from current tasks that they will take a while.
+ * This should only be allocated on the stack.
+ */
+class HintReceiver {
+ public:
+  /**
+   * The passed in function is called when a task signals that it will take a while.
+   * \note The function has to stay alive after the call to the constructor. So one must not pass a
+   * lambda directly into this constructor but store it in a separate variable on the stack first.
+   */
+  HintReceiver(FunctionRef<void()> fn);
+  ~HintReceiver();
+};
+
+}  // namespace blender::lazy_threading
diff --git a/source/blender/blenlib/BLI_task.hh b/source/blender/blenlib/BLI_task.hh
index 33a781d3749..9f9a57be634 100644
--- a/source/blender/blenlib/BLI_task.hh
+++ b/source/blender/blenlib/BLI_task.hh
@@ -31,6 +31,7 @@
 #endif
 
 #include "BLI_index_range.hh"
+#include "BLI_lazy_threading.hh"
 #include "BLI_utildefines.h"
 
 namespace blender::threading {
@@ -56,6 +57,7 @@ void parallel_for(IndexRange range, int64_t grain_size, const Function &function
 #ifdef WITH_TBB
   /* Invoking tbb for small workloads has a large overhead. */
   if (range.size() >= grain_size) {
+    lazy_threading::send_hint();
     tbb::parallel_for(
         tbb::blocked_range<int64_t>(range.first(), range.one_after_last(), grain_size),
         [&](const tbb::blocked_range<int64_t> &subrange) {
@@ -78,6 +80,7 @@ Value parallel_reduce(IndexRange range,
 {
 #ifdef WITH_TBB
   if (range.size() >= grain_size) {
+    lazy_threading::send_hint();
     return tbb::parallel_reduce(
         tbb::blocked_range<int64_t>(range.first(), range.one_after_last(), grain_size),
         identity,
@@ -114,6 +117,7 @@ template<typename... Functions>
 void parallel_invoke(const bool use_threading, Functions &&...functions)
 {
   if (use_threading) {
+    lazy_threading::send_hint();
     parallel_invoke(std::forward<Functions>(functions)...);
   }
   else {
diff --git a/source/blender/blenlib/CMakeLists.txt b/source/blender/blenlib/CMakeLists.txt
index 50dc11cbf0a..36acbd41ad7 100644
--- a/source/blender/blenlib/CMakeLists.txt
+++ b/source/blender/blenlib/CMakeLists.txt
@@ -85,6 +85,7 @@ set(SRC
   intern/kdtree_3d.c
   intern/kdtree_4d.c
   intern/lasso_2d.c
+  intern/lazy_threading.cc
   intern/length_parameterize.cc
   intern/listbase.c
   intern/math_base.c
diff --git a/source/blender/blenlib/intern/lazy_threading.cc b/source/blender/blenlib/intern/lazy_threading.cc
new file mode 100644
index 00000000000..803fd81a96d
--- /dev/null
+++ b/source/blender/blenlib/intern/lazy_threading.cc
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+
+#include "BLI_lazy_threading.hh"
+#include "BLI_vector.hh"
+
+namespace blender::lazy_threading {
+
+/**
+ * This is a #RawVector so that it can be destructed after Blender checks for memory leaks.
+ */
+thread_local RawVector<FunctionRef<void()>, 0> hint_receivers;
+
+void send_hint()
+{
+  for (const FunctionRef<void()> &fn : hint_receivers) {
+    fn();
+  }
+}
+
+HintReceiver::HintReceiver(const FunctionRef<void()> fn)
+{
+  hint_receivers.append(fn);
+}
+
+HintReceiver::~HintReceiver()
+{
+  hint_receivers.pop_last();
+}
+
+}  // namespace blender::lazy_threading
diff --git a/source/blender/blenlib/intern/task_range.cc b/source/blender/blenlib/intern/task_range.cc
index 7e405529f03..181b760bea1 100644
--- a/source/blender/blenlib/intern/task_range.cc
+++ b/source/blender/blenlib/intern/task_range.cc
@@ -12,6 +12,7 @@
 
 #include "DNA_listBase.h"
 
+#include "BLI_lazy_threading.hh"
 #include "BLI_task.h"
 #include "BLI_threads.h"
 
@@ -104,6 +105,8 @@ void BLI_task_parallel_range(const int start,
     const size_t grainsize = MAX2(settings->min_iter_per_thread, 1);
     const tbb::blocked_range<int> range(start, stop, grainsize);
 
+    blender::lazy_threading::send_hint();
+
     if (settings->func_reduce) {
       parallel_reduce(range, task);
       if (settings->userdata_chunk) {
diff --git a/source/blender/functions/FN_lazy_function.hh b/source/blender/functions/FN_lazy_function.hh
index 59a3a90b0b0..4a539e7cbd1 100644
--- a/source/blender/functions/FN_lazy_function.hh
+++ b/source/blender/functions/FN_lazy_function.hh
@@ -43,6 +43,13 @@
 #include "BLI_linear_allocator.hh"
 #include "BLI_vector.hh"
 
+#include <atomic>
+#include <thread>
+
+#ifdef DEBUG
+#  define FN_LAZY_FUNCTION_DEBUG_THREADS
+#endif
+
 namespace blender::fn::lazy_function {
 
 enum class ValueUsage {
@@ -102,9 +109,13 @@ class Params {
    * The lazy-function this #Params has been prepared for.
    */
   const LazyFunction &fn_;
+#ifdef FN_LAZY_FUNCTION_DEBUG_THREADS
+  std::thread::id main_thread_id_;
+  std::atomic<bool> allow_multi_threading_;
+#endif
 
  public:
-  Params(const LazyFunction &fn);
+  Params(const LazyFunction &fn, bool allow_multi_threading_initially);
 
   /**
    * Get a pointer to an input value if the value is available already. Otherwise null is returned.
@@ -154,7 +165,7 @@ class Params {
    * Typed utility methods that wrap the methods above.
    */
   template<typename T> T extract_input(int index);
-  template<typename T> const T &get_input(int index);
+  template<typename T> const T &get_input(int index) const;
   template<typename T> T *try_get_input_data_ptr_or_request(int index);
   template<typename T> void set_output(int index, T &&value);
 
@@ -163,7 +174,15 @@ class Params {
    */
   void set_default_remaining_outputs();
 
+  /**
+   * Returns true when the lazy-function is now allowed to use multi-threading when interacting
+   * with this #Params. That means, it is allowed to call non-const methods from different threads.
+   */
+  bool try_enable_multi_threading();
+
  private:
+  void assert_valid_thread() const;
+
   /**
    * Methods that need to be implemented by subclasses. Those are separate from the non-virtual
    * methods above to make it easy to insert additional debugging logic on top of the
@@ -176,6 +195,7 @@ class Params {
   virtual bool output_was_set_impl(int index) const = 0;
   virtual ValueUsage get_output_usage_impl(int index) const = 0;
   virtu

@@ Diff output truncated at 10240 characters. @@