[Bf-blender-cvs] [369cd24495b] temp-vse-proxies: Improve proxy building performance

Tue Feb 23 15:07:25 CET 2021

Commit: 369cd24495bdff070dd72c120d562f8695184afd
Author: Richard Antalik
Date:   Tue Feb 23 14:54:20 2021 +0100
Branches: temp-vse-proxies
https://developer.blender.org/rB369cd24495bdff070dd72c120d562f8695184afd

Improve proxy building performance

====Principle of operation====
Proxy rebuild job will spawn 2 threads that are responsible for reading
packets from source file and writing transcoded packets to output file.
This is done by functions `index_ffmpeg_read_packets()` and
`index_ffmpeg_write_frames()`. These threads work rather quickly and
don't use too much CPU resources.
Transcoding of read packets is done in thread pool by function
`index_ffmpeg_transcode_packets()`.

This scheme is used because transcoded packets must be read and written
in order as if they were transcoded in one single loop. Transcoding can
happen relatively (see next paragraph) asynchronously.

Because decoding must always start on I-frame, each GOP is fed to
transcoding thread as whole. Some files may not have enough GOPs to be
fed to all threads. In such case, performance gain won't be as great.
This is relatively rare case though.

According to FFmpeg docs some packets may contain multiple frames, but
in such case only first frame is decoded. I am not sure if this is
limitation of FFmpeg, or it is possible to decode these frames, but in
previous proxy building implementation such case wasn't handled either.

Similar as above, there is assumption that decoding any number of packets
in GOP chunk will produce same number of output packets. This must be
always true, otherwise we couldn't map proxy frames to original perfectly.
Therefore it should be possible to increment input and output packet
containers independently and one of them can be manipulated "blindly".
For example sometimes decoded frames lag after packets in 1 or 2 or
more steps, sometimes these are output immediately. It depends on
codec. But number of packets fed to decoder must match number of
frames received.

Transcoding contexts are allocated only when building process starts.
This is because these contexts use lot of RAM. `avcodec_copy_context()`
is used to allocate input and output codec contexts. These have to be
unique for each thread. Sws_context also needs to be unique but it is
not copied, because it is needed only for transcoding.

====Job coordination====
In case that output file can not be written to disk fast enough,
transcoded packets will accumulate in RAM potentially filling it up
completely. This isn't such problem on SSD, but on HDD it can easily
happen. Therefore packets are read in sync with packets written with
lookahead.
When building all 4 sizes for 1080p movie, writing speed averages at
80MBps.

During operation packets are read in advance. Lookahead is number of
GOPs to read ahead. This is because all transcoding threads must have
packets to decode and each thread is working on whole GOP chunk.

Jobs are suspended when needed using thread conditions and wake signals
Threads will suspend on their own and are resumed in ring scheme:

```
read_packets -> transcode -> write_packets
    ^                              |
    |______________________________|
```
In addition, when any of threads above are done or cancelled, they will
resume building job to free data and finish building process.

====Performance====
On my machine (16 cores) building process is about 9x faster.
Before I have introduced job coordination, transcoding was 14x faster.
So there is still some room for optimization, perhaps wakeup frequency
is too high or threads are put to sleep unnecessarily.

---------

====Code layout====
I am using `FFmpegIndexBuilderContext` as "root" context for storing contexts.
Transcode job is wrapped in `TranscodeJob` because I need to pass thread number, that determines on which GOP chunks this job will work on.
`output_packet_wrap` and `source_packet_wrap` wrap `AVPacket` with some additional information like GOP chunk number (currently `i_frame_segment`).
These 2 structs could be consolidated which will simplify some auxilary logic. This is bit tricky part because sometimes `output_packet_wrap` must
lag one step after `source_packet_wrap` and this needs to be managed when jumping between GOP chunks properly.

Other than that, I am not super happy with amount of code and mere setup this patch adds. But it doesn't look like anything could be simplified significantly.

====Problems / TODO====
I am not aware of any bugs currently

Maniphest Tasks: T85469

Differential Revision: https://developer.blender.org/D10394

===================================================================

M	source/blender/imbuf/intern/indexer.c

===================================================================

diff --git a/source/blender/imbuf/intern/indexer.c b/source/blender/imbuf/intern/indexer.c
index c9581c108c0..aabe8fdec38 100644
--- a/source/blender/imbuf/intern/indexer.c
+++ b/source/blender/imbuf/intern/indexer.c
@@ -451,19 +451,87 @@ typedef struct IndexBuildContext {
 
 #ifdef WITH_FFMPEG
 
+struct input_ctx {
+  AVFormatContext *format_context;
+  AVCodecContext *codec_context;
+  AVCodec *codec;
+  AVStream *stream;
+  int video_stream;
+};
+
 struct proxy_output_ctx {
-  AVFormatContext *of;
-  AVStream *st;
-  AVCodecContext *c;
+  AVFormatContext *output_format;
+  AVStream *stream;
   AVCodec *codec;
-  struct SwsContext *sws_ctx;
-  AVFrame *frame;
+  AVCodecContext *codec_context;
   int cfra;
   int proxy_size;
-  int orig_height;
   struct anim *anim;
 };
 
+struct transcode_output_ctx {
+  AVCodecContext *codec_context;
+  struct SwsContext *sws_ctx;
+  int orig_height;
+} transcode_output_ctx;
+
+struct proxy_transcode_ctx {
+  AVCodecContext *input_codec_context;
+  struct transcode_output_ctx *output_context[IMB_PROXY_MAX_SLOT];
+};
+
+typedef struct FFmpegIndexBuilderContext {
+  /* Common data for building process. */
+  int anim_type;
+  struct anim *anim;
+  int quality;
+  int num_proxy_sizes;
+  int num_indexers;
+  int num_transcode_threads;
+  IMB_Timecode_Type tcs_in_use;
+  IMB_Proxy_Size proxy_sizes_in_use;
+
+  /* Builder contexts. */
+  struct input_ctx *input_ctx;
+  struct proxy_output_ctx *proxy_ctx[IMB_PROXY_MAX_SLOT];
+  struct proxy_transcode_ctx **transcode_context_array;
+  anim_index_builder *indexer[IMB_TC_MAX_SLOT];
+
+  /* Common data for transcoding. */
+  GHash *source_packets;
+  GHash *transcoded_packets;
+
+  /* Job coordination. */
+  ThreadMutex packet_access_mutex;
+  ThreadCondition reader_suspend_cond;
+  ThreadMutex reader_suspend_mutex;
+  ThreadCondition **transcode_suspend_cond;
+  ThreadMutex **transcode_suspend_mutex;
+  ThreadCondition writer_suspend_cond;
+  ThreadMutex writer_suspend_mutex;
+  ThreadCondition builder_suspend_cond;
+  ThreadMutex builder_suspend_mutex;
+  bool all_packets_read;
+  int transcode_jobs_done;
+  int last_gop_chunk_written;
+  bool all_packets_written;
+  short *stop;
+  short *do_update;
+  float *progress;
+
+  /* TC index building */
+  uint64_t seek_pos;
+  uint64_t last_seek_pos;
+  uint64_t seek_pos_dts;
+  uint64_t seek_pos_pts;
+  uint64_t last_seek_pos_dts;
+  uint64_t start_pts;
+  double frame_rate;
+  double pts_time_base;
+  int frameno, frameno_gapless;
+  int start_pts_set;
+} FFmpegIndexBuilderContext;
+
 // work around stupid swscaler 16 bytes alignment bug...
 
 static int round_up(int x, int mod)
@@ -471,174 +539,113 @@ static int round_up(int x, int mod)
   return x + ((mod - (x % mod)) % mod);
 }
 
-static struct proxy_output_ctx *alloc_proxy_output_ffmpeg(
-    struct anim *anim, AVStream *st, int proxy_size, int width, int height, int quality)
+static struct SwsContext *alloc_proxy_output_sws_context(AVCodecContext *input_codec_ctx,
+                                                         AVCodecContext *proxy_codec_ctx)
 {
-  struct proxy_output_ctx *rv = MEM_callocN(sizeof(struct proxy_output_ctx), "alloc_proxy_output");
+  struct SwsContext *sws_ctx = sws_getContext(input_codec_ctx->width,
+                                              av_get_cropped_height_from_codec(input_codec_ctx),
+                                              input_codec_ctx->pix_fmt,
+                                              proxy_codec_ctx->width,
+                                              proxy_codec_ctx->height,
+                                              proxy_codec_ctx->pix_fmt,
+                                              SWS_FAST_BILINEAR | SWS_PRINT_INFO,
+                                              NULL,
+                                              NULL,
+                                              NULL);
+  return sws_ctx;
+}
 
+static AVFormatContext *alloc_proxy_output_output_format_context(struct anim *anim, int proxy_size)
+{
   char fname[FILE_MAX];
-  int ffmpeg_quality;
 
-  rv->proxy_size = proxy_size;
-  rv->anim = anim;
-
-  get_proxy_filename(rv->anim, rv->proxy_size, fname, true);
+  get_proxy_filename(anim, proxy_size, fname, true);
   BLI_make_existing_file(fname);
 
-  rv->of = avformat_alloc_context();
-  rv->of->oformat = av_guess_format("avi", NULL, NULL);
-
-  BLI_strncpy(rv->of->filename, fname, sizeof(rv->of->filename));
-
-  fprintf(stderr, "Starting work on proxy: %s\n", rv->of->filename);
+  AVFormatContext *format_context = avformat_alloc_context();
+  format_context->oformat = av_guess_format("avi", NULL, NULL);
 
-  rv->st = avformat_new_stream(rv->of, NULL);
-  rv->st->id = 0;
+  BLI_strncpy(format_context->filename, fname, sizeof(format_context->filename));
 
-  rv->c = rv->st->codec;
-  rv->c->thread_count = BLI_system_thread_count();
-  rv->c->thread_type = FF_THREAD_SLICE;
-  rv->c->codec_type = AVMEDIA_TYPE_VIDEO;
-  rv->c->codec_id = AV_CODEC_ID_MJPEG;
-  rv->c->width = width;
-  rv->c->height = height;
-
-  rv->of->oformat->video_codec = rv->c->codec_id;
-  rv->codec = avcodec_find_encoder(rv->c->codec_id);
-
-  if (!rv->codec) {
+  /* Codec stuff must be initialized properly here. */
+  if (avio_open(&format_context->pb, fname, AVIO_FLAG_WRITE) < 0) {
     fprintf(stderr,
-            "No ffmpeg MJPEG encoder available? "
+            "Couldn't open outputfile! "
             "Proxy not built!\n");
-    av_free(rv->of);
+    av_free(format_context);
     return NULL;
   }
 
-  if (rv->codec->pix_fmts) {
-    rv->c->pix_fmt = rv->codec->pix_fmts[0];
-  }
-  else {
-    rv->c->pix_fmt = AV_PIX_FMT_YUVJ420P;
-  }
-
-  rv->c->sample_aspect_ratio = rv->st->sample_aspect_ratio = st->codec->sample_aspect_ratio;
+  return format_context;
+}
 
-  rv->c->time_base.den = 25;
-  rv->c->time_base.num = 1;
-  rv->st->time_base = rv->c->time_base;
+static struct proxy_output_ctx *alloc_proxy_output_ffmpeg(
+    struct anim *anim, AVStream *inpuf_stream, int proxy_size, int width, int height, int quality)
+{
+  struct proxy_output_ctx *proxy_out_ctx = MEM_callocN(sizeof(struct proxy_output_ctx),
+                                                       "alloc_proxy_output");
 
-  /* there's no  way to set JPEG quality in the same way as in AVI JPEG and image sequence,
-   * but this seems to be giving expected quality result */
-  ffmpeg_quality = (int)(1.0f + 30.0f * (1.0f - (float)quality / 100.0f) + 0.5f);
-  av_opt_set_int(rv->c, "qmin", ffmpeg_quality, 0);
-  av_opt_set_int(rv->c, "qmax", ffmpeg_quality, 0);
+  proxy_out_ctx->proxy_size = proxy_size;
+  proxy_out_ctx->anim = anim;
 
-  if (rv->of->flags & AVFMT_GLOBALHEADER) {
-    rv->c->flags |= CODEC_FLAG_GLOBAL_HEADER;
-  }
+  proxy_out_ctx->output_format = alloc_proxy_output_output_format_context(anim, proxy_size);
 
-  if (avio_open(&rv->of->pb, fname, AVIO_FLAG_WRITE) < 0) {
-    fprintf(stderr,
-            "Couldn't open outputfile! "
-            "Proxy not built!\n");
-    av_free(rv->of);
-    return 0;
-  }
+  proxy_out_ctx->stream = avformat_new_stream(proxy_out_ctx->output_format, NULL);
+  proxy_out_ctx->stream->id = 0;
 
-  avcodec_open2(rv->c, rv->codec, NULL);
+  proxy_out_ctx->codec_context = proxy_out_ctx->stream->codec;
+  proxy_out_ctx->codec_context->thread_count = BLI_system_thread_count();
+  proxy_out_ctx->codec_context->thread_type = FF_THREAD_SLICE;
+  proxy_out_ctx->codec_context->codec_type = AVMEDIA_TYPE_VIDEO;
+  proxy_out_ctx->codec_context->codec_id = AV_CODEC_ID_MJPEG;
+  proxy_out_ctx->codec_context->width = width;
+  proxy_out_ctx->codec_context->height = height;
 
-  rv->orig_height = av_get_cropped_height_from_codec(st->codec);
+  proxy_out_ctx->output_format->oformat->video_codec = proxy_out_ctx->codec_context->codec_id;
+  proxy_out_ctx->codec = avcodec_find_encoder(proxy_out_ctx->codec_context->codec_id);
 
-  if (st->codec->width != width || st->codec->height != height ||
-      st->codec->pix_fmt != rv->c->pix_fmt) {
-    rv->frame = av_frame_alloc();
-    avpicture_fill((AVPicture *)rv->frame,
-                   MEM_mallocN(avpicture_get_size(rv->c->pix_fmt, round_up(width, 16), height),
-                               "alloc proxy output frame"),
-                   rv->c->pix_fmt,
-                   round_up(width, 16),
-                   height);
-
-    rv->sws_ctx = sws_getContext(st->codec->width,
-                                 rv->orig_height,
-                                 st->codec->pix_fmt,
-                                 width,
-                                 height,
-                                 rv->c->pix_fmt,
-                                 SWS_FAST_BILINEAR | SWS_PRINT_INFO,
-                                 NULL,
-                                 NULL,
-                                 NULL);
-  }
-
-  if (avformat_write_header(rv->of, NULL) < 0) {
+  if (!proxy_out_ctx->codec) {
     fprintf(stderr,
-            "Couldn't set output parameters? "
+            "No ffmpeg MJPEG encoder available? "
             "Proxy not built!\n");
-    av_free(rv->of);
-    return 0;
+    av_free(proxy_out_ctx->output_format);
+    return NULL;
   }
 
-  return rv;
-}
-
-static int add_to_proxy_output_ffmpeg(struct proxy_output_ctx *ctx, AVFrame *frame)
-{
-  AVPacket packet = {0};
-  int ret, got_output;
-
-  av_init_packet(&packet);
-
-  if (!ctx) {
-    return 0;
+  if (proxy_out_ctx->codec->pix_fmts) {
+    proxy_out_ctx->codec_context->pix_fmt = proxy_out_ctx->codec->pix_fmts[0];
   }
-
-  if (ctx->sws_ctx && frame &&
-      (frame->data[0] || frame->data[1] || frame->data[2] || frame->data[3])) {
-    sws_scale(ctx->sws_ctx,
-              (const uint8_t *const *)frame->data,
-              frame->linesize,
-              0,
-              ctx->orig_height,
-              ctx->frame->data,
-              ctx->frame->linesize);
+  else {
+    proxy_out_ctx->codec_context->pix_fmt = AV_PIX_FMT_YUVJ420P;
   }
 
-  frame = ctx->sws_ctx ? (frame ? ctx->frame : 0) : frame;
+  proxy_out_ctx->codec_context->sample_aspect_ratio = proxy_out_ctx->stream->sample_aspect_ratio =
+      inpuf_stream->codec->sample_aspect_ratio;
 
-  if (frame) {
-    frame->pts = ctx->cfra++;
-  }
+  proxy_out_ctx

@@ Diff output truncated at 10240 characters. @@