[Bf-cycles] SIGGRAPH observations

Mon Jul 29 17:58:24 CEST 2013

Some notes I took during SIGGRAPH, mostly technical and render related
stuff that I found interesting. For more in-depth stuff check the
collection of links here:
http://blog.selfshadow.com/2013/07/24/siggraph-2013-links/

OPENSUBDIV

Pixar is open sourcing OpenSubdiv mainly to push it as *the*
subdivision surface standard, and will be proposing to include it in
OpenGL later this year. They have a lot invested in their workflow,
e.g. modeling with creases and fewer edge loops, and think it's worth
the investment to ensure they do not have to switch to some other
standard that could win later on.

There's still some things missing in the library that we need for
Blender, in particular non-manifold meshes and smooth UVs. These are
planned to be added by Dreamworks, along with some performance
optimizations.

For Blender we also still have to see how this all fits with
displacement and multires sculpting. My guess is that for display in
sculpt mode itself OpenSubdiv is not so useful, as you are editing the
tessellated vertex positions directly there and retessellating them
each time will not help. You'd need to bake down the displacement each
time which is just going to make things slower. Outside of sculpt
mode, doing the subdivision surfaces on the GPU with or without
displacement should be quite nice, especially for animators to see
realtime playback of the full subdivided and displaced mesh.

SUBD AND RENDERING

There is an interesting difference between the workflows for Pixar
Renderman and Arnold. It seems that Pixar is relying a lot on having
good lower poly subdivision surfaces with detail added through
subdivision and displacement. Ptex needs such base meshes to do
mipmapping well, when the base mesh is super high poly and every
primitive is a separate Ptex face it may not be as good at texture
filtering. The Renderman multiresolution shading cache also needs a
base mesh to store the coarse level of the cache, if that's too fine
performance would drop. On the other hand it was mentioned that Arnold
is happy to just get tessellated very high poly meshes, partially
because the subd isn't as good but also because it doesn't do caching
anyway. In Arnold reducing shading cost for secondary bounces seems to
be mostly done by having a simpler version of the shader / ray
differentials.

So this is an interesting difference that we might want to think about
for Blender/Cycles algorithm choices too, things like dynamic topology
sculpting suit the Arnold workflow, whereas multires sculpting fits
the Pixar workflow. For the Pixar workflow with dynamic topology
sculpts it seems that good retopology and remeshing is needed. Is it
worth it adding special optimizations for subd surfaces all the way
into the rendering kernel, or do we focus on handling triangle/quad
soups really well with compression, SIMD, etc? We can look into a few
higher level primitives to support in the BVH like triangle strips,
quad strips, grids, .. but try to make those things work also when not
using subd.

Further, automatically picking the right subdivision level for global
illumination still is an unsolved problem. There is usually a choice
per object to do it with a fixed user specified subd level or world
space subd refiniment, or to do it adaptive based on screen projection
for certain special cases. It's basically up to the artists to do this
well, automatic tessellation in screen space in the style of Renderman
may be becoming a bit less useful with GI. I would like an automatic
solution here but couldn't find anyone who had something like that..

Also interesting is that for some studios the base meshes already have
enough resolution to render mostly without subdivision. Sometimes the
detail is needed in the base mesh for physics simulation to show
enough detail. It depends a lot on the modelling and rigging workflow,
some people like to add lots of edge loops and others not.

PIXAR LIGHT SAMPLING

The difference between Renderman and Arnold style sampling is quite
interesting, and you can't easily switch between the two. With Arnold
every AA sample will result in relatively few diffuse or glossy
samples, whereas for Renderman it will take enough samples to be noise
free for each point on the micropolygon grid.

Renderman can do this because they decouple visibility from shading
and use a shading cache for indirect bounces. The downside is that
you're potentially doing too much shading, especially if your base
meshes are quite fine, the upside is that you can use tricks like
importance resampling to reduce the number of shadow rays, and control
variates to get an analytic noise free result for area lights if there
is no shadowing. The control variates results look quite impressive
but I wonder how much it really helps to get unshadowed areas noise
free if your shadowed areas still have a lot of noise. Maybe with
adaptive sampling, but Pixar isn't using that as far as I know.

Not also that with the caching the diffuse bounces are cheaper than
the glossy once because the latter are view dependent and so generally
can't be cached. Pixar uses tricks like selectively disabling glossy
in secondary bounces when they aren't needed. Those are technically
caustics anyway.

A nice trick they showed is to give some basic texture to area lights,
like a white area light white a blue border, simple but gives nice
shading variation.

OSL

Simple OSL shaders have quite a bit of overhead, and shader compile
thing is quite long. There are some things that we can do to improve
things, like making our internal struct match the ShaderGlobals
struct. In general SPI have the same sort of issue with e.g. volume
rendering where you have lots of small shading operations, and they
are also looking into this.

SPI is interested in OSL on the GPU but unsure about the right choice
of implementation and target to use (llvm nvidia/amd backends, opencl,
cuda, glsl, hlsl, ..). This was the second big reason to start OSL at
SPI, to have it e.g. seamlessly display in render and viewport without
manually writing GLSL shaders that match nodes. Some other companies
might be interested in working on this, discussion will happen on
osl-dev mailing list. There seems to be a consensus that a single
backend target isn't going to satisfy everyone, and that there will be
some system to easily plug in multiple backends.

There were also some mentions that OIIO image texture lookup can be
slow, it's really designed to have good high quality texture
filtering. It's possible to improve things for the simpler cases but
Autodesk Beast and V-Ray just used their own texture filtering code.
We could have an option to use our quick/stupid SVM texture filtering
code too, or looking into OIIO optimizations for common settings.

SUBSURFACE SCATTERING

Solid Angle / SPI had a talk on BSSRDF sampling. Their method seems to
have less noise than the line sampling we use now, mainly because it
can stratify samples better.

Cycles does some extra things though, which perhaps we should drop? We
use some tricks to normalize the influence in to avoid light leaking
and color shifts, other render engines don't do this so perhaps it's
acceptable. Cycles also does multiple tries to reduce variance, both
the line sampling and new technique will have about 50% of samples
miss which is quite a lot. My implementation of this is quite weak
though and I suspect to be actually wrong, it may not be possible to
do this properly in an unbiased way.

The multiple importance sampling they use between falloff curves for
difference color channels should also help. This color noise is
something I couldn't figure out when implementing SSS but their
solution should not be so hard to add. Especially for things like the
skin model with a sum of 6 gaussians this can be very useful.

HAIR

Everyone seems to be using some variation of the Marschner model. The
Pixar/Disney importance sampling method is published, the SPI/Arnold
method one by Alejandro Conty is not. The former uses full raytracing
of the hair, no deep shadow maps anymore. They do skip the TT term and
uses some trickery to compensate for the missing multiple scattering
(using surface normals for short hair and point cloud blurring for
long hair). The SPI/Arnold method includes multiple scattering and
presumably does not rely on any precomputation but I don't know how it
works.

VOLUMES

The volume rendering in The Croods and Wizard of Oz is quite
impressive, things have really gotten more advanced this year.

Camera frustum volume shapes seem to be quite important in getting as
much detail as possible into the render. Tracing rays through such a
frustum turns out to be not so simple.

There were already some papers published by SPI/SolidAngle for single
scattering, now there is also a trick to emulate multiple scattering.
The idea is to combine multiple 'octaves' with different volume
settings. Each octave halves the density to let light scatter further
with just one bounce. This is only a trick but combining these octaves
lets light scatter far enough while still preserving detail, and it's
entirely unbiased with no need for precomputation.

OPENVDB

The OpenVDB file format and data structure is quite cool in that you
can store volumes with no predefined bounding boxes, the volume data
can grow as needed without the users needing to worry about setting
bounds. Even better through are the tools provided along with the
library. They've got production quality tools for conversion to/from
meshes and particles, various volume manipulations, etc. If someone is
interested in implementing a native Volume datablock in Blender this
sounds like a great way to do it as much of the important stuff is
already there, with more planned to be added.

One example they showed is converting a mesh to a volume, giving the
walls some thickness with dilation, fracturing the mesh, and
converting back to an adaptive tessellated mesh. All while preserving
mesh attributes like UV maps. Another example was clouds modelling by
quickly deforming and placing some spheres, converting to a volume and
adding procedural noise.

EXPLOITING SIMD/SIMT

Embree is now using ISPC and has kernels that work with ray packets.
Restructuring the code to take advantage of that sort of thing is
hard, using ISPC to compile the kernel might help. It's unclear if the
extra memory usage of computing 4/8/16 ray paths at the same time is
really more efficient in the end for real use cases, you can then
optimally use CPU FLOPS but you're doing a lot more memory access
which is usually a bottleneck already.

At the same time for optimal GPU usage we should split our megakernel
into smaller parts, this will help getting OpenCL to work on AMD, but
will also benefit NVidia cards in more complex scenes with many
different materials. This is quite challenging to do in practice
though, especially with a kernel as complex as Cycles. NVidia showed
how to do this for a simple path tracer but Cycles is quite a bit more
complicated. What you need to do is to turn the code into a kind of
state machine. We should try this at some point.

In general it's sort of interesting that when you look at the open
sourced libraries and talks from the studios, they're not actually
using SIMD that often, it's generally a pain to get working and adapt
your code to it. For a raytracing kernel it's quite important though.

MULTITHREADED DEPENDENCY GRAPH

Pixar right now uses basically one character per thread, then
background caches frames for playback to keep cores occupied. It's
kind of a workaround, but ensuring fast playback this way is quite
nice for animators. They showed their Presto animation software, with
fast playback, opensubdiv and hair deformation on the GPU, baked ptex
applied to the mesh and lighting and shadows from the key light. All
looked quite nice.

They are looking into finer granularity too but don't have it yet.
Dreamworks is very granular, graphs with 50k-150k nodes. Requires
careful design of rigs, but they have very good scaling. Overhead from
granularity is reduced by using TBB, and letting threads handle chains
of nodes without going through the task system.

Pixar uses a system where there is a very clear separation between
output data and the depsgraph, for evaluating multiple frames at the
same time and to reduce locking, this is something we want in Blender
too. They also compile the depsgraph in advance, and Dreamworks caches
networks for changing various values. It's unclear if this will help
in Blender, maybe with quite complex rigs. I think it's best to leave
this as an optimization to solve when it shows up as a performance
problem.