thanks for taking a look at this. If I understand the problem correctly
the slowdown is mostly due to simple nodes not caching at all, that
means that for a fanout>1 on a socket the node has to be calculated
at least that many times. For simple nodes that is usually not a problem
until you string a lot of them together with multiple fanouts inbetween,
then it very quickly adds up. I would say breaking these execution groups
into pieces is likely to be most successful at points where nodes have
a large fanout; this should reduce the number of calculations significantly.
The longer I think about it the more I feel there is an interesting graph
theory problem/algorithm here that might help...

(Maybe this is all very obvious to you already, in which case
disregard my input and do what you think is best ;)

till then, David.

> This is a proposal to solve speed-issues of the compositor. It should 
> not be considered as the final solution, but should help the most common 
> issues.
> Problem statement.
> The compositor works best when having a good mixture of simple and 
> complex nodes. If you have a lot of simple nodes the system is not able 
> to find a good balance when converting to execution groups (subprogram 
> that will be scheduled to a core of the CPU). It results in a few 
> execution groups with many simple operations and a small number of 
> buffers that store intermediate results. This slows down the system a 
> lot 
> [http://projects.blender.org/tracker/?func=detail&aid=33785&group_id=9&atid=498]. 
> A workaround for this slowdown was to add a complex node (that doesn't 
> do anything, like blur 0) in the setup.
> First test shows that good result depends on the node tree setup and the 
> available memory of the system. We propose to split up execution groups 
> into smaller ones if they get too big. The split up will depend on two 
> variables:
> 1. amount of memory in the system (not free memory)
> 2. number of operations in an execution group
> As this mechanism does a lot of guesses, the user should be able to 
> manually control the number of cuts.
> During tests we saw the next results
> Used file: file attached to issue #33785
> Used system: Intel(R) Core(TM) i5 CPU M 580 @ 2.67GHz, with 8GB of 
> memory, ubuntu 12.04 64 bit:
>  - Baseline (no changes to code): 861MB, 47.49 seconds
>  - Limit execution group size to 10: 3424MB, 7.267 seconds
>  - Limit execution group size to 15: 3289MB, 7.607 seconds
>  - Limit execution group size to 20: 2884MB, 9.393 seconds
>  - Limit execution group size to 25: 2884MB, 11.987 seconds
