[Bf-cycles] correlation between GTX680 dual-precision FP and cycles performance?

Tue Mar 27 16:26:41 CEST 2012

Hi Brecht,

Hoping for your comment on my speculation as to why the GTX680 is not
the Cycles powerhouse a lot of people hoped it would be:

http://blenderartists.org/forum/showthread.php?239480-2.61-Cycles-render-benchmark/page17

It is possible the performance deficit results from nothing more than a
lack of software optimisation (CUDA 3.0 for the GTX680?), but the
anandtech review of the card listed quite a few other examples where the
680 has been 'detuned' as far as compute is concerned.

The brush-stroke summary being that the GTX 680 is Nvidia's first
'efficient' architecture in a long time, at least for gaming, precisely
because a lot of the heavy-lifting silicon for compute purposes was
removed. This would not be unexpected on what is supposed to be a
mid-range GPU like the 460/560 generation, as it was rumoured to be
before a spot of re-branding occurred.

A few quotes from anandtech:

" The CUDA FP64 block contains 8 special CUDA cores that are not part of
the general CUDA core count and are not in any of NVIDIA’s diagrams.
These CUDA cores can only do and are only used for FP64 math. What's
more, the CUDA FP64 block has a very special execution rate: 1/1 FP32.
With only 8 CUDA cores in this block it takes NVIDIA 4 cycles to execute
a whole warp, but each quarter of the warp is done at full speed as
opposed to ½, ¼, or any other fractional speed that previous
architectures have operated at. Altogether GK104’s FP64 performance is
very low at only 1/24 FP32 (1/6 * ¼), but the mere existence of the CUDA
FP64 block is quite interesting because it’s the very first time we’ve
seen 1/1 FP32 execution speed. Big Kepler may not end up resembling
GK104, but if it does then it may be an extremely potent FP64 processor
if it’s built out of CUDA FP64 blocks."

"So NVIDIA has replaced Fermi’s complex scheduler with a far more
simpler scheduler that still uses scoreboarding and other methods for
inter-warp scheduling, but moves the scheduling of instructions in a
warp into NVIDIA’s compiler. In essence it’s a return to static
scheduling. Ultimately it remains to be seen just what the impact of
this move will be. Hardware scheduling makes all the sense in the world
for complex compute applications, which is a big reason why Fermi had
hardware scheduling in the first place, and for that matter why AMD
moved to hardware scheduling with GCN."

"What makes this launch particularly interesting if not amusing though
is how we’ve ended up here. Since Cypress and Fermi NVIDIA and AMD have
effectively swapped positions. It’s now AMD who has produced a higher
TDP video card that is strong in both compute and gaming, while NVIDIA
has produced the lower TDP part that is similar to the Radeon HD 5870
right down to the display outputs."

So my questions:
1. Does Cycles use dual precision FP (FP64?)?
2. If not, does the poor performance result scheduler and other
architecture deficiencies?
3. If yes, how much of the poor performance derives from the lack of
dual-precision grunt?
4. Or, are we jumping the gun branding the GTX680 as poor, and optimised
builds will surprise?

Many thanks

mjg