[Bf-committers] OpenGL optimisation

Sun, 23 May 2004 22:31:49 +0100

A few choice expletives and a fresh CVS later:

http://www.warwick.ac.uk/student/R.J.Berry/customblender.zip

This is exactly the same as the previous version but links against the OS =
X Python framework instead of the Fink libraries. Thanks for the clarificat=
ion about the build systems, I had a suspicion that it was something like =
this, but I'm new to scons...

Okay, a better explanation of what I'm doing. Firstly, this optimisation =
only works when drawing subdivided meshes in "OpenGL Solid" mode (or =
anything else that uses DispListMesh but this is the only path that I'm =
looking at for the moment).

The benchmark is a simple 50 frame animation of 3 cubes subdivided to =
level 6. Using the original Blender code this animation (via Alt-A) was =
nowhere near realtime on my system (PowerBook with ATI Mobility 9600), =
whereas the version with OpenGL display lists is. If you want to see a =
relative comparsion then switch between "Wire" display mode and "OpenGL =
Solid" mode.

Using the Ctrl-Alt-T benchmark I get:

Blender 2.33a:
	draw: 2216 ms
	draw+swap: 2286 ms
	displist: 2298 ms
Custom Blender:
	draw: 187 ms
	draw+swap: 222 ms
	displist: 2467 ms

The next bit might seem redundant to those of you who know OpenGL but I =
thought I'd justify my ideas. There are several ways of submitting vertex =
data to OpenGL:

"Immediate Mode":
This is currently what Blender uses, i.e. submitting vertices using =
something like:
	glBegin(GL_TRIANGLES);
	glNormal3fv(normal1);
	glVertex3fv(vertex1);
	...
	glEnd();
While this is really flexible with formats etc. it is also the slowest way =
to submit vertices: firstly because you have to make a function call for =
each vertex (and extra ones for stuff like the vertex normal / texture =
coordinates) and secondly because the data has to be uploaded to the card =
each time. Basically only useful for specifying small amounts of dynamic =
geometry.

Check out:

http://www.warwick.ac.uk/student/R.J.Berry/without-displists.png

Nearly 21 million calls to glVertex3fv! For what I'm guessing is about 800 =
frames or less (from the number of calls to glFinish, but I think glFinish =
is called more than once per frame...).

Compare this to when the objects are drawn using glCallList:

http://www.warwick.ac.uk/student/R.J.Berry/with-displists.png

Vertex Arrays:
Basically a way of submitting a large amount of vertex data with very =
little function call overhead (i.e. one call to something like glDrawElemen=
ts). Also, depending on the implementation, this is probably more =
efficient than immediate mode (vertex data is in an array in a predictable =
format). There are also a number of array related extensions that are =
specifically tailored for performance with AGP memory and DMA transfers to =
the graphics card (e.g. APPLE_vertex_array_range, ARB_vertex_buffer_object =
etc). Also, because OpenGL knows the size and format of the data it is =
possible to cache the data on the graphics card in some circumstances.

The problem with this is that the vertex data has to be in a format =
optimised for the card (i.e. an array). For best performance the data =
should be interleaved, e.g.
	normal 1
	texture coordinates 1
	position 1
	normal 2
	texture coordinates 2
	position 2

Display Lists:
A way of encoding a number of OpenGL commands for efficient use by OpenGL. =
In modern implementations these commands will get compiled into an =
efficient format which is uploaded to the graphics card. In a best case =
scenario these will probably be the fastest method for drawing geometry. =
Older graphics cards will probably not upload into graphics card memory =
but will optimise the data and put it into AGP memory for fast upload, so =
this is still a win (probably at least as fast as vertex arrays). In the =
worst case with a software renderer probably no optimisation is done and =
it will be equivalent to immediate mode.

However, even though you can put a load of state changes in the command =
list (e.g. changing textures, shading model, activing / deactivating =
lights etc) this can stall the graphics card so it's better to create =
display lists containing only geometry and put the state changes outside =
the display list compilation.

Also, so the display list compiler / optimiser doesn't get confused the =
geometry should be submitted in a "consistent" way, in a similar way to =
the vertex array interleaved formats.

I think the preference for the way that we submit data to the graphics =
card should be (from fastest to slowest):
1) OpenGL display lists for static data (i.e. most things in object mode, =
and data that isn't being edited in edit mode),
2) vertex array extensions for dynamic data (possibly some static data),
3) standard vertex arrays (as a fall back for when the extensions aren't =
supported),
4) immediate mode (probably only useful for a few GUI elements and =
non-mesh type objects, such as cameras, lamps and possibly really small =
meshes where the overhead of setting up display lists / vertex arrays is =
bad for performance).

Further optimisations would probably only be applicable to the game engine =
(e.g. creating triangle / quad strips before drawing geometry, don't know =
if the game engine does this already).

I think display lists are the way to go as they are the most compatible =
(requiring no extensions) and require much less work on our part (we =
basically need to wrap existing OpenGL draw commands in a list, whilst =
also optimising to reduce state changes). I think the fact that it =
probably only took about an hour for me to convert the code to use display =
lists for such a massive performance benefit speaks for itself.

I admit that I have absolutely no idea what rendering method the game =
engine uses, but if we want to make it competitive in any way then we have =
to use display lists / vertex arrays.

In response to Ton:

> I rather look at a redesign for the Blender displaylists first... this =
will - even
> without ogl displaylists - improve performance quite some already.

Hmmm... I doubt it. Optimising Blender display lists will result in less =
CPU work but won't make uploading vertex data to the graphics card any =
faster. The bottleneck with immediate mode (and vertex arrays that don't =
cache / upload data to the graphics card) is the bus bandwidth between the =
CPU and the GPU. There is no way that this is ever going to be faster than =
the GPU reading data out of graphics memory that is in an optimised =
format.

Not only that but using display lists etc. will allow the GPU to do the =
work, not the CPU, so converting to OpenGL display lists will probably =
relieve the CPU much better than any amount of Blender display list =
optimisation, simply because the CPU doesn't have to do the work, it =
simply says to the GPU "draw this list", which the GPU already has in it's =
memory as opposed to uploading all the data again.

You can see this in the screenshots that I showed. Without display lists =
the time to draw for the application and OpenGL (note the application is =
involved because of calling the OpenGL functions) is the total of all the =
times for glVertex3fv, glNormal3fv, glBegin and glEnd calls. With display =
lists the time to draw is only that of glCallList.

> Is there any way to see what added memory usage is?

Yes, but I'll probably have to use the OpenGL Driver Monitor to figure =
that out as the memory will be used by the graphics card, not the CPU. =
Memory usage will probably be be exactly the same as the size of the =
vertex data that has been submitted plus a tiny bit of overhead, so I'm =
guessing about 8 * 4 =3D 24 bytes per vertex (for floating point format =
with normal, 2 texture coordinates and position).

If the display list is uploaded to the card then I don't see how this is =
going to be much of a problem for us. If the card does run low on memory =
then OpenGL will do a much better job of juggling textures, display lists, =
etc than we'll be able to do.

If memory usage is a concern then at the very least I think we should =
convert to using vertex arrays and similar extensions. I think in many =
cases these can draw straight out of the application's memory and then =
cache data on the card if it's not altered. However, this is going to =
involve using a lot of extensions and doing appropriate fallback if =
they're not there.

r i c k