Home       About Me       Contact       Downloads       Part 10    Part 12   

Part 11: The Need for Speed

February 3, 2011

To render the large game world I've been designing, I need to use varying levels of detail. Up close, the world is rendered in full detail. Farther away, we have some reduced-detail version. And in the background, we have the type of scenery like the sun, stars, nebula and ring that I showed last time. Despite all the detail in the nebula and the rest of the sky, that material renders quickly, since it's just a few polygons. The real problem is with the cube world structures.

I started the week working on loading and unloading blocks of cube data as you move through the world. That's straightforward. What's more difficult is rendering the cube data quickly, and producing a reduced-detail version of it for use in the distance. That's the theme of this part.

Graphics Performance

As Florian and others have pointed out, my code for drawing cubes, which dates back to part 1, is all wrong. I implemented my drawing code the way you would have done it twenty years ago, when I first learned 3D computer graphics. Back then, the graphics card was just a piece of video memory without any intelligence. Everything was done by the CPU. Nowadays, the more you do in the CPU, and the less in the graphics card (GPU), the slower the code will be.

Fig 1: My speed test - 416,634 cubes
speed test

We have two types of data -- opaque and transparent. The opaque data can be drawn in any order. As mentioned in Part 4, the z-buffer is used by the graphics card to write nearby cubes on top of distant cubes. Transparent data is different -- it has to be sorted and drawn from back to front, so that distant cubes show through closer cubes (imagine the scene viewed through water or glass.)

The Part 4 demo code draws opaque and transparent in two passes. First, it runs through the cubes of the scene, kept in an Octree. For each cube, there are six flags telling us which faces are visible. Each visible face results in two triangles. Each face has a different texture, based on the type of the cube (stone, wood, etc.)

The simplest way to do this would be to draw each face of each cube as we pull them out of the Octree. I knew that switching textures between each face of each cube would hurt performance, so I didn't do this. Instead, I keep a buffer for each different texture. As cubes are read out, triangles are added to the various buffers. When a buffer is full, a chunk of triangles (all in the same texture) is sent to the graphics card for drawing.

Transparent data cannot use even this optimization though. Since the transparent data has to be drawn in order, from back to front, I can't buffer up them up to avoid switching textures. If there's a block of leaves behind a block of glass behind a block of water, I have to draw them in that order and switch textures three times. Drawing all the water first won't work.

This lack of optimization really shows. My computer is a AMD Phenom II six core CPU at 2.8 Ghz with an NVidia GTS 250 display card. My test case is Figure 1, a piece of Minecraft data 128 by 128 by 128 cubes. There are 416,634 cubes total, which after you account for the visibility flags on each face, generate 315,188 opaque triangles and 33,600 transparent ones. Some of the opaque triangles are the little spheres I draw instead of Minecraft torches.

So here's my timing on the Demo 4 code, which was my starting point this week:

Opaque: 31.19 ms;   Transparent: 39.83 ms;   Total: 71.02 ms;   Frame rate: 14 fps

Note that despite having only 33,600 transparent triangles, they take longer to draw than the 315,000 opaque ones. That's because I'm not buffering them at all. Each one is two calls to the display (set the texture, draw the triangle.)

I could continue my development with this performance. All it really means is that I have to draw less of the detailed cubes nearby and more of whatever low-res version I pick for distant cubes. But since this number is so bad, not even interactive speed, I decided to spend a few days optimizing the code.

Vertex and Index Buffers

You can create buffers on the graphics card instead of in system memory. In DirectX, these are called Vertex and Index buffers. We want the data on the graphics card so that the GPU can access it quickly. A vertex actually contains more data than just the location of the point. There is the location, the normal (which way the plane is facing at that point), and the texture coordinates. That's eight 4-byte floating point values per vertex.

You would think we could reuse vertexes, since a cube only has 8 points, but unfortunately, that's not the case. The normals of points on the top face of the cube point up, and the normals of points on the front point out of the plane. So there are in fact 6x4=24 unique vertexes on a single cube, used to generate the 12 triangles which make up the cube.

Since opaque cubes can be drawn in any order, my first optimization is to generate all the vertexes we need and put them in a Vertex Buffer Object (VBO) on the graphics card. Then to draw them, I just need to tell the GPU to draw this buffer, not generate all the points and move them to the graphics card 60 times a second.

The Index buffer is a list of vertex numbers. So instead of drawing a triangle by giving it vertex 0, vertex 1 and vertex 2, we create an Index buffer with the numbers 0, 1, 2 in it, and the graphics card will go find those vertexes in the VBO. That lets us easily reuse vertexes without duplicating them. Together, the Vertex Buffer and Index Buffer describe the list of triangles.

In fact, in this first version, I don't use a single Index Buffer for everything. I still have to switch textures when I draw the scene. So instead of generating triangles in a buffer the way Demo 4 does, I generate all the vertexes, and then a separate index buffer for each texture. So all the wood has one index buffer, and all the stone another. This means I'm drawing 20+ chunks of data, instead of one. They all share the same Vertex buffer though.

Later, I look at the DirectX API some more and realize I can keep all the indexes in a single buffer, since there's a call for drawing a subsection of the index buffer. Doing that gets me this improvement in timing:

Opaque: 1.87 ms;   Transparent: 49.44 ms;   Total: 51.31 ms;   Frame rate: 19 fps

The opaque triangles are now really fast, over 16 times faster than before! I'm very happy with this result, even though I'm making 40+ calls to the display to render it (set texture and draw triangles for each of the 20+ textures.) The transparent data is even worse, because of some changes I made to the interfaces. That's so bad, it's kept us from really getting any improvement in the frame rate. 19 fps is still unacceptable.

Faster Transparent Triangles

I still have to draw those transparent triangles in order, which means the list is potentially different each frame and can't just be kept on the display. In the last timing, the transparent triangle code is nearly unmodified, which means it is still drawing the triangles with individual system calls. I don't have to do that.

Since all the vertexes are out there already in a VBO, I can at least cut the transparent data back to just indexes. I will fill up an index buffer until it's time to change textures, then draw the triangles I have, and reset the buffer. That optimization gives me this:

Opaque: 1.87 ms;   Transparent: 26.51 ms;   Total: 28.38 ms;   Frame rate: 35 fps

Interestingly, my first implementation of this was horribly slow -- much slower than before -- and it took me a while to figure it out. I had used the same index buffer as for the opaque data, just reserving a section of it for the transparent triangles. However, opening that buffer and changing it each refresh cycle meant DirectX couldn't keep it in the display. It may even have been copying the entire thing to the display each time I modified it. Since there are over a million indexes used to draw the scene, at 4 bytes per index, this would have meant copying 4 megabytes each refresh cycle! To fix that, I had to create a separate index buffer just for the transparent data, and declare it "dynamic" to DirectX.

We've cut the time for the transparent pass almost in half by not doing individual system calls or generating the vertexes. But it's not obvious what the next step is. Switching textures so often means there are still 2390 changes to draw the scene. Each one is two calls -- set the texture and draw the triangles.

We still have to traverse the entire Octree, with its half a million nodes, each refresh. On each node, there's a bit of code to run. It's time to look at that.

Some people use profiling software to get the amount of time spent in each routine. I find those tools misleading. They don't get at what you are doing functionally and it's easy to get lost in the weeds. You end up trying to improve implementation, not algorithms. I prefer to just short out pieces of the code and see how the overall time changes.

The total time is 26 ms. If I generate all the triangles but don't draw them, it takes me 20 ms. That tells me most of the time is in my code, not DirectX or the display. If I just traverse the tree without doing anything with the nodes, it takes 3 ms. That tells me it's how I'm generating the index list that's the problem.

I stare at it for awhile and it seems like there's nothing to fix. I have to look at all six sides of the cube, and for each one tagged "visible", draw two triangles. I'm just generating a list of indexes for each triangle, which is the minimum amount of work I need. I fuss with the interfaces to my index buffer to use fewer method calls, and I look at my code to traverse the Octree in sorted order. Nothing makes any difference!

Then I have that "doh!" moment -- I'm checking all the faces of all the cubes in the tree, even when my flag byte ("vis") is zero for all faces, indicating a buried cube. As my computer science teacher used to say, "the fastest code is the code you never run." Doing that quick check cuts out a big percentage of the cubes.

Then I realize I'm even checking a cube for transparent sides when the type of the cube (wood, or stone) has no transparent faces. I add a summary flag on each cube type, and now I can discard those cubes quickly without checking the faces at all.

These two improvements give me this timing:

Opaque: 1.87 ms;   Transparent: 16.49 ms;   Total: 18.36 ms;   Frame rate: 54 fps

Just these obvious optimizations, which only added a few lines of code, have saved me 10 ms of the 26 ms total. Of the remaining 16 ms, how much is my code, how much is due to switching textures, and how much is draw time?

If I tell the code to use a single texture for all transparent blocks so that there's no switching, the time drops to 10.82 ms. If I tell it not to draw the graphics, the time drops to 10.14 ms. In other words, the time to actually draw the graphics is negligible. What really takes the time is generating the indexes and making all the DirectX calls.

Time for OpenGL?

To eliminate the texture switching calls, I need to use what are called Texture Arrays. Normally, your vertexes are given two texture coordinates, called u and v. These are the x and y indexes into the texture image. Using a texture array adds a third coordinate that selects which texture is used by the vertex.

With that feature, I would generate all my vertexes in advance, and each one would be tagged with the texture it used, both the opaque and transparent ones. Then I can generate all the indexes for the transparent triangles in one batch, never switching textures. That will save me the additional 5 ms, for a total time of 12.69 ms, and a frame rate of 78 fps.

Unfortunately, it doesn't look like DirectX 9 supports texture arrays. It has something called Texture Volumes, but the notes on it in the API make me nervous. For one thing, when it scales a texture volume (mip-mapping), it wants to scale the third depth dimension as well. That would average together my different textures for wood, water, etc., and just make a mess. I'm also not sure that when I render a triangle, it isn't going to try and average in "nearby" textures which are completely different.

I can leave the code alone, since it's reasonably fast even without texture arrays. I can move to DirectX 10, which supports texture arrays. That will lose me any Windows XP users. Or I can switch to OpenGL.

Discussing this with Florian, he makes the case that I will have more users with an OpenGL implementation even on Windows, since Texture Arrays are supported as an extension in earlier versions of OpenGL. And I will get MacOS and Linux users with an OpenGL implementation.

I've bought the OpenGL SuperBible and started to go through it. I'm still thinking about whether to switch. I'll convert the demo over to OpenGL and see how it goes. I also want to see what I can do with shaders to speed things up even more. There's a lot to learn.

Levels of Detail

I've mentioned in previous parts and in the comments that a low-resolution version of the Minecraft-style buildings is going to be a problem. Consider Figure 2. If you cut the resolution in half, you have to summarize 8 cubes with 1 coarser cube. You have to pick a color for it that represents the 8 cubes replaced. It's going to ruin the look of these structures.

Fig 2: How would you scale this down?
Minecraft image

Another way to summarize distant buildings would be to create a billboard or sprite. This is a flat rectangle with an image on it. The image is created by rendering the building at high resolution. From a distance, the flat billboard will look just like the real thing, until you get close, or change your angle on it. Then the billboard has to be replaced with the real thing, or updated to account for the new angle. With a lot of background processing to summarize pieces of the scene, I could implement this.

There might be some other shapes that would need to be updated less often. I was trying to think of ways to build blocks around the structure that would work. If instead of replacing 8 cubes with one, I replaced an 8x8x8 piece with a big cube that had the right images painted on the faces, perhaps this would work.

Reading the OpenGL documentation, I read over the description of the various primitives, including "point sprites". These are used for particle effects in the scene and draw quickly. Sometime later, my subconscious popped up the suggestion of using them to summarize the buildings. Instead of drawing the cubes, I would just draw a dot of the right color. At a distance, would this look convincing? I did a quick change to the demo to find out.

All I did to try this was scale all my textures to 1x1. This causes the image editor to average them. Then I changed the code to draw every block as a sphere instead, and turned off the lighting so everything is a flat color. The result is Figure 3.

Fig 3: A point-sprite LOD summary
point sprites

I think this will be usable, if it's fast enough. After I finish trying out OpenGL, I can get back to my large world implementation and try this out for real. I'll let you know.

Home       About Me       Contact       Downloads       Part 10    Part 12   

blog comments powered by Disqus