Home       About Me       Contact       Downloads       Part 94    Part 96   

Part 95: The New Slab

November 6, 2014

Back in August, I received my Oculus Rift and discovered my graphics card was too slow to run any of the demos properly. I purchased a Radeon 270x card to upgrade my system. Normally, I go with Nvidia cards, since I've always found the drivers and installation to be better, but this time the corresponding Nvidia card was significantly more expensive. (Prices have come down since the latest card was released.)

I ran all the Oculus demos and ported my demos to the Rift. I immediately noticed a huge problem though. Every few seconds, my demos would freeze for a second or two, burning CPU, then start running again.

This only happened when running under Debug, and only seemed to happen on some of the demos, and only on the Radeon. The exact same project ran fine on my old NVidia machine. A friend verified that this happened on his new Radeon card as well, but not on some of his older Radeon cards.

I just put up with this for awhile, but in order to do any real development I had to get to the bottom of it. I started with the TestCube demo, which is just a single rotating cube. I commented out chunks of the code until it was doing almost nothing (cursor on the screen) but the effect persisted. I was really ticked off at this whole problem, since this is exactly the flaky behavior I had come to dread when doing any graphics programming. I almost decided to switch to DirectX!

I got my first clue when I happened to be looking at the processes list under the Windows task manager. As you can see, TestCube has an absurd amount of memory. What's more, it was steadily increasing to around 60 meg, then it would suddenly drop back down to 40 meg. The CPU spike and the freeze of the program happened exactly when the working set dropped. So obviously, I'm using up huge amounts of memory, and then some kind of garbage collection kicks in and the app freezes while that is going on.

But where, and why only under Debug on Radeon? I continued hacking away at my code, and discovered that if I stopped drawing the cursor, the problem went away. But this is ridiculous! This is my oldest OpenGL code, and it couldn't be simpler. I create a buffer, put two triangles into it, set the cursor texture, do a draw. If this doesn't work, what does?

It turns out that it was a single call, glBufferData, that was causing the problem. True, in the OpenGL documentation it says that this allocates a new buffer. But I thought if the buffer already existed, it would just over-write it. And apparently it does on other drivers for other machines. But for this machine, it allocates a new buffer on every call.

This can't be the entire explanation, since I'm using a trivial amount of memory (96 bytes) and even if those weren't being released, it would take forever to add up to the 20 meg I was seeing. Of course, this same demo has a 24 meg working set on the NVidia machine instead of a 52 meg working set on the Radeon, so who knows? I can only guess that every time it allocates a buffer, it creates a new memory segment or something very large.

The solution was simple enough. When I create the buffer for the first time, I use glBufferData. On subsequent uses, I over-write it explicitly with glBufferSubData. The working set is still huge, but it doesn't grow and doesn't freeze periodically. I'm glad it's fixed, but this is the kind of thing that drives me insane. I could have happily developed my game on my NVidia machine and never known this was a problem.

Oh, and the reason it failed on some demos and not others? The higher the frame rate, the more often it draws the cursor, and the quicker it uses up memory. The TestCube demo without vertical synch runs at hundreds of frames per second. On a slower demo, it took longer to trigger garbage collection and freeze. So I thought only some demos were affected.

Subclassing Vertexes

Before I started working on the game again, I wanted to make a change to my 3D library. I needed to handle vertexes differently.

Each vertex has a number of attributes such as the point coordinates, the normal vector and the texture coordinates. Back when I first started learning OpenGL, the books put each attribute in a different array. I didn't see the sense of that, since it's just more stuff to allocate. Instead, I would create a structure for my vertex, like:

class CubeVertex
{
public:
  mgPoint3 m_pt;
  mgPoint3 m_normal;
  mgPoint3 m_texcoords;
};

OpenGL allows me to set the offset of each attribute in this structure. Then I can allocate a single array of all of these attributes together, fill it with vertexes and draw.

Here's the problem. I have a huge class (called a Slab) that draws all of the bricks in the world. I want to do something very similar with avatars by making them out of small bricks. Normally, I would subclass the world drawing code and use it to draw avatar bricks. But if I want a different kind of vertex, with say pure colors on each cube instead of textures, I can't do it. I would have to subclass not just the cube drawing class, but also every use of the CubeVertex above. That's because I put all the vertex attributes in one array.

So now I have a new Mesh class which I can add attributes to. The superclass of all cube-drawing classes just sets position. A subclass for the world can add texture coordinates to the mesh. And a subclass for the avatar can add colors instead.

Performance Improvements

With that done, I could have launched into writing the game again, but I wanted to try some performance improvements. These were not new code, since I have played with improvements off and on during the last three years since the original version of the Minecraft viewer was written. Also, I was surprised how slow the code was even on this much faster display card.

Meshing Faces

One improvement is to combine adjacent faces of cubes when they are the same texture, and draw a single large quad instead of many smaller ones. I had originally thought that this would make no difference. I had it in my head that the majority of time in the GPU is spent filling in pixels, and the number of vertexes was irrelevant. If I had actually looked at the images I was generating, I would have realized that at a distance, I have thousands of one or two pixel cubes, yet I am processing 24 vertexes per cube (four vertexes times six faces.)

So obviously, vertexes were important. Combining adjacent faces using a "greedy meshing" algorithm reduces the count by almost 70%, which for distant faces, should mean about the same reduction in draw time.

Culling

Next, I had realized some time ago that it would be worth it for the cube drawing code to cull faces that weren't visible from the eye. Normally, you would never do this in the CPU, since the GPU is so much faster. But in the cube world, all the faces are aligned. If I am to the left of a slab of cubes, I know that the right sides are invisible. When I am above a slab, I know the bottoms are invisible. If I break the drawing up into six buffers, one each for x-, x+, y-, y+, z- and z+, then I can just skip the buffers I know can never be seen.

Shapes

Minecraft has non-cube shapes like flowers, torches and railroad tracks. They are rendered very simply in the game. When I did my rendering code, I used much more elaborate models. None of these by themselves has a very high polygon count, but there are thousands of them in the scene. When I counted the total number of vertexes, there were more of these "shape" vertexes than cube vertexes.

The shapes have the same problem as the cubes -- at a distance, despite drawing maybe fifty vertexes, the shape ends up being only a couple of pixels.

I made two changes in this area. First, I rendered all the shapes with instancing, to cut down on the amount of display memory used for the things. Second, I replace the complex shapes with much simpler ones at a distance. This dramatically cuts to total number of vertexes.

Another Mystery

With all these performance improvements complete, I was anxious to compare my new rendering code with the old McrView code. Unfortunately, there's a lot going on in McrView other than rendering. So I combined the framework of my new test case with the old rendering code and got a number. Then I compared that with the new code.

In this test case, I am rendering one whole Minecraft region (shown above) which is 512 by 512 by 128. I was rendering only the solid cubes, not the transparent ones, There are 16,670,646 non-air cubes. After meshing, this produces 2,464,256 vertexes.

   old solid only:         146 fps,  6.82 ms/frame
   new solid only:         203 fps,  4.91 ms/frame

This is clearly an improvement, but only about 30%, despite eliminating 70% of the vertexes with meshing the faces, and half the remaining by culling. Something was wrong.

I complained to a friend and was told not to use any unnecessary OpenGL calls between my draws. So I spent some time rewriting my 3D layer to track the current state of all the OpenGL parameters like shader uniforms and textures, and not set them again when they were unchanged. This kicked my new time up:

   new 3D, mesh and cull:  226 fps, 4.41 ms/frame

I was convinced that culling had to be a big win, so I turned it off.

   no cull:                239 fps, 4.18 ms/frame
As you can see, the code actually gets faster!! And there's no way that the GPU can test 800,000 triangles faster than the CPU can generate three draw calls instead of one! I then turned off the meshing as well:
 Radeon 270x, 32x32x32 slabs
   mesh and cull:          226 fps, 4.41 ms/frame
   no cull:                239 fps, 4.18 ms/frame
   no mesh:                224 fps, 4.46 ms/frame
   no mesh or cull:        230 fps, 4.34 ms/frame
So basically nothing I was doing was having any effect, and the timing for drawing 8 million or 2.4 million or 1.2 million vertexes was all the same. This says there's some huge overhead that is swamping everything. Disgusted, I went to bed.

The next day, I realized that I was not just drawing three buffers when I do my own culling. I was drawing three buffers per slab. A slab is a 32 by 32 by 32 chunk of scenery, and in this test case, there are 1024 of them. Around 800 of them are non-empty. So I was actually cutting up 2.4 million vertexes into 6 times 800 buffers, which leaves about 500 vertexes per buffer, or 170 triangles. And this Radeon display hardware probably can test 170 triangles in the GPU faster than the CPU can set up three draw calls.

To test this theory, I made the slab larger -- 64 by 64 by 64. The numbers changed dramatically:

 Radeon 270x, 64x64x64 slabs
   mesh and cull:          882 fps, 1.13 ms/frame
   no cull:                608 fps, 1.64 ms/frame
   no mesh:                375 fps, 2.66 ms/frame
   no mesh or cull:        229 fps, 4.37 ms/frame
Now, with more like 1300 triangles per buffer, the GPU has something to do. And it does it amazingly fast -- 882 fps for that huge region of cubes is not too shabby. What's more, I can see the expected drop off in performance when I stop meshing and culling. So I tried 128 by 128 by 128 slabs:
 Radeon 270x, 128x128x128 slabs
   mesh, cull:             700 fps, 1.43 ms/frame
   no cull:                572 fps, 1.75 ms/frame
   no mesh:                282 fps, 3.54 ms/frame
   no mesh or cull:        225 fps, 4.44 ms/frame
Oddly, the performance falls off again with these much larger slabs. There should be an average of 10K triangles in a buffer now, which should draw at least as fast as the smaller buffers. I have no idea why it doesn't.

I then tested this on my older Nvidia GT 640 hardware:

 NVidia GT 640, 32x32x32 slabs
   mesh, cull:             192 fps,  5.20 ms/frame
   no cull:                131 fps,  7.59 ms/frame
   no mesh:                 83 fps, 11.92 ms/frame
   no mesh or cull:         53 fps, 18.76 ms/frame

 NVidia GT 640, 64x64x64 slabs
   mesh, cull:             191 fps,  5.22 ms/frame
   no cull:                131 fps,  7.58 ms/frame
   no mesh:                 82 fps, 12.14 ms/frame
   no mesh or cull:         52 fps, 19.10 ms/frame

 NVidia GT 640, 128x128x128 slabs
   mesh, cull:             157 fps,  6.37 ms/frame
   no cull:                129 fps,  7.72 ms/frame
   no mesh:                 63 fps, 15.80 ms/frame
   no mesh or cull:         52 fps, 19.22 ms/frame
On this slower display hardware, even the small 150 triangle buffers keep the GPU busy for a bit and you see the expected drop off in performance when I don't mesh or cull. But larger buffers in the size=64 case don't help at all, and performance still drops in the size=128 case.

It is very good luck that I did not try this first. If I had seen the times behaving the way I expected on the Nvidia machine, but not the Radeon, I would have put this down to bad OpenGL drivers and never looked for another reason!

What's Next

Although they are fastest, 64x64x64 slabs have problems. They are a lot of data -- 256K bricks -- and they take a long time to rebuild. After each add or delete of a brick to the world, the rebuild time is around 30 ms. I can probably cut that down a bit, since none of that code is particularly optimized, but it's still going to be a significant factor. They will also take longer to send to the display, and longer to read from the database.

I don't like the fact that changing the display speed requires me to change the slab size. I'd like an architecture that just adapts to whatever display it has. I'm not sure I know enough about the inner workings of GPUs, drivers and OpenGL to build that though.

For now, this cube drawing code is good enough. I still have to integrate selection and highlighting, but I'm almost done. And then I will hopefully avoid OpenGL issues for awhile.

It's time to build the game world.

Home       About Me       Contact       Downloads       Part 94    Part 96   

blog comments powered by Disqus