Home       About Me       Contact       Downloads       Part 15    Part 17   

Part 16: Fun with Shaders

April 10, 2011

In Part 15, I had chunks of Minecraft scenery smoothly loading and unloading as you moved about the world. I had speculated that with changes to the shaders, I could cut the display memory used by each chunk dramatically. This week, I took a crack at doing that.

The original vertex format was 3 floats for position (x,y,z), 3 floats for the normal vector, and 3 floats for the texture coordinates (u, v, and the texture array index.) This is a total of 36 bytes per vertex.

The cubes are all placed on integer coordinates and are small, so position only requries three bytes, instead of three floats. Since the shapes are cubes, there are only six possible normals. I can indicate those with a code, taking only 3 bits. The same is true of the texture coordinates, since each cube face is completely covered by a texture, only four possible coordinates are used. Finally, I needed a byte to indicate the texture array index.

I don't see a way to declare a shader input as a single byte or even short integer, so I've had to pack these fields into integers and do the shift and mask logic to extract them. The vertex is now two integers, for 8 bytes instead of 36. Using the 200 meg limit on display memory, this means easily 1000 chunks of scenery fit in the display.

I was afraid this extra work in the shader would slow things down, but there's no visible effect. I rendered a large scene (view distance out to 350, instead of the 150 I've been using) using both versions of the code. On my NVidia GeForce GTS 250, the scene takes 13.83 ms with the floating point vertexes, and is even faster (12.39 ms) with the compressed integer vertexes.

This is really amazing to me, considering all the work that gets done in the shaders. Since I'm enjoying myself, I haven't dared to try it on my other hardware. I'm not looking forward to timing the laptop integrated ATI graphics!

Since my desktop machine has a gigabyte of display memory, I decided to let it load the entire 3,379 chunks of the database into display memory and render it all. The slowness of sorting transparent blocks keeps this from being interactive, but this is what scenery out to the horizon would look like (click for full resolution version):

Fig 1: A fully-rendered world
full render

You may notice some broken scenery. The problem at this point is that I have some non-cube shapes: the torch globes, steps and columns. I'm not sure how I want to handle these. I could put a type code in the data and have a switch statement in the shader for handling each type. I could also write multiple shaders, one per shape. Then I would render all the cubes, all the spheres, all the columns, etc.

The problem with doing that will be transparent data. I have to render that in sorted order, and don't want to be switching between shaders in the scene. It's getting more and more tempting to just use alpha testing for transparency the way Minecraft does, and then have a single transparent texture for water. If there's only one transparent texture, it doesn't matter what order I render it in, as long as all the water is drawn last.

For trees, alpha testing would be fine. I'm just reluctant to give up general transparency completely. I also don't want to use some kind of pattern for glass, rather than a real blended texture. And I don't intend to just render Minecraft shapes forever. We'll see.

You probably have also noticed the gridding in the water. This isn't a rendering issue. When I extracted the chunks from the Minecraft world file, I set the visibility flags on each chunk independently. So all the outside faces of each chunk, including the water, got set visible. I'll fix that one of these days.

Points instead of cubes

I also tested my idea from Part 11, of using points to summarize the chunks at a distance. In the Figure 2, you can see the landscape rendered in points. In Figure 3, a closeup shows the individual points.

Fig 2: Rendered with points
distant points
Fig 3: A close-up
closeup of points

Finally, this gives me a strategy for speeding up rendering of the entire landscape. Up close, I can use the full rendering with sorted transparency. In the middle range, I can draw transparent as opaque, meaning I don't have to sort it or send a new version of the transparent indexes every frame. And in the distance, I can use points to summarize the landscape. The result is shown in Figure 4 (click for full-sized.)

Fig 4: A world rendered three ways
mixed render

This still isn't quite interactive, but I have a lot of fiddling yet to do. It's encouraging that I can render this large a world even at poor frame rates. With a bit of fog and somewhat smaller view distances, even the current implementation would be acceptable.

Back in Part 8, I decided I wanted asteroids with radius around 630 units, or 1260 across. This block of scenery is 1024 by 1024, for a diagonal of 1448 units. So this is about the distance that would fit in one of my asteroids. I'd really like to get the view distance up to this, so that you'd never have any flickering of scenery. Distant objects would just switch rendering methods as you got close.

I am still experimenting with shaders and rendering, and don't have a finished demo. I will have to get an OpenGL 3.3 version working, get the shaders working on ATI displays, and then write an OpenGL 2.1 version. That will give me the Linux and Mac ports. If I want a DirectX 9 version for Windows, I'll need to learn DX9 shaders, something which I haven't touched yet. I'll probably put out a Windows demo without DX9 to start.

Dead Ends

I also spent a couple of days this week chasing a problem that wasn't at all what I thought it was.

Reader Klemens Friedl pointed out that the Part 15 demo just eats memory on his machine. Despite the code limiting memory use to 200 meg, he was seeing memory use continually grow, to over 500 meg.

I have my own debug versions of c++ new and delete (see Util/debugMemory.cpp in the source.) These just call malloc and free, but they allow me to track memory use, check for duplicate delete calls, and list all memory leaks at the end of a run.

According to this code, there are no memory leaks in Part 15. Plus it keeps track of the maximum total memory allocated, and it was around 200 meg (the limit only applies to chunk storage, so there are other items that aren't counted.)

Since my code wasn't leaking memory, I assumed the problem was with the c++ memory management code. I thought perhaps they had some kind of quick allocation followed periodically by a sweep to free all the deleted memory. The only way to avoid building up lots of memory use in this case is to manage it yourself.

This is simple enough. I can keep a linked list of blocks of memory (allocated with new), and just put deleted blocks back on the list. Then my memory use should never grow beyond the high-water mark of what I need. As old chunks are deleted and new ones loaded, blocks of memory are just put on the free list and then reissued.

I implemented this for my chunk Octree memory. That was simple enough, since I already had code in there to manage blocks of Octree nodes (you don't really want to call new thousands of times per chunk.) It didn't make any difference.

The other big user of memory is the vertex and index objects created to render the chunks. These only exist briefly -- the code creates the buffer, fills it up, then transfers it to the display using OpenGL glBufferData. Then the system-memory copy of the buffer is freed.

Despite being short-lived, these memory allocations have the same problem as the Octree nodes -- I do a lot of allocations and then free them instead of reusing them. So I rewrote this code to use blocks of memory from a pool, rather than a single allocation for each vertex buffer. This was a nuisance, but I finally got it working.

This did make a difference -- it made things worse! Under OpenGL, I was using glBufferData to move the entire buffer in one call. Since it was now in blocks, I had to use a different interface. I called glMapBuffer to get a pointer to that memory, copied each of the blocks of vertex data, then called glUnmapBuffer to release the OpenGL memory. This seems to change the way OpenGL treats the buffer. On Windows, the Resource Monitor showed an additional 200 meg of memory, all marked "shareable".

At that point, I suddenly realized I had made a stupid mistake. I was debugging the OpenGL version, and wasn't even seeing the huge memory growth that Klemens was seeing. I realized he was running the DirectX9 version. And sure enough, that makes all the difference. Where my part 15 OpenGL code was topping out at a 276 meg "working set", my DirectX version was hitting 577 meg. This is running the same test case, rendering the same chunks, etc.

I've poked around a bit on the DirectX version, but I can't find the problem. I'm freeing all the buffers once they are unused, and Release is returning a zero refcount. So I don't see why DirectX should be holding on to any memory.

If any DirectX programmers out there have ideas, let me know. You can see the code in Framework/Graphics3D/DX9/DX9Indexes.cpp The other big buffer is in DX9VertexTA.cpp in the same directory. It's very a straightforward piece of code.

Blog Updates

For the last three parts, I've gotten into the habit of posting a description of what I'm doing, then updates as I get the demo working. From the server logs, it's hard to tell if this is costing me readers. I assume you are all following this via RSS and get notified of the updates. If not, let me know in the comments. I can add an "update" message to the top of a part each time I change it, or to the front page of the blog if that helps.

If you want each update to be a separate part, I can do that too, although that will make them shorter. I know that people are still coming in via earlier parts (even part 1 still gets a lot of traffic!) so I figured it would be nicer to read if each part was a self-contained topic, without a lot of small "still working on it" parts.

I can also write more code-free posts blathering on about programming and design issues, if you really think that's worthwhile. I got the impression that people would rather read about the actual development though.

Let me know what you think.

Home       About Me       Contact       Downloads       Part 15    Part 17   

blog comments powered by Disqus