Home       About Me       Contact       Downloads       Part 16    Part 18   

Part 17: Less Fun with Shaders

April 15, 2011

After the good results I got last week, I fully expected to produce a demo. In fact, I even thought of a better way to do what I was doing with shaders, so I was going to title this part "More Fun with Shaders." Unfortunately, it didn't work out that way.

In Part 16, I described cutting the display memory required for chunks of scenery by encoding them in my game, and decoding them in the shader. This dropped the memory use from 36 bytes per vertex to 8 bytes. With a slightly different arrangement of bits, I could have done it in 4 bytes. I wasn't paying any performance penalty either. The shader code seemed just as fast as before.

The only remaining problem with that approach is that my format would only handle cubes. The game currently has four shapes -- cubes, steps, columns and spheres (which I used in place of Minecraft torches.) In future versions, I wanted more shapes -- perhaps even user-definable shapes. I needed a solution for that. And I thought of the perfect way to do it!

I had been sending cube-related fields like the cube face (six possible values, 0-5), and the texture coordinate (4 possible corners, 0-3) and then turning those back into the "normal x,y,z" and "texture u,v". Instead, I realized I could just send a vertex number. There are 24 possible vertexes on a cube (6 faces times 4 corners) and I could just look up each vertex in a table.

The table entry for a vertex would have the full normal and texture coordinates. It could also have a delta position from the origin of the object, allowing me to do the surface of spheres, or any other shape I wanted. I could put all the different shapes in a big table. The vertexes for cubes would run from 0 to 23, and the ones for steps from 24 to 48, and so on. Then a single shader could render all the different types of shape.

I coded this up. I'm going to have to show you some code here to get my points across, so bear with me, non-programmers. First, we define a structure to hold all the information:

struct VertexInfo
{
  vec3 position;
  vec3 normal;
  vec2 texture;
};

Then we define an array of these structures and initialize it to all the different types of vertex. For my first attempt, there were just 24 entries for a cube:

const VertexInfo SHAPES[24] = VertexInfo[](
  //         position             normal                  texture u,v

  // cube xmin face
  VertexInfo(vec3(0.0, 1.0, 1.0), vec3(-1.0,  0.0,  0.0), vec2(0.0, 0.0)), 
  VertexInfo(vec3(0.0, 1.0, 0.0), vec3(-1.0,  0.0,  0.0), vec2(1.0, 0.0)), 
  VertexInfo(vec3(0.0, 0.0, 1.0), vec3(-1.0,  0.0,  0.0), vec2(0.0, 1.0)), 
  VertexInfo(vec3(0.0, 0.0, 0.0), vec3(-1.0,  0.0,  0.0), vec2(1.0, 1.0)), 
   ...

For non-programmers, this is just defining rows of a table. Each row has three entries, "position", "normal" and "texture". The entries have three values, three values and two values respectively. So the first row is for a vertex on the XMin side of the cube (think of it as the left side.) The position is (0,1,1) which is the top-back corner. The normal faces left (-1,0,0) and this corner is at (0,0) in the texture pattern for the face.

And so on for each of 24 vertexes.

To use this in the shader, we write a line like:

vec3 normal = SHAPES[vertex].normal;

This gets the "vertex" row of the table SHAPES, and pulls out the "normal" entry. I get the position and texture entries as well, use all of this to specify a vertex for the display to render, and we're good to go.

And it worked! It rendered all my cubes without any cube-specific code in the shader. I had compiled the shader with a really large table (1024 entries) and that compiled fine. So I knew it would handle all the shapes I wanted. There was just one little problem....

Not So Fast

My speed test case for the shaders is a modified version of the game with the cursor and text overlays turned off, and no sky or transparent cubes. This week, I also cut out all the non-cube shapes, to make sure everything is comparable to the new shaders. What is left is just a series of calls, one per chunk, to render all the opaque data in the scene. It's as low overhead as I can make it. I render around 300 chunks, just to get the time long enough to measure accurately.

For the part 15 demo, which uses 36 bytes per vertex, I get a revised time of 11.79 ms. For the cube-only shader I mentioned last week, it was even faster -- 10.40 ms.

For the new code, with my fancy table of vertexes, I got... 301.26 ms.

A whopping three frames per second. Nearly thirty times slower. Useless.

The obvious question is what is causing this? The code to index an array isn't slow. The use of the data is the same as in the cube-only shader. It has to be a problem accessing the constant array data. It must not fit in the shader somehow and be in slower memory.

I tried making the SHAPES array a local variable and initializing it. That didn't help. I thought perhaps the compiler was just miserable at indexing arrays of structures, so I made it an array of floating point values and did the indexing myself. That didn't help either.

I knew that constants in the code (when you write something like "pi = 3.14") can't be that slow. Shaders would never get decent performance in that case. So I recoded the table as a big switch statement:

switch (vertex)
{
  case 0:
    position = vec3(0.0, 1.0, 1.0); 
    normal = vec3(-1.0,  0.0,  0.0); 
    texture = vec2(0.0, 0.0);
    break;

  case 1: 
    ...
}
And this did the trick. The time dropped back down to 11 ms. Yay! A big switch statement was kind of tacky, but it would work. I enlarged it to 48 entries and added the step shape. I enlarged it again to 72 entries to do the column shape. And... it stopped working. The compiler doesn't give an error, but it just doesn't run.

This was all with OpenGL 3.3, and shader language 3.3. I looked through the documentation for shader language 1.2, which is what I have to use on the Mac, and it didn't even have switch statements. So I had to rewrite this to use if statements. The trivial way to do this is to compare vertex to each of 24 values, and do the assignments when you hit the right one. For a table of 1000 entries, that would be slow. So I wrote it as a binary tree. For 8 entries, that would look like this:

if (vertex < 4)
{
  if (vertex < 2)
  {
    if (vertex < 1)
    {
      // vertex 0
      position = vec3(0.0, 1.0, 1.0); 
      normal = vec3(-1.0,  0.0,  0.0); 
      texture = vec2(0.0, 0.0);
    }
    else
    {
      // vertex 1
      position = vec3(0.0, 1.0, 0.0); 
      normal = vec3(-1.0,  0.0,  0.0); 
      texture = vec2(1.0, 0.0);
    }
  }
  else
  {
    if (vertex < 3)
      // vertex 2 ...
    else // vertex 3 ...
  }
}
else
{
  if (vertex < 6)
  {
    if (vertex < 5)
      // vertex 4 ...
    else // vertex 5 ...
  }
  else
  {
    if (vertex < 7)
      // vertex 6 ...
    else // vertex 7 ...
  }
}

As you can see, this is tedious, so I wrote a program to generate the if statements for me. For a table of 8 entries, it means three comparisons to find the value rather than an average of 4. Not a big deal. But for a table of 1024 entries, it means 10 comparisons instead of an average of 512. Definitely worth it!

This worked and the time was identical to the switch statement version, and could handle 72 cases, and would work on the Mac. So far, so good.

Time for Another Display

I decided to test my various compressed vertex shaders on my Linux machine, which has the integrated ATI Radeon 4200 graphics chip.

I run the same test case on the Linux machine, but with only 88 test chunks (view distance = 150) instead of 321 (view distance = 300). Here is a table of the times for various versions:

Version NVidia 250
    (321 chunks)    
ATI 4200
(88 chunks)
Ratio
    (ATI/NVidia time)
large vertexes 11.79 18.64 158%
cube-only vertexes 10.40 22.27 214%
shape table 301.26 135.33 45%
24-case switch 10.80 102.98 954%
72-case if tree 10.80 45.45 421%

What I'm seeing here is that the NVidia card handles the switch and if statements well -- almost as well as the simpler cube-only case (it also has small 6-case and 4-case switch statements). But it just doesn't handle the shape table constant array well. I knew all that.

What's interesting is the comparison between the two displays. While ATI also handles the shape table poorly, in comparison to the base time, it's not nearly as bad as NVidia. On the other hand, it's much worse at handling the switch statement.

In fact, ATI runs the 24-case switch even slower than the tree of 72 if statements. If we figure the if-tree is executing 6 tests per vertex, and the 24-case switch is implemented as a series if-then-else statements and averages 12 tests per vertex, this means the switch should be twice as slow as the if-tree, and it is.

This creates yet another issue for portability. Not only can I not write the same shader code for all displays, I can't even be sure that shader algorithms will run as well on one display as they do on another. The large vertexes, cube-only, switch and if-tree versions are all very similar on NVidia, but are all over the map on ATI.

What's disappointing is that even the cube-only shader is slower on the ATI display than using large vertexes. The time saved by having a smaller amount of data to send to the display is used up by the extra code in the shader. None of the other techniques are anywhere near the time of just using large vertexes. On the integrated ATI display, I might as well use large vertexes and a dumb shader for everything.

One More Try

There is still a bit of hope left. The cube-only shader could have been slower because it has small switch statements in it. If I rewrite the shader to avoid these completely, perhaps I can still send short vertexes, which allows me to load more chunks of scenery into the display. Here's how:

In the cube-only shader, I did this kind of thing:

vec3 normal;
switch (nCode & 0x7)
{
  case 0: 
    normal = vec3(1.0, 0.0, 0.0);
    break;

  case 1: 
    normal = vec3(-1.0, 0.0, 0.0);
    break;
  etc ...
}

Here I'm taking 3 bits (values 0 - 5) and producing the correct normal based on the face, using a switch statement. Instead, I can do something like this:

vec3 normal;
normal.x = float((nCode & 0x3)-1);
normal.y = float(((nCode >> 2) & 0x3)-1);
normal.z = float(((nCode >> 4) & 0x3)-1);
Here I'm taking 2 bits for the normal x (using values 0, 1, 2) and subtracting 1, to get -1, 0, or 1, which is the range of possible values I want. I do that on y and z as well, taking 2 bits each in the coded integer nCode. Instead of using a single field of 3 bits, I'm using 3 fields of 2 bits, or 6 total. That uses more space in the compressed vertex, but as you can see, there are no if or switch statements. I can do the same kind of thing with the texture coordinates.

When I do this, I get the same 10.40 ms in the NVidia case, but now I get a time of 17.49 ms for ATI. This is a bit faster than the large vertex case, and gets me the storage reduction I want. Yay!

The Bottom Line

There are shader programs I can write that will run well on all devices. The simple shaders I use for the large vertex (part 15) version work fine, and so does this last try shader without if statements. It looks like the does-many-shapes shader is impossible to write portably. I will have to have a separate shader for each shape after all.

I asked Florian Bösch (who has a new post on his own blog, here) what I should do about shaders. His (edited) reply was:

  • Const arrays are really slow. It's weird.

  • Structs tend to be a bit slower then vanilla type arrays.

  • You should get better speed with uniform arrays. Unfortunately, most hardware limits them to 128 to 256 elements.

  • You should avoid doing things that would prevent the compiler from inlining your array index access.

  • You should not do too many bit-tricks (shifts and ANDs) on the GPU. They're not much good at them and hardware support for them is really poor.

  • You should avoid conditional statements. A lot of the time they get inlined, except when they don't, in which case they become really slow.

  • Shader cores have a really, really tiny stack (something like 2 KB per core or so). If you load a lot of data onto that stack, they constantly fetch data from main VRAM, which is a bus access that blocks all other cores from doing the same.

My State of Mind

As I have worked on the low-level graphics code for this project, I've gradually developed a mental image of the PC graphics industry. The designers of OpenGL and DirectX, the 3D hardware companies and the writers of the reference books and the OpenGL SuperBible have all been rolled into a single character in my mind.

It looks like this:

He looked so reliable at first.
shining

Hopefully, I'll have better news for you next week.

Home       About Me       Contact       Downloads       Part 16    Part 18   

blog comments powered by Disqus