FXAA and Environment Mapping

FXAA is really just magic. No other word for it. Antialiasing as a cheap post-process is pretty much solved thanks to Mr. Lottes. I dropped it into the engine last night and was amazed at the difference! It makes edges look great:

Those asteroids look a lot better without the jaggy edges. Another nice benefit is that FXAA can also handle other types of aliasing, like specular highlights, which is fantastic considering that I just implemented environment mapping. Which leads me to...

Cubemaps / environment mapping in GL turned out a lot less painful than I was expecting. It's the first time I've ever touched cube maps, but everything went well! I've converted the background generator so that it now generates the background nebula directly into a cube map, which makes distortion less apparent. Plugging the background cube map into the deferred lighting pass for environment mapping also turned out to be relatively painless, with the exception of a nasty aliasing problem that I'm still working on. So here I am, flying next to an absurdly shiny behemoth:

It's amazing how much those reflections add to the graphics! I'm super-excited about them :)

Another bonus of environment maps is that you can use one of the mipmaps to approximate directional ambient lighting from space. Since the mipmaps are generated using a box blur, they can serve as a (very) rough approximation of the light scattered from a diffuse surface with the given normal. This means that we can use the environment map not only for reflections, but also to add a nice "global" light effect to diffuse surfaces. For example, if one part of the sky is a bright blue nebula, then some of that blue light will be scattered by the sides of asteroids that are facing that direction. Unfortunately, tweaking this to look natural has proved to be much more difficult than getting good reflections.

One final screenshot, this time from some experiments with tweaking the background generator. I came across this system and loved it!

If only there were more to do other than fly around and harass dumb AI...hopefully soon!

TNB Frames, Reflections, and Rogue AI

I've always wondered why people fatten up their vertex structures with an extra vector (tangent). It just seems unnecessary to me. I figured I was missing something. I guess I still am. Why not just generate a local tangent and bitangent from your normal? You can even make it continuous...and there you go, you have a TNB frame for normal mapping. I guess the issue is that the tangent and bitangent won't align with the direction of increasing texture coordinates. Which I suppose would be a problem if you really wanted everything to line up perfectly with your albedo. But hey, I just want some modulation in the normal! I don't need all this fancy math. So here's what I do:

A low-cost, continuous TNB frame in your vertex shader. Works for me...

I spent several hours last night fighting my worst mathematical enemy: world-space position reconstruction. I'm not sure why I have such a hard time with it, because it's really not that hard...but every time I try to get world-space position using linear depth inside a shader, I inevitably screw up for hours before getting it right. After finally getting it, though, I'm in good shape to implement environment maps. For now, I'm just hacking with screen-space reflections, but even those (as incorrect as they are) add a lot to the graphics. Shiny things are definitely a must in space games!! :)

The above image has an interesting story to accompany it. I was fooling around with AI again, and implemented a retaliatory AI strategy that runs in tandem with the movement strategy and detects when the AI's ship comes under attack, switching the necessary strategies to pursue and attack the assailant. So I flew up to a passive AI that was on his way to make a trade run, fired a missile, and started to run as he began to pursue me. Unfortunately for him, several other traders were nearby, and he wasn't a very good shot. The idiot ended up blasting several of his nearby AI friends with missiles in his attempts to hit me. And then came one of those lovely "emergent AI" moments. The friends immediately turned on the rogue gunman and started chasing and attacking him! Of course they did: there was nothing special in the code that specified that it had to be the player attacking; the strategy just told the AI to retaliate against any attacks. And so they did! But I hadn't even thought of this possibility, so it was a very cool thing to see. I watched from afar as an all-out space brawl erupted, with missiles streaking all over the place. Cool! It's nice to see that they've got my back.

I'm a sucker for pretty asteroid fields, so here's another one, just for the records...

And finally, a shot of the real-time profiler that I made today to help me rein in my frame time. Always a pleasure trying to do graphics programming with an Intel HD 3000...

Looks like my GPU particle system is really a problem. Particle allocation is just killing my performance! I'm not sure how to further optimize those TexSubImage2D calls...as far as I can tell, there's no other way to do it, and they have to be done every frame, which forces a premature CPU/GPU sync.

Understanding Half Texels

Last night I was revisiting terrain generation on the GPU. I ran into the same problem that always crops up when you try to generate heightmaps on the GPU: slight misalignments that cause cracks between neighboring terrain patches. The root cause of this is texture coordinates, but it's easy to overlook the problem, especially since it is well-known that chunked terrain WILL have cracks between neighboring chunks of different LOD levels. However, the cracks caused by LOD differences should only exist at every other vertex (i.e., every other vertex should line up perfectly). But if you don't take special care when thinking about texture coordinates, you won't have any vertices lining up perfectly between neighboring chunks. Skirts will pretty much fix the problem, especially at high resolutions, but it's still discomforting to know that your implementation is fundamentally wrong.

Here's what you're probably doing: float height = H(tex.x + offsetX, tex.y + offsetY, ...), where tex is the texture coordinates, and offsetX and offsetY are uniforms passed in to the shader that indicate the location of the chunk relative to the world. In a perfect world, where texture coordinates range between 0 and 1, this would work, because the border of one chunk, H(1 + 0, ...), for example, would exactly line up with the border of the neighboring chunk, H(0 + 1, ...). So when you see cracks in the terrain, you must immediately begin to suspect that the texture coordinates are doing something strange. And indeed, they are.

Try this: make a pixel shader that outputs the texture coordinates to a floating point texture. Then read it back and examine the results. They may surprise you (they surprised me): the texture coordinates do NOT range from 0 to 1. On the contrary, they range between [1/(2w), 1 - 1/(2w)] in u and [1/(2h), 1 - 1/(2h)] in v, where w and h are the width and height of the texture in pixels, respectively. Wait, what??? Yes. Believe it or not, this makes sense. Texels are addressed by their CENTER, so 0 is actually the upper-left corner of the upper-left texel. To get to the center of the first texel, you must add a half-texel offset in both dimensions, which is 1/(2w) in u and 1/(2h) in v. The same reasoning applies to all other texels. So why is the shader lying to you? Well, if the shader had handed you coordinates that actually ranged from 0 to 1 and you tried to do a texture lookup, then you would be accessing, for example, the texel at (0, 0), which would invoke filtering - probably not what you wanted. This is a big problem in DirectX, where the driver does NOT automatically offset the texture coordinates for you, so it's really easy to end up invoking a bilinear filter on your whole texture if you aren't specifically aware of this subtlety. Luckily, GL is nice enough to anticipate this problem and solve it for us. But it has the nasty side-effect of getting in the way when we try to do things like this where we want [0, 1].

Ok, hopefully I've made a convincing case that texture coordinates don't range from 0 to 1 in your fullscreen fragment shader, and that there's actually a good reason for that. Once we understand the issue, the solution is really simple. We want to map the range [1/(2x), 1 - 1/(2x)] to [0, 1]. Luckily, it doesn't take a lot of heavy math to realize that we can easily achieve this with the formula u' = b * (u + a), where a = -1/(2x) and b = x / (x - 1). Intuitively, this means we first subtract a half-texel, which gives us the range [0, 1 - 1/x], then we scale by x / (x - 1) to bring that second component back to 1: (1 - 1/x) * x / (x - 1) = (x - 1) / (x - 1) = 1. And that solves it!

And now, the obligatory pretty picture of the day! I found this system today and really loved the background and the asteroid arrangement.

Finally, here's a great resource on the half texel issue, and it includes some nice images: http://drilian.com/2008/11/25/understanding-half-pixel-and-half-texel-offsets/.

GPU Particle System

Some recent profiling has made it very clear to me that a CPU-side particle system just isn't going to work out for large numbers of particles. It's more of a bottleneck than I anticipated, just because of the fact that it has to ship a vertex/index array to the GPU each frame.

Although there's no real conceptual difficulty in a particle system on the GPU, the real trick is getting it to run fast (no surprises there). With a vanilla GPU particle system, if you want to support 100K particles, then you have to draw the vertices for 100K particles each and every frame, barring any geometry shader fanciness. Not cool. Naturally, you can check in the vertex shader to see if the particle is active or not, and if it isn't, then you cull it. This saves pixel processing. Still, that's a lot of vertices burning through the GPU. When I got this far in implementing my GPU particle system, it wasn't doing much better than the CPU one.

Well, wouldn't it be great if we could draw only the active particle vertices? But that would require having all active particles coalesced in a linear region of the texture. Hmmm...defragmenter, perhaps? Wait a minute...there's nothing stopping us from using stack-based allocation! By that, I mean that we always allocate the next free block, and when we go to free a particle, we swap it with the last one in use. This is a classic memory scheme for maintaining a tightly-packed linear array. And it works perfectly in this situation, because the particles do not care about order! Using this strategy, we can specify the exact size of the active vertex buffer, so that no inactive particle vertices are drawn in the draw call.

To break up all this text, here's a somewhat cheesy picture of loads of particles! It'll look a lot better once I apply texture to them.

Another hidden benefit of stack-based allocation is that we can now save a LOT of time on particle allocation by coalescing allocations per frame. Without stack-based allocation, you'll have to make several calls to TexSubImage2D on particle creation. TexSubImage2D is expensive, and will be a HUGE bottleneck if called for each particle creation (trust me, I've profiled it). In fact, it will quickly become the bottleneck of the whole system. However, if we know that particles are packed tightly in a stack, we can defer the allocation to a once-per-frame coalesced allocation, uploading whole rows at a time to the data textures. This is actually easy and intuitive to implement, but yields massive savings on TexSubImage2D calls! With a 256x256 particle texture, you can now perform (in theory) 256 allocations at once! A little probability theory tells us that, on average, we'll make 128 times less calls to TexSubImage2D. Another big win for stack allocation!

One final detail - when freeing a particle and performing the swap, you need to be able to swap texture data. I couldn't find a way to read a subregion of a texture in OGL, so I opted to shadow the GPU data with a CPU copy. One might declare that doing so defeats the purpose of having a GPU PS, but that's not true - the real saving (at least for my application) is the draw call. Having the vertex/index data already on the GPU is a huge saving. To avoid shadowing, you could use a shader and pixel-sized quad (or scissor test) to copy the data. But that's messier and requires double-buffering. So I opted for the easy way. As always, to back my decision, I profiled, and indeed found that keeping a CPU shadow of the data is virtually free. Even with having to perform Euler integration on both the CPU and the GPU, the cost is nowhere close to the cost of drawing the particles, which at this point should be the bottleneck.

Now, for something totally different, here's a (questionably) pretty picture of my dumb AI friends and I flying around planet Roldoni. Yes, the terrain looks terrible. But now that I've turned my attention back to planetary surfaces, it won't stay that way for long. I've also ported my sky shader over from the XDX engine, and it's still looking nice.

PS ~ A final note : you might think that you can save even more speed on particle allocation if you detect the case in which multiple rows can be uploaded at a time (i.e., you have a massive number of particles to upload in one frame), so you coalesce them into a single TexSubImage2D call covering a large rectangle. It turns out that this optimization really doesn't save much time at all, so there's no need to go that far unless you regularly expect to upload hundreds of thousands of particles (I was testing with allocations of ~10,000 at a time, and the saving was not noticeable). Any way you slice it, the vertex cache is going to get hit really hard if you allocate a huge number of new particles at once, and that, if anything, is going to cause a stall - not the uploading.

High Res at a High Price

I got around to generating proper, seamless 3D backdrops so that the space background no longer has glaring symmetry. Unfortunately, the improvement comes at a high cost. Before, I was able to get away with generating noise in a 2D space. To properly texture the inside of a sphere, however, you can't do that. Thanks to the 3x cost of going from 2D to 3D Worley noise, as well as the fact that I need a texture for each hemisphere of the sky, it now takes 6 times longer to generate the background. Ouch!

The result? Well, check out the lack of symmetry:

To push the resolution of my textures even higher, I had to split up the texture generation into multiple GPU submissions. Interestingly, Windows 7 has a feature that tries to recover the graphics driver if it hangs for more than 2 seconds. To do so, it kills the hanging application and reports a graphics driver "crash." Well, it's pretty easy to hit the 2 second mark when generating a 2048 texture with loads of 3D noise computations on a laptop with an Intel HD3000. Previously, the resolutions of both my planet and backdrop textures were limited by this fact, since both are generated on the GPU. The natural solution is to split the work up into multiple calls, so that the driver responds in under 2 seconds.

It turns out that there's an awesome little feature that makes this painless: the "scissor test." To spread the work out over multiple calls, one can just enable the scissor test, set the scissor rectangle to some fraction of the full texture dimensions, then draw a fullscreen quad as usual. Repeat the process, sliding the scissor rectangle as you go, until you've covered the whole texture. This works because pixels/fragments will get discarded by the scissor test before the pixel/fragment shader is ever invoked (and that's where all the work happens if you're generating a texture on the GPU). It's also more painless than trying to adjust the position/texture coordinates of the FSQ.

Here's another shot, showing off the high-resolution textures (backdrop and planets):

I really loved how well this surface went with the background:

What's next? Well, those trade lane gates sure are eyesores...

AI, Particles, Normal Maps

It's been a great week for progress. Using the path finding algorithm discussed in the last post, I've implemented some rudimentary AI players that simply navigate to a random planet using the trade lane network, then select a new destination and repeat upon arrival. Although simple, the AI sure livens up the bleak little star system. I also implemented a basic, CPU-side particle system for creating engine trails and explosions. Finally, I've implemented GPU-based normal map generation for heightmaps, and used it to make planets and asteroids look a whole lot better.

Here's the new and improved world. The streaks of yellow are AI players up ahead that have been instructed to pick a random asteroid and orbit it. I'm following them as they journey towards their respective rocks.

And here you can see me chilling with my AI entourage (they were told to protect me). The engine trails show that the AI paths are always smooth, thanks to some nice math that controls the AI movement. They look very natural while flying around me. Now all we need are some enemies for them to attack!

It's been a really exciting week for this project, and I can't wait to see what happens when I start playing with enemies/attacking, police, real traders, etc! Oh, and I guess someday I should stop being lazy and fix the fact that the skydome is mirrored vertically...kinda an immersion-breaker.

Stars and A-Star

I implemented proper billboard stars last night, including a nice Gaussian falloff and angular rays for dramatic effect. It beats the heck out of my previous solution, which actually used spherical geometry (a very bad call when a quad can do the same job). For fun, I also threw together a "lo-fi" post-process, which just adds some grit and vertical lines.

Here's a new star with the lofi filter:

And one without the filter:

In other news, I finally got around to doing some algorithmic work. I implemented A* pathfinding and a procedure to build a navigation graph for systems. The algorithm was surprisingly easy to implement...I was expecting that pathfinding would be harder! Seems that many people have already thoroughly solved the problem. Works for me :) Some initial tests have shown that everything is working properly, and my ship is able to use trade lanes to navigate itself from planet to planet using the optimal route, including jumping into the middle of trade lanes (which is often the optimal solution when you're not right by a planet). This is a great step in the direction of AI players!

Deferred Shading, Trade Lanes, Asteroids

A lot of my recent work has been on internals (an unfortunate fact that I always hate admitting). But I'm really trying to nail down the cleanest possible abstractions and pipelines for a deferred renderer in GL. I think I've already come further than I ever have before. I've also got transparency integrated properly with the deferred engine, which is another first for me.

My current project is looking more and more like a game each day. There are now "trade lanes," a la Freelancer, for intra-system transport. Asteroids also made a first appearance yesterday.

As the screenshots suggest, I've implemented a vignetting postprocess filter (just for the sake of screenshots). I'm also using the filmic tonemapping operator from the classic Uncharted 2 (John Hable) presentation. It's really a magical little piece of math :)

The most serious challenge I'm currently facing is performance. Having done much of this stuff before in DX9, I'm seriously concerned that my frametime is already hitting 16ms at 1280x768 with only a reasonable amount of geometry on the screen, and reasonable post-processing. I think this is just a fundamental challenge with OpenGL. I've said it once, I'll say it again. DirectX is a better API. It's significantly harder to figure out how to "correctly" do something in GL such that it is as performant as the DX equivalent - which is precisely the problem I'm running in to now. There are so many different ways to do something in GL that it's exceptionally hard to tell when you've done something the "right" way. This is in stark contrast to D3D, in which, generally, if something works, then you have done it the "right" way.

Trading and Such

I've finished a basic trading system. I can actually buy and sell from planets now, making profit where the deals are good. The GUI engine is great - it makes the interface really enjoyable to use. I'm glad that I invested so much time into it last year to make it sleek. Trading is pretty effortless and intuitive.

Unfortunately, there's not much more to do than that. There are no traders and no dynamic economy - so the goods stay where you leave them. Still, it's nice to see a slight bit of gameplay.

Planets are basic but somewhat attractive. I really like the atmosphere shader. The planetary surfaces are procedural and still need a lot of work, right now they're very simple Worley functions. The background is also a simple Worley texture.

So far, I'm doing a really good job of sticking to the top-down methodology. Nothing is beautiful, but everything is decent, and, more importantly, I am exploring higher-level concepts than I have never had to touch before - modular game logic, component-based entities, and so on. I'm finally making a game. And if feels great :)