## Blurring a Cubemap

September 26, 2012

I was surprised by the lack of resources concerning blurring a cubemap available online. I'll give a brief explanation of how to do so in case someone else with the same needs happens to stumble upon this.

First, it is a very reasonable thing to want to blur a cubemap. The most obvious reason one might want to do so, at least in my mind, is to simulate a reflection map for a glossy surface. The cheap alternative to doing so is using the mipmaps, but those will probably be generated using a box filter, which looks pretty bad as an approximation of a glossy surface. So you might want to generate a nice gaussian-blurred version of your environment map. But how?

Doing so is straightforward with a texture - we all know the algorithm. But a cubemap is a very different beast. Or is it? Here's the thing one must keep in mind when approaching this problem: stop thinking in terms of six textures. Instead, think of your cubemap as the unit ball. Which it is, right? Or at least that's how you index into it. Now, if you have a ray in that ball, how might you find the "blurred" equivalent of that ray? The obvious answer is by summing contributions of "nearby" rays. What do you mean by "nearby"? Well, that totally depends on what kind of Kernel you want! For a simulation of a gaussian blur, we will sample points on a disk around our ray, where the radius of the disk will be sampled from a gaussian distribution.

Suppose that e1 is the normalized position that corresponds to the current cubemap pixel, and that e2 and e3 are the orthogonal vectors that form the tangent space of e1 (see the tangent/bitangent post from a while back if you need code to do that). Also suppose that "samples" is an int corresponding to how many samples you want to take (suggest at least 1000 for HQ, but this will depend on your SD), and that "SD" is the standard deviation of the blur. Finally, suppose that you have a function Noise2D(x, y, s) that returns a random float in 0, 1 given two coordinates and a seed (although for this application, the noise doesn't really need to be continous). Then, here you have it:

This is a great example of a Monte-Carlo technique. We solved the problem in an elegant and asymptotically-correct fashion with minimal math and minimal code! If you want better results, look up "stratified sampling" and play around with that (note that the code above is already performing stratification on the angle, just not on the radius). For my purposes, this simple sampling strategy was sufficient.

## Dithering!

September 26, 2012

I'm really excited about this post. If you look carefully at my previous screenshots, especially those in grayish/dusty environments, you might notice a dirty little secret that I've been refusing to talk about until now: color quantization (in the form of color banding). Usually, color quantization is not really a problem that most games have to address. However, the dusty and monochromatic atmospheres that I've been producing as of late are particularly subject to the problem, since they use a very narrow spectrum of colors. Within this narrow spectrum, one starts to run into the problem that 32-bit color only affords 256 choices each for red, green, and blue intensity in a given pixel. When the entire screen is spanned by a narrow spectrum of colors, the discrete steps between these 256 levels become visible, especially in desaturated colors (since there are only 256 total choices for gray).

As an example, here's a shot that demonstrates the problem:

Notice the visible banding in the lower-left. If you open this image in Photoshop, you can see that the colors are only one unit away from one another in each component - so we really can't do any better with 8 bits per component! The solution? A very old technique called dithering! Dithering adds some element of "noise" to break up the perfect borders of discrete color steps, which results in a remarkably-convincing illusion of color continuity.

OpenGL supposedly supports dithering, but it's not clear at all to me how it would do so, and especially how that would interact with a deferred renderer. Luckily, it's actually possible to implement dithering quite easily in a given shader. The appropriate time to do so is when you have a shader that you know will be performing some color computations in floating-point, but will then be outputting to a render target with a lower bit depth (i.e., RGBA8). You'd like to somehow retain that extra information that you had while performing the FP color computations - that's where dithering comes in. Before you output the final color, just add a bit of screen-space noise to the result! In particular, take a random vector v = (x, y, z) such that x,y,z are all Uniform ~ [-1, 1], where the uniform random function uses the fragment coordinates (or the texture coordinates) to compute a pseudorandom number (which need not be of high quality). Then add v / 256 to your result. Why v / 256? Well, this will allow your pixel to go up or down a single level of intensity in each component, assuming you're using an 8-bit-per-component format. In my experiments, this worked well.

Right now, I've implemented this the lazy way by switching my primary render targets to RGBA16, then dithering a single time immediately before presenting the result, which, if you think about it, is pretty much equivalent to the above, but requires 2x the bandwidth. I will switch to the more efficient way soon.

And here's the same scene, now with dithering:

As if by magic, the banding has totally disappeared, and we are left with the flawless impression of a continuous color space. Happy times ðŸ™‚

## Mining and Beam Weapons

September 25, 2012

I haven't given up! Not yet at least. I'm got a five-week plan to get a Kickstarter ready for this project. I hope to flesh out a lot of cool features between now and then, and pull together some real cinematic footage so as to show off all the graphical polish on which I've been spending time.

Right now, my priority is all things mining. Unfortunately, I'm still not positive how the mining system will be implemented. I know that ore/mineable things will actually be physical objects attached to asteroids. I think what will happen is normal weapons will be able to break off shards of the ore, which can then be tractored in. Specialty mining lasers will also be available, and will provide far more efficient extraction of ore (i.e., higher yield than just blasting at it). Mining lasers will have built-in tractor capability, so they won't require a two-step mining process. Of course, this requires implementation of beam weapons, which I've been avoiding.

Here's my first attempt at mining beams.

The beams look pretty cool, and are animated using coherent temporal noise that flows along the direction of the beam, which breaks up the perfect gradient and provides the illusion that the beam is pulsing into the target object.

I can't wait to use this same functionality to slap some beam weapons on large ships and observe epic cap-ship battles (which, sadly, will require a good ship generator, which I'm still avoiding :/).

## Dust and Collision Detection

September 3, 2012

I made some great progress yesterday with volumetric effects, and am officially excited about the current look of dust clouds/nebulas! I've implemented a very cheap approximation to global illumination for volumes. Although the effect is really simple, it adds a lot to the realism. On top of illumination from background nebulae, there's also a fake illumination effect from stars. Here are some shots:

Definitely starting to feel like the game has a strong atmosphere!

The rest of my time lately has been devoted to finally getting real collision detection. I tried several options, including Bullet, OZCollide and Opcode. For now, I've opted (no pun intended...) to go with Opcode. I love the cleanliness of the interface, as well as the fact that it's really easy to use as just a narrow-phase collider, since I've already implemented my own broad-phase.

Once again, I started a fight that I shouldn't have started with someone way bigger than me. Luckily, thanks to the new, accurate collision-detection, I was able to sneakily take refuge in a depression in this large asteroid:

Opcode is very fast, so I'm pleased with that. And I haven't even set up temporal coherence caches yet, which supposedly can yield 10x speedups. Unfortunately, I've already run into a very serious problem that it seems is universal to collision detectors. They don't support scaling in world transformations!! I'm not sure why...I guess there's some fundamental challenge with that and collision detection algorithms that I'm not aware of. At any rate, this poses a big problem: I've got thousands of collidable asteroids drifting around systems. With scaling support, this wouldn't be a problem, because there are only about 15 unique meshes that I generate for a given system. With instancing and scaling, you can't tell that some of the asteroids are the same. Scaling is an absolutely critical part of the illusion! But to get the scaled asteroids to work with collision detectors, I have to bake the scale data into the meshes. I can't possibly generate a unique, scaled mesh and acceleration structure for each of the thousands of asteroids at load time. That would take way too long, and way too much memory.

To solve this, what I'm currently doing is trying to anticipate collisions, and launch another thread to build the acceleration structures for the objects that would be involved. To do so, I keep track of a "fattened" world-space AABB for all objects, and launch the builder thread when the fattened boxes intersect. The factor by which the world boxes are exaggerated affects how early the acceleration structures will be cached. So far, this method is working really well. Opcode is fast at building acceleration structures, so I haven't had any trouble with the response time. In theory, with a slow processor, collisions will be ignored if the other thread does not finish building the acceleration structure before the collision occurs. I've tested to see how far I can push the complexity of meshes before this starts happening. Indeed, if I use the full-quality asteroid meshes (~120k tris) for collision, my first missile flies right through them (admittedly, I also turned the speed way up on the missiles). But let's face it, 120k tris for a collision mesh is a little much! And the only real alternative is to stall and wait for the acceleration structure to be generated, which means the game would freeze for a moment as your missile is heading towards an asteroid or ship. I'd much prefer to run the risk of an occasional missed collision on a really low-end machine than to have the game regularly stalling when I shoot at things.

I'm very impressed with how easy multithreading is with SFML. The thread and mutex structures provided by SFML are completely intuitive and work perfectly! It took less than an hour to implement the aforementioned collision structure worker thread.

## Engine Trails and Dust

September 2, 2012

Wasn't such a good idea to mess with this guy...he has a lot of guns. I've upgraded engine trails to a first-class subsystem, they no longer use the particle system. They look much better as continuous ribbons rather than bunches of particles.

Last night I started playing with fake volumetric effects for nebulas/dust clouds/etc. The effect is really cool, but isn't that apparent in screenshots. When flying around, the effect is pretty compelling. This was one of my favorite graphical aspects of Freelancer - the look and feel of dust clouds. They felt so distinct and really broke up the monotonous feeling of space.

## FXAA and Environment Mapping

August 29, 2012

FXAA is really just magic. No other word for it. Antialiasing as a cheap post-process is pretty much solved thanks to Mr. Lottes. I dropped it into the engine last night and was amazed at the difference! It makes edges look great:

Those asteroids look a lot better without the jaggy edges. Another nice benefit is that FXAA can also handle other types of aliasing, like specular highlights, which is fantastic considering that I just implemented environment mapping. Which leads me to...

Cubemaps / environment mapping in GL turned out a lot less painful than I was expecting. It's the first time I've ever touched cube maps, but everything went well! I've converted the background generator so that it now generates the background nebula directly into a cube map, which makes distortion less apparent. Plugging the background cube map into the deferred lighting pass for environment mapping also turned out to be relatively painless, with the exception of a nasty aliasing problem that I'm still working on. So here I am, flying next to an absurdly shiny behemoth:

It's amazing how much those reflections add to the graphics! I'm super-excited about them ðŸ™‚

Another bonus of environment maps is that you can use one of the mipmaps to approximate directional ambient lighting from space. Since the mipmaps are generated using a box blur, they can serve as a (very) rough approximation of the light scattered from a diffuse surface with the given normal. This means that we can use the environment map not only for reflections, but also to add a nice "global" light effect to diffuse surfaces. For example, if one part of the sky is a bright blue nebula, then some of that blue light will be scattered by the sides of asteroids that are facing that direction. Unfortunately, tweaking this to look natural has proved to be much more difficult than getting good reflections.

One final screenshot, this time from some experiments with tweaking the background generator. I came across this system and loved it!

If only there were more to do other than fly around and harass dumb AI...hopefully soon!

## TNB Frames, Reflections, and Rogue AI

August 28, 2012

I've always wondered why people fatten up their vertex structures with an extra vector (tangent). It just seems unnecessary to me. I figured I was missing something. I guess I still am. Why not just generate a local tangent and bitangent from your normal? You can even make it continuous...and there you go, you have a TNB frame for normal mapping. I guess the issue is that the tangent and bitangent won't align with the direction of increasing texture coordinates. Which I suppose would be a problem if you really wanted everything to line up perfectly with your albedo. But hey, I just want some modulation in the normal! I don't need all this fancy math. So here's what I do:

A low-cost, continuous TNB frame in your vertex shader. Works for me...

I spent several hours last night fighting my worst mathematical enemy: world-space position reconstruction. I'm not sure why I have such a hard time with it, because it's really not that hard...but every time I try to get world-space position using linear depth inside a shader, I inevitably screw up for hours before getting it right. After finally getting it, though, I'm in good shape to implement environment maps. For now, I'm just hacking with screen-space reflections, but even those (as incorrect as they are) add a lot to the graphics. Shiny things are definitely a must in space games!! ðŸ™‚

The above image has an interesting story to accompany it. I was fooling around with AI again, and implemented a retaliatory AI strategy that runs in tandem with the movement strategy and detects when the AI's ship comes under attack, switching the necessary strategies to pursue and attack the assailant. So I flew up to a passive AI that was on his way to make a trade run, fired a missile, and started to run as he began to pursue me. Unfortunately for him, several other traders were nearby, and he wasn't a very good shot. The idiot ended up blasting several of his nearby AI friends with missiles in his attempts to hit me. And then came one of those lovely "emergent AI" moments. The friends immediately turned on the rogue gunman and started chasing and attacking him! Of course they did: there was nothing special in the code that specified that it had to be the player attacking; the strategy just told the AI to retaliate against any attacks. And so they did! But I hadn't even thought of this possibility, so it was a very cool thing to see. I watched from afar as an all-out space brawl erupted, with missiles streaking all over the place. Cool! It's nice to see that they've got my back.

I'm a sucker for pretty asteroid fields, so here's another one, just for the records...

And finally, a shot of the real-time profiler that I made today to help me rein in my frame time. Always a pleasure trying to do graphics programming with an Intel HD 3000...

Looks like my GPU particle system is really a problem. Particle allocation is just killing my performance! I'm not sure how to further optimize those TexSubImage2D calls...as far as I can tell, there's no other way to do it, and they have to be done every frame, which forces a premature CPU/GPU sync.

## Understanding Half Texels

August 27, 2012

Last night I was revisiting terrain generation on the GPU. I ran into the same problem that always crops up when you try to generate heightmaps on the GPU: slight misalignments that cause cracks between neighboring terrain patches. The root cause of this is texture coordinates, but it's easy to overlook the problem, especially since it is well-known that chunked terrain WILL have cracks between neighboring chunks of different LOD levels. However, the cracks caused by LOD differences should only exist at every other vertex (i.e., every other vertex should line up perfectly). But if you don't take special care when thinking about texture coordinates, you won't have any vertices lining up perfectly between neighboring chunks. Skirts will pretty much fix the problem, especially at high resolutions, but it's still discomforting to know that your implementation is fundamentally wrong.

Here's what you're probably doing: float height = H(tex.x + offsetX, tex.y + offsetY, ...), where tex is the texture coordinates, and offsetX and offsetY are uniforms passed in to the shader that indicate the location of the chunk relative to the world. In a perfect world, where texture coordinates range between 0 and 1, this would work, because the border of one chunk, H(1 + 0, ...), for example, would exactly line up with the border of the neighboring chunk, H(0 + 1, ...). So when you see cracks in the terrain, you must immediately begin to suspect that the texture coordinates are doing something strange. And indeed, they are.

Try this: make a pixel shader that outputs the texture coordinates to a floating point texture. Then read it back and examine the results. They may surprise you (they surprised me): the texture coordinates do NOT range from 0 to 1. On the contrary, they range between [1/(2w), 1 - 1/(2w)] in u and [1/(2h), 1 - 1/(2h)] in v, where w and h are the width and height of the texture in pixels, respectively. Wait, what??? Yes. Believe it or not, this makes sense. Texels are addressed by their CENTER, so 0 is actually the upper-left corner of the upper-left texel. To get to the center of the first texel, you must add a half-texel offset in both dimensions, which is 1/(2w) in u and 1/(2h) in v. The same reasoning applies to all other texels. So why is the shader lying to you? Well, if the shader had handed you coordinates that actually ranged from 0 to 1 and you tried to do a texture lookup, then you would be accessing, for example, the texel at (0, 0), which would invoke filtering - probably not what you wanted. This is a big problem in DirectX, where the driver does NOT automatically offset the texture coordinates for you, so it's really easy to end up invoking a bilinear filter on your whole texture if you aren't specifically aware of this subtlety. Luckily, GL is nice enough to anticipate this problem and solve it for us. But it has the nasty side-effect of getting in the way when we try to do things like this where we want [0, 1].

Ok, hopefully I've made a convincing case that texture coordinates don't range from 0 to 1 in your fullscreen fragment shader, and that there's actually a good reason for that. Once we understand the issue, the solution is really simple. We want to map the range [1/(2x), 1 - 1/(2x)] to [0, 1]. Luckily, it doesn't take a lot of heavy math to realize that we can easily achieve this with the formula u' = b * (u + a), where a = -1/(2x) and b = x / (x - 1). Intuitively, this means we first subtract a half-texel, which gives us the range [0, 1 - 1/x], then we scale by x / (x - 1) to bring that second component back to 1: (1 - 1/x) * x / (x - 1) = (x - 1) / (x - 1) = 1. And that solves it!

And now, the obligatory pretty picture of the day! I found this system today and really loved the background and the asteroid arrangement.

Finally, here's a great resource on the half texel issue, and it includes some nice images: http://drilian.com/2008/11/25/understanding-half-pixel-and-half-texel-offsets/.

## GPU Particle System

August 25, 2012

Some recent profiling has made it very clear to me that a CPU-side particle system just isn't going to work out for large numbers of particles. It's more of a bottleneck than I anticipated, just because of the fact that it has to ship a vertex/index array to the GPU each frame.

Although there's no real conceptual difficulty in a particle system on the GPU, the real trick is getting it to run fast (no surprises there). With a vanilla GPU particle system, if you want to support 100K particles, then you have to draw the vertices for 100K particles each and every frame, barring any geometry shader fanciness. Not cool. Naturally, you can check in the vertex shader to see if the particle is active or not, and if it isn't, then you cull it. This saves pixel processing. Still, that's a lot of vertices burning through the GPU. When I got this far in implementing my GPU particle system, it wasn't doing much better than the CPU one.

Well, wouldn't it be great if we could draw only the active particle vertices? But that would require having all active particles coalesced in a linear region of the texture. Hmmm...defragmenter, perhaps? Wait a minute...there's nothing stopping us from using stack-based allocation! By that, I mean that we always allocate the next free block, and when we go to free a particle, we swap it with the last one in use. This is a classic memory scheme for maintaining a tightly-packed linear array. And it works perfectly in this situation, because the particles do not care about order! Using this strategy, we can specify the exact size of the active vertex buffer, so that no inactive particle vertices are drawn in the draw call.

To break up all this text, here's a somewhat cheesy picture of loads of particles! It'll look a lot better once I apply texture to them.

Another hidden benefit of stack-based allocation is that we can now save a LOT of time on particle allocation by coalescing allocations per frame. Without stack-based allocation, you'll have to make several calls to TexSubImage2D on particle creation. TexSubImage2D is expensive, and will be a HUGE bottleneck if called for each particle creation (trust me, I've profiled it). In fact, it will quickly become the bottleneck of the whole system. However, if we know that particles are packed tightly in a stack, we can defer the allocation to a once-per-frame coalesced allocation, uploading whole rows at a time to the data textures. This is actually easy and intuitive to implement, but yields massive savings on TexSubImage2D calls! With a 256x256 particle texture, you can now perform (in theory) 256 allocations at once! A little probability theory tells us that, on average, we'll make 128 times less calls to TexSubImage2D. Another big win for stack allocation!

One final detail - when freeing a particle and performing the swap, you need to be able to swap texture data. I couldn't find a way to read a subregion of a texture in OGL, so I opted to shadow the GPU data with a CPU copy. One might declare that doing so defeats the purpose of having a GPU PS, but that's not true - the real saving (at least for my application) is the draw call. Having the vertex/index data already on the GPU is a huge saving. To avoid shadowing, you could use a shader and pixel-sized quad (or scissor test) to copy the data. But that's messier and requires double-buffering. So I opted for the easy way. As always, to back my decision, I profiled, and indeed found that keeping a CPU shadow of the data is virtually free. Even with having to perform Euler integration on both the CPU and the GPU, the cost is nowhere close to the cost of drawing the particles, which at this point should be the bottleneck.

Now, for something totally different, here's a (questionably) pretty picture of my dumb AI friends and I flying around planet Roldoni. Yes, the terrain looks terrible. But now that I've turned my attention back to planetary surfaces, it won't stay that way for long. I've also ported my sky shader over from the XDX engine, and it's still looking nice.

PS ~ A final note : you might think that you can save even more speed on particle allocation if you detect the case in which multiple rows can be uploaded at a time (i.e., you have a massive number of particles to upload in one frame), so you coalesce them into a single TexSubImage2D call covering a large rectangle. It turns out that this optimization really doesn't save much time at all, so there's no need to go that far unless you regularly expect to upload hundreds of thousands of particles (I was testing with allocations of ~10,000 at a time, and the saving was not noticeable). Any way you slice it, the vertex cache is going to get hit really hard if you allocate a huge number of new particles at once, and that, if anything, is going to cause a stall - not the uploading.

## High Res at a High Price

August 24, 2012

I got around to generating proper, seamless 3D backdrops so that the space background no longer has glaring symmetry. Unfortunately, the improvement comes at a high cost. Before, I was able to get away with generating noise in a 2D space. To properly texture the inside of a sphere, however, you can't do that. Thanks to the 3x cost of going from 2D to 3D Worley noise, as well as the fact that I need a texture for each hemisphere of the sky, it now takes 6 times longer to generate the background. Ouch!

The result? Well, check out the lack of symmetry:

To push the resolution of my textures even higher, I had to split up the texture generation into multiple GPU submissions. Interestingly, Windows 7 has a feature that tries to recover the graphics driver if it hangs for more than 2 seconds. To do so, it kills the hanging application and reports a graphics driver "crash." Well, it's pretty easy to hit the 2 second mark when generating a 2048 texture with loads of 3D noise computations on a laptop with an Intel HD3000. Previously, the resolutions of both my planet and backdrop textures were limited by this fact, since both are generated on the GPU. The natural solution is to split the work up into multiple calls, so that the driver responds in under 2 seconds.

It turns out that there's an awesome little feature that makes this painless: the "scissor test." To spread the work out over multiple calls, one can just enable the scissor test, set the scissor rectangle to some fraction of the full texture dimensions, then draw a fullscreen quad as usual. Repeat the process, sliding the scissor rectangle as you go, until you've covered the whole texture. This works because pixels/fragments will get discarded by the scissor test before the pixel/fragment shader is ever invoked (and that's where all the work happens if you're generating a texture on the GPU). It's also more painless than trying to adjust the position/texture coordinates of the FSQ.

Here's another shot, showing off the high-resolution textures (backdrop and planets):

I really loved how well this surface went with the background:

What's next? Well, those trade lane gates sure are eyesores...