Ambient Occlusion II

Last time I talked about AO, but I left out a teensy little detail: although per-vertex AO is very easy to compute, and also extremely fast to render, it's extremely slow to compute during the pre-process.  To get high-quality, noise-free AO requires somewhere in the vicinity of 1000 samples of the density field per vertex.  Not exactly a cheap operation!  On the CPU, it quickly becomes prohibitively expensive as either the complexity of the density field or the resolution of the mesh increase.

Today, I moved the computation to the GPU, and have once again been blown away by the computational abilities of modern GPUs.  Now that I have every piece of the mesh computation process - the field evaluation, the gradient evaluation, and, finally, the AO evaluation - running on the GPU, it's simply mind-blowing how high I can push the quality of the mesh and the complexity of the density field.

Here's a mesh consisting of 50 unioned and subtracted round boxes (round boxes are very expensive compared to sharp-edged ones), contoured on a grid of 300 x 300 x 300 (that's an insane level of detail, FYI), resulting in half a million vertices, each of which takes 1024 AO samples.  The GPU performs this work in ~3 seconds.  Incredible.

High-Quality, Per-Vertex AO Computed on the GPU


But that's not even the most amazing part.  The amazing thing is that, after profiling, it would seem that the GPU actually takes less than 1 second to complete this work.  It is OpenGL's shader compiler (which, of course, is running on the CPU) that takes the majority of the time.  This isn't too surprising, as the shaders to compute these things are massive, since I actually bake the field equation into the shader.  I'm sure GL spends a long time analyzing and optimizing the equation, which is a good thing, because the shader runs absurdly fast.

Unfortunately, this brings up a few unwanted issues - I now have a CPU bottleneck that can't be easily offloaded to another thread.  Since the bottleneck is inside GL, I will need to explore multithreaded GL contexts in order to compile the shaders in another thread while the game runs, because I can't have the game stalling every time a new asset enters the region and the corresponding shaders have to be compiled.  Sadly, this probably won't be too easy, but I'm sure I'll learn a lot...!

Another, less-tractable problem is that the shader compiler flat-out crashes after a certain field complexity is hit.  I will need to explore this some more.  It might just be the fact that my field function dumps an incredibly-ugly equation into the shader (it's literally a single line, with hundreds of functions wrapped together).  Perhaps breaking it up will prevent the crash.  Or maybe I've hit some kind of hard limit on the allowed complexity of pixel shaders.  If that's the case, I could explore a solution that uploads the equation as a texture, and create a shader that understands how to parse an equation from a texture.  But that would no doubt be significantly slower than baking the equation into the shader...probably at least an order of magnitude slower :/

But for now, I will allow myself to be happy with these results, and am most definitely looking forward to working on ships again with this technology in hand!

7 thoughts on “Ambient Occlusion II”

  1. One thing you might consider is instead of doing the entire 300^3 block at once, break it down into smaller blocks, run them separately on the GPU, and merge the final sub meshes into one.

    You will get some extra verts along block edges of course but you could do a readback, and then remove those if you really want.

    I'm considering doing AO like you mentioned for my project also, but 1000 samples is probably more than I can afford-- perhaps I'll just stick with SSAO

    Anyway I like your blog!

    1. Hey dain, thanks for the suggestion! I have actually tried this, however, as I mentioned in the post, the primary bottleneck is linking the shader, which, unfortunately, is equally expensive regardless of how many chunks you break the mesh into. However, I am considering chunking the mesh to perform better culling, and I may see if it helps at all with render times for large meshes.

      BTW, you can definitely get away with less than 1000 AO samples. 256 - 512 still looks quite good. And an easy way to make do with even fewer would be to perform a 3D blur on the results with a smaller sample size, just like they do in SSAO. Or, perhaps don't compute for each vertex, and interpolate for vertices that weren't computed exactly (like performing SSAO on a down-sampled buffer). But of course, if you already have a good SSAO implementation, no need to fix what isn't broken I suppose :)

  2. I'm quite surprised how little attention this has, this is incredible progress!

    Just flipping through some old entries, and my god you've been at this for some time, and it just keeps looking better and better. Absolutely phenomenal work~

    1. Thanks!! I've been at it for a while, but never publicized this blog until LT was announced, hence the relative secrecy. Hoping to continue many more decades of this progress :)

      1. Absolutely, I look forward to catching up on progress and seeing where it all goes!

        Should you ever need a musician or voice actor I'm always in the neighborhood and happy to help! :D

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>