Avoiding GPU Timeout via Dynamic Load Balancing

December 18, 2012 General News 0 Comments

There's really nothing better in life than when you conceive of something, imagine that it's probably going to be quite difficult to do, then end up getting it to work in five minutes. Seriously. Best feeling ever. It's what just happened with one of the core pieces of the Limit Theory Engine: the GPU texture generator.

I've always known that the engine would crash on under-powered GPUs if it tried to generate really high-res textures (for example, 2048 resolution skyboxes), because the job would take too long, and Windows would think that the driver crashed and would kill it and restart. I believe the default timeout is 2 seconds. So if your texture can't generate in 2 seconds, you're dead. Needless to say, a really high-quality, procedural skybox needs more than 2 seconds to generate on integrated chips. So to alleviate the problem, I split the job up into several pieces (rendering only a certain portion of the texture at once), forcing a GPU sync (glFinish) after each one. In theory, this ensures that each piece takes less than 2 seconds to generate, so Windows doesn't get angry. But it's inefficient to split up the job, as it increases overhead significantly. For powerful GPUs that can generate it all in one go, you don't want to split the job at all.

The solution? Very simple, really: define job size n. Initialize it to 1. Now, generate n columns of the texture and time the operation. Use the stencil test to effortlessly select n columns of the texture for rendering without having to do tricky quad math. Make sure to force GPU sync after each job (glFinish). Now, use the elapsed time to adjust job size, then repeat until all columns have been generated. It's a no-brainer, really, but one might expect that unexpected complexities creep up in implementation.

Nope. Worked the first time. I'm now able to generate 2048 skyboxes on my laptop without crashing! The scheme, in theory, will never crash, because it uses actual timings to adjust the load on the GPU. For now, I've set the target time at 1 second, just to be safe. So my powerful desktop machine will pretty much do the whole thing in one job (after it times the initial size-1 job), while my laptop determines that 609 columns is optimal, so uses about 4 jobs per cubemap face.

Now, it scares me a bit to think that the first job, which is only 1 column wide, would be used to make an initial guess at the maximal load. You might imagine that if the timer accuracy isn't great, we could end up overestimating the GPU's capability and crashing. So it might be wise to implement a gradual scale-up, such that the job size can't change too dramatically during the first few iterations. At first, I did this, but have yet to have the naive scheme crash on me, so I backed off and am happily using the naive scheme for now.