Some recent profiling has made it very clear to me that a CPU-side particle system just isn't going to work out for large numbers of particles. It's more of a bottleneck than I anticipated, just because of the fact that it has to ship a vertex/index array to the GPU each frame.
Although there's no real conceptual difficulty in a particle system on the GPU, the real trick is getting it to run fast (no surprises there). With a vanilla GPU particle system, if you want to support 100K particles, then you have to draw the vertices for 100K particles each and every frame, barring any geometry shader fanciness. Not cool. Naturally, you can check in the vertex shader to see if the particle is active or not, and if it isn't, then you cull it. This saves pixel processing. Still, that's a lot of vertices burning through the GPU. When I got this far in implementing my GPU particle system, it wasn't doing much better than the CPU one.
Well, wouldn't it be great if we could draw only the active particle vertices? But that would require having all active particles coalesced in a linear region of the texture. Hmmm...defragmenter, perhaps? Wait a minute...there's nothing stopping us from using stack-based allocation! By that, I mean that we always allocate the next free block, and when we go to free a particle, we swap it with the last one in use. This is a classic memory scheme for maintaining a tightly-packed linear array. And it works perfectly in this situation, because the particles do not care about order! Using this strategy, we can specify the exact size of the active vertex buffer, so that no inactive particle vertices are drawn in the draw call.
To break up all this text, here's a somewhat cheesy picture of loads of particles! It'll look a lot better once I apply texture to them.
Another hidden benefit of stack-based allocation is that we can now save a LOT of time on particle allocation by coalescing allocations per frame. Without stack-based allocation, you'll have to make several calls to TexSubImage2D on particle creation. TexSubImage2D is expensive, and will be a HUGE bottleneck if called for each particle creation (trust me, I've profiled it). In fact, it will quickly become the bottleneck of the whole system. However, if we know that particles are packed tightly in a stack, we can defer the allocation to a once-per-frame coalesced allocation, uploading whole rows at a time to the data textures. This is actually easy and intuitive to implement, but yields massive savings on TexSubImage2D calls! With a 256x256 particle texture, you can now perform (in theory) 256 allocations at once! A little probability theory tells us that, on average, we'll make 128 times less calls to TexSubImage2D. Another big win for stack allocation!
One final detail - when freeing a particle and performing the swap, you need to be able to swap texture data. I couldn't find a way to read a subregion of a texture in OGL, so I opted to shadow the GPU data with a CPU copy. One might declare that doing so defeats the purpose of having a GPU PS, but that's not true - the real saving (at least for my application) is the draw call. Having the vertex/index data already on the GPU is a huge saving. To avoid shadowing, you could use a shader and pixel-sized quad (or scissor test) to copy the data. But that's messier and requires double-buffering. So I opted for the easy way. As always, to back my decision, I profiled, and indeed found that keeping a CPU shadow of the data is virtually free. Even with having to perform Euler integration on both the CPU and the GPU, the cost is nowhere close to the cost of drawing the particles, which at this point should be the bottleneck.
Now, for something totally different, here's a (questionably) pretty picture of my dumb AI friends and I flying around planet Roldoni. Yes, the terrain looks terrible. But now that I've turned my attention back to planetary surfaces, it won't stay that way for long. I've also ported my sky shader over from the XDX engine, and it's still looking nice.
PS ~ A final note : you might think that you can save even more speed on particle allocation if you detect the case in which multiple rows can be uploaded at a time (i.e., you have a massive number of particles to upload in one frame), so you coalesce them into a single TexSubImage2D call covering a large rectangle. It turns out that this optimization really doesn't save much time at all, so there's no need to go that far unless you regularly expect to upload hundreds of thousands of particles (I was testing with allocations of ~10,000 at a time, and the saving was not noticeable). Any way you slice it, the vertex cache is going to get hit really hard if you allocate a huge number of new particles at once, and that, if anything, is going to cause a stall - not the uploading.