The maximum standard deviation pushes far too hard into the limits of
the filter width, giving discrete points a weird boxy blur. The filter
slice width needs to be expanded, but that's a whole lot of coeffecient
debugging, and I'm putting it off by just reducing the maximum DE width
for now.
Use the vertical and horizontal gradients to "detect" when a pixel is
part of an edge that has been softened by grid-shift AA, and avoid
blurring it further. This causes occasional 1px artifacts in stills, but
fixes the truly grotesque DE bleed-out for a net win. A better edge
detector is still needed.
The allocation pool was reallocating the same frame as soon as it had
left the current scope, before it had been copied. We just reallocate
the same chunks. I don't think this has any real performance impact but
this can be verified.
When the alpha channel is used in a color palette, the code now replaces
the blue channel in the accumulation buffer with a pair of two U16s,
which encode the values of the blue and alpha channels as a fraction of
the value of the density. When the alpha channel is always 1.0, the blue
channel works as normal. Density is now always the last element in the
accumulation buffer.
Eliminating the separate IO operations improved total runtime by more
than 30% on my card, while the extra calculations reduced that to 20%
when alpha was present (though that can be optimized further).