Create a map assigning two bits to every output bin. During the atomic
flush, compute a threshold for discarding writes altogether that would
keep us under 2% error - discard 1 of every 2 writes if we've already
accumulated 64 writes (hotspot value 1), 7 of 8 if we're above 256
(hotspot value 2), or 31 of 32 at 2048 (hotspot value 3). Pack this
value into a read-only buffer that can often be cached at L2, and for
particularly concentrated flames (which historically choke cuburn), L1.
During writeback, discard writes at the apporpriate rate. During the
flush of the integer accumulator to the float, scale the integer
accumulators by the discard rate.
This works because for most flames, there's not a lot of interesting
stuff in the middle regimes; either stuff is very well defined, in which
case we pretty much know exactly what the color is going to be
(remember, the max 2% relative error gets log-scaled as well), or it's
loosely defined so we should keep it at full accuracy.
Of course, a 10x boost is best-case-ish - a long, high-res render. I
realized though that I really didn't care about low quality stuff and
should go for broke optimizing this for my use case, which is
ridiculously high res HDR stuff. (On pathological flames, on the other
hand, 10x is conservative; this easily gives us 100x.)