A bug in the hacky use of OrderedDict to function as an OrderedSet meant
that any value acceessed in a precalculation function which had already
been accessed from another precalculation function would use an
incorrect index.
Using one stream with two pagelocked host buffers allows us to keep the
GPU work queue full without pegging the CPU, and also reduces the
incidences where a host buffer will get overwritten before it can be
written. devtid() was flaky, so this patch also introduces a ringbuffer
to handle the 'slots' concept. It also introduces an adaptive number of
temporal samples, which improves efficiency but also killed the
assumption that (ntemporal_samples % 256 == 0), which required some
additional fixes.
When the alpha channel is used in a color palette, the code now replaces
the blue channel in the accumulation buffer with a pair of two U16s,
which encode the values of the blue and alpha channels as a fraction of
the value of the density. When the alpha channel is always 1.0, the blue
channel works as normal. Density is now always the last element in the
accumulation buffer.
Eliminating the separate IO operations improved total runtime by more
than 30% on my card, while the extra calculations reduced that to 20%
when alpha was present (though that can be optimized further).