The new toolkit generates code for filtering which uses too many
registers, so this change splits filtering into its own module so that
it can have separate register usage limits during compiling. As a bonus,
this should improve startup time in general, since the filtering code
is now fixed and does not need to be recompiled.
Sobel was giving too many false positives. This cross seems to detect
the kinds of edges we care about and avoids the rest of the image, and
it does so on pretty much everything I've tried it on. Very satisfying.
Using one stream with two pagelocked host buffers allows us to keep the
GPU work queue full without pegging the CPU, and also reduces the
incidences where a host buffer will get overwritten before it can be
written. devtid() was flaky, so this patch also introduces a ringbuffer
to handle the 'slots' concept. It also introduces an adaptive number of
temporal samples, which improves efficiency but also killed the
assumption that (ntemporal_samples % 256 == 0), which required some
additional fixes.
The maximum standard deviation pushes far too hard into the limits of
the filter width, giving discrete points a weird boxy blur. The filter
slice width needs to be expanded, but that's a whole lot of coeffecient
debugging, and I'm putting it off by just reducing the maximum DE width
for now.
Use the vertical and horizontal gradients to "detect" when a pixel is
part of an edge that has been softened by grid-shift AA, and avoid
blurring it further. This causes occasional 1px artifacts in stills, but
fixes the truly grotesque DE bleed-out for a net win. A better edge
detector is still needed.