Status: passes rudimentary tests

Current goals:

- Start xforms
    - xform selection, pre- and post-transform in xform
    - first of the variations

Things to do (rather severely incomplete):

- LaunchContext thread distribution based on generated code register count and
  shared memory size
- qlocal storage
    - Performance implications of different state spaces
    - Shared / cache projected usage and its effect on above
    - Implement qlocal storage, and hide the complexity
- The `Feature` class
    - Transform count and per-transform code layout
    - Filter size, oversample, final buffer size
- Buffer allocation, clearing, reading from device
- Preview window
    - When/how to sample?
    - OpenGL interop worth it?
    - Implement
- Implement xforms
- Shuffle
    - State space implications, you know the drill
    - Implement
    - Test effects on quality by masking off writes on all but one lane and
      boosting the sample density to compensate (muuuuuch later on)
- DE
- Clean up code (particularly DSL stuff incl. injector)

Things to test:

- Debug flag/dict/whatever for entire project in general
    - Iteration counters for IterThread

Things to benchmark:

- Kernel invocation and/or interrupt times (will high load freeze X?)
- MWC float conversion
- The entire scatter process
    - Radix sort of writeback coordinates
    - Log-copy-histogram approach
    - Direct reductions
    - Surface loads, stores, reductions