# -*- coding: utf-8 -*-

"""
Examples and documentation for the point shuffle. Run this file to produce the
full output with annotations.
"""

import numpy as np

w, t = 0, 0

def p(a):
    for i in range(0, t, w):
        print '    ' + ' '.join(['%2d' % a[j] for j in range(i,i+w)])

def make():
    a = np.ndarray(t, dtype=np.int32)
    for i in range(t/w):
        for j in range(w):
            a[i*w+j] = i
    return a

for i in range(w):
    a = shuf(a)
    p(a)

print """
The problem
-----------

If two points undergo the same transforms many times, they will begin to
converge. Since every thread in a warp selects the same transforms on each
iteration, leaving a given point in the same warp each time would effectively
cut the number of unique samples generated by 32.

One fix is to allow transforms to diverge, but that has a direct impact upon
performance. Another is to simply swap points between different warps.  This is
what we do here.

For performance reasons, we can't do the swap in global memory, or between
multiple CTAs; the cost of synchronization and the necessarily uncoalesced
write pattern would put an enormous (orders of magnitude) dent in throughput.
As a result, we have to keep things in shared memory, which is fast but has a
few restrictions, of which we note two:

    - There's not a lot of it. 1K is a good target; it gives the memory kernel
      enough room in a 16K shared configuration to use 8K while retaining the
      4-CTA-per-SM allocation we're shooting for. (More on this later.)

    - No two threads in a warp should write to a different bank. More
      specifically, for an address ``a``, the value ``a % 128`` should be
      unique for each thread. See the CUDA docs for details.

Mixing it up
------------

A simple algorithm for doing the swap which respects those constraints is given
below. In this algorithm, for each coordinate value, each warp writes the value
to memory, waits for all other warps to do the same, then reads back the new
value. The trick is in the addresses which are read:

    Let ``w`` be the warp size, ``n`` be the warp number within a CTA, ``l`` be
    the lane ID, and ``t`` be the total number of threads within a CTA. The
    address to which a point should be written is given by::

        a = (l + n * w) % t

    For reading, the address is given by::

        a = (l + (n + l) * w) % t

Note that the write address in this case is simply the thread ID.
"""

def shuf_simple(a):
    b = np.empty_like(a)
    for n in range(t/w):
        for l in range(w):
            b[(l+(n+l)*w)%t] = a[l+n*w]
    return b

print """
It may help to visualize this. Here's a representation of a "before" state. In
this block, rows represent different warps, while columns represent different
thread lanes; if addresses proceed from left to right, top to bottom, the
address would be the same as the thread ID as assigned by the GPU. Each value
in a particular cell represents the warp number of the cell that the point
started in, in each example.

For these examples, the warp size ``w`` is 16, and ``t`` is 16*16=256. ::
"""

w, t = 16, 256
a = make()
p(a)

print """
After one round of the simple shuffle given above, the matrix looks like this::
"""

a = shuf_simple(a)
p(a)

print """
This demonstrates two attractive properties:

    - Both reads and writes satisfy the constraints to avoid bank conflicts
      when writing to shared memory.

    - No point ends up in a warp alongside another from the same origin.

Of course, we should investigate its properties on continued iterations. Here's
round 2::
"""

a = shuf_simple(a)
p(a)

print "\nAnd 3::\n"

a = shuf_simple(a)
p(a)

print """
We've been doing farly good so far, but at the fourth iteration, things don't
look as nice::
"""

a = shuf_simple(a)
p(a)

print """
This *looks* bad, with a lot of points grouped together, but that may not be an
obvious indicator of poor performance in a real situation. The linear
relationship of the shuffle is obvious from the tables, and indeed some simple
modular algebra can identify the patterns, but since I am a better programmer
than a mathematician, I'm going to resort to empirical methods.

"Meet my good friend, Monte Carlo"
----------------------------------

Since the iterative part of the flame algorithm is itself essentially a Monte
Carlo simulation, it is appropriate that we use one to characterize its
implementation. In fact, the test algorithm is remarkably similar to the way
color values are calculated in the flame algorithm itself:

    - Define a list of affine transformations, each with a certain probability
      of being selected.

    - For several iterations, in each warp:

        - Choose one affine transformation from the list according to the
          associated probabilities.

        - Apply it to the current coordinate.

We'll use a 1D affine transformation (``x_n = A * x_{n-1} + B``) and equal
probabilities. After each iteration, we'll calculate the standard deviation of
the output of each transform seprately, and sum the total results [#]. If there
is more diversity among the input values, the final number should be larger; an
exceptionally small number indicates reduced variability among the inputs.

.. [#]  To be honest, I'm not entirely sure *what* this represents,
        statistically, but it *feels* right, and that's what counts.

The initial inputs will be obtained using the same method as the flame
algorithm: run random points through the algorithm a few times without
recording the results. If each affine matrix is convergent, this should quickly
produce suitable input points.

To have something to compare it to, we specify two alternative shuffles. The
first is no shuffle per warp, which is expected to be rather terrible. The
second is a completely random shuffle of the entire matrix in each round,
something that's untenable on the device (at least with any algorithm I've yet
thought of). I've also included the results of running this test with an
effective warp width of 1, which would simulate the results of running a
different transformation per thread regardless of warp configuration.
"""

def shuf_none(a):
    return a

def shuf_all(a):
    np.random.shuffle(a)
    return a

print """
Here's the set of affine transformations we'll be using:
"""

FUSE = 20
aff = [(0.5, 0, 0.3), (0.5, 0.5, 0.3), (0.5, 1, 0.3), (0.25, 0.75, 0.1)]
fmtstr = '    %-10s %-10s %s'
print fmtstr % ('A', 'B', 'probability of selection')
for k in aff:
    print fmtstr % k

print "\nAnd here are the results::\n"

def monte(a, shuf, rounds, xf_w):
    # Shuffle `a` with `shuf` for `rounds`, and for each group of `xf_w`
    # threads, pick a new transform, apply it, and accumulate the result
    stdsum = 0
    for round in range(-FUSE, rounds):
        xf_pts = [[] for i in aff]
        for wrp in range(0, t, xf_w):
            aff_id = 0
            sel = np.random.rand(1)[0]
            while sel > aff[aff_id][2]:
                sel -= aff[aff_id][2]
                aff_id += 1
            a[wrp:wrp+xf_w] = a[wrp:wrp+xf_w] * aff[aff_id][0] + aff[aff_id][1]
            xf_pts[aff_id].extend(a[wrp:wrp+xf_w])
        if round >= 0:
            stdsum += sum([np.std(x) for x in xf_pts])
        a = shuf(a)
    return stdsum / rounds

print '    With no shuffle:     %g' % monte(make(), shuf_none, 1000, 16)
print '    With full shuffle:   %g' % monte(make(), shuf_all, 1000, 16)
print '    With simple shuffle: %g' % monte(make(), shuf_simple, 1000, 16)
print '    With warp_width=1:   %g' % monte(make(), shuf_none, 1000, 1)

print """
Failing to shuffle the points clearly has a considerable negative impact on our
ersatz "diversity" criterion. Fully shuffling the set of points each time
substantially increases this result. It's not enough to close the gap between
transform choice by warp and by thread, but it does improve things.

A simple shuffle actually beats the full shuffle consistently by a few points.
This may seem like a flawed result — "more shuffling means more diversity" —
but it's actually expected. The simple shuffle guarantees that two points
will never move together over one round, but the full shuffle happily reunites
these neighbors from the previous thread, lowering its performance.

As it turns out, the simple shuffle does quite well, despite its strong linear
dependence. If this was the configuration of the device, I would actually be
satisfied putting this into play directly.

But things are, of course, messier.

Hip to be square
----------------

The previous examples make use of a 16-thread warp size, while NVIDIA devices
actually make use of a 32-thread warp. Since we can realistically only fit 256
threads in a warp, this leaves us with a CTA geometry of 32*8, rather than
16*16. To see what this implies for our setup, check out the examples of simple
rotation on a 16*4 matrix, which has the same aspect ratio as the 32*8 device
would::
"""

t = 64
a = make()
for i in range(5):
    p(a)
    print
    a = shuf_simple(a)

print """
The simple shuffle was designed with a square grid in mind, and performs worse
when on the 32*8 device grid::
"""

t = 256
w = 32

print '    With no shuffle:     %g' % monte(make(), shuf_none, 1000, 32)
print '    With full shuffle:   %g' % monte(make(), shuf_all, 1000, 32)
print '    With simple shuffle: %g' % monte(make(), shuf_simple, 1000, 32)
print '    With warp_width=1:   %g' % monte(make(), shuf_none, 1000, 1)

def shuf_better(a):
    b = np.empty_like(a)
    for n in range(t/w):
        r = np.random.randint(32)
        for l in range(w):
            b[(l+(n+l)*w)%t] = a[(l+r)%w+n*w]
    return b

print '    With better shuffle: %g' % monte(make(), shuf_better, 1000, 32)

print 'For 32*16:'
t = 512
print '    With no shuffle:     %g' % monte(make(), shuf_none, 1000, 32)
print '    With full shuffle:   %g' % monte(make(), shuf_all, 1000, 32)
print '    With simple shuffle: %g' % monte(make(), shuf_simple, 1000, 32)
print '    With better shuffle: %g' % monte(make(), shuf_better, 1000, 32)
print 'For 32*32:'
t = 1024
print '    With no shuffle:     %g' % monte(make(), shuf_none, 1000, 32)
print '    With full shuffle:   %g' % monte(make(), shuf_all, 1000, 32)
print '    With simple shuffle: %g' % monte(make(), shuf_simple, 1000, 32)
print '    With better shuffle: %g' % monte(make(), shuf_better, 1000, 32)


print """
Okay I actually intended this to be a blog post but I started writing before
having done any of the math. Actually the simple shuffle looks like it's
sufficient for now, and that's what I'm going to implement.
"""