Shuffle points between threads of a CTA

2025-11-03 01:40:47 -05:00 · 2010-09-12 00:17:18 -04:00
parent 40a5ceafde
commit f368a99a16
3 changed files with 355 additions and 27 deletions
--- a/helpers/shuf.py
+++ b/helpers/shuf.py
@ -0,0 +1,287 @@
+# -*- coding: utf-8 -*-
+
+"""
+Examples and documentation for the point shuffle. Run this file to produce the
+full output with annotations.
+"""
+
+import numpy as np
+
+w, t = 0, 0
+
+def p(a):
+    for i in range(0, t, w):
+        print '    ' + ' '.join(['%2d' % a[j] for j in range(i,i+w)])
+
+def make():
+    a = np.ndarray(t, dtype=np.int32)
+    for i in range(t/w):
+        for j in range(w):
+            a[i*w+j] = i
+    return a
+
+for i in range(w):
+    a = shuf(a)
+    p(a)
+
+print """
+The problem
+-----------
+
+If two points undergo the same transforms many times, they will begin to
+converge. Since every thread in a warp selects the same transforms on each
+iteration, leaving a given point in the same warp each time would effectively
+cut the number of unique samples generated by 32.
+
+One fix is to allow transforms to diverge, but that has a direct impact upon
+performance. Another is to simply swap points between different warps.  This is
+what we do here.
+
+For performance reasons, we can't do the swap in global memory, or between
+multiple CTAs; the cost of synchronization and the necessarily uncoalesced
+write pattern would put an enormous (orders of magnitude) dent in throughput.
+As a result, we have to keep things in shared memory, which is fast but has a
+few restrictions, of which we note two:
+
+    - There's not a lot of it. 1K is a good target; it gives the memory kernel
+      enough room in a 16K shared configuration to use 8K while retaining the
+      4-CTA-per-SM allocation we're shooting for. (More on this later.)
+
+    - No two threads in a warp should write to a different bank. More
+      specifically, for an address ``a``, the value ``a % 128`` should be
+      unique for each thread. See the CUDA docs for details.
+
+Mixing it up
+------------
+
+A simple algorithm for doing the swap which respects those constraints is given
+below. In this algorithm, for each coordinate value, each warp writes the value
+to memory, waits for all other warps to do the same, then reads back the new
+value. The trick is in the addresses which are read:
+
+    Let ``w`` be the warp size, ``n`` be the warp number within a CTA, ``l`` be
+    the lane ID, and ``t`` be the total number of threads within a CTA. The
+    address to which a point should be written is given by::
+
+        a = (l + n * w) % t
+
+    For reading, the address is given by::
+
+        a = (l + (n + l) * w) % t
+
+Note that the write address in this case is simply the thread ID.
+"""
+
+def shuf_simple(a):
+    b = np.empty_like(a)
+    for n in range(t/w):
+        for l in range(w):
+            b[(l+(n+l)*w)%t] = a[l+n*w]
+    return b
+
+print """
+It may help to visualize this. Here's a representation of a "before" state. In
+this block, rows represent different warps, while columns represent different
+thread lanes; if addresses proceed from left to right, top to bottom, the
+address would be the same as the thread ID as assigned by the GPU. Each value
+in a particular cell represents the warp number of the cell that the point
+started in, in each example.
+
+For these examples, the warp size ``w`` is 16, and ``t`` is 16*16=256. ::
+"""
+
+w, t = 16, 256
+a = make()
+p(a)
+
+print """
+After one round of the simple shuffle given above, the matrix looks like this::
+"""
+
+a = shuf_simple(a)
+p(a)
+
+print """
+This demonstrates two attractive properties:
+
+    - Both reads and writes satisfy the constraints to avoid bank conflicts
+      when writing to shared memory.
+
+    - No point ends up in a warp alongside another from the same origin.
+
+Of course, we should investigate its properties on continued iterations. Here's
+round 2::
+"""
+
+a = shuf_simple(a)
+p(a)
+
+print "\nAnd 3::\n"
+
+a = shuf_simple(a)
+p(a)
+
+print """
+We've been doing farly good so far, but at the fourth iteration, things don't
+look as nice::
+"""
+
+a = shuf_simple(a)
+p(a)
+
+print """
+This *looks* bad, with a lot of points grouped together, but that may not be an
+obvious indicator of poor performance in a real situation. The linear
+relationship of the shuffle is obvious from the tables, and indeed some simple
+modular algebra can identify the patterns, but since I am a better programmer
+than a mathematician, I'm going to resort to empirical methods.
+
+"Meet my good friend, Monte Carlo"
+----------------------------------
+
+Since the iterative part of the flame algorithm is itself essentially a Monte
+Carlo simulation, it is appropriate that we use one to characterize its
+implementation. In fact, the test algorithm is remarkably similar to the way
+color values are calculated in the flame algorithm itself:
+
+    - Define a list of affine transformations, each with a certain probability
+      of being selected.
+
+    - For several iterations, in each warp:
+
+        - Choose one affine transformation from the list according to the
+          associated probabilities.
+
+        - Apply it to the current coordinate.
+
+We'll use a 1D affine transformation (``x_n = A * x_{n-1} + B``) and equal
+probabilities. After each iteration, we'll calculate the standard deviation of
+the output of each transform seprately, and sum the total results [#]. If there
+is more diversity among the input values, the final number should be larger; an
+exceptionally small number indicates reduced variability among the inputs.
+
+.. [#]  To be honest, I'm not entirely sure *what* this represents,
+        statistically, but it *feels* right, and that's what counts.
+
+The initial inputs will be obtained using the same method as the flame
+algorithm: run random points through the algorithm a few times without
+recording the results. If each affine matrix is convergent, this should quickly
+produce suitable input points.
+
+To have something to compare it to, we specify two alternative shuffles. The
+first is no shuffle per warp, which is expected to be rather terrible. The
+second is a completely random shuffle of the entire matrix in each round,
+something that's untenable on the device (at least with any algorithm I've yet
+thought of). I've also included the results of running this test with an
+effective warp width of 1, which would simulate the results of running a
+different transformation per thread regardless of warp configuration.
+"""
+
+def shuf_none(a):
+    return a
+
+def shuf_all(a):
+    np.random.shuffle(a)
+    return a
+
+print """
+Here's the set of affine transformations we'll be using:
+"""
+
+FUSE = 20
+aff = [(0.5, 0, 0.3), (0.5, 0.5, 0.3), (0.5, 1, 0.3), (0.25, 0.75, 0.1)]
+fmtstr = '    %-10s %-10s %s'
+print fmtstr % ('A', 'B', 'probability of selection')
+for k in aff:
+    print fmtstr % k
+
+print "\nAnd here are the results::\n"
+
+def monte(a, shuf, rounds, xf_w):
+    # Shuffle `a` with `shuf` for `rounds`, and for each group of `xf_w`
+    # threads, pick a new transform, apply it, and accumulate the result
+    stdsum = 0
+    for round in range(-FUSE, rounds):
+        xf_pts = [[] for i in aff]
+        for wrp in range(0, t, xf_w):
+            aff_id = 0
+            sel = np.random.rand(1)[0]
+            while sel > aff[aff_id][2]:
+                sel -= aff[aff_id][2]
+                aff_id += 1
+            a[wrp:wrp+xf_w] = a[wrp:wrp+xf_w] * aff[aff_id][0] + aff[aff_id][1]
+            xf_pts[aff_id].extend(a[wrp:wrp+xf_w])
+        if round >= 0:
+            stdsum += sum([np.std(x) for x in xf_pts])
+        a = shuf(a)
+    return stdsum / rounds
+
+print '    With no shuffle:     %g' % monte(make(), shuf_none, 1000, 16)
+print '    With full shuffle:   %g' % monte(make(), shuf_all, 1000, 16)
+print '    With simple shuffle: %g' % monte(make(), shuf_simple, 1000, 16)
+print '    With warp_width=1:   %g' % monte(make(), shuf_none, 1000, 1)
+
+print """
+Failing to shuffle the points clearly has a considerable negative impact on our
+ersatz "diversity" criterion. Fully shuffling the set of points each time
+substantially increases this result. It's not enough to close the gap between
+transform choice by warp and by thread, but it does improve things.
+
+A simple shuffle actually beats the full shuffle consistently by a few points.
+This may seem like a flawed result — "more shuffling means more diversity" —
+but it's actually expected. The simple shuffle guarantees that two points
+will never move together over one round, but the full shuffle happily reunites
+these neighbors from the previous thread, lowering its performance.
+
+As it turns out, the simple shuffle does quite well, despite its strong linear
+dependence. If this was the configuration of the device, I would actually be
+satisfied putting this into play directly.
+
+But things are, of course, messier.
+
+Hip to be square
+----------------
+
+The previous examples make use of a 16-thread warp size, while NVIDIA devices
+actually make use of a 32-thread warp. Since we can realistically only fit 256
+threads in a warp, this leaves us with a CTA geometry of 32*8, rather than
+16*16. To see what this implies for our setup, check out the examples of simple
+rotation on a 16*4 matrix, which has the same aspect ratio as the 32*8 device
+would::
+"""
+
+t = 64
+a = make()
+for i in range(5):
+    p(a)
+    print
+    a = shuf_simple(a)
+
+print """
+The simple shuffle was designed with a square grid in mind, and performs worse
+when on the 32*8 device grid::
+"""
+
+t = 256
+w = 32
+
+print '    With no shuffle:     %g' % monte(make(), shuf_none, 1000, 32)
+print '    With full shuffle:   %g' % monte(make(), shuf_all, 1000, 32)
+print '    With simple shuffle: %g' % monte(make(), shuf_simple, 1000, 32)
+print '    With warp_width=1:   %g' % monte(make(), shuf_none, 1000, 1)
+
+def shuf_better(a):
+    b = np.empty_like(a)
+    for n in range(t/w):
+        r = np.random.randint(32)
+        for l in range(w):
+            b[(l+(n+l)*w)%t] = a[(l+r)%w+n*w]
+    return b
+
+print '    With better shuffle: %g' % monte(make(), shuf_better, 1000, 32)
+
+print """
+Okay I actually intended this to be a blog post but I started writing before
+having done any of the math. Actually the simple shuffle looks like it's
+sufficient for now, and that's what I'm going to implement.
+"""