12 KiB
layout | title | description | category | tags | |
---|---|---|---|---|---|
post | Release the GIL | Strategies for Parallelism in Python |
|
Complaining about the Global Interpreter Lock(GIL) seems like a rite of passage for Python developers. It's easy to criticize a design decision made before multi-core CPU's were widely available, but the fact that it's still around indicates that it generally works Good Enough. Besides, there are simple and effective workarounds; it's not hard to start a new process and use message passing to synchronize code running in parallel.
Still, wouldn't it be nice to have more than a single active interpreter thread? In an age of asynchronicity and M:N
threading, Python seems lacking. The ideal scenario is to both take advantage of Python's productivity, and run code in true parallel.
Presented below are two strategies for releasing the GIL's icy grip without giving up on what makes Python a nice language to start with. Bear in mind: these are just the tools, no claim is made about whether it's a good idea to use them. Very often, unlocking the GIL is an XY problem; you want application performance, and the GIL seems like an obvious bottleneck. Remember that any gains from running code in parallel come at the expense of project complexity; messing with the GIL is ultimately messing with Python's memory model.
%load_ext Cython
from numba import jit
N = 1_000_000_000
Cython
Put simply, Cython is a programming language that looks a lot like Python, gets transpiled to C/C++, and integrates well with the CPython API. It's great for building Python wrappers to C and C++ libraries, writing optimized code for numerical processing, and tons more. And when it comes to managing the GIL, there are two special features:
- The
nogil
function annotation asserts that a Cython function is safe to use without the GIL, and compilation will fail if it interacts with vanilla Python - The
with nogil
context manager explicitly unlocks the CPython GIL while active
Whenever Cython code runs inside a with nogil
block on a separate thread, the Python interpreter is unblocked and allowed to continue work elsewhere. We'll define a "busy work" function that demonstrates this principle in action:
%%cython
# Annotating a function with `nogil` indicates only that it is safe
# to call in a `with nogil` block. It *does not* release the GIL.
cdef unsigned long fibonacci(unsigned long n) nogil:
if n <= 1:
return n
cdef unsigned long a = 0, b = 1, c = 0
c = a + b
for _i in range(2, n):
a = b
b = c
c = a + b
return c
def cython_nogil(unsigned long n):
# Explicitly release the GIL before calling `fibonacci`
with nogil:
value = fibonacci(n)
return value
def cython_gil(unsigned long n):
# Because the GIL is not explicitly released, it implicitly
# remains acquired when running the `fibonacci` function
return fibonacci(n)
First, let's time how long it takes Cython to calculate the billionth Fibonacci number:
%%time
_ = cython_gil(N);
CPU times: user 365 ms, sys: 0 ns, total: 365 ms Wall time: 372 ms
%%time
_ = cython_nogil(N);
CPU times: user 381 ms, sys: 0 ns, total: 381 ms Wall time: 388 ms
Both versions (with and without GIL) take effectively the same amount of time to run. If we run them in parallel without unlocking the GIL, even though two threads are used, we expect the time to double (only one thread can be active at a time):
%%time
from threading import Thread
# Create the two threads to run on
t1 = Thread(target=cython_gil, args=[N])
t2 = Thread(target=cython_gil, args=[N])
# Start the threads
t1.start(); t2.start()
# Wait for the threads to finish
t1.join(); t2.join()
CPU times: user 641 ms, sys: 5.62 ms, total: 647 ms Wall time: 645 ms
However, one thread releasing the GIL means that the second thread is free to acquire the GIL and perform its processing in parallel:
%%time
t1 = Thread(target=cython_nogil, args=[N])
t2 = Thread(target=cython_gil, args=[N])
t1.start(); t2.start()
t1.join(); t2.join()
CPU times: user 717 ms, sys: 372 µs, total: 718 ms Wall time: 358 ms
Because user
time represents the sum of processing time on all threads, it doesn't change much. The "wall time" has been cut roughly in half because the code is now running in parallel.
Keep in mind that the order in which threads are started makes a difference!
%%time
# Note that the GIL-locked version is started first
t1 = Thread(target=cython_gil, args=[N])
t2 = Thread(target=cython_nogil, args=[N])
t1.start(); t2.start()
t1.join(); t2.join()
CPU times: user 667 ms, sys: 0 ns, total: 667 ms Wall time: 672 ms
Even though the second thread releases the GIL lock while active, it can't start until the first has completed. Thus, the overall runtime the same as running two GIL-locked threads.
Finally, be aware that attempting to unlock the GIL from a thread that doesn't own it will crash the interpreter, not just the thread attempting the unlock:
%%cython
cdef int cython_recurse(int n) nogil:
if n <= 0:
return 0
with nogil:
return cython_recurse(n - 1)
cython_recurse(2)
Fatal Python error: PyEval_SaveThread: NULL tstate Thread 0x00007f499effd700 (most recent call first): File "/home/bspeice/.virtualenvs/release-the-gil/lib/python3.7/site-packages/ipykernel/parentpoller.py", line 39 in run File "/usr/lib/python3.7/threading.py", line 926 in _bootstrap_inner File "/usr/lib/python3.7/threading.py", line 890 in _bootstrap
In practice, avoiding this issue is simple. First, nogil
functions likely shouldn't contain with nogil
blocks. Second, Cython can conditionally acquire/release the GIL, so synchronizing access shouldn't be problematic. Finally, Cython's documentation for external C code contains more detail on how to safely manage the GIL.
To conclude: use Cython's nogil
annotation to assert that functions are safe for calling when the GIL is unlocked, and with nogil
to actually unlock the GIL.
Numba
Like Cython, Numba is a "compiled Python." Where Cython works by compiling a Python-like language to C/C++, Numba compiles Python bytecode directly to machine code at runtime. Behavior is controlled with a special @jit
decorator; calling a decorated function first compiles it to machine code, and then runs it. Calling the function a second time re-uses that machine code, but will recompile if the argument types change.
Numba works best when a nopython=True
argument is added to the @jit
decorator; functions compiled in nopython
mode avoid the CPython API and have performance comparable to C. Further, adding nogil=True
to the @jit
decorator unlocks the GIL while that function is running. Note that nogil
and nopython
are different arguments; while it is necessary for code to be compiled in nopython
mode in order to release the lock, the GIL will remain locked if nogil=False
(the default).
Let's repeat the same experiment, this time using Numba instead of Cython:
# The `int` type annotation is only for humans and is ignored
# by Numba.
@jit(nopython=True, nogil=True)
def numba_nogil(n: int) -> int:
if n <= 1:
return n
a = 0
b = 1
c = a + b
for _i in range(2, n):
a = b
b = c
c = a + b
return c
# Run using `nopython` mode to receive a performance boost,
# but GIL remains locked due to `nogil=False` by default.
@jit(nopython=True)
def numba_gil(n: int) -> int:
if n <= 1:
return n
a = 0
b = 1
c = a + b
for _i in range(2, n):
a = b
b = c
c = a + b
return c
# Call each function once to force compilation; we don't want
# the timing statistics to include how long it takes to compile.
numba_nogil(N)
numba_gil(N);
We'll perform the same tests as Cython; first, figure out how long it takes to run:
%%time
_ = numba_gil(N)
Aside: it's not immediately clear why Numba takes ~20% less time to run than Cython for code that should be effectively identical after compilation.CPU times: user 253 ms, sys: 258 µs, total: 253 ms Wall time: 251 ms
When running two GIL-locked threads in parallel, the result (as expected) takes around twice as long to compute:
%%time
t1 = Thread(target=numba_gil, args=[N])
t2 = Thread(target=numba_gil, args=[N])
t1.start(); t2.start()
t1.join(); t2.join()
CPU times: user 541 ms, sys: 3.96 ms, total: 545 ms Wall time: 541 ms
And if the GIL-unlocking thread runs first, both threads run in parallel:
%%time
t1 = Thread(target=numba_nogil, args=[N])
t2 = Thread(target=numba_gil, args=[N])
t1.start(); t2.start()
t1.join(); t2.join()
CPU times: user 551 ms, sys: 7.77 ms, total: 559 ms Wall time: 279 ms
Just like Cython, starting a GIL-locked thread first leads to overall runtime taking twice as long:
%%time
t1 = Thread(target=numba_gil, args=[N])
t2 = Thread(target=numba_nogil, args=[N])
t1.start(); t2.start()
t1.join(); t2.join()
CPU times: user 524 ms, sys: 0 ns, total: 524 ms Wall time: 522 ms
Finally, unlike Cython, Numba will unlock the GIL if and only if it is currently acquired; recursively calling @jit(nogil=True)
functions is perfectly safe:
from numba import jit
@jit(nopython=True, nogil=True)
def numba_recurse(n: int) -> int:
if n <= 0:
return 0
return numba_recurse(n - 1)
numba_recurse(2);
Conclusion
While unlocking the GIL is often a solution in search of a problem, both Cython and Numba provide simple means to manage the GIL when appropriate. This enables true parallelism (not just concurrency) that is impossible in vanilla Python.
Before finishing, it's important to address pain points that will show up if these techniques are used in a more realistic project:
First, code running in a GIL-free context will likely also need non-trivial data structures; GIL-free functions aren't useful if they're constantly interacting with Python objects that need the GIL for access. Cython provides extension types and Numba provides a @jitclass
decorator to address this need.
Second, building and distributing applications that make use of Cython/Numba can be complicated. Cython packages require running the compiler, (potentially) linking/packaging external dependencies, and distributing a binary wheel. Numba is generally simpler because the code being distributed is pure Python that isn't compiled until being run. However, errors aren't detected until runtime and debugging can be problematic.