Final draft

timing
Bradlee Speice 2019-12-14 12:36:08 -05:00
parent 451b92db4a
commit 75dce1863a
1 changed files with 19 additions and 19 deletions

View File

@ -8,7 +8,7 @@ tags: [python]
Complaining about the [Global Interpreter Lock](https://wiki.python.org/moin/GlobalInterpreterLock)(GIL) seems like a rite of passage for Python developers. It's easy to criticize a design decision made before multi-core CPU's were widely available, but the fact that it's still around indicates that it generally works [Good](https://wiki.c2.com/?PrematureOptimization) [Enough](https://wiki.c2.com/?YouArentGonnaNeedIt). Besides, there are simple and effective workarounds; it's not hard to start a [new process](https://docs.python.org/3/library/multiprocessing.html) and use message passing to synchronize code running in parallel.
Still, wouldn't it be nice to have more than a single active interpreter thread? In an age of asynchronicity and $M:N$ threading, Python seems lacking. The ideal scenario is to both take advantage of Python's productivity, and run code in true parallel.
Still, wouldn't it be nice to have more than a single active interpreter thread? In an age of asynchronicity and $M:N$ threading, Python seems lacking. The ideal scenario is to take advantage of both Python's productivity and the modern CPU's parallel capabilities.
Presented below are two strategies for releasing the GIL's icy grip without giving up on what makes Python a nice language to start with. Bear in mind: these are just the tools, no claim is made about whether it's a good idea to use them. Very often, unlocking the GIL is an [XY problem](https://en.wikipedia.org/wiki/XY_problem); you want application performance, and the GIL seems like an obvious bottleneck. Remember that any gains from running code in parallel come at the expense of project complexity; messing with the GIL is ultimately messing with Python's memory model.
@ -23,7 +23,7 @@ N = 1_000_000_000
Put simply, [Cython](https://cython.org/) is a programming language that looks a lot like Python, gets [transpiled](https://en.wikipedia.org/wiki/Source-to-source_compiler) to C/C++, and integrates well with the [CPython](https://en.wikipedia.org/wiki/CPython) API. It's great for building Python wrappers to C and C++ libraries, writing optimized code for numerical processing, and tons more. And when it comes to managing the GIL, there are two special features:
- The `nogil` [function annotation](https://cython.readthedocs.io/en/latest/src/userguide/external_C_code.html#declaring-a-function-as-callable-without-the-gil) asserts that a Cython function is safe to use without the GIL, and compilation will fail if it interacts with vanilla Python
- The `nogil` [function annotation](https://cython.readthedocs.io/en/latest/src/userguide/external_C_code.html#declaring-a-function-as-callable-without-the-gil) asserts that a Cython function is safe to use without the GIL, and compilation will fail if it interacts with Python in an unsafe manner
- The `with nogil` [context manager](https://cython.readthedocs.io/en/latest/src/userguide/external_C_code.html#releasing-the-gil) explicitly unlocks the CPython GIL while active
Whenever Cython code runs inside a `with nogil` block on a separate thread, the Python interpreter is unblocked and allowed to continue work elsewhere. We'll define a "busy work" function that demonstrates this principle in action:
@ -49,7 +49,7 @@ cdef unsigned long fibonacci(unsigned long n) nogil:
def cython_nogil(unsigned long n):
# Explicitly release the GIL before calling `fibonacci`
# Explicitly release the GIL while running `fibonacci`
with nogil:
value = fibonacci(n)
@ -86,7 +86,7 @@ _ = cython_nogil(N);
> </pre>
Both versions (with and without GIL) take effectively the same amount of time to run. If we run them in parallel without unlocking the GIL, *even though two threads are used*, we expect the time to double (only one thread can be active at a time):
Both versions (with and without GIL) take effectively the same amount of time to run. Even when running this calculation in parallel on separate threads, it is expected that the run time will double because only one thread can be active at a time:
```python
@ -108,7 +108,7 @@ t1.join(); t2.join()
> </pre>
However, one thread releasing the GIL means that the second thread is free to acquire the GIL and perform its processing in parallel:
However, if the first thread releases the GIL, the second thread is free to acquire it and run in parallel:
```python
@ -125,7 +125,7 @@ t1.join(); t2.join()
> Wall time: 358 ms
> </pre>
Because `user` time represents the sum of processing time on all threads, it doesn't change much. The ["wall time"](https://en.wikipedia.org/wiki/Elapsed_real_time) has been cut roughly in half because the code is now running in parallel.
Because `user` time represents the sum of processing time on all threads, it doesn't change much. The ["wall time"](https://en.wikipedia.org/wiki/Elapsed_real_time) has been cut roughly in half because each function is running simultaneously.
Keep in mind that the **order in which threads are started** makes a difference!
@ -144,7 +144,7 @@ t1.join(); t2.join()
> Wall time: 672 ms
> </pre>
Even though the second thread releases the GIL lock while active, it can't start until the first has completed. Thus, the overall runtime the same as running two GIL-locked threads.
Even though the second thread releases the GIL while running, it can't start until the first has completed. Thus, the overall runtime is effectively the same as running two GIL-locked threads.
Finally, be aware that attempting to unlock the GIL from a thread that doesn't own it will crash the **interpreter**, not just the thread attempting the unlock:
@ -170,15 +170,15 @@ cython_recurse(2)
> File "/usr/lib/python3.7/threading.py", line 890 in _bootstrap
> </pre>
In practice, avoiding this issue is simple. First, `nogil` functions likely shouldn't contain `with nogil` blocks. Second, Cython can [conditionally acquire/release](https://cython.readthedocs.io/en/latest/src/userguide/external_C_code.html#conditional-acquiring-releasing-the-gil) the GIL, so synchronizing access shouldn't be problematic. Finally, Cython's documentation for [external C code](https://cython.readthedocs.io/en/latest/src/userguide/external_C_code.html#acquiring-and-releasing-the-gil) contains more detail on how to safely manage the GIL.
In practice, avoiding this issue is simple. First, `nogil` functions probably shouldn't contain `with nogil` blocks. Second, Cython can [conditionally acquire/release](https://cython.readthedocs.io/en/latest/src/userguide/external_C_code.html#conditional-acquiring-releasing-the-gil) the GIL, so these conditions can be used to synchronize access. Finally, Cython's documentation for [external C code](https://cython.readthedocs.io/en/latest/src/userguide/external_C_code.html#acquiring-and-releasing-the-gil) contains more detail on how to safely manage the GIL.
To conclude: use Cython's `nogil` annotation to assert that functions are safe for calling when the GIL is unlocked, and `with nogil` to actually unlock the GIL.
To conclude: use Cython's `nogil` annotation to assert that functions are safe for calling when the GIL is unlocked, and `with nogil` to actually unlock the GIL and run those functions.
# Numba
Like Cython, [Numba](https://numba.pydata.org/) is a "compiled Python." Where Cython works by compiling a Python-like language to C/C++, Numba compiles Python bytecode *directly to machine code* at runtime. Behavior is controlled with a special `@jit` decorator; calling a decorated function first compiles it to machine code, and then runs it. Calling the function a second time re-uses that machine code, but will recompile if the argument types change.
Like Cython, [Numba](https://numba.pydata.org/) is a "compiled Python." Where Cython works by compiling a Python-like language to C/C++, Numba compiles Python bytecode *directly to machine code* at runtime. Behavior is controlled with a special `@jit` decorator; calling a decorated function first compiles it to machine code before running. Calling the function a second time re-uses that machine code unless the argument types have changed.
Numba works best when a `nopython=True` argument is added to the `@jit` decorator; functions compiled in [`nopython`](http://numba.pydata.org/numba-doc/latest/user/jit.html?#nopython) mode avoid the CPython API and have performance comparable to C. Further, adding `nogil=True` to the `@jit` decorator unlocks the GIL while that function is running. Note that `nogil` and `nopython` are different arguments; while it is necessary for code to be compiled in `nopython` mode in order to release the lock, the GIL will remain locked if `nogil=False` (the default).
Numba works best when a `nopython=True` argument is added to the `@jit` decorator; functions compiled in [`nopython`](http://numba.pydata.org/numba-doc/latest/user/jit.html?#nopython) mode avoid the CPython API and have performance comparable to C. Further, adding `nogil=True` to the `@jit` decorator unlocks the GIL while that function is running. Note that `nogil` and `nopython` are separate arguments; while it is necessary for code to be compiled in `nopython` mode in order to release the lock, the GIL will remain locked if `nogil=False` (the default).
Let's repeat the same experiment, this time using Numba instead of Cython:
@ -227,7 +227,7 @@ numba_nogil(N)
numba_gil(N);
```
We'll perform the same tests as Cython; first, figure out how long it takes to run:
We'll perform the same tests as above; first, figure out how long it takes the function to run:
```python
%%time
@ -244,7 +244,7 @@ Aside: it's not immediately clear why Numba takes ~20% less time to run than Cyt
effectively identical after compilation.
</span>
When running two GIL-locked threads in parallel, the result (as expected) takes around twice as long to compute:
When running two GIL-locked threads, the result (as expected) takes around twice as long to compute:
```python
%%time
@ -259,7 +259,7 @@ t1.join(); t2.join()
> Wall time: 541 ms
> </pre>
And if the GIL-unlocking thread runs first, both threads run in parallel:
But if the GIL-unlocking thread starts first, both threads run in parallel:
```python
%%time
@ -274,7 +274,7 @@ t1.join(); t2.join()
> Wall time: 279 ms
> </pre>
Just like Cython, starting a GIL-locked thread first leads to overall runtime taking twice as long:
Just like Cython, starting the GIL-locked thread first leads to poor performance:
```python
%%time
@ -306,10 +306,10 @@ numba_recurse(2);
# Conclusion
While unlocking the GIL is often a solution in search of a problem, both Cython and Numba provide simple means to manage the GIL when appropriate. This enables true parallelism (not just [concurrency](https://stackoverflow.com/a/1050257)) that is impossible in vanilla Python.
Before finishing, it's important to address pain points that will show up if these techniques are used in a more realistic project:
First, code running in a GIL-free context will likely also need non-trivial data structures; GIL-free functions aren't useful if they're constantly interacting with Python objects that need the GIL for access. Cython provides [extension types](http://docs.cython.org/en/latest/src/tutorial/cdef_classes.html) and Numba provides a [`@jitclass`](https://numba.pydata.org/numba-doc/dev/user/jitclass.html) decorator to address this need.
First, code running in a GIL-free context will likely also need non-trivial data structures; GIL-free functions aren't useful if they're constantly interacting with Python objects whose access requires the GIL. Cython provides [extension types](http://docs.cython.org/en/latest/src/tutorial/cdef_classes.html) and Numba provides a [`@jitclass`](https://numba.pydata.org/numba-doc/dev/user/jitclass.html) decorator to address this need.
Second, building and distributing applications that make use of Cython/Numba can be complicated. Cython packages require running the compiler, (potentially) linking/packaging external dependencies, and distributing a binary wheel. Numba is generally simpler because the code being distributed is pure Python that isn't compiled until being run. However, errors aren't detected until runtime and debugging can be problematic.
Second, building and distributing applications that make use of Cython/Numba can be complicated. Cython packages require running the compiler, (potentially) linking/packaging external dependencies, and distributing a binary wheel. Numba is generally simpler because the code being distributed is pure Python, but can be tricky since errors aren't detected until runtime.
Finally, while unlocking the GIL is often a solution in search of a problem, both Cython and Numba provide tools to directly manage the GIL when appropriate. This enables true parallelism (not just [concurrency](https://stackoverflow.com/a/1050257)) that is impossible in vanilla Python.