diff --git a/blog/2019-09-28-binary-format-shootout/index.mdx b/blog/2019-09-28-binary-format-shootout/index.mdx index 6aad82b..044b4da 100644 --- a/blog/2019-09-28-binary-format-shootout/index.mdx +++ b/blog/2019-09-28-binary-format-shootout/index.mdx @@ -41,7 +41,7 @@ We'll discuss more in detail, but a quick preview of the results: - Flatbuffers: Has some quirks, but largely lived up to its "zero-copy" promises - SBE: Best median and worst-case performance, but the message structure has a limited feature set -# Prologue: Binary Parsing with Nom +## Prologue: Binary Parsing with Nom Our benchmark system will be a simple data processor; given depth-of-book market data from [IEX](https://iextrading.com/trading/market-data/#deep), serialize each message into the schema @@ -119,7 +119,7 @@ Ultimately, because the `nom` code in this shootout was the same for all formats interested in its performance. Still, it's worth mentioning that building the market data parser was actually fun; I didn't have to write tons of boring code by hand. -# Part 1: Cap'n Proto +## Cap'n Proto Now it's time to get into the meaty part of the story. Cap'n Proto was the first format I tried because of how long it has supported Rust (thanks to [dwrensha](https://github.com/dwrensha) for @@ -151,7 +151,7 @@ every read for the segment table. In the end, I put in significant work to make Cap'n Proto as fast as possible, but there were too many issues for me to feel comfortable using it long-term. -# Part 2: Flatbuffers +## Flatbuffers This is the new kid on the block. After a [first attempt](https://github.com/google/flatbuffers/pull/3894) didn't pan out, official support @@ -191,7 +191,7 @@ that tag is nigh on impossible. Ultimately, I enjoyed using Flatbuffers, and had to do significantly less work to make it perform well. -# Part 3: Simple Binary Encoding +## Simple Binary Encoding Support for SBE was added by the author of one of my favorite [Rust blog posts](https://web.archive.org/web/20190427124806/https://polysync.io/blog/session-types-for-hearty-codecs/). @@ -212,7 +212,7 @@ However, if you don't need union types, and can accept that schemas are XML docu worth using. SBE's implementation had the best streaming support of all formats I tested, and doesn't trigger allocation during de/serialization. -# Results +## Results After building a test harness [for](https://github.com/speice-io/marketdata-shootout/blob/master/src/capnp_runner.rs) @@ -225,7 +225,7 @@ the benchmarks, and the raw results are below is the average of 10 runs on a single day of IEX data. Results were validated to make sure that each format parsed the data correctly. -## Serialization +### Serialization This test measures, on a [per-message basis](https://github.com/speice-io/marketdata-shootout/blob/master/src/main.rs#L268-L272), @@ -239,7 +239,7 @@ buffer. | Flatbuffers | 355ns | 2185ns | 3497ns | 14.31s | | SBE | 91ns | 1535ns | 2423ns | 3.91s | -## Deserialization +### Deserialization This test measures, on a [per-message basis](https://github.com/speice-io/marketdata-shootout/blob/master/src/main.rs#L294-L298), @@ -254,7 +254,7 @@ format implementation. | Flatbuffers | 173ns | 421ns | 1007ns | 6.00s | | SBE | 116ns | 286ns | 659ns | 4.05s | -# Conclusion +## Conclusion Building a benchmark turned out to be incredibly helpful in making a decision; because a "union" type isn't important to me, I can be confident that SBE best addresses my needs. diff --git a/blog/2019-12-14-release-the-gil/_article.md b/blog/2019-12-14-release-the-gil/_article.md new file mode 100644 index 0000000..00b47a6 --- /dev/null +++ b/blog/2019-12-14-release-the-gil/_article.md @@ -0,0 +1,370 @@ +--- +layout: post +title: "Release the GIL" +description: "Strategies for Parallelism in Python" +category: +tags: [python] +--- + +Complaining about the [Global Interpreter Lock](https://wiki.python.org/moin/GlobalInterpreterLock) +(GIL) seems like a rite of passage for Python developers. It's easy to criticize a design decision +made before multi-core CPU's were widely available, but the fact that it's still around indicates +that it generally works [Good](https://wiki.c2.com/?PrematureOptimization) +[Enough](https://wiki.c2.com/?YouArentGonnaNeedIt). Besides, there are simple and effective +workarounds; it's not hard to start a +[new process](https://docs.python.org/3/library/multiprocessing.html) and use message passing to +synchronize code running in parallel. + +Still, wouldn't it be nice to have more than a single active interpreter thread? In an age of +asynchronicity and _M:N_ threading, Python seems lacking. The ideal scenario is to take advantage of +both Python's productivity and the modern CPU's parallel capabilities. + +Presented below are two strategies for releasing the GIL's icy grip without giving up on what makes +Python a nice language to start with. Bear in mind: these are just the tools, no claim is made about +whether it's a good idea to use them. Very often, unlocking the GIL is an +[XY problem](https://en.wikipedia.org/wiki/XY_problem); you want application performance, and the +GIL seems like an obvious bottleneck. Remember that any gains from running code in parallel come at +the expense of project complexity; messing with the GIL is ultimately messing with Python's memory +model. + +```python +%load_ext Cython +from numba import jit + +N = 1_000_000_000 +``` + +# Cython + +Put simply, [Cython](https://cython.org/) is a programming language that looks a lot like Python, +gets [transpiled](https://en.wikipedia.org/wiki/Source-to-source_compiler) to C/C++, and integrates +well with the [CPython](https://en.wikipedia.org/wiki/CPython) API. It's great for building Python +wrappers to C and C++ libraries, writing optimized code for numerical processing, and tons more. And +when it comes to managing the GIL, there are two special features: + +- The `nogil` + [function annotation](https://cython.readthedocs.io/en/latest/src/userguide/external_C_code.html#declaring-a-function-as-callable-without-the-gil) + asserts that a Cython function is safe to use without the GIL, and compilation will fail if it + interacts with Python in an unsafe manner +- The `with nogil` + [context manager](https://cython.readthedocs.io/en/latest/src/userguide/external_C_code.html#releasing-the-gil) + explicitly unlocks the CPython GIL while active + +Whenever Cython code runs inside a `with nogil` block on a separate thread, the Python interpreter +is unblocked and allowed to continue work elsewhere. We'll define a "busy work" function that +demonstrates this principle in action: + +```python +%%cython + +# Annotating a function with `nogil` indicates only that it is safe +# to call in a `with nogil` block. It *does not* release the GIL. +cdef unsigned long fibonacci(unsigned long n) nogil: + if n <= 1: + return n + + cdef unsigned long a = 0, b = 1, c = 0 + + c = a + b + for _i in range(2, n): + a = b + b = c + c = a + b + + return c + + +def cython_nogil(unsigned long n): + # Explicitly release the GIL while running `fibonacci` + with nogil: + value = fibonacci(n) + + return value + + +def cython_gil(unsigned long n): + # Because the GIL is not explicitly released, it implicitly + # remains acquired when running the `fibonacci` function + return fibonacci(n) +``` + +First, let's time how long it takes Cython to calculate the billionth Fibonacci number: + +```python +%%time +_ = cython_gil(N); +``` + +>
+> CPU times: user 365 ms, sys: 0 ns, total: 365 ms +> Wall time: 372 ms +>+ +```python +%%time +_ = cython_nogil(N); +``` + +>
+> CPU times: user 381 ms, sys: 0 ns, total: 381 ms +> Wall time: 388 ms +>+ +Both versions (with and without GIL) take effectively the same amount of time to run. Even when +running this calculation in parallel on separate threads, it is expected that the run time will +double because only one thread can be active at a time: + +```python +%%time +from threading import Thread + +# Create the two threads to run on +t1 = Thread(target=cython_gil, args=[N]) +t2 = Thread(target=cython_gil, args=[N]) +# Start the threads +t1.start(); t2.start() +# Wait for the threads to finish +t1.join(); t2.join() +``` + +>
+> CPU times: user 641 ms, sys: 5.62 ms, total: 647 ms +> Wall time: 645 ms +>+ +However, if the first thread releases the GIL, the second thread is free to acquire it and run in +parallel: + +```python +%%time + +t1 = Thread(target=cython_nogil, args=[N]) +t2 = Thread(target=cython_gil, args=[N]) +t1.start(); t2.start() +t1.join(); t2.join() +``` + +>
+> CPU times: user 717 ms, sys: 372 µs, total: 718 ms +> Wall time: 358 ms +>+ +Because `user` time represents the sum of processing time on all threads, it doesn't change much. +The ["wall time"](https://en.wikipedia.org/wiki/Elapsed_real_time) has been cut roughly in half +because each function is running simultaneously. + +Keep in mind that the **order in which threads are started** makes a difference! + +```python +%%time + +# Note that the GIL-locked version is started first +t1 = Thread(target=cython_gil, args=[N]) +t2 = Thread(target=cython_nogil, args=[N]) +t1.start(); t2.start() +t1.join(); t2.join() +``` + +>
+> CPU times: user 667 ms, sys: 0 ns, total: 667 ms +> Wall time: 672 ms +>+ +Even though the second thread releases the GIL while running, it can't start until the first has +completed. Thus, the overall runtime is effectively the same as running two GIL-locked threads. + +Finally, be aware that attempting to unlock the GIL from a thread that doesn't own it will crash the +**interpreter**, not just the thread attempting the unlock: + +```python +%%cython + +cdef int cython_recurse(int n) nogil: + if n <= 0: + return 0 + + with nogil: + return cython_recurse(n - 1) + +cython_recurse(2) +``` + +>
+> Fatal Python error: PyEval_SaveThread: NULL tstate +> +> Thread 0x00007f499effd700 (most recent call first): +> File "/home/bspeice/.virtualenvs/release-the-gil/lib/python3.7/site-packages/ipykernel/parentpoller.py", line 39 in run +> File "/usr/lib/python3.7/threading.py", line 926 in _bootstrap_inner +> File "/usr/lib/python3.7/threading.py", line 890 in _bootstrap +>+ +In practice, avoiding this issue is simple. First, `nogil` functions probably shouldn't contain +`with nogil` blocks. Second, Cython can +[conditionally acquire/release](https://cython.readthedocs.io/en/latest/src/userguide/external_C_code.html#conditional-acquiring-releasing-the-gil) +the GIL, so these conditions can be used to synchronize access. Finally, Cython's documentation for +[external C code](https://cython.readthedocs.io/en/latest/src/userguide/external_C_code.html#acquiring-and-releasing-the-gil) +contains more detail on how to safely manage the GIL. + +To conclude: use Cython's `nogil` annotation to assert that functions are safe for calling when the +GIL is unlocked, and `with nogil` to actually unlock the GIL and run those functions. + +# Numba + +Like Cython, [Numba](https://numba.pydata.org/) is a "compiled Python." Where Cython works by +compiling a Python-like language to C/C++, Numba compiles Python bytecode _directly to machine code_ +at runtime. Behavior is controlled with a special `@jit` decorator; calling a decorated function +first compiles it to machine code before running. Calling the function a second time re-uses that +machine code unless the argument types have changed. + +Numba works best when a `nopython=True` argument is added to the `@jit` decorator; functions +compiled in [`nopython`](http://numba.pydata.org/numba-doc/latest/user/jit.html?#nopython) mode +avoid the CPython API and have performance comparable to C. Further, adding `nogil=True` to the +`@jit` decorator unlocks the GIL while that function is running. Note that `nogil` and `nopython` +are separate arguments; while it is necessary for code to be compiled in `nopython` mode in order to +release the lock, the GIL will remain locked if `nogil=False` (the default). + +Let's repeat the same experiment, this time using Numba instead of Cython: + +```python +# The `int` type annotation is only for humans and is ignored +# by Numba. +@jit(nopython=True, nogil=True) +def numba_nogil(n: int) -> int: + if n <= 1: + return n + + a = 0 + b = 1 + + c = a + b + for _i in range(2, n): + a = b + b = c + c = a + b + + return c + + +# Run using `nopython` mode to receive a performance boost, +# but GIL remains locked due to `nogil=False` by default. +@jit(nopython=True) +def numba_gil(n: int) -> int: + if n <= 1: + return n + + a = 0 + b = 1 + + c = a + b + for _i in range(2, n): + a = b + b = c + c = a + b + + return c + + +# Call each function once to force compilation; we don't want +# the timing statistics to include how long it takes to compile. +numba_nogil(N) +numba_gil(N); +``` + +We'll perform the same tests as above; first, figure out how long it takes the function to run: + +```python +%%time +_ = numba_gil(N) +``` + +>
+> CPU times: user 253 ms, sys: 258 µs, total: 253 ms +> Wall time: 251 ms +>+ + +Aside: it's not immediately clear why Numba takes ~20% less time to run than Cython for code that should be +effectively identical after compilation. + + +When running two GIL-locked threads, the result (as expected) takes around twice as long to compute: + +```python +%%time +t1 = Thread(target=numba_gil, args=[N]) +t2 = Thread(target=numba_gil, args=[N]) +t1.start(); t2.start() +t1.join(); t2.join() +``` + +>
+> CPU times: user 541 ms, sys: 3.96 ms, total: 545 ms +> Wall time: 541 ms +>+ +But if the GIL-unlocking thread starts first, both threads run in parallel: + +```python +%%time +t1 = Thread(target=numba_nogil, args=[N]) +t2 = Thread(target=numba_gil, args=[N]) +t1.start(); t2.start() +t1.join(); t2.join() +``` + +>
+> CPU times: user 551 ms, sys: 7.77 ms, total: 559 ms +> Wall time: 279 ms +>+ +Just like Cython, starting the GIL-locked thread first leads to poor performance: + +```python +%%time +t1 = Thread(target=numba_gil, args=[N]) +t2 = Thread(target=numba_nogil, args=[N]) +t1.start(); t2.start() +t1.join(); t2.join() +``` + +>
+> CPU times: user 524 ms, sys: 0 ns, total: 524 ms +> Wall time: 522 ms +>+ +Finally, unlike Cython, Numba will unlock the GIL if and only if it is currently acquired; +recursively calling `@jit(nogil=True)` functions is perfectly safe: + +```python +from numba import jit + +@jit(nopython=True, nogil=True) +def numba_recurse(n: int) -> int: + if n <= 0: + return 0 + + return numba_recurse(n - 1) + +numba_recurse(2); +``` + +# Conclusion + +Before finishing, it's important to address pain points that will show up if these techniques are +used in a more realistic project: + +First, code running in a GIL-free context will likely also need non-trivial data structures; +GIL-free functions aren't useful if they're constantly interacting with Python objects whose access +requires the GIL. Cython provides +[extension types](http://docs.cython.org/en/latest/src/tutorial/cdef_classes.html) and Numba +provides a [`@jitclass`](https://numba.pydata.org/numba-doc/dev/user/jitclass.html) decorator to +address this need. + +Second, building and distributing applications that make use of Cython/Numba can be complicated. +Cython packages require running the compiler, (potentially) linking/packaging external dependencies, +and distributing a binary wheel. Numba is generally simpler because the code being distributed is +pure Python, but can be tricky since errors aren't detected until runtime. + +Finally, while unlocking the GIL is often a solution in search of a problem, both Cython and Numba +provide tools to directly manage the GIL when appropriate. This enables true parallelism (not just +[concurrency](https://stackoverflow.com/a/1050257)) that is impossible in vanilla Python. diff --git a/blog/2019-12-14-release-the-gil/index.mdx b/blog/2019-12-14-release-the-gil/index.mdx new file mode 100644 index 0000000..f0683c2 --- /dev/null +++ b/blog/2019-12-14-release-the-gil/index.mdx @@ -0,0 +1,372 @@ +--- +slug: 2019/12/release-the-gil +title: Release the GIL +date: 2019-12-14 12:00:00 +authors: [bspeice] +tags: [] +--- + +Complaining about the [Global Interpreter Lock](https://wiki.python.org/moin/GlobalInterpreterLock) +(GIL) seems like a rite of passage for Python developers. It's easy to criticize a design decision +made before multi-core CPU's were widely available, but the fact that it's still around indicates +that it generally works [Good](https://wiki.c2.com/?PrematureOptimization) +[Enough](https://wiki.c2.com/?YouArentGonnaNeedIt). Besides, there are simple and effective +workarounds; it's not hard to start a +[new process](https://docs.python.org/3/library/multiprocessing.html) and use message passing to +synchronize code running in parallel. + +Still, wouldn't it be nice to have more than a single active interpreter thread? In an age of +asynchronicity and _M:N_ threading, Python seems lacking. The ideal scenario is to take advantage of +both Python's productivity and the modern CPU's parallel capabilities. + + + +Presented below are two strategies for releasing the GIL's icy grip without giving up on what makes +Python a nice language to start with. Bear in mind: these are just the tools, no claim is made about +whether it's a good idea to use them. Very often, unlocking the GIL is an +[XY problem](https://en.wikipedia.org/wiki/XY_problem); you want application performance, and the +GIL seems like an obvious bottleneck. Remember that any gains from running code in parallel come at +the expense of project complexity; messing with the GIL is ultimately messing with Python's memory +model. + +```python +%load_ext Cython +from numba import jit + +N = 1_000_000_000 +``` + +## Cython + +Put simply, [Cython](https://cython.org/) is a programming language that looks a lot like Python, +gets [transpiled](https://en.wikipedia.org/wiki/Source-to-source_compiler) to C/C++, and integrates +well with the [CPython](https://en.wikipedia.org/wiki/CPython) API. It's great for building Python +wrappers to C and C++ libraries, writing optimized code for numerical processing, and tons more. And +when it comes to managing the GIL, there are two special features: + +- The `nogil` + [function annotation](https://cython.readthedocs.io/en/latest/src/userguide/external_C_code.html#declaring-a-function-as-callable-without-the-gil) + asserts that a Cython function is safe to use without the GIL, and compilation will fail if it + interacts with Python in an unsafe manner +- The `with nogil` + [context manager](https://cython.readthedocs.io/en/latest/src/userguide/external_C_code.html#releasing-the-gil) + explicitly unlocks the CPython GIL while active + +Whenever Cython code runs inside a `with nogil` block on a separate thread, the Python interpreter +is unblocked and allowed to continue work elsewhere. We'll define a "busy work" function that +demonstrates this principle in action: + +```python +%%cython + +# Annotating a function with `nogil` indicates only that it is safe +# to call in a `with nogil` block. It *does not* release the GIL. +cdef unsigned long fibonacci(unsigned long n) nogil: + if n <= 1: + return n + + cdef unsigned long a = 0, b = 1, c = 0 + + c = a + b + for _i in range(2, n): + a = b + b = c + c = a + b + + return c + + +def cython_nogil(unsigned long n): + # Explicitly release the GIL while running `fibonacci` + with nogil: + value = fibonacci(n) + + return value + + +def cython_gil(unsigned long n): + # Because the GIL is not explicitly released, it implicitly + # remains acquired when running the `fibonacci` function + return fibonacci(n) +``` + +First, let's time how long it takes Cython to calculate the billionth Fibonacci number: + +```python +%%time +_ = cython_gil(N); +``` + +>
+> CPU times: user 365 ms, sys: 0 ns, total: 365 ms +> Wall time: 372 ms +>+ +```python +%%time +_ = cython_nogil(N); +``` + +>
+> CPU times: user 381 ms, sys: 0 ns, total: 381 ms +> Wall time: 388 ms +>+ +Both versions (with and without GIL) take effectively the same amount of time to run. Even when +running this calculation in parallel on separate threads, it is expected that the run time will +double because only one thread can be active at a time: + +```python +%%time +from threading import Thread + +# Create the two threads to run on +t1 = Thread(target=cython_gil, args=[N]) +t2 = Thread(target=cython_gil, args=[N]) +# Start the threads +t1.start(); t2.start() +# Wait for the threads to finish +t1.join(); t2.join() +``` + +>
+> CPU times: user 641 ms, sys: 5.62 ms, total: 647 ms +> Wall time: 645 ms +>+ +However, if the first thread releases the GIL, the second thread is free to acquire it and run in +parallel: + +```python +%%time + +t1 = Thread(target=cython_nogil, args=[N]) +t2 = Thread(target=cython_gil, args=[N]) +t1.start(); t2.start() +t1.join(); t2.join() +``` + +>
+> CPU times: user 717 ms, sys: 372 µs, total: 718 ms +> Wall time: 358 ms +>+ +Because `user` time represents the sum of processing time on all threads, it doesn't change much. +The ["wall time"](https://en.wikipedia.org/wiki/Elapsed_real_time) has been cut roughly in half +because each function is running simultaneously. + +Keep in mind that the **order in which threads are started** makes a difference! + +```python +%%time + +# Note that the GIL-locked version is started first +t1 = Thread(target=cython_gil, args=[N]) +t2 = Thread(target=cython_nogil, args=[N]) +t1.start(); t2.start() +t1.join(); t2.join() +``` + +>
+> CPU times: user 667 ms, sys: 0 ns, total: 667 ms +> Wall time: 672 ms +>+ +Even though the second thread releases the GIL while running, it can't start until the first has +completed. Thus, the overall runtime is effectively the same as running two GIL-locked threads. + +Finally, be aware that attempting to unlock the GIL from a thread that doesn't own it will crash the +**interpreter**, not just the thread attempting the unlock: + +```python +%%cython + +cdef int cython_recurse(int n) nogil: + if n <= 0: + return 0 + + with nogil: + return cython_recurse(n - 1) + +cython_recurse(2) +``` + +>
+> Fatal Python error: PyEval_SaveThread: NULL tstate +> +> Thread 0x00007f499effd700 (most recent call first): +> File "/home/bspeice/.virtualenvs/release-the-gil/lib/python3.7/site-packages/ipykernel/parentpoller.py", line 39 in run +> File "/usr/lib/python3.7/threading.py", line 926 in _bootstrap_inner +> File "/usr/lib/python3.7/threading.py", line 890 in _bootstrap +>+ +In practice, avoiding this issue is simple. First, `nogil` functions probably shouldn't contain +`with nogil` blocks. Second, Cython can +[conditionally acquire/release](https://cython.readthedocs.io/en/latest/src/userguide/external_C_code.html#conditional-acquiring-releasing-the-gil) +the GIL, so these conditions can be used to synchronize access. Finally, Cython's documentation for +[external C code](https://cython.readthedocs.io/en/latest/src/userguide/external_C_code.html#acquiring-and-releasing-the-gil) +contains more detail on how to safely manage the GIL. + +To conclude: use Cython's `nogil` annotation to assert that functions are safe for calling when the +GIL is unlocked, and `with nogil` to actually unlock the GIL and run those functions. + +## Numba + +Like Cython, [Numba](https://numba.pydata.org/) is a "compiled Python." Where Cython works by +compiling a Python-like language to C/C++, Numba compiles Python bytecode _directly to machine code_ +at runtime. Behavior is controlled with a special `@jit` decorator; calling a decorated function +first compiles it to machine code before running. Calling the function a second time re-uses that +machine code unless the argument types have changed. + +Numba works best when a `nopython=True` argument is added to the `@jit` decorator; functions +compiled in [`nopython`](http://numba.pydata.org/numba-doc/latest/user/jit.html?#nopython) mode +avoid the CPython API and have performance comparable to C. Further, adding `nogil=True` to the +`@jit` decorator unlocks the GIL while that function is running. Note that `nogil` and `nopython` +are separate arguments; while it is necessary for code to be compiled in `nopython` mode in order to +release the lock, the GIL will remain locked if `nogil=False` (the default). + +Let's repeat the same experiment, this time using Numba instead of Cython: + +```python +# The `int` type annotation is only for humans and is ignored +# by Numba. +@jit(nopython=True, nogil=True) +def numba_nogil(n: int) -> int: + if n <= 1: + return n + + a = 0 + b = 1 + + c = a + b + for _i in range(2, n): + a = b + b = c + c = a + b + + return c + + +# Run using `nopython` mode to receive a performance boost, +# but GIL remains locked due to `nogil=False` by default. +@jit(nopython=True) +def numba_gil(n: int) -> int: + if n <= 1: + return n + + a = 0 + b = 1 + + c = a + b + for _i in range(2, n): + a = b + b = c + c = a + b + + return c + + +# Call each function once to force compilation; we don't want +# the timing statistics to include how long it takes to compile. +numba_nogil(N) +numba_gil(N); +``` + +We'll perform the same tests as above; first, figure out how long it takes the function to run: + +```python +%%time +_ = numba_gil(N) +``` + +>
+> CPU times: user 253 ms, sys: 258 µs, total: 253 ms +> Wall time: 251 ms +>+ + +Aside: it's not immediately clear why Numba takes ~20% less time to run than Cython for code that should be +effectively identical after compilation. + + +When running two GIL-locked threads, the result (as expected) takes around twice as long to compute: + +```python +%%time +t1 = Thread(target=numba_gil, args=[N]) +t2 = Thread(target=numba_gil, args=[N]) +t1.start(); t2.start() +t1.join(); t2.join() +``` + +>
+> CPU times: user 541 ms, sys: 3.96 ms, total: 545 ms +> Wall time: 541 ms +>+ +But if the GIL-unlocking thread starts first, both threads run in parallel: + +```python +%%time +t1 = Thread(target=numba_nogil, args=[N]) +t2 = Thread(target=numba_gil, args=[N]) +t1.start(); t2.start() +t1.join(); t2.join() +``` + +>
+> CPU times: user 551 ms, sys: 7.77 ms, total: 559 ms +> Wall time: 279 ms +>+ +Just like Cython, starting the GIL-locked thread first leads to poor performance: + +```python +%%time +t1 = Thread(target=numba_gil, args=[N]) +t2 = Thread(target=numba_nogil, args=[N]) +t1.start(); t2.start() +t1.join(); t2.join() +``` + +>
+> CPU times: user 524 ms, sys: 0 ns, total: 524 ms +> Wall time: 522 ms +>+ +Finally, unlike Cython, Numba will unlock the GIL if and only if it is currently acquired; +recursively calling `@jit(nogil=True)` functions is perfectly safe: + +```python +from numba import jit + +@jit(nopython=True, nogil=True) +def numba_recurse(n: int) -> int: + if n <= 0: + return 0 + + return numba_recurse(n - 1) + +numba_recurse(2); +``` + +## Conclusion + +Before finishing, it's important to address pain points that will show up if these techniques are +used in a more realistic project: + +First, code running in a GIL-free context will likely also need non-trivial data structures; +GIL-free functions aren't useful if they're constantly interacting with Python objects whose access +requires the GIL. Cython provides +[extension types](http://docs.cython.org/en/latest/src/tutorial/cdef_classes.html) and Numba +provides a [`@jitclass`](https://numba.pydata.org/numba-doc/dev/user/jitclass.html) decorator to +address this need. + +Second, building and distributing applications that make use of Cython/Numba can be complicated. +Cython packages require running the compiler, (potentially) linking/packaging external dependencies, +and distributing a binary wheel. Numba is generally simpler because the code being distributed is +pure Python, but can be tricky since errors aren't detected until runtime. + +Finally, while unlocking the GIL is often a solution in search of a problem, both Cython and Numba +provide tools to directly manage the GIL when appropriate. This enables true parallelism (not just +[concurrency](https://stackoverflow.com/a/1050257)) that is impossible in vanilla Python. diff --git a/blog/2021-08-01-mdx-blog-post.mdx b/blog/2021-08-01-mdx-blog-post.mdx deleted file mode 100644 index db4736c..0000000 --- a/blog/2021-08-01-mdx-blog-post.mdx +++ /dev/null @@ -1,48 +0,0 @@ ---- -slug: mdx-blog-post -title: MDX Blog Post With An Extraordinarily Long Title -date: 2021-08-02 10:00:00 -authors: [bspeice] -tags: [] ---- - -## title - -Hello? - -Blog posts support [Docusaurus Markdown features](https://docusaurus.io/docs/markdown-features), such as [MDX](https://mdxjs.com/). - -