diff --git a/.gitignore b/.gitignore index ddf4d8b..095c115 100644 --- a/.gitignore +++ b/.gitignore @@ -3,4 +3,5 @@ _site/ .sass-cache/ .jekyll-metadata .bundle/ -vendor/ \ No newline at end of file +vendor/ +.vscode/ \ No newline at end of file diff --git a/_posts/2020-06-29-release-the-gil-pt.-2.md b/_posts/2020-06-29-release-the-gil-pt.-2.md new file mode 100644 index 0000000..cbd5f6f --- /dev/null +++ b/_posts/2020-06-29-release-the-gil-pt.-2.md @@ -0,0 +1,162 @@ +--- +layout: post +title: "Release the GIL: Part 2 - Pybind11, PyO3" +description: "More Python Parallelism" +category: +tags: [python] +--- + +I've been continuing experiments with parallelism in Python; while these techniques are a bit niche, +it's still fun to push the performance envelope. In addition to tools like +[Cython](https://cython.org/) and [Numba](https://numba.pydata.org/) (covered +[here](//2019/12/release-the-gil.html)) that attempt to stay as close to Python as possible, other +projects are available that act as a bridge between Python and other languages. The goal is to make +cooperation simple without compromising independence. + +In practice, this "cooperation" between languages is important for performance reasons. Code written +in C++ shouldn't have to care about the Python GIL. However, unless the GIL is explicitly unlocked, +it will remain implicitly held; though the Python interpreter _could_ be making progress on a +separate thread, it will be stuck waiting on the current operation to complete. We'll look at some +techniques below for managing the GIL in a Python extension. + +# Pybind11 + +The motto of [Pybind11](https://github.com/pybind/pybind11) is "seamless operability between C++11 +and Python", and they certainly deliver on that. My experience was that it was relatively simple to +set up a hybrid project where C++ (using CMake) and Python (using setuptools) were able to +peacefully coexist. We'll examine a simple Fibonacci sequence implementation to demonstrate how +Python's threading model interacts with Pybind11. + +The C++ implementation is very simple: + +```c++ +#include + +inline std::uint64_t fibonacci(std::uint64_t n) { + if (n <= 1) { + return n; + } + + std::uint64_t a = 0; + std::uint64_t b = 1; + std::uint64_t c = 0; + + c = a + b; + for (std::uint64_t _i = 2; _i < n; _i++) { + a = b; + b = c; + c = a + b; + } + + return c; +} + +std::uint64_t fibonacci_gil(std::uint64_t n) { + // The GIL is held by default when entering C++ from Python, so we need no + // manipulation here. Interestingly enough, re-acquiring a held GIL is a safe + // operation (within the same thread), so feel free to scatter + // `py::gil_scoped_acquire` throughout the code. + return fibonacci(n); +} + +std::uint64_t fibonacci_nogil(std::uint64_t n) { + // Because the GIL is held by default, we need to explicitly release it here. + // Note that like Cython, releasing the lock multiple times will crash the + // interpreter. + + py::gil_scoped_release release; + return fibonacci(n); +} +``` + +Admittedly, the project setup is significantly more involved than Cython or Numba. I've omitted +those steps here, but the full project is available at [INSERT LINK HERE]. + +```python +# This number will overflow, but that's OK; our purpose isn't to get an accurate result, +# it's simply to keep the processor busy. +N = 1_000_000_000; + +from fibonacci import fibonacci_gil, fibonacci_nogil +``` + +We'll first run each function independently: + +```python +%%time +_ = fibonacci_gil(N); +``` + +>
+> CPU times: user 350 ms, sys: 3.54 ms, total: 354 ms
+> Wall time: 355 ms
+> 
+ +```python +%%time +_ = fibonacci_nogil(N); +``` + +>
+> CPU times: user 385 ms, sys: 0 ns, total: 385 ms
+> Wall time: 384 ms
+> 
+ +There's some minor variation in how long it takes to run the code, but not a material difference. +When running the same function in multiple threads, we expect the run time to double; even though +there are multiple threads, they effectively run in serial because of the GIL: + +```python +%%time +from threading import Thread + +# Create the two threads to run on +t1 = Thread(target=fibonacci_gil, args=[N]) +t2 = Thread(target=fibonacci_gil, args=[N]) +# Start the threads +t1.start(); t2.start() +# Wait for the threads to finish +t1.join(); t2.join() +``` + +>
+> CPU times: user 709 ms, sys: 0 ns, total: 709 ms
+> Wall time: 705 ms
+> 
+ +However, if one thread unlocks the GIL first, then the threads will execute in parallel: + +```python +%%time + +t1 = Thread(target=fibonacci_nogil, args=[N]) +t2 = Thread(target=fibonacci_gil, args=[N]) +t1.start(); t2.start() +t1.join(); t2.join() +``` + +>
+> CPU times: user 734 ms, sys: 7.89 ms, total: 742 ms
+> Wall time: 372 ms
+> 
+ +While it takes the same amount of CPU time to compute the result ("user" time), the run time ("wall" +time) is cut in half because the code is now running in parallel. + +```python +%%time + +# Note that the GIL-locked version is started first +t1 = Thread(target=fibonacci_gil, args=[N]) +t2 = Thread(target=fibonacci_nogil, args=[N]) +t1.start(); t2.start() +t1.join(); t2.join() +``` + +>
+> CPU times: user 736 ms, sys: 0 ns, total: 736 ms
+> Wall time: 734 ms
+> 
+ +Finally, it's import to note that scheduling matters; in this example, threads run in serial because +the GIL-locked thread is started first.