First draft of pybind11

Having issues with the Rust code taking *forever*. Going to break out the compiler explorer and see if it's doing something different from C++.
2025-11-03 18:10:32 -05:00 · 2020-06-29 18:26:03 -04:00
parent b8c12b9cc1
commit 5c13a8cf8d
2 changed files with 164 additions and 1 deletions
--- a/.gitignore
+++ b/.gitignore
@ -3,4 +3,5 @@ _site/
 .sass-cache/
 .jekyll-metadata
 .bundle/
-vendor/
+vendor/
 .vscode/
--- a/_posts/2020-06-29-release-the-gil-pt.-2.md
+++ b/_posts/2020-06-29-release-the-gil-pt.-2.md
@ -0,0 +1,162 @@
 ---
 layout: post
 title: "Release the GIL: Part 2 - Pybind11, PyO3"
 description: "More Python Parallelism"
 category:
 tags: [python]
 ---
 I've been continuing experiments with parallelism in Python; while these techniques are a bit niche,
 it's still fun to push the performance envelope. In addition to tools like
 [Cython](https://cython.org/) and [Numba](https://numba.pydata.org/) (covered
 [here](//2019/12/release-the-gil.html)) that attempt to stay as close to Python as possible, other
 projects are available that act as a bridge between Python and other languages. The goal is to make
 cooperation simple without compromising independence.
 In practice, this "cooperation" between languages is important for performance reasons. Code written
 in C++ shouldn't have to care about the Python GIL. However, unless the GIL is explicitly unlocked,
 it will remain implicitly held; though the Python interpreter _could_ be making progress on a
 separate thread, it will be stuck waiting on the current operation to complete. We'll look at some
 techniques below for managing the GIL in a Python extension.
 # Pybind11
 The motto of [Pybind11](https://github.com/pybind/pybind11) is "seamless operability between C++11
 and Python", and they certainly deliver on that. My experience was that it was relatively simple to
 set up a hybrid project where C++ (using CMake) and Python (using setuptools) were able to
 peacefully coexist. We'll examine a simple Fibonacci sequence implementation to demonstrate how
 Python's threading model interacts with Pybind11.
 The C++ implementation is very simple:
 ```c++
 #include <cstdint>
 inline std::uint64_t fibonacci(std::uint64_t n) {
  if (n <= 1) {
    return n;
  }
  std::uint64_t a = 0;
  std::uint64_t b = 1;
  std::uint64_t c = 0;
  c = a + b;
  for (std::uint64_t _i = 2; _i < n; _i++) {
    a = b;
    b = c;
    c = a + b;
  }
  return c;
 }
 std::uint64_t fibonacci_gil(std::uint64_t n) {
  // The GIL is held by default when entering C++ from Python, so we need no
  // manipulation here. Interestingly enough, re-acquiring a held GIL is a safe
  // operation (within the same thread), so feel free to scatter
  // `py::gil_scoped_acquire` throughout the code.
  return fibonacci(n);
 }
 std::uint64_t fibonacci_nogil(std::uint64_t n) {
  // Because the GIL is held by default, we need to explicitly release it here.
  // Note that like Cython, releasing the lock multiple times will crash the
  // interpreter.
  py::gil_scoped_release release;
  return fibonacci(n);
 }
 ```
 Admittedly, the project setup is significantly more involved than Cython or Numba. I've omitted
 those steps here, but the full project is available at [INSERT LINK HERE].
 ```python
 # This number will overflow, but that's OK; our purpose isn't to get an accurate result,
 # it's simply to keep the processor busy.
 N = 1_000_000_000;
 from fibonacci import fibonacci_gil, fibonacci_nogil
 ```
 We'll first run each function independently:
 ```python
 %%time
 _ = fibonacci_gil(N);
 ```
 > <pre>
 > CPU times: user 350 ms, sys: 3.54 ms, total: 354 ms
 > Wall time: 355 ms
 > </pre>
 ```python
 %%time
 _ = fibonacci_nogil(N);
 ```
 > <pre>
 > CPU times: user 385 ms, sys: 0 ns, total: 385 ms
 > Wall time: 384 ms
 > </pre>
 There's some minor variation in how long it takes to run the code, but not a material difference.
 When running the same function in multiple threads, we expect the run time to double; even though
 there are multiple threads, they effectively run in serial because of the GIL:
 ```python
 %%time
 from threading import Thread
 # Create the two threads to run on
 t1 = Thread(target=fibonacci_gil, args=[N])
 t2 = Thread(target=fibonacci_gil, args=[N])
 # Start the threads
 t1.start(); t2.start()
 # Wait for the threads to finish
 t1.join(); t2.join()
 ```
 > <pre>
 > CPU times: user 709 ms, sys: 0 ns, total: 709 ms
 > Wall time: 705 ms
 > </pre>
 However, if one thread unlocks the GIL first, then the threads will execute in parallel:
 ```python
 %%time
 t1 = Thread(target=fibonacci_nogil, args=[N])
 t2 = Thread(target=fibonacci_gil, args=[N])
 t1.start(); t2.start()
 t1.join(); t2.join()
 ```
 > <pre>
 > CPU times: user 734 ms, sys: 7.89 ms, total: 742 ms
 > Wall time: 372 ms
 > </pre>
 While it takes the same amount of CPU time to compute the result ("user" time), the run time ("wall"
 time) is cut in half because the code is now running in parallel.
 ```python
 %%time
 # Note that the GIL-locked version is started first
 t1 = Thread(target=fibonacci_gil, args=[N])
 t2 = Thread(target=fibonacci_nogil, args=[N])
 t1.start(); t2.start()
 t1.join(); t2.join()
 ```
 > <pre>
 > CPU times: user 736 ms, sys: 0 ns, total: 736 ms
 > Wall time: 734 ms
 > </pre>
 Finally, it's import to note that scheduling matters; in this example, threads run in serial because
 the GIL-locked thread is started first.