- Author:
- Sam Gross
- Sponsor:
- Łukasz Langa
- Discussions-To:
- Status:
- Draft
- Type:
- Standards Track
- Created:
- 09-Jan-2023
- Python-Version:
- 3.12
- Post-History:
- 09-Jan-2023
- Resolution:
Table of Contents
- Abstract
- Motivation
- Specification
- Build Configuration Changes
- Overview of CPython Changes
- Reference Counting
- Memory Management
- Garbage Collection (Cycle Collection)
- Container Thread-Safety
- Borrowed References
- Python Critical Sections
- Optimistically Avoiding Locking
- Mimalloc Changes for Optimistic
list
anddict
Access - Mimalloc Page Reuse
- Optimistic
dict
andlist
Access Summary
- Rationale
- Backwards Compatibility
- Distribution
- Performance
- How to Teach This
- Reference Implementation
- Alternatives
- Related Work
- Rejected Ideas
- Open Issues
- References
- Acknowledgments
- Copyright
Abstract
CPython’s global interpreter lock (“GIL”) prevents multiple threads
from executing Python code at the same time. The GIL is an obstacle
to using multi-core CPUs from Python efficiently. This PEP proposes
adding a build configuration (--without-gil
) to CPython to let it
run Python code without the global interpreter lock and with the
necessary changes needed to make the interpreter thread-safe.
Motivation
The GIL is a major obstacle to concurrency. For scientific computing
tasks, this lack of concurrency is often a bigger issue than speed of
executing Python code, since most of the processor cycles are spent
in optimized CPU or GPU kernels. The GIL introduces a global
bottleneck that can prevent other threads from making progress if
they call any Python code. There are existing ways to enable
parallelism in CPython today, but those techniques come with
significant limitations (see Alternatives).
This section focuses on the GIL’s impact on scientific computing,
particular AI/ML workloads because that is the area with which this
author has the most experience, but the GIL also affects other users
of Python.
The GIL Makes Many Types of Parallelism Difficult to Express
Neural network-based AI models expose multiple opportunities for
parallelism. For example, individual operations may be parallelized
internally (“intra-operator”), multiple operations may be executed
simultaneously (“inter-operator”), and requests (spanning multiple
operations) may also be parallelized. Efficient execution requires
exploiting multiple types of parallelism [1].
The GIL makes it difficult to express inter-operator parallelism, as
well as some forms of request parallelism, efficiently in Python. In
other programming languages, a system might use threads to run
different parts of a neural network on separate CPU cores, but this is
inefficient in Python due to the GIL. Similarly, latency-sensitive
inference workloads frequently use threads to parallelize across
requests, but face the same scaling bottlenecks in Python.
The challenges the GIL poses to exploiting parallelism in Python
frequently come up in reinforcement learning. Heinrich Kuttler,
author of the NetHack Learning Environment and Member of Technical
Staff at Inflection AI, writes:
Recent breakthroughs in reinforcement learning, such as on Dota
2, StarCraft, and NetHack rely on running multiple
environments (simulated games) in parallel using asynchronous
actor-critic methods. Straightforward multithreaded implementations
in Python don’t scale beyond more than a few parallel environments
due to GIL contention. Multiprocessing, with communication via
shared memory or UNIX sockets, adds much complexity and in effect
rules out interacting with CUDA from different workers, severely
restricting the design space.
Manuel Kroiss, software engineer at DeepMind on the reinforcement
learning team, describes how the bottlenecks posed by the GIL lead to
rewriting Python codebases in C++, making the code less accessible:
We frequently battle issues with the Python GIL at DeepMind. In many
of our applications, we would like to run on the order of 50-100
threads per process. However, we often see that even with fewer
than 10 threads the GIL becomes the bottleneck. To work around this
problem, we sometimes use subprocesses, but in many cases the
inter-process communication becomes too big of an overhead. To
deal with the GIL, we usually end up translating large parts of our
Python codebase into C++. This is undesirable because it makes the
code less accessible to researchers.
Projects that involve interfacing with multiple hardware devices face
similar challenges: efficient communication requires use of multiple
CPU cores. The Dose-3D project aims to improve cancer
radiotherapy with precise dose planning. It uses medical phantoms
(stand-ins for human tissue) together with custom hardware and a
server application written in Python. Paweł Jurgielewicz, lead
software architect for the data acquisition system on the Dose-3D
project, describes the scaling challenges posed by the GIL and how
using a fork of Python without the GIL simplified the project:
In the Dose-3D project, the key challenge was to maintain a stable,
non-trivial concurrent communication link with hardware units while
utilizing a 1 Gbit/s UDP/IP connection to the maximum. Naturally,
we started with the multiprocessing package, but at some point, it
became clear that most CPU time was consumed by the data transfers
between the data processing stages, not by data processing itself.
The CPython multithreading implementation based on GIL was a dead
end too. When we found out about the “nogil” fork of Python it took
a single person less than half a working day to adjust the codebase
to use this fork and the results were astonishing. Now we can focus
on data acquisition system development rather than fine-tuning data
exchange algorithms.
Allen Goodman, author of CellProfiler and staff engineer at
Prescient Design and Genentech, describes how the GIL makes
biological methods research more difficult in Python:
Issues with Python’s global interpreter lock are a frequent source
of frustration throughout biological methods research.I wanted to better understand the current multithreading situation
so I reimplemented parts of HMMER, a standard method for
multiple-sequence alignment. I chose this method because it
stresses both single-thread performance (scoring) and
multi-threaded performance (searching a database of sequences). The
GIL became the bottleneck when using only eight threads. This is a
method where the current popular implementations rely on 64 or
even 128 threads per process. I tried moving to subprocesses but
was blocked by the prohibitive IPC costs. HMMER is a relatively
elementary bioinformatics method and newer methods have far bigger
multi-threading demands.Method researchers are begging to use Python (myself included),
because of its ease of use, the Python ecosystem, and because “it’s
what people know.” Many biologists only know a little bit of
programming (and that’s almost always Python). Until Python’s
multithreading situation is addressed, C and C++ will remain the
lingua franca of the biological methods research community.
The GIL Affects Python Library Usability
The GIL is a CPython implementation detail that limits multithreaded
parallelism, so it might seem unintuitive to think of it as a
usability issue. However, library authors frequently care a great
deal about performance and will design APIs that support working
around the GIL. These workaround frequently lead to APIs that are
more difficult to use. Consequently, users of these APIs may
experience the GIL as a usability issue and not just a performance
issue.
For example, PyTorch exposes a multiprocessing-based API called
DataLoader
for building data input pipelines. It uses fork()
on Linux because it is generally faster and uses less memory
than spawn()
, but this leads to additional challenges for users:
creating a DataLoader
after accessing a GPU can lead to confusing
CUDA errors. Accessing GPUs within a DataLoader
worker quickly
leads to out-of-memory errors because processes do not share CUDA
contexts (unlike threads within a process).
Olivier Grisel, scikit-learn developer and software engineer at Inria,
describes how having to work around the GIL in scikit-learn related
libraries leads to a more complex and confusing user experience:
Over the years, scikit-learn developers have maintained ancillary
libraries such asjoblib
andloky
to try to work around some
of the limitations of multiprocessing: extra memory usage partially
mitigated via semi-automated memory mapping of large data buffers,
slow worker startup by transparently reusing a pool of long
running workers, fork-safety problems of third-party native runtime
libraries such as GNU OpenMP by never using the fork-only
start-method, ability to perform parallel calls of interactively
defined functions in notebooks and REPLs in cross-platform manner
via cloudpickle. Despite our efforts, this multiprocessing-based
solution is still brittle, complex to maintain and confusing to
datascientists with limited understanding of system-level
constraints. Furthermore, there are still irreducible limitations
such as the overhead caused by the pickle-based
serialization/deserialization steps required for inter-process
communication. A lot of this extra work and complexity would not be
needed anymore if we could use threads without contention on
multicore hosts (sometimes with 64 physical cores or more) to run
data science pipelines that alternate between Python-level
operations and calls to native libraries.
Ralf Gommers, co-director of Quansight Labs and NumPy and SciPy
maintainer, describes how the GIL affects the user experience of
NumPy and numeric Python libraries:
A key problem in NumPy and the stack of packages built around it is
that NumPy is still (mostly) single-threaded — and that has shaped
significant parts of the user experience and projects built around
it. NumPy does release the GIL in its inner loops (which do the
heavy lifting), but that is not nearly enough. NumPy doesn’t offer
a solution to utilize all CPU cores of a single machine well, and
instead leaves that to Dask and other multiprocessing solutions.
Those aren’t very efficient and are also more clumsy to use. That
clumsiness comes mainly in the extra abstractions and layers the
users need to concern themselves with when using, e.g.,
dask.array
which wrapsnumpy.ndarray
. It also shows up in
oversubscription issues that the user must explicitly be aware of
and manage via either environment variables or a third package,
threadpoolctl
. The main reason is that NumPy calls into BLAS
for linear algebra – and those calls it has no control over, they
do use all cores by default via either pthreads or OpenMP.Coordinating on APIs and design decisions to control parallelism is
still a major amount of work, and one of the harder challenges
across the PyData ecosystem. It would have looked a lot different
(better, easier) without a GIL.
GPU-Heavy Workloads Require Multi-Core Processing
Many high-performance computing (HPC) and AI workloads make heavy use
of GPUs. These applications frequently require efficient multi-core
CPU execution even though the bulk of the computation runs on a GPU.
Zachary DeVito, PyTorch core developer and researcher at FAIR
(Meta AI), describes how the GIL makes multithreaded scaling
inefficient even when the bulk of computation is performed outside of
Python:
In PyTorch, Python is commonly used to orchestrate ~8 GPUs and ~64
CPU threads, growing to 4k GPUs and 32k CPU threads for big models.
While the heavy lifting is done outside of Python, the speed of
GPUs makes even just the orchestration in Python not scalable. We
often end up with 72 processes in place of one because of the GIL.
Logging, debugging, and performance tuning are orders-of-magnitude
more difficult in this regime, continuously causing lower developer
productivity.
The use of many processes (instead of threads) makes common tasks more
difficult. Zachary DeVito continues:
On three separate occasions in the past couple of months
(reducing redundant compute in data loaders, writing model
checkpoints asynchronously, and parallelizing compiler
optimizations), I spent an order-of-magnitude more time figuring
out how to work around GIL limitations than actually solving the
particular problem.
Even GPU-heavy workloads frequently have a CPU-intensive component.
For example, computer vision tasks typically require
multiple “pre-processing” steps in the data input pipeline, like
image decoding, cropping, and resizing. These tasks are commonly
performed on the CPU and may use Python libraries like Pillow
or Pillow-SIMD. It is necessary to run the data input pipeline
on multiple CPU cores in order to keep the GPU “fed” with data.
The increase in GPU performance compared to individual CPU cores makes
multi-core performance more important. It is progressively more
difficult to keep the GPUs fully occupied. To do so requires efficient
use of multiple CPU cores, especially on multi-GPU systems. For
example, NVIDIA’s DGX-A100 has 8 GPUs and two 64-core CPUs in order to
keep the GPUs “fed” with data.
The GIL Makes Deploying Python AI Models Difficult
Python is widely used to develop neural network-based AI models. In
PyTorch, models are frequently deployed as part of multi-threaded,
mostly C++, environments. Python is often viewed skeptically
because the GIL can be a global bottleneck, preventing efficient
scaling even though the vast majority of the computations
occur “outside” of Python with the GIL released. The torchdeploy
paper [2] shows experimental evidence for these scaling
bottlenecks in multiple model architectures.
PyTorch provides a number of mechanisms for for deploying Python AI
models that avoid or work around the GIL, but they all come with
substantial limitations. For example, TorchScript captures a
representation of the model that can be executed from C++ without any
Python dependencies, but it only supports a limited subset of Python
and often requires rewriting some of the model’s code. The
torch::deploy API
allows multiple Python interpreters, each with its own GIL, in the
same process(similar to PEP 684). However, torch::deploy
has
limited support for Python modules that use C-API extensions.
Motivation Summary
Python’s global interpreter lock makes it difficult to use modern
multi-core CPUs efficiently for many scientific and numeric computing
applications. Heinrich Kuttler, Manuel Kroiss, and Paweł
Jurgielewicz found that multi-threaded implementations in Python did
not scale well for their tasks and that using multiple processes
was not a suitable alternative.
The scaling bottlenecks are not solely in core numeric tasks. Both
Zachary DeVito and Paweł Jurgielewicz described challenges with
coordination and communication in Python.
Olivier Grisel, Ralf Gommers, and Zachary DeVito described how current
workarounds for the GIL are “complex to maintain” and cause “lower
developer productivity.” The GIL makes it more difficult to develop
and maintain scientific and numeric computing libraries as well
leading to library designs that are more difficult to use.
Specification
Build Configuration Changes
The global interpreter lock will remain the default for CPython builds
and python.org downloads. A new build configuration flag,
--without-gil
will be added to the configure script that will
build CPython without the global interpreter lock.
When built with --without-gil
, CPython will define the
Py_NOGIL
macro in Python/patchlevel.h. The ABI tag will include
the letter “n” (for “nogil”).
Overview of CPython Changes
Removing the global interpreter lock requires substantial changes to
CPython internals, but relatively few changes to the public Python
and C APIs. This section describes the required changes to the
CPython implementation followed by the proposed API changes.
The implementation changes can be grouped into the following four
categories:
- Reference counting
- Memory management
- Container thread-safety
- Locking and atomic APIs
Reference Counting
Removing the GIL requires changes to CPython’s
reference counting implementation to make it thread-safe.
Furthermore, it needs to have low execution overhead and allow for
efficient scaling with multiple threads. This PEP proposes a
combination of three techniques to address these constraints. The
first is a switch from plain non-atomic reference counting to biased
reference counting, which is a thread-safe reference counting
technique with lower execution overhead than plain atomic reference
counting. The other two techniques are immortalization and a limited
form of deferred reference counting; they address some of the
multi-threaded scalability issues with reference counting by avoiding
some reference count modifications.
Biased reference counting (BRC) is a technique first described in 2018
by Jiho Choi, Thomas Shull, and Josep Torrellas [3]. It is based on the
observation that most objects are only accessed by a single thread,
even in multi-threaded programs. Each object is associated with an
owning thread (the thread that created it). Reference counting
operations from the owning thread use non-atomic instructions to
modify a “local” reference count. Other threads use atomic
instructions to modify a “shared” reference count. This design avoids
many atomic read-modify-write operations that are expensive on
contemporary processors.
The implementation of BRC proposed in this PEP largely matches the
original description of biased reference counting, but differs in
details like the size of reference counting fields and special bits
in those fields. BRC requires storing three pieces of information in
each object’s header: the “local” reference count, the “shared”
reference count, and the identifier of the owning thread. The BRC
paper packs these three things into a single 64-bit field. This PEP
proposes using three separate pointer-sized fields (i.e., three 64-bit
fields on 64-bit platforms) in each object’s header to avoid
potential issues due to reference count overflow.
The proposed PyObject
struct (also called struct _object
) is
below:
struct _object { _PyObject_HEAD_EXTRA uintptr_t ob_tid; Py_ssize_t ob_ref_local; Py_ssize_t ob_ref_shared; PyTypeObject *ob_type; };
The details of the new fields are described in the following
sections.
Immortalization
Some objects, such as interned strings, small integers, statically
allocated PyTypeObjects, and the True
, False
, and None
objects stay alive for the lifetime of the program. These objects are
marked as immortal by setting the local reference count field
(ob_ref_local
) to -1
and the thread id (ob_tid
) to the
unsigned equivalent(UINTPTR_MAX
). It’s sufficient to check either
of these fields to determine if an object is immortal, which enables
slightly more efficient Py_INCREF
and Py_DECREF
implementations.
The Py_INCREF
and Py_DECREF
macros are no-ops for immortal
objects. This avoids contention on the reference count fields of
these objects when multiple threads access them concurrently.
This proposed immortalization scheme is very similar to PEP 683,
but with slightly different bit representation in the reference count
fields for immortal objects in order to work with biased reference
counting and deferred reference counting.
Biased Reference Counting
Biased reference counting has a fast-path for objects “owned” by the
current thread and a slow-path for other objects. Ownership is
indicated by the ob_tid
field. Determining the thread id
requires platform specific code [5]. Two special values for
ob_tid
are -1
and 0
. A value of -1
indicates the
object is immortal (see Immortalization) and a value of 0
indicates that the object is not owned by any thread.
Threads must give up ownership of an object before that object can be
destroyed. Ownership is most commonly given up when the local
reference count reaches zero, but also can be requested by other
threads. Threads give up ownership by setting ob_tid
to zero, and
adding the local reference count to the shared reference count. If the
combined reference count is zero, the object can be deallocated.
Otherwise, only the shared reference count field is used from that
point onwards.
The ob_ref_local
field stores the local reference count and two
flags. The two most significant bits are used to indicate the object
is immortal or uses deferred reference counting (see Deferred
reference counting).
The ob_ref_shared
field stores the shared reference count and two
flags. The two least significant bits are used to indicate if the
object is “merged” or “queued.” The shared reference count is
therefore shifted left by two. The ob_ref_shared
field uses the
least significant bits because the shared reference count can be
temporarily negative; increfs and decrefs may not be balanced between
threads.
If ob_ref_shared
becomes negative, the current thread requests
that the owning thread merge the two fields. It atomically pushes
the object to the owning thread’s queue of objects to be merged and
sets the “queued” bit on ob_ref_shared
(to prevent duplicate
queueings). The owning thread is notified via the eval_breaker
mechanism. In practice, this operation is rare. Most objects are
only accessed by a single thread and those objects accessed by
multiple threads rarely have negative shared reference counts.
The proposed Py_INCREF
and Py_DECREF
operation should behave
as follows (using C-like pseudo-code):
// low two bits of "ob_ref_shared" are used for flags #define _Py_SHARED_SHIFT 2 void Py_INCREF(PyObject *op) { Py_ssize_t new_local = op->ob_ref_local + 1; if (new_local == 0) return; // object is immortal if (op->ob_tid == _Py_ThreadId()) op->ob_ref_local = new_local; else atomic_add(&op->ob_ref_shared, 1 << _Py_SHARED_SHIFT); } void Py_DECREF(PyObject *op) { if (op->ob_tid == -1) { return; // object is immortal } if (op->ob_tid == _Py_ThreadId()) { op->ob_ref_local -= 1; if (op->ob_ref_local == 0) { _Py_MergeZeroRefcount(); // merge refcount } } else { _Py_DecRefShared(); // slow path } }
The reference implementation [17] contains implementations of
_Py_MergeZeroRefcount
and _Py_DecRefShared
.
Note that the above is pseudocode: in practice, the implementation
should use “relaxed atomics” to access ob_tid
and
ob_ref_local
to avoid undefined behavior in C and C++.
Deferred Reference Counting
A few types of objects, such as top-level functions, code objects,
modules, and methods, tend to be frequently accessed by many threads
concurrently. These objects don’t necessarily live for the lifetime of
the program, so immortalization is not a good fit. This PEP proposes a
limited form of deferred reference counting to avoid contention on
these objects’ reference count fields in multi-threaded programs.
Typically, the interpreter modifies objects’ reference counts as they
are pushed to and popped from the interpreter’s stack. The
interpreter skips these reference counting operations for objects
that use deferred reference counting. Objects that support deferred
reference counting are marked by setting the second-most significant
bit in the local reference count field to one.
Because some reference counting operations are skipped, the reference count fields no
longer reflect the true number of references to these objects. The
true reference count is the sum of the reference count fields plus
any skipped references from each thread’s interpreter stack. The
true reference count can only be safely computed when all threads are