Debugging a Mixed Python and C Language Stack by quasiben

Share This Article

Sed ut perspiciatis unde.

Debugging is difficult. Debugging across multiple languages is especially challenging, and debugging across devices often requires a team with varying skill sets and expertise to reveal the underlying problem.

Yet projects often require using multiple languages, to ensure high performance where necessary, a user-friendly experience, and compatibility where possible. Unfortunately, there is no single programming language that offers all of the above, demanding that developers become versatile.

This post shows how a RAPIDS team approached debugging multiple programming languages, including the use of GDB to identify and resolve deadlocks. The team is dedicated to designing software to accelerate and scale data science solutions.

The bug featured in this post was part of the RAPIDS project that was identified and resolved in the summer of 2019. It involves a complex stack with multiple programming languages, primarily C, C++, and Python, as well as CUDA for GPU acceleration.

Documenting this historical bug and its resolution serves a few goals, including:

Demonstrating Python and C debugging with GDB
Presenting ideas on how to diagnose deadlocks
Developing a better understanding of mixing Python and CUDA

The content presented in this post should help you understand how such bugs manifest and how to address similar issues in your own work.

Bug description

To be efficient and performant, RAPIDS depends on a variety of libraries for a multitude of different operations. To name a few, RAPIDS uses CuPy and cuDF to compute arrays and DataFrames on the GPU, respectively. Numba is a just-in-time compiler that can be used for accelerating user-defined Python operations on the GPU.

In addition, Dask is used to scale compute to multiple GPUs and multiple nodes. The last piece of the puzzle in the bug at hand is UCX, a communication framework used to leverage a variety of interconnects, such as InfiniBand and NVLink.

Figure 1 shows an overview of this stack. Although unknown at the time, a deadlock was occurring somewhere in this stack, preventing the workflow from completing.

Diagram showing the relevant RAPIDS software stack — *Figure 1. Stack of components in a RAPIDS and Dask cluster*

This deadlock was first observed in August 2019, which was shortly after UCX was introduced in the stack. It turns out that the deadlock previously manifested itself without UCX (using the Dask default TCP communicator), except more infrequently.

A lot of time was spent exploring the space when the deadlock occurred. Though unknown at the time, the bug could have been in a particular operation, such as group by aggregation, merge/joins, repartitioning, or in a particular version of any of the libraries, including cuDF, CuPy, Dask, UCX, and more. There were many facets to explore as a result.

Prepare for debugging

The next sections walk you through how to prepare for debugging.

Set up a minimal reproducer

Finding a minimal reproducer is key to debugging anything. This problem was initially identified in a workflow running eight GPUs. Over time, we reduced this down to two GPUs. Having a minimal reproducer is critical to easily share a bug with others and get the time and attention from a broader team.

Set up your environment

Before diving into the problem, set up your environment. The bug can be minimally reproduced with the 0.10 version of RAPIDS (released in October 2019). It is possible to set up the environment with either Conda or Docker (see the respective sections later in this post).

This entire process assumes the use of Linux. Because UCX is not supported on Windows or MacOS, it is not reproducible on those operating systems.

Conda

First, install Miniconda. After the initial setup, we strongly recommend that you install mamba by running the following script:

conda install mamba -n base -c conda-forge

Then run the following script to create and activate a conda environment with RAPIDS 0.10:

mamba create -n rapids-0.10 -c rapidsai -c nvidia -c conda-forge rapids=0.10 glog=0.4 cupy=6.7 numba=0.45.1 ucx-py=0.11 ucx=1.7 ucx-proc=*=gpu libnuma dask=2.30 dask-core=2.30 distributed=2.30 gdb
conda activate rapids-0.10

We recommend Mamba for speeding up environment resolution. Skipping that step and replacing mamba with conda should work as well, but may be considerably slower.

Docker

Alternatively, you can reproduce the bug with Docker. After you have NVIDIA Container Toolkit set up, follow these instructions.

docker run -it --rm --cap-add sys_admin --cap-add sys_ptrace --ipc shareable --net host --gpus all rapidsai/rapidsai:0.10-cuda10.0-runtime-ubuntu18.04 /bin/bash

In the container, install mamba to speed up the environment resolution.

conda create -n mamba -c conda-forge mamba -y

Then, install UCX/UCX-Py, and libnuma, which is a UCX dependency. Also, upgrade Dask to a version that has integrated UCX support. For debugging later, also install GDB.

/opt/conda/envs/mamba/bin/mamba install -y -c rapidsai -c nvidia -c conda-forge dask=2.30 dask-core=2.30 distributed=2.30 fsspec=2022.11.0 libnuma ucx-py=0.11 ucx=1.7 ucx-proc=*=gpu gdb -p /opt/conda/envs/rapids

Debugging

This section details how this particular problem was encountered and ultimately fixed, with a detailed step-by-step overview. You can also reproduce and practice a few of the described concepts.

Running (or hanging)

The debugging issue in question is definitely not limited to a single compute problem, but it is easier to use the same workflow that we used in 2019. That script can be downloaded to a local environment by running the following script:

wget 
https://gist.githubusercontent.com/pentschev/9ce97f8efe370552c7dd5e84b64d3c92/raw/424c9cf95f31c18d32a9481f78dd241e08a071a9/cudf-deadlock.py

To reproduce, execute the following:

OPENBLAS_NUM_THREADS=1 UCX_RNDV_SCHEME=put_zcopy UCX_MEMTYPE_CACHE=n UCX_TLS=sockcm,tcp,cuda_copy,cuda_ipc python cudf-deadlock.py

In just a few iterations (perhaps as few as one or two), you should see the preceding program hang. Now the real work begins.

The deadlock

A nice attribute about deadlocks is that the processes and threads (if you know how to investigate them) can show what they are currently trying to do. You can infer what is causing the deadlock.

The critical tool is GDB. However, much time was spent initially with PDB for investigating what Python was doing at each step. GDB can attach to live processes, so you must first find out what the processes and their associated IDs are:

(rapids) root@dgx13:/rapids/notebooks# ps ax | grep python
   19 pts/0    S      0:01 /opt/conda/envs/rapids/bin/python /opt/conda/envs/rapids/bin/jupyter-lab --allow-root --ip=0.0.0.0 --no-browser --NotebookApp.token=
  865 pts/0    Sl+    0:03 python cudf-deadlock.py
  871 pts/0    S+     0:00 /opt/conda/envs/rapids/bin/python -c from multiprocessing.semaphore_tracker import main;main(69)
  873 pts/0    Sl+    0:08 /opt/conda/envs/rapids/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=70, pipe_handle=76) --multiprocessing-fork
  885 pts/0    Sl+    0:07 /opt/conda/envs/rapids/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=70, pipe_handle=85) --multiprocessing-fork

Four Python processes are relevant to this problem:

Dask Client (865)
Dask Scheduler (871)
Two Dask workers (873 and 885)

Interestingly enough, significant progress has been made in debugging Python since this bug was initially investigated. In 2019, RAPIDS was on Python 3.6, which already had tools to debug lower stacks but only when Python was built in debug mode. That required potentially rebuilding the entire software stack, which is prohibitive in complex cases like this.

Since Python 3.8, the debug builds use the same ABI as release builds, greatly simplifying debugging the C and Python stacks combined. We don’t cover that in this post.

GDB exploration

Use gdb to attach to the last running process (one of the Dask workers):

(rapids) root@dgx13:/rapids/notebooks# gdb -p 885
Attaching to process 885
[New LWP 889]
[New LWP 890]
[New LWP 891]
[New LWP 892]
[New LWP 893]
[New LWP 894]
[New LWP 898]
[New LWP 899]
[New LWP 902]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f5494d48938 in pthread_rwlock_wrlock () from /lib/x86_64-linux-gnu/libpthread.so.0
(gdb)

Each Dask worker has several threads (communication, compute, admin, and so on). Use the gdb command info threads to inspect what each thread is doing.

(gdb) info threads
  Id   Target Id                                        Frame
* 1    Thread 0x7f5495177740 (LWP 885) "python"         0x00007f5494d48938 in pthread_rwlock_wrlock () from /lib/x86_64-linux-gnu/libpthread.so.0
  2    Thread 0x7f5425b98700 (LWP 889) "python"         0x00007f5494d4d384 in read () from /lib/x86_64-linux-gnu/libpthread.so.0
  3    Thread 0x7f5425357700 (LWP 890) "python"         0x00007f5494d49f85 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  4    Thread 0x7f5424b16700 (LWP 891) "python"         0x00007f5494d49f85 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  5    Thread 0x7f5411fff700 (LWP 892) "cuda-

Debugging a Mixed Python and C Language Stack by quasiben

Debugging a Mixed Python and C Language Stack by quasiben

Share This Article

Newsletter

Bug description

Prepare for debugging

Set up a minimal reproducer

Set up your environment

Conda

Docker

Debugging

Running (or hanging)

The deadlock

GDB exploration

HackTech

Leave a comment Cancel reply

Editor's Choice

Debugging a Mixed Python and C Language Stack by quasiben

Debugging a Mixed Python and C Language Stack by quasiben

Share This Article

Newsletter

Bug description

Prepare for debugging

Set up a minimal reproducer

Set up your environment

Conda

Docker

Debugging

Running (or hanging)

The deadlock

GDB exploration

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter