
Our paper Pixelated Butterfly: Simple and Efficient Sparse Training for Neural Network Models
is available on arXiv, and our code is available
on GitHub.
Why Sparsity?
Recent results suggest that overparameterized neural networks generalize well (Belkin et al. 2019). We’ve witnessed the rise and success of large models (e.g., AlphaFold, GPT-3, DALL-E, DLRM), but they are expensive to train and becoming economically, technically, and environmentally unsustainable (Thompson et al. 2020). For example, a single training run of GPT-3 takes a month on 1,024 A100 GPUs and costs $12M. An ideal model should use less computation and memory while retaining the generalization benefits of large models.
The simplest and most popular direction is to sparsify these models — sparsify matrix multiplications, the fundamental building blocks underlying neural networks. Sparsity is not new! It has a long history in machine learning (Lecun et al. 1990) and has driven fundamental progress in other fields such as statistics (Tibshirani et al. 1996), neuroscience (Foldiak et al. 2003), and signal processing (Candes et al. 2005). What is new is that in the modern overparameterized regime, sparsity is no longer used to regularize models — sparse models should behave as close as possible to a dense model, only smaller and faster.
What are the Challenges?
Sparse training is an active research area, but why has sparsity not been adopted widely? Below we summarize a few challenges that motivate our work:
-
Choice of Sparse Parameterization: Many existing methods, e.g., pruning (Lee et al. 2018, Evci et al. 2020), lottery tickets (Frankle et al. 2018), hashing (Chen et al. 2019, Kitaev et al. 2020) maintain dynamic sparsity masks. However, the overhead of evolving the sparsity mask often slows down (instead of speeds up!) training.
-
Hardware Suitability: Most existing methods adopt unstructured sparsity, which may be efficient in theory, but not on hardware such as GPUs (highly optimized for dense computation). An unstructured sparse model with 1% nonzero weights can be as slow as a dense model (Hooker et al. 2020).

Memory access for a hardware with block size 16: one (red) location access means full 16 (blue) access.
- Layer Agnostic Sparsity: Most existing work targets a single type of operation such as attention (Child et al. 2019, Zaheer et al. 2020), whereas neural networks often compose different modules (attention, MLP). In many applications the MLP layers are the main training bottleneck (Wu et al. 2020).
Therefore, we are looking for simple static sparsity patterns that are hardware-suitable and widely applicable to most NN layers.
Our Approach: Pixelated Butterfly
Intuition: In our early exploration, we observe that one sparsity pattern: butterfly + low-rank, consistently outperforms the others. This “magic” sparsity pattern closely connects to two lines of work in matrix structures (see Figure 1):
- Sparse + low-rank matrices (Candes et al. 2011, Udell et al. 2019, Chen et al. 2021): can capture global and local information,
- Butterfly matrices (Parker et al. 1995, Dao et al. 2019): their products can tightly represent any sparse matrices with near-optimal space and time complexity. With butterfly, we can avoid the combinatorial problem of searching over all the possible sparsity pattern!
Butterfly + low-rank is static and applicable to most NN layers.

Figure 1: Observation – Butterfly + Low-rank is a simple an effective fixed sparsity pattern.
However, butterfly matrices are inefficient on modern hardware because (1) their sparsity patterns are not block-aligned, and (2) they are products of many factors — hard to parallelize.
We address the issues with two