Pixelated Butterfly by sdenton4

Share This Article

Sed ut perspiciatis unde.

Our paper Pixelated Butterfly: Simple and Efficient Sparse Training for Neural Network Models
is available on arXiv, and our code is available
on GitHub.

Why Sparsity?

Recent results suggest that overparameterized neural networks generalize well (Belkin et al. 2019). We’ve witnessed the rise and success of large models (e.g., AlphaFold, GPT-3, DALL-E, DLRM), but they are expensive to train and becoming economically, technically, and environmentally unsustainable (Thompson et al. 2020). For example, a single training run of GPT-3 takes a month on 1,024 A100 GPUs and costs $12M. An ideal model should use less computation and memory while retaining the generalization benefits of large models.

The simplest and most popular direction is to sparsify these models — sparsify matrix multiplications, the fundamental building blocks underlying neural networks. Sparsity is not new! It has a long history in machine learning (Lecun et al. 1990) and has driven fundamental progress in other fields such as statistics (Tibshirani et al. 1996), neuroscience (Foldiak et al. 2003), and signal processing (Candes et al. 2005). What is new is that in the modern overparameterized regime, sparsity is no longer used to regularize models — sparse models should behave as close as possible to a dense model, only smaller and faster.

What are the Challenges?

Sparse training is an active research area, but why has sparsity not been adopted widely? Below we summarize a few challenges that motivate our work:

Choice of Sparse Parameterization: Many existing methods, e.g., pruning (Lee et al. 2018, Evci et al. 2020), lottery tickets (Frankle et al. 2018), hashing (Chen et al. 2019, Kitaev et al. 2020) maintain dynamic sparsity masks. However, the overhead of evolving the sparsity mask often slows down (instead of speeds up!) training.
Hardware Suitability: Most existing methods adopt unstructured sparsity, which may be efficient in theory, but not on hardware such as GPUs (highly optimized for dense computation). An unstructured sparse model with 1% nonzero weights can be as slow as a dense model (Hooker et al. 2020).

Layer Agnostic Sparsity: Most existing work targets a single type of operation such as attention (Child et al. 2019, Zaheer et al. 2020), whereas neural networks often compose different modules (attention, MLP). In many applications the MLP layers are the main training bottleneck (Wu et al. 2020).

Therefore, we are looking for simple static sparsity patterns that are hardware-suitable and widely applicable to most NN layers.

Our Approach: Pixelated Butterfly

Intuition: In our early exploration, we observe that one sparsity pattern: butterfly + low-rank, consistently outperforms the others. This “magic” sparsity pattern closely connects to two lines of work in matrix structures (see Figure 1):

Sparse + low-rank matrices (Candes et al. 2011, Udell et al. 2019, Chen et al. 2021): can capture global and local information,
Butterfly matrices (Parker et al. 1995, Dao et al. 2019): their products can tightly represent any sparse matrices with near-optimal space and time complexity. With butterfly, we can avoid the combinatorial problem of searching over all the possible sparsity pattern!

Butterfly + low-rank is static and applicable to most NN layers.

However, butterfly matrices are inefficient on modern hardware because (1) their sparsity patterns are not block-aligned, and (2) they are products of many factors — hard to parallelize.

We address the issues with two

Pixelated Butterfly by sdenton4

Pixelated Butterfly by sdenton4

Share This Article

Newsletter

Why Sparsity?

What are the Challenges?

Our Approach: Pixelated Butterfly

HackTech

Leave a comment Cancel reply

Editor's Choice

Pixelated Butterfly by sdenton4

Pixelated Butterfly by sdenton4

Share This Article

Newsletter

Why Sparsity?

What are the Challenges?

Our Approach: Pixelated Butterfly

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter