Disclaimer: This article took quite a while to prepare. Although I’ve
made every effort to fact-check and ensure the accuracy of the content, there
may still be errors. If you notice any mistakes, please feel free to reach
out and let me know!
I like writing programs that run fast (and accurate). Knowing when to
compromise on accuracy for performance is a very difficult task, but it can be
achieved with enough time, dedication, and a sprinkle of Zen. Over the course
of my career, I had the pleasure of taking algorithms and CPU ISA documents,
and mash them together into something spectacular. By “spectacular,” I mean
innovation that lead to longer battery life, cooler CPU temperature, and the
satisfaction of fully utilizing every transistor on the chip. One algorithm
that I always had to port over various CPU architecture was the IIR filter.
It is primarily used for audio equalization, which is a crucial algorithm in
most audio products today.
IIR or Infinite Response Filters
are mathematically defined as:
$$
y(n) + sum_{k=1}^{M-1}{a_{k} y(n – k)} = sum_{k=0}^{N-1}{b_{k} x(n – k)}
$$
Note: Wikipedia uses different notation. I am redefining them to be
consistent with the rest of the post.
In the equation above:
- (n) is discrete time sequence: (n in mathbb{Z})
- (x(n)) is the discrete-time input signal
- (y(n)) is the output
- (a_k) are feedback or denominator coefficients
- (b_k) are feed-forward or numerator coefficients
- (a_k) and (b_k) shapes the filter’s behavior (like lowpass, highpass, etc.)
I won’t dive too deep into IIR fundamentals here as there are plenty of
excellent resources available:
- Selesnick, Infinite-Impulse Response Digital
Filters:
Lecture notes covering the basics of IIR filter design. - CCRMA, Elementary Filter
Sections:
A discussion on various elementary filter sections for digital filtering. - WPI, L5: IIR Filters and Their
Implementation:
Lecture notes focusing on the practical aspects of implementing IIR filters.
In summary, here are some distilled guidelines for efficient IIR filter
implementations:
-
Second-Order Sections (SOS):
Break the filter into second-order stages to mitigate numerical instability.
While other topologies exist, such as the State Variable Filter
(SVF), this
discussion focuses on SOS implementations. -
Floating-Point Implementations:
Opt for the transposed direct form II topology to enhance numerical
stability. There’s a wealth of literature on this topic, so further
exploration can yield additional insights. -
Fixed-Point Implementations:
Use the direct form I topology for fixed-point designs, as there are no
internal overflows. Reserve direct form II for cases where memory
constraints are particularly severe. -
Regular Coefficient Testing:
IIR filters can (and will) drift into instability over time with certain
inputs. It’s wise to periodically test, and if necessary, reset the filter
coefficients to ensure continued stability.
Most modern amplifier SoCs include dedicated hardware accelerators for IIR
filters. Memory-mapped registers can be configured with the desired filter
coefficients. Some DSPs even offer dual accumulators to separately maintain
feed-forward and feedback states before merging the results. In contrast,
general-purpose CPUs are designed for parallel execution using pipeline stages
and SIMD instruction sets. Standard C implementations of IIR filters rarely
exploit this parallelism due to inherent data dependencies. To bridge this gap,
we must leverage compiler intrinsic functions and some DSP math, which is the
main focus of this post.
Biquads: 2nd Order IIR Filters (SOS)
Biquads are second-order
IIR filters commonly used as building blocks for higher-order designs. They are
named “biquad” because their transfer function in the Z-domain consists of two
quadratic polynomials:
$$
H(z) = frac{b_0 + b_1 z^{-1} + b_2 z^{-2}}{a_0 + a_1 z^{-1} + a_2 z^{-2}}
$$
In the time domain, the filter is described by the difference equation:
$$ a_0 y(n) = b_0 x(n) + b_1 x(n – 1) + b_2 x(n – 2) – a_1 y(n – 1) – a_2 y(n – 2)
$$
Typically, (a_0) is normalized to (1), because divisions are computationally
expensive.
For floating-point implementations, the transposed direct form topology is
preferred over direct form I due to its superior numerical stability. Moreover,
the transposed direct form II structure reduces memory usage by consolidating
four delay elements into two. In this form, the filter equations can be
reformulated as:
$$
y(n) = b_0 x(n) + w_1 \
w_1 = b_1 x(n) – a_1 y(n) + w_2 \
w_2 = b_2 x(n) – a_2 y(n)
$$
Below is a sample C function that processes a block of samples using this
formulation:
void process_biquad(const float *coeff, float *state, const float *input,
float *output, size_t num_frames) {
// Load coefficients
const float b0 = coeff[0];
const float b1 = coeff[1];
const float b2 = coeff[2];
const float a1 = coeff[3];
const float a2 = coeff[4];
// Load states
float w1 = state[0];
float w2 = state[1];
float x, y;
for (size_t i = 0; i < num_frames; ++i) {
x = input[i];
y = b0 * x + w1;
w1 = (b1 * x) - (a1 * y) + w2;
w2 = (b2 * x) - (a1 * y);
output[i] = y;
}
// Save states
state[0] = w1;
state[1] = w2;
}
Memory Map for the Function Arguments:
coeff
: An array of floats representing the filter coefficients in the
order ({b_0, b_1, b_2, a_1, a_2}). (a_0) is assumed to be (1) and is
therefore omitted.state
: The delay buffer containing the state variables ({w_1, w_2}).input
: Input buffer containingnum_frames
samples.output
: Output buffer that will holdnum_frames
processed samples.num_frames
: The number of samples (or frames) to process.
SIMD Level 1: Process Multiple Channels
A straightforward SIMD optimization is to utilize 2-lane SIMD registers to
process multiple biquad channels concurrently. This technique is implemented in
my A5Eq.lv2 plugin project, where I use
double-precision SIMD with both SSE (for x86) and NEON (for ARM) to process two
channels simultaneously. The general idea is shown in the figure below:
graph LR; input_left --> H0L(H0,L) input_right --> H0R(H0,R) H0L --> H1L(H1,L) H0R --> H1R(H1,R) H1L --> HKL(...) H1R --> HKR(...) HKL --> HNL(HM,L) HKR --> HNR(HM,R) HNL --> output_left HNR --> output_right
Since LV2 audio buffers are non-interleaved, an extra step is required to
interleave and later de-interleave the buffer for SIMD processing. However, for
filters with more than four stages (i.e., 8th order filters or
higher), the overhead from interleaving is negligible compared to the
performance benefits. The basic idea of the implmenetaion is shown below:
graph LR; input_left --> Interleave input_right --> Interleave Interleave --> H0(H0,LH0,R) H0 --> H1(H1,L
H1,R) H1 -.-> H2(HM,L
HN,R) H2 --> D(De-interleave) D --> output_left D --> output_right
For implementation details, check out the stereo biquad filter code in
A5Eq.lv2/src/dsp_biquad.hpp:77.
This project uses the same sets of coefficients across both channels,
but it can be easliy adapted to have different coefficients across
multiple channels with an update of how tuning coefficients are stored.
Although, A5Eq uses the same sets of
coefficients across both channels, it can be easily modified for a true
multi-channel filter architecture. The memory mmap of the coefficients for
stereo SIMD are as follows. Note the coefficients are interleaved:
coeff =
{
// Left, Right
b00, b01
b10, b11
b20, b21
a10, a11
a20, a21
};
state =
{
// Left, Right
w10, w11
w20, w21
};
SIMD Level 2: Crossover Filters
Modern CPUs feature multiple SIMD lanes. For example, on the x86/AMD64
architecture, SSE
offers 4-lanes of float vectors,
AVX offers 8,
AVX-512 offers 16. On ARM,
NEON
typically offers 4-lanes, with optional SVE feature that can process 64
elements. We can exploit this to perform parallel filtering. Although the
underlying math remains unchanged, we can restructure the code and data into
parallel nodes that align with the SIMD lanes. One effective approach is to use
a crossover structure that scales to any power of two. For example, consider
this four-band crossover filter:
graph LR; input --> LP0 input --> HP0 LP0 --> LP1L LP0 --> HP1L HP0 --> LP1H HP0 --> HP1H LP1L --> output_ll HP1L --> output_lh LP1H --> output_hl HP1H --> output_hh
In this design, reusing the crossover section produces a tree-like hierarchy,
enabling a divide-and-conquer approach. This allows each branch of the tree to
be processed concurrently, effectively leveraging the full width of the SIMD
lanes.
For instance, suppose you have a biquad filter implementation that processes
four lanes simultaneously:
void crossover_kernel(const float *coeff, float *state,
const float *input, float *output,