[Submitted on 25 Feb 2025]
Abstract:Conventional self-attention mechanisms incur quadratic complexity, limiting their scalability on long sequences. We introduce FFTNet, an adaptive spectral filtering framework that leverages the Fast Fourier Transform (FFT) to achieve global
10 Comments
larodi
Man, this all really gets increasingly more complex in increasingly complex math…
avereveard
Isn't flash attention already n log n?
pointlessone
OK, I admit that the math flies way over my head and I barely understand the text around the math. Can someone please explain (in basic English) how this is equivalent to attention mechanism? What friquencies does it talk about? How does it encode positional relations between tokens?
xeonmc
Basically leverages convolution theorem[0]: expensive convolutions in direct space becomes simple multiplications in reciprocal space, and vice versa.
Whereever you have a convolution operation on your data, transform them to the conjugate domain to turn it into multiplication.
In other words, work in the domain that is natural to your data.
[0] https://en.wikipedia.org/wiki/Convolution_theorem
yorwba
I don't see how you could fit causal masking into this framework without having to do n different FFTs, and there's no mention of positional embeddings either, so I guess the self-attention implementation being compared against is noncausal NoPE, which would make this a case of baseline sandbagging and maybe not so impressive.
If the results were close to state-of-the-art, probably the author would've mentioned it?
yagizdegirmenci
Google introduced this idea in 2022 with "FNet: Mixing Tokens with Fourier Transforms" [0].
Later they found out that, performance of their TPU(s) for matrix multiplication was faster than FFT in the most scenarios.
[0]: https://arxiv.org/abs/2105.03824
cs702
TL;DR:
1. Take FNet (https://arxiv.org/abs/2105.03824).
2. Replace the fixed (frequency-domain) convolution filter with one that is dynamically computed from the data.
3. Apply non-linear functions to both real and imaginary components, before mapping the convolved data back to the time domain.
A7C3D5
I'll never not read FFT as Final Fantasy Tactics.
DrNosferatu
Can someone confirm the big O time complexities of
1. traditional Self-Attention;
2. Flash-Attention?
3. Any novel others?
nnnnico
It really is waves all the way down