Skip to content Skip to footer
0 items - $0.00 0

The FFT Strikes Back: An Efficient Alternative to Self-Attention by iNic

The FFT Strikes Back: An Efficient Alternative to Self-Attention by iNic

10 Comments

  • Post Author
    larodi
    Posted February 26, 2025 at 10:53 am

    Man, this all really gets increasingly more complex in increasingly complex math…

  • Post Author
    avereveard
    Posted February 26, 2025 at 10:56 am

    Isn't flash attention already n log n?

  • Post Author
    pointlessone
    Posted February 26, 2025 at 11:09 am

    OK, I admit that the math flies way over my head and I barely understand the text around the math. Can someone please explain (in basic English) how this is equivalent to attention mechanism? What friquencies does it talk about? How does it encode positional relations between tokens?

  • Post Author
    xeonmc
    Posted February 26, 2025 at 11:22 am

    Basically leverages convolution theorem[0]: expensive convolutions in direct space becomes simple multiplications in reciprocal space, and vice versa.

    Whereever you have a convolution operation on your data, transform them to the conjugate domain to turn it into multiplication.

    In other words, work in the domain that is natural to your data.

    [0] https://en.wikipedia.org/wiki/Convolution_theorem

  • Post Author
    yorwba
    Posted February 26, 2025 at 11:24 am

    I don't see how you could fit causal masking into this framework without having to do n different FFTs, and there's no mention of positional embeddings either, so I guess the self-attention implementation being compared against is noncausal NoPE, which would make this a case of baseline sandbagging and maybe not so impressive.

    If the results were close to state-of-the-art, probably the author would've mentioned it?

  • Post Author
    yagizdegirmenci
    Posted February 26, 2025 at 11:24 am

    Google introduced this idea in 2022 with "FNet: Mixing Tokens with Fourier Transforms" [0].

    Later they found out that, performance of their TPU(s) for matrix multiplication was faster than FFT in the most scenarios.

    [0]: https://arxiv.org/abs/2105.03824

  • Post Author
    cs702
    Posted February 26, 2025 at 11:26 am

    TL;DR:

    1. Take FNet (https://arxiv.org/abs/2105.03824).

    2. Replace the fixed (frequency-domain) convolution filter with one that is dynamically computed from the data.

    3. Apply non-linear functions to both real and imaginary components, before mapping the convolved data back to the time domain.

  • Post Author
    A7C3D5
    Posted February 26, 2025 at 11:59 am

    I'll never not read FFT as Final Fantasy Tactics.

  • Post Author
    DrNosferatu
    Posted February 26, 2025 at 12:03 pm

    Can someone confirm the big O time complexities of

    1. traditional Self-Attention;

    2. Flash-Attention?

    3. Any novel others?

  • Post Author
    nnnnico
    Posted February 26, 2025 at 12:12 pm

    It really is waves all the way down

Leave a comment

In the Shadows of Innovation”

© 2025 HackTech.info. All Rights Reserved.

Sign Up to Our Newsletter

Be the first to know the latest updates

Whoops, you're not connected to Mailchimp. You need to enter a valid Mailchimp API key.