[Submitted on 31 Aug 2023]
Abstract: Since its inception in “Attention Is All You Need”, transformer architecture
has led to revolutionary advancements in NLP. The attention layer within the
transformer admits a sequence of input tokens $X$ and makes them interact
through pairwise similarities computed as softmax$(XQK^top X^top)$, where
$(K,Q)$ are the trainable key-query parameters. In this work, we establish a
formal equivalence between the optimization geometry of self-attention and a
hard-margin SVM problem that