TY - GEN
T1 - Attention is not all you need
T2 - 38th International Conference on Machine Learning, ICML 2021
AU - Dong, Yihe
AU - Cordonnier, Jean Baptiste
AU - Loukas, Andreas
N1 - Publisher Copyright:
Copyright © 2021 by the author(s)
PY - 2021/1/1
Y1 - 2021/1/1
N2 - Attention-based architectures have become ubiquitous in machine learning. Yet, our understanding of the reasons for their effectiveness remains limited. This work proposes a new way to understand self-attention networks: we show that their output can be decomposed into a sum of smaller terms-or paths-each involving the operation of a sequence of attention heads across layers. Using this path decomposition, we prove that self-attention possesses a strong inductive bias towards “token uniformity”. Specifically, without skip connections or multi-layer perceptrons (MLPs), the output converges doubly exponentially to a rank-1 matrix. On the other hand, skip connections and MLPs stop the output from degeneration. Our experiments verify the convergence results on standard transformer architectures.
AB - Attention-based architectures have become ubiquitous in machine learning. Yet, our understanding of the reasons for their effectiveness remains limited. This work proposes a new way to understand self-attention networks: we show that their output can be decomposed into a sum of smaller terms-or paths-each involving the operation of a sequence of attention heads across layers. Using this path decomposition, we prove that self-attention possesses a strong inductive bias towards “token uniformity”. Specifically, without skip connections or multi-layer perceptrons (MLPs), the output converges doubly exponentially to a rank-1 matrix. On the other hand, skip connections and MLPs stop the output from degeneration. Our experiments verify the convergence results on standard transformer architectures.
UR - http://www.scopus.com/inward/record.url?scp=85161307462&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85161307462
T3 - Proceedings of Machine Learning Research
SP - 2793
EP - 2803
BT - Proceedings of the 38th International Conference on Machine Learning, ICML 2021
PB - ML Research Press
Y2 - 18 July 2021 through 24 July 2021
ER -