Seminars
View all Seminars | Download ICal for this eventProvably learning a multi-head attention layer
Series: Bangalore Theory Seminars
Speaker: Sitan Chen Harvard, Cambridge MA
Date/Time: Feb 15 18:30:00
Location: Online Talk (See Teams link below)
Abstract:
Despite the widespread empirical success of transformers, little is known about their learnability from a computational perspective. In practice these models are trained with SGD on a certain next-token prediction objective, but in theory it remains a mystery even to prove that such functions can be learned efficiently at all. In this work, we give the first nontrivial provable algorithms and computational lower bounds for this problem. Our results apply in a realizable setting where one is given random sequence-to-sequence pairs that are generated by some unknown multi-head attention layer. Our algorithm, which is centered around using examples to sculpt a convex body containing the unknown parameters, is a significant departure from existing provable algorithms for learning multi-layer perceptrons, which predominantly exploit fine-grained algebraic and rotation invariance properties of the input distribution.
Microsoft teams link:
Link
We are grateful to the Kirani family for generously supporting the theory seminar series
Hosts: Rameesh Paul, Rahul Madhavan, Rachana Gusain, KVN Sreenivas