Seminars

View all Seminars  |  Download ICal for this event

Provably learning a multi-head attention layer

Series: Bangalore Theory Seminars

Speaker: Sitan Chen Harvard, Cambridge MA

Date/Time: Feb 15 18:30:00

Location: Online Talk (See Teams link below)

Abstract:
Despite the widespread empirical success of transformers, little is known about their learnability from a computational perspective. In practice these models are trained with SGD on a certain next-token prediction objective, but in theory it remains a mystery even to prove that such functions can be learned efficiently at all. In this work, we give the first nontrivial provable algorithms and computational lower bounds for this problem. Our results apply in a realizable setting where one is given random sequence-to-sequence pairs that are generated by some unknown multi-head attention layer. Our algorithm, which is centered around using examples to sculpt a convex body containing the unknown parameters, is a significant departure from existing provable algorithms for learning multi-layer perceptrons, which predominantly exploit fine-grained algebraic and rotation invariance properties of the input distribution.


Microsoft teams link:

Link


We are grateful to the Kirani family for generously supporting the theory seminar series

Hosts: Rameesh Paul, Rahul Madhavan, Rachana Gusain, KVN Sreenivas