BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//CMSA - ECPv6.15.18//NONSGML v1.0//EN
CALSCALE:GREGORIAN
METHOD:PUBLISH
X-WR-CALNAME:CMSA
X-ORIGINAL-URL:https://cmsa.fas.harvard.edu
X-WR-CALDESC:Events for CMSA
REFRESH-INTERVAL;VALUE=DURATION:PT1H
X-Robots-Tag:noindex
X-PUBLISHED-TTL:PT1H
BEGIN:VTIMEZONE
TZID:America/New_York
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:20240310T070000
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:20241103T060000
END:STANDARD
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:20250309T070000
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:20251102T060000
END:STANDARD
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:20260308T070000
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:20261101T060000
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTART;TZID=America/New_York:20251201T163000
DTEND;TZID=America/New_York:20251201T173000
DTSTAMP:20260421T214045
CREATED:20251007T152747Z
LAST-MODIFIED:20251201T144411Z
UID:10003807-1764606600-1764610200@cmsa.fas.harvard.edu
SUMMARY:Asymptotic Theory of Attention: In-Context Learning and Sparse Token Detection
DESCRIPTION:Colloquium \nSpeaker: Yue M. Lu\, Harvard University \nTitle: Asymptotic Theory of Attention: In-Context Learning and Sparse Token Detection \nAbstract: Attention-based architectures exhibit striking emergent abilities—from learning tasks directly from context to detecting rare\, weak features in long sequences—yet a rigorous theory explaining these behaviors remains limited. In this talk\, I will present two recent exactly solvable models that develop a high-dimensional asymptotic theory of attention. \n(i) In-context learning. For linear attention pretrained on linear regression tasks\, we derive sharp asymptotics in a regime where token dimension\, context length\, and task diversity all scale proportionally\, while the number of pretraining examples scales quadratically. The resulting learning curve exhibits double descent and a phase transition separating a low-diversity memorization regime from a high-diversity regime of genuine in-context generalization. These predictions closely track empirical behavior in both linear-attention models and nonlinear Transformer architectures. \n(ii) Sparse-token classification. For detecting weak signals embedded in a small\, randomly located subset of tokens\, we analyze a single-layer attention classifier and determine its representational and learnability thresholds. Attention succeeds with only logarithmic signal scaling in the sequence length L\, outperforming linear baselines that require √L scaling. In a proportional high-dimensional regime\, we prove that two gradient descent steps yield nontrivial alignment between the query vector and the hidden signal\, leading to signal-adaptive attention. Exact formulas for the test error\, training loss\, and separability capacity quantify this advantage.
URL:https://cmsa.fas.harvard.edu/event/colloquium-12125/
LOCATION:CMSA Room G10\, CMSA\, 20 Garden Street\, Cambridge\, MA\, 02138\, United States
CATEGORIES:Colloquium
ATTACH;FMTTYPE=image/png:https://cmsa.fas.harvard.edu/media/CMSA-Colloquium-12.1.2025-scaled.png
END:VEVENT
END:VCALENDAR