BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//CMSA - ECPv6.15.20//NONSGML v1.0//EN
CALSCALE:GREGORIAN
METHOD:PUBLISH
X-WR-CALNAME:CMSA
X-ORIGINAL-URL:https://cmsa.fas.harvard.edu
X-WR-CALDESC:Events for CMSA
REFRESH-INTERVAL;VALUE=DURATION:PT1H
X-Robots-Tag:noindex
X-PUBLISHED-TTL:PT1H
BEGIN:VTIMEZONE
TZID:America/New_York
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:20250309T070000
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:20251102T060000
END:STANDARD
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:20260308T070000
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:20261101T060000
END:STANDARD
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:20270314T070000
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:20271107T060000
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTART;TZID=America/New_York:20260225T140000
DTEND;TZID=America/New_York:20260225T150000
DTSTAMP:20260506T232324
CREATED:20260210T192336Z
LAST-MODIFIED:20260210T194238Z
UID:10003894-1772028000-1772031600@cmsa.fas.harvard.edu
SUMMARY:Scaling Stochastic Momentum from Theory to LLMs
DESCRIPTION:New Technologies in Mathematics Seminar \nSpeaker: Courtney Paquette\, McGill University \nTitle: Scaling Stochastic Momentum from Theory to LLMs \nAbstract: Given the massive scale of modern ML models\, we now often get only a single shot to train them effectively. This limits our ability to sweep architectures and hyperparameters\, making it essential to understand how learning algorithms scale so insights from small models transfer to large ones. \nIn this talk\, I present a framework for analyzing scaling laws of stochastic momentum methods using a power-law random features model\, leveraging tools from high-dimensional probability and random matrix theory. We show that standard SGD with momentum does not improve scaling exponents\, while dimension-adapted Nesterov acceleration (DANA)—which explicitly adapts momentum to model size and data/target complexity—achieves strictly better loss and compute scaling. DANA does this by rescaling its momentum parameters with dimension\, effectively matching the optimizer’s memory to the problem geometry. \nMotivated by these theoretical insights\, I introduce logarithmic-time scheduling for large language models and propose ADANA\, an AdamW-like optimizer with growing memory and explicit damping. Across transformer scales (45M to 2.6B parameters)\, ADANA yields up to 40% compute savings over tuned AdamW\, with gains that improve at scale. \nBased on joint work with Damien Ferbach\, Elliot Paquette\, Katie Everett\, and Gauthier Gidel.
URL:https://cmsa.fas.harvard.edu/event/newtech_22526/
LOCATION:CMSA Room G10\, CMSA\, 20 Garden Street\, Cambridge\, MA\, 02138\, United States
CATEGORIES:New Technologies in Mathematics Seminar
ATTACH;FMTTYPE=image/png:https://cmsa.fas.harvard.edu/media/CMSA-NTM-Seminar-2.25.2026.docx-scaled.png
END:VEVENT
END:VCALENDAR