Loading Events

« All Events

  • This event has passed.

Mathematical foundations of AI

October 6, 2025 @ 9:00 am - October 10, 2025 @ 5:00 pm

Mathematical foundations of AI

Date: October 6–10, 2025

Location: Harvard CMSA, Room G10, 20 Garden Street, Cambridge MA & via Zoom

Artificial intelligence (AI) has achieved unprecedented advances, yet our theoretical understanding lags significantly behind. This gap poses a significant obstacle to improving AI’s safety and reliability. Since the classical tools of learning theory have proven insufficient for understanding AI, researchers are now drawing insights from a vast array of fields—including functional analysis, probability theory, optimal transport, optimization, PDEs, information theory, geometry, statistics, electrical engineering, and ergodic theory. Those interdisciplinary efforts are gradually shedding light on the underlying principles governing modern AI. This workshop centers around these mathematical and interdisciplinary developments. It will feature a series of talks from people in various subfields. Open problem and small-group sessions will help foster new connections and new research avenues.

 

Registration required

In-person registration (This event is at capacity)

Zoom Webinar Registration

 

 Speakers

  • Jason Altschuler, University of Pennsylvania
  • Guy Bresler, MIT
  • Sinho Chewi, Yale University
  • Lenaic Chizat, EPFL
  • Nabarun Deb, University of Chicago
  • Edgar Dobriban, University of Pennsylvania
  • Ahmed El Alaoui, Cornell University
  • Zhou Fan, Yale University
  • Boris Hanin, Princeton University
  • Jason Klusowski, Princeton University
  • Tengyu Ma, Stanford University
  • Alexander Rakhlin, MIT
  • Yuting Wei, University of Pennsylvania
  • Tijana Zrnic, Stanford University

Organizer: Morgane Austern, Harvard Statistics

 

Schedule

Monday, Oct. 6, 2025

8:30–9:00 am Morning refreshments
9:00–10:00 am Yuting Wei, U Penn

To Intrinsic Dimension and Beyond: Efficient Sampling in Diffusion Models

The denoising diffusion probabilistic model (DDPM) has become a cornerstone of generative AI. While sharp convergence guarantees have been established for DDPM, the iteration complexity typically scales with the ambient data dimension of target distributions, leading to overly conservative theory that fails to explain its practical efficiency. This has sparked recent efforts to understand how DDPM can achieve sampling speed-ups through automatic exploitation of intrinsic low dimensionality of data. This talk explores two key scenarios: (1) For a broad class of data distributions with intrinsic dimension k, we prove that the iteration complexity of the DDPM scales nearly linearly with k, which is optimal under the KL divergence metric; (2) For mixtures of Gaussian distributions with k components, we show that DDPM learns the distribution with iteration complexity that grows only logarithmically in k. These results provide theoretical justification for the practical efficiency of diffusion models.

10:00–10:30 am Break
10:30–11:30 am Jason Klusowski, Princeton

The Value of Side Information in Unlabeled Data

Practitioners often work in settings with limited labeled data and abundant unlabeled data. During training, they may even have access to extra side information (some labeled, some not) that won’t be available once the model is deployed. When can this side information actually improve performance? I’ll present a simple framework where a rich-view model that sees the extra features generates pseudo-labels on the large unlabeled data, and a deployment model that only sees the standard features is trained on both real and pseudo-labels. The two are trained iteratively: each deployment model update calibrates the next round of pseudo-labels, and those refined pseudo-labels in turn guide the deployment model. Our theory shows that side information helps precisely when the rich-view and deployment models make different kinds of errors. We formalize this with a decorrelation score that quantifies how independent those errors are; the more independent, the greater the performance gains.

11:3 0am–12:00 pm Break
12:00–1:00 pm Guy Bresler, MIT

Global Minimizers of Sigmoid Contrastive Loss

The meta-task of obtaining and aligning representations through contrastive pre-training is steadily gaining importance since its introduction in CLIP and ALIGN. In this paper we theoretically explain the advantages of synchronizing with trainable inverse temperature and bias under the sigmoid loss, as implemented in the recent SigLIP models of Google DeepMind. Temperature and bias can drive the loss function to zero for a rich class of configurations that we call (m,b)-Constellations. (m,b)-Constellations are a novel combinatorial object related to spherical codes and are parametrized by a margin m and relative bias b. We use our characterization of constellations to theoretically justify the success of SigLIP on retrieval, to explain the modality gap present in SigLIP, and to identify the necessary dimension for producing high-quality representations. We also propose a reparameterization of the sigmoid loss with explicit relative bias, which appears to improve training dynamics. Joint work with Kiril Bangachev, Iliyas Noman, and Yury Polyanskiy.

 

Tuesday, Oct. 7, 2025

8:30–9:00 am Morning refreshments
9:00–10:00 am Lénaïc Chizat, EPFL

The Hidden Width of Deep ResNets

We present a mathematical framework to analyze the training dynamics of deep ResNets that rigorously captures practical architectures (including Transformers) trained from standard random initializations. Our approach combines stochastic approximation of ODEs with propagation-of-chaos arguments. It yields three main insights:
– Depth begets width: infinite-depth ResNets of any hidden width behave throughout training as if they were infinitely wide;
– Unified phase diagram: the phase diagram of Transformers mirrors that of two-layer perceptrons, once the appropriate substitutions are made;
– Optimal shape scaling: for a given parameter budget P, a Transformer with optimal shape converges to its limiting dynamics at rate P^{-1/6}.
This is based on https://arxiv.org/abs/2509.10167

10:00–10:30 am Break

 

10:30–11:30 am Boris Hanin, Princeton

Kernel Learning on Manifolds

This talk concerns the L_2 risk of minimum norm interpolation with n samples in the RKHS of a kernel K. Unlike most prior work in this space our kernels will be defined on any close d-dimensional Riemannian manifold, and we require only that the kernels are trace class and elliptic. With these assumptions we get nearly sharp L_2 risk bounds with high probability over the data. Like prior work on round spheres our results essentially say that the number of samples n, the dimension of the manifold, and some details of the kernel determine a natural spectral cutoff \lambda(n,d,K) and that minimal norm interpolation essentially learns exactly the projection of the data generating process onto the eigenfunctions of the Laplacian with frequency at most \lambda(n,d,K). Joint work with Mengxuan Yang.

11:30–12:00 Break
12:00–1:00 Zhou Fan, Yale

Dynamical mean-field analysis of adaptive Langevin diffusions

In many applications of statistical estimation via sampling, one may wish to sample from a high-dimensional target distribution that is adaptively evolving to the samples already seen. We study an example of such dynamics, given by a Langevin diffusion for posterior sampling in a Bayesian linear regression model with i.i.d. regression design, whose prior continuously adapts to the Langevin trajectory via a maximum marginal-likelihood scheme. Using techniques of dynamical mean-field theory (DMFT), we provide a precise characterization of a high-dimensional asymptotic limit for the joint evolution of the prior parameter and law of the Langevin sample. We then carry out an analysis of the equations that describe this DMFT limit, under conditions of approximate time-translation-invariance which include, in particular, settings where the posterior law satisfies a log-Sobolev inequality. In such settings, we show that this adaptive Langevin trajectory converges on a dimension-independent time horizon to an equilibrium state that is characterized by a system of replica-symmetric fixed-point equations, and the associated prior parameter converges to a critical point of a replica-symmetric limit for the model free energy. We explore the nature of the free energy landscape and its critical points in a few simple examples, where such critical points may or may not be unique.

 

Wednesday, Oct. 8, 2025

8:30–9:00 am Morning refreshments
9:00–10:00 am Jason Altschuler, U Penn

Negative Stepsizes Make Gradient-Descent-Ascent Converge

Solving min-max problems is a central question in optimization, games, learning, and controls. Arguably the most natural algorithm is Gradient-Descent-Ascent (GDA), however since the 1970s, conventional wisdom has argued that it fails to converge even on simple problems. This failure spurred the extensive literature on modifying GDA with extragradients, optimism, momentum, anchoring, etc. In contrast, we show that GDA converges in its original form by simply using a judicious choice of stepsizes. The key innovation is the proposal of unconventional stepsize schedules that are time-varying, asymmetric, and (most surprisingly) periodically negative. We show that all three properties are necessary for convergence, and that altogether this enables GDA to converge on the classical counterexamples (e.g., unconstrained convex-concave problems). The core intuition is that although negative stepsizes make backward progress, they de-synchronize the min/max variables (overcoming the cycling issue of GDA) and lead to a slingshot phenomenon in which the forward progress in the other iterations is overwhelmingly larger. This results in fast overall convergence. Geometrically, the slingshot dynamics leverage the non-reversibility of gradient flow: positive/negative steps cancel to first order, yielding a second-order net movement in a new direction that leads to convergence and is otherwise impossible for GDA to move in. Joint work with Henry Shugart.

10:00–10:30 am Break
10:30–11:30 am Nabarun Deb, U Chicago

Generative Modeling via Parabolic Monge-Ampère PDEs

We introduce a novel generative modeling framework based on a discretized parabolic Monge-Ampère PDE, which emerges as a continuous limit of the Sinkhorn algorithm commonly used in optimal transport. Our method performs iterative refinement in the space of Brenier maps using a mirror gradient descent step. We establish theoretical guarantees for generative modeling through the lens of no-regret analysis, demonstrating that the iterates converge to the optimal Brenier map under a variety of step-size schedules. As a technical contribution, we derive a new Evolution Variational Inequality tailored to the parabolic Monge-Ampère PDE, connecting geometry, transportation cost, and regret. Our framework accommodates non-log-concave target distributions, constructs an optimal sampling process via the Brenier map, and integrates favorable learning techniques from generative adversarial networks and score-based diffusion models.

11:30–12:00 Break
12:00–1:00 Sinho Chewi, Yale

Discretization and distribution learning in diffusion models

First, I will review some literature on discretization of diffusion models, focusing on the use of randomized midpoints for deterministic vs. stochastic samplers. Then, I will argue that such sampling guarantees reduce distribution learning, in the form of learning to generate a sample, to score matching. To complement this result, we reduce other forms of distribution learning (parameter estimation and density estimation) to score matching as well. This leads to new consequences for diffusion models, such as asymptotic efficiency of a DDPM-based parameter estimator and algorithms for Gaussian mixture density estimation, as well as to a general approach for establishing cryptographic hardness results for score estimation.

 

Thursday, Oct. 9, 2025

8:30–9:00 am Morning refreshments
9:00–10:00 am Ahmed El Alaoui, Cornell

How abundant are good interpolators?

We consider classifying labelled data in the interpolation regime where there exist linear classifiers (with possibly negative margin) correctly classifying all points in the dataset. Under the logistic model with gaussian features, we derive the large deviation rate function of the event that an interpolator chosen uniformly at random achieves a given generalization error. This describes the proportion of interpolators having any desired performance. We remark that in a wide regime of parameters, the vast majority of interpolators have inferior performance than the one found via a simple linear programming procedure, showing that the latter algorithm produces an atypically good classifier.
This is based on joint work with August Chen.

10:00–10:30 am break
10:30–11:30 am Tengyu Ma, Stanford

Self-play LLM Theorem Provers with Iterative Conjecturing and Proving

I will discuss some works on using RL for theorem proving, especially in the possible future regime where we ran out of high-quality training data. To keep improving the models with limited data, we draw inspiration from mathematicians, who continuously develop new results, partly by proposing novel conjectures or exercises (which are often variants of known results) and attempting to solve them. We design the Self-play Theorem Prover (STP) that simultaneously takes on two roles, conjecturer and prover, each providing training signals to the other. The model achieves state-of-the-art performance among whole-proof generation methods on miniF2F-test (65.0%, pass@3200), Proofnet-test (23.9%, pass@3200) and PutnamBench (8/644, pass@3200).

 

11:30–12:00 break
12:00–1:00 Edgar Dobriban, U Penn

Leveraging synthetic data in statistical inference

The rapid proliferation of high-quality synthetic data — generated by advanced AI models or collected as auxiliary data from related tasks — presents both opportunities and challenges for statistical inference. This paper introduces a GEneral Synthetic-Powered Inference (GESPI) framework that wraps around any statistical inference procedure to safely enhance sample efficiency by combining synthetic and real data. Our framework leverages high-quality synthetic data to boost statistical power, yet adaptively defaults to the standard inference method using only real data when synthetic data is of low quality. The error of our method remains below a user-specified bound without any distributional assumptions on the synthetic data, and decreases as the quality of the synthetic data improves. This flexibility enables seamless integration with conformal prediction, risk control, hypothesis testing, and multiple testing procedures, all without modifying the base inference method. We demonstrate the benefits of our method on challenging tasks with limited labeled data, including AlphaFold protein structure prediction, and comparing large reasoning models on complex math problems.

 

Friday, Oct. 10, 2025

8:30–9:00 am Morning refreshments
9:00–10:00 am Tijana Zrnic, Stanford

Probably Approximately Correct Labels

Obtaining high-quality labeled datasets is often costly, requiring either extensive human annotation or expensive experiments. We propose a method that supplements such “expert” labels with AI predictions from pre-trained models to construct labeled datasets more cost-effectively. Our approach results in probably approximately correct labels: with high probability, the overall labeling error is small. This solution enables rigorous yet efficient dataset curation using modern AI models. We demonstrate the benefits of the methodology through text annotation with large language models, image labeling with pre-trained vision models, and protein folding analysis with AlphaFold. This is joint work with Emmanuel Candes and Andrew Ilyas.

10:00–10:30 am Break
10:30–11:30 am Alexander Rakhlin, MIT

Elements of Interactive Decision Making

Machine learning methods are increasingly deployed in interactive environments, ranging from dynamic treatment strategies in medicine to fine-tuning of LLMs using reinforcement learning. In these settings, the learning agent interacts with the environment to collect data and necessarily faces an exploration-exploitation dilemma. We present a general framework for interactive decision making that subsumes multi-armed bandits, contextual bandits, structured bandits, and reinforcement learning. We focus on both the statistical aspect of learning—aiming to develop a tight characterization of sample complexity in terms of properties of the class of models—and on the basic algorithmic primitives.

 

 


 

 

Details

  • Start: October 6, 2025 @ 9:00 am
  • End: October 10, 2025 @ 5:00 pm
  • Event Category:

Organizer

Morgane Austern