Conference on Geometry and Statistics

On Feb 27-March 1, 2023 the CMSA will host a Conference on Geometry and Statistics.

Location: G10, CMSA, 20 Garden Street, Cambridge MA 02138

This conference will be held in person. Directions and Recommended Lodging

Registration is required.

Register here to attend in-person.

Organizing Committee:
Stephan Huckemann (Georg-August-Universität Göttingen)
Ezra Miller (Duke University)
Zhigang Yao (Harvard CMSA and Committee Chair)

Scientific Advisors:
Horng-Tzer Yau (Harvard CMSA)
Shing-Tung Yau (Harvard CMSA)


  • Tamara Broderick (MIT)
  • David Donoho (Stanford)
  • Ian Dryden (Florida International University in Miami)
  • David Dunson (Duke)
  • Charles Fefferman (Princeton)
  • Stefanie Jegelka (MIT)
  • Sebastian Kurtek (OSU)
  • Lizhen Lin (Notre Dame)
  • Steve Marron (U North Carolina)
  • Ezra Miller (Duke)
  • Hans-Georg Mueller (UC Davis)
  • Nicolai Reshetikhin (UC Berkeley)
  • Wolfgang Polonik (UC Davis)
  • Amit Singer (Princeton)
  • Zhigang Yao (Harvard CMSA)
  • Bin Yu (Berkeley)

Moderator: Michael Simkin (Harvard CMSA)



Monday, Feb. 27, 2023 (Eastern Time)

8:30 am Breakfast
8:45–8:55 am Zhigang Yao Welcome Remarks
8:55–9:00 am Shing-Tung Yau* Remarks
Morning Session Chair: Zhigang Yao
9:00–10:00 am David Donoho Title: ScreeNOT: Exact MSE-Optimal Singular Value Thresholding in Correlated Noise

Abstract: Truncation of the singular value decomposition is a true scientific workhorse. But where to Truncate?

For 55 years the answer, for many scientists, has been to eyeball the scree plot, an approach which still generates hundreds of papers per year.

I will describe ScreeNOT, a mathematically solid alternative deriving from the many advances in Random Matrix Theory over those 55 years. Assuming a model of low-rank signal plus possibly correlated noise, and adopting an asymptotic viewpoint with number of rows proportional to the number of columns, we show that ScreeNOT has a surprising oracle property.

It typically achieves exactly, in large finite samples, the lowest possible MSE for matrix recovery, on each given problem instance – i.e. the specific threshold it selects gives exactly the smallest achievable MSE loss among all possible threshold choices for that noisy dataset and that unknown underlying true low rank model. The method is computationally efficient and robust against perturbations of the underlying covariance structure.

The talk is based on joint work with Matan Gavish and Elad Romanov, Hebrew University.

10:00–10:10 am Break
10:10–11:10 am Steve Marron Title: Modes of Variation in Non-Euclidean Spaces

Abstract: Modes of Variation provide an intuitive means of understanding variation in populations, especially in the case of data objects that naturally lie in non-Euclidean spaces. A variety of useful approaches to finding useful modes of variation are considered in several non-Euclidean contexts, including shapes as data objects, vectors of directional data, amplitude and phase variation and compositional data.

11:10–11:20 am Break
11:20 am–12:20 pm Zhigang Yao Title: Manifold fitting: an invitation to statistics

Abstract: While classical statistics has dealt with observations which are real numbers or elements of a real vector space, nowadays many statistical problems of high interest in the sciences deal with the analysis of data which consist of more complex objects, taking values in spaces which are naturally not (Euclidean) vector spaces but which still feature some geometric structure. This manifold fitting problem can go back to H. Whitney’s work in the early 1930s (Whitney (1992)), and finally has been answered in recent years by C. Fefferman’s works (Fefferman, 2006, 2005). The solution to the Whitney extension problem leads to new insights for data interpolation and inspires the formulation of the Geometric Whitney Problems (Fefferman et al. (2020, 2021a)): Assume that we are given a set $Y \subset \mathbb{R}^D$. When can we construct a smooth $d$-dimensional submanifold $\widehat{M} \subset \mathbb{R}^D$ to approximate $Y$, and how well can $\widehat{M}$ estimate $Y$ in terms of distance and smoothness? To address these problems, various mathematical approaches have been proposed (see Fefferman et al. (2016, 2018, 2021b)). However, many of these methods rely on restrictive assumptions, making extending them to efficient and workable algorithms challenging. As the manifold hypothesis (non-Euclidean structure exploration) continues to be a foundational element in statistics, the manifold fitting Problem, merits further exploration and discussion within the modern statistical community. The talk will be partially based on a recent work Yao and Xia (2019) along with some on-going progress. Relevant reference:

 12:20–1:50 pm 12:20 pm Group Photo

followed by Lunch

Afternoon Session Chair: Stephan Huckemann
1:50–2:50 pm Bin Yu* Title: Interpreting Deep Neural Networks towards Trustworthiness

Abstract: Recent deep learning models have achieved impressive predictive performance by learning complex functions of many variables, often at the cost of interpretability. This lecture first defines interpretable machine learning in general and introduces the agglomerative contextual decomposition (ACD) method to interpret neural networks. Extending ACD to the scientifically meaningful frequency domain, an adaptive wavelet distillation (AWD) interpretation method is developed. AWD is shown to be both outperforming deep neural networks and interpretable in two prediction problems from cosmology and cell biology. Finally, a quality-controlled data science life cycle is advocated for building any model for trustworthy interpretation and introduce a Predictability Computability Stability (PCS) framework for such a data science life cycle.

2:50–3:00 pm Break
3:00-4:00 pm Hans-Georg Mueller Title: Exploration of Random Objects with Depth Profiles and Fréchet Regression

Abstract: Random objects, i.e., random variables that take values in a separable metric space, pose many challenges for statistical analysis, as vector operations are not available in general metric spaces. Examples include random variables that take values in the space of distributions, covariance matrices or surfaces, graph Laplacians to represent networks, trees and in other spaces. The increasing prevalence of samples of random objects has stimulated the development of metric statistics, an emerging collection of statistical tools to characterize, infer and relate samples of random objects. Recent developments include depth profiles, which are useful for the exploration of random objects. The depth profile for any given object is the distribution of distances to all other objects (with P. Dubey, Y. Chen 2022).

These distributions can then be subjected to statistical analysis. Their mutual transports lead to notions of transport ranks, quantiles and centrality. Another useful tool is global or local Fréchet regression (with A. Petersen 2019) where random objects are responses and scalars or vectors are predictors and one aims at modeling conditional Fréchet means. Recent theoretical advances for local Fréchet regression provide a basis for object time warping (with Y. Chen 2022). These approaches are illustrated with distributional and other data.

4:00-4:10 pm Break
4:10-5:10 pm Stefanie Jegelka Title: Some benefits of machine learning with invariances

Abstract: In many applications, especially in the sciences, data and tasks have known invariances. Encoding such invariances directly into a machine learning model can improve learning outcomes, while it also poses challenges on efficient model design. In the first part of the talk, we will focus on the invariances relevant to eigenvectors and eigenspaces being inputs to a neural network. Such inputs are important, for instance, for graph representation learning. We will discuss targeted architectures that can universally express functions with the relevant invariances – sign flips and changes of basis – and their theoretical and empirical benefits.

Second, we will take a broader, theoretical perspective. Empirically, it is known that encoding invariances into the machine learning model can reduce sample complexity. For the simplified setting of kernel ridge regression or random features, we will discuss new bounds that illustrate two ways in which invariances can reduce sample complexity. Our results hold for learning on manifolds and for invariances to (almost) any group action, and use tools from differential geometry.

This is joint work with Derek Lim, Joshua Robinson, Behrooz Tahmasebi, Lingxiao Zhao, Tess Smidt, Suvrit Sra, and Haggai Maron.




Tuesday, Feb. 28, 2023 (Eastern Time)

8:30-9:00 am Breakfast
Morning Session Chair: Zhigang Yao
9:00-10:00 am Charles Fefferman* Title: Lipschitz Selection on Metric Spaces

Abstract: The talk concerns the problem of finding a Lipschitz map F from a given metric space X into R^D, subject to the constraint that F(x) must lie in a given compact convex “target” K(x) for each point x in X. Joint work with Pavel Shvartsman and with Bernat Guillen Pegueroles.

10:00-10:10 am Break
10:10-11:10 am David Dunson Title: Inferring manifolds from noisy data using Gaussian processes

Abstract: In analyzing complex datasets, it is often of interest to infer lower dimensional structure underlying the higher dimensional observations. As a flexible class of nonlinear structures, it is common to focus on Riemannian manifolds. Most existing manifold learning algorithms replace the original data with lower dimensional coordinates without providing an estimate of the manifold in the observation space or using the manifold to denoise the original data. This article proposes a new methodology for addressing these problems, allowing interpolation of the estimated manifold between fitted data points. The proposed approach is motivated by novel theoretical properties of local covariance matrices constructed from noisy samples on a manifold. Our results enable us to turn a global manifold reconstruction problem into a local regression problem, allowing application of Gaussian processes for probabilistic manifold reconstruction. In addition to theory justifying the algorithm, we provide simulated and real data examples to illustrate the performance. Joint work with Nan Wu – see

11:10-11:20 am Break
11:20 am-12:20 pm Wolfgang Polonik Title: Inference in topological data analysis

Abstract: Topological data analysis has seen a huge increase in popularity finding applications in numerous scientific fields. This motivates the importance of developing a deeper understanding of benefits and limitations of such methods. Using this angle, we will present and discuss some recent results on large sample inference in topological data analysis, including bootstrap for Betti numbers and the Euler characteristics process.

12:20–1:50 pm Lunch
Afternoon Session Chair: Stephan Huckemann
1:50-2:50 pm Ezra Miller Title: Geometric central limit theorems on non-smooth spaces

Abstract: The central limit theorem (CLT) is commonly thought of as occurring on the real line, or in multivariate form on a real vector space. Motivated by statistical applications involving nonlinear data, such as angles or phylogenetic trees, the past twenty years have seen CLTs proved for Fréchet means on manifolds and on certain examples of singular spaces built from flat pieces glued together in combinatorial ways. These CLTs reduce to the linear case by tangent space approximation or by gluing. What should a CLT look like on general non-smooth spaces, where tangent spaces are not linear and no combinatorial gluing or flat pieces are available? Answering this question involves figuring out appropriate classes of spaces and measures, correct analogues of Gaussian random variables, and how the geometry of the space (think “curvature”) is reflected in the limiting distribution. This talk provides an overview of these answers, starting with a review of the usual linear CLT and its generalization to smooth manifolds, viewed through a lens that casts the singular CLT as a natural outgrowth, and concluding with how this investigation opens gateways to further advances in geometric probability, topology, and statistics. Joint work with Jonathan Mattingly and Do Tran.

2:50-3:00 pm Break
3:00-4:00 pm Lizhen Lin Title: Statistical foundations of deep generative models

Abstract: Deep generative models are probabilistic generative models where the generator is parameterized by a deep neural network. They are popular models for modeling high-dimensional data such as texts, images and speeches, and have achieved impressive empirical success. Despite demonstrated success in empirical performance, theoretical understanding of such models is largely lacking. We investigate statistical properties of deep generative models from a nonparametric distribution estimation viewpoint. In the considered model, data are assumed to be observed in some high-dimensional ambient space but concentrate around some low-dimensional structure such as a lower-dimensional manifold structure. Estimating the distribution supported on this low-dimensional structure is challenging due to its singularity with respect to the Lebesgue measure in the ambient space. We obtain convergence rates with respect to the Wasserstein metric of distribution estimators based on two methods: a sieve MLE based on the perturbed data and a GAN type estimator. Such an analysis provides insights into i) how deep generative models can avoid the curse of dimensionality and outperform classical nonparametric estimates, and ii) how likelihood approaches work for singular distribution estimation, especially in adapting to the intrinsic geometry of the data.

4:00-4:10 pm Break
4:10-5:10 pm Conversation session




Wednesday, March 1, 2023 (Eastern Time)

8:30-9:00 am Breakfast
Morning Session Chair: Ezra Miller
9:00-10:00 am Amit Singer* Title: Heterogeneity analysis in cryo-EM by covariance estimation and manifold learning

Abstract: In cryo-EM, the 3-D molecular structure needs to be determined from many noisy 2-D tomographic projection images of randomly oriented and positioned molecules. A key assumption in classical reconstruction procedures for cryo-EM is that the sample consists of identical molecules. However, many molecules of interest exist in more than one conformational state. These structural variations are of great interest to biologists, as they provide insight into the functioning of the molecule. Determining the structural variability from a set of cryo-EM images is known as the heterogeneity problem, widely recognized as one of the most challenging and important computational problem in the field. Due to high level of noise in cryo-EM images, heterogeneity studies typically involve hundreds of thousands of images, sometimes even a few millions. Covariance estimation is one of the earliest methods proposed for heterogeneity analysis in cryo-EM. It relies on computing the covariance of the conformations directly from projection images and extracting the optimal linear subspace of conformations through an eigendecomposition. Unfortunately, the standard formulation is plagued by the exorbitant cost of computing the N^3 x N^3 covariance matrix. In the first part of the talk, we present a new low-rank estimation method that requires computing only a small subset of the columns of the covariance while still providing an approximation for the entire matrix. This scheme allows us to estimate tens of principal components of real datasets in a few minutes at medium resolutions and under 30 minutes at high resolutions. In the second part of the talk, we discuss a manifold learning approach based on the graph Laplacian and the diffusion maps framework for learning the manifold of conformations. If time permits, we will also discuss the potential application of optimal transportation to heterogeneity analysis. Based on joint works with Joakim Andén, Marc Gilles, Amit Halevi, Eugene Katsevich, Joe Kileel, Amit Moscovich, and Nathan Zelesko.

10:00-10:10 am Break
10:10-11:10 am Ian Dryden Title: Statistical shape analysis of molecule data

Abstract: Molecular shape data arise in many applications, for example high dimension low sample size cryo-electron microscopy (cryo-EM) data and large temporal sequences of peptides from molecular dynamics simulations. In both applications it is of interest to summarize the shape evolution of the molecules in a succinct, low-dimensional representation. However, Euclidean techniques such as principal components analysis (PCA) can be problematic as the data may lie far from in a flat manifold. Principal nested spheres gives a fundamentally different decomposition of data from the usual Euclidean subspace based PCA. Subspaces of successively lower dimension are fitted to the data in a backwards manner with the aim of retaining signal and dispensing with noise at each stage. We adapt the methodology to 3D sub-shape spaces and provide some practical fitting algorithms. The methodology is applied to cryo-EM data of a large sliding clamp multi-protein complex and to cluster analysis of peptides, where different states of the molecules can be identified. Further molecular modeling tasks include resolution matching, where coarse resolution models are back-mapped into high resolution (atomistic) structures. This is joint work with Kwang-Rae Kim, Charles Laughton and Huiling Le.

11:10-11:20 am Break
11:20 am-12:20 pm Tamara Broderick Title: An Automatic Finite-Sample Robustness Metric: Can Dropping a Little Data Change Conclusions?

Abstract: One hopes that data analyses will be used to make beneficial decisions regarding people’s health, finances, and well-being. But the data fed to an analysis may systematically differ from the data where these decisions are ultimately applied. For instance, suppose we analyze data in one country and conclude that microcredit is effective at alleviating poverty; based on this analysis, we decide to distribute microcredit in other locations and in future years. We might then ask: can we trust our conclusion to apply under new conditions? If we found that a very small percentage of the original data was instrumental in determining the original conclusion, we might not be confident in the stability of the conclusion under new conditions. So we propose a method to assess the sensitivity of data analyses to the removal of a very small fraction of the data set. Analyzing all possible data subsets of a certain size is computationally prohibitive, so we provide an approximation. We call our resulting method the Approximate Maximum Influence Perturbation. Our approximation is automatically computable, theoretically supported, and works for common estimators. We show that any non-robustness our method finds is conclusive. Empirics demonstrate that while some applications are robust, in others the sign of a treatment effect can be changed by dropping less than 0.1% of the data — even in simple models and even when standard errors are small.

 12:20-1:50 pm Lunch
Afternoon Session Chair: Ezra Miller
1:50-2:50 pm Nicolai Reshetikhin* Title: Random surfaces in exactly solvable models in statistical mechanics.

Abstract: In the first part of the talk I will be an overview of a few models in statistical mechanics where a random variable is a geometric object such as a random surface or a random curve. The second part will be focused on the behavior of such random surfaces in the thermodynamic limit and on the formation of the so-called “limit shapes”.

2:50-3:00 pm Break
3:00-4:00 pm Sebastian Kurtek Title: Robust Persistent Homology Using Elastic Functional Data Analysis

Abstract: Persistence landscapes are functional summaries of persistence diagrams designed to enable analysis of the diagrams using tools from functional data analysis. They comprise a collection of scalar functions such that birth and death times of topological features in persistence diagrams map to extrema of functions and intervals where they are non-zero. As a consequence, variation in persistence diagrams is encoded in both amplitude and phase components of persistence landscapes. Through functional data analysis of persistence landscapes, under an elastic Riemannian metric, we show how meaningful statistical summaries of persistence landscapes (e.g., mean, dominant directions of variation) can be obtained by decoupling their amplitude and phase variations. This decoupling is achieved via optimal alignment, with respect to the elastic metric, of the persistence landscapes. The estimated phase functions are tied to the resolution parameter that determines the filtration of simplicial complexes used to construct persistence diagrams. For a dataset obtained under geometric, scale and sampling variabilities, the phase function prescribes an optimal rate of increase of the resolution parameter for enhancing the topological signal in a persistence diagram. The proposed approach adds to the statistical analysis of data objects with rich structure compared to past studies. In particular, we focus on two sets of data that have been analyzed in the past, brain artery trees and images of prostate cancer cells, and show that separation of amplitude and phase of persistence landscapes is beneficial in both settings. This is joint work with Dr. James Matuk (Duke University) and Dr. Karthik Bharath (University of Nottingham).

4:00-4:10 pm Break
4:10-5:10 pm Conversation session
5:10-5:20 pm Stephan Huckemann, Ezra Miller, Zhigang Yao Closing Remarks

* Virtual Presentation