 This event has passed.
Conference on Geometry and Statistics
February 27, 2023 @ 9:00 am  March 1, 2023 @ 5:30 pm
On Feb 27March 1, 2023 the CMSA will host a Conference on Geometry and Statistics.
Location: G10, CMSA, 20 Garden Street, Cambridge MA 02138
This conference will be held in person. Directions and Recommended Lodging
Registration is required.
Register here to attend inperson.
Organizing Committee:
Stephan Huckemann (GeorgAugustUniversität Göttingen)
Ezra Miller (Duke University)
Zhigang Yao (Harvard CMSA and Committee Chair)
Scientific Advisors:
HorngTzer Yau (Harvard CMSA)
ShingTung Yau (Harvard CMSA)
Speakers:
 Tamara Broderick (MIT)
 David Donoho (Stanford)
 Ian Dryden (Florida International University in Miami)
 David Dunson (Duke)
 Charles Fefferman (Princeton)
 Stefanie Jegelka (MIT)
 Sebastian Kurtek (OSU)
 Lizhen Lin (Notre Dame)
 Steve Marron (U North Carolina)
 Ezra Miller (Duke)
 HansGeorg Mueller (UC Davis)
 Nicolai Reshetikhin (UC Berkeley)
 Wolfgang Polonik (UC Davis)
 Amit Singer (Princeton)
 Zhigang Yao (Harvard CMSA)
 Bin Yu (Berkeley)
Moderator: Michael Simkin (Harvard CMSA)
SCHEDULE
Monday, Feb. 27, 2023 (Eastern Time)
8:30 am  Breakfast  
8:45–8:55 am  Zhigang Yao  Welcome Remarks 
8:55–9:00 am  ShingTung Yau*  Remarks 
Morning Session Chair: Zhigang Yao  
9:00–10:00 am  David Donoho  Title: ScreeNOT: Exact MSEOptimal Singular Value Thresholding in Correlated Noise
Abstract: Truncation of the singular value decomposition is a true scientific workhorse. But where to Truncate? For 55 years the answer, for many scientists, has been to eyeball the scree plot, an approach which still generates hundreds of papers per year. I will describe ScreeNOT, a mathematically solid alternative deriving from the many advances in Random Matrix Theory over those 55 years. Assuming a model of lowrank signal plus possibly correlated noise, and adopting an asymptotic viewpoint with number of rows proportional to the number of columns, we show that ScreeNOT has a surprising oracle property. It typically achieves exactly, in large finite samples, the lowest possible MSE for matrix recovery, on each given problem instance – i.e. the specific threshold it selects gives exactly the smallest achievable MSE loss among all possible threshold choices for that noisy dataset and that unknown underlying true low rank model. The method is computationally efficient and robust against perturbations of the underlying covariance structure. The talk is based on joint work with Matan Gavish and Elad Romanov, Hebrew University. 
10:00–10:10 am  Break  
10:10–11:10 am  Steve Marron  Title: Modes of Variation in NonEuclidean Spaces
Abstract: Modes of Variation provide an intuitive means of understanding variation in populations, especially in the case of data objects that naturally lie in nonEuclidean spaces. A variety of useful approaches to finding useful modes of variation are considered in several nonEuclidean contexts, including shapes as data objects, vectors of directional data, amplitude and phase variation and compositional data. 
11:10–11:20 am  Break  
11:20 am–12:20 pm  Zhigang Yao  Title: Manifold fitting: an invitation to statistics
Abstract: While classical statistics has dealt with observations which are real numbers or elements of a real vector space, nowadays many statistical problems of high interest in the sciences deal with the analysis of data which consist of more complex objects, taking values in spaces which are naturally not (Euclidean) vector spaces but which still feature some geometric structure. This manifold fitting problem can go back to H. Whitney’s work in the early 1930s (Whitney (1992)), and finally has been answered in recent years by C. Fefferman’s works (Fefferman, 2006, 2005). The solution to the Whitney extension problem leads to new insights for data interpolation and inspires the formulation of the Geometric Whitney Problems (Fefferman et al. (2020, 2021a)): Assume that we are given a set $Y \subset \mathbb{R}^D$. When can we construct a smooth $d$dimensional submanifold $\widehat{M} \subset \mathbb{R}^D$ to approximate $Y$, and how well can $\widehat{M}$ estimate $Y$ in terms of distance and smoothness? To address these problems, various mathematical approaches have been proposed (see Fefferman et al. (2016, 2018, 2021b)). However, many of these methods rely on restrictive assumptions, making extending them to efficient and workable algorithms challenging. As the manifold hypothesis (nonEuclidean structure exploration) continues to be a foundational element in statistics, the manifold fitting Problem, merits further exploration and discussion within the modern statistical community. The talk will be partially based on a recent work Yao and Xia (2019) along with some ongoing progress. Relevant reference:https://arxiv.org/abs/1909.10228 
12:20–1:50 pm  12:20 pm Group Photo
followed by Lunch 

Afternoon Session Chair: Stephan Huckemann  
1:50–2:50 pm  Bin Yu*  Title: Interpreting Deep Neural Networks towards Trustworthiness
Abstract: Recent deep learning models have achieved impressive predictive performance by learning complex functions of many variables, often at the cost of interpretability. This lecture first defines interpretable machine learning in general and introduces the agglomerative contextual decomposition (ACD) method to interpret neural networks. Extending ACD to the scientifically meaningful frequency domain, an adaptive wavelet distillation (AWD) interpretation method is developed. AWD is shown to be both outperforming deep neural networks and interpretable in two prediction problems from cosmology and cell biology. Finally, a qualitycontrolled data science life cycle is advocated for building any model for trustworthy interpretation and introduce a Predictability Computability Stability (PCS) framework for such a data science life cycle. 
2:50–3:00 pm  Break  
3:004:00 pm  HansGeorg Mueller  Title: Exploration of Random Objects with Depth Profiles and Fréchet Regression
Abstract: Random objects, i.e., random variables that take values in a separable metric space, pose many challenges for statistical analysis, as vector operations are not available in general metric spaces. Examples include random variables that take values in the space of distributions, covariance matrices or surfaces, graph Laplacians to represent networks, trees and in other spaces. The increasing prevalence of samples of random objects has stimulated the development of metric statistics, an emerging collection of statistical tools to characterize, infer and relate samples of random objects. Recent developments include depth profiles, which are useful for the exploration of random objects. The depth profile for any given object is the distribution of distances to all other objects (with P. Dubey, Y. Chen 2022). These distributions can then be subjected to statistical analysis. Their mutual transports lead to notions of transport ranks, quantiles and centrality. Another useful tool is global or local Fréchet regression (with A. Petersen 2019) where random objects are responses and scalars or vectors are predictors and one aims at modeling conditional Fréchet means. Recent theoretical advances for local Fréchet regression provide a basis for object time warping (with Y. Chen 2022). These approaches are illustrated with distributional and other data. 
4:004:10 pm  Break  
4:105:10 pm  Stefanie Jegelka  Title: Some benefits of machine learning with invariances
Abstract: In many applications, especially in the sciences, data and tasks have known invariances. Encoding such invariances directly into a machine learning model can improve learning outcomes, while it also poses challenges on efficient model design. In the first part of the talk, we will focus on the invariances relevant to eigenvectors and eigenspaces being inputs to a neural network. Such inputs are important, for instance, for graph representation learning. We will discuss targeted architectures that can universally express functions with the relevant invariances – sign flips and changes of basis – and their theoretical and empirical benefits. Second, we will take a broader, theoretical perspective. Empirically, it is known that encoding invariances into the machine learning model can reduce sample complexity. For the simplified setting of kernel ridge regression or random features, we will discuss new bounds that illustrate two ways in which invariances can reduce sample complexity. Our results hold for learning on manifolds and for invariances to (almost) any group action, and use tools from differential geometry. This is joint work with Derek Lim, Joshua Robinson, Behrooz Tahmasebi, Lingxiao Zhao, Tess Smidt, Suvrit Sra, and Haggai Maron. 
Tuesday, Feb. 28, 2023 (Eastern Time)
8:309:00 am  Breakfast  
Morning Session Chair: Zhigang Yao  
9:0010:00 am  Charles Fefferman*  Title: Lipschitz Selection on Metric Spaces
Abstract: The talk concerns the problem of finding a Lipschitz map F from a given metric space X into R^D, subject to the constraint that F(x) must lie in a given compact convex “target” K(x) for each point x in X. Joint work with Pavel Shvartsman and with Bernat Guillen Pegueroles. 
10:0010:10 am  Break  
10:1011:10 am  David Dunson  Title: Inferring manifolds from noisy data using Gaussian processes
Abstract: In analyzing complex datasets, it is often of interest to infer lower dimensional structure underlying the higher dimensional observations. As a flexible class of nonlinear structures, it is common to focus on Riemannian manifolds. Most existing manifold learning algorithms replace the original data with lower dimensional coordinates without providing an estimate of the manifold in the observation space or using the manifold to denoise the original data. This article proposes a new methodology for addressing these problems, allowing interpolation of the estimated manifold between fitted data points. The proposed approach is motivated by novel theoretical properties of local covariance matrices constructed from noisy samples on a manifold. Our results enable us to turn a global manifold reconstruction problem into a local regression problem, allowing application of Gaussian processes for probabilistic manifold reconstruction. In addition to theory justifying the algorithm, we provide simulated and real data examples to illustrate the performance. Joint work with Nan Wu – see https://arxiv.org/abs/2110.07478 
11:1011:20 am  Break  
11:20 am12:20 pm  Wolfgang Polonik  Title: Inference in topological data analysis
Abstract: Topological data analysis has seen a huge increase in popularity finding applications in numerous scientific fields. This motivates the importance of developing a deeper understanding of benefits and limitations of such methods. Using this angle, we will present and discuss some recent results on large sample inference in topological data analysis, including bootstrap for Betti numbers and the Euler characteristics process. 
12:20–1:50 pm  Lunch  
Afternoon Session Chair: Stephan Huckemann  
1:502:50 pm  Ezra Miller  Title: Geometric central limit theorems on nonsmooth spaces
Abstract: The central limit theorem (CLT) is commonly thought of as occurring on the real line, or in multivariate form on a real vector space. Motivated by statistical applications involving nonlinear data, such as angles or phylogenetic trees, the past twenty years have seen CLTs proved for Fréchet means on manifolds and on certain examples of singular spaces built from flat pieces glued together in combinatorial ways. These CLTs reduce to the linear case by tangent space approximation or by gluing. What should a CLT look like on general nonsmooth spaces, where tangent spaces are not linear and no combinatorial gluing or flat pieces are available? Answering this question involves figuring out appropriate classes of spaces and measures, correct analogues of Gaussian random variables, and how the geometry of the space (think “curvature”) is reflected in the limiting distribution. This talk provides an overview of these answers, starting with a review of the usual linear CLT and its generalization to smooth manifolds, viewed through a lens that casts the singular CLT as a natural outgrowth, and concluding with how this investigation opens gateways to further advances in geometric probability, topology, and statistics. Joint work with Jonathan Mattingly and Do Tran. 
2:503:00 pm  Break  
3:004:00 pm  Lizhen Lin  Title: Statistical foundations of deep generative models
Abstract: Deep generative models are probabilistic generative models where the generator is parameterized by a deep neural network. They are popular models for modeling highdimensional data such as texts, images and speeches, and have achieved impressive empirical success. Despite demonstrated success in empirical performance, theoretical understanding of such models is largely lacking. We investigate statistical properties of deep generative models from a nonparametric distribution estimation viewpoint. In the considered model, data are assumed to be observed in some highdimensional ambient space but concentrate around some lowdimensional structure such as a lowerdimensional manifold structure. Estimating the distribution supported on this lowdimensional structure is challenging due to its singularity with respect to the Lebesgue measure in the ambient space. We obtain convergence rates with respect to the Wasserstein metric of distribution estimators based on two methods: a sieve MLE based on the perturbed data and a GAN type estimator. Such an analysis provides insights into i) how deep generative models can avoid the curse of dimensionality and outperform classical nonparametric estimates, and ii) how likelihood approaches work for singular distribution estimation, especially in adapting to the intrinsic geometry of the data. 
4:004:10 pm  Break  
4:105:10 pm  Conversation session 
Wednesday, March 1, 2023 (Eastern Time)
8:309:00 am  Breakfast  
Morning Session Chair: Ezra Miller  
9:0010:00 am  Amit Singer*  Title: Heterogeneity analysis in cryoEM by covariance estimation and manifold learning
Abstract: In cryoEM, the 3D molecular structure needs to be determined from many noisy 2D tomographic projection images of randomly oriented and positioned molecules. A key assumption in classical reconstruction procedures for cryoEM is that the sample consists of identical molecules. However, many molecules of interest exist in more than one conformational state. These structural variations are of great interest to biologists, as they provide insight into the functioning of the molecule. Determining the structural variability from a set of cryoEM images is known as the heterogeneity problem, widely recognized as one of the most challenging and important computational problem in the field. Due to high level of noise in cryoEM images, heterogeneity studies typically involve hundreds of thousands of images, sometimes even a few millions. Covariance estimation is one of the earliest methods proposed for heterogeneity analysis in cryoEM. It relies on computing the covariance of the conformations directly from projection images and extracting the optimal linear subspace of conformations through an eigendecomposition. Unfortunately, the standard formulation is plagued by the exorbitant cost of computing the N^3 x N^3 covariance matrix. In the first part of the talk, we present a new lowrank estimation method that requires computing only a small subset of the columns of the covariance while still providing an approximation for the entire matrix. This scheme allows us to estimate tens of principal components of real datasets in a few minutes at medium resolutions and under 30 minutes at high resolutions. In the second part of the talk, we discuss a manifold learning approach based on the graph Laplacian and the diffusion maps framework for learning the manifold of conformations. If time permits, we will also discuss the potential application of optimal transportation to heterogeneity analysis. Based on joint works with Joakim Andén, Marc Gilles, Amit Halevi, Eugene Katsevich, Joe Kileel, Amit Moscovich, and Nathan Zelesko. 
10:0010:10 am  Break  
10:1011:10 am  Ian Dryden  Title: Statistical shape analysis of molecule data
Abstract: Molecular shape data arise in many applications, for example high dimension low sample size cryoelectron microscopy (cryoEM) data and large temporal sequences of peptides from molecular dynamics simulations. In both applications it is of interest to summarize the shape evolution of the molecules in a succinct, lowdimensional representation. However, Euclidean techniques such as principal components analysis (PCA) can be problematic as the data may lie far from in a flat manifold. Principal nested spheres gives a fundamentally different decomposition of data from the usual Euclidean subspace based PCA. Subspaces of successively lower dimension are fitted to the data in a backwards manner with the aim of retaining signal and dispensing with noise at each stage. We adapt the methodology to 3D subshape spaces and provide some practical fitting algorithms. The methodology is applied to cryoEM data of a large sliding clamp multiprotein complex and to cluster analysis of peptides, where different states of the molecules can be identified. Further molecular modeling tasks include resolution matching, where coarse resolution models are backmapped into high resolution (atomistic) structures. This is joint work with KwangRae Kim, Charles Laughton and Huiling Le. 
11:1011:20 am  Break  
11:20 am12:20 pm  Tamara Broderick  Title: An Automatic FiniteSample Robustness Metric: Can Dropping a Little Data Change Conclusions?
Abstract: One hopes that data analyses will be used to make beneficial decisions regarding people’s health, finances, and wellbeing. But the data fed to an analysis may systematically differ from the data where these decisions are ultimately applied. For instance, suppose we analyze data in one country and conclude that microcredit is effective at alleviating poverty; based on this analysis, we decide to distribute microcredit in other locations and in future years. We might then ask: can we trust our conclusion to apply under new conditions? If we found that a very small percentage of the original data was instrumental in determining the original conclusion, we might not be confident in the stability of the conclusion under new conditions. So we propose a method to assess the sensitivity of data analyses to the removal of a very small fraction of the data set. Analyzing all possible data subsets of a certain size is computationally prohibitive, so we provide an approximation. We call our resulting method the Approximate Maximum Influence Perturbation. Our approximation is automatically computable, theoretically supported, and works for common estimators. We show that any nonrobustness our method finds is conclusive. Empirics demonstrate that while some applications are robust, in others the sign of a treatment effect can be changed by dropping less than 0.1% of the data — even in simple models and even when standard errors are small. 
12:201:50 pm  Lunch  
Afternoon Session Chair: Ezra Miller  
1:502:50 pm  Nicolai Reshetikhin*  Title: Random surfaces in exactly solvable models in statistical mechanics.
Abstract: In the first part of the talk I will be an overview of a few models in statistical mechanics where a random variable is a geometric object such as a random surface or a random curve. The second part will be focused on the behavior of such random surfaces in the thermodynamic limit and on the formation of the socalled “limit shapes”. 
2:503:00 pm  Break  
3:004:00 pm  Sebastian Kurtek  Title: Robust Persistent Homology Using Elastic Functional Data Analysis
Abstract: Persistence landscapes are functional summaries of persistence diagrams designed to enable analysis of the diagrams using tools from functional data analysis. They comprise a collection of scalar functions such that birth and death times of topological features in persistence diagrams map to extrema of functions and intervals where they are nonzero. As a consequence, variation in persistence diagrams is encoded in both amplitude and phase components of persistence landscapes. Through functional data analysis of persistence landscapes, under an elastic Riemannian metric, we show how meaningful statistical summaries of persistence landscapes (e.g., mean, dominant directions of variation) can be obtained by decoupling their amplitude and phase variations. This decoupling is achieved via optimal alignment, with respect to the elastic metric, of the persistence landscapes. The estimated phase functions are tied to the resolution parameter that determines the filtration of simplicial complexes used to construct persistence diagrams. For a dataset obtained under geometric, scale and sampling variabilities, the phase function prescribes an optimal rate of increase of the resolution parameter for enhancing the topological signal in a persistence diagram. The proposed approach adds to the statistical analysis of data objects with rich structure compared to past studies. In particular, we focus on two sets of data that have been analyzed in the past, brain artery trees and images of prostate cancer cells, and show that separation of amplitude and phase of persistence landscapes is beneficial in both settings. This is joint work with Dr. James Matuk (Duke University) and Dr. Karthik Bharath (University of Nottingham). 
4:004:10 pm  Break  
4:105:10 pm  Conversation session  
5:105:20 pm  Stephan Huckemann, Ezra Miller, Zhigang Yao  Closing Remarks 
* Virtual Presentation