- This event has passed.
Big Data Conference 2025

Big Data Conference 2025
Dates: Sep. 11–12, 2025
Location: Harvard University CMSA, 20 Garden Street, Cambridge & via Zoom
The Big Data Conference features speakers from the Harvard community as well as scholars from across the globe, with talks focusing on computer science, statistics, math and physics, and economics.
Invited Speakers
- Markus J. Buehler, MIT
- Yiling Chen, Harvard
- Jordan Ellenberg, UW Madison
- Yue M. Lu, Harvard
- Pankaj Mehta, BU
- Nick Patterson, Harvard
- Gautam Reddy, Princeton
- Trevor David Rhone, Rensselaer Polytechnic Institute
- Tess Smidt, MIT
Organizers:
Michael M. Desai, Harvard OEB | Michael R. Douglas, Harvard CMSA | Yannai A. Gonczarowski, Harvard Economics | Efthimios Kaxiras, Harvard Physics | Melanie Weber, Harvard SEAS
Schedule
Thursday, Sep. 11, 2025
| 9:00 am | Refreshments |
| 9:30 am | Introductions |
| 9:45–10:45 am | Gautam Reddy, Princeton
Title: Global epistasis in genotype-phenotype maps |
| 10:45–11:00 am | Break |
| 11:00 am –12:00 pm | Nick Patterson, Harvard
Title: The Origin of the Indo-Europeans Abstract: Indo-European is the largest family of human languages, with very wide geographical distribution and more than 3 billion native speakers. How did this family arise and spread? This question has been discussed for nearly 250 years but with the advent of the availability of DNA from ancient fossils is now largely understood, at least in broad outlines. We will describe what we now know about the origins. |
| 12:00–1:30 pm | Lunch break |
| 1:30–2:30 pm | Markus Buehler, MIT
Title: Superintelligence for scientific discovery Abstract: AI is moving beyond prediction to become a partner in invention. While today’s models excel at interpolating within known data, true discovery requires stepping outside existing truths. This talk introduces superintelligent discovery engines built on multi-agent swarms: diverse AI agents that interact, compete, and cooperate to generate structured novelty. Guided by Gödel’s insight that no closed system is complete, these swarms create gradients of difference – much like temperature gradients in thermodynamics – that sustain flow, invention, and surprise. Case studies in protein design and music composition show how swarms escape data biases, invent novel structures, and weave long-range coherence, producing creativity that rivals human processes. By moving from “big data” to “big insight”, these systems point toward a new era of AI that composes knowledge across science, engineering, and the arts. |
| 2:30–2:45 pm | Break |
| 2:45–3:45 pm | Jordan Ellenberg, UW Madison
Title: What does machine learning have to offer mathematics? |
| 3:45–4:00 pm | Break |
| 4:00–5:00 pm | Pankaj Mehta, Boston University
Title: Thinking about high-dimensional biological data in the age of AI Abstract: The molecular biology revolution has transformed our view of living systems. Scientific explanations of biological phenomena are now synonymous with the identification of the genes and proteins. The preeminence of the molecular paradigm has only become more pronounced as new technologies allow us to make measurements at scale. Combining this wealth of data with new artificial intelligence (AI) techniques is widely viewed as the future of biology. Here, I will discuss the promise and perils of this approach. I will focus on our unpublished work with collaborators on two fronts: (i) transformer-based models for understanding genotype-to-phenotype maps, and (ii) LLM-based ‘foundational models’ for cellular identity, such as TranscriptFormer, which is trained on single-cell RNA sequencing (scRNAseq) data. While LLMs excel at capturing complex evolutionary and demographic structure in DNA sequence data, they are much less adept at elucidating the biology of cellular identity. We show that simple parameter-free models based on linear-algebra outperform TranscriptFormer on downstream tasks related to cellular identity, even though TranscriptFormer has nearly a billion parameters. If time permits, I will conclude by showing how we can combine ideas from linear algebra, bifurcation theory, and statistical physics to classify cell fate transitions using scRNAseq data. |
Friday, Sep. 12, 2025
| 9:00-9:45 am | Refreshments |
| 9:45–10:45 am | Yiling Chen, Harvard
Title: Data Reliability Scoring Abstract: Imagine you are trying to make a data-driven decision, but the data at hand may be noisy, biased, or even strategically manipulated. Can you assess whether such a dataset is reliable—without access to ground truth? |
| 10:45–11:00 am | Break |
| 11:00 am –12:00 pm | Yue M. Lu, Harvard
Title: Nonlinear Random Matrices in High-Dimensional Estimation and Learning Abstract: In recent years, new classes of structured random matrices have emerged in statistical estimation and machine learning. Understanding their spectral properties has become increasingly important, as these matrices are closely linked to key quantities such as the training and generalization performance of large neural networks and the fundamental limits of high-dimensional signal recovery. Unlike classical random matrix ensembles, these new matrices often involve nonlinear transformations, introducing additional structural dependencies that pose challenges for traditional analysis techniques. In this talk, I will present a set of equivalence principles that establish asymptotic connections between various nonlinear random matrix ensembles and simpler linear models that are more tractable for analysis. I will then demonstrate how these principles can be applied to characterize the performance of kernel methods and random feature models across different scaling regimes and to provide insights into the in-context learning capabilities of attention-based Transformer networks. |
| 12:00–1:30 pm | Lunch break |
| 1:30–2:30 pm | Trevor David Rhone, Rensselaer Polytechnic Institute
Title: Accelerating the discovery of van der Waals quantum materials using AI Abstract: van der Waals (vdW) materials are exciting platforms for studying emergent quantum phenomena, ranging from long-range magnetic order to topological order. A conservative estimate for the number of candidate vdW materials exceeds ~106 for monolayers and ~1012 for heterostructures. How can we accelerate the exploration of this entire space of materials? Can we design quantum materials with desirable properties, thereby advancing innovation in science and technology? A recent study showed that artificial intelligence (AI) can be harnessed to discover new vdW Heisenberg ferromagnets based on Cr2Ge2Te6 [1], [2] and magnetic vdW topological insulators based on MnBi2Te4 [3]. In this talk, we will harness AI to efficiently explore the large chemical space of vdW materials and to guide the discovery of vdW materials with desirable spin and charge properties. We will focus on crystal structures based on monolayer Cr2I6 of the form A2X6, which are studied using density functional theory (DFT) calculations and AI. Magnetic properties, such as the magnetic moment are determined. The formation energy is also calculated and used as a proxy for the chemical stability. We also investigate monolayers based on MnBi2Te4 of the form AB2X4 to identify novel topological materials. Further to this, we study heterostructures based on MnBi2Te4/Sb2Te3 stacks. We show that AI, combined with DFT, can provide a computationally efficient means to predict the thermodynamic and magnetic properties of vdW materials [4],[5]. This study paves the way for the rapid discovery of chemically stable vdW quantum materials with applications in spintronics, magnetic memory and novel quantum computing architectures. |
| 2:30–2:45 pm | Break |
| 2:45–3:45 pm | Tess Smidt, MIT
Title: Applications of Euclidean neural networks to understand and design atomistic systems Abstract: Atomic systems (molecules, crystals, proteins, etc.) are naturally represented by a set of coordinates in 3D space labeled by atom type. This poses a challenge for machine learning due to the sensitivity of coordinates to 3D rotations, translations, and inversions (the symmetries of 3D Euclidean space). Euclidean symmetry-equivariant Neural Networks (E(3)NNs) are specifically designed to address this issue. They faithfully capture the symmetries of physical systems, handle 3D geometry, and operate on the scalar, vector, and tensor fields that characterize these systems. E(3)NNs have achieved state-of-the-art results across atomistic benchmarks, including small-molecule property prediction, protein-ligand binding, force prediciton for crystals, molecules, and heterogeneous catalysis. By merging neural network design with group representation theory, they provide a principled way to embed physical symmetries directly into learning. In this talk, I will survey recent applications of E(3)NNs to materials design and highlight ongoing debates in the AI for atomistic sciences community: how to balance the incorporation of physical knowledge with the drive for engineering efficiency. |