This event has passed.

Big Data Conference 2025

September 11, 2025 @ 9:00 am - September 12, 2025 @ 5:00 pm

Big Data Conference 2025

Dates: Sep. 11–12, 2025

Location: Harvard University CMSA, 20 Garden Street, Cambridge & via Zoom

The Big Data Conference features speakers from the Harvard community as well as scholars from across the globe, with talks focusing on computer science, statistics, math and physics, and economics.

Invited Speakers

Markus J. Buehler, MIT
Yiling Chen, Harvard
Jordan Ellenberg, UW Madison
Yue M. Lu, Harvard
Pankaj Mehta, BU
Nick Patterson, Harvard
Gautam Reddy, Princeton
Trevor David Rhone, Rensselaer Polytechnic Institute
Tess Smidt, MIT

Organizers:

Michael M. Desai, Harvard OEB | Michael R. Douglas, Harvard CMSA | Yannai A. Gonczarowski, Harvard Economics | Efthimios Kaxiras, Harvard Physics | Melanie Weber, Harvard SEAS

Big Data Youtube Playlist

Schedule

Thursday, Sep. 11, 2025

Big Data Conference 2025

9:00 am	Refreshments
9:30 am	Introductions
9:45–10:45 am	Gautam Reddy, Princeton Title: Global epistasis in genotype-phenotype maps
10:45–11:00 am	Break
11:00 am –12:00 pm	Nick Patterson, Harvard Title: The Origin of the Indo-Europeans Abstract: Indo-European is the largest family of human languages, with very wide geographical distribution and more than 3 billion native speakers. How did this family arise and spread? This question has been discussed for nearly 250 years but with the advent of the availability of DNA from ancient fossils is now largely understood, at least in broad outlines. We will describe what we now know about the origins.
12:00–1:30 pm	Lunch break
1:30–2:30 pm	Markus Buehler, MIT Title: Superintelligence for scientific discovery Abstract: AI is moving beyond prediction to become a partner in invention. While today’s models excel at interpolating within known data, true discovery requires stepping outside existing truths. This talk introduces superintelligent discovery engines built on multi-agent swarms: diverse AI agents that interact, compete, and cooperate to generate structured novelty. Guided by Gödel’s insight that no closed system is complete, these swarms create gradients of difference – much like temperature gradients in thermodynamics – that sustain flow, invention, and surprise. Case studies in protein design and music composition show how swarms escape data biases, invent novel structures, and weave long-range coherence, producing creativity that rivals human processes. By moving from “big data” to “big insight”, these systems point toward a new era of AI that composes knowledge across science, engineering, and the arts.
2:30–2:45 pm	Break
2:45–3:45 pm	Jordan Ellenberg, UW Madison Title: What does machine learning have to offer mathematics?
3:45–4:00 pm	Break
4:00–5:00 pm	Pankaj Mehta, Boston University Title: Thinking about high-dimensional biological data in the age of AI Abstract: The molecular biology revolution has transformed our view of living systems. Scientific explanations of biological phenomena are now synonymous with the identification of the genes and proteins. The preeminence of the molecular paradigm has only become more pronounced as new technologies allow us to make measurements at scale. Combining this wealth of data with new artificial intelligence (AI) techniques is widely viewed as the future of biology. Here, I will discuss the promise and perils of this approach. I will focus on our unpublished work with collaborators on two fronts: (i) transformer-based models for understanding genotype-to-phenotype maps, and (ii) LLM-based ‘foundational models’ for cellular identity, such as TranscriptFormer, which is trained on single-cell RNA sequencing (scRNAseq) data. While LLMs excel at capturing complex evolutionary and demographic structure in DNA sequence data, they are much less adept at elucidating the biology of cellular identity. We show that simple parameter-free models based on linear-algebra outperform TranscriptFormer on downstream tasks related to cellular identity, even though TranscriptFormer has nearly a billion parameters. If time permits, I will conclude by showing how we can combine ideas from linear algebra, bifurcation theory, and statistical physics to classify cell fate transitions using scRNAseq data.

Friday, Sep. 12, 2025

9:00-9:45 am	Refreshments
9:45–10:45 am	Yiling Chen, Harvard Title: Data Reliability Scoring Abstract: Imagine you are trying to make a data-driven decision, but the data at hand may be noisy, biased, or even strategically manipulated. Can you assess whether such a dataset is reliable—without access to ground truth? We initiate the study of reliability scoring for datasets reported by potentially strategic data sources. While the true data remain unobservable, we assume access to auxiliary observations generated by an unknown statistical process that depends on the truth. We introduce the Gram Determinant Score, a reliability measure that evaluates how well the reported data align with the unobserved truth, using only the reported data and the auxiliary observations. The score comes with provable guarantees: it preserves several natural reliability orderings. Experimentally, it effectively captures data quality in settings with synthetic noise and contrastive learning embeddings. This talk is based on joint work with Shi Feng, Fang-Yi Yu, and Paul Kattuman.
10:45–11:00 am	Break
11:00 am –12:00 pm	Yue M. Lu, Harvard Title: Nonlinear Random Matrices in High-Dimensional Estimation and Learning Abstract: In recent years, new classes of structured random matrices have emerged in statistical estimation and machine learning. Understanding their spectral properties has become increasingly important, as these matrices are closely linked to key quantities such as the training and generalization performance of large neural networks and the fundamental limits of high-dimensional signal recovery. Unlike classical random matrix ensembles, these new matrices often involve nonlinear transformations, introducing additional structural dependencies that pose challenges for traditional analysis techniques. In this talk, I will present a set of equivalence principles that establish asymptotic connections between various nonlinear random matrix ensembles and simpler linear models that are more tractable for analysis. I will then demonstrate how these principles can be applied to characterize the performance of kernel methods and random feature models across different scaling regimes and to provide insights into the in-context learning capabilities of attention-based Transformer networks.
12:00–1:30 pm	Lunch break
1:30–2:30 pm	Trevor David Rhone, Rensselaer Polytechnic Institute Title: Accelerating the discovery of van der Waals quantum materials using AI Abstract: van der Waals (vdW) materials are exciting platforms for studying emergent quantum phenomena, ranging from long-range magnetic order to topological order. A conservative estimate for the number of candidate vdW materials exceeds ~106 for monolayers and ~1012 for heterostructures. How can we accelerate the exploration of this entire space of materials? Can we design quantum materials with desirable properties, thereby advancing innovation in science and technology? A recent study showed that artificial intelligence (AI) can be harnessed to discover new vdW Heisenberg ferromagnets based on Cr2Ge2Te6 [1], [2] and magnetic vdW topological insulators based on MnBi2Te4 [3]. In this talk, we will harness AI to efficiently explore the large chemical space of vdW materials and to guide the discovery of vdW materials with desirable spin and charge properties. We will focus on crystal structures based on monolayer Cr2I6 of the form A2X6, which are studied using density functional theory (DFT) calculations and AI. Magnetic properties, such as the magnetic moment are determined. The formation energy is also calculated and used as a proxy for the chemical stability. We also investigate monolayers based on MnBi2Te4 of the form AB2X4 to identify novel topological materials. Further to this, we study heterostructures based on MnBi2Te4/Sb2Te3 stacks. We show that AI, combined with DFT, can provide a computationally efficient means to predict the thermodynamic and magnetic properties of vdW materials [4],[5]. This study paves the way for the rapid discovery of chemically stable vdW quantum materials with applications in spintronics, magnetic memory and novel quantum computing architectures. [1] T. D. Rhone et al., “Data-driven studies of magnetic two-dimensional materials,” Sci. Rep., vol. 10, no. 1, p. 15795, 2020. [2] Y. Xie, G. Tritsaris, O. Granas, and T. Rhone, “Data-Driven Studies of the Magnetic Anisotropy of Two-Dimensional Magnetic Materials,” J. Phys. Chem. Lett., vol. 12, no. 50, pp. 12048–12054. [3] R. Bhattarai, P. Minch, and T. D. Rhone, “Investigating magnetic van der Waals materials using data-driven approaches,” J. Mater. Chem. C, vol. 11, p. 5601, 2023. [4] T. D. Rhone et al., “Artificial Intelligence Guided Studies of van der Waals Magnets,” Adv. Theory Simulations, vol. 6, no. 6, p. 2300019, 2023. [5] P. Minch, R. Bhattarai, K. Choudhary, and T. D. Rhone, “Predicting magnetic properties of van der Waals magnets using graph neural networks,” Phys. Rev. Mater., vol. 8, no. 11, p. 114002, Nov. 2024. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation Grant No. ACI-1548562. This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract No. DE-AC02-06CH11357. This material is based on work supported by the National Science Foundation CAREER award under Grant No. 2044842.
2:30–2:45 pm	Break
2:45–3:45 pm	Tess Smidt, MIT Title: Applications of Euclidean neural networks to understand and design atomistic systems Abstract: Atomic systems (molecules, crystals, proteins, etc.) are naturally represented by a set of coordinates in 3D space labeled by atom type. This poses a challenge for machine learning due to the sensitivity of coordinates to 3D rotations, translations, and inversions (the symmetries of 3D Euclidean space). Euclidean symmetry-equivariant Neural Networks (E(3)NNs) are specifically designed to address this issue. They faithfully capture the symmetries of physical systems, handle 3D geometry, and operate on the scalar, vector, and tensor fields that characterize these systems. E(3)NNs have achieved state-of-the-art results across atomistic benchmarks, including small-molecule property prediction, protein-ligand binding, force prediciton for crystals, molecules, and heterogeneous catalysis. By merging neural network design with group representation theory, they provide a principled way to embed physical symmetries directly into learning. In this talk, I will survey recent applications of E(3)NNs to materials design and highlight ongoing debates in the AI for atomistic sciences community: how to balance the incorporation of physical knowledge with the drive for engineering efficiency.

Details

Start: September 11, 2025 @ 9:00 am
End: September 12, 2025 @ 5:00 pm
Event Categories: Big Data Conference, Conference, Event

Organizer

: Michael R. Douglas (Harvard CMSA)

Venue

CMSA Room G10
CMSA, 20 Garden Street
Cambridge, MA 02138 United States + Google Map
Phone 6174967132