BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//CMSA - ECPv6.15.18//NONSGML v1.0//EN
CALSCALE:GREGORIAN
METHOD:PUBLISH
X-WR-CALNAME:CMSA
X-ORIGINAL-URL:https://cmsa.fas.harvard.edu
X-WR-CALDESC:Events for CMSA
REFRESH-INTERVAL;VALUE=DURATION:PT1H
X-Robots-Tag:noindex
X-PUBLISHED-TTL:PT1H
BEGIN:VTIMEZONE
TZID:America/New_York
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:20240310T070000
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:20241103T060000
END:STANDARD
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:20250309T070000
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:20251102T060000
END:STANDARD
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:20260308T070000
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:20261101T060000
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTART;TZID=America/New_York:20250911T090000
DTEND;TZID=America/New_York:20250912T170000
DTSTAMP:20260411T121011
CREATED:20250502T175902Z
LAST-MODIFIED:20251026T044243Z
UID:10003743-1757581200-1757696400@cmsa.fas.harvard.edu
SUMMARY:Big Data Conference 2025
DESCRIPTION:Big Data Conference 2025 \nDates: Sep. 11–12\, 2025 \nLocation: Harvard University CMSA\, 20 Garden Street\, Cambridge & via Zoom \nThe Big Data Conference features speakers from the Harvard community as well as scholars from across the globe\, with talks focusing on computer science\, statistics\, math and physics\, and economics. \nInvited Speakers \n\nMarkus J. Buehler\, MIT\nYiling Chen\, Harvard\nJordan Ellenberg\, UW Madison\nYue M. Lu\, Harvard\nPankaj Mehta\, BU\nNick Patterson\, Harvard\nGautam Reddy\, Princeton\nTrevor David Rhone\, Rensselaer Polytechnic Institute\nTess Smidt\, MIT\n\nOrganizers: \nMichael M. Desai\, Harvard OEB |  Michael R. Douglas\, Harvard CMSA | Yannai A. Gonczarowski\, Harvard Economics | Efthimios Kaxiras\, Harvard Physics | Melanie Weber\, Harvard SEAS \n  \nBig Data Youtube Playlist \n  \nSchedule \nThursday\, Sep. 11\, 2025 \n  \n\n\n\n9:00 am\nRefreshments\n\n\n9:30 am\nIntroductions\n\n\n9:45–10:45 am\nGautam Reddy\, Princeton \nTitle: Global epistasis in genotype-phenotype maps\n\n\n10:45–11:00 am\nBreak\n\n\n11:00 am –12:00 pm\nNick Patterson\, Harvard \nTitle: The Origin of the Indo-Europeans \nAbstract: Indo-European is the largest family of human languages\, with very wide geographical distribution and more than 3 billion native speakers. How did this family arise and spread? This question has been discussed for nearly 250 years but with the advent of the availability of DNA from ancient fossils is now largely understood\, at least in broad outlines. We will describe what we now know about the origins.\n\n\n12:00–1:30 pm\nLunch break\n\n\n1:30–2:30 pm\nMarkus Buehler\, MIT \nTitle: Superintelligence for scientific discovery \nAbstract: AI is moving beyond prediction to become a partner in invention. While today’s models excel at interpolating within known data\, true discovery requires stepping outside existing truths. This talk introduces superintelligent discovery engines built on multi-agent swarms: diverse AI agents that interact\, compete\, and cooperate to generate structured novelty. Guided by Gödel’s insight that no closed system is complete\, these swarms create gradients of difference – much like temperature gradients in thermodynamics – that sustain flow\, invention\, and surprise. Case studies in protein design and music composition show how swarms escape data biases\, invent novel structures\, and weave long-range coherence\, producing creativity that rivals human processes. By moving from “big data” to “big insight”\, these systems point toward a new era of AI that composes knowledge across science\, engineering\, and the arts.\n\n\n2:30–2:45 pm\nBreak\n\n\n2:45–3:45 pm\nJordan Ellenberg\, UW Madison \nTitle: What does machine learning have to offer mathematics?\n\n\n3:45–4:00 pm\nBreak\n\n\n4:00–5:00 pm\nPankaj Mehta\, Boston University \nTitle: Thinking about high-dimensional biological data in the age of AI \nAbstract: The molecular biology revolution has transformed our view of living systems. Scientific explanations of biological phenomena are now synonymous with the identification of the genes and proteins. The preeminence of the molecular paradigm has only become more pronounced as new technologies allow us to make measurements at scale. Combining this wealth of data with new artificial intelligence (AI) techniques is widely viewed as the future of biology. Here\, I will discuss the promise and perils of this approach. I will focus on our unpublished work with collaborators on two fronts: (i) transformer-based models for understanding genotype-to-phenotype maps\, and (ii) LLM-based ‘foundational models’ for cellular identity\, such as TranscriptFormer\, which is trained on single-cell RNA sequencing (scRNAseq) data. While LLMs excel at capturing complex evolutionary and demographic structure in DNA sequence data\, they are much less adept at elucidating the biology of cellular identity. We show that simple parameter-free models based on linear-algebra outperform TranscriptFormer on downstream tasks related to cellular identity\, even though TranscriptFormer has nearly a billion parameters. If time permits\, I will conclude by showing how we can combine ideas from linear algebra\, bifurcation theory\, and statistical physics to classify cell fate transitions using scRNAseq data.\n\n\n\n  \nFriday\, Sep. 12\, 2025  \n\n\n\n9:00-9:45 am\nRefreshments\n\n\n9:45–10:45 am\nYiling Chen\, Harvard \nTitle: Data Reliability Scoring \nAbstract: Imagine you are trying to make a data-driven decision\, but the data at hand may be noisy\, biased\, or even strategically manipulated. Can you assess whether such a dataset is reliable—without access to ground truth?\nWe initiate the study of reliability scoring for datasets reported by potentially strategic data sources. While the true data remain unobservable\, we assume access to auxiliary observations generated by an unknown statistical process that depends on the truth. We introduce the Gram Determinant Score\, a reliability measure that evaluates how well the reported data align with the unobserved truth\, using only the reported data and the auxiliary observations. The score comes with provable guarantees: it preserves several natural reliability orderings. Experimentally\, it effectively captures data quality in settings with synthetic noise and contrastive learning embeddings.\nThis talk is based on joint work with Shi Feng\, Fang-Yi Yu\, and Paul Kattuman.\n\n\n10:45–11:00 am\nBreak\n\n\n11:00 am –12:00 pm\nYue M. Lu\, Harvard \nTitle: Nonlinear Random Matrices in High-Dimensional Estimation and Learning \nAbstract: In recent years\, new classes of structured random matrices have emerged in statistical estimation and machine learning. Understanding their spectral properties has become increasingly important\, as these matrices are closely linked to key quantities such as the training and generalization performance of large neural networks and the fundamental limits of high-dimensional signal recovery. Unlike classical random matrix ensembles\, these new matrices often involve nonlinear transformations\, introducing additional structural dependencies that pose challenges for traditional analysis techniques. \nIn this talk\, I will present a set of equivalence principles that establish asymptotic connections between various nonlinear random matrix ensembles and simpler linear models that are more tractable for analysis. I will then demonstrate how these principles can be applied to characterize the performance of kernel methods and random feature models across different scaling regimes and to provide insights into the in-context learning capabilities of attention-based Transformer networks.\n\n\n12:00–1:30 pm\nLunch break\n\n\n1:30–2:30 pm\nTrevor David Rhone\, Rensselaer Polytechnic Institute \nTitle: Accelerating the discovery of van der Waals quantum materials using AI \nAbstract: van der Waals (vdW) materials are exciting platforms for studying emergent quantum phenomena\, ranging from long-range magnetic order to topological order. A conservative estimate for the number of candidate vdW materials exceeds ~106 for monolayers and ~1012 for heterostructures. How can we accelerate the exploration of this entire space of materials? Can we design quantum materials with desirable properties\, thereby advancing innovation in science and technology? A recent study showed that artificial intelligence (AI) can be harnessed to discover new vdW Heisenberg ferromagnets based on Cr2Ge2Te6 [1]\, [2] and magnetic vdW topological insulators based on MnBi2Te4 [3]. In this talk\, we will harness AI to efficiently explore the large chemical space of vdW materials and to guide the discovery of vdW materials with desirable spin and charge properties. We will focus on crystal structures based on monolayer Cr2I6 of the form A2X6\, which are studied using density functional theory (DFT) calculations and AI. Magnetic properties\, such as the magnetic moment are determined. The formation energy is also calculated and used as a proxy for the chemical stability. We also investigate monolayers based on MnBi2Te4 of the form AB2X4 to identify novel topological materials. Further to this\, we study heterostructures based on MnBi2Te4/Sb2Te3 stacks. We show that AI\, combined with DFT\, can provide a computationally efficient means to predict the thermodynamic and magnetic properties of vdW materials [4]\,[5]. This study paves the way for the rapid discovery of chemically stable vdW quantum materials with applications in spintronics\, magnetic memory and novel quantum computing architectures.\n[1]        T. D. Rhone et al.\, “Data-driven studies of magnetic two-dimensional materials\,” Sci. Rep.\, vol. 10\, no. 1\, p. 15795\, 2020.\n[2]        Y. Xie\, G. Tritsaris\, O. Granas\, and T. Rhone\, “Data-Driven Studies of the Magnetic Anisotropy of Two-Dimensional Magnetic Materials\,” J. Phys. Chem. Lett.\, vol. 12\, no. 50\, pp. 12048–12054.\n[3]        R. Bhattarai\, P. Minch\, and T. D. Rhone\, “Investigating magnetic van der Waals materials using data-driven approaches\,” J. Mater. Chem. C\, vol. 11\, p. 5601\, 2023.\n[4]        T. D. Rhone et al.\, “Artificial Intelligence Guided Studies of van der Waals Magnets\,” Adv. Theory Simulations\, vol. 6\, no. 6\, p. 2300019\, 2023.\n[5]        P. Minch\, R. Bhattarai\, K. Choudhary\, and T. D. Rhone\, “Predicting magnetic properties of van der Waals magnets using graph neural networks\,” Phys. Rev. Mater.\, vol. 8\, no. 11\, p. 114002\, Nov. 2024.\nThis work used the Extreme Science and Engineering Discovery Environment (XSEDE)\, which is supported by National Science Foundation Grant No. ACI-1548562. This research used resources of the Argonne Leadership Computing Facility\, which is a DOE Office of Science User Facility supported under Contract No. DE-AC02-06CH11357. This material is based on work supported by the National Science Foundation CAREER award under Grant No. 2044842.\n\n\n2:30–2:45 pm\nBreak\n\n\n2:45–3:45 pm\nTess Smidt\, MIT \nTitle: Applications of Euclidean neural networks to understand and design atomistic systems \nAbstract: Atomic systems (molecules\, crystals\, proteins\, etc.) are naturally represented by a set of coordinates in 3D space labeled by atom type. This poses a challenge for machine learning due to the sensitivity of coordinates to 3D rotations\, translations\, and inversions (the symmetries of 3D Euclidean space). Euclidean symmetry-equivariant Neural Networks (E(3)NNs) are specifically designed to address this issue. They faithfully capture the symmetries of physical systems\, handle 3D geometry\, and operate on the scalar\, vector\, and tensor fields that characterize these systems. \nE(3)NNs have achieved state-of-the-art results across atomistic benchmarks\, including small-molecule property prediction\, protein-ligand binding\, force prediciton for crystals\, molecules\, and heterogeneous catalysis. By merging neural network design with group representation theory\, they provide a principled way to embed physical symmetries directly into learning. In this talk\, I will survey recent applications of E(3)NNs to materials design and highlight ongoing debates in the AI for atomistic sciences community: how to balance the incorporation of physical knowledge with the drive for engineering efficiency.\n\n\n\n 
URL:https://cmsa.fas.harvard.edu/event/bigdata_2025/
LOCATION:CMSA Room G10\, CMSA\, 20 Garden Street\, Cambridge\, MA\, 02138\, United States
CATEGORIES:Big Data Conference,Conference,Event
ATTACH;FMTTYPE=image/jpeg:https://cmsa.fas.harvard.edu/media/Big-Data-2025_11x17.9-scaled.jpg
END:VEVENT
END:VCALENDAR