BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//CMSA - ECPv6.15.18//NONSGML v1.0//EN
CALSCALE:GREGORIAN
METHOD:PUBLISH
X-WR-CALNAME:CMSA
X-ORIGINAL-URL:https://cmsa.fas.harvard.edu
X-WR-CALDESC:Events for CMSA
REFRESH-INTERVAL;VALUE=DURATION:PT1H
X-Robots-Tag:noindex
X-PUBLISHED-TTL:PT1H
BEGIN:VTIMEZONE
TZID:America/New_York
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:20140309T070000
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:20141102T060000
END:STANDARD
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:20150308T070000
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:20151101T060000
END:STANDARD
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:20160313T070000
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:20161106T060000
END:STANDARD
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:20170312T070000
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:20171105T060000
END:STANDARD
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:20180311T070000
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:20181104T060000
END:STANDARD
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:20190310T070000
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:20191103T060000
END:STANDARD
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:20200308T070000
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:20201101T060000
END:STANDARD
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:20210314T070000
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:20211107T060000
END:STANDARD
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:20220313T070000
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:20221106T060000
END:STANDARD
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:20230312T070000
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:20231105T060000
END:STANDARD
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:20240310T070000
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:20241103T060000
END:STANDARD
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:20250309T070000
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:20251102T060000
END:STANDARD
BEGIN:DAYLIGHT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
DTSTART:20260308T070000
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
DTSTART:20261101T060000
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTART;TZID=America/New_York:20250911T090000
DTEND;TZID=America/New_York:20250912T170000
DTSTAMP:20260502T013323
CREATED:20250502T175902Z
LAST-MODIFIED:20251026T044243Z
UID:10003743-1757581200-1757696400@cmsa.fas.harvard.edu
SUMMARY:Big Data Conference 2025
DESCRIPTION:Big Data Conference 2025 \nDates: Sep. 11–12\, 2025 \nLocation: Harvard University CMSA\, 20 Garden Street\, Cambridge & via Zoom \nThe Big Data Conference features speakers from the Harvard community as well as scholars from across the globe\, with talks focusing on computer science\, statistics\, math and physics\, and economics. \nInvited Speakers \n\nMarkus J. Buehler\, MIT\nYiling Chen\, Harvard\nJordan Ellenberg\, UW Madison\nYue M. Lu\, Harvard\nPankaj Mehta\, BU\nNick Patterson\, Harvard\nGautam Reddy\, Princeton\nTrevor David Rhone\, Rensselaer Polytechnic Institute\nTess Smidt\, MIT\n\nOrganizers: \nMichael M. Desai\, Harvard OEB |  Michael R. Douglas\, Harvard CMSA | Yannai A. Gonczarowski\, Harvard Economics | Efthimios Kaxiras\, Harvard Physics | Melanie Weber\, Harvard SEAS \n  \nBig Data Youtube Playlist \n  \nSchedule \nThursday\, Sep. 11\, 2025 \n  \n\n\n\n9:00 am\nRefreshments\n\n\n9:30 am\nIntroductions\n\n\n9:45–10:45 am\nGautam Reddy\, Princeton \nTitle: Global epistasis in genotype-phenotype maps\n\n\n10:45–11:00 am\nBreak\n\n\n11:00 am –12:00 pm\nNick Patterson\, Harvard \nTitle: The Origin of the Indo-Europeans \nAbstract: Indo-European is the largest family of human languages\, with very wide geographical distribution and more than 3 billion native speakers. How did this family arise and spread? This question has been discussed for nearly 250 years but with the advent of the availability of DNA from ancient fossils is now largely understood\, at least in broad outlines. We will describe what we now know about the origins.\n\n\n12:00–1:30 pm\nLunch break\n\n\n1:30–2:30 pm\nMarkus Buehler\, MIT \nTitle: Superintelligence for scientific discovery \nAbstract: AI is moving beyond prediction to become a partner in invention. While today’s models excel at interpolating within known data\, true discovery requires stepping outside existing truths. This talk introduces superintelligent discovery engines built on multi-agent swarms: diverse AI agents that interact\, compete\, and cooperate to generate structured novelty. Guided by Gödel’s insight that no closed system is complete\, these swarms create gradients of difference – much like temperature gradients in thermodynamics – that sustain flow\, invention\, and surprise. Case studies in protein design and music composition show how swarms escape data biases\, invent novel structures\, and weave long-range coherence\, producing creativity that rivals human processes. By moving from “big data” to “big insight”\, these systems point toward a new era of AI that composes knowledge across science\, engineering\, and the arts.\n\n\n2:30–2:45 pm\nBreak\n\n\n2:45–3:45 pm\nJordan Ellenberg\, UW Madison \nTitle: What does machine learning have to offer mathematics?\n\n\n3:45–4:00 pm\nBreak\n\n\n4:00–5:00 pm\nPankaj Mehta\, Boston University \nTitle: Thinking about high-dimensional biological data in the age of AI \nAbstract: The molecular biology revolution has transformed our view of living systems. Scientific explanations of biological phenomena are now synonymous with the identification of the genes and proteins. The preeminence of the molecular paradigm has only become more pronounced as new technologies allow us to make measurements at scale. Combining this wealth of data with new artificial intelligence (AI) techniques is widely viewed as the future of biology. Here\, I will discuss the promise and perils of this approach. I will focus on our unpublished work with collaborators on two fronts: (i) transformer-based models for understanding genotype-to-phenotype maps\, and (ii) LLM-based ‘foundational models’ for cellular identity\, such as TranscriptFormer\, which is trained on single-cell RNA sequencing (scRNAseq) data. While LLMs excel at capturing complex evolutionary and demographic structure in DNA sequence data\, they are much less adept at elucidating the biology of cellular identity. We show that simple parameter-free models based on linear-algebra outperform TranscriptFormer on downstream tasks related to cellular identity\, even though TranscriptFormer has nearly a billion parameters. If time permits\, I will conclude by showing how we can combine ideas from linear algebra\, bifurcation theory\, and statistical physics to classify cell fate transitions using scRNAseq data.\n\n\n\n  \nFriday\, Sep. 12\, 2025  \n\n\n\n9:00-9:45 am\nRefreshments\n\n\n9:45–10:45 am\nYiling Chen\, Harvard \nTitle: Data Reliability Scoring \nAbstract: Imagine you are trying to make a data-driven decision\, but the data at hand may be noisy\, biased\, or even strategically manipulated. Can you assess whether such a dataset is reliable—without access to ground truth?\nWe initiate the study of reliability scoring for datasets reported by potentially strategic data sources. While the true data remain unobservable\, we assume access to auxiliary observations generated by an unknown statistical process that depends on the truth. We introduce the Gram Determinant Score\, a reliability measure that evaluates how well the reported data align with the unobserved truth\, using only the reported data and the auxiliary observations. The score comes with provable guarantees: it preserves several natural reliability orderings. Experimentally\, it effectively captures data quality in settings with synthetic noise and contrastive learning embeddings.\nThis talk is based on joint work with Shi Feng\, Fang-Yi Yu\, and Paul Kattuman.\n\n\n10:45–11:00 am\nBreak\n\n\n11:00 am –12:00 pm\nYue M. Lu\, Harvard \nTitle: Nonlinear Random Matrices in High-Dimensional Estimation and Learning \nAbstract: In recent years\, new classes of structured random matrices have emerged in statistical estimation and machine learning. Understanding their spectral properties has become increasingly important\, as these matrices are closely linked to key quantities such as the training and generalization performance of large neural networks and the fundamental limits of high-dimensional signal recovery. Unlike classical random matrix ensembles\, these new matrices often involve nonlinear transformations\, introducing additional structural dependencies that pose challenges for traditional analysis techniques. \nIn this talk\, I will present a set of equivalence principles that establish asymptotic connections between various nonlinear random matrix ensembles and simpler linear models that are more tractable for analysis. I will then demonstrate how these principles can be applied to characterize the performance of kernel methods and random feature models across different scaling regimes and to provide insights into the in-context learning capabilities of attention-based Transformer networks.\n\n\n12:00–1:30 pm\nLunch break\n\n\n1:30–2:30 pm\nTrevor David Rhone\, Rensselaer Polytechnic Institute \nTitle: Accelerating the discovery of van der Waals quantum materials using AI \nAbstract: van der Waals (vdW) materials are exciting platforms for studying emergent quantum phenomena\, ranging from long-range magnetic order to topological order. A conservative estimate for the number of candidate vdW materials exceeds ~106 for monolayers and ~1012 for heterostructures. How can we accelerate the exploration of this entire space of materials? Can we design quantum materials with desirable properties\, thereby advancing innovation in science and technology? A recent study showed that artificial intelligence (AI) can be harnessed to discover new vdW Heisenberg ferromagnets based on Cr2Ge2Te6 [1]\, [2] and magnetic vdW topological insulators based on MnBi2Te4 [3]. In this talk\, we will harness AI to efficiently explore the large chemical space of vdW materials and to guide the discovery of vdW materials with desirable spin and charge properties. We will focus on crystal structures based on monolayer Cr2I6 of the form A2X6\, which are studied using density functional theory (DFT) calculations and AI. Magnetic properties\, such as the magnetic moment are determined. The formation energy is also calculated and used as a proxy for the chemical stability. We also investigate monolayers based on MnBi2Te4 of the form AB2X4 to identify novel topological materials. Further to this\, we study heterostructures based on MnBi2Te4/Sb2Te3 stacks. We show that AI\, combined with DFT\, can provide a computationally efficient means to predict the thermodynamic and magnetic properties of vdW materials [4]\,[5]. This study paves the way for the rapid discovery of chemically stable vdW quantum materials with applications in spintronics\, magnetic memory and novel quantum computing architectures.\n[1]        T. D. Rhone et al.\, “Data-driven studies of magnetic two-dimensional materials\,” Sci. Rep.\, vol. 10\, no. 1\, p. 15795\, 2020.\n[2]        Y. Xie\, G. Tritsaris\, O. Granas\, and T. Rhone\, “Data-Driven Studies of the Magnetic Anisotropy of Two-Dimensional Magnetic Materials\,” J. Phys. Chem. Lett.\, vol. 12\, no. 50\, pp. 12048–12054.\n[3]        R. Bhattarai\, P. Minch\, and T. D. Rhone\, “Investigating magnetic van der Waals materials using data-driven approaches\,” J. Mater. Chem. C\, vol. 11\, p. 5601\, 2023.\n[4]        T. D. Rhone et al.\, “Artificial Intelligence Guided Studies of van der Waals Magnets\,” Adv. Theory Simulations\, vol. 6\, no. 6\, p. 2300019\, 2023.\n[5]        P. Minch\, R. Bhattarai\, K. Choudhary\, and T. D. Rhone\, “Predicting magnetic properties of van der Waals magnets using graph neural networks\,” Phys. Rev. Mater.\, vol. 8\, no. 11\, p. 114002\, Nov. 2024.\nThis work used the Extreme Science and Engineering Discovery Environment (XSEDE)\, which is supported by National Science Foundation Grant No. ACI-1548562. This research used resources of the Argonne Leadership Computing Facility\, which is a DOE Office of Science User Facility supported under Contract No. DE-AC02-06CH11357. This material is based on work supported by the National Science Foundation CAREER award under Grant No. 2044842.\n\n\n2:30–2:45 pm\nBreak\n\n\n2:45–3:45 pm\nTess Smidt\, MIT \nTitle: Applications of Euclidean neural networks to understand and design atomistic systems \nAbstract: Atomic systems (molecules\, crystals\, proteins\, etc.) are naturally represented by a set of coordinates in 3D space labeled by atom type. This poses a challenge for machine learning due to the sensitivity of coordinates to 3D rotations\, translations\, and inversions (the symmetries of 3D Euclidean space). Euclidean symmetry-equivariant Neural Networks (E(3)NNs) are specifically designed to address this issue. They faithfully capture the symmetries of physical systems\, handle 3D geometry\, and operate on the scalar\, vector\, and tensor fields that characterize these systems. \nE(3)NNs have achieved state-of-the-art results across atomistic benchmarks\, including small-molecule property prediction\, protein-ligand binding\, force prediciton for crystals\, molecules\, and heterogeneous catalysis. By merging neural network design with group representation theory\, they provide a principled way to embed physical symmetries directly into learning. In this talk\, I will survey recent applications of E(3)NNs to materials design and highlight ongoing debates in the AI for atomistic sciences community: how to balance the incorporation of physical knowledge with the drive for engineering efficiency.\n\n\n\n 
URL:https://cmsa.fas.harvard.edu/event/bigdata_2025/
LOCATION:CMSA Room G10\, CMSA\, 20 Garden Street\, Cambridge\, MA\, 02138\, United States
CATEGORIES:Big Data Conference,Conference,Event
ATTACH;FMTTYPE=image/jpeg:https://cmsa.fas.harvard.edu/media/Big-Data-2025_11x17.9-scaled.jpg
END:VEVENT
BEGIN:VEVENT
DTSTART;TZID=America/New_York:20240906T090000
DTEND;TZID=America/New_York:20240907T170000
DTSTAMP:20260502T013323
CREATED:20240325T141950Z
LAST-MODIFIED:20250415T154033Z
UID:10003287-1725613200-1725728400@cmsa.fas.harvard.edu
SUMMARY:Big Data Conference 2024
DESCRIPTION:  \n \nYoutube Playlist \nOn September 6-7\, 2024\, the CMSA hosted the tenth annual Conference on Big Data. The Big Data Conference features speakers from the Harvard community as well as scholars from across the globe\, with talks focusing on computer science\, statistics\, math and physics\, and economics. \nLocation: Harvard University CMSA\, 20 Garden Street\, Cambridge & via Zoom \n  \nSpeakers: \n\nTianxi Cai\, Harvard Chan School\nRaj Chetty\, Harvard\nBianca Dumitrascu\, Columbia\nBoris Hanin\, Princeton\nPeter Hull\, Brown\nJamie Morgenstern\, U Washington\nKavita Ramanan\, Brown\nNeil Thompson\, MIT\nMelanie Weber\, Harvard\nKun-Hsing Yu\, Harvard Medical School\n\nOrganizers: \n\nRediet Abebe\, Harvard Society of Fellows\nMorgane Austern\, Harvard University Statistics\nMichael R. Douglas\, Harvard CMSA\nYannai Gonczarowski\, Harvard University Economics and Computer Science\nSam Kou\, Harvard University Statistics\n\nSCHEDULE (downloadable pdf) \nFriday\, Sep. 6\, 2024 \n9:00 am: Breakfast \n9:30 am: Introductions \n9:45–10:45 am\nSpeaker: Peter Hull\, Brown University\nTitle: Measuring Discrimination in Multi-Phase Systems\, with an Application to Child Protection\nAbstract: Large racial disparities have been documented in many high-stakes settings—such as employment\, health care\, housing\, and criminal justice—raising concerns of discrimination by individual decision-makers. At the same time\, there is growing understanding that a focus on individual decisions can yield an incomplete view of discrimination; an extensive theoretical literature shows how discrimination can arise and compound across multiple decision-makers in interconnected systems. We develop new empirical tools for studying discrimination in such multi-phase systems and apply them to the setting of foster care placement by child protective services. Leveraging the quasi-random assignment of two sets of decision-makers—initial hotline call screeners and subsequent investigators—we study how unwarranted racial disparities arise and propagate through this system. Using a sample of over 200\,000 maltreatment allegations\, we find that calls involving Black children are 55% more likely to result in foster care placement than calls involving white children with the same potential for future maltreatment in the home. Call screeners account for up to 19% of this unwarranted disparity\, with the remainder due to investigators. Unwarranted disparity is concentrated in cases with potential for future maltreatment\, suggesting that white children may be harmed by “underplacement” in high-risk situations. \n10:45–11:00 am: Break \n11:00 am –12:00 pm\nSpeaker: Jamie Morgenstern\, U Washington\nTitle: What governs predictive disparity in modern machine learning applications?\nAbstract: The deployment of statistical models in impactful environments is far from new—simple correlations have been used to guide decisions throughout the sciences\, health care\, political campaigns\, and in pricing financial instruments and other products for decades. Many such models\, and the decisions they supported\, were known to have different degrees of predictive power for different demographic groups. These differences had numerous sources\, including: limited expressiveness of the statistical models; limited availability of data from marginalized populations; noisier measurements of both features and targets from certain populations; and features with less mutual information about the prediction target for some populations than others.\nModern decision systems which use machine learning are more ubiquitous than ever\, as are their differences in performance for different populations of people. In this talk\, I will discuss some similarities and differences in the sources of differing performance in contemporary ML systems including facial recognition systems and those incorporating generative AI. \n12:00–1:30 pm: Lunch Break \n1:30–2:30 pm\nSpeaker: Kavita Ramanan\, Brown University\nTitle: Understanding High-dimensional Stochastic Dynamics on Realistic Networks\nAbstract: Large collections of randomly evolving particles that interact locally with respect to an underlying network model a variety of phenomena ranging from magnetism\, the spread of diseases\, neural and neuronal networks\, opinion dynamics and load balancing on computer networks. Due to their high-dimensional nature\, these systems are typically intractable to analyze exactly. Classical work\, falling under the rubric of mean-field approximations\, has mostly focused on the case when this interaction graph is dense.  However\, most real-world networks are sparse and often random. We describe a new approach to develop principled approximations for dynamics on realistic networks that beats the curse of dimensionality\, and illustrate its efficacy on a class of epidemiological models. This is based on joint works with Michel Davydov\, Ankan Ganguly and Juniper Cocomello. \n2:30–2:45 pm: Break \n2:45–3:45 pm\nSpeaker: Raj Chetty\, Harvard University\nTitle: The Science of Economic Opportunity: New Insights from Big Data\nAbstract: How can we improve economic opportunities for children growing up in low-income families? This talk will present findings from a recent set of studies that use various sources of big data — ranging from anonymized tax records to social network data — to understand the science of economic opportunity. Among other topics\, the talk will discuss how and why children’s chances of climbing the income ladder vary across neighborhoods\, the drivers of racial disparities in economic mobility\, how highly selective colleges may amplify the persistence of privilege\, and the role of social capital as a driver of upward mobility. The talk will conclude by giving examples of how academic research using big data is informing policy decisions from the local to federal level to expand opportunities for all. \n3:45–4:00 pm: Break \n4:00–5:00 pm\nSpeaker: Neil Thompson\nTitle: How Algorithmic Progress is driving progress in Big Data and AI\nAbstract: Algorithm improvement is one of the purest forms of innovation: it allows the same computational task to be achieved with far fewer resources by proposing clever new ways to do that computation. In this talk\, I will discuss the work that my lab has done tracking and quantifying progress across decades of algorithm research and practice. As I will show\, this algorithmic progress has often outpaced hardware improvement as the most important driver of progress in Big Data and AI. \n  \nSaturday\, Sep. 7\, 2024 \n9:00 am: Breakfast \n9:30 am: Introductions \n9:45–10:45 am\nSpeaker: Tianxi Cai\, Harvard Chan School\nTitle: Crowdsourcing with Multi-institutional EHR to Improve Reliability of Real World Evidence – Opportunities and Challenges\nAbstract: The wide adoption of electronic health records (EHR) systems has led to the availability of large clinical datasets available for discovery research. EHR data\, linked with bio- repository\, is a valuable new source for deriving real-word\, data-driven prediction models of disease risk and progression. Yet\, they also bring analytical difficulties especially when aiming to leverage multi-institutional EHR data. Synthesizing information across healthcare systems is challenging due to heterogeneity and privacy. Statistical challenges also arise due to high dimensionality in the feature space. In this talk\, I’ll discuss analytical approaches for mining EHR data to improve the reliability and generalizability of real world evidence generated from the analyses. These methods will be illustrated using EHR data from Mass General Brigham and Veteran Health Administration. \n10:45–11:00 am: Break \n11:00 am–12:00 pm\nSpeaker: Bianca Dumitrascu\, Columbia Data Science Institute\nTitle: Statistical machine learning for learning representations of embryonic development\nAbstract: During embryonic development\, single cells read in local information from their environments and use this information to move\, divide and specialize. As a result\, the environments themselves change.  However\, it remains unclear how gene expression programs interact with cell morphology and mechanical forces to orchestrate organogenesis in early embryos. Recent advances in single cell techniques and in toto imaging enable unique venues in exploring this link between genomics and biophysics\, which dynamically maps cells to organisms.\nIn this talk\, I will describe statistical machine learning frameworks aimed at understanding how tissue level mechanical and morphometric information impact gene expression patterns in spatio-temporal contexts. We use these tools to understand boundary formation in the early development of mouse embryos and to align data from light sheet recordings of pre-gastrulation development. \n12:00–1:30 pm: Lunch Break \n1:30–2:30 pm\nSpeaker: Melanie Weber\, Harvard Mathematics\nTitle: Data and Model Geometry in Deep Learning\nAbstract: Data with geometric structure is ubiquitous in machine learning. Often such structure arises from fundamental symmetries in the domain\, such as permutation-invariance in graphs and sets\, and translation-invariance in images. In this talk we discuss implications of this structure on the design and complexity of neural networks. Equivariant architectures\, which encode symmetries as inductive bias\, have shown great success in applications with geometric data\, but can suffer from instabilities as their depths increases. We propose a new architecture based on unitary group convolutions\, which allows for deeper networks with less instability. In the second part of the talk we discuss the impact of data and model geometry on the learnability of neural networks. We discuss learnability in several geometric settings\, including equivariant neural networks\, as well as learnability with respect to the geometry of the input data manifold. \n2:30–2:45 pm: Break \n2:45–3:45 pm\nSpeaker: Boris Hanin\, Princeton University\nTitle: Scaling Limits of Neural Networks\nAbstract: Neural networks are often studied analytically through scaling limits: regimes in which taking some structural network parameters (e.g. depth\, width\, number of training datapoints\, and so on) to infinity results in simplified models of learning. I will motivative and discuss recent results using several such approaches. I will emphasize both new theoretical insights into how model\, training data\, and optimizer impact learning and their practical implications for hyperparameter transfer. \n3:45–4:00 pm: Break \n4:00–5:00 pm\nSpeaker: Kun-Hsing Yu\, Harvard Medical School\nTitle: Foundation Models for Real-Time Cancer Diagnosis\nAbstract: Artificial intelligence (AI) is transforming the landscape of medical research and practice. Recent advances in microscopic image digitization\, foundation models\, and scalable computing infrastructure have opened new avenues for AI-enhanced cancer diagnosis. In this talk\, I will highlight recent breakthroughs in multi-modal AI systems for cancer pathology evaluation\, discuss integrative biomedical informatics methods that link cell morphology with molecular profiles\, and outline critical challenges in developing robust medical AI systems. \n  \n\nInformation about the 2023 Big Data Conference can be found here.
URL:https://cmsa.fas.harvard.edu/event/bigdata_2024/
LOCATION:20 Garden Street\, Cambridge\, MA 02138\, MA\, MA\, 02138\, United States
CATEGORIES:Big Data Conference,Conference,Event
ATTACH;FMTTYPE=image/png:https://cmsa.fas.harvard.edu/media/Big-Data-2024_8.5x11-1.png
END:VEVENT
BEGIN:VEVENT
DTSTART;TZID=America/New_York:20230831T090000
DTEND;TZID=America/New_York:20230901T170000
DTSTAMP:20260502T013323
CREATED:20230904T063654Z
LAST-MODIFIED:20251026T043812Z
UID:10000820-1693472400-1693587600@cmsa.fas.harvard.edu
SUMMARY:Big Data Conference 2023
DESCRIPTION:On August 31-Sep 1\, 2023 the CMSA hosted the ninth annual Conference on Big Data. The Big Data Conference features speakers from the Harvard community as well as scholars from across the globe\, with talks focusing on computer science\, statistics\, math and physics\, and economics. \nSpeakers: \n\nJacob Andreas\, MIT\nMorgane Austern\, Harvard\nAlbert-László Barabási\, Northeastern\nRachel Cummings\, Columbia\nMelissa Dell\, Harvard\nJianqing Fan\, Princeton\nTommi Jaakkola\, MIT\nAnkur Moitra\, MIT\nMark Sellke\, Harvard\nMarinka Zitnik\, Harvard Medical School\n\nOrganizers: \n\nMichael Douglas\, CMSA\, Harvard University\nYannai Gonczarowski\, Economics and Computer Science\, Harvard University\nLucas Janson\, Statistics and Computer Science\, Harvard University\nTracy Ke\, Statistics\, Harvard University\nHorng-Tzer Yau\, Mathematics and CMSA\, Harvard University\nYue Lu\, Electrical Engineering and Applied Mathematics\, Harvard University\n\nSchedule\n(PDF download) \nThursday\, August 31\, 2023 \n\n\n\n9:00 AM\nBreakfast\n\n\n9:30 AM\nIntroductions\n\n\n9:45–10:45 AM\nAlbert-László Barabási (Northeastern\, Harvard) \nTitle: From Network Medicine to the Foodome: The Dark Matter of Nutrition \nAbstract: A disease is rarely a consequence of an abnormality in a single gene but reflects perturbations to the complex intracellular network. Network medicine offer a platform to explore systematically not only the molecular complexity of a particular disease\, leading to the identification of disease modules and pathways\, but also the molecular relationships between apparently distinct (patho) phenotypes. As an application\, I will explore how we use network medicine to uncover the role individual food molecules in our health. Indeed\, our current understanding of how diet affects our health is limited to the role of 150 key nutritional components systematically tracked by the USDA and other national databases in all foods. Yet\, these nutritional components represent only a tiny fraction of the over 135\,000 distinct\, definable biochemicals present in our food. While many of these biochemicals have documented effects on health\, they remain unquantified in any systematic fashion across different individual foods. Their invisibility to experimental\, clinical\, and epidemiological studies defines them as the ‘Dark Matter of Nutrition.’ I will speak about our efforts to develop a high-resolution library of this nutritional dark matter\, and efforts to understand the role of these molecules on health\, opening novel avenues by which to understand\, avoid\, and control disease. \nhttps://youtu.be/UmgzUwi6K3E\n\n\n10:45–11:00 AM\nBreak\n\n\n11:00 AM–12:00 PM\nRachel Cummings (Columbia) \nTitle: Differentially Private Algorithms for Statistical Estimation Problems \nAbstract: Differential privacy (DP) is widely regarded as a gold standard for privacy-preserving computation over users’ data.  It is a parameterized notion of database privacy that gives a rigorous worst-case bound on the information that can be learned about any one individual from the result of a data analysis task. Algorithmically it is achieved by injecting carefully calibrated randomness into the analysis to balance privacy protections with accuracy of the results.\nIn this talk\, we will survey recent developments in the development of DP algorithms for three important statistical problems\, namely online learning with bandit feedback\, causal interference\, and learning from imbalanced data. For the first problem\, we will show that Thompson sampling — a standard bandit algorithm developed in the 1930s — already satisfies DP due to the inherent randomness of the algorithm. For the second problem of causal inference and counterfactual estimation\, we develop the first DP algorithms for synthetic control\, which has been used non-privately for this task for decades. Finally\, for the problem of imbalanced learning\, where one class is severely underrepresented in the training data\, we show that combining existing techniques such as minority oversampling perform very poorly when applied as pre-processing before a DP learning algorithm; instead we propose novel approaches for privately generating synthetic minority points. \nBased on joint works with Marco Avella Medina\, Vishal Misra\, Yuliia Lut\, Tingting Ou\, Saeyoung Rho\, and Ethan Turok. \nhttps://youtu.be/0cPE6rb1Roo\n\n\n12:00–1:30 PM\nLunch\n\n\n1:30–2:30 PM\nMorgane Austern (Harvard) \nTitle: To split or not to split that is the question: From cross validation to debiased machine learning \nAbstract: Data splitting is a ubiquitous method in statistics with examples ranging from cross-validation to cross-fitting. However\, despite its prevalence\, theoretical guidance regarding its use is still lacking. In this talk\, we will explore two examples and establish an asymptotic theory for it. In the first part of this talk\, we study the cross-validation method\, a ubiquitous method for risk estimation\, and establish its asymptotic properties for a large class of models and with an arbitrary number of folds. Under stability conditions\, we establish a central limit theorem and Berry-Esseen bounds for the cross-validated risk\, which enable us to compute asymptotically accurate confidence intervals. Using our results\, we study the statistical speed-up offered by cross-validation compared to a train-test split procedure. We reveal some surprising behavior of the cross-validated risk and establish the statistically optimal choice for the number of folds. In the second part of this talk\, we study the role of cross-fitting in the generalized method of moments with moments that also depend on some auxiliary functions. Recent lines of work show how one can use generic machine learning estimators for these auxiliary problems\, while maintaining asymptotic normality and root-n consistency of the target parameter of interest. The literature typically requires that these auxiliary problems are fitted on a separate sample or in a cross-fitting manner. We show that when these auxiliary estimation algorithms satisfy natural leave-one-out stability properties\, then sample splitting is not required. This allows for sample reuse\, which can be beneficial in moderately sized sample regimes. \nhttps://youtu.be/L_pHxgoQSgU\n\n\n2:30–2:45 PM\nBreak\n\n\n2:45–3:45 PM\nAnkur Moitra (MIT) \nTitle: Learning from Dynamics \nAbstract: Linear dynamical systems are the canonical model for time series data. They have wide-ranging applications and there is a vast literature on learning their parameters from input-output sequences. Moreover they have received renewed interest because of their connections to recurrent neural networks.\nBut there are wide gaps in our understanding. Existing works have only asymptotic guarantees or else make restrictive assumptions\, e.g. that preclude having any long-range correlations. In this work\, we give a new algorithm based on the method of moments that is computationally efficient and works under essentially minimal assumptions. Our work points to several missed connections\, whereby tools from theoretical machine learning including tensor methods\, can be used in non-stationary settings. \nhttps://youtu.be/UmgzUwi6K3E\n\n\n3:45–4:00 PM\nBreak\n\n\n4:00–5:00 PM\nMark Sellke (Harvard) \nTitle: Algorithmic Thresholds for Spherical Spin Glasses \nAbstract: High-dimensional optimization plays a crucial role in modern statistics and machine learning. I will present recent progress on non-convex optimization problems with random objectives\, focusing on the spherical p-spin glass. This model is related to spiked tensor estimation and has been studied in probability and physics for decades. We will see that a natural class of “stable” optimization algorithms gets stuck at an algorithmic threshold related to geometric properties of the landscape. The algorithmic threshold value is efficiently attained via Langevin dynamics or by a second-order ascent method of Subag. Much of this picture extends to other models\, such as random constraint satisfaction problems at high clause density. \nhttps://youtu.be/JoghiwiIbT8\n\n\n6:00 – 8:00 PM\nBanquet for organizers and speakers\n\n\n\n  \nFriday\, September 1\, 2023 \n\n\n\n9:00 AM\nBreakfast\n\n\n9:30 AM\nIntroductions\n\n\n9:45–10:45 AM\nJacob Andreas (MIT) \nTitle: What Learning Algorithm is In-Context Learning? \nAbstract: Neural sequence models\, especially transformers\, exhibit a remarkable capacity for “in-context” learning. They can construct new predictors from sequences of labeled examples (x\,f(x)) presented in the input without further parameter updates. I’ll present recent findings suggesting that transformer-based in-context learners implement standard learning algorithms implicitly\, by encoding smaller models in their activations\, and updating these implicit models as new examples appear in the context\, using in-context linear regression as a model problem. First\, I’ll show by construction that transformers can implement learning algorithms for linear models based on gradient descent and closed-form ridge regression. Second\, I’ll show that trained in-context learners closely match the predictors computed by gradient descent\, ridge regression\, and exact least-squares regression\, transitioning between different predictors as transformer depth and dataset noise vary\, and converging to Bayesian estimators for large widths and depths. Finally\, we present preliminary evidence that in-context learners share algorithmic features with these predictors: learners’ late layers non-linearly encode weight vectors and moment matrices. These results suggest that in-context learning is understandable in algorithmic terms\, and that (at least in the linear case) learners may rediscover standard estimation algorithms. This work is joint with Ekin Akyürek at MIT\, and Dale Schuurmans\, Tengyu Ma and Denny Zhou at Stanford. \nhttps://youtu.be/UNVl64G3BzA\n\n\n10:45–11:00 AM\nBreak\n\n\n11:00 AM–12:00 PM\nTommi Jaakkola (MIT) \nTitle: Generative modeling and physical processes \nAbstract: Rapidly advancing deep distributional modeling techniques offer a number of opportunities for complex generative tasks\, from natural sciences such as molecules and materials to engineering. I will discuss generative approaches inspired from physical processes including diffusion models and more recent electrostatic models (Poisson flow)\, and how they relate to each other in terms of embedding dimension. From the point of view of applications\, I will highlight our recent work on SE(3) invariant distributional modeling over backbone 3D structures with ability to generate designable monomers without relying on pre-trained protein structure prediction methods as well as state of the art image generation capabilities (Poisson flow). Time permitting\, I will also discuss recent analysis of efficiency of sample generation in such models. \nhttps://youtu.be/GLEwQAWQ85E\n\n\n12:00–1:30 PM\nLunch\n\n\n1:30–2:30 PM\nMarinka Zitnik (Harvard Medical School) \nTitle: Multimodal Learning on Graphs \nAbstract: Understanding biological and natural systems requires modeling data with underlying geometric relationships across scales and modalities such as biological sequences\, chemical constraints\, and graphs of 3D spatial or biological interactions. I will discuss unique challenges for learning from multimodal datasets that are due to varying inductive biases across modalities and the potential absence of explicit graphs in the input. I will describe a framework for structure-inducing pretraining that allows for a comprehensive study of how relational structure can be induced in pretrained language models. We use the framework to explore new graph pretraining objectives that impose relational structure in the induced latent spaces—i.e.\, pretraining objectives that explicitly impose structural constraints on the distance or geometry of pretrained models. Applications in genomic medicine and therapeutic science will be discussed. These include TxGNN\, an AI model enabling zero-shot prediction of therapeutic use across over 17\,000 diseases\, and PINNACLE\, a contextual graph AI model dynamically adjusting its outputs to contexts in which it operates. PINNACLE enhances 3D protein structure representations and predicts the effects of drugs at single-cell resolution. \nhttps://youtu.be/hjt4nsN_8iM\n\n\n2:30–2:45 PM\nBreak\n\n\n2:45–3:45 PM\nJianqing Fan (Princeton) \nTitle: UTOPIA: Universally Trainable Optimal Prediction Intervals Aggregation \nAbstract: Uncertainty quantification for prediction is an intriguing problem with significant applications in various fields\, such as biomedical science\, economic studies\, and weather forecasts. Numerous methods are available for constructing prediction intervals\, such as quantile regression and conformal predictions\, among others. Nevertheless\, model misspecification (especially in high-dimension) or sub-optimal constructions can frequently result in biased or unnecessarily-wide prediction intervals. In this work\, we propose a novel and widely applicable technique for aggregating multiple prediction intervals to minimize the average width of the prediction band along with coverage guarantee\, called Universally Trainable Optimal Predictive Intervals Aggregation (UTOPIA). The method also allows us to directly construct predictive bands based on elementary basis functions.  Our approach is based on linear or convex programming which is easy to implement. All of our proposed methodologies are supported by theoretical guarantees on the coverage probability and optimal average length\, which are detailed in this paper. The effectiveness of our approach is convincingly demonstrated by applying it to synthetic data and two real datasets on finance and macroeconomics. (Joint work Jiawei Ge and Debarghya Mukherjee). \nhttps://youtu.be/WY6dr1oEOrk\n\n\n3:45–4:00 PM\nBreak\n\n\n4:00–5:00 PM\nMelissa Dell (Harvard) \nTitle: Efficient OCR for Building a Diverse Digital History \nAbstract: Many users consult digital archives daily\, but the information they can access is unrepresentative of the diversity of documentary history. The sequence-to-sequence architecture typically used for optical character recognition (OCR) – which jointly learns a vision and language model – is poorly extensible to low-resource document collections\, as learning a language-vision model requires extensive labeled sequences and compute. This study models OCR as a character-level image retrieval problem\, using a contrastively trained vision encoder. Because the model only learns characters’ visual features\, it is more sample-efficient and extensible than existing architectures\, enabling accurate OCR in settings where existing solutions fail. Crucially\, it opens new avenues for community engagement in making digital history more representative of documentary history. \nhttps://youtu.be/u0JY9vURUAs\n\n\n\n  \n\nInformation about the 2022 Big Data Conference can be found here.
URL:https://cmsa.fas.harvard.edu/event/bigdata_2023/
LOCATION:Harvard Science Center\, 1 Oxford Street\, Cambridge\, MA\, 02138
CATEGORIES:Big Data Conference,Conference,Event
ATTACH;FMTTYPE=image/png:https://cmsa.fas.harvard.edu/media/Big-Data-2023_letter-1.png
END:VEVENT
BEGIN:VEVENT
DTSTART;TZID=America/New_York:20220826T090000
DTEND;TZID=America/New_York:20220826T130000
DTSTAMP:20260502T013323
CREATED:20230705T044827Z
LAST-MODIFIED:20250328T145239Z
UID:10000058-1661504400-1661518800@cmsa.fas.harvard.edu
SUMMARY:Big Data Conference 2022
DESCRIPTION:On August 26\, 2022 the CMSA hosted our eighth annual Conference on Big Data. The Big Data Conference features speakers from the Harvard community as well as scholars from across the globe\, with talks focusing on computer science\, statistics\, math and physics\, and economics. \nThe 2022 Big Data Conference took place virtually on Zoom. \nOrganizers: \n\nScott Duke Kominers\, MBA Class of 1960 Associate Professor\, Harvard Business\nHorng-Tzer Yau\, Professor of Mathematics\, Harvard University\nSergiy Verstyuk\, CMSA\, Harvard University\n\nSpeakers: \n\nXiaohong Chen\, Yale\nMiles Cranmer\, Princeton\nJessica Jeffers\, University of Chicago\nDan Roberts\, MIT\n\nSchedule \n\n\n\n\n9:00 am\nConference Organizers\nIntroduction and Welcome\n\n\n9:10 am – 9:55 am\nXiaohong Chen\nTitle: On ANN optimal estimation and inference for policy functionals of nonparametric conditional moment restrictions \nAbstract:  Many causal/policy parameters of interest are expectation functionals of unknown infinite-dimensional structural functions identified via conditional moment restrictions. Artificial Neural Networks (ANNs) can be viewed as nonlinear sieves that can approximate complex functions of high dimensional covariates more effectively than linear sieves. In this talk we present ANN optimal estimation and inference on  policy functionals\, such as average elasticities or value functions\, of unknown structural functions of endogenous covariates. We provide ANN efficient estimation and optimal t based confidence interval for regular policy functionals such as average derivatives in nonparametric instrumental variables regressions. We also present ANN quasi likelihood ratio based inference for possibly irregular policy functionals of general nonparametric conditional moment restrictions (such as quantile instrumental variables models or Bellman equations) for time series data. We conduct intensive Monte Carlo studies to investigate computational issues with ANN based optimal estimation and inference in economic structural models with endogeneity. For economic data sets that do not have very high signal to noise ratios\, there are current gaps between theoretical advantage of ANN approximation theory vs inferential performance in finite samples.\nSome of the results are applied to efficient estimation and optimal inference for average price elasticity in consumer demand and BLP type demand. \nThe talk is based on two co-authored papers:\n(1) Efficient Estimation of Average Derivatives in NPIV Models: Simulation Comparisons of Neural Network Estimators\n(Authors: Jiafeng Chen\, Xiaohong Chen and Elie Tamer)\nhttps://arxiv.org/abs/2110.06763 \n(2) Neural network Inference on Nonparametric conditional moment restrictions with weakly dependent data\n(Authors: Xiaohong Chen\, Yuan Liao and Weichen Wang). \nView/Download Lecture Slides (pdf)\n\n\n10:00 am – 10:45 am\nJessica Jeffers\nTitle: Labor Reactions to Credit Deterioration: Evidence from LinkedIn Activity \nAbstract: We analyze worker reactions to their firms’ credit deterioration. Using weekly networking activity on LinkedIn\, we show workers initiate more connections immediately following a negative credit event\, even at firms far from bankruptcy. Our results suggest that workers are driven by concerns about both unemployment and future prospects at their firm. Heightened networking activity is associated with contemporaneous and future departures\, especially at financially healthy firms. Other negative events like missed earnings and equity downgrades do not trigger similar reactions. Overall\, our results indicate that the build-up of connections triggered by credit deterioration represents a source of fragility for firms.\n\n\n10:50 am – 11:35 am\nMiles Cranmer\nTitle: Interpretable Machine Learning for Physics \nAbstract: Would Kepler have discovered his laws if machine learning had been around in 1609? Or would he have been satisfied with the accuracy of some black box regression model\, leaving Newton without the inspiration to discover the law of gravitation? In this talk I will explore the compatibility of industry-oriented machine learning algorithms with discovery in the natural sciences. I will describe recent approaches developed with collaborators for addressing this\, based on a strategy of “translating” neural networks into symbolic models via evolutionary algorithms. I will discuss the inner workings of the open-source symbolic regression library PySR (github.com/MilesCranmer/PySR)\, which forms a central part of this interpretable learning toolkit. Finally\, I will present examples of how these methods have been used in the past two years in scientific discovery\, and outline some current efforts. \nView/Download Lecture Slides (pdf) \n\n\n11:40 am – 12:25 pm\nDan Roberts\nTitle: A Statistical Model of Neural Scaling Laws \nAbstract: Large language models of a huge number of parameters and trained on near internet-sized number of tokens have been empirically shown to obey “neural scaling laws” for which their performance behaves predictably as a power law in either parameters or dataset size until bottlenecked by the other resource. To understand this better\, we first identify the necessary properties allowing such scaling laws to arise and then propose a statistical model — a joint generative data model and random feature model — that captures this neural scaling phenomenology. By solving this model using tools from random matrix theory\, we gain insight into (i) the statistical structure of datasets and tasks that lead to scaling laws (ii) how nonlinear feature maps\, i.e the role played by the deep neural network\, enable scaling laws when trained on these datasets\, and (iii) how such scaling laws can break down\, and what their behavior is when they do. A key feature is the manner in which the power laws that occur in the statistics of natural datasets are translated into power law scalings of the test loss\, and how the finite extent of such power laws leads to both bottlenecks and breakdowns. \nView/Download Lecture Slides (pdf) \n \n\n\n12:30 pm\nConference Organizers\nClosing Remarks\n\n\n\n\n  \nInformation about last year’s conference can be found here.
URL:https://cmsa.fas.harvard.edu/event/big-data-conference-2022/
LOCATION:Virtual
CATEGORIES:Big Data Conference,Conference,Event
ATTACH;FMTTYPE=image/png:https://cmsa.fas.harvard.edu/media/Big-Data-2022_web.png
END:VEVENT
BEGIN:VEVENT
DTSTART;VALUE=DATE:20210824
DTEND;VALUE=DATE:20210825
DTSTAMP:20260502T013323
CREATED:20230705T081718Z
LAST-MODIFIED:20250328T145235Z
UID:10000070-1629763200-1629849599@cmsa.fas.harvard.edu
SUMMARY:Big Data Conference 2021
DESCRIPTION:On August 24\, 2021\, the CMSA hosted our seventh annual Conference on Big Data. The Conference features many speakers from the Harvard community as well as scholars from across the globe\, with talks focusing on computer science\, statistics\, math and physics\, and economics. \nThe 2021 Big Data Conference took place virtually on Zoom. \nOrganizers:  \n\nShing-Tung Yau\, William Caspar Graustein Professor of Mathematics\, Harvard University\nScott Duke Kominers\, MBA Class of 1960 Associate Professor\, Harvard Business\nHorng-Tzer Yau\, Professor of Mathematics\, Harvard University\nSergiy Verstyuk\, CMSA\, Harvard University\n\nSpeakers: \n\nAndrew Blumberg\, University of Texas at Austin\nMoran Koren\, Harvard CMSA\nHima Lakkaraju\, Harvard University\nKatrina Ligett\, The Hebrew University of Jerusalem\n\n\n\n\n\nTime (ET; Boston time)\nSpeaker\nTitle/Abstract\n\n\n9:00AM\nConference Organizers\nIntroduction and Welcome\n\n\n9:10AM – 9:55AM\nAndrew Blumberg\, University of Texas at Austin\nTitle: Robustness and stability for multidimensional persistent homology \nAbstract: A basic principle in topological data analysis is to study the shape of data by looking at multiscale homological invariants. The idea is to filter the data using a scale parameter that reflects feature size. However\, for many data sets\, it is very natural to consider multiple filtrations\, for example coming from feature scale and density. A key question that arises is how such invariants behave with respect to noise and outliers. This talk will describe a framework for understanding those questions and explore open problems in the area.\n\n\n10:00AM – 10:45AM\nKatrina Ligett\, The Hebrew University of Jerusalem\nTitle: Privacy as Stability\, for Generalization \nAbstract: Many data analysis pipelines are adaptive: the choice of which analysis to run next depends on the outcome of previous analyses. Common examples include variable selection for regression problems and hyper-parameter optimization in large-scale machine learning problems: in both cases\, common practice involves repeatedly evaluating a series of models on the same dataset. Unfortunately\, this kind of adaptive re-use of data invalidates many traditional methods of avoiding overfitting and false discovery\, and has been blamed in part for the recent flood of non-reproducible findings in the empirical sciences. An exciting line of work beginning with Dwork et al. in 2015 establishes the first formal model and first algorithmic results providing a general approach to mitigating the harms of adaptivity\, via a connection to the notion of differential privacy. In this talk\, we’ll explore the notion of differential privacy and gain some understanding of how and why it provides protection against adaptivity-driven overfitting. Many interesting questions in this space remain open. \nJoint work with: Christopher Jung (UPenn)\, Seth Neel (Harvard)\, Aaron Roth (UPenn)\, Saeed Sharifi-Malvajerdi (UPenn)\, and Moshe Shenfeld (HUJI). This talk will draw on work that appeared at NeurIPS 2019 and ITCS 2020\n\n\n10:50AM – 11:35AM\nHima Lakkaraju\, Harvard University\nTitle: Towards Reliable and Robust Model Explanations \nAbstract: As machine learning black boxes are increasingly being deployed in domains such as healthcare and criminal justice\, there is growing emphasis on building tools and techniques for explaining these black boxes in an interpretable manner. Such explanations are being leveraged by domain experts to diagnose systematic errors and underlying biases of black boxes. In this talk\, I will present some of our recent research that sheds light on the vulnerabilities of popular post hoc explanation techniques such as LIME and SHAP\, and also introduce novel methods to address some of these vulnerabilities. More specifically\, I will first demonstrate that these methods are brittle\, unstable\, and are vulnerable to a variety of adversarial attacks. Then\, I will discuss two solutions to address some of the vulnerabilities of these methods – (i) a framework based on adversarial training that is designed to make post hoc explanations more stable and robust to shifts in the underlying data; (ii) a Bayesian framework that captures the uncertainty associated with post hoc explanations and in turn allows us to generate explanations with user specified levels of confidences. I will conclude the talk by discussing results from real world datasets to both demonstrate the vulnerabilities in post hoc explanation techniques as well as the efficacy of our aforementioned solutions.\n\n\n11:40AM – 12:25PM\nMoran Koren\, Harvard CMSA\nTitle: A Gatekeeper’s Conundrum \nAbstract: Many selection processes contain a “gatekeeper”. The gatekeeper’s goal is to examine an applicant’s suitability to a proposed position before both parties endure substantial costs. Intuitively\, the introduction of a gatekeeper should reduce selection costs as unlikely applicants are sifted out. However\, we show that this is not always the case as the gatekeeper’s introduction inadvertently reduces the applicant’s expected costs and thus interferes with her self-selection. We study the conditions under which the gatekeeper’s presence improves the system’s efficiency and those conditions under which the gatekeeper’s presence induces inefficiency. Additionally\, we show that the gatekeeper can sometimes improve selection correctness by behaving strategically (i.e.\, ignore her private information with some probability).\n\n\n12:25PM\nConference Organizers\nClosing Remarks
URL:https://cmsa.fas.harvard.edu/event/big-data-conference-2021/
LOCATION:Virtual
CATEGORIES:Big Data Conference,Conference,Event
ATTACH;FMTTYPE=image/png:https://cmsa.fas.harvard.edu/media/BD_21-Poster.png
END:VEVENT
BEGIN:VEVENT
DTSTART;TZID=America/New_York:20200824T100000
DTEND;TZID=America/New_York:20200825T140500
DTSTAMP:20260502T013323
CREATED:20230707T104105Z
LAST-MODIFIED:20250305T185337Z
UID:10000137-1598263200-1598364300@cmsa.fas.harvard.edu
SUMMARY:2020 Big Data Conference (Virtual)
DESCRIPTION:On August 24-25\, 2020 the CMSA hosted our sixth annual Conference on Big Data. The Conference featured many speakers from the Harvard community as well as scholars from across the globe\, with talks focusing on computer science\, statistics\, math and physics\, and economics. The 2020 Big Data Conference took place virtually. \n\nVideos of the talks are available in this youtube playlist.\n  \nOrganizers:  \n\nShing-Tung Yau\, William Caspar Graustein Professor of Mathematics\, Harvard University\nScott Duke Kominers\, MBA Class of 1960 Associate Professor\, Harvard Business\nHorng-Tzer Yau\, Professor of Mathematics\, Harvard University\nSergiy Verstyuk\, CMSA\, Harvard University\n\nSpeakers:\n \n\nSanjeev Arora\, Princeton University\nJuan Camilo Castillo\, University of Pennsylvania\nJoseph Dexter\, Dartmouth College\nNicole Immorlica\, Microsoft\nAmin Saberi\, Stanford University\nVira Semenova\, University of California\, Berkeley\nVarda Shalev\, Tel Aviv University
URL:https://cmsa.fas.harvard.edu/event/2020-big-data-conference-virtual/
LOCATION:CMSA\, 20 Garden Street\, Cambridge\, MA\, 02138\, United States
CATEGORIES:Big Data Conference,Conference,Event
ATTACH;FMTTYPE=image/jpeg:https://cmsa.fas.harvard.edu/media/Big-Data-2020-pdf.jpg
END:VEVENT
BEGIN:VEVENT
DTSTART;TZID=America/New_York:20190819T083000
DTEND;TZID=America/New_York:20190820T164000
DTSTAMP:20260502T013323
CREATED:20230707T174003Z
LAST-MODIFIED:20250328T145128Z
UID:10000116-1566203400-1566319200@cmsa.fas.harvard.edu
SUMMARY:2019 Big Data Conference
DESCRIPTION:On August 19-20\, 2019 the CMSA hosted the fifth annual Conference on Big Data. The Conference will featured many speakers from the Harvard community as well as scholars from across the globe\, with talks focusing on computer science\, statistics\, math and physics\, and economics. \nThe talks will take place in Science Center Hall D\, 1 Oxford Street. \nVideos can be found in the Youtube playlist.
URL:https://cmsa.fas.harvard.edu/event/2019-big-data-conference/
LOCATION:CMSA\, 20 Garden Street\, Cambridge\, MA\, 02138\, United States
CATEGORIES:Big Data Conference,Conference,Event
ATTACH;FMTTYPE=image/png:https://cmsa.fas.harvard.edu/media/Big-Data-2019-Poster-5-2.png
END:VEVENT
BEGIN:VEVENT
DTSTART;TZID=America/New_York:20180823T083000
DTEND;TZID=America/New_York:20180824T163000
DTSTAMP:20260502T013323
CREATED:20230715T083801Z
LAST-MODIFIED:20250415T154139Z
UID:10000086-1535013000-1535128200@cmsa.fas.harvard.edu
SUMMARY:Big Data Conference 2018
DESCRIPTION:On August 23-24\, 2018 the CMSA hosted the fourth annual Conference on Big Data. The Conference featured speakers from the Harvard community as well as scholars from across the globe\, with talks focusing on computer science\, statistics\, math and physics\, and economics. \nThe talks were held in Science Center Hall B\, 1 Oxford Street. \nSpeakers:  \n\nMohammad Akbarpour\, Stanford\nEmily Breza\, Harvard\nFrancesca Dominici\, Harvard\nChiara Farronato\, Harvard\nKobi Gal\, Ben Gurion\nJonah Kallenbach\, Reverie Labs\nSamuel Kou\, Harvard\nLaura Kreidberg\, Harvard\nDanielle Li\, MIT\nLibby Mishkin\, Uber\nJosh Speagle\, Harvard\nWilliam Stein\, University of Washington\nAlex Teyltelboym\, University of Oxford\nSergiy Verstyuk\, CMSA/Harvard\n\nOrganizers:  \n\nShing-Tung Yau\, William Caspar Graustein Professor of Mathematics\, Harvard University\nScott Duke Kominers\, MBA Class of 1960 Associate Professor\, Harvard Business\nRichard Freeman\, Herbert Ascherman Professor of Economics\, Harvard University\nJun Liu\, Professor of Statistics\, Harvard University\nHorng-Tzer Yau\, Professor of Mathematics\, Harvard University
URL:https://cmsa.fas.harvard.edu/event/2018-big-data-conference-2/
LOCATION:Harvard Science Center\, 1 Oxford Street\, Cambridge\, MA\, 02138
CATEGORIES:Big Data Conference,Conference,Event
ATTACH;FMTTYPE=image/png:https://cmsa.fas.harvard.edu/media/Big-Data-2018-4.png
END:VEVENT
BEGIN:VEVENT
DTSTART;TZID=America/New_York:20170818T154700
DTEND;TZID=America/New_York:20170819T154700
DTSTAMP:20260502T013323
CREATED:20230717T172600Z
LAST-MODIFIED:20250328T144515Z
UID:10000034-1503071220-1503157620@cmsa.fas.harvard.edu
SUMMARY:2017 Big Data Conference
DESCRIPTION:The Center of Mathematical Sciences and Applications will be hosting a conference on Big Data from August 18 – 19\, 2017\, in Hall D of the Science Center at Harvard University.\nThe Big Data Conference features many speakers from the Harvard community as well as scholars from across the globe\, with talks focusing on computer science\, statistics\, math and physics\, and economics. This is the third conference on Big Data the Center will host as part of our annual events\, and is co-organized by Richard Freeman\, Scott Kominers\, Jun Liu\, Horng-Tzer Yau and Shing-Tung Yau. \nConfirmed Speakers: \n\nMohammad Akbarpour\, Stanford University\nAlbert-László Barabási\, Northeastern University\nNoureddine El Karoui\, University of California\, Berkeley\nRavi Jagadeesan\, Harvard University\nLucas Janson\, Harvard University\nTracy Ke\, University of Chicago\nTze Leung Lai\, Stanford University\nAnnie Liang\, University of Pennsylvania\nMarena Lin\, Harvard University\nNikhil Naik\, Harvard University\nAlex Peysakhovich\, Facebook\nNatesh Pillai\, Harvard University\nJann Spiess\, Harvard University\nBradly Stadie\, Open AI\, University of California\, Berkeley\nZak Stone\, Google\nHau-Tieng Wu\, University of Toronto\nSifan Zhou\, Xiamen University\n\n  \nFollowing the conference\, there will be a two-day workshop from August 20-21. The workshop is organized by Scott Kominers\, and will feature: \n\nJörn Boehnke\, Harvard University\nNikhil Naik\, Harvard University\nBradly Stadie\, Open AI\, University of California\, Berkeley\n\n  \nConference Schedule \nA PDF version of the schedule below can also be downloaded here. \nAugust 18\, Friday (Full day)\n\n\n\nTime\nSpeaker\nTopic\n\n\n8:30 am – 9:00 am\n\nBreakfast\n\n\n9:00 am – 9:40 am\nMohammad Akbarpour \nVideo\nTitle: Information aggregation in overlapping generations and the emergence of experts \nAbstract: We study a model of social learning with “overlapping generations”\, where agents meet others and share data about an underlying state over time. We examine under what conditions the society will produce individuals with precise knowledge about the state of the world. There are two information sharing regimes in our model: Under the full information sharing technology\, individuals exchange the information about their point estimates of an underlying state\, as well as their sources (or the precision of their signals) and update their beliefs by taking a weighted average. Under the limited information sharing technology\, agents only observe the information about the point estimates of those they meet\, and update their beliefs by taking a weighted average\, where weights can depend on the sequence of meetings\, as well as the labels. Our main result shows that\, unlike most social learning settings\, using such linear learning rules do not guide the society (or even a fraction of its members) to learn the truth\, and having access to\, and exploiting knowledge of the precision of a source signal are essential for efficient social learning (joint with Amin Saberi & Ali Shameli).\n\n\n9:40 am – 10:20 am\nLucas Janson \nVideo\nTitle: Model-Free Knockoffs For High-Dimensional Controlled Variable Selection \nAbstract: Many contemporary large-scale applications involve building interpretable models linking a large set of potential covariates to a response in a nonlinear fashion\, such as when the response is binary. Although this modeling problem has been extensively studied\, it remains unclear how to effectively control the fraction of false discoveries even in high-dimensional logistic regression\, not to mention general high-dimensional nonlinear models. To address such a practical problem\, we propose a new framework of model-free knockoffs\, which reads from a different perspective the knockoff procedure (Barber and Candès\, 2015) originally designed for controlling the false discovery rate in linear models. The key innovation of our method is to construct knockoff variables probabilistically instead of geometrically. This enables model-free knockoffs to deal with arbitrary (and unknown) conditional models and any dimensions\, including when the dimensionality p exceeds the sample size n\, while the original knockoffs procedure is constrained to homoscedastic linear models with n greater than or equal to p. Our approach requires the design matrix be random (independent and identically distributed rows) with a covariate distribution that is known\, although we show our procedure to be robust to unknown/estimated distributions. As we require no knowledge/assumptions about the conditional distribution of the response\, we effectively shift the burden of knowledge from the response to the covariates\, in contrast to the canonical model-based approach which assumes a parametric model for the response but very little about the covariates. To our knowledge\, no other procedure solves the controlled variable selection problem in such generality\, but in the restricted settings where competitors exist\, we demonstrate the superior power of knockoffs through simulations. Finally\, we apply our procedure to data from a case-control study of Crohn’s disease in the United Kingdom\, making twice as many discoveries as the original analysis of the same data. \nSlides\n\n\n10:20 am – 10:50 am\n\nBreak\n\n\n10:50 pm – 11:30 pm\nNoureddine El Karoui \nVideo\nTitle: Random matrices and high-dimensional statistics: beyond covariance matrices \nAbstract: Random matrices have played a central role in understanding very important statistical methods linked to covariance matrices (such as Principal Components Analysis\, Canonical Correlation Analysis etc…) for several decades. In this talk\, I’ll show that one can adopt a random-matrix-inspired point of view to understand the performance of other widely used tools in statistics\, such as M-estimators\, and very common methods such as the bootstrap. I will focus on the high-dimensional case\, which captures well the situation of “moderately” difficult statistical problems\, arguably one of the most relevant in practice. In this setting\, I will show that random matrix ideas help upend conventional theoretical thinking (for instance about maximum likelihood methods) and highlight very serious practical problems with resampling methods.\n\n\n11:30 am – 12:10 pm\nNikhil Naik \nVideo\nTitle: Understanding Urban Change with Computer Vision and Street-level Imagery \nAbstract: Which neighborhoods experience physical improvements? In this work\, we introduce a computer vision method to measure changes in the physical appearances of neighborhoods from time-series street-level imagery. We connect changes in the physical appearance of five US cities with economic and demographic data and find three factors that predict neighborhood improvement. First\, neighborhoods that are densely populated by college-educated adults are more likely to experience physical improvements. Second\, neighborhoods with better initial appearances experience\, on average\, larger positive improvements. Third\, neighborhood improvement correlates positively with physical proximity to the central business district and to other physically attractive neighborhoods. Together\, our results illustrate the value of using computer vision methods and street-level imagery to understand the physical dynamics of cities. \n(Joint work with Edward L. Glaeser\, Cesar A. Hidalgo\, Scott Duke Kominers\, and Ramesh Raskar.)\n\n\n12:10 pm – 12:25 pm\nVideo #1 \nVideo #2\nData Science Lightning Talks\n\n\n12:25 pm – 1:30 pm\n\nLunch\n\n\n1:30 pm – 2:10 pm\nTracy Ke \nVideo\nTitle: A new SVD approach to optimal topic estimation \nAbstract: In the probabilistic topic models\, the quantity of interest—a low-rank matrix consisting of topic vectors—is hidden in the text corpus matrix\, masked by noise\, and Singular Value Decomposition (SVD) is a potentially useful tool for learning such a low-rank matrix. However\, the connection between this low-rank matrix and the singular vectors of the text corpus matrix are usually complicated and hard to spell out\, so how to use SVD for learning topic models faces challenges. \nWe overcome the challenge by revealing a surprising insight: there is a low-dimensional simplex structure which can be viewed as a bridge between the low-rank matrix of interest and the SVD of the text corpus matrix\, and which allows us to conveniently reconstruct the former using the latter. Such an insight motivates a new SVD-based approach to learning topic models. \nFor asymptotic analysis\, we show that under a popular topic model (Hofmann\, 1999)\, the convergence rate of the l1-error of our method matches that of the minimax lower bound\, up to a multi-logarithmic term. In showing these results\, we have derived new element-wise bounds on the singular vectors and several large deviation bounds for weakly dependent multinomial data. Our results on the convergence rate and asymptotical minimaxity are new. We have applied our method to two data sets\, Associated Process (AP) and Statistics Literature Abstract (SLA)\, with encouraging results. In particular\, there is a clear simplex structure associated with the SVD of the data matrices\, which largely validates our discovery.\n\n\n2:10 pm – 2:50 pm\nAlbert-László Barabási \nVideo\nTitle: Taming Complexity: From Network Science to Controlling Networks \nAbstract: The ultimate proof of our understanding of biological or technological systems is reflected in our ability to control them. While control theory offers mathematical tools to steer engineered and natural systems towards a desired state\, we lack a framework to control complex self-organized systems. Here we explore the controllability of an arbitrary complex network\, identifying the set of driver nodes whose time-dependent control can guide the system’s entire dynamics. We apply these tools to several real networks\, unveiling how the network topology determines its controllability. Virtually all technological and biological networks must be able to control their internal processes. Given that\, issues related to control deeply shape the topology and the vulnerability of real systems. Consequently unveiling the control principles of real networks\, the goal of our research\, forces us to address series of fundamental questions pertaining to our understanding of complex systems. \n \n\n\n2:50 pm – 3:20 pm\n\nBreak\n\n\n3:20 pm – 4:00 pm\nMarena Lin \nVideo\nTitle: Optimizing climate variables for human impact studies \nAbstract: Estimates of the relationship between climate variability and socio-economic outcomes are often limited by the spatial resolution of the data. As studies aim to generalize the connection between climate and socio-economic outcomes across countries\, the best available socio-economic data is at the national level (e.g. food production quantities\, the incidence of warfare\, averages of crime incidence\, gender birth ratios). While these statistics may be trusted from government censuses\, the appropriate metric for the corresponding climate or weather for a given year in a country is less obvious. For example\, how do we estimate the temperatures in a country relevant to national food production and therefore food security? We demonstrate that high-resolution spatiotemporal satellite data for vegetation can be used to estimate the weather variables that may be most relevant to food security and related socio-economic outcomes. In particular\, satellite proxies for vegetation over the African continent reflect the seasonal movement of the Intertropical Convergence Zone\, a band of intense convection and rainfall. We also show that agricultural sensitivity to climate variability differs significantly between countries. This work is an example of the ways in which in-situ and satellite-based observations are invaluable to both estimates of future climate variability and to continued monitoring of the earth-human system. We discuss the current state of these records and potential challenges to their continuity.\n\n\n4:00 pm – 4:40 pm\nAlex Peysakhovich\n Title: Building a cooperator \nAbstract: A major goal of modern AI is to construct agents that can perform complex tasks. Much of this work deals with single agent decision problems. However\, agents are rarely alone in the world. In this talk I will discuss how to combine ideas from deep reinforcement learning and game theory to construct artificial agents that can communicate\, collaborate and cooperate in productive positive sum interactions.\n\n\n4:40 pm – 5:20 pm\nTze Leung Lai \nVideo\nTitle: Gradient boosting: Its role in big data analytics\, underlying mathematical theory\, and recent refinements \nAbstract: We begin with a review of the history of gradient boosting\, dating back to the LMS algorithm of Widrow and Hoff in 1960 and culminating in Freund and Schapire’s AdaBoost and Friedman’s gradient boosting and stochastic gradient boosting algorithms in the period 1999-2002 that heralded the big data era. The role played by gradient boosting in big data analytics\, particularly with respect to deep learning\, is then discussed. We also present some recent work on the mathematical theory of gradient boosting\, which has led to some refinements that greatly improves the convergence properties and prediction performance of the methodology.\n\n\n\nAugust 19\, Saturday (Full day)\n\n\n\nTime\nSpeaker\nTopic\n\n\n8:30 am – 9:00 am\n\nBreakfast\n\n\n9:00 am – 9:40 am\nNatesh Pillai \nVideo\nTitle: Accelerating MCMC algorithms for Computationally Intensive Models via Local Approximations \nAbstract: We construct a new framework for accelerating Markov chain Monte Carlo in posterior sampling problems where standard methods are limited by the computational cost of the likelihood\, or of numerical models embedded therein. Our approach introduces local approximations of these models into the Metropolis–Hastings kernel\, borrowing ideas from deterministic approximation theory\, optimization\, and experimental design. Previous efforts at integrating approximate models into inference typically sacrifice either the sampler’s exactness or efficiency; our work seeks to address these limitations by exploiting useful convergence characteristics of local approximations. We prove the ergodicity of our approximate Markov chain\, showing that it samples asymptotically from the exact posterior distribution of interest. We describe variations of the algorithm that employ either local polynomial approximations or local Gaussian process regressors. Our theoretical results reinforce the key observation underlying this article: when the likelihood has some local regularity\, the number of model evaluations per Markov chain Monte Carlo (MCMC) step can be greatly reduced without biasing the Monte Carlo average. Numerical experiments demonstrate multiple order-of-magnitude reductions in the number of forward model evaluations used in representative ordinary differential equation (ODE) and partial differential equation (PDE) inference problems\, with both synthetic and real data.\n\n\n9:40 am – 10:20 am\nRavi Jagadeesan \nVideo\nTitle: Designs for estimating the treatment effect in networks with interference \nAbstract: In this paper we introduce new\, easily implementable designs for drawing causal inference from randomized experiments on networks with interference. Inspired by the idea of matching in observational studies\, we introduce the notion of considering a treatment assignment as a quasi-coloring” on a graph. Our idea of a perfect quasi-coloring strives to match every treated unit on a given network with a distinct control unit that has identical number of treated and control neighbors. For a wide range of interference functions encountered in applications\, we show both by theory and simulations that the classical Neymanian estimator for the direct effect has desirable properties for our designs. This further extends to settings where homophily is present in addition to interference.\n\n\n10:20 am – 10:50 am\n\nBreak\n\n\n10:50 am – 11:30 am\nAnnie Liang \nVideo\nTitle: The Theory is Predictive\, but is it Complete? An Application to Human Generation of Randomness \nAbstract: When we test a theory using data\, it is common to focus on correctness: do the predictions of the theory match what we see in the data? But we also care about completeness: how much of the predictable variation in the data is captured by the theory? This question is difficult to answer\, because in general we do not know how much “predictable variation” there is in the problem. In this paper\, we consider approaches motivated by machine learning algorithms as a means of constructing a benchmark for the best attainable level of prediction.  We illustrate our methods on the task of predicting human-generated random sequences. Relative to a theoretical machine learning algorithm benchmark\, we find that existing behavioral models explain roughly 15 percent of the predictable variation in this problem. This fraction is robust across several variations on the problem. We also consider a version of this approach for analyzing field data from domains in which human perception and generation of randomness has been used as a conceptual framework; these include sequential decision-making and repeated zero-sum games. In these domains\, our framework for testing the completeness of theories provides a way of assessing their effectiveness over different contexts; we find that despite some differences\, the existing theories are fairly stable across our field domains in their performance relative to the benchmark. Overall\, our results indicate that (i) there is a significant amount of structure in this problem that existing models have yet to capture and (ii) there are rich domains in which machine learning may provide a viable approach to testing completeness (joint with Jon Kleinberg and Sendhil Mullainathan).\n\n\n11:30 am – 12:10 pm\nZak Stone \nVideo\nTitle: TensorFlow: Machine Learning for Everyone \nAbstract: We’ve witnessed extraordinary breakthroughs in machine learning over the past several years. What kinds of things are possible now that weren’t possible before? How are open-source platforms like TensorFlow and hardware platforms like GPUs and Cloud TPUs accelerating machine learning progress? If these tools are new to you\, how should you get started? In this session\, you’ll hear about all of this and more from Zak Stone\, the Product Manager for TensorFlow on the Google Brain team.\n\n\n12:10 pm – 1:30 pm\n\nLunch\n\n\n1:30 pm – 2:10 pm\nJann Spiess \nVideo\nTitle: (Machine) Learning to Control in Experiments \nAbstract: Machine learning focuses on high-quality prediction rather than on (unbiased) parameter estimation\, limiting its direct use in typical program evaluation applications. Still\, many estimation tasks have implicit prediction components. In this talk\, I discuss accounting for controls in treatment effect estimation as a prediction problem. In a canonical linear regression framework with high-dimensional controls\, I argue that OLS is dominated by a natural shrinkage estimator even for unbiased estimation when treatment is random; suggest a generalization that relaxes some parametric assumptions; and contrast my results with that for another implicit prediction problem\, namely the first stage of an instrumental variables regression.\n\n\n2:10 pm – 2:50 pm\nBradly Stadie\nTitle: Learning to Learn Quickly: One-Shot Imitation and Meta Learning \nAbstract: Many reinforcement learning algorithms are bottlenecked by data collection costs and the brittleness of their solutions when faced with novel scenarios.\nWe will discuss two techniques for overcoming these shortcomings. In one-shot imitation\, we train a module that encodes a single demonstration of a desired behavior into a vector containing the essence of the demo. This vector can subsequently be utilized to recover the demonstrated behavior. In meta-learning\, we optimize a policy under the objective of learning to learn new tasks quickly. We show meta-learning methods can be accelerated with the use of auxiliary objectives. Results are presented on grid worlds\, robotics tasks\, and video game playing tasks.\n\n\n2:50 pm – 3:20 pm\n\nBreak\n\n\n3:20 pm – 4:00 pm\nHau-Tieng Wu \nVideo\nTitle: When Medical Challenges Meet Modern Data Science \nAbstract: Adaptive acquisition of correct features from massive datasets is at the core of modern data analysis. One particular interest in medicine is the extraction of hidden dynamics from a single observed time series composed of multiple oscillatory signals\, which could be viewed as a single-channel blind source separation problem. The mathematical and statistical problems are made challenging by the structure of the signal which consists of non-sinusoidal oscillations with time varying amplitude/frequency\, and by the heteroscedastic nature of the noise. In this talk\, I will discuss recent progress in solving this kind of problem by combining the cepstrum-based nonlinear time-frequency analysis and manifold learning technique. A particular solution will be given along with its theoretical properties. I will also discuss the application of this method to two medical problems – (1) the extraction of a fetal ECG signal from a single lead maternal abdominal ECG signal; (2) the simultaneous extraction of the instantaneous heart/respiratory rate from a PPG signal during exercise; (3) (optional depending on time) an application to atrial fibrillation signals. If time permits\, the clinical trial results will be discussed.\n\n\n4:00 pm – 4:40 pm\nSifan Zhou \nVideo\nTitle: Citing People Like Me: Homophily\, Knowledge Spillovers\, and Continuing a Career in Science \nAbstract: Forward citation is widely used to measure the scientific merits of articles. This research studies millions of journal article citation records in life sciences from MEDLINE and finds that authors of the same gender\, the same ethnicity\, sharing common collaborators\, working in the same institution\, or being geographically close are more likely (and quickly) to cite each other than predicted by their proportion among authors working on the same research topics. This phenomenon reveals how social and geographic distances influence the quantity and speed of knowledge spillovers. Given the importance of forward citations in academic evaluation system\, citation homophily potentially put authors from minority group at a disadvantage. I then show how it influences scientists’ chances to survive in the academia and continue publishing. Based on joint work with Richard Freeman.\n\n\n\n  \nTo view photos and video interviews from the conference\, please visit the CMSA blog. \n\n \n\n  \n\n\n\nBig Data\,CMSA\,Harvard\,Math\nEvents\,Past Events
URL:https://cmsa.fas.harvard.edu/event/2017-big-data-conference-aug-18-19/
LOCATION:Harvard Science Center\, 1 Oxford Street\, Cambridge\, MA\, 02138
CATEGORIES:Big Data Conference,Conference,Event
ATTACH;FMTTYPE=image/png:https://cmsa.fas.harvard.edu/media/Big-Data-2017_2.png
END:VEVENT
BEGIN:VEVENT
DTSTART;TZID=America/New_York:20160822T090000
DTEND;TZID=America/New_York:20160823T163000
DTSTAMP:20260502T013323
CREATED:20230717T171959Z
LAST-MODIFIED:20250328T144123Z
UID:10000017-1471856400-1471969800@cmsa.fas.harvard.edu
SUMMARY:2016 Big Data Conference & Workshop
DESCRIPTION:! LOCATION CHANGE: The conference will be in Science Center Hall C on Tuesday\, Aug.23\, 2016.\nThe Center of Mathematical Sciences and Applications will be hosting a workshop on Big Data from August 12 – 21\, 2016 followed by a two-day conference on Big Data from August 22 – 23\, 2016. \nBig Data Conference features many speakers from the Harvard Community as well as many scholars from across the globe\, with talks focusing on computer science\, statistics\, math and physics\, and economics. This is the second conference on Big Data the Center will host as part of our annual events. The 2015 conference was a huge success. \nThe conference will be hosted at Harvard Science Center Hall A (Monday\, Aug.22) & Hall C (Tuesday\, Aug.23): 1 Oxford Street\, Cambridge\, MA 02138. \nThe 2016 Big Data conference is sponsored by the Center of Mathematical Sciences and Applications at Harvard University and the Alfred P. Sloan Foundation. \nConference Speakers:\n\nJörn Boehnke\, Harvard CMSA\nJoan Bruna\, UC Berkeley [Video]\nTamara Broderick\, MIT [Video]\nJustin Chen\, MIT [Video]\nYiling Chen\, Harvard University [Video]\nAmir Farbin\, UT Arlington [Video]\nDoug Finkbeiner\, Harvard University [Video]\nAndrew Gelman\, Columbia University [Video]\nNina Holden\, MIT [Video]\nElchanan Mossel\, MIT\nAlex Peysakhovich\, Facebook\nAlexander Rakhlin\, University of Pennsylvania [Video]\nNeal Wadhwa\, MIT [Video]\nJun Yin\, University of Wisconsin\nHarry Zhou\, Yale University [Video]\n\nPlease click Conference Program for a downloadable schedule with talk abstracts.\nConference Schedule:\n\n\n\nAugust 22 – Day 1\n\n\n8:30am\nBreakfast\n\n\n8:55am\nOpening remarks\n\n\n9:00am – 9:50am\nYiling Chen\, “Machine Learning with Strategic Data Sources” [Video]\n\n\n9:50am – 10:40am\nAndrew Gelman\, “Taking Bayesian Inference Seriously” [Video]\n\n\n10:40am – 11:10am\nBreak\n\n\n11:10am – 12:00pm\nHarrison Zhou\, “A General Framework for Bayes Structured Linear Models” [Video]\n\n\n12:00pm – 1:30pm\nLunch\n\n\n1:30pm – 2:20pm\nDouglas Finkbeiner\, “Mapping the Milky Way in 3D with star colors” [Video]\n\n\n2:20pm – 3:10pm\nNina Holden\, “Sparse exchangeable graphs and their limits” [Video]\n\n\n3:10pm – 3:40pm\nBreak\n\n\n3:40pm – 4:30pm\nAlex Peysakhovich\, “How social science methods inform personalization on Facebook News Feed” [Video]\n\n\n4:30pm – 5:20pm\nAmir Farbin\, “Deep Learning in High Energy Physics” [Video]\n\n\n\n\n\nAugust 23 – Day 2\n\n\n8:45am\nBreakfast\n\n\n9:00am – 9:50am\nJoan Bruna Estrach\, “Addressing Computational and Statistical Gaps with Deep Networks” [Video]\n\n\n9:50am – 10:40am\nJustin Chen & Neal Wadhwa\, “Smaller Than the Eye Can See: Big Engineering from Tiny Motions in Video” [Video]\n\n\n10:40am – 11:10am\nBreak\n\n\n11:10am – 12:00pm\nAlexander Rakhlin\, “How to Predict When Estimation is Hard: Algorithms for Learning on Graphs” [Video]\n\n\n12:00pm – 1:30pm\nLunch\n\n\n1:30pm – 2:20pm\nTamara Broderick\, “Fast Quantification of Uncertainty and Robustness with Variational Bayes” [Video]\n\n\n2:20pm – 3:10pm\nElchanan Mossel\, “Phylogenetic Reconstruction – a Rigorous Model of Deep Learning”\n\n\n3:10pm – 3:40pm\nBreak\n\n\n3:40pm – 4:30pm\nJörn Boehnke\, “Amazon’s Price and Sales-rank Data: What can one billion prices on 150 thousand products tell us about the economy?”\n\n\n\nWorkshop Participants:\nRichard Freeman’s Group: \n\nSen Chai\, ESSEC\nBrock Mendel\, Harvard University\nRaviv Muriciano-Goroff\, Stanford University\nSifan Zhou\, CMSA\n\nScott Kominer’s Group: \n\nBradly Stadie\, UC Berkeley\nNeal Wadhwa\, MIT [Video]\nJustin Chen\n\nChristopher Rogan’s Group: \n\nAmir Farbin\, UT Arlington [Video]\nPaul Jackson\, University of Adelaide\n\nFor more information about the workshops\, please reach out directly to the individual group leaders. \n* This event is sponsored by CMSA Harvard University and the Alfred P. Sloan Foundation. \n 
URL:https://cmsa.fas.harvard.edu/event/2016-big-data-conference-workshop/
LOCATION:Harvard Science Center\, 1 Oxford Street\, Cambridge\, MA\, 02138
CATEGORIES:Big Data Conference,Conference,Event,Workshop
ATTACH;FMTTYPE=image/png:https://cmsa.fas.harvard.edu/media/Big-Data_2016_2-1-2.png
END:VEVENT
BEGIN:VEVENT
DTSTART;TZID=America/New_York:20150824T084500
DTEND;TZID=America/New_York:20150826T160000
DTSTAMP:20260502T013323
CREATED:20230717T180044Z
LAST-MODIFIED:20250304T180628Z
UID:10000013-1440405900-1440604800@cmsa.fas.harvard.edu
SUMMARY:2015 Conference on Big Data
DESCRIPTION:The Center of Mathematical Sciences and Applications will be having a conference on Big Data August 24-26\, 2015\, in Science Center Hall B at Harvard University.  This conference will feature many speakers from the Harvard Community as well as many scholars from across the globe\, with talks focusing on computer science\, statistics\, math and physics\, and economics.\n\n \nMonday\, August 24 \n\n\n\nTime\nSpeaker\nTitle\n\n\n8:45am\nMeet and Greet\n\n\n\n9:00am\nSendhil Mullainathan\nPrediction Problems in Social Science: Applications of Machine Learning to Policy and Behavioral Economics\n\n\n9:45am\nMike Luca\nDesigning Disclosure for the Digital Age\n\n\n10:30\nBreak\n\n\n\n10:45\nJianqing Fan\nBig Data Big Assumption: Spurious discoveries and endogeneity\n\n\n11:30am\nDaniel Goroff\nPrivacy and Reproducibility in Data Science\n\n\n12:15pm\nBreak for Lunch\n\n\n\n2:00pm\nRyan Adams\nExact Markov Chain Monte Carlo with Large Data\n\n\n2:45pm\nDavid Dunson\nScalable Bayes: Simple algorithms with guarantees\n\n\n3:30pm\nBreak\n\n\n\n3:45pm\nMichael Jordan\nComputational thinking\, inferential thinking and Big Data\n\n\n4:30pm\nJoel Tropp\nApplied Random Matrix Theory\n\n\n5:15pm\nDavid Woodruff\nInput Sparsity and Hardness for Robust Subspace Approximation\n\n\n\nTuesday\, August 25 \n\n\n\nTime\nSpeaker\nTitle\n\n\n8:45am\nMeet and Greet\n\n\n\n9:00am\nGunnar Carlsson\nPersistent homology for qualitative analysis and feature generation\n\n\n9:45am\nAndrea Montanari\nSemidefinite Programming Relaxations for Graph and Matrix Estimation: Algorithms and Phase Transitions\n\n\n10:30am\nBreak\n\n\n\n10:45am\nSusan Athey\nMachine Learning and Causal Inference for Policy Evaluation\n\n\n11:30am\nDenis Nekipelov\nRobust Empirical Evaluation of Large Competitive Markets\n\n\n12:15pm\nBreak for Lunch\n\n\n\n2:00pm\nLucy Colwell\nUsing evolutionary sequence variation to make inferences about protein structure and function: Modeling with Random Matrix Theory\n\n\n2:45pm\nSimona Cocco\nInverse Statistical Physics approaches for the modeling of protein families\n\n\n3:30pm\nBreak\n\n\n\n3:45pm\nRemi Monasson\nInference of top components of correlation matrices with prior informations\n\n\n4:30pm\nSayan Mukherjee\nRandom walks on simplicial complexes and higher order notions of spectral clustering\n\n\n\n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \nA Banquet from 7:00 – 8:30pm will follow Tuesday’s talks. This event is by invitation only. \n Wednesday\, August 26  \n\n\n\nTime\nSpeaker\nTitle\n\n\n8:45am\nMeet and Greet\n\n\n\n9:00am\nAnkur Moitra\nBeyond Matrix Completion\n\n\n9:45am\nFlorent Krzakala\nOptimal compressed sensing with spatial coupling and message passing\n\n\n10:30am\nBreak\n\n\n\n10:45am\nPiotr Indyk\nFast Algorithms for Structured Sparsity\n\n\n11:30am\nGuido Imbens\nExact p-values for network inference\n\n\n12:15pm\nBreak for lunch\n\n\n\n2:00pm\nEdo Airoldi\nSome fundamental ideas for causal inference on large networks\n\n\n2:45pm\nRonitt Rubinfeld\nSomething for almost nothing: sublinear time approximation algorithms\n\n\n3:30pm\nBreak\n\n\n\n3:45pm\nLenka Zdeborova\nClustering of sparse networks:  Phase transitions and optimal algorithms\n\n\n4:30pm\nJelani Nelson\nDimensionality reductions via sparse matrices
URL:https://cmsa.fas.harvard.edu/event/conference-on-big-data-august-24-26-2015/
LOCATION:Harvard Science Center\, 1 Oxford Street\, Cambridge\, MA\, 02138
CATEGORIES:Big Data Conference,Conference,Event
END:VEVENT
END:VCALENDAR