- This event has passed.
Big Data Conference 2024
September 6, 2024 @ 9:00 am - September 7, 2024 @ 5:00 pm
On September 6-7, 2024, the CMSA hosted the tenth annual Conference on Big Data. The Big Data Conference features speakers from the Harvard community as well as scholars from across the globe, with talks focusing on computer science, statistics, math and physics, and economics.
Location: Harvard University CMSA, 20 Garden Street, Cambridge & via Zoom
Speakers:
- Tianxi Cai, Harvard Chan School
- Raj Chetty, Harvard
- Bianca Dumitrascu, Columbia
- Boris Hanin, Princeton
- Peter Hull, Brown
- Jamie Morgenstern, U Washington
- Kavita Ramanan, Brown
- Neil Thompson, MIT
- Melanie Weber, Harvard
- Kun-Hsing Yu, Harvard Medical School
Organizers:
- Rediet Abebe, Harvard Society of Fellows
- Morgane Austern, Harvard University Statistics
- Michael R. Douglas, Harvard CMSA
- Yannai Gonczarowski, Harvard University Economics and Computer Science
- Sam Kou, Harvard University Statistics
SCHEDULE (downloadable pdf)
Friday, Sep. 6, 2024
9:00 am: Breakfast
9:30 am: Introductions
9:45–10:45 am
Speaker: Peter Hull, Brown University
Title: Measuring Discrimination in Multi-Phase Systems, with an Application to Child Protection
Abstract: Large racial disparities have been documented in many high-stakes settings—such as employment, health care, housing, and criminal justice—raising concerns of discrimination by individual decision-makers. At the same time, there is growing understanding that a focus on individual decisions can yield an incomplete view of discrimination; an extensive theoretical literature shows how discrimination can arise and compound across multiple decision-makers in interconnected systems. We develop new empirical tools for studying discrimination in such multi-phase systems and apply them to the setting of foster care placement by child protective services. Leveraging the quasi-random assignment of two sets of decision-makers—initial hotline call screeners and subsequent investigators—we study how unwarranted racial disparities arise and propagate through this system. Using a sample of over 200,000 maltreatment allegations, we find that calls involving Black children are 55% more likely to result in foster care placement than calls involving white children with the same potential for future maltreatment in the home. Call screeners account for up to 19% of this unwarranted disparity, with the remainder due to investigators. Unwarranted disparity is concentrated in cases with potential for future maltreatment, suggesting that white children may be harmed by “underplacement” in high-risk situations.
10:45–11:00 am: Break
11:00 am –12:00 pm
Speaker: Jamie Morgenstern, U Washington
Title: What governs predictive disparity in modern machine learning applications?
Abstract: The deployment of statistical models in impactful environments is far from new—simple correlations have been used to guide decisions throughout the sciences, health care, political campaigns, and in pricing financial instruments and other products for decades. Many such models, and the decisions they supported, were known to have different degrees of predictive power for different demographic groups. These differences had numerous sources, including: limited expressiveness of the statistical models; limited availability of data from marginalized populations; noisier measurements of both features and targets from certain populations; and features with less mutual information about the prediction target for some populations than others.
Modern decision systems which use machine learning are more ubiquitous than ever, as are their differences in performance for different populations of people. In this talk, I will discuss some similarities and differences in the sources of differing performance in contemporary ML systems including facial recognition systems and those incorporating generative AI.
12:00–1:30 pm: Lunch Break
1:30–2:30 pm
Speaker: Kavita Ramanan, Brown University
Title: Understanding High-dimensional Stochastic Dynamics on Realistic Networks
Abstract: Large collections of randomly evolving particles that interact locally with respect to an underlying network model a variety of phenomena ranging from magnetism, the spread of diseases, neural and neuronal networks, opinion dynamics and load balancing on computer networks. Due to their high-dimensional nature, these systems are typically intractable to analyze exactly. Classical work, falling under the rubric of mean-field approximations, has mostly focused on the case when this interaction graph is dense. However, most real-world networks are sparse and often random. We describe a new approach to develop principled approximations for dynamics on realistic networks that beats the curse of dimensionality, and illustrate its efficacy on a class of epidemiological models. This is based on joint works with Michel Davydov, Ankan Ganguly and Juniper Cocomello.
2:30–2:45 pm: Break
2:45–3:45 pm
Speaker: Raj Chetty, Harvard University
Title: The Science of Economic Opportunity: New Insights from Big Data
Abstract: How can we improve economic opportunities for children growing up in low-income families? This talk will present findings from a recent set of studies that use various sources of big data — ranging from anonymized tax records to social network data — to understand the science of economic opportunity. Among other topics, the talk will discuss how and why children’s chances of climbing the income ladder vary across neighborhoods, the drivers of racial disparities in economic mobility, how highly selective colleges may amplify the persistence of privilege, and the role of social capital as a driver of upward mobility. The talk will conclude by giving examples of how academic research using big data is informing policy decisions from the local to federal level to expand opportunities for all.
3:45–4:00 pm: Break
4:00–5:00 pm
Speaker: Neil Thompson
Title: How Algorithmic Progress is driving progress in Big Data and AI
Abstract: Algorithm improvement is one of the purest forms of innovation: it allows the same computational task to be achieved with far fewer resources by proposing clever new ways to do that computation. In this talk, I will discuss the work that my lab has done tracking and quantifying progress across decades of algorithm research and practice. As I will show, this algorithmic progress has often outpaced hardware improvement as the most important driver of progress in Big Data and AI.
Saturday, Sep. 7, 2024
9:00 am: Breakfast
9:30 am: Introductions
9:45–10:45 am
Speaker: Tianxi Cai, Harvard Chan School
Title: Crowdsourcing with Multi-institutional EHR to Improve Reliability of Real World Evidence – Opportunities and Challenges
Abstract: The wide adoption of electronic health records (EHR) systems has led to the availability of large clinical datasets available for discovery research. EHR data, linked with bio- repository, is a valuable new source for deriving real-word, data-driven prediction models of disease risk and progression. Yet, they also bring analytical difficulties especially when aiming to leverage multi-institutional EHR data. Synthesizing information across healthcare systems is challenging due to heterogeneity and privacy. Statistical challenges also arise due to high dimensionality in the feature space. In this talk, I’ll discuss analytical approaches for mining EHR data to improve the reliability and generalizability of real world evidence generated from the analyses. These methods will be illustrated using EHR data from Mass General Brigham and Veteran Health Administration.
10:45–11:00 am: Break
11:00 am–12:00 pm
Speaker: Bianca Dumitrascu, Columbia Data Science Institute
Title: Statistical machine learning for learning representations of embryonic development
Abstract: During embryonic development, single cells read in local information from their environments and use this information to move, divide and specialize. As a result, the environments themselves change. However, it remains unclear how gene expression programs interact with cell morphology and mechanical forces to orchestrate organogenesis in early embryos. Recent advances in single cell techniques and in toto imaging enable unique venues in exploring this link between genomics and biophysics, which dynamically maps cells to organisms.
In this talk, I will describe statistical machine learning frameworks aimed at understanding how tissue level mechanical and morphometric information impact gene expression patterns in spatio-temporal contexts. We use these tools to understand boundary formation in the early development of mouse embryos and to align data from light sheet recordings of pre-gastrulation development.
12:00–1:30 pm: Lunch Break
1:30–2:30 pm
Speaker: Melanie Weber, Harvard Mathematics
Title: Data and Model Geometry in Deep Learning
Abstract: Data with geometric structure is ubiquitous in machine learning. Often such structure arises from fundamental symmetries in the domain, such as permutation-invariance in graphs and sets, and translation-invariance in images. In this talk we discuss implications of this structure on the design and complexity of neural networks. Equivariant architectures, which encode symmetries as inductive bias, have shown great success in applications with geometric data, but can suffer from instabilities as their depths increases. We propose a new architecture based on unitary group convolutions, which allows for deeper networks with less instability. In the second part of the talk we discuss the impact of data and model geometry on the learnability of neural networks. We discuss learnability in several geometric settings, including equivariant neural networks, as well as learnability with respect to the geometry of the input data manifold.
2:30–2:45 pm: Break
2:45–3:45 pm
Speaker: Boris Hanin, Princeton University
Title: Scaling Limits of Neural Networks
Abstract: Neural networks are often studied analytically through scaling limits: regimes in which taking some structural network parameters (e.g. depth, width, number of training datapoints, and so on) to infinity results in simplified models of learning. I will motivative and discuss recent results using several such approaches. I will emphasize both new theoretical insights into how model, training data, and optimizer impact learning and their practical implications for hyperparameter transfer.
3:45–4:00 pm: Break
4:00–5:00 pm
Speaker: Kun-Hsing Yu, Harvard Medical School
Title: Foundation Models for Real-Time Cancer Diagnosis
Abstract: Artificial intelligence (AI) is transforming the landscape of medical research and practice. Recent advances in microscopic image digitization, foundation models, and scalable computing infrastructure have opened new avenues for AI-enhanced cancer diagnosis. In this talk, I will highlight recent breakthroughs in multi-modal AI systems for cancer pathology evaluation, discuss integrative biomedical informatics methods that link cell morphology with molecular profiles, and outline critical challenges in developing robust medical AI systems.
Information about the 2023 Big Data Conference can be found here.