Big Data Conference 2021

On August 24, 2021, the CMSA hosted our seventh annual Conference on Big Data. The Conference features many speakers from the Harvard community as well as scholars from across the globe, with talks focusing on computer science, statistics, math and physics, and economics.

The 2021 Big Data Conference took place virtually on Zoom.

Organizers: 

  • Shing-Tung Yau, William Caspar Graustein Professor of Mathematics, Harvard University
  • Scott Duke Kominers, MBA Class of 1960 Associate Professor, Harvard Business
  • Horng-Tzer Yau, Professor of Mathematics, Harvard University
  • Sergiy Verstyuk, CMSA, Harvard University

Speakers:

Time (ET; Boston time) Speaker Title/Abstract
9:00AM Conference Organizers Introduction and Welcome
9:10AM – 9:55AM Andrew Blumberg, University of Texas at Austin Title: Robustness and stability for multidimensional persistent homology

Abstract: A basic principle in topological data analysis is to study the shape of data by looking at multiscale homological invariants. The idea is to filter the data using a scale parameter that reflects feature size. However, for many data sets, it is very natural to consider multiple filtrations, for example coming from feature scale and density. A key question that arises is how such invariants behave with respect to noise and outliers. This talk will describe a framework for understanding those questions and explore open problems in the area.

10:00AM – 10:45AM Katrina Ligett, The Hebrew University of Jerusalem Title: Privacy as Stability, for Generalization

Abstract: Many data analysis pipelines are adaptive: the choice of which analysis to run next depends on the outcome of previous analyses. Common examples include variable selection for regression problems and hyper-parameter optimization in large-scale machine learning problems: in both cases, common practice involves repeatedly evaluating a series of models on the same dataset. Unfortunately, this kind of adaptive re-use of data invalidates many traditional methods of avoiding overfitting and false discovery, and has been blamed in part for the recent flood of non-reproducible findings in the empirical sciences. An exciting line of work beginning with Dwork et al. in 2015 establishes the first formal model and first algorithmic results providing a general approach to mitigating the harms of adaptivity, via a connection to the notion of differential privacy. In this talk, we’ll explore the notion of differential privacy and gain some understanding of how and why it provides protection against adaptivity-driven overfitting. Many interesting questions in this space remain open.

Joint work with: Christopher Jung (UPenn), Seth Neel (Harvard), Aaron Roth (UPenn), Saeed Sharifi-Malvajerdi (UPenn), and Moshe Shenfeld (HUJI). This talk will draw on work that appeared at NeurIPS 2019 and ITCS 2020

10:50AM – 11:35AM Hima Lakkaraju, Harvard University Title: Towards Reliable and Robust Model Explanations

Abstract: As machine learning black boxes are increasingly being deployed in domains such as healthcare and criminal justice, there is growing emphasis on building tools and techniques for explaining these black boxes in an interpretable manner. Such explanations are being leveraged by domain experts to diagnose systematic errors and underlying biases of black boxes. In this talk, I will present some of our recent research that sheds light on the vulnerabilities of popular post hoc explanation techniques such as LIME and SHAP, and also introduce novel methods to address some of these vulnerabilities. More specifically, I will first demonstrate that these methods are brittle, unstable, and are vulnerable to a variety of adversarial attacks. Then, I will discuss two solutions to address some of the vulnerabilities of these methods – (i) a framework based on adversarial training that is designed to make post hoc explanations more stable and robust to shifts in the underlying data; (ii) a Bayesian framework that captures the uncertainty associated with post hoc explanations and in turn allows us to generate explanations with user specified levels of confidences. I will conclude the talk by discussing results from real world datasets to both demonstrate the vulnerabilities in post hoc explanation techniques as well as the efficacy of our aforementioned solutions.

11:40AM – 12:25PM Moran Koren, Harvard CMSA Title: A Gatekeeper’s Conundrum

Abstract: Many selection processes contain a “gatekeeper”. The gatekeeper’s goal is to examine an applicant’s suitability to a proposed position before both parties endure substantial costs. Intuitively, the introduction of a gatekeeper should reduce selection costs as unlikely applicants are sifted out. However, we show that this is not always the case as the gatekeeper’s introduction inadvertently reduces the applicant’s expected costs and thus interferes with her self-selection. We study the conditions under which the gatekeeper’s presence improves the system’s efficiency and those conditions under which the gatekeeper’s presence induces inefficiency. Additionally, we show that the gatekeeper can sometimes improve selection correctness by behaving strategically (i.e., ignore her private information with some probability).

12:25PM Conference Organizers Closing Remarks