Learning and inference from sensitive data

02/02/2022 9:30 am - 10:30 am

Abstract: Consider an agency holding a large database of sensitive personal information—say,  medical records, census survey answers, web searches, or genetic data. The agency would like to discover and publicly release global characteristics of the data while protecting the privacy of individuals’ records.

I will discuss recent (and not-so-recent) results on this problem with a focus on the release of statistical models. I will first explain some of the fundamental limitations on the release of machine learning models—specifically, why such models must sometimes memorize training data points nearly completely. On the more positive side, I will present differential privacy, a rigorous definition of privacy in statistical databases that is now widely studied, and increasingly used to analyze and design deployed systems. I will explain some of the challenges of sound statistical inference based on differentially private statistics, and lay out directions for future investigation.