The Center of Mathematical Sciences and Applications will be hosting a conference on Big Data from August 18 – 19, 2017, in the Science Center at Harvard University.
The Big Data Conference features many speakers from the Harvard community as well as scholars from across the globe, with talks focusing on computer science, statistics, math and physics, and economics. This is the third conference on Big Data the Center will host as part of our annual events, and is coorganized by Richard Freeman, Scott Kominers, Jun Liu, HorngTzer Yau and ShingTung Yau.
To register for this event, please click here. For a list of lodging options convenient to the Center, please visit our recommended lodgings page.
Confirmed Speakers:
 Mohammad Akbarpour, Stanford University
 AlbertLászló Barabási, Northeastern University
 Noureddine El Karoui, University of Californai, Berkeley
 Ravi Jagadeesan, Harvard University
 Lucas Janson, Harvard University
 Tracy Ke, University of Chicago
 Tze Leung Lai, Stanford University
 Annie Liang, University of Pennsylvania
 Marena Lin, Harvard University
 Nikhil Naik, Harvard University
 Natesh Pillai, Harvard University
 Jann Spiess, Harvard University
 Bradly Stadie, University of California, Berkeley
 Zak Stone, Google
 HauTieng Wu, University of Toronto
 Sifan Zhou, Xiamen University
August 18, Friday (Full day)
Time  Speaker  Topic 
8:30 am – 9:00 am  Breakfast  
9:00 am – 9:40 am  Mohammad Akbarpour  Title: Big data is not good data: A theory of social learning in dynamic environments.
Abstract: We study a model of social learning with “overlapping generations”, where agents meet others and share data about an underlying state over time. We examine under what conditions the society will produce individuals with precise knowledge about the state of the world. There are two information sharing regimes in our model: Under the full information sharing technology, individuals exchange the information about their point estimates of an underlying state, as well as their sources (or the precision of their signals) and update their beliefs by taking a weighted average. Under the limited information sharing technology, agents only observe the information about the point estimates of those they meet, and update their beliefs by taking a weighted average, where weights can depend on the sequence of meetings, as well as the labels. Our main result shows that, unlike most social learning settings, using such linear learning rules do not guide the society (or even a fraction of its members) to learn the truth, and having access to, and exploiting knowledge of the precision of a source signal are essential for efficient social learning (joint with Amin Saberi & Ali Shameli). 
9:40 am – 10:20 am  Lucas Janson  Title: ModelFree Knockoffs For HighDimensional Controlled Variable Selection
Abstract: Many contemporary largescale applications involve building interpretable models linking a large set of potential covariates to a response in a nonlinear fashion, such as when the response is binary. Although this modeling problem has been extensively studied, it remains unclear how to effectively control the fraction of false discoveries even in highdimensional logistic regression, not to mention general highdimensional nonlinear models. To address such a practical problem, we propose a new framework of modelfree knockoffs, which reads from a different perspective the knockoff procedure (Barber and Candès, 2015) originally designed for controlling the false discovery rate in linear models. The key innovation of our method is to construct knockoff variables probabilistically instead of geometrically. This enables modelfree knockoffs to deal with arbitrary (and unknown) conditional models and any dimensions, including when the dimensionality p exceeds the sample size n, while the original knockoffs procedure is constrained to homoscedastic linear models with n greater than or equal to p. Our approach requires the design matrix be random (independent and identically distributed rows) with a covariate distribution that is known, although we show our procedure to be robust to unknown/estimated distributions. As we require no knowledge/assumptions about the conditional distribution of the response, we effectively shift the burden of knowledge from the response to the covariates, in contrast to the canonical modelbased approach which assumes a parametric model for the response but very little about the covariates. To our knowledge, no other procedure solves the controlled variable selection problem in such generality, but in the restricted settings where competitors exist, we demonstrate the superior power of knockoffs through simulations. Finally, we apply our procedure to data from a casecontrol study of Crohn’s disease in the United Kingdom, making twice as many discoveries as the original analysis of the same data. 
10:20 am – 10:50 am  Break  
10:50 pm – 11:30 pm  Noureddine El Karoui  Title: Random matrices and highdimensional statistics: beyond covariance matrices
Abstract: Random matrices have played a central role in understanding very important statistical methods linked to covariance matrices (such as Principal Components Analysis, Canonical Correlation Analysis etc…) for several decades. In this talk, I’ll show that one can adopt a randommatrixinspired point of view to understand the performance of other widely used tools in statistics, such as Mestimators, and very common methods such as the bootstrap. I will focus on the highdimensional case, which captures well the situation of “moderately” difficult statistical problems, arguably one of the most relevant in practice. In this setting, I will show that random matrix ideas help upend conventional theoretical thinking (for instance about maximum likelihood methods) and highlight very serious practical problems with resampling methods. 
11:30 am – 12:10 pm  Tracy Ke  Title: A new SVD approach to optimal topic estimation
Abstract: In the probabilistic topic models, the quantity of interest—a lowrank matrix consisting of topic vectors—is hidden in the text corpus matrix, masked by noise, and Singular Value Decomposition (SVD) is a potentially useful tool for learning such a lowrank matrix. However, the connection between this lowrank matrix and the singular vectors of the text corpus matrix are usually complicated and hard to spell out, so how to use SVD for learning topic models faces challenges. We overcome the challenge by revealing a surprising insight: there is a lowdimensional simplex structure which can be viewed as a bridge between the lowrank matrix of interest and the SVD of the text corpus matrix, and which allows us to conveniently reconstruct the former using the latter. Such an insight motivates a new SVDbased approach to learning topic models. For asymptotic analysis, we show that under a popular topic model (Hofmann, 1999), the convergence rate of the l1error of our method matches that of the minimax lower bound, up to a multilogarithmic term. In showing these results, we have derived new elementwise bounds on the singular vectors and several large deviation bounds for weakly dependent multinomial data. Our results on the convergence rate and asymptotical minimaxity are new. We have applied our method to two data sets, Associated Process (AP) and Statistics Literature Abstract (SLA), with encouraging results. In particular, there is a clear simplex structure associated with the SVD of the data matrices, which largely validates our discovery. 
12:10 pm – 1:30 pm  Lunch  
1:30 pm – 2:10 pm  Nikhil Naik  TBA 
2:10 pm – 2:50 pm  AlbertLászló Barabási  Title: Taming Complexity: From Network Science to Controlling Networks
Abstract: The ultimate proof of our understanding of biological or technological systems is reflected in our ability to control them. While control theory offers mathematical tools to steer engineered and natural systems towards a desired state, we lack a framework to control complex selforganized systems. Here we explore the controllability of an arbitrary complex network, identifying the set of driver nodes whose timedependent control can guide the system’s entire dynamics. We apply these tools to several real networks, unveiling how the network topology determines its controllability. Virtually all technological and biological networks must be able to control their internal processes. Given that, issues related to control deeply shape the topology and the vulnerability of real systems. Consequently unveiling the control principles of real networks, the goal of our research, forces us to address series of fundamental questions pertaining to our understanding of complex systems.

2:50 pm – 3:20 pm  Break  
3:20 pm – 4:00 pm  Marena Lin  Title: Optimizing climate variables for human impact studies
Abstract: Estimates of the relationship between climate variability and socioeconomic outcomes are often limited by the spatial resolution of the data. As studies aim to generalize the connection between climate and socioeconomic outcomes across countries, the best available socioeconomic data is at the national level (e.g. food production quantities, the incidence of warfare, averages of crime incidence, gender birth ratios). While these statistics may be trusted from government censuses, the appropriate metric for the corresponding climate or weather for a given year in a country is less obvious. For example, how do we estimate the temperatures in a country relevant to national food production and therefore food security? We demonstrate that highresolution spatiotemporal satellite data for vegetation can be used to estimate the weather variables that may be most relevant to food security and related socioeconomic outcomes. In particular, satellite proxies for vegetation over the African continent reflect the seasonal movement of the Intertropical Convergence Zone, a band of intense convection and rainfall. We also show that agricultural sensitivity to climate variability differs significantly between countries. This work is an example of the ways in which insitu and satellitebased observations are invaluable to both estimates of future climate variability and to continued monitoring of the earthhuman system. We discuss the current state of these records and potential challenges to their continuity. 
4:00 pm – 4:40 pm  Tze Leung Lai  Title: Gradient boosting: Its role in big data analytics, underlying mathematical theory, and recent refinements 
August 19, Saturday (Full day)
Time  Speaker  Topic 
8:30 am – 9:00 am  Breakfast  
9:00 am – 9:40 am  Natesh Pillai  TBA 
9:40 am – 10:20 am  Ravi Jagadeesan  Title: Designs for estimating the treatment effect in networks with interference
Abstract: In this paper we introduce new, easily implementable designs for drawing causal inference from randomized experiments on networks with interference. Inspired by the idea of matching in observational studies, we introduce the notion of considering a treatment assignment as a quasicoloring” on a graph. Our idea of a perfect quasicoloring strives to match every treated unit on a given network with a distinct control unit that has identical number of treated and control neighbors. For a wide range of interference functions encountered in applications, we show both by theory and simulations that the classical Neymanian estimator for the direct effect has desirable properties for our designs. This further extends to settings where homophily is present in addition to interference. 
10:20 am – 10:50 am  Break  
10:50 am – 11:30 am  Annie Liang  Title: The Theory is Predictive, but is it Complete? An Application to Human Generation of Randomness
Abstract: When we test a theory using data, it is common to focus on correctness: do the predictions of the theory match what we see in the data? But we also care about completeness: how much of the predictable variation in the data is captured by the theory? This question is difficult to answer, because in general we do not know how much “predictable variation” there is in the problem. In this paper, we consider approaches motivated by machine learning algorithms as a means of constructing a benchmark for the best attainable level of prediction. We illustrate our methods on the task of predicting humangenerated random sequences. Relative to an a theoretical machine learning algorithm benchmark, we find that existing behavioral models explain roughly 15 percent of the predictable variation in this problem. This fraction is robust across several variations on the problem. We also consider a version of this approach for analyzing field data from domains in which human perception and generation of randomness has been used as a conceptual framework; these include sequential decisionmaking and repeated zerosum games. In these domains, our framework for testing the completeness of theories provides a way of assessing their effectiveness over different contexts; we find that despite some differences, the existing theories are fairly stable across our field domains in their performance relative to the benchmark. Overall, our results indicate that (i) there is a significant amount of structure in this problem that existing models have yet to capture and (ii) there are rich domains in which machine learning may provide a viable approach to testing completeness. 
11:30 am – 12:10 pm  Zak Stone  TBA 
12:10 pm – 1:30 pm  Lunch  
1:30 pm – 2:10 pm  Jann Spiess  Title: (Machine) Learning to Control in Experiments
Abstract: Machine learning focusses on highquality prediction rather than on (unbiased) parameter estimation, limiting its direct use in typical program evaluation applications. Still, many estimation tasks have implicit prediction components. In this talk, I discuss accounting for controls in treatment effect estimation as a prediction problem. In a canonical linear regression framework with highdimensional controls, I argue that OLS is dominated by a natural shrinkage estimator even for unbiased estimation when treatment is random; suggest a generalization that relaxes some parametric assumptions; and contrast my results with that for another implicit prediction problem, namely the first stage of an instrumental variables regression. 
2:10 pm – 2:50 pm  Bradly Stadie  TBA 
2:50 pm – 3:20 pm  Break  
3:20 pm – 4:00 pm  HauTieng Wu  Title: When Medical Challenges Meet Modern Data Science
Abstract: Adaptive acquisition of correct features from massive datasets is at the core of modern data analysis. One particular interest in medicine is the extraction of hidden dynamics from a single observed time series composed of multiple oscillatory signals, which could be viewed as a singlechannel blind source separation problem. The mathematical and statistical problems are made challenging by the structure of the signal which consists of nonsinusoidal oscillations with time varying amplitude/frequency, and by the heteroscedastic nature of the noise. In this talk, I will discuss recent progress in solving this kind of problem by combining the cepstrumbased nonlinear timefrequency analysis and manifold learning technique. A particular solution will be given along with its theoretical properties. I will also discuss the application of this method to two medical problems – (1) the extraction of a fetal ECG signal from a single lead maternal abdominal ECG signal; (2) the simultaneous extraction of the instantaneous heart/respiratory rate from a PPG signal during exercise; (3) (optional depending on time) an application to atrial fibrillation signals. If time permits, the clinical trial results will be discussed. 
4:00 pm – 4:40 pm  Sifan Zhou  Title: Citing People Like Me: Homophily, Knowledge Spillovers, and Continuing a Career in Science
Abstract: Forward citation is widely used to measure the scientific merits of articles. This research studies millions of journal article citation records in life sciences from MEDLINE and finds that authors of the same gender, the same ethnicity, sharing common collaborators, working in the same institution, or being geographically close are more likely (and quickly) to cite each other than predicted by their proportion among authors working on the same research topics. This phenomenon reveals how social and geographic distances influence the quantity and speed of knowledge spillovers. Given the importance of forward citations in academic evaluation system, citation homophily potentially put authors from minority group at a disadvantage. I then show how it influences scientists’ chances to survive in the academia and continue publishing. Based on joint work with Richard Freeman. 