On August 23-24, 2018 the CMSA will be hosting our fourth annual Conference on Big Data. The Conference will feature many speakers from the Harvard community as well as scholars from across the globe, with talks focusing on computer science, statistics, math and physics, and economics.
The talks will take place in Science Center Hall B, 1 Oxford Street.
For a list of lodging options convenient to the Center, please visit our recommended lodgings page.
Please note that lunch will not be provided during the conference, but a map of Harvard Square with a list of local restaurants can be found by clicking Map & Restaurants.
Thursday, August 23
|8:30 – 9:00 am||Breakfast|
|9:00 – 9:40 am||Alex Teytelboym||Title: A New Interpretation of the Economic Complexity Index (with Penny Mealy and J. Doyne Farmer)
Abstract: Analysis of properties of the global trade network has generated new insights into the patterns of economic development across countries. The Economic Complexity Index (ECI) in particular, has been successful at explaining cross-country differences in GDP per capita and economic growth. The ECI aims to infer information about countries’ productive capabilities by making relative comparisons across countries’ export baskets. Previous studies have emphasised the link between ECI and ‘diversity’ (the number of exports that a country has revealed comparative advantage in). We show that the ECI is equivalent to a spectral clustering algorithm, which partitions a similarity graph into two parts. When applied to country-export data, the ECI represents a ranking of countries that places countries with similar exports close together in the ordering. More generally, the ECI is a dimension reduction tool, which gives the optimal one-dimensional ordering that minimizes the distance between nodes in a similarity graph. We illustrate stark differences between the ECI and diversity with two empirical examples based on regional data. We discuss this new interpretation of the ECI with reference to the economic development literature.
|9:40 – 10:20am||Libby Mishkin|| Title: Using industry data for economics research
Abstract: Ever wondered what it’s like to do economics research in a data-rich industry like tech? I’ll describe Uber’s approach to research and academic collaborations and share some examples of past work, with an emphasis on the advantages and pitfalls of using data that is a byproduct of business processes.
|10:20 – 10:50am||Break|
|10:50 – 11:30am||Mohammad Akbarpour||Title: “Just a few seeds more: Value of network data for diffusion”
Abstract: Identifying the optimal set of individuals to first receive information (“seeds”) in a social network is a widely-studied question in many settings, such as the diffusion of information, microfinance programs, and new technologies. Numerous studies have proposed various network-centrality based heuristics to choose seeds in a way that is likely to boost diffusion. Here we show that, for some frequently studied diffusion processes, randomly seeding s plus x individuals can prompt a larger cascade than optimally targeting the best s individuals, for a small x. We prove our results for large classes of random networks, but also show that they hold in simulations over several real-world networks. This suggests that returns to collecting and analyzing network data to identify the optimal seeds may not be economically significant. Given these findings, practitioners interested in communicating a message to a large number of people may wish to compare the cost of network-based targeting to that of slightly expanding initial outreach.
|11:30 – 12:10pm||Emily Breza||Title: “Using Aggregated Relational Data to feasibly identify network structure without network data”
Abstract: Social network data is often prohibitively expensive to collect, limiting empirical network research. We propose an inexpensive and feasible strategy for network elicitation using Aggregated Relational Data (ARD) — responses to questions of the form “How many of your links have trait k?” Our method uses ARD to recover parameters of a network formation model, which permits the estimation of any arbitrary node- or graph-level statistic. We characterize both theoretically and empirically for which network features the procedure works. In both simulated and real-world graphs, the method performs well at matching a range of network characteristics. We replicate the results of two field experiments that used network data and draw similar conclusions with ARD alone.
(with Arun Chandrasekhar, Tyler McCormick, and Mengjie Pan)
|12:10 – 1:30pm||Lunch|
|1:30 – 2:10pm||Kobi Gal||Title: Intelligent Interventions in Peer Production
Abstract: Peer production sites such as wikipedia, citizen science and e-learning platforms depend critically on maintaining the engagement of their participants. The vast majority of users in such systems exhibit casual and non-committed participation patterns, making very few contributions before dropping out and never returning to the system. We present a methodology for extending engagement and productivity in such systems by combining machine learning with intervention strategies. We demonstrate the efficacy of this approach in the wild, showing that it increased the contributions of thousands of volunteers in one of the largest citizen science platforms on the web.Joint work with Avi Segal, Ece Kamar, Eric Horvitz
|2:10 – 2:50pm||Francesca Dominici||Abstract: What if I told you I had evidence of a serious threat to American national security – a terrorist attack in which a jumbo jet will be hijacked and crashed every 12 days. Thousands will continue to die unless we act now. This is the question before us today – but the threat doesn’t come from terrorists. The threat comes from climate change and air pollution.
We have developed an artificial neural network model that uses on-the-ground air-monitoring data and satellite-based measurements to estimate daily pollution levels across the continental U.S., breaking the country up into 1-square-kilometer zones. We have paired that information with health data contained in Medicare claims records from the last 12 years, and for 97% of the population ages 65 or older. We have developed statistical methods and computational efficient algorithms for the analysis over 460 million health records.Our research shows that short and long term exposure to air pollution is killing thousands of senior citizens each year. This data science platform is telling us that federal limits on the nation’s most widespread air pollutants are not stringent enough.This type of data is the sign of a new era for the role of data science in public health, and also for the associated methodological challenges. For example, with enormous amounts of data, the threat of unmeasured confounding bias is amplified, and causality is even harder to assess with observational studies. These and other challenges will be discussed.Press coverage links:
Los Angeles Times
New York Times
Di Q, Wang Y, Zanobetti A, Wang Y, Koutrakis P, Dominici F, Schwartz J. (2017). Air Pollution and Mortality in the Medicare Population. New England Journal of Medicine, 376:2513-2522, June 29, 2017, DOI: 10.1056/NEJMoa1702747Di Q, Dai L, Wang Y, Zanobetti A, Dominici F, Schwartz J. (2017) A Nationwide Case-crossover Study on Air Pollution and Mortality in the United States, 2000-2012, Journal of American Medical Association, AMA. 2017;318(24):2446-2456. doi:10.1001/jama.2017.17923
|3:20 – 4:00pm||Danielle Li||Title: Developing Novel Drugs, Financing Novel Drugs
Abstract: We analyze the economic tradeoffs associated with firms’ decisions to invest in incremental and radical innovation, in the context of pharmaceutical research and development. We develop a new, ex ante, measure of a drug candidate’s innovativeness by comparing its chemical structure to that of previously developed drug candidates: this allows us to better distinguish between novel and so-called
“me-too” drugs. We show that, on average, novel drug candidates 1) generate higher private and social returns conditional on approval (as measured by revenues, stock market returns, clinical value added, and patent citations) but 2) are riskier in that they are less likely to be approved by the FDA. Using variation in the expansion of Medicare prescription drug coverage, we show that firms respond to a plausibly exogenous positive shock to their net worth by developing more chemically novel drug candidates, as opposed to more “me-too” drugs. This pattern suggests that, on the margin, firms perceive novel drugs to be more valuable ex-ante investments, but that financial frictions may hinder their willingness to invest in these riskier candidates..
|4:00 – 4:40pm||Jonah Kallenbach|| Title: Transfer learning for property prediction in drug discovery
Abstract: In drug discovery, the goal is to cure important diseases by developing molecules that inhibit a set of protein targets. However, there’s a small data problem—few examples exist for the problems most important in drug discovery because data acquisition is so expensive. However, there are large molecular datasets for less important properties, and quantum chemistry can compute high quality approximations to other molecular properties. In this talk, we’ll examine many different techniques for how to develop effective models for important molecular properties when data is scarce.
Friday, August 24
|9:00 – 9:40 am||Laura Kreidberg||Title: Small Data, Big Ideas: Inferring the Presence of Extraterrestrial Life from a Few Photons
Abstract: Until recently, the search for extraterrestrial life has focused (unsuccessfully) on detecting radio signals from alien civilizations. I will discuss an alternative approach, which is the detection of biosignatures in the atmospheres of extrasolar planets. I will provide an update on the search for promising Earth-like planet candidates, discuss the state of the art in exoplanet atmosphere characterization, and speculate wildly about the possibility that other Earths are in fact inhabited.
|9:40 – 10:20am||Josh Speagle||Title: Revealing the Milky Way’s Dust-iny
Abstract: Just like on Earth, dust is everywhere in space. Dust particles absorb and scatter light, confounding astronomers that try to see through them to study stars within the Milky Way as well as the galaxies that lie beyond it. Although it is difficult to measure the impact of dust on a single star, combining the colors of ~1 billion stars has allowed us to start creating detailed 3-D maps of the dust within our galaxy. In the past year, new distances measured by the Gaia satellite have sharpened the resolution of this map, and improved theoretical stellar models have started to give us better insights into the underlying assembly history of our galaxy. Although dust can be a drag, the future looks bright.
|10:20 – 10:50am||Break|
|10:50 – 11:30am||Chiara Farronato||Title: Consumer Reviews and Regulation: Evidence from NYC Restaurants” (joint with Georgios Zervas)
Abstract: We investigate how two signals of restaurant quality, hygiene grade cards and online reviews, affect consumer choice and restaurant hygiene. Unlike hygiene cards, online reviews contain information about multiple dimensions of restaurant quality. To extract signals of hygiene from online reviews, we exploit the fact that health inspectors look for different types of violations and we apply machine learning methods to predict the occurrence of individual violations from review text. Using out-of-sample prediction accuracy as a measure of signal informativeness, we find substantial heterogeneity in how informative reviews are about different violations. Reviews are more informative about food handling and pest violations than facilities and maintenance violations. Next, we estimate the effect of hygiene information contained in online reviews on consumer demand and restaurant hygiene choices. We find that consumer demand is more sensitive to more informative signals of hygiene. In addition, restaurants that are reviewed online are more likely to comply with hygiene standards for which online reviews provide a more informative signal. Our results have implications for the allocation of limited regulator resources when consumers rate service providers online.
|11:30 – 12:10pm||Sam Kou||Title: Learning from big medical data: statistical analysis on electronic health data
Abstract: Big data have attracted significant interest from business, government agencies, academic communities and the general public. They offer the potential to transform knowledge discovery and decision making. We consider in this talk big medical data, in particular electronic health insurance data, which have been widely adopted over the last two decades to give healthcare and insurance providers faster and easier access to record, retrieve and process patient information. The massive health insurance data also present opportunities for studying the causal relationship between diseases and treatments. We will use our analysis of the causal relationship between cancer immunotherapy and autoimmune diseases as an illustration. Immunotherapy is one of the most exciting cancer treatments developed in the last five years; it works by enhancing the body’s own immune system to fight cancer and has been shown to extend patients’ life expectancy. An unintended side effect observed anecdotally by several doctors is that immunotherapy seems to lead to more autoimmune diseases. We analyzed an electronic health insurance data system, which covers over 44 million members, to study the potential causal relationship between cancer immunotherapy and autoimmune diseases. Mining the massive data allows us to answer the causal question. We will also discuss the complications and lessons we learned from working on big medical data.
|12:10 – 1:30pm||Lunch|
|11:30 – 2:10pm||William Stein||Title: CoCalc: Making open source data analysis software more collaborative
Abstract: I launched https://CoCalc.com in 2013, as an easy web-based way for students and instructors to streamline their use of open source data analysis and presentation software such as R, SageMath, Octave, Jupyter notebooks, and LaTeX. Everything in CoCalc now fully supports realtime synchronized editing, and there is a huge preinstalled software stack. CoCalc now has tens of thousands of active users at hundreds of sites. In this talk, I will explain how you can use CoCalc to enhance your teaching, research and data sharing. I will also describe how CoCalc grew out of courses I taught and a software project I started (SageMath) at the Harvard Mathematics Department 2000-2005.
|2:10 – 2:50pm||Sergiy Verstyuk||Title: Modeling multivariate time series in economics: Autoregressions versus Recurrent Neural Networks
Abstract: Non-structural empirical modeling is important in economics. It is used extensively for such tasks as forecasting and policy analysis. I apply vector autoregression and multivariate recurrent neural network methods to economic variables and compare their results.