In print and forthcoming

Goals that appear vaguely descriptive are often actually causal. Framing them as such points to new machine learning estimators and provides language for clear interpretation.

Disparities across race, gender, and class are important targets of descriptive research. But rather than only describe disparities, research would ideally inform interventions to close those gaps. The gap-closing estimand quantifies how much a gap (e.g., incomes by race) would close if we intervened to equalize a treatment (e.g., access to college). Drawing on causal decomposition analyses, this type of research question yields several benefits. First, gap-closing estimands place categories like race in a causal framework without making them play the role of the treatment (which is philosophically fraught for non-manipulable variables). Second, gap-closing estimands empower researchers to study disparities using new statistical and machine learning estimators designed for causal effects. Third, gap-closing estimands can directly inform policy: if we sampled from the population and actually changed treatment assignments, how much could we close gaps in outcomes? I provide open-source software (the R package gapclosing) to support these methods.

Lundberg, Ian, Rebecca Johnson, and Brandon M. Stewart. 2021. “What is your estimand? Defining the target quantity connects statistical evidence to theory." American Sociological Review. [open access] [replication]

Our framework grounds methodological choices in a clear statement of the estimand: the goal an empirical analysis hopes to achieve.

We make only one point in this article. Every quantitative study must be able to answer the question: what is your estimand? The estimand is the target quantity---the purpose of the statistical analysis. Much attention is already placed on how to do estimation; a similar degree of care should be given to defining the thing we are estimating. We advocate that authors state the central quantity of each analysis---the theoretical estimand---in precise terms that exist outside of any statistical model. In our framework, researchers do three things: (1) set a theoretical estimand, clearly connecting this quantity to theory, (2) link to an empirical estimand, which is informative about the theoretical estimand under some identification assumptions, and (3) learn from data. Adding precise estimands to research practice expands the space of theoretical questions, clarifies how evidence can speak to those questions, and unlocks new tools for estimation. By grounding all three steps in a precise statement of the target quantity, our framework connects statistical evidence to theory.

Lundberg, Ian, Sarah L. Gold, Louis Donnelly, Jeanne Brooks-Gunn, and Sara S. McLanahan. 2021. “Government assistance protects low-income families from eviction.Journal of Policy Analysis and Management. [open access] [replication]

America faces an affordable housing crisis. Eviction is alarmingly common. Public policies can help.

A lack of affordable housing is a pressing issue for many low-income American families and can lead to eviction from their homes. Housing assistance programs to address this problem include public housing and other assistance, including vouchers, through which a government agency offsets the cost of private market housing. This paper assesses whether the receipt of either category of assistance reduces the probability that a family will be evicted from their home in the subsequent six years. Because no randomized trial has assessed these effects, we use observational data and formalize the conditions under which a causal interpretation is warranted. Families living in public housing experience less eviction conditional on pre-treatment variables. We argue that this evidence points toward a causal conclusion that assistance, particularly public housing, protects families from eviction.

Salganik, Matthew J., Ian Lundberg, Alex Kindel, Sara McLanahan, and 108 others. 2020. “Measuring the predictability of life outcomes with a scientific mass collaboration.Proceedings of the National Academy of Sciences (latest articles). [replication] [project website]

The Fragile Families Challenge is a scientific mass collaboration designed to measure and understand the predictability of life trajectories using the common task method.

Participants in the Challenge created predictive models of six life outcomes using data from the Fragile Families and Child Wellbeing Study, a high-quality birth cohort study. We evaluated these predictions on holdout data not available to participants. This paper reports the predictive performance of the Fragile Families Challenge and presents implications for scientists and policymakers. We are not posting results of this project online until the paper is published.

Lundberg, Ian. 2020. “Does opportunity skip generations? Reassessing evidence from sibling and cousin correlations.” 2020. Demography. [open access] [replication].

  • 2020 Graduate Student Paper Award, American Sociological Association Section on Inequality, Poverty, and Mobility.

What do various data generating processes imply for the similarities of siblings' and cousins' income attainments? This paper links empirical evidence to a series of candidate Markov processes unfolding over many generations.

Sibling and cousin correlations are empirically straightforward: they capture the degree to which siblings' or cousins' outcomes are similar. The meaning of these quantities, however, is complicated. A multitude of theoretical processes can produce any particular set of sibling and cousin correlations. Using multigenerational mobility as a substantive example, I show that sibling and cousin correlations in published research are equally consistent with several theoretical interpretations. While some prior authors have concluded that opportunity must skip parents to directly link the outcomes of grandparents and offspring, I show that this evidence is often consistent with alternative theories of latent transmission (measurement error) or of dynamic transmission (a parent-to-child transmission process that changes over generations). I clarify that point estimates which seem to contradict a given theory may also arise from estimation error. I develop a Bayesian procedure to estimate sibling and cousin correlations and quantify uncertainty about the statistic central to the argument. I conclude by outlining how future research might use sibling and cousin correlations as effective descriptive quantities while remaining cognizant that these quantities could arise from a variety of distinct theoretical processes.

Salganik, Matthew J., Ian Lundberg, Alexander T. Kindel, and Sara S. McLanahan. 2019. “Introduction to the special collection on the Fragile Families Challenge.Socius. [replication]

Hundreds of social scientists and data scientists participated in a scientific mass collaboration to predict life outcomes. This paper introduces a special collection of articles written by some who participated.

Participants in the Challenge created predictive models of six life outcomes using data from the Fragile Families and Child Wellbeing Study, a high-quality birth cohort study. This Special Collection includes twelve articles describing participants' approaches to predicting these six outcomes, as well as three articles describing methodological and procedural insights from running the Challenge. This introduction will help readers interpret the individual articles and help researchers interested in running future projects similar to the Fragile Families Challenge.

Lundberg, Ian, Arvind Narayanan, Karen E.C. Levy, and Matthew J. Salganik. 2019. “Privacy, ethics, and data access: A case study of the Fragile Families Challenge.Socius.

Data access is a key barrier to knowledge creation. Our framework for ethical data access, developed in a social science setting, could serve as a model for other data science settings.

Stewards of social science data face a fundamental tension. On one hand, they want to make their data accessible to as many researchers as possible to facilitate new discoveries. At the same time, they want to restrict access to their data as much as possible in order to protect the people represented in the data. In this paper, we provide a case study addressing this common tension in an uncommon setting: the Fragile Families Challenge, a scientific mass collaboration designed to yield insights that could improve the lives of disadvantaged children in the United States. We describe our process of threat modeling, threat mitigation, and third-party guidance. We also describe the ethical principles that formed the basis of our process. We are open about our process and the trade-offs that we made in the hopes that others can improve on what we have done.

Eviction is alarmingly common among American families, suggesting a failure of social policies.

A growing body of research suggests that housing eviction is more common than previously recognized and may play an important role in the reproduction of poverty. The proportion of children affected by housing eviction, however, remains largely unknown. We estimate that one in seven children born in large U.S. cities in 1998–2000 experienced at least one eviction for nonpayment of rent or mortgage between birth and age 15. Rates of eviction were substantial across all cities and demographic groups studied, but children from disadvantaged backgrounds were most likely to experience eviction. Among those born into deep poverty, we estimate that approximately one in four were evicted by age 15. Given prior evidence that forced moves have negative consequences for children, we conclude that the high prevalence and social stratification of housing eviction are sufficient to play an important role in the reproduction of poverty and warrant greater policy attention.

Killewald, Alexandra, and Ian Lundberg. 2017. “New evidence against a causal marriage wage premium.Demography. [open access] [replication]

Marriage does not cause men's hourly wages to increase. They are already increasing prior to the marriage date.

Recent research has shown that men’s wages rise more rapidly than expected prior to marriage, but interpretations diverge on whether this indicates selection or a causal effect of anticipating marriage. We seek to adjudicate this debate by bringing together literatures on (1) the male marriage wage premium; (2) selection into marriage based on men’s economic circumstances; and (3) the transition to adulthood, during which both union formation and unusually rapid improvements in work outcomes often occur. Using data from the National Longitudinal Survey of Youth 1979, we evaluate these perspectives. We show that wage declines predate rather than follow divorce, indicating no evidence that staying married benefits men’s wages. We find that older grooms experience no unusual wage patterns at marriage, suggesting that the observed marriage premium may simply reflect co-occurrence with the transition to adulthood for younger grooms. We show that men entering shotgun marriages experience similar premarital wage gains as other grooms, casting doubt on the claim that anticipation of marriage drives wage increases. We conclude that the observed wage patterns are most consistent with men marrying when their wages are already rising more rapidly than expected and divorcing when their wages are already falling, with no additional causal effect of marriage on wages.


Lundberg, Ian, and Brandon M. Stewart. 2020. “Comment: Summarizing income mobility with multiple smooth quantiles instead of parameterized means.Sociological Methodology. [open access] [replication]

Our methodological comment proposes a visualization to pack more information into summaries of economic mobility.

Studies of economic mobility summarize the distribution of offspring incomes for each level of parent income. Mitnik and Grusky (2020) highlight that the conventional intergenerational elasticity (IGE) targets the geometric mean and propose a parametric strategy for estimating the arithmetic mean. We decompose the IGE and their proposal into two choices: (1) the summary statistic for the conditional distribution and (2) the functional form. These choices lead us to a different strategy: visualizing several quantiles of the offspring income distribution as smooth functions of parent income. Our proposal solves the problems Mitnik and Grusky highlight with geometric means, avoids the sensitivity of arithmetic means to top incomes, and provides more information than is possible with any single number. Our proposal has broader implications: the default summary (the mean) used in many regressions is sensitive to the tail of the distribution in ways that may be substantively undesirable.

Working papers

Computational power and digital data have created new opportunities to explore and understand the social world. A special synergy is possible when social scientists combine human attention to certain aspects of the problem with the power of algorithms to automate other aspects of the problem. We review selected exemplary applications where machine learning amplifies researcher coding, summarizes complex data, relaxes statistical assumptions, and targets researcher attention. We then seek to reduce perceived barriers to machine learning by summarizing several fundamental building blocks and their grounding in classical statistics. We present a few guiding principles and promising approaches where we see particular potential for machine learning to transform social science inquiry. We conclude that machine learning tools are accessible, worthy of attention, and ready to yield new discoveries.

By applying the gap-closing estimand, this paper presents new evidence about the degree to which occupational segregation contributes to racial health disparities.

Racism causes racial disparities in health, and structural racism has many components. Focusing on one of those components, this paper addresses occupational segregation. I document high onset of work-limiting disabilities in occupations where many workers identify as non-Hispanic Black or as Hispanic. I then pivot to a causal question. Suppose we took a sample from the population and reassigned their occupations to be a function of education alone. To what degree would health disparities narrow for that sample? Using observational data, I estimate that the disparity between non-Hispanic Black and white workers would narrow by one-third. This estimate is credible because of adjustment for lagged measures of demographics, human capital, and health carried out under transparent causal assumptions. The result contributes to understanding about inequality and health by quantifying the contribution of occupational segregation to a disparity: if we took a sample and reassigned occupations, the disparity would narrow but would not disappear. The paper contributes to methodology by illustrating an approach to macro-level claims (how segregation affects a population disparity) that draws on explicitly causal micro-level analyses (potential outcomes for individuals) for which data are abundant.