Search EdWorkingPapers

Search EdWorkingPapers by author, title, or keywords.

Methodology, measurement and data

Daniel Rodriguez-Segura, Savannah Tierney.

While learning outcomes in low- and middle-income countries are generally at low levels, the degree to which students and schools more broadly within education systems lag behind grade-level proficiency can vary significantly. A substantial portion of existing literature advocates for aligning curricula closer to the proficiency level of the “median child” within each system. Yet, amidst considerable between-school heterogeneity in learning outcomes, choosing a single instructional level for the entire system may still leave behind those students in schools far from this level. Hence, establishing system-wide curriculum expectations in the presence of significant between-school heterogeneity poses a significant challenge for policymakers — especially as the issue of between-school heterogeneity has been relatively unexplored by researchers so far. This paper addresses the gap by leveraging a unique dataset on foundational literacy and numeracy outcomes, representative of six public educational systems encompassing over 900,000 enrolled children in South Asia and West Africa. With this dataset, we examine the current extent of between-school heterogeneity in learning outcomes, the potential predictors of this heterogeneity, and explore its potential implications for setting national curricula for different grade levels and subjects. Our findings reveal that between-school heterogeneity can indeed present both a severe pedagogical hindrance and challenges for policymakers, particularly in contexts with relatively higher levels of performance and in the higher grades. In response to meaningful between-system heterogeneity, we also demonstrate through simulation that a more nuanced, data-driven targeting of curricular expectations for different schools within a system could empower policymakers to effectively reach a broader spectrum of students through classroom instruction.

More →


Reuben Hurst, Andrew Simon, Michael Ricks.

To understand the causes and consequences of polarized demand for government expenditure, we conduct three field experiments in the context of public higher education. The first two experiments study polarization in taxpayer demand. We provide information to shape beliefs about social returns on investment. Our treatments narrow the political partisan gap in ideal policies---a reduction in ideological polarization---by up to 32%, with differences in partisan reasoning as a key mechanism. Providing information also affects how people communicate their ideal policies to elected officials, increasing their propensity to write a (positive) letter to an official of the other party---a reduction in affective polarization. In the third experiment, we send these letters to a randomized subset of elected officials to study how policymakers respond to constituent demand. We find that officials who receive their constituents' demands engage more with higher education issues in our correspondences.

More →


Paul T. von Hippel.

Educational researchers often report effect sizes in standard deviation units (SD), but SD effects are hard to interpret. Effects are easier to interpret in percentile points, but converting SDs to percentile points involves a calculation that is not transparent to educational stakeholders. We show that if the outcome variable is normally distributed, we can approximate the percentile-point effect simply by multiplying the SD effect by 37 (or, equivalently, dividing the SD effect by 0.027). For students in the middle three-fifths of a normal distribution, this rule of thumb is always accurate to within 1.6 percentile points for effect sizes of up to 0.8 SD. Two examples show that the rule can be just as accurate for empirical effects from real studies. Applying the rule to empirical benchmarks, we find that the least effective third of educational interventions raise scores by 0 to 2 percentile points; the middle third raise scores by 2 to 7 percentile points; and the most effective third raise scores by more than 7 percentile points.

More →


Kenji Kitamura, Dana Charles McCoy, Sharon Wolf.

Children's approaches to learning (AtL) are widely recognized as a critical predictor of educational outcomes, especially in early childhood. Nevertheless, there remains a dearth of understanding regarding the dimensionality of AtL, the reciprocal dynamics between AtL and learning outcomes, and how AtL operates in non-Western contexts. This paper aims to extend the existing AtL literature by both conceptually and empirically investigating the dimensionality of the AtL scale of the International Development and Early Learning Assessment (IDELA) – a globally used measure of early childhood development – based on data from Ghanaian children newly enrolled in formal schooling. Additionally, our research explores reciprocal relationships between AtL subconstructs and academic skills over time. Our analysis identifies two dimensions within the IDELA AtL scale: Self-Regulation (SR) and Motivation. We found that children with higher levels of SR early in schooling demonstrated better literacy and numeracy skills in later grades compared to their peers with low early SR, whereas children's motivation did not predict subsequent literacy and numeracy skills. This study enhances understanding of AtL in non-Western contexts, with implications for culturally appropriate support for children’s engagement in learning.

More →


D'Wayne Bell, John B. Holbein, Samuel Imlay, Jonathan Smith.

We study how colleges shape their students' voting habits by linking millions of SAT takers to their college-enrollment and voting histories. To begin, we show that the fraction of students from a particular college who vote varies systematically by the college's attributes (e.g. increasing with selectivity) but also that seemingly similar colleges can have markedly different voting rates. Next, after controlling for students' college application portfolios and pre-college voting behavior, we find that attending a college with a 10 percentage-point higher voting rate increases entrants' probability of voting by 4 percentage points (10 percent). This effect arises during college, persists after college, and is almost entirely driven by higher voting-rate colleges making new voters. College peers' initial voting propensity plays no discernible role.

More →


Isaac M. Opper, Umut Özek.

We use a marginal treatment effect (MTE) representation of a fuzzy regression discontinuity setting to propose a novel estimator. The estimator can be thought of as extrapolating the traditional fuzzy regression discontinuity estimate or as an observational study that adjusts for endogenous selection into treatment using information at the discontinuity. We show in a frequentest framework that it is consistent under weaker assumptions than existing approaches and then discuss conditions in a Bayesian framework under which it can be considered the posterior mean given the observed conditional moments. We then use this approach to examine the effects of early grade retention. We show that the benefits of early grade retention policies are larger for students with lower baseline achievement and smaller for low-performing students who are exempt from retention. These findings imply that (1) the benefits of early grade retention policies are larger than have been estimated using traditional fuzzy regression discontinuity designs but that (2) retaining additional students would have a limited effect on student outcomes.

More →


Joshua B. Gilbert.
When analyzing treatment effects on test scores, researchers face many choices and competing guidance for scoring tests and modeling results. This study examines the impact of scoring choices through simulation and an empirical application. Results show that estimates from multiple methods applied to the same data will vary because two-step models using sum or factor scores provide attenuated standardized treatment effects compared to latent variable models. This bias dominates any other differences between models or features of the data generating process, such as the use of scoring weights. An errors-in-variables (EIV) correction removes the bias from two-step models. An empirical application to data from a randomized controlled trial demonstrates the sensitivity of the results to model selection. This study shows that the psychometric principles most consequential in causal inference are related to attenuation bias rather than optimal scoring weights.

More →


Joshua B. Gilbert, Luke W. Miratrix, Mridul Joshi, Benjamin W. Domingue.
Analyzing heterogeneous treatment effects (HTE) plays a crucial role in understanding the impacts of educational interventions. A standard practice for HTE analysis is to examine interactions between treatment status and pre-intervention participant characteristics, such as pretest scores, to identify how different groups respond to treatment. This study demonstrates that identical patterns of HTE on test score outcomes can emerge either from variation in treatment effects due to a pre-intervention participant characteristic or from correlations between treatment effects and item easiness parameters. We demonstrate analytically and through simulation that these two scenarios cannot be distinguished if analysis is based on summary scores alone. We then describe a novel approach that identifies the relevant data-generating process by leveraging item-level data. We apply our approach to a randomized trial of a reading intervention in second grade, and show that any apparent HTE by pretest ability is driven by the correlation between treatment effect size and item easiness. Our results highlight the potential of employing measurement principles in causal analysis, beyond their common use in test construction.

More →


Dorottya Demszky, Rose Wang, Sean Geraghty, Carol Yu.

Providing ample opportunities for students to express their thinking is pivotal to their learning of mathematical concepts. We introduce the Talk Meter, which provides in-the-moment automated feedback on student-teacher talk ratios. We conduct a randomized controlled trial on a virtual math tutoring platform (n=742 tutors) to evaluate the effectiveness of the Talk Meter at increasing student talk. In one treatment arm, we show the Talk Meter only to the tutor, while in the other arm we show it to both the student and the tutor. We find that the Talk Meter increases student talk ratios in both treatment conditions by 13-14%; this trend is driven by the tutor talking less in the tutor-facing condition, whereas in the student-facing condition it is driven by the student expressing significantly more mathematical thinking. Through interviews with tutors, we find the student-facing Talk Meter was more motivating to students, especially those with introverted personalities, and was effective at encouraging joint effort towards balanced talk time. These results demonstrate the promise of in-the-moment joint talk time feedback to both teachers and students as a low cost, engaging, and scalable way to increase students' mathematical reasoning.

More →


Kelli A. Bird, Benjamin L. Castleman, Yifeng Song.

Predictive analytics are increasingly pervasive in higher education. However, algorithmic bias has the potential to reinforce racial inequities in postsecondary success. We provide a comprehensive and translational investigation of algorithmic bias in two separate prediction models -- one predicting course completion, the second predicting degree completion. Our results show that algorithmic bias in both models could result in at-risk Black students receiving fewer success resources than White students at comparatively lower-risk of failure. We also find the magnitude of algorithmic bias to vary within the distribution of predicted success. With the degree completion model, the amount of bias is nearly four times higher when we define at-risk using the bottom decile than when we focus on students in the bottom half of predicted scores. Between the two models, the magnitude and pattern of bias and the efficacy of basic bias mitigation strategies differ meaningfully, emphasizing the contextual nature of algorithmic bias and attempts to mitigate it. Our results moreover suggest that algorithmic bias is due in part to currently-available administrative data being less useful at predicting Black student success compared with White student success, particularly for new students; this suggests that additional data collection efforts have the potential to mitigate bias.

More →