Search EdWorkingPapers

Search EdWorkingPapers by author, title, or keywords.

Methodology, measurement and data

Ashley Edwards, Justin C. Ortagus, Jonathan Smith, Andria Smythe.

Using data from nearly 1.2 million Black SAT takers, we estimate the impacts of initially enrolling in an Historically Black College and University (HBCU) on educational, economic, and financial outcomes. We control for the college application portfolio and compare students with similar portfolios and levels of interest in HBCUs and non-HBCUs who ultimately make divergent enrollment decisions - often enrolling in a four-year HBCU in lieu of a two-year college or no college. We find that students initially enrolling in HBCUs are 14.6 percentage points more likely to earn a BA degree and have 5 percent higher household income around age 30 than those who do not enroll in an HBCU. Initially enrolling in an HBCU also leads to $12,000 more in outstanding student loans around age 30. We find that some of these results are driven by an increased likelihood of completing a degree from relatively broad-access HBCUs and also relatively high-earning majors (e.g., STEM). We also explore new outcomes, such as credit scores, mortgages, bankruptcy, and neighborhood characteristics around age 30.

More →


Paul T. von Hippel.

Educational researchers often report effect sizes in standard deviation units (SD), but SD effects are hard to interpret. Effects are easier to interpret in percentile points, but conversion from SDs to percentile points involves a calculation that is not intuitive to educational stakeholders. We point out that, if the outcome variable is normally distributed, simply multiplying the SD effect by 37 usually gives an excellent approximation to the percentile-point effect. For students in the middle three-fifths of a normal distribution, the approximation is always accurate to within 1.6 percentile points (and usually accurate to within 1 percentile point) for effect sizes of up to 0.8 SD (or 29 to 30 percentile points). Two examples show that the approximation can work for empirical effects estimated from real studies.

More →


Brendan Bartanen, Aliza N. Husain, David D. Liebowitz.

School principals are viewed as critical actors to improve student outcomes, but there remain important methodological questions about how to measure principals’ effects. We propose a framework for measuring principals’ contributions to student outcomes and apply it empirically using data from Tennessee, New York City, and Oregon. As commonly implemented, value-added models misattribute to principals changes in student performance caused by unobserved time-varying factors over which principals exert minimal control, leading to biased estimates of individual principals’ effectiveness and an overstatement of the magnitude of principal effects. Based on our framework, which better accounts for bias from time-varying factors, we find that little of the variation in student test scores or attendance is explained by persistent effectiveness differences between principals. Across contexts, the estimated standard deviation of principal value-added is roughly 0.03 student-level standard deviations in math achievement and 0.01 standard deviations in reading.

More →


Joshua B. Gilbert, Luke W. Miratrix, Mridul Joshi, Benjamin W. Domingue.

Analyzing heterogeneous treatment effects plays a crucial role in understanding the impacts of educational interventions. A standard practice for heterogeneity analysis is to examine interactions between treatment status and pre-intervention participant char- acteristics, such as pretest scores, to identify how different groups respond to treatment. This study demonstrates that identical observed patterns of heterogeneity on test score outcomes can emerge from entirely distinct data-generating processes. Specifically, we describe scenarios in which treatment effect heterogeneity arises from either variation in treatment effects along a pre-intervention participant characteristic or from correlations between treatment effects and item easiness parameters. We demonstrate analytically and through simulation that these two scenarios cannot be distinguished if analysis is based on summary scores alone as such outcomes are insufficient to identify the relevant generating process. We then describe a novel approach that identifies the relevant data-generating process by leveraging item-level data. We apply our approach to a randomized trial of a reading intervention in second grade, and show that any apparent heterogeneity by pretest ability is driven by the correlation between treatment effect size and item easiness. Our results highlight the potential of employing measurement principles in causal analysis, beyond their common use in test construction.

More →


Dorottya Demszky, Jing Liu, Heather C. Hill, Shyamoli Sanghi, Ariel Chung.

While recent studies have demonstrated the potential of automated feedback to enhance teacher instruction in virtual settings, its efficacy in traditional classrooms remains unexplored. In collaboration with TeachFX, we conducted a pre-registered randomized controlled trial involving 523 Utah mathematics and science teachers to assess the impact of automated feedback in K-12 classrooms. This feedback targeted “focusing questions” – questions that probe students’ thinking by pressing for explanations and reflection. Our findings indicate that automated feedback increased teachers’ use of focusing questions by 20%. However, there was no discernible effect on other teaching practices. Qualitative interviews revealed mixed engagement with the automated feedback: some teachers noticed and appreciated the reflective insights from the feedback, while others had no knowledge of it. Teachers also expressed skepticism about the accuracy of feedback, concerns about data security, and/or noted that time constraints prevented their engagement with the feedback. Our findings highlight avenues for future work, including integrating this feedback into existing professional development activities to maximize its effect.

More →


Jing Liu, Megan Kuhfeld, Monica Lee.

Noncognitive constructs such as self-e cacy, social awareness, and academic engagement are widely acknowledged as critical components of human capital, but systematic data collection on such skills in school systems is complicated by conceptual ambiguities, measurement challenges and resource constraints. This study addresses this issue by comparing the predictive validity of two most widely used metrics on noncogntive outcomes|observable academic behaviors (e.g., absenteeism, suspensions) and student self-reported social and emotional learning (SEL) skills|for the likelihood of high school graduation and postsecondary attainment. Our  ndings suggest that conditional on student demographics and achievement, academic behaviors are several-fold more predictive than SEL skills for all long-run outcomes, and adding SEL skills to a model with academic behaviors improves the model's predictive power minimally. In addition, academic behaviors are particularly strong predictors for low-achieving students' long-run outcomes. Part-day absenteeism (as a result of class skipping) is the largest driver behind the strong predictive power of academic behaviors. Developing more nuanced behavioral measures in existing administrative data systems might be a fruitful strategy for schools whose intended goal centers on predicting students' educational attainment.

More →


Joshua B. Gilbert, James S. Kim, Luke W. Miratrix.

Longitudinal models of individual growth typically emphasize between-person predictors of change but ignore how growth may vary within persons because each person contributes only one point at each time to the model. In contrast, modeling growth with multi-item assessments allows evaluation of how relative item performance may shift over time. While traditionally viewed as a nuisance under the label of “item parameter drift” (IPD) in the Item Response Theory literature, we argue that IPD may be of substantive interest if it reflects how learning manifests on different items at different rates. In this study, we present a novel application of the Explanatory Item Response Model (EIRM) to assess IPD in a causal inference context. Simulation results show that when IPD is not accounted for, both parameter estimates and their standard errors can be affected. We illustrate with an empirical application to the persistence of transfer effects from a content literacy intervention on vocabulary knowledge, revealing how researchers can leverage IPD to achieve a more fine-grained understanding of how vocabulary learning develops over time.

More →


Paul T. von Hippel.

Longitudinal studies can produce biased estimates of learning if children miss tests. In an application to summer learning, we illustrate how missing test scores can create an illusion of large summer learning gaps when true gaps are close to zero. We demonstrate two methods that reduce bias by exploiting the correlations between missing and observed scores on tests taken by the same child at different times. One method, multiple imputation, uses those correlations to fill in missing scores with plausible imputed scores. The other method models the correlations implicitly, using child-level random effects. Widespread adoption of these methods would improve the validity of summer learning studies and other longitudinal research in education.

More →


Arielle Boguslav, Julie Cohen.

Teacher preparation programs are increasingly expected to use data on pre-service teacher (PST) skills to drive program improvement and provide targeted supports. Observational ratings are especially vital, but also prone to measurement issues. Scores may be influenced by factors unrelated to PSTs’ instructional skills, including rater standards and mentor teachers’ skills. Yet we know little about how these measurement challenges play out in the PST context. Here we investigate the reliability and sensitivity of two observational measures. We find measures collected during student teaching are especially prone to measurement issues; only 3-4% of variation in scores reflects consistent differences between PSTs, while 9-17% of variation can be attributed to the mentors with whom they work. When high scores stem not from strong instructional skills, but instead from external circumstances, we cannot use them to make consequential decisions about PSTs’ individual needs or readiness for independent teaching.

More →


Kirsten Slungaard Mumma.

The recent spike in book challenges has put school libraries at the center of heated political debates. I investigate the relationship between local politics and school library collections using data on books with controversial content in 6,631 public school libraries. Libraries in conservative areas have fewer titles with LGBTQ+, race/racism, or abortion content and more Christian fiction and discontinued Dr. Seuss titles. This is true even though most libraries have at least some controversial content. I also find that state laws that restrict curricular content are negatively related to some kinds of controversial books. Finally, I present descriptive short-term evidence that book challenges in the 2021-22 school year have had “chilling effects” on the acquisition of new LGBTQ+ titles.

More →