Search EdWorkingPapers

Search EdWorkingPapers by author, title, or keywords.

Methodology, measurement and data

Brendan Bartanen, Aliza N. Husain, David D. Liebowitz.

School principals are viewed as critical actors to improve student outcomes, but there remain important methodological questions about how to measure principals’ effects. We propose a framework for measuring principals’ contributions to student outcomes and apply it empirically using data from Tennessee, New York City, and Oregon. As commonly implemented, value-added models misattribute to principals changes in student performance caused by unobserved time-varying factors over which principals exert minimal control, leading to biased estimates of individual principals’ effectiveness and an overstatement of the magnitude of principal effects. Based on our framework, which better accounts for bias from time-varying factors, we find that little of the variation in student test scores or attendance is explained by persistent effectiveness differences between principals. Across contexts, the estimated standard deviation of principal value-added is roughly 0.03 student-level standard deviations in math achievement and 0.01 standard deviations in reading.

More →


Dorottya Demszky, Jing Liu, Heather C. Hill, Shyamoli Sanghi, Ariel Chung.

While recent studies have demonstrated the potential of automated feedback to enhance teacher instruction in virtual settings, its efficacy in traditional classrooms remains unexplored. In collaboration with TeachFX, we conducted a pre-registered randomized controlled trial involving 523 Utah mathematics and science teachers to assess the impact of automated feedback in K-12 classrooms. This feedback targeted “focusing questions” – questions that probe students’ thinking by pressing for explanations and reflection. Our findings indicate that automated feedback increased teachers’ use of focusing questions by 20%. However, there was no discernible effect on other teaching practices. Qualitative interviews revealed mixed engagement with the automated feedback: some teachers noticed and appreciated the reflective insights from the feedback, while others had no knowledge of it. Teachers also expressed skepticism about the accuracy of feedback, concerns about data security, and/or noted that time constraints prevented their engagement with the feedback. Our findings highlight avenues for future work, including integrating this feedback into existing professional development activities to maximize its effect.

More →


Jing Liu, Megan Kuhfeld, Monica Lee.

Noncognitive constructs such as self-e cacy, social awareness, and academic engagement are widely acknowledged as critical components of human capital, but systematic data collection on such skills in school systems is complicated by conceptual ambiguities, measurement challenges and resource constraints. This study addresses this issue by comparing the predictive validity of two most widely used metrics on noncogntive outcomes|observable academic behaviors (e.g., absenteeism, suspensions) and student self-reported social and emotional learning (SEL) skills|for the likelihood of high school graduation and postsecondary attainment. Our  ndings suggest that conditional on student demographics and achievement, academic behaviors are several-fold more predictive than SEL skills for all long-run outcomes, and adding SEL skills to a model with academic behaviors improves the model's predictive power minimally. In addition, academic behaviors are particularly strong predictors for low-achieving students' long-run outcomes. Part-day absenteeism (as a result of class skipping) is the largest driver behind the strong predictive power of academic behaviors. Developing more nuanced behavioral measures in existing administrative data systems might be a fruitful strategy for schools whose intended goal centers on predicting students' educational attainment.

More →


Joshua B. Gilbert, James S. Kim, Luke W. Miratrix.

Longitudinal models of individual growth typically emphasize between-person predictors of change but ignore how growth may vary within persons because each person contributes only one point at each time to the model. In contrast, modeling growth with multi-item assessments allows evaluation of how relative item performance may shift over time. While traditionally viewed as a nuisance under the label of “item parameter drift” (IPD) in the Item Response Theory literature, we argue that IPD may be of substantive interest if it reflects how learning manifests on different items at different rates. In this study, we present a novel application of the Explanatory Item Response Model (EIRM) to assess IPD in a causal inference context. Simulation results show that when IPD is not accounted for, both parameter estimates and their standard errors can be affected. We illustrate with an empirical application to the persistence of transfer effects from a content literacy intervention on vocabulary knowledge, revealing how researchers can leverage IPD to achieve a more fine-grained understanding of how vocabulary learning develops over time.

More →


Paul T. von Hippel.

Longitudinal studies can produce biased estimates of learning if children miss tests. In an application to summer learning, we illustrate how missing test scores can create an illusion of large summer learning gaps when true gaps are close to zero. We demonstrate two methods that reduce bias by exploiting the correlations between missing and observed scores on tests taken by the same child at different times. One method, multiple imputation, uses those correlations to fill in missing scores with plausible imputed scores. The other method models the correlations implicitly, using child-level random effects. Widespread adoption of these methods would improve the validity of summer learning studies and other longitudinal research in education.

More →


Arielle Boguslav, Julie Cohen.

Teacher preparation programs are increasingly expected to use data on pre-service teacher (PST) skills to drive program improvement and provide targeted supports. Observational ratings are especially vital, but also prone to measurement issues. Scores may be influenced by factors unrelated to PSTs’ instructional skills, including rater standards and mentor teachers’ skills. Yet we know little about how these measurement challenges play out in the PST context. Here we investigate the reliability and sensitivity of two observational measures. We find measures collected during student teaching are especially prone to measurement issues; only 3-4% of variation in scores reflects consistent differences between PSTs, while 9-17% of variation can be attributed to the mentors with whom they work. When high scores stem not from strong instructional skills, but instead from external circumstances, we cannot use them to make consequential decisions about PSTs’ individual needs or readiness for independent teaching.

More →


Kirsten Slungaard Mumma.

The recent spike in book challenges has put school libraries at the center of heated political debates. I investigate the relationship between local politics and school library collections using data on books with controversial content in 6,631 public school libraries. Libraries in conservative areas have fewer titles with LGBTQ+, race/racism, or abortion content and more Christian fiction and discontinued Dr. Seuss titles. This is true even though most libraries have at least some controversial content. I also find that state laws that restrict curricular content are negatively related to some kinds of controversial books. Finally, I present descriptive short-term evidence that book challenges in the 2021-22 school year have had “chilling effects” on the acquisition of new LGBTQ+ titles.

More →


Zachary Himmelsbach, Heather C. Hill, Jing Liu, Dorottya Demszky.

This study provides the first large-scale quantitative exploration of mathematical language use in U.S. classrooms. Our approach employs natural language processing techniques to describe variation in the use of mathematical language in 1,657 fourth and fifth grade lessons by teachers and students in 317 classrooms in four districts over three years. Students’ exposure to mathematical language varies substantially across lessons and between teachers. Students whose teachers use more mathematical language are more likely to use it themselves, and they perform better on standardized tests. These findings suggest that teachers play a substantial role in students’ mathematical language use.

More →


Andreas de Barros.

Explaining the productivity paradox—the phenomenon where an introduction of information and communication technology (ICT) does not lead to improvements in labor productivity—is difficult, as changes in technology often coincide with adjustments to working hours and substitution of labor. I conduct a cluster-randomized trial in India to investigate the effects of a program that provides teachers with continuous training and materials, encouraging them to blend their instruction with high-quality videos. Teaching hours, teacher-to-student assignments, and the curriculum are held constant. Eleven months after its launch, I document negative effects on student learning in grades 9 and 10 in mathematics, and no effects in science. I also find detrimental effects on instructional quality, instructional practices, and student perceptions and attitudes towards mathematics and science. These findings suggest adjustment costs can serve as one explanation for the paradox.

More →


Monnica Chan, Blake Heller.

Generally, need-based financial aid improves students’ academic outcomes. However, the largest source of need-based grant aid in the United States, the Federal Pell Grant Program (Pell), has a mixed evaluation record. We assess the minimum Pell Grant in a regression discontinuity framework, using Kentucky administrative data. We focus on whether and how year-to-year changes in aid eligibility and interactions with other sources of aid attenuate Pell’s estimated effects on post-secondary outcomes. This evaluation complements past work by assessing explanations for the null or muted impacts found in our analysis and other Pell evaluations. We also discuss the limitations of using regression discontinuity methods to evaluate Pell—or other interventions with dynamic eligibility criteria—with respect to generalizability and construct validity.

More →