Search EdWorkingPapers

Andrew D. Ho

Ishita Ahmed, Masha Bertling, Lijin Zhang, Andrew D. Ho, Prashant Loyalka, Hao Xue, Scott Rozelle, Benjamin W. Domingue.

Researchers use test outcomes to evaluate the effectiveness of education interventions across numerous randomized controlled trials (RCTs). Aggregate test data—for example, simple measures like the sum of correct responses—are compared across treatment and control groups to determine whether an intervention has had a positive impact on student achievement. We show that item-level data and psychometric analyses can provide information about treatment heterogeneity and improve design of future experiments. We apply techniques typically used in the study of Differential Item Functioning (DIF) to examine variation in the degree to which items show treatment effects. That is, are observed treatment effects due to generalized gains on the aggregate achievement measures or are they due to targeted gains on specific items? Based on our analysis of 7,244,566 item responses (265,732 students responding to 2,119 items) taken from 15 RCTs in low-and-middle-income countries, we find clear evidence for variation in gains across items. DIF analyses identify items that are highly sensitive to the interventions—in one extreme case, a single item drives nearly 40% of the observed treatment effect—as well as items that are insensitive. We also show that the variation of item-level sensitivity can have implications for the precision of effect estimates. Of the RCTs that have significant effect estimates, 41% have patterns of item-level sensitivity to treatment that allow for the possibility of a null effect when this source of uncertainty is considered. Our findings demonstrate how researchers can gain more insight regarding the effects of interventions via additional analysis of item-level test data.

More →

David M. Quinn, Andrew D. Ho.

The estimation of test score “gaps” and gap trends plays an important role in monitoring educational inequality. Researchers decompose gaps and gap changes into within- and between-school portions to generate evidence on the role schools play in shaping these inequalities. However, existing decomposition methods assume an equal-interval test scale and are a poor fit to coarsened data such as proficiency categories. This leaves many potential data sources ill-suited for decomposition applications. We develop two decomposition approaches that overcome these limitations: an extension of V, an ordinal gap statistic, and an extension of ordered probit models. Simulations show V decompositions have negligible bias with small within-school samples. Ordered probit decompositions have negligible bias with large within-school samples but more serious bias with small within-school samples. More broadly, our methods enable analysts to (1) decompose the difference between two groups on any ordinal outcome into portions within- and between some third categorical variable, and (2) estimate scale-invariant between-group differences that adjust for a categorical covariate.        

More →

Emma M. Klugman, Andrew D. Ho.

State testing programs regularly release previously administered test items to the public. We provide an open-source recipe for state, district, and school assessment coordinators to combine these items flexibly to produce scores linked to established state score scales. These would enable estimation of student score distributions and achievement levels. We discuss how educators can use resulting scores to estimate achievement distributions at the classroom and school level. We emphasize that any use of such tests should be tertiary, with no stakes for students, educators, and schools, particularly in the context of a crisis like the COVID-19 pandemic. These tests and their results should also be lower in priority than assessments of physical, mental, and social–emotional health, and lower in priority than classroom and district assessments that may already be in place. We encourage state testing programs to release all the ingredients for this recipe to support low-stakes, aggregate-level assessments. This is particularly urgent during a crisis where scores may be declining and gaps increasing at unknown rates.

More →