Search EdWorkingPapers

Joshua B. Gilbert

Scaling up evidence-based educational interventions to improve student outcomes presents challenges, particularly in adapting to new contexts while maintaining fidelity. Structured teacher adaptations that integrate the strengths of experimental science (high fidelity) and improvement science (high adaptation) offer a viable solution to bridge the research-practice divide. This preregistered randomized controlled trial study examines the effectiveness of structured teacher adaptations in a Tier 1 content literacy intervention delivered through asynchronous and synchronous methods during COVID-19 on Grade 3 students’ (N = 1,914) engagement in digital app and print-based reading activities, student-teacher interactions, and learning outcomes. Our structured teacher adaptations achieved higher average outcomes and minimal treatment heterogeneity across schools, thereby enhancing the effectiveness of the intervention rather than undermining it.

More →


Longitudinal models of individual growth typically emphasize between-person predictors of change but ignore how growth may vary within persons because each person contributes only one point at each time to the model. In contrast, modeling growth with multi-item assessments allows evaluation of how relative item performance may shift over time. While traditionally viewed as a nuisance under the label of “item parameter drift” (IPD) in the Item Response Theory literature, we argue that IPD may be of substantive interest if it reflects how learning manifests on different items or subscales at different rates. In this study, we present a novel application of the Explanatory Item Response Model (EIRM) to assess IPD in a causal inference context. Simulation results show that when IPD is not accounted for, both parameter estimates and their standard errors can be affected. We illustrate with an empirical application to the persistence of transfer effects from a content literacy intervention on vocabulary knowledge, revealing how researchers can leverage IPD to achieve a more fine-grained understanding of how vocabulary learning develops over time.

More →


When analyzing treatment effects on test scores, researchers face many choices and competing guidance for scoring tests and modeling results. This study examines the impact of scoring choices through simulation and an empirical application. Results show that estimates from multiple methods applied to the same data will vary because two-step models using sum or factor scores provide attenuated standardized treatment effects compared to latent variable models. This bias dominates any other differences between models or features of the data generating process, such as the use of scoring weights. An errors-in-variables (EIV) correction removes the bias from two-step models. An empirical application to data from a randomized controlled trial demonstrates the sensitivity of the results to model selection. This study shows that the psychometric principles most consequential in causal inference are related to attenuation bias rather than optimal scoring weights.

More →


Analyzing heterogeneous treatment effects (HTE) plays a crucial role in understanding the impacts of educational interventions. A standard practice for HTE analysis is to examine interactions between treatment status and pre-intervention participant characteristics, such as pretest scores, to identify how different groups respond to treatment. This study demonstrates that identical patterns of HTE on test score outcomes can emerge either from variation in treatment effects due to a pre-intervention participant characteristic or from correlations between treatment effects and item easiness parameters. We demonstrate analytically and through simulation that these two scenarios cannot be distinguished if analysis is based on summary scores alone. We then describe a novel approach that identifies the relevant data-generating process by leveraging item-level data. We apply our approach to a randomized trial of a reading intervention in second grade, and show that any apparent HTE by pretest ability is driven by the correlation between treatment effect size and item easiness. Our results highlight the potential of employing measurement principles in causal analysis, beyond their common use in test construction.

More →


We investigated the effectiveness of a sustained and spiraled content literacy intervention that emphasizes building domain and topic knowledge schemas and vocabulary for elementary-grade students. The Model of Reading Engagement (MORE) intervention underscores thematic lessons that provide an intellectual structure for helping students connect new learning to a general schema in Grade 1 (animal survival), Grade 2 (scientific investigation of past events like dinosaur mass extinctions), and Grade 3 (scientific investigation of living systems). A total of 30 elementary schools (N = 2,870 students) were randomized to a treatment or control condition. In the treatment condition (i.e., full spiral curriculum), students participated in content literacy lessons from Grades 1 to 3 during the school year and wide reading of thematically related informational texts in the summer following Grades 1 and 2. In the control condition (i.e., partial spiral curriculum), students participated in lessons in only Grade 3. The Grade 3 lessons for both conditions were implemented online during the COVID-19 pandemic school year. Results reveal that treatment students outperformed control students on science vocabulary knowledge across all three grades. Furthermore, intent-to-treat analyses revealed positive transfer effects on Grade 3 science reading (ES = .14), domain-general reading comprehension (ES = .11), and mathematics achievement (ES = .12). Treatment impacts were sustained at 14-month follow-up on Grade 4 reading comprehension (ES = .12) and mathematics achievement (ES = .16). Findings indicate that a content literacy intervention that spirals topics and vocabulary across grades can improve students’ long-term academic achievement outcomes.

More →


This simulation study examines the characteristics of the Explanatory Item Response Model (EIRM) when estimating treatment effects when compared to classical test theory (CTT) sum and mean scores and item response theory (IRT)-based theta scores. Results show that the EIRM and IRT theta scores provide generally equivalent bias and false positive rates compared to CTT scores and superior calibration of standard errors under model misspecification. Analysis of the statistical power of each method reveals that the EIRM and IRT theta scores provide a marginal benefit to power and are more robust to missing data than other methods when parametric assumptions are met and provide a substantial benefit to power under heteroskedasticity, but their performance is mixed under other conditions. The methods are illustrated with an empirical data application examining the causal effect of an elementary school literacy intervention on reading comprehension test scores and demonstrates that the EIRM provides a more precise estimate of the average treatment effect than the CTT or IRT theta score approaches. Tradeoffs of model selection and interpretation are discussed.

More →


The current study aimed to explore the COVID-19 impact on the reading achievement growth of Grade 3-5 students in a large urban school district in the U.S. and whether the impact differed by students’ demographic characteristics and instructional modality. Specifically, using administrative data from the school district, we investigated to what extent students made gains in reading during the 2020-2021 school year relative to the pre-COVID-19 typical school year in 2018-2019. We further examined whether the effects of students’ instructional modality on reading growth varied by demographic characteristics. Overall, students had lower average reading achievement gains over the 9-month 2020-2021 school year than the 2018-2019 school year with a learning loss effect size of 0.54, 0.27, and 0.28 standard deviation unit for Grade 3, 4, and 5, respectively. Substantially reduced reading gains were observed from Grade 3 students, students from high-poverty backgrounds, English learners, and students with reading disabilities. Additionally, findings indicate that among students with similar demographic characteristics, higher-achieving students tended to choose the fully remote instruction option, while lower-achieving students appeared to opt for in-person instruction at the beginning of the 2020-2021 school year. However, students who received in-person instruction most likely demonstrated continuous growth in reading over the school year, whereas initially higher-achieving students who received remote instruction showed stagnation or decline, particularly in the spring 2021 semester. Our findings support the notion that in-person schooling during the pandemic may serve as an equalizer for lower-achieving students, particularly from historically marginalized or vulnerable student populations.

More →


Analyses that reveal how treatment effects vary allow researchers, practitioners, and policymakers to better understand the efficacy of educational interventions. In practice, however, standard statistical methods for addressing Heterogeneous Treatment Effects (HTE) fail to address the HTE that may exist within outcome measures. In this study, we present a novel application of the Explanatory Item Response Model (EIRM) for assessing what we term “item-level” HTE (IL-HTE), in which a unique treatment effect is estimated for each item in an assessment. Results from data simulation reveal that when IL-HTE are present but ignored in the model, standard errors can be underestimated and false positive rates can increase. We then apply the EIRM to assess the impact of a literacy intervention focused on promoting transfer in reading comprehension on a digital formative assessment delivered online to approximately 8,000 third-grade students. We demonstrate that allowing for IL-HTE can reveal treatment effects at the item-level masked by a null average treatment effect, and the EIRM can thus provide fine-grained information for researchers and policymakers on the potentially heterogeneous causal effects of educational interventions.

More →