Methods
ChatGPT vs. Machine Learning: Assessing the Efficacy and Accuracy of Large Language Models for Automated Essay Scoring
Automated Essay Scoring (AES) is a critical tool in education that aims to enhance the efficiency and objectivity of educational assessments. Recent advancements in Large Language Models (LLMs), such as ChatGPT, have sparked interest in their potential for AES. However, comprehensive comparisons… more →
A Framework for Building High-Quality Education Data for R&D in the Age of AI: The EDSI Dataset and Expert Insights
The Gates Foundation, the Walton Family Foundation, and the Chan Zuckerberg Initiative have launched a series of collaborative investments in building large-scale datasets that can support and accelerate data infrastructure for AI R&D efforts in education. In partnership with researchers… more →
Beg to DIFfer: Resolving Statistical Complications of Intersectional DIF Analyses
Modern test developers conduct differential item functioning (DIF) analyses to ensure fairness in educational and psychological testing. To address previously unrecognized biases, researchers have recently demonstrated the importance of conducting intersectional DIF analyses that attend to the… more →
Controlling For Measurement Error in Evaluation Models When Treatment Group Assignment is Based on Noisy Measures: Evaluation of an Achievement Gap-Closing Initiative
This paper develops new models to evaluate the effects of interventions and intervention-by-site heterogeneity when treatment group assignment is based on a fallible variable and the outcome of interest is determined in part by the corresponding true control variables (measured without error).… more →
Comparing Machine Learning Methods for Estimating Heterogeneous Treatment Effects in Randomized Trials: A Comprehensive Simulation Study
This study compares 18 machine learning methods for estimating heterogeneous treatment effects in randomized controlled trials, using simulations calibrated to two large-scale educational experiments. We evaluate performance across continuous and binary outcomes with diverse and realistic… more →
Measuring “Noncognitive” Skills at Scale: Building Longitudinal Student Behavior Composites Using Administrative Data
“Noncognitive” skills, especially student behavior, are critical predictors of academic and life outcomes. However, measuring student behavior at scale remains challenging, particularly for longitudinal research. This study uses a demographically diverse sample of students followed from… more →
A Critical Appraisal of the Evidence on Racial Disproportionality in Special Education
This essay provides a two-pronged critical assessment of a subset of the literature on racial disproportionality in special education: that which aims to estimate racial disparities among otherwise similar children. This body of research has shown that Black students are less likely than… more →
The Sensitivity of Value-Added Estimates to Test Scoring Decisions
Value-Added Models (VAMs) are both common and controversial in education policy and accountability research. While the sensitivity of VAMs to model specification and covariate selection is well documented, the extent to which test scoring methods (e.g., mean scores vs. IRT-based scores) may… more →
From Sentence-Corrections to Deeper Dialogue: Qualitative Insights from LLM and Teacher Feedback on Student Writing
Effective writing feedback is a powerful tool for enhancing student learning, encouraging revision, and increasing motivation and agency. Yet, teachers face many challenges that prevent them from consistently providing effective writing feedback. Recent advances in generative artificial… more →
Item-Level Heterogeneity in Value Added Models: Implications for Reliability, Cross-Study Comparability, and Effect Sizes
Value added models (VAMs) attempt to estimate the causal effects of teachers and schools on student test scores. We apply Generalizability Theory to show how estimated VA effects depend upon the selection of test items. Standard VAMs estimate causal effects on the items that are included on the… more →
Disparate Teacher Effects, Comparative Advantage, and Match Quality
Does student-teacher match quality exist? While prior research documents disparities in teachers' impacts across student types, it has not distinguished between sorting and causal effects as the drivers of these disparities. I develop a flexible disparate value-added model (DVA) and introduce a… more →
Count Me In? Identifying Factors That Predict Centers’ Application to Boston’s Mixed-Delivery Universal Pre-K Program
Universal prekindergarten (UPK) programs often expand through mixed-delivery systems by offering seats in public schools and community-based centers (CBOs). Although this approach aims to meet varied family needs, little is known about potential systematic differences between CBOs that apply to… more →
Educator Attention: How computational tools can systematically identify the distribution of a key resource for students
Educator attention is critical for student success, yet how educators distribute their attention across students remains poorly understood due to data and methodological constraints. This study presents the first large-scale computational analysis of educator attention patterns, leveraging over… more →
Combining Early Grade Assessments to Study Literacy Skills: Addressing the Variability in Tests Taken across Schools and Students
There is considerable variability in the literacy assessments taken in Kindergarten through second grade, across schools and between multilingual learners and other students, and within students over time. This makes it difficult to study changes in students’ acquisition of ELA skills in these… more →
The Correlated Proxy Problem: Why Control Variables can Obscure the Contribution of Selection Processes to Group-Level Inequality
Whether selection processes contribute to group-level disparities or merely reflect pre-existing inequalities is an important societal question. In the context of observational data, researchers, concerned about omitted-variable bias, assess selection-contributing inequality via a kitchen-sink… more →
Addressing Threats to Validity in Supervised Machine Learning: A Framework and Best Practices for Education Researchers
Given the rapid adoption of machine learning methods by education researchers, and growing acknowledgement of their inherent risks, there is an urgent need for tailored methodological guidance on how to improve and evaluate the validity of inferences drawn from these methods. Drawing upon an… more →
How Not to Fool Ourselves About Heterogeneity of Treatment Effects
Researchers across many fields have called for greater attention to heterogeneity of treatment effects—shifting focus from the average effect to variation in effects between different treatments, studies, or subgroups. True heterogeneity is important, but many reports of heterogeneity… more →
Making the Grade: Accounting for Course Selection in High School Transcripts with Item Response Theory
We apply Item Response Theory (IRT) to high-school transcript data, treating courses as items and grades as ordered responses, to estimate student transcript strength (θ̂) and course difficulty on a common scale. IRT estimation orders courses plausibly by difficulty, differentiates… more →
Integrating Open Science Principles into Quasi-Experimental Social Science Research
Quasi-experimental methods are a cornerstone of applied social science, providing critical answers to causal questions that inform policy and practice. Although open science principles have influenced experimental research norms across the social sciences, these practices are rarely implemented… more →
Examining the Relationship Between Randomization Strategies and Contamination in Higher Education Interventions
Randomized controlled trials are the reference method for causal inference, but field experiments in educational settings must balance statistical power with the risk of contamination. This study examines crossover and spillover contamination in a large-enrollment, in-person college course… more →