Search EdWorkingPapers

Search EdWorkingPapers by author, title, or keywords.

Methodology, measurement and data

William Delgado.

Does student-teacher match quality exist? Prior work has documented large disparities in teachers' impacts across student types but has not distinguished between sorting and causal effects as the drivers of these disparities. I propose a disparate value-added model and derive a novel measure of teacher quality---revealed comparative advantage---that captures the degree to which teachers affect student outcome gaps. Quasi-experimental changes in teaching staff show that the comparative advantage measure accurately predicts teachers’ disparate impacts: a teacher with a 1 standard deviation in revealed comparative advantage for black students increases black students' test scores by 1 standard deviation and has no effect on non-black students' test scores. Teacher removal and teacher-to-classroom re-allocation simulations show substantial efficiency and equity gains of considering teachers’ comparative advantage.

More →


Noam Angrist, Rachael Meager.

Targeted instruction is one of the most effective educational interventions in low- and middle-income countries, yet reported impacts vary by an order of magnitude. We study this variation by aggregating evidence from prior randomized trials across five contexts, and use the results to inform a new randomized trial. We find two factors explain most of the heterogeneity in effects across contexts: the degree of implementation (intention-to-treat or treatment-on-the-treated) and program delivery model (teachers or volunteers). Accounting for these implementation factors yields high generalizability, with similar effect sizes across studies. Thus, reporting treatment-on-the-treated effects, a practice which remains limited, can enhance external validity. We also introduce a new Bayesian framework to formally incorporate implementation metrics into evidence aggregation. Results show targeted instruction delivers average learning gains of 0.42 SD when taken up and 0.85 SD when implemented with high fidelity. To investigate how implementation can be improved in future settings, we run a new randomized trial of a targeted instruction program in Botswana. Results demonstrate that implementation can be improved in the context of a scaling program with large causal effects on learning. While research on implementation has been limited to date, our findings and framework reveal its importance for impact evaluation and generalizability.

More →


Brian McManus, Jessica Howell, Michael Hurwitz.

The impact of test-optional college admissions policies depends on whether applicants act strategically in disclosing test scores. We analyze individual applicants’ standardized test scores and disclosure behavior to 50 major US colleges for entry in fall 2021, when Covid-19 prompted widespread adoption of test-optional policies. Applicants withheld low scores and disclosed high scores, including seeking admissions advantages by conditioning their disclosure choices on their other academic characteristics, colleges’ selectivity and testing policy statements, and the Covid-related test access challenges of the applicants’ local peers. We find only modest differences in test disclosure strategies by applicants’ race and socioeconomic characteristics.

More →


Mei Tan, Dorottya Demszky.

Teachers’ attitudes and classroom management practices critically affect students’ academic and behavioral outcomes, contributing to the persistent issue of racial disparities in school discipline. Yet, identifying and improving classroom management at scale is challenging, as existing methods require expensive classroom observations by experts. We apply natural language processing methods to elementary math classroom transcripts to computationally measure the frequency of teachers’ classroom management language in instructional dialogue and the degree to which such language is reflective of punitive attitudes. We find that the frequency and punitiveness of classroom management language show strong and systematic correlations with human-rated observational measures of instructional quality, student and teacher perceptions of classroom climate, and student academic outcomes. Our analyses reveal racial disparities and patterns of escalation in classroom management language. We find that classrooms with higher proportions of Black students experience more frequent and more punitive classroom management. The frequency and punitiveness of classroom management language escalate over time during observations, and these escalations occur more severely for classrooms with higher proportions of Black students. Our results demonstrate the potential of automated measures and position everyday classroom management interactions as a critical site of intervention for addressing racial disparities, preventing escalation, and reducing punitive attitudes.

More →


Wendy Castillo, David Gillborn.

‘QuantCrit’ (Quantitative Critical Race Theory) is a rapidly developing approach that seeks to challenge and improve the use of statistical data in social research by applying the insights of Critical Race Theory. As originally formulated, QuantCrit rests on five principles; 1) the centrality of racism; 2) numbers are not neutral; 3) categories are not natural; 4) voice and insight (data cannot ‘speak for itself); and 5) a social justice/equity orientation (Gillborn et al, 2018). The approach has quickly developed an international and interdisciplinary character, including applications in medicine (Gerido, 2020) and literature (Hammond, 2019). Simultaneously, there has been ferocious criticism from detractors outraged by the suggestion that numbers are anything other than objective and scientific (Airaksinen, 2018). In this context it is vital that the approach establishes some common understandings about good practice; in order to sustain rigor, make QuantCrit accessible to academics, practioners, and policymakers alike, and resist widespread attempts to over-simplify and pillory. This paper is intended to advance an iterative process of expanding and clarifying how to ‘QuantCrit’.

More →


Oded Gurantz, Yung-Yu Tsai.
Government programs impose eligibility requirements to balance the goals of improving welfare while minimizing waste. We study the impact of eligibility monitoring in the context of Federal Application for Federal Student Aid (FAFSA) submissions, where students may be subject to “verification” requirements that require them to confirm the accuracy of the data. Using a matching on observables design we do not find that students flagged for verification are less likely to enroll in college, which contrasts prior research. Verification reduces grant aid received but average changes are small, raising questions about the benefits of this administrative process.

More →


D. Betsy McCoach, Anthony J. Gambino, Scott J. Peters, Daniel Long, Del Siegle.

Teacher rating scales (TRS) are often used to make service eligibility decisions for exceptional learners. Although TRS are regularly used to identify student exceptionalism either as part of an informal nomination process or through behavioral rating scales, there is little research documenting the between-teacher variance in teacher ratings or the consequences of such rater dependence. To evaluate the possible benefits or disadvantages of using TRS as part of a gifted identification process, we examined the student-, teacher-, and school-level variance in TRS controlling for student ability and achievement to determine the unique information, consistency, and potential bias in TRS. Between 10% and 25% of a students’ TRS score can be attributed to the teacher doing the rating, and between-teacher standard deviations represent an effect size of one-third to one-half standard deviation unit. Our results suggest that TRS are not easily comparable across teachers, making it impossible to set a cut score for admission into a program (or for further screening) that functions equitably across teachers.

More →


Sam Sims, Jake Anders, Matthew Inglis, Hugues Lortie-Forgues, Ben Styles, Ben Weidmann.

Over the last twenty years, education researchers have increasingly conducted randomised experiments with the goal of informing the decisions of educators and policymakers. Such experiments have generally employed broad, consequential, standardised outcome measures in the hope that this would allow decisionmakers to compare effectiveness of different approaches. However, a combination of small effect sizes, wide confidence intervals, and treatment effect heterogeneity means that researchers have largely failed to achieve this goal. We argue that quasiexperimental methods and multi-site trials will often be superior for informing educators’ decisions on the grounds that they can achieve greater precision and better address heterogeneity. Experimental research remains valuable in applied education research. However, it should primarily be used to test theoretical models, which can in turn inform educators’ mental models, rather than attempting to directly inform decision making. Since comparable effect size estimates are not of interest when testing educational theory, researchers can and should improve the power of theory-informing experiments by using more closely aligned (i.e., valid) outcome measures. We argue that this approach would reduce wasteful research spending and make the research that does go ahead more statistically informative, thus improving the return on investment in educational research.

More →


David Grissmer, Thomas White, Richard Buddin, Mark Berends, Daniel Willingham, Jamie DeCoster, Chelsea Duran, Chris Hulleman, William Murrah, Tanya Evans.

The Core Knowledge curriculum is a K-8 curriculum focused on building students General Knowledge about the world they live in that is hypothesized to increase reading comprehension and Reading/English-LA achievement. This study utilizes an experimental design to evaluate the long term effects of attending Charter schools teaching the Core Knowledge curriculum. Fourteen oversubscribed kindergarten lotteries for enrollment in nine Core Knowledge Charter schools using the curriculum had 2310 students applying from parents in predominately middle/high income school districts. State achievement data was collected at 3rd- 6th grade in Reading/English-LA and Mathematics and at 5th Grade in Science. A new methodology addresses two previously undiscovered sources of bias inherent in kindergarten lotteries that include middle/high income families. The unbiased confirmatory Reading-English-LA results show statistically significant ITT (0.241***) and TOT (0.473***) effects for 3rd-6th grade achievement with statistically significant ITT and TOT effects at each grade. Exploratory analyses also showed significant ITT (0.15*) and TOT (0.300*) unbiased effects at 5th grade in Science. A CK-Charter school in a low income school district also had statistically significant, moderate to large unbiased ITT and TOT effects in English Language Arts (ITT= 0.944**; TOT = 1.299**), Mathematics (ITT= 0.735*; TOT = 0.997*) and positive, but insignificant Science effects (ITT= 0.468; TOT = 0.622) that eliminated achievement gaps in all subjects.

More →


Shirin A. Hashim, Thomas Kelley-Kemple, Mary E. Laski.

We propose a new method for estimating school-level characteristics from publicly available census data. We use a school’s location to impute its catchment area by aggregating the nearest n census block groups such that the number of school-aged children in those n block groups is just over the number of students enrolled in that school. We then weight census data by the number of school-aged children in the block-group to estimate school-level measures. We conduct several robustness checks to assess the quality of our estimates and find that our method is broadly successful in replicating known school-level characteristics and producing unbiased estimates for school-level income. This method expands the available set of school-level variables to the broader and richer set of characteristics measured in the census, which can then be used to conduct descriptive and observational research across a long time horizon.

More →