Search for EdWorkingPapers here by author, title, or keywords.
Methodology, measurement and data
We estimate the longer-run effects of attending an effective high school (one that improves a combination of test scores, survey measures of socio-emotional development, and behaviors in 9th grade) for students who are more versus less educationally advantaged (i.e., likely to attain more years of education based on 8th-grade characteristics). All students benefit from attending effective schools, but the least advantaged students experience larger improvements in high-school graduation, college going, and school-based arrests. This heterogeneity is not solely due to less-advantaged groups being marginal for particular outcomes. Commonly used test-score value-added understates the long-run importance of effective schools, particularly for less-advantaged populations. Patterns suggest this partly reflects less-advantaged students being relatively more responsive to non-test-score dimensions of school quality.
Graduate education is among the fastest growing segments of the U.S. higher educational system. This paper provides up-to-date causal evidence on labor market returns to Master’s degrees and examines heterogeneity in the returns by field area, student demographics and initial labor market conditions. We use rich administrative data from Ohio and an individual fixed effects model that compares students’ earnings trajectories before and after earning a Master’s degree. Findings show that obtaining a Master’s degree increased quarterly earnings by about 12% on average, but the returns vary largely across graduate fields. We also find gender and racial disparities in the returns, with higher average returns for women than for men, and for White than for Black graduates. In addition, by comparing returns among students who graduated before and under the Great Recession, we show that economic downturns appear to reduce but not eliminate the positive returns to Master’s degrees.
Teachers' sense-making of student behavior determines whether students get in trouble and are formally disciplined. Status categories, such as race, can influence perceptions of student culpability, but the degree to which this contributes to racial disproportionality in discipline receipt is unknown. This study provides the first systematic documentation of teachers' use office discipline referrals (ODRs) in a large, diverse urban school district in California that specifies the identity of both the referred and referring individuals in all ODRs. We identify teachers exhibiting extensive referral behavior, or the top 5% referrers based on the number of ODRs they make in a given year and evaluate their contributions to disciplinary disparities. We find that "top referrers" effectively double the racial gaps in ODRs for both Black-White and Hispanic-White comparisons. These gaps are mainly driven by higher numbers of ODRs issued for Black and Hispanic students due to interpersonal offences and defiance, and also partially convert to racial gaps in suspensions. Both the level and racial compositions of the school sites where "top referrers" serve and their personal traits seem to explain some of their frequent referring behavior. Targeting supports and interventions to "top referrers" might afford an important opportunity to reduce racial disciplinary gaps.
School principals are viewed as critical mechanisms by which to improve student outcomes, but there remain important methodological questions about how to measure principals' effects. We propose a framework for measuring principals' contributions to student outcomes and apply it empirically using data from Tennessee, New York City, and Oregon. We find that using contemporaneous student outcomes to assess principal performance is flawed. Value-added models misattribute to principals changes in student performance caused by factors that principals minimally control. Further, little to none of the variation in average student test scores or attendance is explained by persistent effectiveness differences between principals.
Analyses that reveal how treatment effects vary allow researchers, practitioners, and policymakers to better understand the efficacy of educational interventions. In practice, however, standard statistical methods for addressing Heterogeneous Treatment Effects (HTE) fail to address the HTE that may exist within outcome measures. In this study, we present a novel application of the Explanatory Item Response Model (EIRM) for assessing what we term “item-level” HTE (IL-HTE), in which a unique treatment effect is estimated for each item in an assessment. Results from data simulation reveal that when IL-HTE are present but ignored in the model, standard errors can be underestimated and false positive rates can increase. We then apply the EIRM to assess the impact of a literacy intervention focused on promoting transfer in reading comprehension on a digital formative assessment delivered online to approximately 8,000 third-grade students. We demonstrate that allowing for IL-HTE can reveal treatment effects at the item-level masked by a null average treatment effect, and the EIRM can thus provide fine-grained information for researchers and policymakers on the potentially heterogeneous causal effects of educational interventions.
What happens when employers would like to screen their employees but only observe a subset of output? We specify a model in which heterogeneous employees respond by producing more of the observed output at the expense of the unobserved output. Though this substitution distorts output in the short-term, we derive three sufficient conditions under which the heterogenous response improves screening efficiency: 1) all employees place similar value on staying in their current role; 2) the employees' utility functions satisfy a variation of the traditional single-crossing condition; 3) employer and worker preferences over output are similar. We then assess these predictions empirically by studying a change to teacher tenure policy in New York City, which increased the role that a single measure -- test score value-added -- played in tenure decisions. We show that in response to the policy teachers increased test score value-added and decreased output that did not enter the tenure decision. The increase in test score value-added was largest for the teachers with more ability to improve students' untargeted outcomes, increasing their likelihood of getting tenure. We find that the endogenous response to the policy announcement reduced the screening efficiency gap -- defined as the reduction of screening efficiency stemming from the partial observability of output -- by 28%, effectively shifting some of the cost of partial observability from the post-tenure period to the pre-tenure period.
Given recent evidence challenging the replicability of results in the social and behavioral sciences, critical questions have been raised about appropriate measures for determining replication success in comparing effect estimates across studies. At issue is the fact that conclusions about replication success often depend on the measure used for evaluating correspondence in results. Despite the importance of choosing an appropriate measure, there is still no wide-spread agreement about which measures should be used. This paper addresses these questions by describing formally the most commonly used measures for assessing replication success, and by comparing their performance in different contexts according to their replication probabilities – that is, the probability of obtaining replication success given study-specific settings. The measures may be characterized broadly as conclusion-based approaches, which assess the congruence of two independent studies’ conclusions about the presence of an effect, and distance-based approaches, which test for a significant difference or equivalence of two effect estimates. We also introduce a new measure for assessing replication success called the correspondence test, which combines a difference and equivalence test in the same framework. To help researchers plan prospective replication efforts, we provide closed formulas for power calculations that can be used to determine the minimum detectable effect size (and thus, sample sizes) for each study so that a predetermined minimum replication probability can be achieved. Finally, we use a replication dataset from the Open Science Collaboration (2015) to demonstrate the extent to which conclusions about replication success depend on the correspondence measure selected.
Teachers are the most important school-specific factor in student learning. Yet, little evidence exists linking teacher professional learning programs and the various strategies or components that comprise them to student achievement. In this paper, we examine a teacher fellowship model for professional learning designed and implemented by Leading Educators, a national nonprofit organization that aims to bridge research and practice to improve instructional quality and accelerate learning across school systems. During the 2015-16 and 2016-17 school years, Leading Educators conducted its fellowship program for teachers and school leaders to provide educators ongoing, collaborative, job-embedded professional development and to improve student achievement. Relying on quasi-experimental methods, we find that a school’s participation in the fellowship model increased student proficiency rates in math and English language arts on state achievement exams. Further, student achievement benefitted from a more sustained duration of teacher participation in the fellowship model, and the impact on student achievement varied depending on the share of a school’s teachers who participated in the fellowship model and the extent to which teachers independently selected into the fellowship model or were appointed to participate by school leaders. Taken together, findings from this paper should inform professional learning organizations, schools, and policymakers on the design, implementation and impact of teacher professional learning.
How scholars name different racial groups has powerful salience for understanding what researchers study. We explored how education researchers used racial terminology in recently published, high-profile, peer-reviewed studies. Our sample included all original empirical studies published in the non-review AERA journals from 2009 to 2019. We found two-thirds of articles used at least one racial category term, with an increase from about half to almost three-quarters of published studies between 2009 and 2019. Other trends include the increasing popularity of the term Black, the emergence of gender-expansive terms such as Latinx, the popularity of the term Hispanic in quantitative studies, and the paucity of studies with terms connoting missing race data or including terms describing Indigenous and multiracial peoples.