Search EdWorkingPapers

Search for EdWorkingPapers here by author, title, or keywords.

Methodology, measurement and data

Julie Cohen, Anandita Krishnamachari, Vivian C. Wong.

Many novice teachers learn to teach “on-the-job,” leading to burnout and attrition among teachers and negative outcomes for students in the long term. Pre-service teacher education is tasked with optimizing teacher readiness, but there is a lack of causal evidence regarding effective ways for preparing new teachers. In this paper, we use a mixed reality simulation platform to evaluate the causal effects and robustness of an individualized, directive coaching model for candidates enrolled in a university-based teacher education program, as well as for undergraduates considering teaching as a profession. Across five conceptual replication studies, we find that targeted, directive coaching significantly improves candidates’ instructional performance during simulated classroom sessions, and that coaching effects are robust across different teaching tasks, study timing, and modes of delivery. However, coaching effects are smaller for a sub-population of participants not formally enrolled in a teacher preparation program. These participants differed from teacher candidates in multiple ways, including by demographic characteristics, as well as by their prior experiences learning about instructional methods. We highlight implications for research and practice.

More →


Anjali Adukia, Alex Eble, Emileigh Harrison, Hakizumwami Birali Runesha, Teodora Szasz.

Books shape how children learn about society and social norms, in part through the representation of different characters. To better understand the messages children encounter in books, we introduce new artificial intelligence methods for systematically converting images into data. We apply these image tools, along with established text analysis methods, to measure the representation of race, gender, and age in children’s books commonly found in US schools and homes over the last century. We find that more characters with darker skin color appear over time, but "mainstream" award-winning books, which are twice as likely to be checked out from libraries, persistently depict more lighter-skinned characters even after conditioning on perceived race. Across all books, children are depicted with lighter skin than adults. Over time, females are increasingly present but are more represented in images than in text, suggesting greater symbolic inclusion in pictures than substantive inclusion in stories. Relative to their growing share of the US population, Black and Latinx people are underrepresented in the mainstream collection; males, particularly White males, are persistently overrepresented. Our data provide a view into the "black box" of education through children’s books in US schools and homes, highlighting what has changed and what has endured.

More →


Matthew A. Lenard, Mikko Silliman.

We study the effects of informal social interactions on academic achievement and behavior using idiosyncratic variation in peer groups stemming from changes in bus routes across elementary, middle, and high school. In early grades, a one standard-deviation change in the value-added of same-grade bus peers corresponds to a 0.01 SD change in academic performance and a 0.03 SD change in behavior; by high school, these magnitudes grow to 0.04 SD and 0.06 SD. These findings suggest that student interactions outside the classroom—especially in adolescence—may be an important factor in the education production function.

More →


Luke Keele, Matthew Lenard, Lindsay Page.

In education settings, treatments are often non-randomly assigned to clusters, such as schools or classrooms, while outcomes are measured for students. This research design is called the clustered observational study (COS). We examine the consequences of common support violations in the COS context. Common support violations occur when the covariate distributions of treated and control units do not overlap. Such violations are likely to occur in a COS, especially with a small number of treated clusters. One common technique for dealing with common support violations is trimming treated units. We demonstrate how this practice can yield nonsensical results in some COSs. More specifically, we show how trimming the data can result in an uninterpretable estimand. We use data on Catholic schools to illustrate concepts throughout.

More →


Christine Mulhern, Isaac M. Opper.

There is an emerging consensus that teachers impact multiple student outcomes, but it remains unclear how to summarize these multiple dimensions of teacher effectiveness into simple metrics that can be used for research or personnel decisions. Here, we discuss the implications of estimating teacher effects in a multidimensional empirical Bayes framework and illustrate how to appropriately use these noisy estimates to assess the dimensionality and predictive power of the true teacher effects. Empirically, our principal components analysis indicates that the multiple dimensions can be efficiently summarized by a small number of measures; for example, one dimension explains over half the variation in the teacher effects on all the dimensions we observe. Summary measures based on the first principal component lead to similar rankings of teachers as summary measures weighting short-term effects by their prediction of long-term outcomes. We conclude by discussing the practical implications of using summary measures of effectiveness and, specifically, how to ensure that the policy implementation is fair when different sets of measures are observed for different teachers.

More →


Michael Gilraine, Jeffrey Penney.

An administrative rule allowed students who failed an exam to retake it shortly after, triggering strong `teach to the test' incentives to raise these students' test scores for the retake. We develop a model that accounts for truncation and find that these students score 0.14 standard deviations higher on the retest. Using a regression discontinuity design, we estimate thirty percent of these gains persist to the following year. These results provide evidence that test-focused instruction or `cramming' raises contemporaneous performance, but a large portion of these gains fade-out. Our findings highlight that persistence should be accounted for when comparing educational interventions.

More →


Ishtiaque Fazlul, Todd R. Jones, Jonathan Smith.

Millions of high school students who take an Advanced Placement (AP) course in one of over 30 subjects can earn college credit by performing well on the corresponding AP exam. Using data from four metro-Atlanta public school districts, we find that 15 percent of students’ AP courses do not result in an AP exam. We predict that up to 32 percent of the AP courses that do not result in an AP exam would result in a score of 3 or higher, which generally commands college credit at colleges and universities across the United States. Next, we examine disparities in AP exam-taking rates by demographics and course taking patterns.  Most immediately policy relevant, we find evidence consistent with the positive impact of school district exam subsidies on AP exam-taking rates. In fact, students on free and reduced-price lunch (FRL) in the districts that provide a higher subsidy to FRL students than non-FRL students are more likely to take an AP exam than their non-FRL counterparts, after controlling for demographic and academic covariates.

More →


Kelli A. Bird, Benjamin L. Castleman, Zachary Mabel, Yifeng Song.

Colleges have increasingly turned to predictive analytics to target at-risk students for additional support. Most of the predictive analytic applications in higher education are proprietary, with private companies offering little transparency about their underlying models. We address this lack of transparency by systematically comparing two important dimensions: (1) different approaches to sample and variable construction and how these affect model accuracy; and (2) how the selection of predictive modeling approaches, ranging from methods many institutional researchers would be familiar with to more complex machine learning methods, impacts model performance and the stability of predicted scores. The relative ranking of students’ predicted probability of completing college varies substantially across modeling approaches. While we observe substantial gains in performance from models trained on a sample structured to represent the typical enrollment spells of students and with a robust set of predictors, we observe similar performance between the simplest and most complex models.

More →


David M. Houston, Michael B. Henderson, Paul E. Peterson, Martin R. West.

Do Americans hold a consistent set of opinions about their public schools and how to improve them? From 2013 to 2018, over 5,000 unique respondents participated in more than one consecutive iteration of the annual, nationally representative Education Next poll, offering an opportunity to examine individual-level attitude stability on education policy issues over a six-year period. The proportion of participants who provide the same response to the same question over multiple consecutive years greatly exceeds the amount expected to occur by chance alone. We also find that teachers offer more consistent responses than their non-teaching peers. By contrast, we do not observe similar differences in attitude stability between parents of school-age children and their counterparts without children.

More →


Matthew D. Baird, John Engberg, Isaac M. Opper.

We consider the case in which the number of seats in a program is limited, such as a job training program or a supplemental tutoring program, and explore the implications that peer effects have for which individuals should be assigned to the limited seats. In the frequently-studied case in which all applicants are assigned to a group, the average outcome is not changed by shuffling the group assignments if the peer effect is linear in the average composition of peers. However, when there are fewer seats than applicants, the presence of linear-in-means peer effects can dramatically influence the optimal choice of who gets to participate. We illustrate how peer effects impact optimal seat assignment, both under a general social welfare function and under two commonly used social welfare functions. We next use data from a recent job training RCT to provide the first evidence of large peer effects in the context of job training for disadvantaged adults. Finally, we combine the two results to show that the program's effectiveness varies greatly depending on whether the assignment choices account for or ignore peer effects.

More →