Search for EdWorkingPapers here by author, title, or keywords.
Methodology, measurement and data
Principals (policymakers) disagree as to whether U. S. student performance has changed over the past half century. To inform conversations, agents administered seven million psychometrically linked tests in math (m) and reading (rd) in 160 survey waves to national probability samples of cohorts born between 1954 and 2007. Estimated change in standard deviations (sd) per decade varies by agent (m: -0.10sd to 0.27sd, rd: -0.02sd to 0.12sd). Consistent with Flynn effects, median trends show larger gains in m (0.19sd) than rd (0.04sd), though rates of progress for cohorts born since 1990 have increased in rd but slowed in m. Greater progress is shown by students tested at younger ages (m: 0.31sd, rd: 0.08sd) than when tested in middle years of schooling (m: 0.17sd, rd: 0.03sd) or toward end of schooling (m: 0.06sd, rd: 0.02sd). Young white students progress more slowly (m: 0.28sd, rd: 0.09sd) than Asian (m: 46sd, rd: 0.28sd), black (m: 0.36sd, rd: 0.19sd) and Hispanic (m: 0.29sd, rd: 0.13sd) students. These ethnic differences generally attenuate as students age. Young students in the bottom quartile of the SES distribution show greater progress than those in the top quartile (difference in m: 0.08sd, in rd: 0.15sd), but the reverse is true for older students. Moderators likely include not only changes in families and schools but also improvements in nutrition, health care, and protection from contagious diseases and environmental risks. International data suggest that subject and age differentials may be due to moderators more general than just the United States.
We consider the case in which the number of seats in a program is limited, such as a job training program or a supplemental tutoring program, and explore the implications that peer effects have for which individuals should be assigned to the limited seats. In the frequently-studied case in which all applicants are assigned to a group, the average outcome is not changed by shuffling the group assignments if the peer effect is linear in the average composition of peers. However, when there are fewer seats than applicants, the presence of linear-in-means peer effects can dramatically influence the optimal choice of who gets to participate. We illustrate how peer effects impact optimal seat assignment, both under a general social welfare function and under two commonly used social welfare functions. We next use data from a recent job training RCT to provide evidence of large peer effects in the context of job training for disadvantaged adults. Finally, we combine the two results to show that the program's effectiveness varies greatly depending on whether the assignment choices account for or ignore peer effects.
In a randomized trial that collects text as an outcome, traditional approaches for assessing treatment impact require that each document first be manually coded for constructs of interest by human raters. An impact analysis can then be conducted to compare treatment and control groups, using the hand-coded scores as a measured outcome. This process is both time and labor-intensive, which creates a persistent barrier for large-scale assessments of text. Furthermore, enriching ones understanding of a found impact on text outcomes via secondary analyses can be difficult without additional scoring efforts. Machine-based text analytic and data mining tools offer one potential avenue to help facilitate research in this domain. For instance, we could augment a traditional impact analysis that examines a single human-coded outcome with a suite of automatically generated secondary outcomes. By analyzing impacts across a wide array of text-based features, we can then explore what an overall change signifies, in terms of how the text has evolved due to treatment. In this paper, we propose several different methods for supplementary analysis in this spirit. We then present a case study of using these methods to enrich an evaluation of a classroom intervention on young children’s writing. We argue that our rich array of findings move us from “it worked” to “it worked because” by revealing how observed improvements in writing were likely due, in part, to the students having learned to marshal evidence and speak with more authority. Relying exclusively on human scoring, by contrast, is a lost opportunity.
Can families in low-income contexts “pull themselves up by their bootstraps?” In rural Gambia, caregivers with high aspirations for their children, measured before the child starts school, invest substantially more in their children’s education. Despite this, essentially no children are literate or numerate three years later. In contrast, a bundled supply-side intervention administered in these same areas generated large literacy and numeracy gains. Crucially, conditional on receipt of this intervention, high-aspirations children are 25 percent more likely to attain literacy/numeracy than low-aspirations children. We also show how the test score SD metric can mislead when counterfactual learning levels are low.
Many novice teachers learn to teach “on-the-job,” leading to burnout and attrition among teachers and negative outcomes for students in the long term. Pre-service teacher education is tasked with optimizing teacher readiness, but there is a lack of causal evidence regarding effective ways for preparing new teachers. In this paper, we use a mixed reality simulation platform to evaluate the causal effects and robustness of an individualized, directive coaching model for candidates enrolled in a university-based teacher education program, as well as for undergraduates considering teaching as a profession. Across five conceptual replication studies, we find that targeted, directive coaching significantly improves candidates’ instructional performance during simulated classroom sessions, and that coaching effects are robust across different teaching tasks, study timing, and modes of delivery. However, coaching effects are smaller for a sub-population of participants not formally enrolled in a teacher preparation program. These participants differed from teacher candidates in multiple ways, including by demographic characteristics, as well as by their prior experiences learning about instructional methods. We highlight implications for research and practice.
Books shape how children learn about society and social norms, in part through the representation of different characters. To better understand the messages children encounter in books, we introduce new artificial intelligence methods for systematically converting images into data. We apply these image tools, along with established text analysis methods, to measure the representation of race, gender, and age in children’s books commonly found in US schools and homes over the last century. We find that more characters with darker skin color appear over time, but "mainstream" award-winning books, which are twice as likely to be checked out from libraries, persistently depict more lighter-skinned characters even after conditioning on perceived race. Across all books, children are depicted with lighter skin than adults. Over time, females are increasingly present but are more represented in images than in text, suggesting greater symbolic inclusion in pictures than substantive inclusion in stories. Relative to their growing share of the US population, Black and Latinx people are underrepresented in the mainstream collection; males, particularly White males, are persistently overrepresented. Our data provide a view into the "black box" of education through children’s books in US schools and homes, highlighting what has changed and what has endured.
We study the effects of informal social interactions on academic achievement and behavior using idiosyncratic variation in peer groups stemming from changes in bus routes across elementary, middle, and high school. In early grades, a one standard-deviation change in the value-added of same-grade bus peers corresponds to a 0.01 SD change in academic performance and a 0.03 SD change in behavior; by high school, these magnitudes grow to 0.04 SD and 0.06 SD. These findings suggest that student interactions outside the classroom—especially in adolescence—may be an important factor in the education production function.
In education settings, treatments are often non-randomly assigned to clusters, such as schools or classrooms, while outcomes are measured for students. This research design is called the clustered observational study (COS). We examine the consequences of common support violations in the COS context. Common support violations occur when the covariate distributions of treated and control units do not overlap. Such violations are likely to occur in a COS, especially with a small number of treated clusters. One common technique for dealing with common support violations is trimming treated units. We demonstrate how this practice can yield nonsensical results in some COSs. More specifically, we show how trimming the data can result in an uninterpretable estimand. We use data on Catholic schools to illustrate concepts throughout.
There is an emerging consensus that teachers impact multiple student outcomes, but it remains unclear how to summarize these multiple dimensions of teacher effectiveness into simple metrics that can be used for research or personnel decisions. Here, we discuss the implications of estimating teacher effects in a multidimensional empirical Bayes framework and illustrate how to appropriately use these noisy estimates to assess the dimensionality and predictive power of the true teacher effects. Empirically, our principal components analysis indicates that the multiple dimensions can be efficiently summarized by a small number of measures; for example, one dimension explains over half the variation in the teacher effects on all the dimensions we observe. Summary measures based on the first principal component lead to similar rankings of teachers as summary measures weighting short-term effects by their prediction of long-term outcomes. We conclude by discussing the practical implications of using summary measures of effectiveness and, specifically, how to ensure that the policy implementation is fair when different sets of measures are observed for different teachers.
An administrative rule allowed students who failed an exam to retake it shortly after, triggering strong `teach to the test' incentives to raise these students' test scores for the retake. We develop a model that accounts for truncation and find that these students score 0.14 standard deviations higher on the retest. Using a regression discontinuity design, we estimate thirty percent of these gains persist to the following year. These results provide evidence that test-focused instruction or `cramming' raises contemporaneous performance, but a large portion of these gains fade-out. Our findings highlight that persistence should be accounted for when comparing educational interventions.