Search for EdWorkingPapers here by author, title, or keywords.
Methodology, measurement and data
Principals (policymakers) disagree as to whether U. S. student performance has changed over the past half century. To inform conversations, agents administered seven million psychometrically linked tests in math (m) and reading (rd) in 160 survey waves to national probability samples of cohorts born between 1954 and 2007. Estimated change in standard deviations (sd) per decade varies by agent (m: -0.10sd to 0.27sd, rd: -0.02sd to 0.12sd). Consistent with Flynn effects, median trends show larger gains in m (0.19sd) than rd (0.04sd), though rates of progress for cohorts born since 1990 have increased in rd but slowed in m. Greater progress is shown by students tested at younger ages (m: 0.31sd, rd: 0.08sd) than when tested in middle years of schooling (m: 0.17sd, rd: 0.03sd) or toward end of schooling (m: 0.06sd, rd: 0.02sd). Young white students progress more slowly (m: 0.28sd, rd: 0.09sd) than Asian (m: 46sd, rd: 0.28sd), black (m: 0.36sd, rd: 0.19sd) and Hispanic (m: 0.29sd, rd: 0.13sd) students. These ethnic differences generally attenuate as students age. Young students in the bottom quartile of the SES distribution show greater progress than those in the top quartile (difference in m: 0.08sd, in rd: 0.15sd), but the reverse is true for older students. Moderators likely include not only changes in families and schools but also improvements in nutrition, health care, and protection from contagious diseases and environmental risks. International data suggest that subject and age differentials may be due to moderators more general than just the United States.
We consider the case in which the number of seats in a program is limited, such as a job training program or a supplemental tutoring program, and explore the implications that peer effects have for which individuals should be assigned to the limited seats. In the frequently-studied case in which all applicants are assigned to a group, the average outcome is not changed by shuffling the group assignments if the peer effect is linear in the average composition of peers. However, when there are fewer seats than applicants, the presence of linear-in-means peer effects can dramatically influence the optimal choice of who gets to participate. We illustrate how peer effects impact optimal seat assignment, both under a general social welfare function and under two commonly used social welfare functions. We next use data from a recent job training RCT to provide evidence of large peer effects in the context of job training for disadvantaged adults. Finally, we combine the two results to show that the program's effectiveness varies greatly depending on whether the assignment choices account for or ignore peer effects.
In a randomized trial that collects text as an outcome, traditional approaches for assessing treatment impact require that each document first be manually coded for constructs of interest by human raters. An impact analysis can then be conducted to compare treatment and control groups, using the hand-coded scores as a measured outcome. This process is both time and labor-intensive, which creates a persistent barrier for large-scale assessments of text. Furthermore, enriching ones understanding of a found impact on text outcomes via secondary analyses can be difficult without additional scoring efforts. Machine-based text analytic and data mining tools offer one potential avenue to help facilitate research in this domain. For instance, we could augment a traditional impact analysis that examines a single human-coded outcome with a suite of automatically generated secondary outcomes. By analyzing impacts across a wide array of text-based features, we can then explore what an overall change signifies, in terms of how the text has evolved due to treatment. In this paper, we propose several different methods for supplementary analysis in this spirit. We then present a case study of using these methods to enrich an evaluation of a classroom intervention on young children’s writing. We argue that our rich array of findings move us from “it worked” to “it worked because” by revealing how observed improvements in writing were likely due, in part, to the students having learned to marshal evidence and speak with more authority. Relying exclusively on human scoring, by contrast, is a lost opportunity.
Many novice teachers learn to teach “on-the-job,” leading to burnout and attrition among teachers and negative outcomes for students in the long term. Pre-service teacher education is tasked with optimizing teacher readiness, but there is a lack of causal evidence regarding effective ways for preparing new teachers. In this paper, we use a mixed reality simulation platform to evaluate the causal effects and robustness of an individualized, directive coaching model for candidates enrolled in a university-based teacher education program, as well as for undergraduates considering teaching as a profession. Across five conceptual replication studies, we find that targeted, directive coaching significantly improves candidates’ instructional performance during simulated classroom sessions, and that coaching effects are robust across different teaching tasks, study timing, and modes of delivery. However, coaching effects are smaller for a sub-population of participants not formally enrolled in a teacher preparation program. These participants differed from teacher candidates in multiple ways, including by demographic characteristics, as well as by their prior experiences learning about instructional methods. We highlight implications for research and practice.
Books shape how children learn about society and social norms, in part through the representation of different characters. To better understand the messages children encounter in books, we introduce new artificial intelligence methods for systematically converting images into data. We apply these image tools, along with established text analysis methods, to measure the representation of race, gender, and age in children’s books commonly found in US schools and homes over the last century. We find that more characters with darker skin color appear over time, but "mainstream" award-winning books, which are twice as likely to be checked out from libraries, persistently depict more lighter-skinned characters even after conditioning on perceived race. Across all books, children are depicted with lighter skin than adults. Over time, females are increasingly present but are more represented in images than in text, suggesting greater symbolic inclusion in pictures than substantive inclusion in stories. Relative to their growing share of the US population, Black and Latinx people are underrepresented in the mainstream collection; males, particularly White males, are persistently overrepresented. Our data provide a view into the "black box" of education through children’s books in US schools and homes, highlighting what has changed and what has endured.
We study the effects of informal social interactions on academic achievement and behavior using idiosyncratic variation in peer groups stemming from changes in bus routes across elementary, middle, and high school. In early grades, a one standard-deviation change in the value-added of same-grade bus peers corresponds to a 0.01 SD change in academic performance and a 0.03 SD change in behavior; by high school, these magnitudes grow to 0.04 SD and 0.06 SD. These findings suggest that student interactions outside the classroom—especially in adolescence—may be an important factor in the education production function.
In education settings, treatments are often non-randomly assigned to clusters, such as schools or classrooms, while outcomes are measured for students. This research design is called the clustered observational study (COS). We examine the consequences of common support violations in the COS context. Common support violations occur when the covariate distributions of treated and control units do not overlap. Such violations are likely to occur in a COS, especially with a small number of treated clusters. One common technique for dealing with common support violations is trimming treated units. We demonstrate how this practice can yield nonsensical results in some COSs. More specifically, we show how trimming the data can result in an uninterpretable estimand. We use data on Catholic schools to illustrate concepts throughout.
An administrative rule allowed students who failed an exam to retake it shortly after, triggering strong `teach to the test' incentives to raise these students' test scores for the retake. We develop a model that accounts for truncation and find that these students score 0.14 standard deviations higher on the retest. Using a regression discontinuity design, we estimate thirty percent of these gains persist to the following year. These results provide evidence that test-focused instruction or `cramming' raises contemporaneous performance, but a large portion of these gains fade-out. Our findings highlight that persistence should be accounted for when comparing educational interventions.
Millions of high school students who take an Advanced Placement (AP) course in one of over 30 subjects can earn college credit by performing well on the corresponding AP exam. Using data from four metro-Atlanta public school districts, we find that 15 percent of students’ AP courses do not result in an AP exam. We predict that up to 32 percent of the AP courses that do not result in an AP exam would result in a score of 3 or higher, which generally commands college credit at colleges and universities across the United States. Next, we examine disparities in AP exam-taking rates by demographics and course taking patterns. Most immediately policy relevant, we find evidence consistent with the positive impact of school district exam subsidies on AP exam-taking rates. In fact, students on free and reduced-price lunch (FRL) in the districts that provide a higher subsidy to FRL students than non-FRL students are more likely to take an AP exam than their non-FRL counterparts, after controlling for demographic and academic covariates.
Colleges have increasingly turned to predictive analytics to target at-risk students for additional support. Most of the predictive analytic applications in higher education are proprietary, with private companies offering little transparency about their underlying models. We address this lack of transparency by systematically comparing two important dimensions: (1) different approaches to sample and variable construction and how these affect model accuracy; and (2) how the selection of predictive modeling approaches, ranging from methods many institutional researchers would be familiar with to more complex machine learning methods, impacts model performance and the stability of predicted scores. The relative ranking of students’ predicted probability of completing college varies substantially across modeling approaches. While we observe substantial gains in performance from models trained on a sample structured to represent the typical enrollment spells of students and with a robust set of predictors, we observe similar performance between the simplest and most complex models.