Search EdWorkingPapers by author, title, or keywords.
Methodology, measurement and data
Many teacher education researchers have expressed concerns with the lack of rigorous impact evaluations of teacher preparation practices. I summarize these various concerns as they relate to issues of internal validity, external validity, and measurement. I then assess the prevalence of these issues by reviewing 166 impact evaluations of teacher preparation practices published in peer-reviewed journals between 2002-2019. Although I find that very few studies address issues of internal validity, external validity and measurement, I highlight some innovative approaches and present a checklist of considerations to assist future researchers in designing more rigorous impact evaluations.
Dual-enrollment courses are theorized to promote students' preparedness for college in part by bolstering their beneficial beliefs, such as academic self-efficacy, educational expectations, and sense of college belonging. These beliefs may also shape students' experiences and outcomes in dual-enrollment courses, yet few if any studies have examined this possibility. We study a large dual-enrollment program created by a university in the Southwest to examine these patterns. We find that mathematics self-efficacy and educational expectations predict performance in dual-enrollment courses, even when controlling for students' academic preparedness, while factors such as high school belonging, college belonging, and self-efficacy in other academic domains are unrelated to academic performance. However, we also find that students of color and first-generation students tend to have lower self-efficacy and educational expectations before enrolling in dual-enrollment courses, in addition to having lower levels of academic preparation. These findings suggest that students from historically marginalized populations may benefit from social-psychological as well as academic supports in order to receive maximum benefits from early postsecondary opportunities such as dual-enrollment. Our findings have implications for how states and dual-enrollment programs determine eligibility for dual-enrollment as well as how dual-enrollment programs should be designed and delivered in order to promote equity in college preparedness.
School closures induced by COVID-19 placed heightened emphasis on alternative ways to measure student learning besides in-person exams. We leverage the administration of phone-based assessments (PBAs) measuring numeracy and literacy for primary school children in Kenya, along with in-person standardized tests administered to the same students prior to school shutdowns, to assess the validity of PBAs. Compared to repeated in-person assessments, PBAs did not severely misclassify students’ relative performance, but PBA scores did tend to be further from baseline in-person scores than repeated in-person assessments from each other. As such, PBAs performed well at measuring aggregate but not individual learning levels. Administrators can therefore use these tools for aggregate measurement, such as in the context of impact evaluation, but be wary of PBAs for individual-level tracking or high-stakes decisions. Results also reveal the importance of making deliberate efforts to reach a representative sample and selecting items that provide discriminating power.
This paper compares and contrasts two required building level school violence measures under NCLB, arrests and incidents of well-defined school misconduct acts, across 20 years of Pennsylvania’s approximately 3,000 public school buildings. Generally, both arrests for school violence and incidents of school violence are rare events. Over 20 years, the third quartile arrest rate was zero and, the third quartile incident rate was 3.3%. Relatively few, 4.1% overall, of Pennsylvania’s school buildings were persistently dangerous as defined and reported pursuant to Pennsylvania’s state plan to the US Department of Education; however, these buildings represented about 7.8% of the student population statewide. When we measure whether or not a school building is dangerous based on reported school violence incidents, that is without an arrest requirement, fully 36.9% of Pennsylvania’school buildings were dangerous, and they represented 46.7% of the students statewide. Both Philadelphia and Pittsburgh public school buildings were disproportionately unsafe and among the top 20 districts in the state which were unsafe over the 20 year study period.
Exploratory regression analysis of mean building scale scores for math and language arts explained about 58% of the variation in such learning outcome measures. As expected, household poverty, holding all else constant, has very strong, negative effects on learning outcomes. A school building composed entirely of low income students will score about 240 scale points lower, about 1.24 standard deviations lower, than a school building without any low income students. A school building at the 90th percentile in terms of student misconduct and poverty rates, would have lower student test scores by about 1 to 1.28 standard deviations. Were a school administrator to reduce student misconduct rates from the 90th percentile to the 50th percentile, our regression coefficients predict learning gains on the order of (100-43) = 2/3 of a standard deviation in mean scale scores.
Half of kindergarten teachers split children into higher and lower ability groups for reading or math. In national data, we predicted kindergarten ability group placement using linear and ordinal logistic regression with classroom fixed effects. In fall, test scores were the best predictors of group placement, but there was bias favoring girls, high-SES (socioeconomic status) children, and Asian Americans, who received higher placements than their scores alone would predict. Net of SES, there was no bias against placing black children in higher groups. By spring, one third of kindergartners moved groups, and high-SES children moved up more than their score gains alone would predict. Teacher-reported behaviors (e.g., attentiveness, approaches to learning) helped explain girls’ higher placements, but did little to explain the higher placements of Asian American and high-SES children.
Underrepresented minority (URM) college students have been steadily earning degrees in relatively less-lucrative fields of study since the mid-1990s. A decomposition reveals that this widening gap is principally explained by rising stratification at public research universities, many of which increasingly enforce GPA restriction policies that prohibit students with poor introductory grades from declaring popular majors. We investigate these GPA restrictions by constructing a novel 50-year dataset covering four public research universities' student transcripts and employing a staggered difference-in-difference design around the implementation of 29 restrictions. Restricted majors’ average URM enrollment share falls by 20 percent, which matches observational patterns and can be explained by URM students’ poorer average pre-college academic preparation. Using first-term course enrollments to identify students who intend to earn restricted majors, we find that major restrictions disproportionately lead URM students from their intended major toward less-lucrative fields, driving within-institution ethnic stratification and likely exacerbating labor market disparities.
Principals (policymakers) disagree as to whether U. S. student performance has changed over the past half century. To inform conversations, agents administered seven million psychometrically linked tests in math (m) and reading (rd) in 160 survey waves to national probability samples of cohorts born between 1954 and 2007. Estimated change in standard deviations (sd) per decade varies by agent (m: -0.10sd to 0.27sd, rd: -0.02sd to 0.12sd). Consistent with Flynn effects, median trends show larger gains in m (0.19sd) than rd (0.04sd), though rates of progress for cohorts born since 1990 have increased in rd but slowed in m. Greater progress is shown by students tested at younger ages (m: 0.31sd, rd: 0.08sd) than when tested in middle years of schooling (m: 0.17sd, rd: 0.03sd) or toward end of schooling (m: 0.06sd, rd: 0.02sd). Young white students progress more slowly (m: 0.28sd, rd: 0.09sd) than Asian (m: 46sd, rd: 0.28sd), black (m: 0.36sd, rd: 0.19sd) and Hispanic (m: 0.29sd, rd: 0.13sd) students. These ethnic differences generally attenuate as students age. Young students in the bottom quartile of the SES distribution show greater progress than those in the top quartile (difference in m: 0.08sd, in rd: 0.15sd), but the reverse is true for older students. Moderators likely include not only changes in families and schools but also improvements in nutrition, health care, and protection from contagious diseases and environmental risks. International data suggest that subject and age differentials may be due to moderators more general than just the United States.
We consider the case in which the number of seats in a program is limited, such as a job training program or a supplemental tutoring program, and explore the implications that peer effects have for which individuals should be assigned to the limited seats. In the frequently-studied case in which all applicants are assigned to a group, the average outcome is not changed by shuffling the group assignments if the peer effect is linear in the average composition of peers. However, when there are fewer seats than applicants, the presence of linear-in-means peer effects can dramatically influence the optimal choice of who gets to participate. We illustrate how peer effects impact optimal seat assignment, both under a general social welfare function and under two commonly used social welfare functions. We next use data from a recent job training RCT to provide evidence of large peer effects in the context of job training for disadvantaged adults. Finally, we combine the two results to show that the program's effectiveness varies greatly depending on whether the assignment choices account for or ignore peer effects.
In a randomized trial that collects text as an outcome, traditional approaches for assessing treatment impact require that each document first be manually coded for constructs of interest by human raters. An impact analysis can then be conducted to compare treatment and control groups, using the hand-coded scores as a measured outcome. This process is both time and labor-intensive, which creates a persistent barrier for large-scale assessments of text. Furthermore, enriching ones understanding of a found impact on text outcomes via secondary analyses can be difficult without additional scoring efforts. Machine-based text analytic and data mining tools offer one potential avenue to help facilitate research in this domain. For instance, we could augment a traditional impact analysis that examines a single human-coded outcome with a suite of automatically generated secondary outcomes. By analyzing impacts across a wide array of text-based features, we can then explore what an overall change signifies, in terms of how the text has evolved due to treatment. In this paper, we propose several different methods for supplementary analysis in this spirit. We then present a case study of using these methods to enrich an evaluation of a classroom intervention on young children’s writing. We argue that our rich array of findings move us from “it worked” to “it worked because” by revealing how observed improvements in writing were likely due, in part, to the students having learned to marshal evidence and speak with more authority. Relying exclusively on human scoring, by contrast, is a lost opportunity.