- Paul T. von Hippel
Search EdWorkingPapers by author, title, or keywords.
Paul T. von Hippel
Educational researchers often report effect sizes in standard deviation units (SD), but SD effects are hard to interpret. Effects are easier to interpret in percentile points, but converting SDs to percentile points involves a calculation that is not transparent to educational stakeholders. We show that if the outcome variable is normally distributed, simply multiplying the SD effect by 37 usually gives an excellent approximation to the percentile-point effect. For students in the middle three-fifths of a normal distribution, this rule of thumb is always accurate to within 1.6 percentile points for effect sizes of up to 0.8 SD. Two examples show that the rule can be just as accurate for empirical effects from real studies. Applying the rule to Kraft’s empirical benchmarks, we find that the least effective third of educational interventions raise scores by 0 to 2 percentile points; the middle third raise scores by 2 to 7 percentile points; and the most effective third raise scores by more than 7 percentile points.
Longitudinal studies can produce biased estimates of learning if children miss tests. In an application to summer learning, we illustrate how missing test scores can create an illusion of large summer learning gaps when true gaps are close to zero. We demonstrate two methods that reduce bias by exploiting the correlations between missing and observed scores on tests taken by the same child at different times. One method, multiple imputation, uses those correlations to fill in missing scores with plausible imputed scores. The other method models the correlations implicitly, using child-level random effects. Widespread adoption of these methods would improve the validity of summer learning studies and other longitudinal research in education.
Von Hippel & Cañedo (2021) reported that US kindergarten teachers placed girls, Asian-Americans, and children from families of high socioeconomic status (SES) into higher ability groups than their test scores alone would warrant. The results fit the view that teachers were biased.
This comment asks whether parents’ lobbying for higher placement might explain these results. The answer, for the most part, is no. Measures of parent-teacher contact explained little variation in children’s ability group placement, and did not account for the higher placement of girls, Asian-Americans, or high-SES children. In fact, Asian-American parents had less teacher contact than did white children. It appears that the biases observed by von Hippel & Cañedo resided primarily in teachers, not in parents.
We also ask whether teachers who used more objective assessment techniques were less biased in placing children into higher and lower ability groups. The answer, again, was no. Unfortunately, biases persisted in the face of objective information about students’ skill. Fortunately, the biases were not terribly large.
Half of kindergarten teachers split children into higher and lower ability groups for reading or math. In national data, we predicted kindergarten ability group placement using linear and ordinal logistic regression with classroom fixed effects. In fall, test scores were the best predictors of group placement, but there was bias favoring girls, high-SES (socioeconomic status) children, and Asian Americans, who received higher placements than their scores alone would predict. Net of SES, there was no bias against placing black children in higher groups. By spring, one third of kindergartners moved groups, and high-SES children moved up more than their score gains alone would predict. Teacher-reported behaviors (e.g., attentiveness, approaches to learning) helped explain girls’ higher placements, but did little to explain the higher placements of Asian American and high-SES children.
At least sixteen US states have taken steps toward holding teacher preparation programs (TPPs) accountable for teacher value-added to student test scores. Yet it is unclear whether teacher quality differences between TPPs are large enough to make an accountability system worthwhile. Several statistical practices can make differences between TPPs appear larger and more significant than they are. We reanalyze TPP evaluations from 6 states—New York, Louisiana, Missouri, Washington, Texas, and Florida—using appropriate methods implemented by our new caterpillar command for Stata. Our results show that teacher quality differences between most TPPs are negligible—.01-.03 standard deviations in student test scores—even in states where larger differences were reported previously. While ranking all a state’s TPPs may not be possible or desirable, in some states and subjects we can find a single TPP whose teachers stand out as significantly above or below average. Such exceptional TPPs may reward further study.
Enrollment in higher education has risen dramatically in Latin America, especially in Chile. Yet graduation and persistence rates remain low. One way to improve graduation and persistence is to use data and analytics to identify students at risk of dropout, target interventions, and evaluate interventions’ effectiveness at improving student success. We illustrate the potential of this approach using data from eight Chilean universities. Results show that data available at matriculation are only weakly predictive of persistence, while prediction improves dramatically once data on university grades become available. Some predictors of persistence are under policy control. Financial aid predicts higher persistence, and being denied a first-choice major predicts lower persistence. Student success programs are ineffective at some universities; they are more effective at others, but when effective they often fail to target the highest risk students. Universities should use data regularly and systematically to identify high-risk students, target them with interventions, and evaluate those interventions’ effectiveness.
Year-round school calendars take the usual 175-180 instruction days of the school year and redistribute them, replacing the usual schedule – nine months on, three months off – with a more “balanced” schedule of short instruction periods alternating with shorter breaks across all four seasons of the year. Over the past three decades, the number of schools using year-round calendars has increased ninefold, from 410 in 1985 to 3,700 in 2011-12 (Skinner, 2014). Over 2 million children now attend year-round schools – as many as attend charter schools – yet year-round schools have attracted relatively little attention from researchers and the public.
In this chapter, I review the evidence for the effects of year-round calendars on test scores. Once thought to be positive, these effects now appear to be neutral at best. Although year-round calendars do increase summer learning, they reduce learning at other times of year, so that the total amount learned over a 12-month period is no greater under a year-round calendar than under a nine-month calendar. I also review evidence that year-round calendars make it harder to recruit and retain experienced teachers, make it harder for mothers to work outside the home, and reduce property values. When students' schedules are staggered, year-round calendars do offer a way to reduce school crowding – an alternative to busing or portable classrooms, and a low-cost alternative to new school construction.
Evidence-based policy is the practice of basing policy decisions on rigorous research evidence, such as randomized experiments. But it is unclear how often evidence-based decisions produce more effective policy. We evaluate an evidence-based policy implemented in 1989-93, after the state of Tennessee completed the famous Project STAR randomized experiment, which showed that reducing average class sizes from 23 to 15 could raise test scores by nearly 0.2 standard deviations (SD). After Project STAR, the state launched Project Challenge, which tried to achieve similar score gains by earmarking $5 million to reduce class sizes in the state’s 17 poorest districts.
We evaluate the effects of Project Challenge by applying regression discontinuity and difference in differences analysis to data from district report cards. Our analysis offers no evidence that Project Challenge districts raised test scores, and even raises questions about whether districts reduced class sizes. After Project Challenge, Tennessee’s Basic Education Plan did reduce class sizes, but only by a token amount, from 26 to 25. In this example, it seems that a successful randomized experiment did not lead to successful policy.
Debates in education policy draw on different theories about how to raise children’s achievement. The school competition theory holds that achievement rises when families can choose among competing schools. The school resource theory holds that achievement rises with school spending and resources that spending can buy. The family resources theory holds that children’s achievement rises with parental education and income. We test all three theories in Chile between 2002 and 2013, when reading and math scores rose by 0.2-0.3 standard deviations, while school competition, school resources, and family resources all increased. In a difference in differences analysis, we ask which Chilean municipalities saw the greatest increases in test scores. Test scores did not rise faster in municipalities with greater increases in competition, but did rise faster in municipalities with greater increases in school resources (teachers per student) and especially family resources (parental education, not income). Student grade point averages show similar patterns. Results contradict the school competition theory but fit the family resource theory and, to a lesser extent, the school resource theory.