How Not to Fool Ourselves About Heterogeneity of Treatment Effects

January 2025

Researchers across many fields have called for greater attention to heterogeneity of treatment effects—shifting focus from the average effect to variation in effects between different treatments, studies, or subgroups. True heterogeneity is important, but many reports of heterogeneity have proved to be false, non-replicable, or inflated. In this review, we catalog ways that past researchers fooled themselves about heterogeneity, and recommend steps to stop fooling ourselves about heterogeneity in the future.

We make 18 specific recommendations, which we illustrate with examples from education research. These are the most common themes: (1) seek heterogeneity only when the causal mechanism offers clear motivation and the data offer adequate power; (2) shy away from seeking “no-but” heterogeneity when there is no main effect; (3) separate the noise of estimation error from the signal of true heterogeneity; (4) shrink variation in estimates toward zero; (5) increase p values and widen confidence intervals when conducting multiple tests; (6) estimate interactions rather than subgroup effects; and (7) check whether findings of heterogeneity are sensitive to changes in model or measurement. We also resolve two longstanding debates: one about centering interactions in linear models and one about estimating and interpreting interactions in nonlinear models such as logistic, ordinal, and interval regression.

Following these recommendations will screen out many false, confusing, and non-replicable findings. Claims of heterogeneity that pass the screens should be more plausible, better calibrated, and more replicable.

Education level

K-12 Education

Topics

Methods