Search EdWorkingPapers by author, title, or keywords.
Methodology, measurement and data
Classroom discourse is a core medium of instruction --- analyzing it can provide a window into teaching and learning as well as driving the development of new tools for improving instruction. We introduce the largest dataset of mathematics classroom transcripts available to researchers, and demonstrate how this data can help improve instruction. The dataset consists of 1,660 45-60 minute long 4th and 5th grade elementary mathematics observations collected by the National Center for Teacher Effectiveness (NCTE) between 2010-2013. The anonymized transcripts represent data from 317 teachers across 4 school districts that serve largely historically marginalized students. The transcripts come with rich metadata, including turn-level annotations for dialogic discourse moves, classroom observation scores, demographic information, survey responses and student test scores. We demonstrate that our natural language processing model, trained on our turn-level annotations, can learn to identify dialogic discourse moves and these moves are correlated with better classroom observation scores and learning outcomes. This dataset opens up several possibilities for researchers, educators and policymakers to learn about and improve K-12 instruction.
This simulation study examines the characteristics of the Explanatory Item Response Model (EIRM) when estimating treatment effects when compared to classical test theory (CTT) sum and mean scores and item response theory (IRT)-based theta scores. Results show that the EIRM and IRT theta scores provide generally equivalent bias and false positive rates compared to CTT scores and superior calibration of standard errors under model misspecification. Analysis of the statistical power of each method reveals that the EIRM and IRT theta scores provide a marginal benefit to power and are more robust to missing data than other methods when parametric assumptions are met and provide a substantial benefit to power under heteroskedasticity, but their performance is mixed under other conditions. The methods are illustrated with an empirical data application examining the causal effect of an elementary school literacy intervention on reading comprehension test scores and demonstrates that the EIRM provides a more precise estimate of the average treatment effect than the CTT or IRT theta score approaches. Tradeoffs of model selection and interpretation are discussed.
Districts nationwide have revised their educator evaluation systems, increasing the frequency with which administrators observe and evaluate teacher instruction. Yet, limited insight exists on the role of evaluator feedback for instructional improvement. Relying on unique observation-level data, we examine the alignment between evaluator and teacher assessments of teacher instruction and the potential consequences for teacher productivity and mobility. We show that teachers and evaluators typically rate teacher performance similarly during classroom observations, but with significant variability in teacher-evaluator ratings. While teacher performance improves across multiple classroom observations, evaluator ratings likely overstate productivity improvements among the lowest-performing teachers. Evaluators, but not teachers, systematically rate teacher performance lower in classrooms serving higher concentrations of economically disadvantaged students. And while teacher performance improves when evaluators provide more critical feedback about teacher instruction, teachers receiving critical feedback may seek alternative teaching assignments in schools with less critical evaluation settings. We discuss the implications of these findings for the design, implementation and impact of educator evaluation systems.
Books shape how children learn about society and norms, in part through representation of different characters. We introduce new artificial intelligence methods for systematically converting images into data and apply them, along with text analysis methods, to measure the representation of race, gender, and age in award-winning children’s books from the past century. We find that more characters with darker skin color appear over time, but the most influential books persistently depict a greater proportion of light-skinned characters than other books, even after conditioning on race; we also find that children are depicted with lighter skin than adults. Relative to their growing share of the U.S. population, Black and Latinx people are underrepresented in these same books, while White males are overrepresented. Over time, females are increasingly present but appear less often in text than in images, suggesting greater symbolic inclusion in pictures than substantive inclusion in stories. We then report empirical evidence for predictions about the supply of and demand for representation that would generate these patterns. On the demand side, we show that people consume books that center their own identities. On the supply side, we document higher prices for books that center non-dominant social identities and fewer copies of these books in libraries that serve predominantly White communities. Lastly, we show that the types of children’s books purchased in a neighborhood are related to local political beliefs.
Increasing numbers of students require internet access to pursue their undergraduate degrees, yet broadband access remains inequitable across student populations. Furthermore, surveys that currently show differences in access by student demographics or location typically do so at high levels of aggregation, thereby obscuring important variation between subpopulations within larger groups. Through the dual lenses of quantitative intersectionality and critical race spatial analysis, we use Bayesian multilevel regression and census microdata to model variation in broadband access among undergraduate populations at deeper interactions of identity. We find substantive heterogeneity in student broadband access by gender, race, and place, including between typically aggregated subpopulations. Our findings speak to inequities in students’ geographies of opportunity and suggest a range of policy prescriptions at both the institutional and federal level.
Community schools are an increasingly popular strategy used to improve the performance of students whose learning may be disrupted by non-academic challenges related to poverty. Community schools partner with community based organizations (CBOs) to provide integrated supports such as health and social services, family education, and extended learning opportunities. With over 300 community schools, the New York City Community Schools Initiative (NYC-CS) is the largest of these programs in the country. Using a novel method that combines multiple rating regression discontinuity design (MRRDD) with machine learning (ML) techniques, we estimate the causal effect of NYC-CS on elementary and middle school student attendance and academic achievement. We find an immediate reduction in chronic absenteeism of 5.6 percentage points, which persists over the following three years. We also find large improvements in math and ELA test scores – an increase of 0.26 and 0.16 standard deviations by the third year after implementation – although these effects took longer to manifest than the effects on attendance. Our findings suggest that improved attendance is a leading indicator of success of this model and may be followed by longer-run improvements in academic achievement, which has important implications for how community school programs should be evaluated.
How much does family demand matter for child learning in settings of extreme poverty? In rural Gambia, families with high aspirations for their children’s future education and career, measured before children start school, go on to invest substantially more than other families in the early years of their children’s education. Despite this, essentially no children are literate or numerate three years later. When villages receive a highly-impactful, teacher-focused supply-side intervention, however, children of these families are 25 percent more likely to achieve literacy and numeracy than other children in the same village. Furthermore, improved supply enables these children to acquire other higher-level skills necessary for later learning and child development. We also document patterns of substitutability and complementarity between demand and supply in generating learning at varying levels of skill difficulty. Our analysis shows that greater demand can map onto developmentally meaningful learning differences in such settings, but only with adequate complementary inputs on the supply side.
Teachers affect a wide range of students’ educational and social outcomes, but how they contribute to students’ involvement in school discipline is less understood. We estimate the impact of teacher demographics and other observed qualifications on students’ likelihood of receiving a disciplinary referral. Using data that track all disciplinary referrals and the identity of both the referred and referring individuals from a large and diverse urban school district in California, we find students are about 0.2 to 0.5 percentage points (7% to 18%) less likely to receive a disciplinary referral from teachers of the same race or gender than from teachers of different demographic backgrounds. Students are also less likely to be referred by more experienced teachers and by teachers who hold either an English language learners or special education credential. These results are mostly driven by referrals for defiance and violence infractions, Black and Hispanic male students, and middle school students. While it is unclear whether these findings are due to variation in teachers’ effects on actual student behavior, variation in teachers’ proclivities to make disciplinary referrals, or a combination of the two, these results nonetheless suggest that teachers play a central role in the prevalence of, and inequities in, office referrals and subsequent student discipline.
Noncognitive constructs such as self-efficacy, social awareness, and academic engagement are widely acknowledged as critical components of human capital, but systematic data collection on such skills in school systems is complicated by conceptual ambiguities, measurement challenges and resource constraints. This study addresses this issue by comparing the predictive validity of two most widely used metrics on noncogntive outcomes|observable academic behaviors (e.g., absenteeism, suspensions) and student self-reported social and emotional learning (SEL) skills|for the likelihood of high school graduation and postsecondary attainment. Our findings suggest that conditional on student demographics and achievement, academic behaviors are several-fold more predictive than SEL skills for all long-run outcomes, and adding SEL skills to a model with academic behaviors improves the model's predictive power minimally. In addition, academic behaviors are particularly strong predictors for low-achieving students' long-run outcomes. Part-day absenteeism (as a result of class skipping) is the largest driver behind the strong predictive power of academic behaviors. Developing more nuanced behavioral measures in existing administrative data systems might be a fruitful strategy for schools whose intended goal centers on predicting students' educational attainment.
Data science applications are increasingly entwined in students’ educational experiences. One prominent application of data science in education is to predict students’ risk of failing a course in or dropping out from college. There is growing interest among higher education researchers and administrators in whether learning management system (LMS) data, which capture very detailed information on students’ engagement in and performance on course activities, can improve model performance. We systematically evaluate whether incorporating LMS data into course performance prediction models improves model performance. We conduct this analysis within an entire state community college system. Among students with prior academic history in college, administrative data-only models substantially outperform LMS data-only models and are quite accurate at predicting whether students will struggle in a course. Among first-time students, LMS data-only models outperform administrative data-only models. We achieve the highest performance for first-time students with models that include data from both sources. We also show that models achieve similar performance with a small and judiciously selected set of predictors; models trained on system-wide data achieve similar performance as models trained on individual courses.