Search EdWorkingPapers

Yifeng Song

Kelli A. Bird, Benjamin L. Castleman, Yifeng Song, Renzhe Yu.

Colleges have increasingly turned to data science applications to improve student outcomes. One prominent application is to predict students’ risk of failing a course. In this paper, we investigate whether incorporating data from learning management systems (LMS)--which captures detailed information on students’ engagement in course activities--increases the accuracy of predicting student success beyond using just administrative data alone. We use data from the Virginia Community College System to build random forest models based on student type (new versus returning) and data source (administrative-only, LMS-only, or full data). We find that among returning college students, models that use administrative-only outperform models that use LMSonly. Combining the two types of data results in minimal increased accuracy. Among new students, LMS-only models outperform administrative-only models, and accuracy is significantly higher when both types of predictors are used. This pattern of results reflects the fact that community college administrative data contains little information about new students. Within the LMS data, we find that LMS data pertaining to students’ engagement during the first part of the course has the most predictive value.

More →


Kelli A. Bird, Benjamin L. Castleman, Yifeng Song.

Predictive analytics are increasingly pervasive in higher education. However, algorithmic bias has the potential to reinforce racial inequities in postsecondary success. We provide a comprehensive and translational investigation of algorithmic bias in two separate prediction models -- one predicting course completion, the second predicting degree completion. Our results show that algorithmic bias in both models could result in at-risk Black students receiving fewer success resources than White students at comparatively lower-risk of failure. We also find the magnitude of algorithmic bias to vary within the distribution of predicted success. With the degree completion model, the amount of bias is nearly four times higher when we define at-risk using the bottom decile than when we focus on students in the bottom half of predicted scores. Between the two models, the magnitude and pattern of bias and the efficacy of basic bias mitigation strategies differ meaningfully, emphasizing the contextual nature of algorithmic bias and attempts to mitigate it. Our results moreover suggest that algorithmic bias is due in part to currently-available administrative data being less useful at predicting Black student success compared with White student success, particularly for new students; this suggests that additional data collection efforts have the potential to mitigate bias.

More →


Kelli A. Bird, Benjamin L. Castleman, Zachary Mabel, Yifeng Song.

Colleges have increasingly turned to predictive analytics to target at-risk students for additional support. Most of the predictive analytic applications in higher education are proprietary, with private companies offering little transparency about their underlying models. We address this lack of transparency by systematically comparing two important dimensions: (1) different approaches to sample and variable construction and how these affect model accuracy; and (2) how the selection of predictive modeling approaches, ranging from methods many institutional researchers would be familiar with to more complex machine learning methods, impacts model performance and the stability of predicted scores. The relative ranking of students’ predicted probability of completing college varies substantially across modeling approaches. While we observe substantial gains in performance from models trained on a sample structured to represent the typical enrollment spells of students and with a robust set of predictors, we observe similar performance between the simplest and most complex models.

More →