Date of Completion


Embargo Period



early warning system; machine learning; dropout; random forests; modeling; prediction; on-track; at-risk students; student achievement; regularized logistic regression

Major Advisor

Christopher Rhoads

Associate Advisor

H. Jane Rogers

Associate Advisor

Hariharan Swaminathan

Associate Advisor

Suzanne M. Wilson

Associate Advisor

Charles Martie

Field of Study

Educational Psychology


Doctor of Philosophy

Open Access

Open Access


In response to the high school dropout crisis, which comes with great economic and social costs, early warning systems (EWSs) have been developed to systematically predict and improve student outcomes. The purpose of this study is to evaluate different statistical and machine learning methods to predict high school student performance and improve EWSs. By improving education EWSs, this study aims to better identify those students in need of targeted support and inform on-the-ground practitioners who may intervene long before students may be dropping out.

The current study explores the aforementioned methods in the context of a cohort of 40,008 Connecticut students. The study utilized more than 100 predictors and developed models to predict each student’s probability of being on-track to graduate within four years using data collected prior to a student’s entry into 9th grade. Random forests, classification and regression tree (CART, or decision tree), and regularized logistic regression—ridge, lasso, and elastic net—models were developed, and performance of the models was evaluated on a validation dataset by comparing classification accuracy measures.

The study revealed that random forests models developed using a training set balanced by oversampling did the best job of identifying which students are at risk. These models captured complex interactions among covariates and performed best when thresholds were optimized using Youden’s index rather than defaulted at a 0.5 cut-off. The variable importance rankings showed that standardized test scores, attendance, and course performance were the top-ranking predictors of being on-track. Coefficients from elastic net models provided nuanced information to complement random forests results. In addition, incorporating detailed special education-related predictors served to improve classification accuracy, especially for students with disabilities.

This study is filling a practical void in education to support the development of more sophisticated predictive models. This will be usable by researchers as an approach to ensure future EWSs work optimally. It is also an opportunity for practitioners to leverage new knowledge about students who are at-risk, and to test interventions at many levels in an attempt to improve graduation outcomes.