Date of Completion


Embargo Period



Predictive modelling, Risk assessment, Episode Treatment Groups, Stop-loss pricing, Model averaging, Model selection, Random Forest, Health Insurance Pricing, Tweedie model, Two part model

Major Advisor

Brian Hartman

Associate Advisor

Jeyaraj Vadiveloo

Associate Advisor

James G. Bridgeman

Field of Study



Doctor of Philosophy

Open Access

Open Access


Risk assessment is essential for insurance pricing and risk management. This study develops several predictive models with data from a major national health insurer. Specifically, four models (lognormal, gamma, log-skew-t, and Lomax) for Episode Treatment Groups based costs are compared using four different metrics (AIC and BIC weights, random forest feature classification, and Bayesian model averaging). Several case studies are provided for illustration. Experimental results show that random forest feature classification is preferred for large data set for its computational efficiency and sufficient accuracy. For small data sets, Bayesian model averaging is recommended for its better accuracy.

Given the target variable is semi-continuous, heavy-tailed and clustered, nine candidate models are investigated including the Tweedie GLM and GAM, several two-part models, quantile regression, and a finite mixture model. A comprehensive model selection strategy and framework are suggested for different goals. A few evaluation mechanisms are investigated, considering measures of distance, effectiveness, distribution similarity, or location. In particular, the minimal distance probability matrix is proposed as a robust model selection technique. A few interesting conclusions are drawn between the transitivity of the matrix of relation and the existence of a single robust best model among candidates.

This research also develops a stop-loss coverage pricing model for self-funded health plans. The formulas that denote the net stop-loss premium are derived and predictive analytics are deployed to capture the relationship between certain characteristics and the target variable. A case study about Specific Stop-Loss (SSL) only coverage is given and future work is summarized.