Date of Completion

8-7-2020

Embargo Period

8-7-2020

Keywords

Bayesian Analysis, Imbalanced Response Data, Hurdle Model, Skewed link Binary Regression, K-prototype Clustering

Major Advisor

Dipak K. Dey

Co-Major Advisor

Emiliano Valdez

Associate Advisor

Victor Hugo Lachos Davila

Field of Study

Statistics

Degree

Doctor of Philosophy

Open Access

Campus Access

Abstract

Modeling imbalanced data sets is a common problem in regression and classification where there is a disproportionate ratio of observations in each class. Imbalanced data analysis can be found in many different areas such as mine safety operation and life insurance. The imbalanced distribution of majority (non-event) and minority (event) classes which result in misleading output is a great challenge. Though the information contained in the majority class is very important, the hazard rate or the mortality rate is estimated and analyzed relying on the samples from the minority class. The consequences of overestimating and underestimating the probability of an event will directly impact the individual's life and safety and company's financial well-being. Therefore the study of the imbalanced problem is vital. This dissertation reviews different possible ways to handle an imbalanced class problem for count and binary response variables, the techniques for making Bayesian inference, such as Markov Chain Monte Carlo methods and Exchange algorithm. In order to analyze different types of response variable with imbalanced distribution, the zero-inflated model with skewed link and a generalized type of count distribution, binary regression with skewed links and a generalized clustering algorithm are developed using MCMC techniques. Three applications on the real data sets will be shown in mine data and life insurance data separately of how those proposed methods are employed to achieve accurate Bayesian inference.

COinS