Date of Completion

4-18-2019

Embargo Period

4-17-2020

Keywords

variational inference, Hidden Markov model, autoregressive, methylation, EM algorithm, variable selection, missing data

Major Advisor

Haim Bar

Associate Advisor

Nalini Ravishanker

Associate Advisor

Dipak Dey

Field of Study

Statistics

Degree

Doctor of Philosophy

Open Access

Campus Access

Abstract

Current popular methods of methylation data analysis rely on multiple testing where the assumption of independent loci is required. The effects of nearby sites in sequencing are usually ignored. Some methods use Hidden Markov Model (HMM) to model the influence of neighbors. The assumptions of locally homogeneous segments with constant variances (homoscedasticity) or constant autocorrelations for standard HMM are restrictive. When heterogeneity of variances or autocorrelations are introduced and missing values occur, the well-known Baum-Welch algorithm for HMM is not applicable to find the model parameters. In this dissertation, we develop a generalized HMM, where AutoRegression and Missing values are handled simultaneously in HMM (ARM-HMM). To provide fast and accurate inference, a modified expectation maximization algorithm and variational inference are introduced as two kinds of fitting procedures. Further feature extraction and variable selection techniques are developed and compared for adequacy and efficiency in the detection of important biomarkers. Experiments with both simulated and real methylation data show that the proposed ARM-HMM is able to get precise parameter estimations and detect meaningful segments. With carefully chosen variable selection methods, biologically meaningful methylation regions can also be detected.

COinS