Date of Completion
variational inference, Hidden Markov model, autoregressive, methylation, EM algorithm, variable selection, missing data
Field of Study
Doctor of Philosophy
Current popular methods of methylation data analysis rely on multiple testing where the assumption of independent loci is required. The effects of nearby sites in sequencing are usually ignored. Some methods use Hidden Markov Model (HMM) to model the influence of neighbors. The assumptions of locally homogeneous segments with constant variances (homoscedasticity) or constant autocorrelations for standard HMM are restrictive. When heterogeneity of variances or autocorrelations are introduced and missing values occur, the well-known Baum-Welch algorithm for HMM is not applicable to find the model parameters. In this dissertation, we develop a generalized HMM, where AutoRegression and Missing values are handled simultaneously in HMM (ARM-HMM). To provide fast and accurate inference, a modified expectation maximization algorithm and variational inference are introduced as two kinds of fitting procedures. Further feature extraction and variable selection techniques are developed and compared for adequacy and efficiency in the detection of important biomarkers. Experiments with both simulated and real methylation data show that the proposed ARM-HMM is able to get precise parameter estimations and detect meaningful segments. With carefully chosen variable selection methods, biologically meaningful methylation regions can also be detected.
Liu, Kangyan, "Segmentation, Feature Extraction and Selection in Sequential Data with Missing-Data Imputation" (2019). Doctoral Dissertations. 2134.
Available for download on Friday, April 17, 2020