Date of Completion

5-4-2015

Embargo Period

5-4-2015

Keywords

microarray, RNA-Seq, Gene Length, Classification

Major Advisor

Lynn Kuo

Associate Advisor

Ming-Hui Chen

Associate Advisor

Zhiyi Chi

Field of Study

Statistics

Degree

Doctor of Philosophy

Open Access

Open Access

Abstract

This thesis focuses on analyzing the type of data returned by two pieces of technology, the older and less expensive microarray, or the next generation sequencing data, RNA-Seq. Both devices return data that is extremely large in volume. Microarray analysis begins by finding genes of interest, which are called differentially expressed (DE). Genes are called DE controlling for some criteria, such as false discovery rate (FDR), and then clustered into groups. A method unifying these two steps was suggested, using a mixture of normal distributions with the appropriate EM algorithm. We compare this to a semi-parametric alternative to the unified method. We use simulation studies to compare these and other microarray analysis methods. We then look at next generation RNA-Seq data, with a focus on accounting for gene length. We introduce a hierarchical, log-linear negative binomial count model which incorporates gene length both into the parameter estimation and zero count inflation for this data. This hierarchical model allows borrowing counts information across genes efficiently and provides a Bayes factor criterion for screening for DE genes.We use real data to show a decrease in length bias when our method is compared to popular existing methods, as well as a simulation study to establish the effects of over and under fitting within our model, as well as the effect of fitting multiple DE types in a single model. We provide new methods for finding DE genes for microarray and RNA-Seq data, and illustrate their advantages using real and simulated data.

Classification and Multiple Hypothesis Testing in Microarray andRNA-Seq Experiments

COinS