Date of Completion

7-26-2019

Embargo Period

1-22-2020

Keywords

Imputation, linear models, genomics, data visualization

Major Advisor

Yuping Zhang

Associate Advisor

Ming-Hui Chen

Associate Advisor

Zhiyi Chi

Associate Advisor

Lynn Kuo

Field of Study

Statistics

Degree

Doctor of Philosophy

Open Access

Open Access

Abstract

The rise of Big Data has enabled sophisticated analysis of the human genome in unprecedented detail. Large datasets are now collected as a matter of routine, and their scope spans multiple data types and multiple functional units at the molecular level of the cell. The breadth and depth of these data offer the opportunity for complex experiments and extensive structural modeling. But, given the intricacies of these data and the nuanced challenges they pose, robust and rigorous methods are essential to ensure the value and validity of the resulting scientific research. In this dissertation, we consider statistical methods for networks, applied to signaling pathways in the human genome. We construct joint, integrative models that employ a variety of data types simultaneously. These pathway models provide a unified approach to analysis of genetic, epigenetic, transcriptomic, and other types of genomic data, and incorporate functionally meaningful biological relationships. In particular, we propose a new pathway model that integrates non-coding micro RNAs, proteins that play a regulatory role with respect to genes. We also propose methods to address obstacles that arise in the course of real-world research. We consider missing data, a fundamental reality of -omics Big Data due to variability in data quality and experimental design. We adapt a low-rank method for matrix completion to apply to bioinformatic datasets with arbitrary patterns of missing data. We apply the imputation and pathway methods to a large-scale research study that profiles more than 30 cancer types. We also propose an algorithm to identify important subnetworks within large signaling pathways, in order to hone our understanding of the drivers of complex diseases. Through the use of interactive data visualization and analysis, we promote access to -omics analyses. Taken together, these methods provide a suite of tools that empower biological research using -omics data. Our methods span functional genomic models, address real-world problems in data analysis, and seek to make analysis of complex datasets more tractable, all while maintaining a statistically sound foundation.

COinS