Date of Completion

1-31-2020

Embargo Period

1-30-2021

Keywords

Algorithms, Mitochondrial Genome Assembly, Mitochondrial Haplogroup Assignment, Low-Coverage, Whole-Genome Sequencing Data

Major Advisor

Ion Mandoiu

Associate Advisor

Mukul Bansal

Associate Advisor

Derek Aguiar

Field of Study

Computer Science and Engineering

Open Access

Open Access

Abstract

Mitochondria are cellular organelles present with very rare exceptions in all eukaryotic cells. In most animals, the mitochondria have their own genome. The small size, high copy number, and the presence of both coding and regulatory regions that mutate at different rates make the mitochondrial genome an ideal genetic marker. Indeed, mitochondrial sequences have been used in applications ranging from maternal ancestry inference and tracing human migrations to forensic analysis. This thesis presents several novel bioinformatic tools enabling highly accurate mitochondrial genome reconstruction from low coverage from WGS data. First, we describe the Statistical Mitogenome Assembly with Repeats (SMART) pipeline for assembly of complete circular mitochondrial genomes from WGS data. Experiments on WGS datasets from a variety of species show that the SMART pipeline produces complete circular mitochondrial genome sequences with a higher success rate than current state-of-the art tools, particularly for low-coverage WGS datasets. Second, we present SMART2, an enhanced version of the SMART pipeline that can take advantage of multiple sequencing libraries when available and automatically selects the optimal number of read pairs used for assembly. Experimental results on publicly available WGS datasets show that SMART2 can assemble high quality mitochondrial genomes from low coverage with minimal user intervention. Indeed, SMART2 succeeded in generating mitochondrial sequences for 27 metazoan species with no previously published mitogenomes in NCBI databases. Finally, we present efficient algorithms for highly accurate haplogroup assignment and mitochondrial-based forensic analysis of WGS data from mixed DNA samples.

COinS