Date of Completion
Algorithms, Mitochondrial Genome Assembly, Mitochondrial Haplogroup Assignment, Low-Coverage, Whole-Genome Sequencing Data
Field of Study
Computer Science and Engineering
Mitochondria are cellular organelles present with very rare exceptions in all eukaryotic cells. In most animals, the mitochondria have their own genome. The small size, high copy number, and the presence of both coding and regulatory regions that mutate at different rates make the mitochondrial genome an ideal genetic marker. Indeed, mitochondrial sequences have been used in applications ranging from maternal ancestry inference and tracing human migrations to forensic analysis. This thesis presents several novel bioinformatic tools enabling highly accurate mitochondrial genome reconstruction from low coverage from WGS data. First, we describe the Statistical Mitogenome Assembly with Repeats (SMART) pipeline for assembly of complete circular mitochondrial genomes from WGS data. Experiments on WGS datasets from a variety of species show that the SMART pipeline produces complete circular mitochondrial genome sequences with a higher success rate than current state-of-the art tools, particularly for low-coverage WGS datasets. Second, we present SMART2, an enhanced version of the SMART pipeline that can take advantage of multiple sequencing libraries when available and automatically selects the optimal number of read pairs used for assembly. Experimental results on publicly available WGS datasets show that SMART2 can assemble high quality mitochondrial genomes from low coverage with minimal user intervention. Indeed, SMART2 succeeded in generating mitochondrial sequences for 27 metazoan species with no previously published mitogenomes in NCBI databases. Finally, we present efficient algorithms for highly accurate haplogroup assignment and mitochondrial-based forensic analysis of WGS data from mixed DNA samples.
Alqahtani, Fahad, "Algorithms for Mitochondrial Genome Assembly and Haplogroup Assignment from Low-Coverage Whole-Genome Sequencing Data" (2020). Doctoral Dissertations. 2416.