Authors

Chong ChuFollow

Date of Completion

5-5-2017

Embargo Period

5-2-2017

Keywords

De novo repeats, mobile element insertions, transposable elements, closing gaps, genome assembly

Major Advisor

Yufeng Wu

Associate Advisor

Ion Mandoiu

Associate Advisor

Sanguthevar Rajasekaran

Associate Advisor

Dong-Guk Shin

Field of Study

Computer Science and Engineering

Degree

Doctor of Philosophy

Open Access

Open Access

Abstract

Repeat elements are important components of eukaryotic genomes. The dropping cost of the second and third generation sequencing technologies provides opportunities to study repeat elements of hundreds of species and thousands of individuals of one species. Based on the quality of the assembled genomes, generally there are two obstacles for studying repeat elements: (1) For species with high repetitive or complex genomes that do not have high quality genomes assembled, how do we construct de novo repeat elements? (2) For species with high quality genomes assembled, how to detect mobile elements insertions (one type of repeat elements) of different individuals of the species? It is known that most of the gaps on draft genomes are caused by repeat elements, thus a following-up question is: (3) With the understanding of repeats on genome, can we better close the gaps on draft genomes?

To address the first problem that reference genomes are incomplete and often contain missing data in highly repetitive regions, we propose a method (called REPdenovo) to construct repeats directly from short sequencing reads. REPdenovo can construct various types of repeats that are highly repetitive and have low sequence divergence within copies. We show that REPdenovo is substantially better than existing methods both in terms of the number and the completeness of the repeat sequences that it recovers. The key advantage of REPdenovo is that it can reconstruct long repeats from short sequence reads. We apply the method to human data and discover a number of potentially new repeats sequences that have been missed by previous repeat annotations.

Next we present an improved version of REPdenovo, which is able to reconstruct more divergent and lower frequency repeats from short sequencing reads. Comparing with the original REPdenovo, this improved approach uses more repeat-related k-mers. In addition, the new approach improves repeat assembly quality using a consensus-based k-mer processing method. We compare the performance of the new method with REPdenovo and RepARK on Human and Arabidopsis thaliana short sequencing data. The results show that the improved REPdenovo can assemble more complete repeats than REPdenovo (and also RepARK). We apply the improved REPdenovo on Hummingbird which has no known repeat library, and construct many repeat elements that are validated using PacBio long reads. Many of these repeats are likely to be true that are not in public repeat libraries.

To answer the second question, we develop a novel method (called REPdenovo-MEI) for detecting mobile element insertions (MEIs) with given reference genome and alignments of different individuals. Different from all existing tools, REPdenovo-MEI does not rely on any repeats library and can call MEIs efficiently and accurately. Besides calling out insertion sites, REPdenovo-MEI has a local assembly step to construct the inserted copy and a classification based approach for calling genotypes. In addition, the third-generation sequencing technology generates long reads of thousands of bases long, which usually is long enough to contain the whole repeat elements in the reads, thus can help to construct the MEIs completely. Thus, besides short reads, REPdenovo-MEI can also work with long reads to infer the inserted copies. Results on both simulated and real data show that REPdenovo-MEI outperforms existing tools on both accuracy and the number of constructed high divergent MEIs.

To solve the third problem of closing gaps on draft genomes, we propose a new method (called GAPPadder) that can sensitively close gaps for large and complex genomes. Different from existing approaches, GAPPadder collects more gap originated reads, especially repeat associated reads, and better utilize the information of different insert sizes of PE and MP reads. Finally, GAPPadder provides higher quality of local assembly with an extra contigs merging step. We show GAPPadder can close more gaps on one bacterial genome, Human chromosome 14 and Human whole genome. Besides closing gaps on draft genome assembled only from short sequence reads, GAPPadder can also be used to close gaps for draft genomes assembled with long reads. We show GAPPadder can close gaps on the bed bug genome and the Asian sea bass genome that are assembled partially and fully with long reads respectively. We also show GAPPadder is efficient in both time and memory usage.

COinS