Date of Completion
Circular RNA, High-throughput sequencing, Genomics, RNA-Seq, Structure variation, Deletion, Unsupervised learning
Field of Study
Computer Science and Engineering
Doctor of Philosophy
The genomes of most eukaryotes are large and complex. The presence of large amounts of non-coding sequences is a general property of the genomes of complex eukaryotes. High-throughput sequencing is increasingly important for the study of complex genomes. In this dissertation, we focus on two computational problems for high-throughput sequence data analysis, including detecting circular RNA and calling structural variations (especially deletions).
Circular RNA (or circRNA) is a kind of non-coding RNA, which consists of a circular configuration through a typical 5' to 3' phosphodiester bond by non-canonical splicing. CircRNA was originally thought as the byproduct from the process of mis-splicing and considered to be of low abundance. Recently, however, circRNA is considered as a new class of functional molecule, and the importance of circRNA in gene regulation and their biological functions in some human diseases have started to be recognized. In this research work, we propose two algorithms to detect potential circRNA. In order to improve the performance of running time, we design an algorithm called CircMarker to find circRNA by creating k-mer table rather than conventional reads mapping. Furthermore, we develop an algorithm named CircDBG by taking advantage of the information from both reads and annotated genome to create de Bruijn graph for circRNA detection, which improves the accuracy and sensitivity.
Structural variation (SV), which ranges from 50 bp to ~3 Mb in size, is an important type of genetic variations. Deletion is a type of SV in which a part of a chromosome or a sequence of DNA is lost during DNA replication. In this research work, we develop a new method called EigenDel for detecting genomic deletions. EigenDel first takes advantage of discordant read-pairs and clipped reads to get initial deletion candidates. Then, EigenDel clusters similar deletion candidates together and calls true deletions from each cluster by using unsupervised learning method. EigenDel outperforms other major methods in terms of balancing accuracy and sensitivity as well as reducing bias.
Our results in this dissertation show that sequencing data can be used to study complex genomes by using effective computational approaches.
Li, Xin, "Complex Genome Analysis with High-throughput Sequencing Data: Methods and Applications" (2020). Doctoral Dissertations. 2436.