Date of Completion

1-28-2019

Embargo Period

1-31-2019

Keywords

Binding sites, DNA motif, Motif detection tool, Motif discovery pipeline, Motif similarity detection, Motif clustering

Major Advisor

Chun-Hsi Huang

Associate Advisor

Sanguthevar Rajasekaran

Associate Advisor

Dong-Guk Shin

Field of Study

Computer Science and Engineering

Degree

Doctor of Philosophy

Open Access

Open Access

Abstract

Binding site motifs are short sequences of similar patterns found in DNA or protein. They have an important role in Bioinformatics as they reveal the transcription factors that control the gene expression. Hence, several motif discovery tools have been developed. We reviewed nine Web tools for finding binding site motifs in ChIP-Seq data. The results showed that they reported various results for an identical dataset. This is largely due to the fact that different tools use different strategies and possess unique features for detecting the motifs. Therefore, using multiple tools and methods is generally advised because the motifs commonly reported by them are more likely to be biologically significant. Besides, numerous studies show that using multiple tools and methods indeed improved the accuracy of the motif detection generally. However, the results from multiple tools and methods need to be compared for identifying the common significant motifs. Existing tools and methods for motif similarity comparison do not allow comparing multiple datasets concurrently for extracting the common significant motifs. Instead, they only allow motif comparisons within a dataset or between two datasets. To compare more than two datasets, pair-wise comparisons are performed first. The results are then checked against each other manually. This is a time-consuming process and it becomes impractical for comparing large datasets and large number of datasets. Moreover, the results from individual motif finders on the same datasets vary significantly. Theorefore, it may not be reliable for getting results from individual tools. To address this issue, we developed MOTIFSIM algorithm for comparing multiple DNA motif datasets concurrently to extract (1) the common (global) significant motifs from multiple tools, (2) the motifs reported by some tools but not by others (the global and local significant motifs), and (3) the best matches for each motif in the collection of motifs from multiple tools. We performed an extensive assessment for MOTIFSIM. The pair-wise comparison results show that its performance is better than the un-gapped Smith-Waterman algorithm and four distance metrics namely average Kullback-Leibler, average log-likelihood ratio, Chi-Square distance, and Pearson Correlation Coefficient. The clustering results also demonstrate that MOTIFSIM achieves similar or even better performance than RSAT Matrix-clustering tool. Furthermore, the findings indicate if the motif detection does not require a special tool for detecting a specific type of motif then using multiple motif finders and combining with MOTIFSIM for obtaining the common significant motifs, it improved the results for DNA motif detection.

We implemented MOTIFSIM algorithm into software tools with several usefulnesses. First, it allows finding similarity in multiple DNA motif datasets concurrently. Second, the results obtained are faster than the manual comparisons. Third, the results are validated to be more reliable than those from individual de novo motif finders. Fourth, it allows comparing large datasets and large number of datasets. Fifth, the results can be matched with motif database for obtaining similar motifs. Sixth, similar motifs found in the results can be merged into new motifs to reduce the number of redundant motifs. Lastly, the results can be visualized by motif trees.

The implementations were carried out as follows. First, the command-line MOTIFSIM was developed for comparing motifs locally in standalone mode. Second, the cluster-based MOTIFSIM was developed for comparing motifs on-line with a user-friendly interface. The users can save the datasets and results on-line for retrieval. The Web traffic is also balanced with HAProxy loader balancer. We performed three case studies in which we compared the cluster-based MOTIFSIM with STAMP tool for pair-wise motif similarity detection. The results reveal that 83% or higher of global significant motifs found by MOTIFSIM were detected by STAMP tool. Third, the cloud-based MOTIFSIM was developed for comparing large-scale motif datasets on Amazon Web Services cloud. It provides additional on-line storage space for datasets and results. The tool is also scalable with the expandable services from AWS. Furthermore, its performance is better than the cluster-based tool. The version 2.0 of both command-line MOTIFSIM and cluster-based MOTIFSIM offered numerous technical improvements to further support the motif comparison and analysis. The version 2.1 of both tools offered three new features including matching motifs with the reference database, merging similar motifs, and clustering motifs into motif trees.

Running several motif finders for an identical dataset manually is a hassle. This practice may require several manual installations and configurations of different tools on a local machine or it may require several manual runs of different motif finders residing on several different Web servers. To facilitate this process, numerous motif discovery pipelines have been developed. They can be standalone applications for standalone servers or pipelining Web servers. Recent development tends to be pipelining Web servers, which eliminate the complications of software installations and configurations required by standalone applications in order to serve more users via the Web. Generally, the pipelines incorporated multiple algorithms or tools. They were designed to complement individual motif finders for achieving better accuracy. The results can be clustered and ranked for obtaining the top significant motifs. Some pipelines allow verifying the results with the reference databases. Although existing pipelines were designed with their unique integrations and the methods for ranking and selecting the significant motifs, they do not allow obtaining different comparison results for multiple tools and methods. They generally report the top ranked results either from individual motif finders or from a combination of multiple predictive algorithms and tools. To address this issue, we developed MODSIDE, which is a motif discovery pipeline and similarity detector. MODSIDE was designed for not only delivering the predictive results from individual motif finders but also the comparison results for multiple tools. The pipeline integrated four de novo motif finders: ChIPMunk, MEME, Weeder, and XXmotif. It also incorporated a motif similarity detection tool MOTIFSIM. We assessed MODSIDE in two aspects. First, we evaluated MODSIDE and its adopted motif finders on sixteen benchmark datasets. The statistical results demonstrate MODSIDE achieves better accuracy than individual motif finders. Second, we compared MODSIDE with two popular motif discovery pipelines: MEME-ChIP and RSAT peak-motifs. The comparison results reveal that MODSIDE attains similar performance as RSAT peak-motifs but better accuracy than MEME-ChIP. In addition, MODSIDE is able to deliver various comparison results that are not offered by MEME-ChIP, RSAT peak-motifs, and other existing motif discovery pipelines.

COinS