Date of Completion

5-5-2017

Embargo Period

10-27-2017

Keywords

Closest Pair Problem (CPP), Error Correction, Feature Selection, Genome-wide Association Study (GWAS), Hierarchical Clustering, Metagenomics, Scaffolding, Sequence Compression, Spliced Junctions, Time Series Motifs

Major Advisor

Sanguthevar Rajasekaran

Associate Advisor

Chun-Hsi (Vincent) Huang

Associate Advisor

Ion Mandoiu

Associate Advisor

Mohammad Maifi Hasan Khan

Field of Study

Computer Science and Engineering

Degree

Doctor of Philosophy

Open Access

Open Access

Abstract

In this dissertation we offer novel algorithms for big data analytics. We live in a period when voluminous datasets get generated in every walk of life. It is essential to develop novel algorithms to analyze these and extract useful information. In this thesis we present generic data analytics algorithms and demonstrate their applications in various domains.

A number of fundamental problems, such as clustering, data reduction, classification, feature selection, closest pair detection, data compression, sequence assembly, error correction, metagenomic phylogenetic clustering, etc. arise in big data analytics. We have worked on some of these fundamental problems and developed algorithms that outperform the best prior algorithms. For example, we have come up with a series of data compression algorithms for biological data that offer better compression ratios while reducing the compression and decompression times drastically. As another example, we have invented an efficient algorithm for the problem of closest pairs. This problem has numerous applications. Our algorithm when applied to solve the two-locus problem in Genome-wide Association Studies performs two orders of magnitude faster than the best-known prior algorithm for solving the two locus problem. As another example, we have proposed a novel deterministic sampling technique that can be used to speed up any clustering algorithm. Empirical results show that this technique results in a speedup of more than an order of magnitude over exact hierarchical clustering algorithms. Also, the accuracy obtained is excellent. In fact, on many datasets, we get an accuracy that is better than that of exact hierarchical clustering algorithms!

COinS