"Novel Algorithms for Big Data Analytics" by Subrata Saha

Doctoral Dissertations

Title

Novel Algorithms for Big Data Analytics

Authors

Subrata Saha, University of ConnecticutFollow

Date of Completion

5-5-2017

Embargo Period

10-27-2017

Keywords

Closest Pair Problem (CPP), Error Correction, Feature Selection, Genome-wide Association Study (GWAS), Hierarchical Clustering, Metagenomics, Scaffolding, Sequence Compression, Spliced Junctions, Time Series Motifs

Major Advisor

Sanguthevar Rajasekaran

Associate Advisor

Chun-Hsi (Vincent) Huang

Associate Advisor

Ion Mandoiu

Associate Advisor

Mohammad Maifi Hasan Khan

Field of Study

Computer Science and Engineering

Degree

Doctor of Philosophy

Open Access

Abstract

In this dissertation we offer novel algorithms for big data analytics. We live in a period when voluminous datasets get generated in every walk of life. It is essential to develop novel algorithms to analyze these and extract useful information. In this thesis we present generic data analytics algorithms and demonstrate their applications in various domains.

A number of fundamental problems, such as clustering, data reduction, classification, feature selection, closest pair detection, data compression, sequence assembly, error correction, metagenomic phylogenetic clustering, etc. arise in big data analytics. We have worked on some of these fundamental problems and developed algorithms that outperform the best prior algorithms. For example, we have come up with a series of data compression algorithms for biological data that offer better compression ratios while reducing the compression and decompression times drastically. As another example, we have invented an efficient algorithm for the problem of closest pairs. This problem has numerous applications. Our algorithm when applied to solve the two-locus problem in Genome-wide Association Studies performs two orders of magnitude faster than the best-known prior algorithm for solving the two locus problem. As another example, we have proposed a novel deterministic sampling technique that can be used to speed up any clustering algorithm. Empirical results show that this technique results in a speedup of more than an order of magnitude over exact hierarchical clustering algorithms. Also, the accuracy obtained is excellent. In fact, on many datasets, we get an accuracy that is better than that of exact hierarchical clustering algorithms!

Recommended Citation

Saha, Subrata, "Novel Algorithms for Big Data Analytics" (2017). Doctoral Dissertations. 1481.
https://digitalcommons.lib.uconn.edu/dissertations/1481

Download

COinS

Doctoral Dissertations

Title

Authors

Date of Completion

Embargo Period

Keywords

Major Advisor

Associate Advisor

Associate Advisor

Associate Advisor

Field of Study

Degree

Open Access

Abstract

Recommended Citation

Search

Links

Browse

Author Corner

Homepage