Date of Completion

1-10-2019

Embargo Period

1-10-2019

Keywords

Metric Nets, Net-Trees, Randomized Incremental Construction, Hierarchical Structures, Topological Data Analysis, Computational Geometry

Major Advisor

Donald R. Sheehy

Associate Advisor

Thomas J. Peters

Associate Advisor

Sanguthevar Rajasekaran

Field of Study

Computer Science and Engineering

Degree

Doctor of Philosophy

Open Access

Open Access

Abstract

The volume of data is not the only problem in modern data analysis, data complexity is often more challenging. In many areas such as computational biology, topological data analysis, and machine learning, the data resides in high dimensional spaces which may not even be Euclidean. Therefore, processing such massive and complex data and extracting some useful information is a big challenge. Our methods will apply to any data sets given as a set of objects and a metric that measures the distance between them.

In this dissertation, we first consider the problem of preprocessing and organizing such complex data into a hierarchical data structure that allows efficient nearest neighbor and range queries. There have been many data structures for general metric spaces, but almost all of them have construction time that can be quadratic in terms of the number of points. There are only two data structures with O(n log n) construction time, but both have very complex algorithms and analyses. Also, they cannot be implemented efficiently. Here, we present a simple, randomized incremental algorithm that builds a metric data structure in O(n log n) time in expectation. Thus, we achieve the best of both worlds, simple implementation with asymptotically optimal performance.

Furthermore, we consider the close relationship between our metric data structure and point orderings used in applications such as k-center clustering. We give linear time algorithms to go back and forth between these orderings and our metric data structure.

In the last part, we use metric data structures to extract topological features of a data set, such as the number of connected components, holes, and voids. We give an efficient algorithm for constructing a (1 + epsilon)-approximation to the so-called Nerve filtration of a metric space, a fundamental tool in topological data analysis.

COinS