Date of Completion

10-28-2019

Embargo Period

10-27-2020

Keywords

Incremental Record Linkage, Edit Distance, Blocking, K-mers, Parallel Computing, Hierarchical Clustering.

Major Advisor

Prof. Sanguthevar Rajasekaran.

Co-Major Advisor

Prof. Reda Ammar.

Associate Advisor

Prof. Song Han.

Associate Advisor

Prof. Sheida Nabavi.

Field of Study

Computer Science and Engineering

Degree

Doctor of Philosophy

Open Access

Campus Access

Abstract

In the biomedical domain, the record linkage is considered as a crucial problem. When the number of records is very large, existing algorithms for record linkage take too much time. Often, we have to link a small set of new records with a large set of old records. This can be done by putting together the old and new records and performing a linkage on all the records. Clearly, this will call for an enormous amount of time. An alternative is to develop algorithms that perform linkage in an incremental manner. We refer to any such algorithm as an Incremental Record Linkage (IRL) algorithm.

In this thesis, we present an efficient IRL algorithm. In addition to taking large amounts of time, existing algorithms might also suffer from a chaining problem and hence introduce some errors in linking. As has been observed in the literature, this chaining problem can be solved by performing clustering under complete linkage.

This thesis makes two main contributions. Firstly, we have offer novel sequential and parallel algorithms for the critical incremental record linkage problem using a single linkage. Secondly, we have come up with novel sequential and parallel algorithms for incremental record linkage using complete linkage to overcome the chaining problems.

Our algorithms can handle any number of datasets. In contrast, many of the existing algorithms can only link two datasets at a time. Our algorithms outperform previous algorithms and offer state-of-the-art solutions to the IRL problem. We have tested our algorithms on millions of records on synthetic and real datasets and shown that our algorithms outperform the best-known RLA algorithms when the number of new records is up to around 20% of the total number of old records. Our algorithms achieve a very nearly linear speedup in parallel.

COinS