Date of Completion

5-5-2017

Embargo Period

5-4-2017

Keywords

Data Mining, Formal Methods, Grammars, CFG, Big Data

Major Advisor

Reda Ammar

Co-Major Advisor

Sanguthevar Rajasekaran

Associate Advisor

Swapna Gokhale

Associate Advisor

Yufeng Wu

Field of Study

Computer Science and Engineering

Degree

Doctor of Philosophy

Open Access

Open Access

Abstract

As data collection technologies are advancing and memory storage costs are declining, volumes of data collected have soared. Scientists and investigators are collecting all possible data in fear of missing out on important information. With the merge of the data collection trend, researchers were studying data mining and analysis to find the most efficient way to data mine. There are various valuable data mining techniques that can be found in literature such as Support Vector Machine (SVM), Neural Networks (ANN), and Formal Methods (Grammars). Grammars are a very valuable in analyzing structured data and describing them in a condense matter. However, not many have used it for data mining even though it has many benefits. In this research we present an approach to data mine big data. First, a grammar is inferred to build a structural model that describes the data. Then, on the next phase, a probabilistic context-free grammar is inferred and a model for a more complex structures. Given an input sequence, the model parses and generates the probability of that data sequence being part of the class based on its structural characteristics. Grammatical concatenation is utilized in case of existing sub-structures within the class’s structural description. The model then accepts, or rejects, the input as part of the data’s class by comparing the probability to a pre-set threshold. Finally, this is applied on a heterogeneous large data set by inferring multiple grammars. After building grammatical model for each class, the algorithm parse multiple points in the large set. It then classifies these data into smaller sets where they share similar structural characteristics using probabilistic grammar. If more than one class accepts the data point, it is associated to the highest ranking class. Biological data, DNAs and Proteins, were used for experimentation in this research.

COinS