Date of Completion

5-11-2013

Embargo Period

5-11-2013

Keywords

Document Classification, Data Mining, Information retrieval

Major Advisor

Reda A. Ammar

Co-Major Advisor

Sanguthevar Rajasekaran

Associate Advisor

Chun-Hsi (Vincent) Huang

Associate Advisor

Yufeng Wu

Field of Study

Computer Science and Engineering

Degree

Doctor of Philosophy

Open Access

Open Access

Abstract

Voluminous data sets are being generated on a continual basis in various branches of science and engineering. As a result, the amount of scholarly publications has also increased tremendously. For instance, Pubmed carries millions of abstracts. Pubmed's size keeps growing at a rapid pace. Given such large repositories, one of the challenges for any biologist will be to retrieve the information of interest in a short amount of time. In this research we propose novel solutions for such problems of information retrieval.

One of the goals of this research has been to develop a computational tool that can come up with a short list of documents that are likely to contain the information of interest in a short amount of time.

Information retrieval (IR) is the process of finding the information (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored in databases). Information retrieval tools are useful for people from different walks of life including reference librarians, paralegals, etc. Another popular application is web search. The term "unstructured data" refers to data which does not have a clear, semantically overt, and easy-for-a-computer structure. In this research we have developed information retrieval techniques that classify documents into two, namely, those that have information pertinent to a specific topic and those that do not.

A typical tool that we envision will take as input a set of pre-classified documents (that characterize the information of interest), extract all the keywords from the pre-classified documents, and will develop a learner model that is capable of classifying new documents (unknown or non-classified documents) into two classes. A class 1 document does have information of interest and a class 2 document does not. It is noteworthy that there are tools reported in the literature that are similar to what we study in this research. Examples include the TextMine algorithm by Vyas et al., the Gene Selection algorithm by Song and Rajasekaran, and others. We have compared our algorithms with those in the literature and showed that our algorithms yield better results.

COinS