Date of Completion

1-17-2014

Embargo Period

1-17-2014

Keywords

transcription factor, machine learning, web tool

Major Advisor

Chun-Hsi Huang

Associate Advisor

Jinbo Bi

Associate Advisor

Sanguthevar Rajasekaran

Associate Advisor

Daniel Schwartz

Associate Advisor

Dong-Guk Shin

Field of Study

Computer Science and Engineering

Degree

Doctor of Philosophy

Open Access

Open Access

Abstract

A transcription factor (TF) is a protein or protein complex. It regulates the expression of its target genes by physically binding to the regulatory regions of these genes. The binding sites of a TF naturally share a common pattern or motif with one another. Given known binding sites of a TF, a TF model can be built to scan sequences for putative binding sites. This is known as a transcription factor binding site (TFBS) search problem. In this dissertation, we investigate the TFBS search problem using machine learning approaches.

In general, the known binding sites of a TF are of variable lengths and have to be aligned before a model can be built. Transcription factor binding site alignment is considered an unsupervised learning problem since no other information about the unaligned binding sites is given. We propose an algorithm that considers the lengths of TFBSs and dependencies of nucleotide positions in a binding site. The novel method is named LASAGNA (Length-Aware Site Alignment Guided by Nucleotide Association).

Studies often utilize TFBS search tools to predict the binding sites of a TF in a DNA sequence when binding sites found by assays are not available. The analysis often involves TF model collection, promoter sequence retrieval and visualization, requiring several tools to accomplish. To accelerate TFBS analyses, we developed a novel integrated webtool named LASAGNA-Search. This user-friendly tool allows users to perform the analysis without leaving the site.

TFBS search methods are considered supervised learning algorithms since they learn from example binding sites of a TF. Most of the TFBS search methods consider only known binding sites of a TF and hence deal with one-class classification problems. However, non-binding sites contain information about the TF as well. When non-binding sites are available, searching for TFBSs becomes a two-class classification problem. We propose two novel methods named the negative-to-positive vector and the optimal discriminating vector methods, utilizing both binding sites and non-binding sites.

COinS