Date of Completion


Embargo Period



Sanguthevar Rajasekaran; Daniel Schwartz

Field of Study

Computer Science and Engineering


Master of Science

Open Access

Open Access


Background: Many consensus-based and Position Weight Matrix-based methods for recognizing transcription factor binding sites are not well suited to the variability in the lengths of binding sites. Besides, many methods discard known binding sites while building the model. Moreover, the impact of Information Content (IC) and the positional dependence of nucleotides within an aligned set of TFBSs has not been well researched for modeling variable-length binding sites. In this paper, we propose ML-Consensus, a consensus model for variable-length binding sites which does not exclude any input binding sites. We consider Pairwise Score (PS) as a measure of positional dependence of nucleotides within an alignment of binding sites. We investigate how the prediction accuracy of ML-Consensus is a ffected by using IC, PS, and any particular binding site alignment strategy. We perform leave-one-out cross-validations on datasets of six species from the TRANSFAC public database, and analyze the results using ROC curves and Wilcoxon matched-pair signed-ranks test.

Results: We observed that the incorporation of IC and PS in ML-Consensus results in statistically significant improvement in the prediction accuracy. Moreover, any two positions in the multiple sequence alignment of the binding sites were found to be interdependent only when they the distance between them was below a certain value. Lastly, configurations with state-of-the-art alignment strategies did not perform significantly better than configurations with a naive alignment strategy.

Conclusions: There exists a core region within a set of known binding sites, ix and positions in that core region are interdependent. Additionally, it is possible to improve the existing state-of-the-art multiple sequence alignment algorithms by using such information as mentioned above about the core region among the binding sites.

Availability: All source codes (C#), results, supporting evidence, supplementary data and figures are available from .

Major Advisor

Chun-Hsi Huang