Date of Completion
Kevin Brown, Ph.D., Ion Mandiou, Ph.D., Yong-Jun Shin, M.D., Ph.D.
Field of Study
Master of Engineering
The ability to collect and store large amounts of data is transforming data-driven discovery; recent technological advances in biology allow systematic data production and storage at a previously unattainable scale. It is common for biological Big Data to have an order of magnitude or more features than samples. Feature scoring with selection is therefore an essential pre-processing step to finding meaningful clusters in these data. Many feature scoring algorithms have been proposed; they are based on dramatically different ideas about what constitutes a “good” or “important” feature. Motivated by studies in data classification, we use a rank aggregation (RANKAGG) method to combine estimates of feature importance from multiple sources and use a subset of the highest scoring features for subsequent clustering. We demonstrate the performance of RANKAGG on five real-world biological data-sets, and compare the clustering performance of RANKAGG to the thirteen individual feature scoring methods comprising RANKAGG. The rank aggregated features have a mean perfor- mance across the five data-sets equal to the best individual feature scoring method but with lower variance, indicating robust performance across a variety of data. We carefully consider if there is any systematic way to remove rankers from RANKAGG to improve clustering performance. We demonstrate that rank aggregated feature selection yields excellent performance in clustering problems and possibly more im- portantly, greatly limits the risk of choosing a method that is sub-optimal for a given data-set.
Yankee, Tara N., "Rank Aggregation of Feature Scoring Methods for Unsupervised Learning" (2017). Master's Theses. 1123.
Kevin Brown, Ph.D.