Date of Completion

7-9-2018

Embargo Period

7-9-2018

Keywords

population genetics, inference problem, maximum likelihood, machine learning, species delimitation, ancestry inference, demographic history

Major Advisor

Yufeng Wu

Associate Advisor

Ion Mandoiu

Associate Advisor

Mukul Bansal

Field of Study

Computer Science and Engineering

Degree

Doctor of Philosophy

Open Access

Open Access

Abstract

Inference of population history is a central problem of population genetics. The advent of large genetic data brings us not only opportunities on developing more accurate methods for inference problems, but also computational challenges. Thus, we aim at developing accurate method and fast algorithm for problems in population genetics.

Inference of admixture proportions is a classical statistical problem. We particularly focus on the problem of ancestry inference for ancestors. Standard methods implicitly assume that both parents of an individual have the same admixture fraction. However, this is rarely the case in real data. We develop a Hidden Markov Model (HMM) framework for estimating the admixture proportions of the immediate ancestors of an individual, i.e. a type of appropriation of an individual's admixture proportions into further subsets of ancestral proportions in the ancestors. Based on a genealogical model for admixture tracts, we develop an efficient algorithm for computing the sampling probability of the genome from a single individual, as a function of the admixture proportions of the ancestors of this individual. We show that the distribution and lengths of admixture tracts in a genome contain information about the admixture proportions of the ancestors of an individual. This allows us to perform probabilistic inference of admixture proportions of ancestors only using the genome of an extant individual.

To better understand population, we further study the species delimitation problem. It is a problem of determining the boundary between population and species. We propose a classification-based method to assign a set of populations to a number of species. Our new method uses summary statistics generated from genetic data to classify pairwise populations as either 'same species' or 'different species'. We show that machine learning can be used for species delimitation and scaled for large genomic data. It can also outperform Bayesian approaches, especially when gene flow involves in the evolutionary process.

COinS