Gaussian Mixture Model Implementation for Population Stratification Estimation from Genomics Data

Genomics study, as opposed to socio-anthropology, has been demonstrated as an excellent tool to picture biological relatedness and disease risk factors. To analyze the data obtained from the study, Genome-wide Association Study (GWAS) has been more than decades known as the mainstay approach., is the most popular approach in analysing genomics data. The confounding variables selection, being that ancestry estimation or population stratification, is substantial to maintain the quality of GWAS. Researchers have developed various methods in extracting the population stratification information from high dimensional genomics data, especially Single Nucleotide Polymorphisms (SNPs) data. In the present study, we proposed an implementation of Principal Component Analysis (PCA)-complemented Gaussian Mixture Model (GMM) as an unsupervised model to estimate population stratification from samples. The results derived from this approach was further compared to that resulted from K-means and from the commonly used ancestry estimation software, fastSTRUCTURE. We figured out that our recent improved approach outperformed the two later mentioned as shown by the average cluster and population scores. Furthermore, it was able to generate the probability distribution of each sample across all population, despite its limited quality. These intriguing results worth further investigations with much more comprehensive population coverage and more advanced algorithm.

International Conference on Computer Science and Computational Intelligence 2020

Arif Budiarto, Bharuno Mahesworo, Alam Ahmad Hidayat, Ika Nurlaila, and Bens Pardamean

Read Full Paper