Fast and Effective Clustering Method for Ancestry Estimation

Ancestry estimation which provides family history information is one of the most popular services in direct-to-consumer genomic testing. It is also an important task which aimed to reduce the confounding by ancestry on the relationship of genotypes and disease risk in assocation studies. Several methods have been developed to generate the best ancestry estimated scores even though some of them are still facing inefficient computation time. In this paper, a combination method between KMeans clustering and PCA is proposed estimate ancestry estimation from SNP genotyping data. This method was compared with baseline model, called fastSTRUCTURE, in term of the quality of clustering and computation time. Public data from 1000 Genome project is used to train and evaluate the proposed model and the baseline model. The proposed model can successfully generate clusters with better accuracy than fastSTRUCTURE (91.02% over 90.39%). More importantly, it can boost the computation time until 100 times faster than fastSTRUCTURE (from 490 seconds to 4.86 seconds).

Conference: 2019 International Conference on Computer Science and Computational Intelligence, Vol. 157, Yogyakarta, Indonesia

Arif Budiarto, Bharuno Mahesworo, James W Baurley, Teddy Suparyanto, Bens Pardamean

Read Full Paper