Principal Component Analysis Implementation on Machine Learning in Diabetes Classification
Diabetes Mellitus, a global health burden linked to increased cancer risks, can be identified through variables like BMI, age, blood sugar, and HbA1c. This study explored diverse machine learning techniques for diabetes prediction, emphasizing dimensionality reduction and feature selection’s role in enhancing model accuracy. Our motive is to compare the performance of multiple machine learning algorithms measures between original data and original data on which the handling sampling method or principal component analysis (PCA) was applied. The study utilizes Kaggle’s “Diabetes Prediction Dataset” with 100,000 entries, employing eight features and one target variable related to diabetes. In the experiment, the dataset was divided into three distinct datasets: 1) whole dataset, 2) dataset containing males only, and 3) dataset containing females only. Those datasets were trained with multiple machine learning models: K-Nearest Neighbor (KNN), Decision Tree (DT), Support Vector Machines (SVM), XGBoost (XGB), and Random Forest (RF). The findings revealed that XGB outperformed other models with f1-score of 80.87 for an imbalanced dataset. Moreover, in diabetes classification based on gender, the random forest model was better for males with 80.34 as the f1-score while XGB was good for females 81.9 as the f1-score.
Authors:
Michael Tantowen, Krisna Putra, Mahmud Isnan, Bens Pardamean
Communications in Mathematical Biology and Neuroscience (CMBN)