A Review: Data Pre-processing Techniques Used for Diabetes Prediction

When processing datasets in diabetes classification, common problems included a large number of missing values, outliers, and dataset imbalance. To deal with those issues, this study analyzed 18 studies on diabetes classification with machine learning algorithms over the past 5 years. This revealed the important role of data pre-processing in creating effective classification models, as it was found that by using different data pre-processing techniques, the same model can provide different performance. The study identified K-Nearest Neighbor (KNN) and support vector machine (SVM) as superior methods for filling in missing values, achieving an accuracy of 98.49% and 94.89%, respectively. These approaches outperformed traditional methods such as median or mean replacement. However, the challenge of imbalanced data sets remains in all studies reviewed. The common evaluation metrics used to evaluate the created models in previous studies included accuracy, precision, specificity, sensitivity/recall, and F1 Score. Overall, this review showed that the role of data pre-processing is no less important than algorithm selection to improve the performance of machine learning models in diabetes classification.
Authors:
Mahmud Isnan, Gregorius Natanael Elwirehardja, Bens Pardamean
2024 International Conference on Computer Science and Computational Intelligence (ICCSCI)