Predicting Prediabetes Risk From Electronic Health Records Using Machine Learning
Abstract
Prediabetes, also known as ‘hidden sugar,’ is a public health priority because of its risk of progressing to diabetes if left untreated. While studies regarding diabetes have been extensive, research focusing on the early detection of prediabetes is limited. For that reason, it is extremely important that efforts toward early identification be conducted. This study develops a prediabetes prediction model using machine learning algorithms from electronic health record data. Data were retrieved from the Korean National Health and Nutrition Examination Survey (KNHANES), examining associations between prediabetes and multiple factors among adults. The dataset consisted of 16 attributes and included clinical health information, socioeconomic indicators, physical activity, and dietary habits for 16,137 individuals. Non-contributory features were removed during preprocessing, while values normalization was performed with a Standard Scaler. To evaluate model performance, the dataset was split into an 80% training set and a 20% test set. Four different machine learning methods were applied: SVM, KNN, Logistic Regression, and Random Forest. After training, their performance was tested on the test set. Accuracy, precision, recall, F1-score, and ROC-AUC were measured. Among all models, the Random Forest algorithm demonstrated 68% accuracy and 61% precision, while SVM demonstrated 75% recall. Logistic Regression showed a performance of 64% for the F1-score with 75% ROC-AUC. These are very promising results for the detection of prediabetes. In the future, prediction will be improved by using larger datasets and advanced feature selection, including deep learning techniques.