Predicting Prediabetes Risk From Electronic Health Records Using Machine Learning

Authors

DOI:

https://doi.org/10.5281/zenodo.18055607

Keywords:

Prediabetes , Machine Learning R, Health Data Analysis , Classification Algorithms, EHR

Abstract

Prediabetes, also known as ‘hidden sugar,’ is a public health priority because of its risk of progressing to diabetes if left untreated. While studies regarding diabetes have been extensive, research focusing on the early detection of prediabetes is limited. For that reason, it is extremely important that efforts toward early identification be conducted. This study develops a prediabetes prediction model using machine learning algorithms from electronic health record data. Data were retrieved from the Korean National Health and Nutrition Examination Survey (KNHANES), examining associations between prediabetes and multiple factors among adults. The dataset consisted of 16 attributes and included clinical health information, socioeconomic indicators, physical activity, and dietary habits for 16,137 individuals. Non-contributory features were removed during preprocessing, while values normalization was performed with a Standard Scaler. To evaluate model performance, the dataset was split into an 80% training set and a 20% test set. Four different machine learning methods were applied: SVM, KNN, Logistic Regression, and Random Forest. After training, their performance was tested on the test set. Accuracy, precision, recall, F1-score, and ROC-AUC were measured. Among all models, the Random Forest algorithm demonstrated 68% accuracy and 61% precision, while SVM demonstrated 75% recall. Logistic Regression showed a performance of 64% for the F1-score with 75% ROC-AUC. These are very promising results for the detection of prediabetes. In the future, prediction will be improved by using larger datasets and advanced feature selection, including deep learning techniques.

References

International Diabetes Federation. (2025). IDF Diabetes Atlas (11th ed.). Brussels, Belgium: International Diabetes Federation. http://www. diabetesatlas.org

Zhang, X., Yao, W., Wang, D., Hu, W., Zhang, G., & Zhang, Y. (2024). Development and validation of machine learning models for identifying prediabetes and diabetes in normoglycemia. Diabetes Metab Res Rev, 40(8), e70003. [CrossRef] [PubMed]

De Silva, K., Jönsson, D., & Demmer, R. T. (2019). A combined strategy of feature selection and machine learning to identify predictors of prediabetes. Journal of the American Medical Informatics Association, 27(3), 396–406. [CrossRef] [PubMed]

Severeyn, E., Velásquez, J., La Cruz, A., & Huerta, M. (2024). Leveraging support vector machines for enhanced diagnosis of diabetes and prediabetes. 2024 IEEE Colombian Conference on Communications and Computing (COLCOM). IEEE. [CrossRef]

Bashar, A. K. M. R., Goudarzi, M., & Tsokos, C. P. (2024). A machine learning classification model for detecting prediabetes. Journal of Data Analysis and Information Processing, 12(03), 462–478. [CrossRef]

Kanbour, S., Harris, C., Lalani, B., Wolf, R. M., Fitipaldi, H., Gomez, M. F., & Mathioudakis, N. (2024). Machine learning models for prediction of diabetic microvascular complications. Journal of Diabetes Science and Technology, 18(2), 273–286. [CrossRef] [PubMed]

Islam, N. U., & Khanam, R. (2021). Classification of diabetes using machine learning. In 2021 International Conference on Computational Performance Evaluation (ComPE) (pp. 185-189).

Mujumdar, A., & Vaidehi, V. (2019). Diabetes Prediction using Machine Learning Algorithms. Procedia Computer Science, 165, 292–299. [CrossRef]

Prediabetes and Health Dataset. (2025, April 15). Retrieved from https://www.kaggle.com/datasets/ jesusdeleon19/prediabetes-and-health-dataset

Ho, T. K. (1995). Random decision forests. In Proceedings of the 3rd International Conference on Document Analysis and Recognition (Vol. 1, pp. 278–282). IEEE. [CrossRef]

Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297. [CrossRef]

Cox, D. R. (1958). The regression analysis of binary sequences. Journal of the Royal Statistical Society Series B (Statistical Methodology), 20(2), 215–232. [CrossRef]

Downloads

Published

2025-12-30

How to Cite

Bilgin, N., Gülpınar, E., & Karakuş, Y. (2025). Predicting Prediabetes Risk From Electronic Health Records Using Machine Learning. International Journal of Digital Health & Patient Care, 2(2), 103–110. https://doi.org/10.5281/zenodo.18055607