Predicting Prediabetes Risk From Electronic Health Records Using Machine Learning

Nursena Bilgin; Ekber Gülpınar; Yasin Karakuş

doi:10.5281/zenodo.18055607

Open Access

Predicting Prediabetes Risk From Electronic Health Records Using Machine Learning

10.5281/zenodo.18055607

pdf

Nursena Bilgin ¹

, Ekber Gülpınar ²

, Yasin Karakuş ³

¹ Kutahya Health Sciences University

² Kutahya Health Sciences University

³ Kutahya Health Sciences University

Abstract

Prediabetes, also known as ‘hidden sugar,’ is a public health priority because of its risk of progressing to diabetes if left untreated. While studies regarding diabetes have been extensive, research focusing on the early detection of prediabetes is limited. For that reason, it is extremely important that efforts toward early identification be conducted. This study develops a prediabetes prediction model using machine learning algorithms from electronic health record data. Data were retrieved from the Korean National Health and Nutrition Examination Survey (KNHANES), examining associations between prediabetes and multiple factors among adults. The dataset consisted of 16 attributes and included clinical health information, socioeconomic indicators, physical activity, and dietary habits for 16,137 individuals. Non-contributory features were removed during preprocessing, while values normalization was performed with a Standard Scaler. To evaluate model performance, the dataset was split into an 80% training set and a 20% test set. Four different machine learning methods were applied: SVM, KNN, Logistic Regression, and Random Forest. After training, their performance was tested on the test set. Accuracy, precision, recall, F1-score, and ROC-AUC were measured. Among all models, the Random Forest algorithm demonstrated 68% accuracy and 61% precision, while SVM demonstrated 75% recall. Logistic Regression showed a performance of 64% for the F1-score with 75% ROC-AUC. These are very promising results for the detection of prediabetes. In the future, prediction will be improved by using larger datasets and advanced feature selection, including deep learning techniques.

Keywords

Prediabetes ,Machine Learning R,Health Data Analysis ,Classification Algorithms,EHR

How to Cite

Bilgin, N., Gülpınar, E., & Karakuş, Y. (2025). Predicting Prediabetes Risk From Electronic Health Records Using Machine Learning. International Journal of Digital Health & Patient Care, 2(2), 103–110. https://doi.org/10.5281/zenodo.18055607

⬇ Endnote/Zotero/Mendeley (RIS) ⬇ BibTeX

References

📄 International Diabetes Federation. (2025). IDF Diabetes Atlas (11th ed.). Brussels, Belgium: International Diabetes Federation. http://www. diabetesatlas.org

📄 Zhang, X., Yao, W., Wang, D., Hu, W., Zhang, G., & Zhang, Y. (2024). Development and validation of machine learning models for identifying prediabetes and diabetes in normoglycemia. Diabetes Metab Res Rev, 40(8), e70003. [CrossRef] [PubMed]

📄 De Silva, K., Jönsson, D., & Demmer, R. T. (2019). A combined strategy of feature selection and machine learning to identify predictors of prediabetes. Journal of the American Medical Informatics Association, 27(3), 396–406. [CrossRef] [PubMed]

📄 Severeyn, E., Velásquez, J., La Cruz, A., & Huerta, M. (2024). Leveraging support vector machines for enhanced diagnosis of diabetes and prediabetes. 2024 IEEE Colombian Conference on Communications and Computing (COLCOM). IEEE. [CrossRef]

📄 Bashar, A. K. M. R., Goudarzi, M., & Tsokos, C. P. (2024). A machine learning classification model for detecting prediabetes. Journal of Data Analysis and Information Processing, 12(03), 462–478. [CrossRef]