Contact Us Search Paper

A Comparative Study on Machine Learning Models to Classify Diseases Based on Patient Behaviour and Habits

Elham Musaaed1, 2, *, Nabil Hewah2, Abdulla Alasaadi3

Corresponding Author:

Elham Musaaed

Affiliation(s):

1 MSc. in Big Data Science and Analytics, College of Science, University of Bahrain, Al-Riffa, and Kingdom of Bahrain

Email: [email protected]

2 Department of Information Systems College of Information Technology, University of Bahrain, Al-Riffa, and Kingdom of Bahrain

Email: [email protected]

3 Department of Information Systems College of Information Technology, University of Bahrain, Al-Riffa, and Kingdom of Bahrain

Email: [email protected]

*Corresponding Author: Elham Musaaed, Email: [email protected]


Abstract:

In recent years, ML algorithms have been shown to be useful for predicting diseases based on health data and posed a potential application area for these algorithms such as modeling of diseases. The majority of these applications employ supervised rather than unsupervised ML algorithms. In addition, each year, the amount of data in medical science grows rapidly. Moreover, these data include clinical and Patient-Related Factors (PRF), such as height, weight, age, other physical characteristics, blood sugar, lipids, insulin, etc., all of which will change continually over time. Analysis of historical data can help identify disease risk factors and their interactions, which is useful for disease diagnosis and prediction. This wealth of valuable information in these data will help doctors diagnose accurately and people can become more aware of the risk factors and key indicators to act proactively. The purpose of this study is to use six supervised ML approaches to fill this gap by conducting a comprehensive experiment to investigate the correlation between PRF and Diabetes, Stroke, Heart Disease (HD), and Kidney Disease (KD). Moreover, it will investigate the link between Diabetes, Stroke, and KD and PRF with HD. Further, the research aims to compare and evaluate various ML algorithms for classifying diseases based on the PRF. Additionally, it aims to compare and evaluate ML algorithms for classifying HD based on PRF as well as Diabetes, Stroke, Asthma, Skin Cancer, and KD as attributes. Lastly, HD predictions will be provided through a Web-based application on the most accurate classifier, which allows the users to input their values and predict the output. Logistic Regression (LR), Naive Bayes (NB), Random Forest (RF), K-Nearest Neighbor (KNN), Extreme Gradient Boost (XGB), and Support Vector Machine (SVM) were the algorithms used. The dataset was obtained from the Kaggle repository. The attributes are divided into PRF and diseases. The selected algorithms were implemented on the dataset with the optimal hyperparameters determined by using the “GridsearchCV” method in order to obtain the best performance. The accuracy of the algorithms ranged from 70% to 76%. Based on the accuracy, recall, precision, and F1-score measures for all algorithms, all ML algorithms predicted HD more accurately than diabetes, strokes, and KD. The algorithms even performed better when predefined diseases were combined with PRF in order to predict HD. Although there was no significant difference between the algorithms, LR achieved the highest score with 75%, when using only PRF and 76% when using a combination of disease attributes and PRF using a 70/30 split. Furthermore, accuracy increased from 74.8 to 76% when using the 10-fold CV. Two conclusions have been drawn: these features are more closely related to HD compared to other diseases and can be useful in predicting HD more proactively. Furthermore, the risk of HD increases with the presence of predefined diseases, especially Diabetes and Stroke. In terms of performance, LR was always one of the superior classifiers that performed similarly to more complex Machine Learning algorithms, while NB performed the worst.

Keywords:

Machine Learning, Classifiers, Patient-Related Factors, Heart Disease, Risk Factors

Downloads: 13 Views: 68
Cite This Paper:

Elham Musaaed, Nabil Hewah, Abdulla Alasaadi (2024). A Comparative Study on Machine Learning Models to Classify Diseases Based on Patient Behaviour and Habits. Journal of Artificial Intelligence and Systems, 6, 34–58. https://doi.org/10.33969/AIS.2024060103.

References:

[1] Virani SS, Alonso A, Benjamin EJ, et al. Heart Disease and Stroke Statistics—2020 Update: A Report from the American Heart Association. Circulation [Internet]. 2020 Mar 3 [cited 2022 Jun 22]; 141(9):E139–596. Available from:  https://www.ahajournals.org/doi/abs/10.1161/CIR.0000000000000757

[2] Benjamin EJ, Muntner P, Alonso A, et al. Heart Disease and Stroke Statistics-2019 Update: A Report from the American Heart Association. Circulation. 2019 Mar 5; 139(10):e56–528. 

[3] Dipto IC, Islam T, Rahman HMM, et al. Comparison of Different Machine Learning Algorithms for the Prediction of Coronary Artery Disease. Journal of Data Analysis and Information Processing [Internet]. 2020 [cited 2022 Jun 25]; 8:41–68. Available from:  https://doi.org/10.4236/jdaip.2020.82003

[4] Dahiwade D, Patle G, Meshram E. Designing disease prediction model using machine learning approach. Proceedings of the 3rd International Conference on Computing Methodologies and Communication, ICCMC 2019. 2019 Mar 1;1211–5. 

[5] Uddin S, Khan A, Hossain ME, et al. Comparing different supervised machine learning algorithms for disease prediction. BMC Med Inform Decis Mak. 2019 Dec 21;19(1). 

[6] Mustaqeem A, Anwar S, Majid M, et al. Wrapper method for feature selection to classify cardiac arrhythmia. 2017. p. 3656–9. 

[7] Patil BM, Joshi RC, Toshniwal D. Hybrid prediction model for Type-2 diabetic patients. Expert Syst Appl. 2010 Dec 1;37(12):8102–8. 

[8] Nayak L, Pandi GS. Diabetes Disease Prediction using Machine Learning. International Research Journal of Engineering and Technology [Internet]. 2021 [cited 2022 Aug 7]; Available from: www.irjet.net

[9] Emon MU, Keya MS, Meghla TI, et al. Performance Analysis of Machine Learning Approaches in Stroke Prediction. Proceedings of the 4th International Conference on Electronics, Communication and Aerospace Technology, ICECA 2020. 2020 Nov 5;1464–9. 

[10] Rady EHA, Anwar AS. Prediction of kidney disease stages using data mining algorithms. Inform Med Unlocked. 2019 Jan 1; 15:100178. 

[11] Ifraz GM, Rashid MH, Tazin T, et al. Comparative Analysis for Prediction of Kidney Disease Using Intelligent Machine Learning Methods. Comput Math Methods Med. 2021;2021. 

[12] Personal Key Indicators of Heart Disease [Internet]. www.kaggle.com. 2020. Available from:  https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease

[13] Donges N. A Complete Guide to the Random Forest Algorithm [Internet]. Built in. 2021. Available from: https://builtin.com/data-science/random-forest-algorithm

[14] Nelson D. What is a KNN (K-Nearest Neighbors)? [Internet]. Unite.AI. 2020. Available from: https://www.unite.ai/what-is-k-nearest-neighbors/

[15] Pandey A, Jain A. Comparative Analysis of KNN Algorithm using Various Normalization Techniques. International Journal of Computer Network and Information Security. 2017 Nov 8;9(11):36–42. 

[16] Gandhi R. Support Vector Machine — Introduction to Machine Learning Algorithms [Internet]. Towards Data Science. 2018. Available from: https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47

[17] Sasidharan aswathi. Support Vector Machine Algorithm [Internet]. GeeksforGeeks. 2021. Available from:  https://www.geeksforgeeks.org/support-vector-machine-algorithm/

[18] Peling IBA, Arnawan IN, Arthawan IPA, et al. Implementation of Data Mining To Predict Period of Students Study Using Naive Bayes Algorithm. International Journal of Engineering and Emerging Technology. 2017 Sep 23;2(1):53. 

[19] Naive Bayes Algorithm | How It Works | Basic Models | Advantages [Internet]. EDUCBA. 2019. Available from:  https://www.educba.com/naive-bayes-algorithm/

[20] Maalouf M. Logistic regression in data analysis: An overview. International Journal of Data Analysis Techniques and Strategies. 2011;3(3):281–99. 

[21] Mahapatra S, Gupta VR, Sahu SS, et al. Deep Neural Network and Extreme Gradient Boosting Based Hybrid Classifier for Improved Prediction of Protein-Protein Interaction. IEEE/ACM Trans Comput Biol Bioinform. 2022;19(1):155–65. 

[22] Ahmed I, Khan MS, Kaddoura S. Classification of Parkinson Disease Based on Patient’s Voice Signal Using Machine Learning. 

[23] Saini A. Logistic Regression | What is Logistic Regression and Why do we need it? [Internet]. Analytics Vidhya. 2021. Available from: https://www.analyticsvidhya.com/blog/2021/08/conceptual-understanding-of-logistic-regression-for-data-science-beginners/

[24] Stylianou N, Akbarov A, Kontopantelis E, et al. Mortality risk prediction in burn injury: Comparison of logistic regression with machine learning approaches. Burns. 2015 Aug 1;41(5):925–34.