PREDICTING OF BREAST CANCER RISK USING MACHINE LEARNING WITH FEATURE SELECTION THROUGH XGBOOST

Cahya Mutiara Al Azhar; Pujiono Pujiono

doi:10.33330/jurteksi.v11i2.3661

Cahya Mutiara Al Azhar Universitas Dian Nuswantoro
Pujiono Pujiono Universitas Dian Nuswantoro

DOI: https://doi.org/10.33330/jurteksi.v11i2.3661

Abstract

Abstract: Breast cancer is the leading cause of death for women globally, exacerbated by late detection. This study proposes a breast cancer risk prediction framework using XGBoost with SelectKBest feature selection. It aims to improve the accuracy and efficiency of early detection through exploratory data analysis, coding, SMOTE to address class imbalance, and feature selection (k=29). As a result, the XGBoost model achieved 98.1% accuracy, 98.1% recall, 98.1% f1-score, and 98.2% precision on test data, highlighting the importance of feature selection. These results are promising in patient prioritization (triage) for further examination, helping medical personnel identify high-risk patients, thus improving resource allocation efficiency. These findings validate SelectKBest and pave the way for the development of a machine learning-based clinical decision support system for breast cancer early detection workflows. This research contributes significantly to the application of machine learning to support early breast cancer detection.

Keywords: breast cancer; feature selection; machine learning; risk prediction; XGBOOST.

Abstrak: Kanker payudara menjadi penyebab utama kematian wanita global, diperparah deteksi yang terlambat. Penelitian ini mengusulkan kerangka prediksi risiko kanker payudara menggunakan XGBoost dengan seleksi fitur SelectKBest. Tujuannya meningkatkan akurasi dan efisiensi deteksi dini melalui analisis data eksploratif, pengkodean, SMOTE untuk mengatasi ketidakseimbangan kelas, dan seleksi fitur (k=29). Hasilnya, model XGBoost mencapai akurasi 98.1%, recall 98.1%, f1-score 98.1%, dan presisi 98.2% pada data uji, menyoroti pentingnya seleksi fitur. Hasil ini menjanjikan dalam penentuan prioritas pasien (triage) untuk pemeriksaan lebih lanjut, membantu tenaga medis mengidentifikasi pasien berisiko tinggi, sehingga meningkatkan efisiensi alokasi sumber daya. Temuan ini memvalidasi SelectKBest dan membuka jalan bagi pengembangan sistem pendukung keputusan klinis berbasis machine learning untuk alur kerja deteksi dini kanker payudara. Penelitian ini berkontribusi signifikan dalam penerapan machine learning untuk mendukung deteksi dini kanker payudara.

Kata kunci: kanker payudara; pembelajaran mesin; prediksi risiko ; seleksi fitur; XGBOOST.

Author Biography

Cahya Mutiara Al Azhar, Universitas Dian Nuswantoro

Name: Cahya Mutiara Al Azhar
Affiliation: Universitas Dian Nuswantoro
Semester: 7
Expertise: Data Analysis, Machine Learning, Data Mining
Research Interests: Focused on predicting breast cancer risk using machine learning algorithms.
Publication: Predicting Breast Cancer Risk Using Machine Learning with Feature Selection through XGBoost

References

Y. S. Prabandari et al., â€œâ€˜Alas â€¦ my sickness becomes my familyâ€™s burdenâ€™: A nested qualitative study on the experience of advanced breast cancer patients across the disease trajectory in Indonesia,â€ The Breast, vol. 63, pp. 168â€“176, Jun. 2022, doi: 10.1016/j.breast.2022.04.001.

M. Arnold et al., â€œCurrent and future burden of breast cancer: Global statistics for 2020 and 2040,â€ The Breast, vol. 66, pp. 15â€“23, Dec. 2022, doi: 10.1016/j.breast.2022.08.010.

B. E. PatiÃ±o-Palma, L. LÃ³pezâ€Montoya, R. Escamilla-Ugarte, and A. GÃ³mez-Rodas, â€œTrends in physical activity research for breast cancer - A bibliometric analysis of the past ten years,â€ Heliyon, vol. 9, no. 12, p. e22499, Dec. 2023, doi: 10.1016/j.heliyon.2023.e22499.

S. M. Malakouti, M. B. Menhaj, and A. A. Suratgar, â€œML: Early Breast Cancer Diagnosis,â€ Curr. Probl. Cancer Case Rep., vol. 13, p. 100278, Mar. 2024, doi: 10.1016/j.cpccr.2024.100278.

Md. M. Hassan et al., â€œA comparative assessment of machine learning algorithms with the Least Absolute Shrinkage and Selection Operator for breast cancer detection and prediction,â€ Decis. Anal. J., vol. 7, p. 100245, Jun. 2023, doi: 10.1016/j.dajour.2023.100245.

A. De Luca et al., â€œNeoadjuvant chemotherapy for breast cancer in Italy: A Senonetwork analysis of 37,215 patients treated from 2017 to 2022,â€ The Breast, vol. 78, p. 103790, Dec. 2024, doi: 10.1016/j.breast.2024.103790.

H. Xie, Y. Deng, J. Li, K. Xie, T. Tao, and J. Zhang, â€œPredicting the risk of primary SjÃ¶grenâ€™s syndrome with key N7-methylguanosine-related genes: A novel XGBoost model,â€ Heliyon, vol. 10, no. 10, p. e31307, May 2024, doi: 10.1016/j.heliyon.2024.e31307.

M. Darwich and M. Bayoumi, â€œAn evaluation of the effectiveness of machine learning prediction models in assessing breast cancer risk,â€ Inform. Med. Unlocked, vol. 49, p. 101550, 2024, doi: 10.1016/j.imu.2024.101550.

V. Nemade and V. Fegade, â€œMachine Learning Techniques for Breast Cancer Prediction,â€ Procedia Comput. Sci., vol. 218, pp. 1314â€“1320, 2023, doi: 10.1016/j.procs.2023.01.110.

S. Jafari, J.-H. Yang, and Y.-C. Byun, â€œOptimized XGBoost modeling for accurate battery capacity degradation prediction,â€ Results Eng., vol. 24, p. 102786, Dec. 2024, doi: 10.1016/j.rineng.2024.102786.

C.-J. Tseng and C. Tang, â€œAn optimized XGBoost technique for accurate brain tumor detection using feature selection and image segmentation,â€ Healthc. Anal., vol. 4, p. 100217, Dec. 2023, doi: 10.1016/j.health.2023.100217.

N. Q. K. Le, D. T. Do, T.-T.-D. Nguyen, and Q. A. Le, â€œA sequence-based prediction of Kruppel-like factors proteins using XGBoost and optimized features,â€ Gene, vol. 787, p. 145643, Jun. 2021, doi: 10.1016/j.gene.2021.145643.

V. Jaiswal, P. Saurabh, U. K. Lilhore, M. Pathak, S. Simaiya, and S. Dalal, â€œA breast cancer risk predication and classification model with ensemble learning and big data fusion,â€ Decis. Anal. J., vol. 8, p. 100298, Sep. 2023, doi: 10.1016/j.dajour.2023.100298.

M. Shanbehzadeh, H. Kazemi-Arpanahi, M. Bolbolian Ghalibaf, and A. Orooji, â€œPerformance evaluation of machine learning for breast cancer diagnosis: A case study,â€ Inform. Med. Unlocked, vol. 31, p. 101009, 2022, doi: 10.1016/j.imu.2022.101009.

D. Tarwidi, S. R. Pudjaprasetya, D. Adytia, and M. Apri, â€œAn optimized XGBoost-based machine learning method for predicting wave run-up on a sloping beach,â€ MethodsX, vol. 10, p. 102119, 2023, doi: 10.1016/j.mex.2023.102119.

A. M. Mequanenit, A. M. Ayalew, A. O. Salau, E. A. Nibret, and M. Meshesha, â€œPrediction of mung bean production using machine learning algorithms,â€ Heliyon, vol. 10, no. 24, p. e40971, Dec. 2024, doi: 10.1016/j.heliyon.2024.e40971.

Z. Wang, X. Wu, and Y. Wu, â€œA spatiotemporal XGBoost model for PM2.5 concentration prediction and its application in Shanghai,â€ Heliyon, vol. 9, no. 12, p. e22569, Dec. 2023, doi: 10.1016/j.heliyon.2023.e22569.

T. Chen, X. Zhou, and G. Wang, â€œUsing an innovative method for breast cancer diagnosis based on Extreme Gradient Boost optimized by Simplified Memory Bounded A*,â€ Biomed. Signal Process. Control, vol. 87, p. 105450, Jan. 2024, doi: 10.1016/j.bspc.2023.105450.

S. Batool and S. Zainab, â€œA comparative performance assessment of artificial intelligence based classifiers and optimized feature reduction technique for breast cancer diagnosis,â€ Comput. Biol. Med., vol. 183, p. 109215, Dec. 2024, doi: 10.1016/j.compbiomed.2024.109215.

P. T. Teo et al., â€œDetermining risk and predictors of head and neck cancer treatment-related lymphedema: A clinicopathologic and dosimetric data mining approach using interpretable machine learning and ensemble feature selection,â€ Clin. Transl. Radiat. Oncol., vol. 46, p. 100747, May 2024, doi: 10.1016/j.ctro.2024.100747.

V. Safavi, A. Mohammadi Vaniar, N. Bazmohammadi, J. C. Vasquez, O. Keysan, and J. M. Guerrero, â€œEarly prediction of battery remaining useful life using CNN-XGBoost model and Coati optimization algorithm,â€ J. Energy Storage, vol. 98, p. 113176, Sep. 2024, doi: 10.1016/j.est.2024.113176.

X. Y. Liew, N. Hameed, and J. Clos, â€œAn investigation of XGBoost-based algorithm for breast cancer classification,â€ Mach. Learn. Appl., vol. 6, p. 100154, Dec. 2021, doi: 10.1016/j.mlwa.2021.100154.

P. Paulus, Y. Ruppert, A. Andreicovici, M. Vielhaber, and J. Griebsch, â€œComparison of machine learning based methods on prediction quality of thin-walled geometries using laser-based Direct Energy Deposition,â€ Procedia CIRP, vol. 124, pp. 781â€“784, 2024, doi: 10.1016/j.procir.2024.08.224.

A. Maleki, M. Raahemi, and H. Nasiri, â€œBreast cancer diagnosis from histopathology images using deep neural network and XGBoost,â€ Biomed. Signal Process. Control, vol. 86, p. 105152, Sep. 2023, doi: 10.1016/j.bspc.2023.105152.