Construction of a COPD risk prediction model based on machine learning and the COPD-SQ questionnaire

Acta Universitatis Medicinalis Anhui     font:big middle small

Fund programs: National Natural Science Foundation of China (No. 82560019); National Science and Technology Major Project for Prevention and Treatment of Cancer, Cardiovascular and Cerebrovascular Diseases, Respiratory and Metabolic Diseases (No. 2023ZD0506100); Guiding Science and Technology Plan Project of the Xinjiang Production and Construction Corps (No. 2023ZD019)

Authors:Chen Lin1, Zhao Luna1, Zhou Yue2, Wang Panpan3, Li Jingkun2,Zhang Wenwenl,Zhang Xinxin l,Wu Chaol, Liu Dong1

Keywords:chronic obstructive pulmonary disease ; risk prediction model; machine learning; logistic regression; class imbalance; screening

DOI:专辑:医药卫生科技

〔Abstract〕 Objective This study aims to construct and evaluate various machine learning models for predicting the risk of chronic obstructive pulmonary disease (COPD) in individuals, thereby providing data support for early screening and intervention.Methods A total of 823 subjects were selected for this study, comprising l42 individuals in the high-risk group for COPD and 68l individuals in the low-risk group. Data collected included demographic characteristics, smoking history, symptoms (such as cough and shortness of breath), and scores from the Chronic obstructive pulmonary disease screening questionnaire. Four machine learning algorithms—Logistic Regression, Random Forest, Support Vector Machine, and XGBoost—were utilized to construct risk prediction models. The performance of these models was assessed using 5-fold cross-validation, with evaluation metrics including accuracy, precision, recall, Fl-score, area under the receiver operating characteristic curve (AUC), and average precision (AP). Furthermore, a feature importance analysis was performed.Results The Logistic Regression model exhibited superior performance, achieving an AUC of 0.982 and an AP of 0.939. This was closely followed by the Random Forest model, which recorded an AUC of 0.975 and an AP of 0.890. Feature importance analysis revealed that smoking history, symptoms of shortness of breath, and body weight were significant predictors. All models demonstrated robust performance in identifying low-risk populations, with precision values exceeding 0.93; however, variations were observed in their efficacy in identifying high-risk populations.Conclusion Machine learning models have proven effective in identifying individuals at high risk for COPD. Among these, the logistic regression model exhibits the best overall performance, efficiently identifying high-risk populations and serving as a valuable clinical auxiliary screening tool. Various models, each with distinct performance characteristics, are suited to different clinical screening scenarios, thereby offering targeted decision-making support for the establishment of a hierarchical and intelligent COPD screening pathway.