Predicting anxiety in cancer survivors presenting to primary care – A machine learning approach accounting for physical comorbidity

Abstract Background The purpose of this study was to explore predictors for anxiety as the most common form of psychological distress in cancer survivors while accounting for physical comorbidity. Methods We conducted a secondary data analysis of a large study within the German National Cancer Plan which enrolled primary care cancer survivors diagnosed with colon, prostatic, or breast cancer. We selected candidate predictors based on a systematic MEDLINE search. Using supervised machine learning, we developed a prediction model for anxiety by splitting the data into a 70% training set and a 30% test set and further split the training set into 10‐folds for cross‐validating the hyperparameter tuning step during model selection. We fit six different regression models, selected the model that maximized the root mean square error (RMSE) and fit the selected model to the entire training set. Finally, we evaluated the model performance on the holdout test set. Results In total, data from 496 cancer survivors were analyzed. The LASSO model (α = 1.0) with weakly penalized model complexity (λ = 0.015) slightly outperformed all other models (RMSE = 0.370). Physical symptoms, namely, fatigue/weakness (β = 0.18), insomnia (β = 0.12), and pain (β = 0.04), were the most important predictors, while the degree of physical comorbidity was negligible. Conclusions Prediction of clinically significant anxiety in cancer survivors using readily available predictors is feasible. The findings highlight the need for considering cancer survivors’ physical functioning regardless of the degree of comorbidity when assessing their psychological well‐being. The generalizability of the model to other populations should be investigated in future external validations.


| INTRODUCTION
For most cancer survivors, coping with cancer and its treatment remains a challenge even years after diagnosis. 1 Coping is often complicated by physical comorbidity. Physical comorbidity is common among cancer survivors, because cancer and comorbidity may share common risk factors, non-malignant chronic conditions may increase the likelihood of cancer diagnoses, and oncological therapies in turn may contribute to chronic conditions. 2 A nationwide U.S. survey indicated that 30%-50% of all cancer survivors suffer from physical comorbidity and another survey of breast, prostate, colorectal, and gynecological cancer survivors showed that cancer survivors had on average five comorbid medical diseases. 3,4 Another major concern in cancer survivors is psychological distress, as over one-third of all cancer survivors show clinically significant levels of anxiety and/or depression. 5 Previous research indicates a strong association between psychological distress and physical comorbidity in the general population. 6 However, it is unclear as to whether physical comorbidity may impact the mental health of cancer survivors. On the one hand, evidence from prospective studies of the general and aging population has shown that adverse physical symptoms and impaired functional status are predictive of psychological distress. [7][8][9] Additionally, physical comorbidity may increase the financial burden which can also result in psychological distress. 2 Further, the presence of physical comorbidity may lead to a perception of loss of control, which itself has been found to lead to psychological distress. 10 On the other hand, it is possible that cancer itself has such a large influence on psychological distress in cancer survivors that physical comorbidity may have virtually no additional impact. 2 In fact, the severity of physical symptoms, which can vary independent of physical comorbidity, may be a better predictor for psychological distress than the mere number of comorbidities. Evidence for the impact of physical comorbidity on anxiety, the most common type of psychological distress in cancer survivors, is scarce since comorbid conditions are often exclusion criteria in both observational and interventional studies. 11 In that regard, we conducted a systematic literature search, from inception to May 9, 2021 in MEDLINE, using the search string in Appendix 1. During the screening of 1786 records and relevant cross-references, we identified 14 studies that examined the relation between physical comorbidity and anxiety in cancer survivors. Nine studies focused on specific tumor entities, 1,[12][13][14][15][16][17][18][19] while five studies sampled patients with heterogeneous tumor entities. [20][21][22][23][24] These studies generally indicate some form of relation between physical comorbidity and anxiety. [20][21][22] However, each study usually included only a few predictors often neglecting others such as actual physical symptoms or performance status. Moreover, patients with cancer were often assessed during acute treatment in often tertiary academic centers limiting the generalizability of the findings.
The purpose of this study is to explore potential predictors for anxiety in cancer survivors presenting to primary care accounting for physical comorbidity. Specifically, we apply a machine learning approach using data from a large survey.

| Source of data
The data were obtained from a large prospective, cross-sectional observational study within the German National Cancer Plan, entitled "Comparison of two psychosocial cancer care models for rural areas: the P-O-LAND study." 25 This study was approved by the Ethics Committee of Heidelberg Medical School (Registration No. S-300/2013) and is reported in line with the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) Statement. 26

| Participants
In the P-O-LAND study, we identified all physicians who practiced in the two study regions and provided cancer survivorship care from the mandatory registries of the regional Associations of Statutory Health Insurance Physicians. We initially surveyed these physicians (for results see Zimmermann-Schlegel et al. 25 ). From all responding physicians, we then randomly selected physicians who, in turn, identified and recruited eligible cancer survivors. Cancer survivors were reminded to participate up to four times. We included cancer survivors with a definitive diagnosis of colon, prostatic, or breast cancer and excluded those with cognitive impairment, addiction, psychotic episodes, or suicidality.

| Predictors
We based the selection of the candidate predictors on the systematic MEDLINE search described above and in more detail in Appendix 2.

| Physical comorbidity
We classified the comorbidity status for each cancer survivor applying the Charlson Comorbidity Index (CCI). 28 The CCI assigns weights of 1, 2, 3, or 6 (i.e., the relative risk of noncancer-related 1-year mortality rounded to the nearest integer to each of the 13 included comorbid conditions). 29 Total scores for each patient are derived by summing the weights for each condition. A member of the research team blinded to the goal of this study stratified the comorbid cancer survivors in five comorbidity groups for between-group comparisons based on their individual CCI sum scores: no comorbidity, very mild comorbidity (comorbidity not enlisted in the CCI, e.g., hypertension), mild comorbidity (CCI sum score of 1), moderate comorbidity (CCI sum score of 2), and severe comorbidity (CCI sum score of 3 to 5). Due to the small number of cancer survivors with sum scores of 3 or higher (13 cancer survivors had a CCI sum score of 3, seven cancer survivors had a sum score of 4, and one participant had a sum score of 5) we collapsed those cancer survivors in one group.

| Physical symptoms: Fatigue/weakness, pain, and insomnia
From the psychosomatic complaint subscale of the German version of the Questionnaire on Distress in Cancer Patients-Short Form (QSC-R10), we assessed the items for physical symptoms fatigue/weakness, pain, and sleep disturbances. In the QSC-R10, participants rate fatigue/weakness, pain, and insomnia on a scale between 0 and 5. Since we intended to model the distinct relation of each physical symptom (fatigue/weakness, pain, and insomnia) with anxiety, we treated each item as a separate variable instead of calculating the QSC-R10 sum score.

| Additional features
We also included age, disease stage (metastatic disease yes/ no), years of education, gender, performance status (WHO-ECOG), relationship status, time since diagnosis, treatment modality, and tumor location as additional candidate predictors for anxiety in cancer survivors identified in previous work.

| Data measurement/sources
For data collection, we asked all eligible cancer survivors to complete an anonymous paper-and-pencil self-reported questionnaire set. A member of the research team blinded to the goal of this study extracted information on physical comorbidity and other medical data (metastatic disease, tumor site, treatment modality, time since diagnosis) from the medical records kept in the primary care practices.

| Sample size
In the secondary analysis reported here, we were able to include data of N = 496 cancer survivors. Considering the 13 predictors in our model, this sample size allowed us to detect a minimum R 2 of approximately 0.06 that could be found statistically significant with a power of 0.80. 30

| Statistical analysis
Our primary objective was to predict anxiety (GAD-2 sum scores) using a combination of all identified predictors in a supervised machine learning approach. 31 To minimize potential overfitting, we employed different methods, for example, regularizations and penalizations in LASSO regression, out-of-bag estimation in Random Forest models, and crossvalidation in all models. We conducted all analyses in R 4.0.3 using the tidymodels ecosystem of packages. 32,33

| Developing the reference model
The reference model included all 13 candidate predictors described above.

| Feature engineering (data preprocessing)
First, we computed diagnostics for missing data. Among all predictors, the highest median fraction of missing information (FMI) was 6.5%, among cases the highest FMI was 33.3% (see Table 1). Applying a 50% criterion, all 496 cases were amenable to imputation. 30,34 Since we did not find sufficient evidence to reject a Missing Completely at Random Process at the 0.05 significance level, we concluded that no systematic missing data process existed for any of the variables. Hence, we used a k-nearest neighbor imputation model to handle missing data for all predictors. Specifically, we selected Gower's distance measure and k = 5 contributing neighbors for each predictor. 35 Second, we converted categorical predictors into binary dummy variables for each level (one-hot encoding). Third, to correct skewness, we applied Yeo-Johnson transformation on all numeric variables which all were then centered and scaled. Fourth, we removed any near-zero variance predictors. Sixth, we split the data randomly into a single training set and a single testing (hold-out sample) set applying a 70:30 split and the outcome (GAD-2 score) as a stratum. Finally, we further split the training data into 10-folds for cross-validating the hyperparameter tuning step in model selection.

| Model training, hyperparameter tuning, and within-model comparison
We constructed six machine learning models including an Ordinary Least Square (OLS), Ridge, LASSO, and Elastic Net regression as well as two tree-based algorithms, namely, Random Forest and XGBoost. To tune the parameters for each of these models, we chose hyperparameter values leading to the best predictive performance metric and performed the vfold = 10 cross-validation with target variable stratification over the hyperparameter grid as a resampling method. Hyperparameter tuning supports the identification of the best value for the bias-variance trade-off. In each cross-validation, randomly selected 70% of the data were used to develop the submodels for each model (analysis set) and the remaining 30% to estimate the performance during comparison of sub-models within a model. Accounting for uniform accuracy across the range of the outcome, we used the root mean squared error (RMSE) as a performance metric (with lower values indicating better accuracy) to determine the optimal hyperparameter configuration and finalize the best sub-model for each model.

| Between-model approach comparison and computation of predictions
We then trained the best sub-model for each model fitting them to the entire training dataset to subsequently compare the performance of the best sub-models of all models. Finally, to attain an independent assessment of the model efficacy for each model, we assessed the out-of-sample performance by fitting their best sub-model to the testing dataset and computing performance of the full model on new (unseen) data.

| Sample characteristics
Please see Figure 1 for the study flowchart. The sample for this study comprised 496 participants. Please see Table 1 for the descriptive characteristics. Notably, 90 participants (17.6%) had a GAD-2 score ≥ 3 indicating signs of clinically significant anxiety.

| Model selection and performance estimation
Cross-validation showed that the different model approaches for predicting anxiety (OLS, Ridge, LASSO, Elastic Net regression, Random Forest, and XGBoost) varied only slightly with performance metrics ranging from RMSE = 0.370 to RMSE = 0.386 and from R 2 = 0.370 to R 2 = 0.427 (Table 2). However, the highly parametric LASSO regression with the regularization parameter λ = 0.015 slightly outperformed all other approaches both on the training and the testing dataset (RMSE = 0.370). The LASSO model performance in relation to the regularization parameter λ is depicted in Figure 2.

| Importance of predictors
Importance of all 12 predictors in the LASSO regression model is shown in Figure 3. Fatigue/weakness (β = 0.181), insomnia (β = 0.122), and pain (β = 0.041) emerged as the most important predictors for anxiety. There were no notable correlations between physical symptoms (fatigue/weakness, insomnia, and pain) and the degree of comorbidity (all r ≤ 0.08). Notably, age and a moderate degree of comorbidity (CCI group) were predictors for less anxiety although of small magnitude.

| Model calibration
The calibration curve in Figure 4 illustrates the agreement between the observed and the predicted scores for anxiety (GAD-2 sum scores). Notably, higher scores of anxiety were more accurately predicted compared to lower scores.

| Key results
This study has been among the first to investigate predictors for anxiety in cancer survivors while controlling for physical symptoms and physical comorbidity. The profound advances in the field of oncology allow many cancer survivors to return to a relatively high level of functioning after having completed active cancer treatment. 36 However, one in six cancer survivors surveyed in this study experienced clinically significant anxiety which underscores the importance of predictive models to tailor supportive care for this population. The performance of our model indicated that predicting clinically significant anxiety in cancer survivors F I G U R E 1 Study flow chart is challenging, although our model did perform reasonably well for higher scores of anxiety. At any rate, we did show that adverse physical symptoms, such as fatigue/weakness, and insomnia seem to be linked to a higher likelihood of experiencing anxiety. The degree of physical comorbidity had no major role in our predictive model. Rather, our findings indicate that the presence of distressing physical symptoms (such as fatigue/weakness) may contribute to anxiety in cancer survivors to much larger extent compared to type or mere number of comorbid medical diseases. Future work may clarify a potentially protective role of moderate comorbidity, older age, and being in a relationship with respect to anxiety in cancer survivors.
To the best of our knowledge, this is one of the first studies applying a machine learning approach to predict anxiety in cancer survivors after the acute treatment phase and explicitly accounting for comorbidity. The performance of our model was moderate. However, an older study using logistic regression for classification based on the Hospital Anxiety and Depression Scale found a comparable amount of variance explained. 37 Considering the absence of established theoretical models on the mechanism of anxiety in cancer survivors, our study adds a model based on a comprehensive review of prior work on potential predictors for anxiety. Specifically, we accounted for both sociodemographic and medical characteristics. In contrast to prior work, we did not draw on self-reports for assessing the medical characteristics, but directly obtained the medical data, including the severity of physical comorbidity, from the health records in the respective primary care practice. 21,22 We consider this as an important strength, as patients tend to underreport their medical conditions. 38,39 To evaluate the severity of physical comorbidity, we derived from the common strategy of counting the mere number of chronic medical conditions, but applied the highly valid CCI. 40 With respect to the most important predictors, we did not find an association between the degree of physical comorbidity and the presence of anxiety, which somewhat contradicts the findings from two previous studies. 20,22 However, these studies did not model physical symptoms separately, and one may be limited in its cultural generalizability given distinctive aspects of anxiety in Asian populations. 20,41 In any case, it seems plausible, that people with a larger number of comorbid diseases may have developed profound coping strategies enabling them to better mitigate the impact of a cancer diagnosis and the related treatment. 22 One prior study reporting associations between unhealthy lifestyles and somatic comorbidity did account for physical symptoms, but focused on elderly cancer survivors and applied the number of chronic medical conditions as a surrogate for the severity of physical comorbidity. 21 In summary, we provide a cross-validated, fully tuned model that is optimized with respect to the bias-variance trade-off and out-of-sample performance. Our findings indicate that the subjective experience of physical symptoms is of greater importance compared to the objective degree of physical comorbidity when evaluating the risk of anxiety in cancer survivors.
This study has some limitations. First, we analyzed crosssectional data to identify predictors for anxiety in long-term cancer survivors accounting for physical comorbidity. Given the interplay of fatigue/weakness, insomnia, and pain, in certain cases anxiety may elicit these symptoms. Nevertheless, our results may facilitate setting prospective cohort studies T A B L E 2 Training and validation performance metrics obtained from 10-fold cross-validation for the six best models trained by six machine learning approaches which are desirable for entangling the complex interplay between physical symptoms, physical comorbidity, and anxiety. In fact, we are aware of only one prospective cohort study addressing this issue. The ACTION Study Group followed patients with cancer from Southeast Asia for up to 12 months after their diagnoses and focused on health-related quality of life and psychological distress. 20 The study found that 37% of the participants had at least mild levels of anxiety. This higher rate compared to our sample may be explained by the fact that most of these participants were still in the acute treatment phase. Second, we used the CCI to measure the degree of comorbidity leading to a small number of cancer survivors in the severe comorbidity group. Indeed, given that the CCI stratifies chronic conditions based on a rather serious criterion (the expected 1-year-mortality), using this instrument may have affected the generalizability of our findings. Nevertheless, the CCI is still considered to be the comorbidity measure with the highest validity and the distribution of physical comorbidity in our study is comparable to population-based studies in cancer survivors. 22,40,42 Third, given that this study included a secondary analysis, we cannot fully rule out left-out variables error, that is, the omission of predictors that covary with measured predictors but were not excluded in our model as they were not measured in the original study (e.g., functional status). 43 However, we have tried to minimize the number of left-out variables so that serious specification error seems unlikely. Fourth, we limited our sampling frame to cancer survivors with three highly prevalent cancer types (i.e., colon, prostatic, and breast cancer). Thus, our findings may be of limited generalizability to cancer survivors diagnosed with other cancer types (e.g., more aggressive ones such as brain, lung, or pancreatic cancer). Fifth, the cross-cultural generalizability of our findings may also be somewhat limited given the Western perspective we took when analyzing and interpreting the data. For instance, aspects of sexual functioning (including sexual satisfaction)

F I G U R E 4
Calibration curve comparing the observed and the predicted Generalized Anxiety Disorder-2 Scale (GAD-2) sum scores may be a more private matter in Eastern cultures, dealt with more privately, and therefore less reflected in the data. In fact, qualitative studies, preferably using one-on-one interviews, may be needed to further elucidate such aspects that potentially differ across cultures. Finally, validation by resampling in our study cannot replace the need for a genuine external validation of the model at another time, other geographical regions, and other health systems. However, resampling did validate the process that produced our model.

| CONCLUSIONS
To predict psychological distress in cancer survivors, machine learning-based approaches allow for the consideration of many predictors and more robust validation of the predictive models. In this study, we found that physical symptoms, namely fatigue/weakness and insomnia, were the predictors of highest practical significance for predicting anxiety. Consequently, clinicians should consistently prioritize the first-person perspective on physical functioning when evaluating psychological distress in cancer survivors. Indeed, the patients' subjective experience of their physical and psychological functioning may be the key factor in clarifying patient complexity (overall impact of the different diseases in an individual considering their severity and other health-related attributes) beyond the mere consideration of comorbidity. 44

ACKNOWLEDGMENTS
We would like to thank Max Kuhne, MD for assisting in the data collection and Justus Tönnies, MSc for proofreading the manuscript.

CONFLICT OF INTEREST
None declared.