Prediction of early‐onset colorectal cancer mortality rates in the United States using machine learning

Abstract Introduction The current study, focusing on a significant US (United States) colorectal cancer (CRC) burden, employs machine learning for predicting future rates among young population. Methods CDC WONDER data from 1999 to 2022 was analyzed for CRC‐related mortality in patients younger than 56 years. Temporal trends in age‐adjusted mortality rates (AAMRs) were assessed via Joinpoint software. Future mortality rates were forecasted using an optimal Autoregressive Integrated Moving Average (ARIMA) model. Results From 1999 to 2022, we observed 150,908 deaths with CRC listed as the underlying cause, predominantly in males, with an upward trend in AAMR. The ARIMA model projects an increase in CRC mortality by 2035, estimating an average annual percent change (AAPC) of 1.3% overall, 1% for females, and 1.5% for males. Conclusion Our study findings emphasize the need for more robust preventive measures to reduce future CRC mortality among younger population. These results have significant implications for public health policies, particularly for males under 56, and underscore the importance of early screening and lifestyle modifications.

and genetic predispositions may contribute to this rising trend.The lack of early detection often stems from less frequent screening and misdiagnosis, further exacerbating the issue.
Current research demonstrates that establishing a diagnosis of CRC takes 40% longer in patients under the age of 50, often due to prolonged symptom evaluation and misdiagnosis. 2The latest data shows that CRC incidence and mortality rates have decreased overall. 3,4The US Preventive Services Task Force (USPSTF) recently updated its recommendation, lowering the age for screening colonoscopy from 50 to 45 years. 5This change is expected to lead to earlier CRC detection, thereby reducing the incidence and mortality rates.
Predicting future trends in CRC rates is complex due to factors such as demographic changes, risk factor prevalence, and healthcare practices.In this study, we apply machine learning techniques to predict future trends in CRC-related mortality rates in adult patients younger than 56 of age in the US.

| METHODS
We retrieved deidentified data from the Centers for Disease Control and Prevention Wide-Ranging Online Data for Epidemiologic Research (CDC WONDER) multiple causes of death database (years 1999-2022) for CRCrelated mortality with a focus on the underlying cause of death.The study population consisted of patients with CRC as defined by the International Classification of Diseases-10 codes (C18, C19, and C20). 4 All statistical analyses were performed using Python, utilizing the PyCharm Integrated Development Environment (IDE) and Generative Pre-trained Transformer 4 (GPT-4). 6Python and GPT-4 were chosen for their advanced statistical capabilities, which enhanced the validity of our study by allowing for more complex analyses and data visualization.The age-adjusted mortality rates (AAMRs) per 100,000 population among individuals aged 25-55 were examined.AAMRs were standardized to the 2000 US population.Joinpoint software assessed Annual Average Percent Change (AAPC) temporal trends with 95% confidence intervals (CI) and p-values, representing the change in mortality during a specific period.We chose to focus on the age range of 25-55 years to capture the emerging trend of rising CRC incidence in younger populations.This age range is particularly relevant as it represents a demographic that is increasingly being affected by CRC but is often overlooked in screening programs.
For predictive time series analysis, the autoregressive integrated moving average (ARIMA) model was used for non-stationary data in the context of the long-term trend to forecast mortality rates till 2035, as previously. 7The ARIMA model was selected for its robustness in handling non-stationary time series data and its widespread use in healthcare forecasting.Compared to other models, ARIMA provides a more nuanced understanding of time-dependent patterns. 7,8An optimal ARIMA model, identified using the auto ARIMA function based on the Bayesian Information Criterion (BIC), was fitted to the data.The model's residuals were evaluated for white noise via the Ljung-Box test. 8The model's robustness was validated using time series cross-validation (n = 10), with Root Mean Squared Error (RMSE) reported for accuracy. 9he study adhered to STROBE reporting standards and didn't require informed consent or institutional board approval as it used anonymized public data, following the Common Rule.

| DISCUSSION
This study highlights an upward trend in CRC mortality rates between 1999 and 2022, with our model predicting a continued increase through 2035.The rise in AAMR is more pronounced in males, suggesting the necessity of gender-focused preventive and therapeutic strategies.The increasing trend, up to 2035, was similar when the 2015-2022 data was used, however, with a higher error factor.Therefore, it can be safely deduced that current CRC screening outreach must be widened to achieve a downward trend in CRC-related mortality through 2035.Compared to existing literature that primarily focuses on the epidemiology of CRC, our study adds a new dimension by employing machine learning techniques for predictive modeling. 10This is a significant strength of our study as it provides actionable insights for public health policies.
An important aspect to consider in the context of early-onset colorectal cancer is the role of rectum cancer.Literature has indicated that rectum cancer is a significant contributor to early-onset CRC, and its inclusion in the analysis is crucial for a comprehensive understanding of the disease. 11While our study specifically looks at colorectal cancer, future studies are needed to investigate the prediction of both cancers separately to provide a more nuanced understanding of early-onset colon and rectal cancers.
While the literature has focused on the epidemiology and biology of early-onset CRC to the best of our knowledge, 12 there are limited studies specifically aimed at predicting early-onset CRC.This makes our study unique and adds strength to our findings.However, it is important to note that risk factors for early-onset CRC have been well-studied, including increasing age, higher BMI or obesity, and family history of cancer. 13Our study aligns with these findings, extends them by providing predictive models for future trends, and underscores the urgent need for interventions targeting these modifiable risk factors, as highlighted in existing literature.
The acceleration in mortality rates projected by our models could be associated with increased CRC risk factors such as a sedentary lifestyle, obesity, and poor diet.This highlights the importance of robust preventive measures, including regular screening, lifestyle modifications, and effective treatment strategies.In addition to increasing awareness of screening modalities, achieving compliance needs to be addressed.
The observed gender disparities in CRC mortality rates may be influenced by a variety of factors.Men may be exposed to different risk factors, such as occupational hazards or lifestyle choices, compared to women.Additionally, access to healthcare and biological differences could play a role in these disparities.Our findings align with previous studies that have also reported gender disparities in CRC mortality rates, but we extend this by providing future projections that can guide gender-specific interventions. 14It's worth noting that future changes in CRC prevention and treatment, such as advancements in targeted therapies or changes in screening guidelines, could impact the accuracy of our predictions.These potential shifts in medical practice could either mitigate or exacerbate the trends we have identified.
One of the major strengths of this study lies in its innovative approach to predicting future trends in CRC mortality rates using machine learning techniques, specifically the ARIMA model.This adds a new layer of depth to the existing body of research, which has primarily focused on epidemiological and biological aspects of CRC.Our study aligns with the well-studied risk factors for early-onset CRC and extends the current understanding by providing actionable predictive models.These models can be a valuable tool for healthcare policymakers and clinicians in planning targeted interventions.Additionally, our study is one of the few that specifically aims at predicting early-onset CRC, filling a significant gap in the literature.The gender-specific findings further add granularity to the study, enabling the development of gender-focused preventive and therapeutic strategies.
This study has certain limitations.Being retrospective, it doesn't allow for causal inference.The database doesn't allow for adjustments based on socioeconomic status, race, and comorbidities, which could be potential confounding factors.The predictive models assume that future trends will mirror historical patterns, which may not account for potential changes in CRC screening, prevention, and treatment.While CDC WONDER database provides a comprehensive overview of mortality rates, it lacks granularity in demographic and clinical information.Additionally, there is concern of misrepresentation of cause of death.
Nevertheless, our projection based on 2015-2022 data demonstrated a similar trend in AAMR, albeit with a higher statistical error factor (Appendix S1).While the Ljung-Box test suggests a good fit of the model, some level prediction errors (RMSE) are present, which are native to ARIMA models to indicate a measure of potential inaccuracies in precision forecasts.However, the overall trend would most likely remain the same, highlighting the importance of our study.
In conclusion, the increasing trend in CRC mortality highlights the necessity for more focused efforts in prevention, treatment, and compliance with guidelinedirected screening modalities.The gender disparities identified in our study warrant further investigation to understand their underlying causes and develop effective mitigation strategies.Finally, the prediction data suggests that an immediate call to action is needed to improve current efforts in CRC screening to shift the overall trend of CRC downward by 2035.The novelty and applicability of our results lie in their potential to guide future public health interventions and policy decisions, thereby contributing significantly to the existing body of knowledge.

F I G U R E 1
Actual and projected rates of colorectal cancer-related mortality up to 2035.The shaded areas represent the 95% confidence intervals.F I G U R E 2 Actual and projected rates of colorectal cancer-related mortality up to 2035, stratified by gender.The shaded areas represent the 95% confidence intervals.