Comparison of statistical models to predict age‐standardized cancer incidence in Switzerland

This study compares the performance of statistical methods for predicting age‐standardized cancer incidence, including Poisson generalized linear models, age‐period‐cohort (APC) and Bayesian age‐period‐cohort (BAPC) models, autoregressive integrated moving average (ARIMA) time series, and simple linear models. The methods are evaluated via leave‐future‐out cross‐validation, and performance is assessed using the normalized root mean square error, interval score, and coverage of prediction intervals. Methods were applied to cancer incidence from the three Swiss cancer registries of Geneva, Neuchatel, and Vaud combined, considering the five most frequent cancer sites: breast, colorectal, lung, prostate, and skin melanoma and bringing all other sites together in a final group. Best overall performance was achieved by ARIMA models, followed by linear regression models. Prediction methods based on model selection using the Akaike information criterion resulted in overfitting. The widely used APC and BAPC models were found to be suboptimal for prediction, particularly in the case of a trend reversal in incidence, as it was observed for prostate cancer. In general, we do not recommend predicting cancer incidence for periods far into the future but rather updating predictions regularly.

Comparisons among prediction models have been carried out in the literature.For example, Møller (2003) compared several variants of APC predicting the incidence of the most frequent cancer sites and concluded that APC models using a power link function were superior to those adopting the canonical logarithmic link.To predict the incidence of lung, bronchus, and trachea cancers, Riebler (2012) compared BAPC with the Lee-Carter model, a model used in the demographic field to predict mortality (Lee & Carter, 1992), showing the superiority of BAPC.ARIMA has been compared to neural networks showing the similar performance (Zheng et al., 2020).Using USA state-level data to predict the incidence of the most frequent cancer sites, Chen (2012) compared the performance of APC, BAPC, joinpoint regression, state space models, and the vector autoregressive Hilbert-Huang transform and concluded the superiority of BAPC, although followed closely by APC and state space models.In another study (Clements, 2005), generalized additive models (GAM, a special case of GLM) have shown a better performance than BAPC to predict the incidence of lung cancer.
These comparisons were in general limited to a few models only (e.g., BAPC vs. Lee-Carter, Riebler et al., 2012;or BAPC vs. GAM, Clements, 2005) or to different variants of the same class of models, for example, APC (Møller et al., 2003), with the notable exception of the U.S. study (H.S. Chen et al., 2012), which, however, omitted important methods such as GLM or ARIMA.In addition, most of these comparisons have been made in situations where the incidence was almost constantly increasing, providing a relatively easy setting for prediction.In recent years in Switzerland, we observed some trend reversals, as the incidence of some cancers (e.g., breast, prostate, and skin melanoma) has begun to stabilize or decrease (OFS, 2020).This phenomenon represents an additional difficulty in predicting incidence trends, and it is of interest to compare the different methods in this context.
In the present paper, we sought to compare a large number of methods for predicting the incidence of the most common cancers, including in situations with a trend reversal, based on cancer incidence data from Switzerland.The methods compared include Poisson GLM, of which the Lee-Carter model and joinpoint regression are special cases, APC and BAPC models, ARIMA, and simple linear regression models.We evaluated prediction performance by repeatedly using leave-future-out cross-validation, that is, dividing the data into a training set (on which the models are fitted) and a test set (on which the predictions are evaluated) and considering all possible period partitions of the data in these two sets.This allowed us to produce many prediction settings (scenarios), including some with a trend reversal at the boundary of the training set.
This paper is organized as follows.The data used for our comparisons are presented in Section 2. Section 3 explains how we evaluated the performance of a prediction method by leave-future-out cross-validation using different criteria.All compared methods are described in Section 4. Our results are detailed in Section 5, and a discussion is proposed in Section 6.

Data sources
Our primary data sources were population-based Swiss cancer registries of the cantons of Vaud, Geneva, and Neuchatel.Registries record information on all incident cases of malignant neoplasms occurring in their resident population according to international rules, such as the International Classification of Diseases for Oncology (Fritz et al., 2000).For each incident case, information on the incidence date, the patient's date of birth, and the cancer site are recorded, among others.We used combined data from the three cancer registries of the cantons of Vaud, Geneva, and Neuchatel because they are the oldest registries in Switzerland.Incidence records were available from 1982 to 2016, the latter year being the most recent with complete data.We considered the five most frequent cancer sites: breast, colorectal, lung, prostate, and skin melanoma and formed a final group including all other sites.Considering separately the two sexes, we thus dealt with five sites for women (breast, colorectal, lung, skin melanoma, and others) and five sites for men (prostate, colorectal, lung, skin melanoma, and others).
Our second source of information was the Swiss Federal Statistical Office (FSO), which produces population figures for each Swiss canton, informing on how many people are alive at the beginning of a calendar year by age and sex.

Age, period, and cohort tabulation
Data were aggregated on the three dimensions of age, period, and cohort, using the well-known triangle representation of the Lexis diagram (Figure 1).For each triangle, data include the number of incident cases for the corresponding age a ( = 0, … , 98, 99+), period p ( = 1982, … , 2016), and birth cohort c ( = 1882, … , 2016) combination ( ,, ), and the number of person-years for the same combination ( ,, ) in the general population, the latter being calculated from population data using the classical method presented in Carstensen (2007).APC-specific incidence rates are then defined by:  ,, =  ,, ∕ ,, .Each ratio  ,, in the Lexis diagram acts as an "observation" in most models considered (see Section 4).Therefore, for each sex and cancer site, we worked with 7000 observations obtained by combining the following factors: 35 periods (𝑝 = 1982, … , 2016), 100 ages ( = 0, … , 98, 99+), and 2 triangles (upper/lower) separating people in two different birth cohorts (e.g., a person reaching the age of 67 in 2012 may have been born in either 1945 or 1944; see Figure 1).Some of the models below consider only the age and period tabulation:

Standardized incidence rates
To compare incidence rates over periods using a single quantity, age-specific incidence rates should be combined taking into account changes in the population structure over time.This can be achieved by weighting the age-specific incidence rates of a period  ( , ) for the age distribution of a reference population  *  ( = 0, … , 99+), resulting in so-called standardized incident rates: It has been shown (Spiegelman & Marks, 1966) that when comparing incidence rates over time, the choice of the reference population has little impact on the results.The same is true when making predictions.In this study, we chose the population of the year 2000 of the aggregated cantons of Vaud, Geneva, and Neuchatel as the reference for standardization.A summary of the notations used for quantities related to the Lexis diagram is provided in Table 1.
The standardized incidence rates  *  (𝑝 = 1982, … , 2016) can be seen in Figure 2 for each sex and cancer site.While some rates show a monotonic trend, for example, lung cancer, which continues to increase for women and decrease for men at the present time, other rates show a recent stabilization (breast cancer, skin melanoma) or a trend reversal, in particular prostate cancer, whose incidence began to decrease in the 2000s after decades of increase.
While the majority of the models considered in this paper use tabulated data by age, period, and/or cohort, predicting specific incidence rates  , ,  , or  ,, , we followed the literature (H. S. Chen et al., 2012;Clements, 2005;Møller et al., 2003) and compared their prediction performance using the standardized incidence rates  *  (3).This is the most widely used choice when predicting the cancer burden (Møller et al., 2002;Rapiti et al., 2014) and allows the comparison of the predictions by ARIMA and simple linear regression models which model directly standardized incidence rates (see Section 4).

Leave-future-out cross-validation
Models were compared by repeatedly applying the principle of leave-future-out cross-validation (Bürkner et al., 2020).The procedure can be summarized as follows.First, for a given sex/cancer site combination, chose a cutoff time  in the range:  = 2001, … , 2015.Given the cutoff, the leave-future-out cross-validation consists in (a) fitting a model on incidence rates from  0 = 1982 until year t (training set), (b) predicting incidence rates for the second part of the data (test set), that is, from the year ( + 1) until the last year available  = 2016, and (c) evaluating the prediction performance from ( + 1) to  according to criteria below.Second, repeat the procedure moving  in the range  = 2001, … , 2015 and for each sex/cancer site combination.The range for the cutoff  is chosen to ensure having at least 20 observations in the training set and at least one observation in the test set.
Because we had five cancer sites for each sex and 15 possible cutoffs for each sex/cancer site combination ( = 2001, … , 2015), we obtained 150 different scenarios (see Figure 3).All models presented in Section 4 will be fitted on each of these 150 scenarios.

Comparison criteria
Our main criterion of prediction performance was the normalized root mean squared error (NRMSE) (Hyndman & Koehler, 2006), a measure of point prediction accuracy.This criterion takes the following form in each scenario, that is, for a given cutoff time ,  = 2001, … , 2015, and for each sex/cancer site combination: Here  *  is the observed (i.e., actual) standardized incidence rates for the period x in the test set and for the considered sex/cancer site combination, and λ *  the corresponding predicted standardized incidence rate.The denominator of (4) represents the mean standardized incidence rate in the test set, allowing normalizing of the statistics.The prediction performance (prediction accuracy) of each model was evaluated by the mean NRMSE (M-NRMSE) across the 150 scenarios.A smaller M-NRMSE value indicates better accuracy, with a minimum of 0. We also looked at three time horizons for prediction, assessing separately the short-term M-NRMSE (1-5 years,  + 1 ≤  ≤  + 5), medium-term M-NRMSE (6-10 years,  + 6 ≤  ≤  + 10), and long-term M-NRMSE (11-15 years,  ≥  + 11).This represented 150, 100, and 50 scenarios, respectively.
In a first sensitivity analysis, we considered an alternative of (4), the normalized mean absolute error (NMAE), defined as follows for a given cutoff time,  = 2001, … , 2015, and for each sex/cancer site combination (Hyndman & Koehler, 2006): and considered the mean NMAE (M-NMAE) across the 150 scenarios.In a second sensitivity analysis, we considered the median (instead of the mean) of the NRMSE and the NMAE across scenarios (Med-NRMSE and Med-NMAE).
As an additional comparison criterion, we evaluated the quality of the 95% prediction intervals provided by the different methods to inform on the prediction uncertainty.For this, we used the coverage rate (CR) and the interval score (IS) (Gneiting et al., 2007).The CR measures the empirical coverage of a prediction interval, that is, the probability that it contains the actual standardized incidence rate and is defined as follows for a given cutoff time  = 2001, … , 2015, and for each sex/cancer site combination: where   and   are the lower and upper bounds of a 95% prediction interval calculated for period  ( =  + 1, … , ) and  is an indicator function.The IS is a strictly proper scoring rule, which is related to the width of a prediction interval, with a penalty in cases where the interval does not contain the actual standardized incidence rate, and is defined as follows for a given cutoff time  = 2001, … , 2015, and for each sex/cancer site combination (Gneiting et al., 2007): The CR of a prediction interval should be as close to 95% as possible, while the IS should be as small as possible with a minimum of 0. As for our primary criteria, CR and IS were averaged across the 150 scenarios, obtaining a mean CR and a mean IS (M-CR and M-IS).
Finally, we reported the percentage of scenarios for which the fitting algorithms used by the different methods converged (i.e., they were able to produce predictions), which was not always 100%.Note that all the above criteria were calculated excluding the few cases where the computation did not converge.
All models below will be evaluated and compared according to all criteria (M-NRMSE, med-NMRSE, M-NMAE, Med-NMAE, M-CR, M-IS, and convergence rate).All detailed results are given in the Appendix of the Supporting Information.

Generalized linear models
We considered the Poisson GLM as presented in Chapter 2.4 of A. C. Cameron and Trivedi (2013a).The number of cases  , ( , ) in a certain area of the Lexis diagram (see Figure 1) is assumed to follow a Poisson distribution whose mean  , ⋅ Y , ( , ⋅ Y , ) depends (usually via a log link) on the age and period (or age and cohort).The person-years  , ( , ) is an offset of the model.Due to the age = period−cohort relationship, the three variables cannot be introduced simultaneously in a model, so that we considered data either in squares of the Lexis diagram (age and period) or in parallelograms (age and cohort, see Figure 1), that is, only two effects were considered at a time, with a possible interaction between the two: with  = {, }; , , ℎ = {∅, ,  k };  = 1, … , 4. The age  was introduced into the model with natural splines   , that is, cubic splines based on  knots, linear outside the extreme knots (Ruppert et al., 2003).The period (or the cohort) was either absent ( = ∅), either introduced linearly, that is, via an identity function ( = ) or by splines ( =   ).Both age and period (or age and cohort) were introduced either linearly or by splines in the interaction term, with the constraint that the degrees of freedom for the period (or the cohort) effect must be lower or equal in the interaction term than in the main effect.So, if the period (or the cohort) was introduced linearly as the main effect ( = ), it was not introduced via splines in the interaction (ℎ ≠   ), and if the period (or the cohort) was not introduced as a main effect ( = ∅), it was not in an interaction term (ℎ = ∅).For a given number of knots , this defined five models without interaction and 12 models with interaction.Varying the number of knots between  = 1 and  = 4 (beyond which we encountered problems of lack of convergence of the fitting algorithm), we obtained 4 × (5 + 12) = 68 GLM.The Lee-Carter model (1992), a well-established model used in demography to predict mortality rates, is one of those 68 models where the age and the period are introduced with splines, with an interaction between the two.Finally, we considered a GLM obtained via a model selection strategy that consists in selecting among the 68 GLM models the one that best fits the data in the training set according to the Akaike information criterion (AIC), so that a possibly different GLM is selected in each scenario.This corresponds to a 69th GLM that we call GLM AIC.
The well-known joinpoint regression (Lerman, 1980) was included as a 70th GLM.This is an age-period model with the period introduced with linear splines, with knots representing times where a change of trend occurs.Unlike classical splines, where the knots are chosen a priori, knots in joinpoint regression are determined from the data according to AIC.We only considered a single knot joinpoint model.
Predictions of incidence rates were obtained by extrapolating period, respectively, cohort effects.For age-cohort models, cohort effects must be extrapolated for future periods, with the particularity of requiring the prediction of the effect of some new cohorts and the removal of the effect of old cohorts, as each year some cohorts exit and new cohorts enter.Predictions of  , or  , ( λ, or λ, ) were then aggregated and standardized according to (3) to obtain predicted standardized incidence rates λ *  .The variance of λ, (or λ, ) was estimated by the delta method as in Chapter 3 of A. C. Cameron and Trivedi (2013b), and prediction intervals were obtained based on this variance following the implementation provided in the R package trending (Schumacher & Jombart, 2021).

4.2
Age-period-cohort models APC models, introduced by Holford (1983), are Poisson GLM allowing to include simultaneously age, period, and cohort effects.In order to make the model identifiable despite the linear relationship between the three variables, some constraints are added (Carstensen, 2007).Whereas age, period, and cohort were originally considered as factors, forcing a coarse tabulation of 5-year groups (Clayton & Schifflers, 1987;Holford, 1983;Møller et al., 2002), more recent developments in APC, which we followed in our study, adopted splines to model three effects (Carstensen, 2007;Carstensen et al., 2022).With the canonical log link of Poisson GLM, a spline APC model takes the following form: Here the drift is a linear trend effect, ascribed to both period and cohort and   () (respectively,   ()) are residual nonlinear effects specific to the period (respectively, the cohort).For these splines, we considered between  =3, 4, or 5 knots.In addition to the canonical log link, other link functions (Dunn & Smyth, 2018) have been compared in the literature for their prediction performance, where the 1/5 power link performed best (Møller et al., 2003).For this reason, we have adopted the 1/5 power link in our comparisons, as an alternative to the log link (9).
Two options have been adopted in the literature to extrapolate the incidence rates λ,, for future periods and cohorts.The most common one consists in extrapolating only the drift (Yu et al., 2019); the alternative consists in adding a linear extrapolation of the nonlinear effects (Rutherford et al., 2012).Both options are compared in our study.In what follows, the two extrapolation strategies are referred to as drift only, and all effects, respectively.Varying the number of knots between  =3, 4 or 5, and considering two link functions and two extrapolation strategies, we obtained 3 × 2 × 2 = 12 APC models.
As with GLM, the predictions λ,, were finally aggregated and standardized as in (3) to produce predicted standardized incidence rates λ *  and prediction intervals.

Bayesian age-period-cohort models
A BAPC model (Riebler et al., 2012;Riebler & Held, 2017;Schmid & Held, 2004) has the same formulation as an APC model, the difference being how age, period, and cohort effects are included in the model and how they are fitted.Instead of using splines, Bayesian models are based on the first-or second-order random walk ( 1 or  2 ) specification for each effect of the model: where  = 1, 2 is the order of the random walk.Considering, for instance, the period effect (the same holds for age and cohort effects), a Bayesian first-order random walk takes the following form: where Δ is the difference of the last two period effects.With the  1 model, each (age, period, and cohort) effect is taken as being equal to the previous one plus a random number from a normal distribution centered on zero, with variance  2 1 .The latter acts as a smoothing parameter: with an infinite variance, the fitted model will pass through all the data points in the training set, while a small prior variance will give rise to a smoother fit.We considered different choices for the hyper-prior distribution of parameter  2 1 .Our first option was the most commonly adopted distribution, that is, a flat (noninformative) gamma(1, 5e−5) for the three effects of age, period, and cohort (Riebler & Held, 2017).As alternative options, we considered a tighter gamma(1, 5e−3), a larger gamma(1,5e−7), and PC(u=1,α=0.01)(Lindgren & Rue, 2015) priors for the three effects, as well as an option with a gamma(1, 9e−4) for the age effect and gamma(1, 2.5e−4) for the period and cohort effects, as in Riebler and Held (2017).Considering that we have 7000 observations in the Lexis diagram, the hyper-prior has a limited effect.When making predictions from a Bayesian  1 model, the last fitted period and cohort effects are extrapolated as constant.
Considering again the period effect as an example, a Bayesian second-order random walk takes the form: where Δ 2  is the second difference (difference of the difference) involving the last three period effects.In a  2 model, each age, period, and cohort effect is based on the two previous effects and adding a random number from a normal distribution centered on zero, with variance  2 2 .The latter acts again as a smoothing parameter: the smaller the prior variance, the smoother the fit.The same five priors adopted for  2 1 were adopted for  2 2 .In a Bayesian  2 model, the future period and cohort effects are predicted by a linear trend based on the last two fitted effects.Compared to an  1 model, an  2 model will be generally smoother.
Since we considered five priors for each of the  1 and  2 models, we included a total of 10 BAPC in our comparison.As with GLM and APC models, the predicted rates λ,, from a BAPC model should be combined to obtain predictions of standardized incidence rates λ *  .Bayesian  1 and  2 models were fitted using the INLA (Lindgren & Rue, 2015) and BAPC packages (Riebler & Held, 2017) from R software (R Core Team, 2022) to get point predictions and prediction intervals, respectively.The package INLA is based on the works of Rue (2009), Martins (2013), and Lindgren (2011Lindgren ( , 2008Lindgren ( , 2015)).

ARIMA time series models
ARIMA time series models are econometric models defined by combining a difference autoregressive model with a moving average model (Hamilton, 1994).Let Δ   *  be the standardized rates  *  differenced d times.The ARIMA(h,d,q) is expressed as where   are normally distributed residuals,  1 ,. . ., ℎ are the coefficients of the autoregressive (AR) part of the model,  1 , … ,   are the coefficients of the moving average (MA) part, and  0 is a constant.In an ARIMA(h,d,q) the predictors are lagged h data points for the autoregressive part and q residuals are considered for the moving average part, which are all d differenced.We considered all orders from (0,0,0) to (3,3,3) for (h,d,q) which represents 4 3 = 64 different models.As for GLM, we finally considered an ARIMA obtained via a model selection strategy that consists in selecting among the 64 ARIMA models the one that best fits the data in the training set according to the AIC, so that a possibly different ARIMA is selected in each scenario.This corresponds to the 65th ARIMA method that we call ARIMA AIC.Unlike the models presented in the previous sections, the predictions obtained with an ARIMA model are based directly on an extrapolation of the standardized rates (without the need for any aggregation).We used the function ARIMA from R (R Core Team, 2022) to fit the models, predict the incidence rates, and compute the prediction intervals.

Linear models
In order to get another standard comparator for the above methods, we finally considered a simple linear model (LM) fitted on the last r data points of the training set: where  = ( −  + 1), … , ,   are normally distributed residuals and  and  are the intercept and slope of the model, respectively.Letting  vary between 3 and 10 data points, we considered eight LM.In these models, predictions of standardized incidence rates are made by extrapolating the fitted trend, and prediction intervals are obtained in the classical framework of regression models.In summary, considering the five classes of models described in this section, we compared 165 different models: 70 GLM, 12 APC, 10 BAPC, 65 ARIMA, and 8 LM.An exhaustive list of all the models compared can be found in Figures A1-A4 of the Supporting Information Appendix.

RESULTS
All models described in Section 4 were fitted to the 150 available scenarios detailed in Section 3.1 and evaluated according to the performance criteria presented in Section 3.2.For each model, the NRMSE and distribution over the 150 scenarios is given in Figures A1-A8 of the Supporting Information Appendix.Most of these distributions show large variability and are highly skewed, with M-NRMSE > Med-NRMSE, as M-NRMSE strongly penalizes the poor performance of some models in particularly difficult scenarios, for example, in the case of a trend reversal (and the same is true for the M-IS).

Illustration of two selected scenarios
To illustrate the heterogeneity of predictions among models, we have plotted in Figure 4 two selected scenarios, one for female lung cancer and one for (male) prostate cancer, both with a cutoff time of  = 2010.In the former, incidence rates are always increasing, representing a simple setting for predictions, while in the latter the incidences begin to decrease around 2005, giving a more challenging situation for predictions.It can be seen that in the former scenario almost all models predict similar incidence rates, whereas in the latter the heterogeneity among predictions is impressive, as a trend reversal occurs towards the end of the training set.In this setting, some models adapt to the new trend (e.g., an APC extrapolating all effects, represented in green), while others fail to adapt (e.g., an APC extrapolating the drift only, represented in dark cyan).We can observe that highly flexible models, although fitting closely to the data points in the training set, are not necessarily the best for prediction in the test set (e.g., ARIMA (3,3,0) in red).The Supporting Information Appendix contains more detailed results for prostate cancer.Figure A9 shows the predictions of the different models for each of the 15 scenarios (i.e., varying the cutoff time ) for this cancer site.When the cutoff time occurs before the trend reversal, all models predict excessively high incidence rates, but some models adapt quicker than others when the cutoff time occurs during or just after the trend reversal.For each model, Figures A10-A13 give the NMRSE distribution over the 15 scenarios for prostate cancer, showing a generally worse performance of the methods (higher NMRSE values and more skewed distribution) than what we had in Figures A1-A4 (when considering all cancer sites).The NRMSE and IS distributions of each model over the 15 scenarios for all 10 sites considered in this study can be found in the Supporting Information.In what follows, we discuss and compare the prediction performance achieved by the different models.

Comparison within classes of models
In the class of GLM (Section 4.1), recall that we varied (a) the number of knots  of the splines between 1 and 4, and (b) the variables included in the model and the functional forms used.Models with  = 3 and  = 4 knots achieved smaller M-NMRSE (Figure 5a).Models including age and period performed better than models including only age or age and cohort, with better performance when the period was included via splines rather than just linear, while models including an interaction term performed similarly to those without interaction (Figure 5b).The best GLM with  = 4 knots was a model including age and period, along with a linear interaction between the two variables (M-NRMSE = 0.138), while the best GLM with  = 3 knots was a model including age and period without interaction (M-NRMSE = 0.143), the latter being simpler and converging more often than the former (100% vs. 96%).Both models performed much better than the more complex Lee-Carter model (M-NRMSE = 0.180).On the other hand, the use of AIC for selecting the GLM showed an extremely poor performance, as did the joinpoint model (both M-NRMSE > 7; see also Figures A1 and A2 in the Supporting Information Appendix).
In the APC class of models (Section 4.2), we varied the number of knots  between 3 and 5, while considering two possible link functions (log or 1/5 power), and two different extrapolation strategies (drift only or all effects).Results are summarized in Figure 5c.We found that using the 1/5 power link improved the predictions compared to using the canonical log link, especially when extrapolating the drift only.The optimal number of knots was  = 3 when using the first extrapolation strategy (drift only) and  = 5 when using the second (all effects).In general, the second extrapolation strategy improved predictions compared to the first.The best APC model was thus a model using the 1/5 power link, with five knots and extrapolating all effects (M-NRMSE = 0.140; see also Figure A3 in the Supporting Information Appendix).Geneva, andNeuchatel, 1982-2016.In the labels, "A" indicates GLM only including age; "A+P" (respectively, "A+C") indicates GLM including age and period (respectively, age and cohort) without interaction; "A ⋅ P" ("A ⋅ C") indicates GLM including age, period (respectively, age, cohort) with interaction.For APC, "drift only" and "all effects" refer to prediction strategies extrapolating only the drift, respectively, the drift + all nonlinear effects, "log" indicates the logarithmic link, and "pw" is the 1/5 power link.For BAPC, RW 1 and RW 2 refer to first, and second-order random walk.APC, age-period-cohort; BAPC, Bayesian age-period-cohort; GLM, generalized linear models; LM, linear model.
In the BAPC class of models (Section 4.3), the choice of the priors for the period, cohort, and age effects had almost no impact on the M-NRMSE, while (using a PC hyper-prior) the performance was better for the RW 1 model (M-NRMSE = 0.127) than for the more complex RW 2 model (M-NRMSE = 0.139), as summarized in Figure 5c (see also Figure A3 in the Supporting Information Appendix).
In the ARIMA class of models (Section 4.4), we varied the autoregressive, integrated, and moving average orders from 0 to 3. Results are summarized in Figure 6.Changing the autoregressive and the moving average order did not have much impact on the prediction performance.On the other hand, working with difference data did improve the performance up to order 1, but got worse beyond.The smallest M-NRMSE was achieved by an ARIMA (2,1,1) (M-NRMSE = 0.078), with a convergence of 96%, while the best ARIMA with a convergence of 100% was ARIMA (1,1,0) (M-NRMSE = 0.084).As for GLM, using the AIC criterion to select a possibly different ARIMA model depending on the scenario resulted in a worse prediction performance (M-NRMSE = 0.119) than when systematically opting for an ARIMA (2,1,1) or (1,1,0) (see also Figure A4 in the Supporting Information Appendix).
Finally, in the class of LM (Section 4.5), the best performance was achieved using the last  = 7 data points of the training set to fit the model (M-NRMSE = 0.117; see Figure 5c and Figure A3 of the Supporting Information Appendix).The Supporting Information Appendix also contains the same summary results as in Figures 5 and 6 for the M-IS criterion (Figures A14 and A15).Within-class comparisons in terms of M-IS are largely consistent with that obtained using M-NMRSE.The best GLM was a model with three knots including age and period with an interaction (M-IS = 0.199; Figures A14, A5, and A6).The best APC was a model using the 1/5 power link, with five knots and extrapolating all effects (M-IS = 0.504; Figures A14 and A7).The best BAPC was a RW 1 model using a gamma(1, 2.5e−4) prior (M-IS = 0.200; Figures A14 and A7).In the ARIMA class, the best performance was achieved by ARIMA(1,1,0) (M-IS = 0.098; Figures A15 and A8), while in the class of LM the best performance was obtained using the last  = 4 data points (M-IS = 0.181; Figures A14 and A7).

Comparison between classes of models
The performance of the best models from each class identified above based either on M-NRSME or on M-IS, as well as of some other well-known models, is summarized in Table 2, where models are ordered in terms of M-NRSME.Based on this criterion, simple ARIMA models, such as (2,1,1) and (1,1,0) and LM using the last seven data points, performed better than the best models among the more complex (GLM, APC, and BAPC) classes of models.Looking at detailed results in Supporting Information Appendix (Figures A1-A4), one can see that ARIMA models with orders up to (3,1,3) as well as LM models using the last 5-10 data points were also ahead of the other methods.The AIC-based ARIMA outperformed GLM, APC, and BAPC models, while being inferior to simple ARIMA and LM.Finally, the single knot joinpoint and Lee-Carter models showed extremely poor performance.Largely similar conclusions were obtained in terms of M-IS, with simple models outperforming more complex ones (Table 2 and Figures A5-A8).Again, the best performance was achieved for simple ARIMA models, followed by LM (using the last four points).These simple models also presented an M-CR fairly close to 95%.BAPC models had in general wider prediction intervals, resulting also in too high coverage rates (M-CR close to 100% instead of 95% for some of them).In contrast, coverage rates were clearly too low for GLM and APC, where M-CR could even reach values as low as 50%.
Figure 7 compares the M-NRMSE between model classes separately for each sex and cancer site.Prostate cancer is the site with the largest variability in model accuracy, as incidence rates showed a trend reversal (Figures 4 and A5), followed by skin melanoma for both sexes, because of a rate stabilization in recent years (Figure 2).The APC, BAPC, and GLM models performed particularly poorly in at least one of these difficult cases, while the ARIMA models were among the best methods whatever the cancer site.Here also, similar results were obtained in terms of M-IS (Figure A16).
When evaluating separately the M-NRMSE of the short-term (1-5 years), medium-term (6-10 years), and long-term (11-15 years) predictions, we found that, despite inevitably less good performances for long-term predictions than for mid-and especially short-term predictions, the ranking of the models was largely the same in all three settings (Table 2).Results of our sensitivity analyses (Section 3.2) are also summarized in Table 2.In a first sensitivity analysis, we repeated all calculations using M-NMAE instead of M-NRMSE.The ranking of the models stayed mostly the same.In a second  Geneva, andNeuchatel, 1982-2016 per sex and cancer site).The number of scenarios for which the fitting algorithm of a method did not converge is indicated at the bottom of the graph for APC|BAPC.The other methods shown always converged.APC, age-period-cohort; ARIMA, autoregressive integrated moving average; BAPC, Bayesian age-period-cohort; GLM, generalized linear models; LM, linear model.sensitivity analysis, we looked at the Med-NRMSE and Med-NMAE.While the performance of the different models was much closer to each other, the best models remained the simple ARIMA models.
The last column of Table 2 provides the percentage of scenarios for which the fitting algorithm of the different methods converged, which was 100% for LM, above 95% for ARIMA and GLM, and slightly below (down to 85%) for the APC and BAPC models.The computational time of Bayesian and non-Bayesian methods is available in the Supporting Information Appendix (Table A1).

DISCUSSION
The objective of this paper was to compare the performance of statistical models for predicting annual age-standardized cancer incidence rates.The APC method was compared with its Bayesian counterpart BAPC, with simpler GLM approaches including only age and period or age and cohort in the model, and with ARIMA models based only on time series of incidence rates.A simple linear regression model (LM) extrapolating the trend fitted on the last few observed incidence rates was included in the comparison.The prediction performance of each method was evaluated in terms of point prediction accuracy, the quality of prediction intervals, and the convergence rate of the fitting algorithm on a large set of scenarios obtained by leave-future-out cross-validation using real data from Swiss cancer registries.
In the class of GLM, including period led to better prediction performance than including cohort (in addition to age) in the model.Including cohort in the model involves the prediction of new cohort effects mainly based on the most recent ones containing only few observations.These models are therefore more prone to overfitting than those using period effects concerning the whole population.On the other hand, the inclusion of a simple (linear) interaction term for age and period did not improve prediction performance, while including complex interactions (splines) actually worsened predictions due to overfitting.This was also the case for the Lee-Carter and joinpoint models, and for a GLM model selected via the AIC.Prediction intervals were particularly poor for GLM as previously found by another study (Møller et al., 2005), who estimated (as we did) coverage rates of about 50%.The APC approach, although considerably improved by the use of a power link (Møller et al., 2003), performed poorly when predictions were made by extrapolating the drift alone.The drift alone is, however, the most widely used strategy to date (Yu et al., 2019).Further extrapolating all nonlinear effects (Rutherford et al., 2012) improved the predictions in case of a trend reversal.The Bayesian APC models based on secondorder random walk performed very similarly to the classical APC models with splines, while a simpler BAPC based on first-order random walk performed slightly better despite less smooth age, period, and cohort effects.For ARIMA, models with low-order autoregressive, integrated, and moving average components are to be preferred to more complex model structures.A strategy based on selecting the best ARIMA model via AIC overfitted the data and gave worse predictions than systematically choosing a simple structure.Finally, increasing the number of data points to fit a simple linear regression model improved the prediction when including up to 7 years in the past, worsening after this lag.
Comparing models among classes, we obtained the best performance for simple ARIMA models such as (2,1,1) or (1,1,0).Nevertheless, most of the simplest ARIMA models, including component orders up to (3,1,3), were ahead of the other methods.The second-best performing class of models was simple linear regression, where all alternatives using 5-10 data points to fit the trend outperformed the more complex GLM, APC, and BAPC models.Similar results in terms of ranking of models were obtained for short, medium, and long-term predictions, and stratifying by cancer site, despite the less good performance for long-term versus short-term predictions, and for cancers showing a trend reversal.Considering absolute errors (NMAE) instead of square errors (NRMSE) and taking the median instead of the mean of the NRMSE (or NMAE) across scenarios did not substantially change the ranking of the models, still placing simple ARIMA models at the top of the list.However, while the choice of the mean strongly penalizes the most difficult scenarios, the median simply excludes these particularly complicated situations from the evaluation.For this reason, we have focused mainly on the mean NRMSE, as it allows us to favor models and methods that do not show aberrant behavior in case of trend reversal or stabilization, as this is becoming increasingly common (e.g., prostate, breast, and skin melanoma) and a similar pattern will hopefully be observed in the future for other cancer sites.Finally, looking at the prediction intervals did not alter our conclusions, with simple ARIMA models still being the best performing models.
Our first observation is that the models that offer the best fit to the data, for example, models selected using AIC, are not necessarily the best for prediction.The use of AIC may lead to the choice of overly complex models, for example, GLM or APC including complex interactions with splines, which are likely to overfit the data and are indeed less efficient for prediction than simpler and smoother models.The second observation is in line with the first one by advocating for simplicity.Approaches such as APC or BAPC estimate the effects of age, period, and/or cohort separately, considered as proxies for epidemiological risk factors, such as smoking and other carcinogens.However, when it is a matter of prediction, we have shown that better performance is obtained by simply extrapolating the trends using models such as ARIMA (or even simple regression models).As Booth and Tickle (2008) have already noted in the context of mortality projection, an extrapolation-based approach is often preferable to an explanation-or interpretation-based approach.
Our study has some limitations, all of which could motivate future work.For example, although we considered many models, we did not include overdispersion in our Poisson models.While this would affect the coverage of the prediction intervals, it would not greatly change the predictions and thus the NRMSE, our main criterion.More generally, many other prediction models and options can be formulated, for example, using quasi-Poisson or negative binomial distributions or trying other link functions, and it could become an interesting challenge for researchers to try to identify one that can outperform ARIMA in comparable settings.Second, we did not take into account the effects of prevention/screening programs and changes in medical technology and practice on cancer incidence rates.These factors can strongly influence the (past and future) rates and help identify trend changes (Etzioni et al., 2013).Such effects are however challenging to analyze, as they evolve over time.For example, the start of a cancer screening program leads to a short-term initial increase in incidence, followed by a stabilization or sometimes a decrease compared to past trends.The magnitude of these temporal effects will depend on the adherence to these prevention programs.Considering the impact of screening when making predictions is a subject of ongoing work.Another limitation is that we did not account for data underreporting and its evolution over time, which may introduce artifacts into predictions.In fact, while a high degree of completeness has been evaluated in Swiss cancer registration (Lorez et al., 2017), a slight underreporting of some cancers cannot be discarded.Finally, we have only modeled the most common cancers.Although we assume that simpler methods, such as ARIMA, should be suitable for rare cancers, as they are less prone to overfitting, we have not studied the prediction performance for these cancers and cannot draw any conclusions on this point.A specific analysis of the most suitable prediction methods for rare cancers would be another important subject of future work.The same consideration can be made for the prediction of specific age incidence rates.
In conclusion, we recommend using lower order ARIMA models to predict cancer incidence, for example, the ARIMA (2,1,1) or (1,1,0) models, which achieved the best overall performance in our comparison study.We suggest not using AIC to select the model, as this appears not to be the best strategy for prediction because it often results in overfitting.While APC and BAPC models remain the best to help interpret changes in cancer incidence trends, we recommend avoiding the widely used APC model for prediction, especially when the extrapolation is restricted to the drift.Finally, given the large uncertainty, particularly in the case of a trend reversal, we do not recommend predicting cancer incidence for periods far into the future but rather updating predictions regularly.

F
I G U R E 1 Lexis diagram of an extract of the combined data from the Swiss registries of Vaud, Geneva, and Neuchatel.Selected period: January 1, 2010-December 31, 2015.Selected age range: 65th to 70th birthday.In each triangle, the top number represents the number of incident cases  ,, and the bottom number corresponds to the person-years  ,, for an age (a), period (p), and birth cohort (c) combination.For example, during 2012 we had  67,2012,1944 = 21 new cases of breast cancer among women 67 years old who were born in 1944 (triangle in bold) and  67,2012,1945 = 18 new cases among women of the same age who were born in 1945 (dashed triangle).Corresponding person-years were  67,2012,1944 = 3663.5 and  67,2012,1945 = 3547, respectively.

F
I G U R E 2 Standardized incidence rates per 10,000 inhabitants over the period 1982-2016 for the most common cancer sites and the two sexes.Combined data from the Swiss registries of Vaud, Geneva, and Neuchatel.Reference for standardization: combined population of Vaud, Geneva, and Neuchatel in 2000.

F
Illustration of the 15 leave-future out cross-validation scenarios for each sex and cancer site, based on partitioning the time range into a training set (blue dots) and a test set (red dots) according to a cutoff time t (working with five cancer sites for women and five for men resulted in 150 different scenarios).

F
Illustration of predictions made by the different models in one of the 15 leave-future-out cross-validation scenarios(training  set 1982-2010; test set 2011-2016)  for lung (female) and prostate (male) cancer standardized incidence rates.Combined data from the Swiss registries ofVaud, Geneva, and Neuchatel, 1982-2016.Three models are highlighted (APC extrapolating only the drift, APC extrapolating the drift + all nonlinear effects, and ARIMA (3,3,0)).All other models are in gray.APC, age-period-cohort; ARIMA, autoregressive integrated moving average.

F
Mean normalized root mean square error (M-NRMSE) for GLM (a) and (b); APC, BAPC, and LM (c) across 150 scenarios (15 leave-future-out cross-validation scenarios for each sex and five cancer locations per sex) obtained on combined data from the Swiss registries of Vaud,

F
Mean normalized root mean square error (M-NRMSE) for ARIMA with orders of each component (AR, I, and MA) between 0 and 3, and an ARIMA AIC across 150 scenarios (15 leave-future-out cross-validation scenarios for each sex and five cancer locations per sex) obtained on combined data from the Swiss registries of Vaud, Geneva, and Neuchatel, 1982-2016.AIC, Akaike Information Criterion; AR, autoregressive; ARIMA, autoregressive integrated moving average; MA, moving average.

TA B L E 2
Prediction performance for the best models within each class (GLM, APC, BAPC, ARIMA, and LM) according to the mean normalized root mean square error (M-NRMSE) and to the mean interval score (M-IS), and for some other selected models, across 150 scenarios (15 leave-future-out cross-validation scenarios for each sex and five cancer locations per sex) obtained on combined data from the Swiss registries ofVaud, Geneva, and Neuchatel, 1982-2016.Also given is the performance of short-term, medium-term, and long-term predictions provided by M-NRMSE (1-5 years), M-NRMSE (6-10 years), and M-NRMSE (11-15 years).Alternative criteria include the median NMRSE (Med-NRMSE), the mean and median normalized mean absolute error (M-NMAE and Med-NMAE), and the mean coverage rate (M-CR).The last column indicates the percentage of convergence of the methods over the 150 scenarios.Models in this table are ordered according to M-NRMSE.

F
Mean normalized root mean square error (M-NRMSE) for one model per class, by sex and cancer site (M-NRSME across 15 leave-future-out cross-validation scenarios on combined data from the Swiss registries of Vaud,