Validation and public health modelling of risk prediction models for kidney cancer using the UK Biobank

To externally validate risk models for the detection of kidney cancer, as early detection of kidney cancer improves survival and stratifying the population using risk models could enable an individually tailored screening programme.


Introduction
Kidney cancer is the seventh most common cancer in Europe [1], responsible for 50 000 annual deaths [2,3]. Incidence is projected to rise in coming years, due to changes in population demographics and lifestyle [1]. Early-stage diagnosis is strongly correlated with improved survival rates; The 5-year cancer-specific survival rates for patients diagnosed with Stage I and IV kidney cancer are 83% and 6%, respectively [4]; in the UK~25% of patients who present with kidney cancer have evidence of metastases. In the UK, 60% of kidney cancers are diagnosed incidentally with~20% of these being late stage (III-IV) [5]. A lack of symptoms, even at late stages of the disease, makes the early detection of kidney cancer particularly challenging.
There are several risk factors that are known to be associated with kidney cancer and these have been combined to develop a number of risk models, as identified in a recent systematic review [6]. It has been suggested that such risk models could be used to identify high-risk individuals for inclusion in a screening programme [7], with the development and validation of risk models being identified as a research priority in a recent review of screening for kidney cancer [8]. Risk-stratified screening could improve screening efficiency, while reducing the associated costs and harms compared to a simple age-based approach [9,10]; this is crucial when considering diseases, such as kidney cancer, with a relatively low prevalence [11]. However, the performance of most of the published models that predict individual-level risk of developing kidney cancer have not be externally validated. Therefore, the potential of these models to assess eligibility is unclear and comparisons to other approaches (such as using age-and sex-based selection) is not possible.
In the present study, we externally validated 30 of these models in the UK Biobank (UKB), a cohort of >500 000 people. We then applied a public health modelling approach to estimate the population level health benefit of incorporating the best performing models within a riskstratified kidney cancer screening programme. The use of risk models to determine eligibility for screening was compared to age-and sex-based strategies.

Selection of Risk Prediction Models
We identified risk prediction models from our recent systematic review [6]. We found 60 studies describing model that predict the risk of an individual developing kidney cancer. Only models with phenotypic risk factors were included in this validation. Where there was insufficient detail about the models, the authors were contacted; these models were excluded if additional information was not obtained (n = 8). Models including more than one risk factor for which there is no comparable UKB variable were excluded (n = 8). A total of 16 studies (29 models) were included in this validation study [7,[12][13][14][15][16][17][18][19][20][21][22][23][24][25][26][27] (Fig. S1). Following discussion with experts, an additional study published after the systematic review was included [7]. Details of the development studies, including study type, population, sample size and location are given in Table S1.

Validation Cohort
The UKB was used as the validation cohort. It is the largest population-based cohort in the UK [28]; between the years 2006 and 2010,~500 000 individuals aged 40-69 years, were recruited. Demographic and lifestyle information was recorded at baseline assessment; however, no imaging was carried out. Details of how risk factors were used is provided in Table S5. Data on cancer incidence is available through linkage to national cancer registries (Methods S1).
Members of the cohort with a diagnosis of kidney cancer prior to baseline were excluded from analysis. If more than one diagnosis was recorded, the first occurrence was used. The most recent cancer diagnosis in the dataset is December 2016. We censored all follow-up to 31 March 2016 (the study end date) to ensure that late registrations were not missed. A closed cohort analysis was conducted; individuals whose follow-up was censored before 6 years were excluded. Cases were those who developed kidney cancer within 6-years of baseline assessment. This follow-up time is longer than the estimated sojourn time of RCC (3.7-5.8 years [29,30]).

Model Validation
For all included models, we first computed the score, relative risk, or absolute risk (depending on model type) for each eligible UKB participant at baseline. Models predicting the absolute risk over a specific period were re-scaled to predict 6-year risk (Methods S1). We calculated the performance of the models both separately for men and women and as a whole cohort. Models developed originally for single sex cohorts were not validated for the other sex or the whole cohort. Where a study developed separate models for men and women the sex-specific risk predictions were combined for validation in the whole cohort.
A complete case approach was used for the primary analysis with each model only computed for individuals with data for all the risk factors. This was performed on a model-by-model basis; hence, the cohort size varies slightly for each validation. The outcome (diagnosis of kidney cancer) was treated as a binary variable. The discriminative ability of each model was measured quantitatively using the area under the receiveroperating characteristic curve (AUROC). The calibration of the models was assessed graphically using deciles (or maximum number of possible groups) of predicted risk (Methods S1). The observed to expected ratio (O:E) and the Hosmer-Lemeshow statistic were calculated for absolute risk models.
Sensitivity analyses were carried out. An open cohort analysis included all participants with <6 years follow-up and cases of kidney cancer occurring after 6 years. In that analysis, we used Harrell's C-statistic to assess discrimination, as it accounts for censoring. Secondly, we performed a completecase analysis using a subset of participants with complete data available for all models.

Public Health Modelling
Further analyses were carried out to assess the potential benefits of using these models in the UK general population. Five models predicting absolute risk, with good performance in validation (AUROC >0.60, 0.5<O:E<2) were selected for this analysis [31].
First, we used a re-calibration method (previously described [32,33]) to re-scale the risk distributions of the models to reflect the expected risk distribution in the general population (Methods S1). Second, we developed a set of scenarios using different risk thresholds (0.1-1.0% 6-year risk) as a cut-off for screening eligibility. For each model, members of the cohort with a predicted risk higher than the threshold were classified as high-risk and hence 'eligible' for screening. Cases of kidney cancer in the high-risk group are considered identifiable in that screening scenario.
The proportion of true positives, false positives, false negatives, and true negatives were calculated for each sex and 5-year age group. These were then summed with weighting that reflected the age and sex distribution and the incidence of kidney cancer in the UK (using Cancer Research UK [CRUK] and Office for National Statistics [ONS] data) [34]. Using this new representative population, the efficacy of the model was determined for each screening scenario. An overview of this analysis process is given in the Methods S1.
We assessed accuracy of the models by calculating the sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) for each threshold. The efficiency of each screening scenario was assessed by calculating the proportion of the population invited to screening, the proportion of people subsequently diagnosed with kidney cancer within current routine care during the following 6 years who are invited, and the number of people needed to invite to identify one individual who is subsequently diagnosed with kidney cancer within 6 years (NNI). These screening scenarios were compared to age-and sex-based risk stratification. Sub-group analyses by sex and age group were carried out for one threshold (0.4%), selected pragmatically to represent a potential cut-off that selects 10% of the population. This modelling approach assesses the potential of the risk models to determine screening eligibility; however, the practicalities of screening (such as the performance of screening tests and attendance at screening) are not factored into this analysis and the results are not dependent on screening modality.

Results
A cohort of 450 687 individuals, including 635 cases of kidney cancer, were included in the primary analysis ( Fig. 1). Compared to those who did not develop kidney cancer, the cases were more likely to be male, White, overweight, an ever smoker (and have higher pack-years), hypertensive, and have a previous cancer diagnosis ( Table 1).
These 10 models all include age as a risk factor. Several also include blood pressure as a risk factor, either a hypertension diagnosis [7,21] or a direct measurement [7,24]. The results of the sensitivity analyses were consistent with the primary analysis (Table S2).

Calibration
The calibration of the models in the UKB cohort is shown graphically in Fig. S2a-e for men, women, and the whole cohort. Some of the models show good calibration (Landsman and Graubard [21]

Public Health Modelling
The adjusted population used in this analysis was scaled to have 100 000 individuals aged 40-70 years. This population has the same age and sex distribution (within this age range) as the UK and the same incidence of kidney cancer. Over a 6-year period, 212 population members (139 men, 72 women) are expected to develop kidney cancer. If a screening programme included all 100 000 individuals and detected all cases (over 6-years), 471 individuals would be invited per case identified (NNI).
The Usher-Smith et al. [27], Scelo et al. [25], Singleton et al. [7], Landsman and Graubard [21] 2013 (b) and (c) models were included in this analysis. Table 3  . For all models, as the risk threshold (and the proportion of the population screened) increases, the sensitivity of the models increases while the specificity (and NNI) decreases. Very few individuals were predicted to have a risk >1% for any model. No meaningful differences are seen between the five models.
At a risk threshold of 0.4% the NNI is <200 for all five models. The Landsman and Graubard [21] (b) model detects the highest number of kidney cancer cases, 32.5% (69/212), screening 12.7% of the population with a corresponding NNI of 183. The Scelo et al. [25] model detects the least number of cases at this threshold, 27.8% (59/212); however, a slightly lower proportion of the population are screened (11.7%), with a corresponding NNI of 199. As seen in Fig. 3, the models performed slightly better than the age-and sex-based screening strategies. If all men aged ≥60 years in the modelling population were screened (14.1%), then 29.7% (63/ 212) of cases are identified (NNI = 206). Table S4 compares the model performance at a risk threshold of 0.4% (a) to the age-and sex-based strategies (b) broken down by sex and 5-year age group. For all five models, the subgroup with the oldest men (65-69 years) has the highest screening rate and the highest detection rate; the two Landsman and Graubard [21]

Discussion
We have performed the first comprehensive and systematic validation of risk models (n = 30) for kidney cancer in the largest available UK cohort (n = 502 536). In total, 10 models had adequate to good (0.61<AUROC<0.72) discrimination in men [7,21,25,27], women [7,21,24,25,27] or the whole cohort [7,21,25,27]. Measures of sex (n = 20), smoking (n = 22) and BMI (n = 29) are included in the majority of models. The highest performing models all included age as a risk factor and several included hypertension. This is consistent with current literature about risk factors for kidney cancer [30,36,37].
The best performing models show discriminative ability comparable to that for models for colorectal cancer [38], breast cancer [39] and melanoma [40] in similar studies. The performance of risk models in kidney cancer based on phenotypic risk factors is limited by the relatively low population prevalence (0.17, 95% CI 0.09-0.27) in Europe [11]) and relatively weak risk factor associations (e.g. heavy vs never smoker, hazard ratio [HR] 1.83, 95% CI 1.26-2.65 [36]). Models may be improved in the future by the addition of biomarkers or genetic risk factors; however, more research is required to identify suitable candidates [6].
Given the relatively low prevalence of kidney cancer in the general population, age-based screening, such as currently used for colorectal cancer screening in the UK, is unlikely to be implemented. In the public health modelling, the five risk models investigated had slightly better performance than the age-and sex-based strategies in a UK population. The use of any one of these models to identify individuals at higher risk of developing kidney cancer could provide a more nuanced selection process for screening. An ideal screening strategy would provide an efficient method of detecting early stage kidney cancer while minimising the number of unaffected individuals screened. At a threshold of 0.4%, all of the models have an NNI of <200, screening between 11% and 13% of the population and detecting between 27% and 33% of cases. This is comparable to screening all men aged ≥60 years, where 14.1% of the population are screened and 29.7% of cases are identified (NNI = 206).
The Landsman and Graubard [21] and Singleton et al. [7] models additionally performed particularly well in younger men. At a threshold of 0.4%, the Landsman and Graubard [21] (b) model identified 38% of cases in men aged 45-49 years, although only 7% of this subgroup are classified as high-risk (NNI = 119). Within the context of a screening programme, these models offer the opportunity to identify younger individuals likely to benefit from screening while reducing unnecessary screening amongst low-risk older individuals.
A further important finding is that at a threshold of 0.4%, all of the models classified more men than women as high risk. As a result, using any of the models to determine eligibility for screening would result in a greater proportion of men being invited for screening than women. Given the higher incidence of kidney cancer in men than women (62% of cases are in men in the UK [41]) this may be a reasonable strategy, with men, who are on average at higher risk, being screened and women, who are on average at lower risk, avoiding screening or being screened at older ages. However, all the models also identified a smaller proportion of women who later developed kidney cancer; e.g. the Usher-Smith et al. [27] model detects 44.8% of men who developed kidney cancer but only 2.2% of women, although the NNI for men and women is similar (190 and 192). Using a strategy like this raises ethical considerations about missing more cases in certain subgroups of the population, and equitable distribution of the benefits and harms of screening. Previous research has shown that selection by a complex risk score is more acceptable to the public than age and sex alone [42].
Recent cost analysis work around kidney cancer screening (using ultrasonography screening) estimated the incremental cost effectiveness ratio to be <£20 000 per quality adjusted life year for men, but not women (when compared to existing practice in the UK) [43]. Using a risk model could result in reduced costs if the population selected screened has a higher prevalence of kidney cancer. However, compared to age-and sex-based strategies, use of these risk models to determine eligibility for screening would require additional data about individuals, increasing the resources required. The Scelo et al. [25] and Usher-Smith et al. [27] models require minimal additional information (BMI and smoking status), often available in primary care records. However, the Landsman and Graubard [21] models require information that is not routinely collected, such as educational level.

Strengths and Weaknesses
To our knowledge, the present study is the first validation of multiple risk models predicting the development of kidney cancer. Previously only the models by Usher-Smith et al. [27] had been externally validated; our present results are similar to these prior validations [7,27].
A key strength of the present study is the use of the large UKB cohort. In the primary analysis there were 635 cases (408 in men and 227 in women), sufficient for a robust validation [44,45]. Although recruitment for the UKB cohort was designed to be wide reaching, there is evidence of selection bias [46]. The participants differ from the general population in demographics, lifestyle, and health outcomes; the rates of cancer are 11.8% lower than in the UK population [46]. This bias is mitigated in our present analysis by re-scaling predicted risk distributions and standardising the results to reflect the UK population age and sex distribution and kidney cancer incidence. This assumes that cases of kidney cancer in UKB are similar to cases in the UK population but does not address other differences between UKB and the general population. Further, we note that kidney cancer is most common in older individuals (median age at diagnosis is 65 years) and the UKB cohort only covers the age range 40-70 years (46- (Table S1) and most were developed in retrospective cohorts (not specifically designed for screening programmes) [6]. A potential issue with screening is the over-identification of early stage, low-grade cancer. Renal imaging was not carried out for the UKB cohort, so we do not have information about the presence of kidney cancer at baseline and rely on diagnoses in routine care to identify cases. However, as many kidney cancers are currently detected at a late stage (45% Stage III-IV in UK), it is likely that many cases could be detected earlier if a screening programme was implemented. Whether this would translate into a survival benefit is currently unknown and beyond the scope of the present study.
More generally, it is important to consider that variation in performance of the models may also reflect differences in their development populations. Models developed in populations similar to the UKB cohort are likely to perform better in this validation. For example, the Singleton et al. [7] model, developed in a pan-European cohort, has good discrimination and calibration in the present study. On the other hand, the Joh et al. [18] models, developed in a USA cohort, perform relatively poorly.
In the public health modelling analyses, we assumed that all eligible participants attend screening and that the programme detects all kidney cancers otherwise diagnosed within 6-years. Therefore, the estimates of accuracy are overestimates (and the NNI underestimates) of the real-life implementation of these scenarios. The potential of these models should be considered within the context of a best-practice screening programme and issues including the performance of screening tests, attendance at screening and existing diagnostic pathways should be included in any feasibility studies, as they are in the ongoing Yorkshire Kidney Screening Trial [47]. However, while underlying population prevalence and understanding of the speed with which kidney cancers progress from early to late stage remain uncertain [11], it is not possible to estimate the actual benefits and harms of a kidney cancer screening programme. Randomised controlled trials that compare organised screening to current practise are required understand the role that screening could play in improving health outcomes.

Conclusions
The present validation study identified five models with good discrimination and calibration in a large UK cohort. These best-performing models are better at identifying individuals who are later diagnosed with kidney cancer than criteria based on age and sex alone, and so could potentially improve the efficiency of a kidney cancer screening programme if used to determine eligibility. However, the benefits of using any of the models over age or sex alone remain small. All of the models additionally struggled to identify women at high risk of kidney cancer and the number of people needed to invite to identify one individual who is subsequently diagnosed with kidney cancer within 6 years was >100 for most thresholds. Future research is needed to combine these phenotypic models with other biomarkers, including genetic risk factors, to improve performance.

Supporting Information
Additional Supporting Information may be found in the online version of this article: Figure S1. Model selection process. This flow diagram describes the process used to select models for inclusion in this validation study. Figure S2. Calibration plots comparing the observed and predicted risk of each model in deciles in the UK Biobank cohort. Figure S3. Histograms showing the risk distribution of the five best performing models (a: Usher-Smith, b: Scelo, c: Landsman and Graubard (b), d: Landsman and Graubard (c), e: Singleton) before and after rescaling. The risk is shown separately for men and women. Figure S4. Overview of public health modelling analysis process. Table S1. Summary of the development studies of the models included in the validation. Table S2. The discrimination (AUROC) values for all the validated models in the primary analysis and two sensitivity analyses (open cohort and missing data). Table S3. Calibration measures (observed to expected ratio and Hosmer-Lemeshow test) for all of the models estimating the absolute risk of developing kidney cancer. Table S4. The performance of screening strategies in different sex and age subgroups for a UK population of 100 000 individuals aged 40-70 years. The ability of (a) the five best performing models to identify people at high risk of developing kidney cancer is contrasted to (b) the use of ageand sex-based screening strategies. Table S5. Use of UKB variables. Methods S1. Supplementary Methods Section.