Predicting risk of endometrial cancer in asymptomatic women (PRECISION): Model development and external validation

Develop an endometrial cancer risk prediction model and externally validate it for UK primary care use.


| I N TRODUC TION
Endometrial cancer is the fourth most common cancer in women in the UK, with 9700 diagnoses annually, an increase of 59% over the last 30 years. 1 Although overall prognosis is generally good, mortality rates are rising, with endometrial cancer anticipated to become the sixth most common cause of female cancer death by 2035. 2,3Effective primary prevention strategies are needed to reduce the burden of disease.The strong association between endometrial cancer and modifiable risk factors means it is potentially amenable to intervention, with up to 60% of endometrial cancer cases thought to be preventable. 4,5Endometrial cancer has a particularly strong association with obesity, with unopposed estrogen, chronic low-level inflammation and hyperinsulinaemia driving endometrial proliferation and the accumulation of cancer-forming genetic mutations. 6,7Weight loss, increasing physical activity levels and exposure to progesterone could be used to reduce endometrial cancer risk by counteracting these effects.
For endometrial cancer prevention to be both clinically and cost-effective, however, it needs to be targeted at individuals at greatest disease risk.This was the most important unanswered research question identified by patients, carers and clinicians in a James Lind Alliance Priority Setting partnership. 8There have been two previous attempts to develop endometrial cancer multivariable prediction models for use in pre-and postmenopausal women.Both showed moderate to good discriminatory ability (C statistic 0.68 and 0.77) but failed to include age as the most important variable. 1,9,10Only one of the models was externally validated and significantly overestimated the number of endometrial cancers (E/O 1.20, 95% confidence interval [CI] 1.11-1.30). 10The limited ethnic and health diversity of the datasets used for model development likely explains the poor generalisability observed.A third, recently developed model applicable to white postmenopausal women also exhibits the same risk of bias and poor calibration. 11As a result, none of these models has been utilised in clinical or research practice despite women and GPs being overwhelmingly in favour of using such a tool. 12he Predicting risk of endometrial cancer in asymptomatic women study (PRECISION) was designed to develop and externally validate a new endometrial cancer multivariable prediction model for use in asymptomatic women aged 45-60 years using robust methodology. 13This age range would allow a 10-year interval before the age-related peak in endometrial cancer incidence for intervention to be implemented. 1,6It also sought to compare the performance of the new model with those previously developed and hence determine its clinical utility as a decision-making tool for future endometrial cancer prevention trials.

| Population
A multivariable endometrial cancer prediction model was developed using data from the prospective UK Biobank cohort study (https:// www.ukbio bank.ac.uk).This is a large-scale biomedical database containing in-depth information from half a million UK participants aged 39-71 years who were recruited from 22 centres throughout England, Scotland and Wales between 2006 and 2010.Participants were eligible for this study if female, aged 40-70 years and had not undergone a hysterectomy prior to study recruitment.Women diagnosed with endometrial cancer prior to or within 12 months of entering the cohort were excluded to reduce the risk of reverse causality.Prior history of other malignancies was not considered an exclusion criterion due to residual endometrial cancer risk.The eligibility criteria for the study were deliberately broad in order to utilise all available data within the dataset.Participants entered the cohort on the date of UK Biobank recruitment and were followed up until 31 March 2016 in England and Wales and 31 October 2015 in Scotland.Data were collected through nurse and self-administered standardised questionnaires and were linked to national cancer and death registries.Censoring occurred at date of endometrial cancer diagnosis, hysterectomy, death or last data collection.
External validation was undertaken in a retrospective observational cohort comprising data from Clinical Practice Research Datalink (CPRD) Gold and Aurum, linked to the National Cancer Registration and Analysis Survey, Office for National Statistics mortality, Hospital Episode Statistics Admitted Patient Care and Index of Multiple Deprivation data.The CPRD contains clinical data inputted during routine appointments in primary care in the UK, engaging with GP practices utilising the Vision and EMIS electronic health record software.Eligibility criteria were similar to those used for model development, namely female, with an intact uterus and no history of endometrial cancer prior to or within 12 months of study entry.In addition, they had to be a research-acceptable patient eligible for record linkage and have at least 12 months of follow-up prior to the study start date (1 January 2000), with at least 1 day of registration within the study period (1 January 2000-31 December 2018).The age range for inclusion in the external validation cohort was 45-60 years, as this is the target population for intervention to reduce endometrial cancer risk, allowing an approximate 10-year interval prior to the peak age of endometrial cancer diagnosis (60-70 years).Patients entered the cohort on 1 January 2000, the date of their 45th birthday or date of practice registration, whichever occurred later.Data were censored at the earliest occurrence of endometrial cancer diagnosis, hysterectomy, death, patient registration end date/transfer out date, practice last collection date or 10 years after cohort entry to mirror maximal follow-up in the UK Biobank.

| Outcomes
Incident endometrial cancers were identified in the UK Biobank using International Classification of Diseases (ICD) 9 or 10 or self-reported data.In the CPRD dataset, endometrial cancers were identified based on Read codes within the primary care record and ICD-10 codes documented in linked datasets (codes in Table S1).Histological sub-types of endometrial cancer were identified using ICD-O-3 codes (Table S1).Controls had no documented history of endometrial cancer.

| Model predictors
Predictors of endometrial cancer were identified through literature review and expert clinical opinion. 6These included 16 variables covering patient demographics, body size, reproductive history, medical, family and medication history, smoking and glycaemic control (Table S2).Family history of endometrial cancer could not be included, as these data were not collected in the UK Biobank.
Each variable was separately coded as a data field within the UK Biobank at study recruitment.In the CPRD cohort, variables were defined by the occurrence of relevant Read or ICD-10 codes at any time point before study recruitment.The closest recorded measurement to study entry for height, weight and waist circumference was used.Duration of medication use was calculated based on the first and last recorded dates of prescription.

| Sample size
A sample size calculation for model development was performed in accordance with the methodology proposed by Riley et al. 14 and based on the external validation data of the PLCO/NIH-AARP model (C statistic 0.68, event number 532, total cohort size 121 701). 10Assuming a shrinkage factor of 0.9 and a maximum of 16 predictors in the final model, 88 889 participants would be required.For external validation, at least 200 events are recommended by Collins et al. 15 The actual sample sizes in both the UK Biobank and CPRD far exceeded these requirements.

| Model development
Descriptive statistics were reported as medians, interquartile ranges and counts for non-parametric continuous and categorical data, respectively.All analyses were performed using STATA v.16. 16issing data in the UK Biobank were assessed to be missing at random and were managed with multiple imputation with chained equations.Ten imputations were generated including all potential model predictors and the Nelson-Aalen estimator for the baseline cumulative specific hazard. 17The number of imputations was chosen to reflect the percentage of missing data (≤9%). 18Complete predictors and the outcome (endometrial cancer of all histological sub-types) were included in the imputation model as auxiliary variables.
A multivariable prediction model was developed in each imputed UK Biobank dataset using a Cox regression.Automated backward variable selection was used to minimise the Akaike information criterion (AIC).Continuous variables were centralised about their mean and nonlinear terms were considered through natural logarithmic transformation and squaring.Interaction terms were also considered.A majority method approach was used for variable selection across multiple imputed data. 19 flexible parametric survival model was built using the selected variables within each of the imputed datasets using the stpm2 package. 20Variable coefficients and the linear predictor were pooled across imputed datasets using Rubin's rules.
Apparent performance of the model in predicting any endometrial cancer was assessed using calibration plots and the pmcalplot package 21 and averaged across all imputed datasets.Internal validation within the UK Biobank was undertaken with 100 bootstrap samples, with exact replication of model development.Variable coefficients in the original model were adjusted for optimism (average difference in apparent performance of each bootstrap model within the bootstrap sample and raw data) and an average adjusted baseline survival at 10 years determined across the original imputed datasets.

| External validation
Missing data in the CPRD and linked datasets were assumed to be missing not at random on the basis that the absence of a condition is rarely documented in primary care.A risk factor absent/mean imputation approach was taken.Women without documented height and weight measurements were assumed to have a 'normal' body mass index (BMI) 22 and were allocated the mean female BMI (26.7) according to the Health Survey for England (HSE) 2000. 23Missing waist circumference values were replaced with the mean measurement for women collected in HSE 2012. 24The mean age at menarche of women in the UK Biobank was used for all women (12.8 years), as this is only recorded in primary care records if pathological.Women with missing menopause data were assumed to be premenopausal if <51 years at study recruitment and postmenopausal if ≥51 years.Women with no documented deliveries or children were assumed nulliparous and no documented smoking status as non-smokers.This imputation approach in CPRD matches how the model would be anticipated to be used in practice. 25Sensitivity analyses replacing missing data with imputed measurements of BMI within the 'normal' range (20-25), the mean waist circumference of women in the UK Biobank and age at menarche adjusted by the standard deviation in the UK Biobank were undertaken.
For external validation, the final model was applied to the CPRD dataset.Model calibration was assessed by comparing expected and observed predictions ('calibration in the large'), determining the calibration slope and visually using calibration plots.Discrimination was assessed with the C-statistic.Model performance was compared for prediction of any endometrial cancer and non-endometrioid endometrial cancers (clear cell, serous, carcinosarcoma).

| Alternative models
As primary care datasets frequently do not contain data for some of the variables included in the PRECISION model, an alternative (basic) model based on current age and BMI was developed for comparison.
The two previously developed models for use in both pre-and postmenopausal women (EPIC and PLCO/NIH-AARP models) were also externally validated within the UK Biobank and CPRD datasets. 9,10The baseline survival functions and linear predictor equations for these models are provided in Table S3.

| Clinical utility
The clinical utility of the PRECISION model was determined through decision curve analysis, using the dca package.The net benefit of the model was compared against the strategies of offering endometrial cancer prevention to all or no women and visualised on dca graphs across a prespecified limited range of thresholds (0-5%).Clinical utility was also compared with the basic, EPIC and PLCO/NIH-AARP models.The clinical relevance of the four models was appraised by comparing their positive and negative predictive values across a range of risk thresholds in the CPRD cohort.

| Patient and public involvement
A James Lind Alliance Womb Cancer Priority Setting Partnership was established in 2016 to identify the top 10 research priorities in the field. 8Patients, carers, academics and clinicians proposed potential questions, gathered existing evidence and prioritised the outstanding questions during a consensus meeting using modified nominal group methodology.The top unanswered research question was 'is it possible to develop a personalised risk score which reflects a woman's individual risk of developing endometrial cancer?'This project was designed to address this research question.

| R E SU LTS
In total, 222 031 women were included in the UK Biobank cohort for model development, of whom 902 (0.41%) developed endometrial cancer during a median follow-up of 7.1 years (interquartile range [IQR] 6.4-7.7 years) across the whole cohort.Overall, 3 094 371 women were included in the CPRD cohort for external validation (512 083 CPRD Gold, 2 582 291 CPRD Aurum), with an incidence of endometrial cancer of 0.28% (n = 8585) during a median follow-up of 10.0 years (IQR 6.2-10.0years).Of these, 397 (4.6%) were diagnosed with non-endometrioid endometrial cancers.Figure 1 describes the flow of study participants in both cohorts.Baseline characteristics are compared in Table 1.The CPRD cohort was more ethnically diverse, lived in more deprived areas of the country and had higher rates of overweight and obesity than the UK Biobank cohort, but was also younger.

| Model development
The final model included 12 variables (Table S4), with the final equation given in Equation 1.The raw coefficients and coefficients adjusted for optimism are provided in Table S4b.
where age is current age, late menopause is 0 if premenopausal or periods stopped <55 years, and 1 if still having periods and ≥ 55 years or ≥ 55 years at time periods stopped; HRT use is 0 if never used or prior use of HRT and 1 if current use of HRT; OCP use is 0 if never used or <5 years of oral contraceptive pill use and 1 if ≥5 years of oral contraceptive pill use (either combined or progesterone only); tamoxifen is 0 if never or prior use of tamoxifen and 1 if current use of tamoxifen; t2d is 0 if no diagnosis of type 2 diabetes and 1 if diagnosis of type 2 diabetes; smoking is 0 if current or former smoking and 1 if never smoked; fhx bowel cancer is 0 if no family history of bowel cancer and 1 if at least one first-degree relative diagnosed with bowel cancer.
No violations of the proportional hazards assumptions were detected.A significant interaction was detected between BMI and HRT use and this was forced into the final model. 7The baseline hazard function was most accurately modelled using 5df.An example calculation of 10-year endometrial cancer risk is shown in Material S1.
At 10 years, the apparent calibration plot in the model development data showed excellent calibration and good discrimination (calibration slope 1.000, C-statistic 0.741, Figure S1).The optimism-adjusted performance of the model in the UK Biobank resulted in a C-statistic of 0.74 (95% confidence interval [CI] 0.74-0.76)and a calibration slope of 1.03 (95% CI 1.01-1.06)(Figure S2). (1)

| validation
The mean and standard deviation (SD) of the adjusted linear predictor in the CPRD cohort was −0.55 (SD 0.59) compared with −0.20 (SD 0.87) in the UK Biobank.Histograms of predicted risk in the two datasets are provided in Figure S3.
Model calibration for predicting all endometrial cancers remained very good, with an expected to observed ratio at 10 years of 1.03 (95% CI 1.01-1.05) in the CPRD cohort and a calibration slope of 1.14 (95% CI 1.11-1.17)(Figure 2).A small reduction in discrimination was observed (C-statistic 0.70, 95% CI 0.69-0.70).The model demonstrated similar performance in predicting non-endometrioid endometrial cancers (C statistic 0.72, 95% CI 0.69-0.74;calibration slope 1.14, 95% CI 1.00-1.28)as it did all types of endometrial cancer, albeit with much wider confidence intervals given the small number of cases.Sensitivity analyses revealed no change in model performance with varying input of BMI, waist circumference and age at menarche.

| Alternative models
The basic model had a final equation: This had similar calibration and only slightly lower discriminatory ability that the full model (Figure S4, UK Biobank C-statistic 0.71, calibration slope 1.00, CPRD C-statistic 0.69, 95% CI 0.68-0.69,calibration slope 0.93, 95% CI 0.90-0.95).Using a ≥1% risk to denote high-risk women, the basic model re-classified 1.3% of women in the CPRD dataset as high risk, leading to an increase in discrepancy between predicted and observed risk (observed risk 3.1% predicted risk PRECISION model 3.1% basic model 4.4%).

| Clinical utility
Decision curve analysis showed that offering targeted endometrial cancer prevention based on the PRECISION model was superior to offering intervention to all or no women, between a predicted 10-year endometrial cancer risk of 0-1% (Figure 3A).The PRECISION model had greater net benefit than the EPIC model across all threshold probabilities and the PLCO/NIH-AARP model above a threshold of 0.4% (Figure 3B).The basic model has similar potential clinical utility to the PRECISION model despite slightly poorer discrimination.
While the PLCO/NIH-AARP model had the highest negative predictive value (NPV), this was at the expense of the lowest positive predictive value (PPV 0.005, NPV 1.000, Table S5).The PRECISION model also had an excellent   negative predictive value (0.997-0.998) but improved positive predictive value (0.007-0.030) compared with both the NIH-AARP and basic models.

| Principal findings
A well-calibrated model based on routinely recorded information and with moderate to good discrimination has been developed to determine a woman's 10-year risk of endometrial cancer and has been externally validated in a population-based cohort.The model demonstrated a small improvement in ability to discriminate between those at high and low risk of disease compared with previously published models and improvement in calibration across the risk spectrum resulting in superior net benefit.It has excellent negative predictive value, offering appropriate reassurance to low-risk women and missing few endometrial cancer cases, an essential prerequisite for a cancer risk prediction tool.A basic model, based on age and BMI alone, had similar potential clinical utility despite poorer discrimination and would be a reasonable alternative in datasets containing large amounts of missing data on reproductive variables.

| Strengths and limitations
The strengths of this work lie in the cohorts selected for model development and external validation.The large  Note: This relates to predictors for which an assumption has been made that the absence of documentation of their presence within the CPRD dataset implies that an individual does not have a particular diagnosis, has never used a specific drug or has never smoked.For these predictors, there are therefore no missing data within the CPRD.
Abbreviations: CPRD, Clinical Practice Research Datalink; n/a, not applicable.number of events and adjustment for optimism reduced the risk of model over-fitting and use of the CPRD allowed us to investigate performance within the model's anticipated target population.Despite the optimism adjustment, there is evidence of model under-fitting in the CPRD, reflecting the differences in demographics of the two datasets used and 'healthy bias' of the UK Biobank cohort. 26Notwithstanding this, the PRECISION model demonstrated superior calibration compared with existing published models.This is likely because of the use of a flexible parametric survival model, which allows more accurate modelling of the baseline hazard function compared with a Cox proportional hazards model. 27The model has, however, been solely developed and tested within UK datasets and has not been calibrated for use outside of this population.It could easily be re-calibrated, though, for use in other countries and we would welcome the opportunity to do so.It would also be of interest to determine the additive value of incorporating a polygenic risk score into the model. 28Importantly, the model is designed to predict the risk of any histological type of endometrial cancer and performs similarly well in predicting risk of non-endometrioid endometrial cancers, which have a much poorer prognosis.
While primary care datasets provide an excellent resource in terms of size, length of follow-up and representation of the population, they do present some limitations. 29Data were originally collected by GPs for a clinical rather than academic purpose, which potentially influences its quality and completeness.Linkage of individual medical records to national datasets was performed to circumvent these issues and maximise data quality.We can be confident, therefore, that all endometrial cancer cases were correctly identified and the date of diagnosis accurately recorded.Widely inclusive Read code lists for predictors were developed to try to avoid mis-classification but may still have been inadequate.There were considerable missing data on reproductive variables and waist circumference, as these are not routinely recorded.Population-level average data were imputed to avoid excluding large numbers of participants, with sensitivity analyses suggesting minimal impact on overall model performance.Participants in the UK Biobank study may have contributed data to the CPRD dataset but could not be identified due to the use of anonymised records.The numbers are likely to be small for CPRD Gold (UK population coverage 4.6%) but could be higher for CPRD Aurum (coverage 19.9%).

| Interpretation
][11] Both the EPIC and PLCO/NIH-AARP models demonstrated similar discriminatory ability to the PRECISION model when externally validated here, which is perhaps unsurprising given the number of shared predictors between them.The clinical utility of the PRECISION model is a reflection, therefore, of its better calibration.The use of a flexible parametric survival model and censoring of individuals at the time of hysterectomy when they are no longer at risk of the disease could explain this.The PLCO/NIH-AARP model, on the other hand, used national endometrial cancer incidence data to determine their baseline survival hazard and adjusted for population level rates of hysterectomy.The EPIC model did not perform as well in the UK Biobank and CPRD datasets, as previously reported; however, this is the first time this model has been externally validated. 9,30The recently published E2C2 model was not validated in this project, as the target population (white, postmenopausal women) was felt to be too restrictive and could prevent premenopausal and non-white women from accessing endometrial cancer prevention strategies that would otherwise be of value to them. 11t has become increasingly evident that primary prevention strategies are required to reduce the ever-increasing number of endometrial cancer diagnoses.Directing these towards women at greatest risk of the disease will maximise benefit whilst minimising harm caused by unnecessary intervention.The PRECISION model provides researchers and primary care doctors with a tool that determines a woman's 10-year endometrial cancer risk, allowing better-informed decision making about lifestyle and contraceptive choices that could reduce this risk.To establish the true acceptability and performance of the model, it will, initially, be used to determine eligibility into endometrial cancer prevention trials of weight loss and the Mirena coil.The impact of undergoing a risk assessment on cancer worry, anxiety and behaviour will be These data are required for both the clinical and cost-effectiveness of the model to be determined, a necessary prerequisite for it to be incorporated into primary care.

| CONCLUSIONS
The PRECISION model is a well-calibrated tool able to discriminate between women at high and low risk of endometrial cancer and to quantify an individual woman's risk of developing the disease within the next 10 years.It could easily be rolled out into primary care and used to target endometrial cancer prevention resources at those at greatest risk.

AU T HOR C ON T R I BU T ION S
SJK, EJC and DGE conceptualised the study.SJK, AL, KRM, DA, EK and GPM were involved in writing, and implementing computer code.SJK, AL, EK and GPM undertook the formal analysis.SJK, AL, KRM, DA, EK and GPM provided computing resources.SJK and DA were project administrators and undertook data curation.EJC and DGE supervised the project.SJK acquired funding for the project and wrote the original draft.All authors were involved in developing the methodology employed in the project, reviewing and editing the final draft.

AC K NO W L E D GE M E N T S
We would like to thank the UK Biobank, CPRD and NHS Digital teams for their assistance in linking and supplying the data used in this analysis.We are also grateful to the individuals who participated in the UK Biobank study and the NHS patients who consented to have their anonymised routine medical records be used for research.This study is based on data from the Clinical Practice Research Datalink obtained under licence from the UK Medicines and Healthcare products Regulatory Agency.The data are provided by patients and collected by the NHS as part of their care and support.The Office for National Statistics (ONS) is the provider of the ONS data contained within the CPRD data.Hospital Episode Data and the ONS data, ©2020, have been re-used with the permission of The Health & Social Care Information Centre.Cancer register data have been used with the permission of Public Health England (2020).All rights reserved.The study was approved by the independent scientific advisory committee for CPRD research (Independent Scientific Advisory Committee approval 20_000087).

DATA AVA I L A BI L I T Y S TAT E M E N T
Data were obtained from the UK Biobank and CPRD upon approval of the project application by the relevant bodies and in the context of an institutional licence.Requests for data sharing should be made directly to the UK Biobank and CPRD.Software code are available from the corresponding author on request.

E T H IC S S TAT E M E N T
The UK Biobank received ethical approval from the North West Multi-Centre Research Ethics Committee (16/ NW/0274), Patient Information Advisory Group (England and Wales) and the Community Health Index Advisory Group (Scotland).The CPRD was approved by East Midlands-Derby Research Ethics Committee (21/EM/0265).This study protocol was approved by the UK Biobank and CPRD independent scientific advisory committees before receipt of data, which were fully anonymised.

F I G U R E 1
Flow chart of participants in both the model development and external validation cohorts.T A B L E Descriptive statistics for the model development and external validation cohorts provided for cases and the full cohort.Data are presented as median (IQR) or number (percentage).

2
Calibration plots and performance statistics for the PRECISION endometrial cancer prediction model at 10 years on external validation in the CPRD.Groups represent 10ths of predicted risk with 95% confidence intervals.The spike plot illustrates the events and nonevents at 10 years according to predicted risk.CPRD, Clinical Practice Research Datalink.

F I G U R E 3
Decision curve analysis demonstrating the net benefit of using the different prediction models to identify women at high risk of endometrial cancer for targeted intervention across different threshold probabilities.(A) PRECISION model only.(B) PRECISION, basic, EPIC and PLCO/NIH-AARP models.EC, endometrial cancer.

F
U N DI NG I N FOR M AT ION SJK is a National Institute for Health and Care Research (NIHR) Academic Clinical Lecturer and the recipient of a Wellbeing of Women Postdoctoral Research Fellowship (grant number PRF101).EJC is supported by a NIHR Advanced Fellowship (grant number NIHR300650) and the NIHR Manchester Biomedical Research Centre (grant number IS-BRC-1215-20007).AL and KM are supported by the European Union's funded Project iHELP (grant number 101017441).DMA is supported by the NIHR Greater Manchester Patient Safety Translational Research Centre (grant number PSTRC-2016-003).The views expressed are those of the author(s) and not necessarily those of the NIHR MHRA or the Department of Health and Social Care.C ON F L IC T OF I N T E R E S T S TAT E M E N T None declared.