Updating M6 pregnancy of unknown location risk‐prediction model including evaluation of clinical factors

Ectopic pregnancy (EP) is a major high‐risk outcome following a pregnancy of unknown location (PUL) classification. Biochemical markers are used to triage PUL as high vs low risk to guide appropriate follow‐up. The M6 model is currently the best risk‐prediction model. We aimed to update the M6 model and evaluate whether performance can be improved by including clinical factors.


Methods
This prospective cohort study recruited consecutive PUL between January 2015 and January 2017 at eight units (Phase 1), with two centers continuing recruitment between January 2017 and March 2021 (Phase 2).Serum samples were collected routinely and sent for β-human chorionic gonadotropin (β-hCG) and progesterone measurement.Clinical factors recorded were maternal age, pain score, bleeding score and history of EP.

Based on transvaginal ultrasonography and/or biochemical confirmation during follow-up, PUL were classified subsequently as failed PUL (FPUL), intrauterine pregnancy (IUP) or EP (including persistent PUL (PPUL)). The M6 models with (M6 P ) and without (M6 NP ) progesterone were refitted and extended with clinical factors. Model validation was performed using internal-external cross-validation (IECV) (Phase 1) and temporal external validation (EV) (Phase 2). Missing values were handled using multiple imputation.
Results Overall, 5473 PUL were recruited over both phases.A total of 709 PUL were excluded because maternal age was < 16 years or initial β-hCG was ≤ 25 IU/L, leaving 4764 (87%) PUL for analysis (2894 in Phase 1 and 1870 in Phase 2).For the refitted M6 P model, the area under the receiver-operating-characteristics curve (AUC) for EP/PPUL vs IUP/FPUL was 0.89 for IECV
Serum β-hCG and progesterone are used to triage PUL as high or low risk to decide appropriate follow-up [11][12][13] .Single discriminatory zones for β-hCG or progesterone have been described to designate high-risk PUL when marker levels exceed specific thresholds 8,14 .β-hCG ratios, calculated using serial levels taken 48 h apart, define high risk 15 as values of 0.87-1.66.Whilst single β-hCG levels cannot discriminate between high-and low-risk cases, single progesterone measurements and β-hCG ratios have diagnostic ability, each having an area under the receiver-operating-characteristics curve (AUC) of 0.69 for the prediction of high-risk outcome 16,17 .
Multivariable risk-prediction models have superseded the abovementioned strategies.The M4 model, based on initial β-hCG and the β-hCG ratio, had an AUC of 0.82 when validated on 1962 cases of PUL 15,[17][18][19] .However, the M6 model, incorporating initial β-hCG, β-hCG ratio and optional initial progesterone, is currently the most accurate form of risk prediction.In a large validation study, AUCs of 0.86 (excluding progesterone) and 0.89 (including progesterone) were reported [19][20][21] .Moreover, risks estimated by M6 are also more reliable.
The M6 model is based on PUL data from two hospitals between 2003 and 2013.Given population heterogeneity between hospitals and over time, it is good practice to update models using more recent data collected from multiple hospitals 22,23 .In addition, efforts to extend models with important new predictors can increase performance and utility.Previous EP is a risk factor for recurrent EP, whilst maternal age, pelvic pain and vaginal bleeding have been investigated previously, either independently or in addition to β-hCG [24][25][26][27][28][29] .
Our objectives were to update M6 using a large prospective multicenter dataset and to evaluate the value of adding clinical factors to M6.

Study design, setting and participants
This prospective cohort study consecutively recruited women classified with PUL following initial TVS at eight early pregnancy assessment units (EPAU) in London, UK, between January 2015 and January 2017 for Phase 1 (n = 3266), and two units continued recruiting until March 2021 for Phase 2 (St Mary's Hospital (SMH) and Queen Charlotte's and Chelsea Hospital (QCCH)) (n = 2207).Women were eligible if they presented following spontaneous or assisted conception with a positive UPT and were classified with a PUL following TVS examination.Women were excluded if they were less than 16 years of age or if β-hCG was ≤ 25 IU/L (i.e. the level below which a UPT is negative).
Data were collected for clinical purposes and for use with the M6 model after receiving local audit approval at all sites.Following discussions with a research ethics committee, formal consent was not required as the M6 model is published and validated externally 20,30,31 .Data were collated for research analysis as part of an ethically approved database study designed to evaluate and improve M6 modeling strategies using anonymized clinical, ultrasound and biochemical data.This was titled Improving pregnancy of unknown location risk and outcome prediction (IMPPREP) (reference: 21/HRA/0260) and was approved by the UK Health Research Authority and the Health and Care Research Wales.Patients were not involved in the study design.
We report this study according to the transparent reporting of a multivariate prediction model for individual prognosis or diagnosis (TRIPOD) guideline and its extension for clustered data 32,33 .

Data collection
Serum samples were collected routinely and sent for laboratory measurement of β-hCG and progesterone at the first EPAU visit when a classification of PUL was made.The recommendation to take a second serum sample to measure β-hCG 48 h later was issued 410 Kyriacou et al.
according to the two-step strategy (Figure 1), i.e. if the initial progesterone was very low (≤ 2 nmol/L) at Step 1, a second β-hCG reading was often not taken and a UPT was advised for 2 weeks' time to confirm pregnancy resolution 20,21 .Serum progesterone levels were not considered in women using progesterone supplements; therefore, in cases for which progesterone levels were not suitable or unavailable, Step 1 was skipped and the M6 model without progesterone (M6 NP ) was used at Step 2 (Figure 1).Samples taken from women with PUL were analyzed by each hospital laboratory using calibrated and validated assays subject to rigid internal and external quality control and assurance.These were not standardized across all units; samples were usually processed using the Architect (Abbott, Abbott Park, IL, USA) or Access (Beckman Coulter, Brea, CA, USA) chemiluminescent immunoassays.Results were made available to clinicians through electronic reporting systems.Metadata were collected via clinical questionnaires completed by each patient at their first EPAU visit when a classification of PUL was made.Women documented their most extreme pain and bleeding using a numerical pain scale and pictographic bleeding score, in addition to their age and EP history.Datasheets containing the M6 model algorithm were used in each unit, with one of three management strategies (repeat β-hCG and TVS in 48 h, perform UPT in 2 weeks or perform TVS in 1 week) advised based on model risk calculation (Figure 1) [19][20][21] .All data were anonymized prior to collation into the single dataset utilized for this work focusing on risk-prediction performance.

Outcome
PUL outcome was confirmed by TVS and/or biochemistry, with the timing for each case dependent on when a pregnancy structure was visualized and/or levels of progesterone/β-hCG: (1) IUP, when a gestational sac was seen within the endometrial cavity on TVS; (2) EP, when a mass external to the endometrial cavity was noted on TVS (which was either solid or contained a gestational sac with or without a yolk sac, fetal pole or cardiac activity); (3) PPUL, when pregnancy location was never confirmed on TVS and at least three serial β-hCG values were reported that varied by < 15% each time; or (4) FPUL, when pregnancy location was never confirmed on TVS and spontaneous decline in serum β-hCG or a negative UPT 2 weeks after initial assessment was reported 1,[6][7][8][9][10] .PPUL and EP were deemed high-risk outcomes for the analysis, as we did not, by definition, know which PPUL were in fact small EP not visualized on TVS.

Predictors, modeling strategy and sample size
Models were developed on Phase-1 data using randomeffects multinomial logistic regression.Six models were developed: M6 P update, M6 NP update, M6 P Extension 1, M6 NP Extension 1, M6 P Extension 2 and M6 NP Extension 2. Updated M6 P and M6 NP models used exactly the same predictor variables as the original versions: log (β-hCG at 0 h); log (β-hCG ratio) (defined as β-hCG at 48 h/β-hCG at 0 h); log (progesterone at 0 h); and the interaction term, log (β-hCG ratio) × log (progesterone at 0 h) 21 .The last two variables were used only for M6 P .For Extension 1 of both models, the following clinical variables were added: patient age (years), patient age squared, vaginal bleeding score (0-4) and history of EP (yes/no).The addition of age squared was based on previous work 29 .For Extension 2 of both models, pain score as a binary (> 0 vs 0) and numerical (0-10) variable were added in addition to the variables for Extension 1. Sample size calculation is discussed in Appendix S1. Step

Perform UPT in 2 weeks
Perform TVS in 1 week PUL classification: perform 0-h β-hCG and progesterone* Figure 1 Flowchart of two-step triage protocol for managing women classified with pregnancy of unknown location (PUL), consisting of progesterone analysis at Step 1 followed by M6 risk-prediction model, incorporating (M6 P ) or excluding (M6 NP ) progesterone, at Step 2. *If progesterone level not suitable (e.g.due to use of progesterone supplements) or unavailable (e.g.due to financial or guideline constraints on performing progesterone assay), Step 1 is skipped and M6 NP model is used at Step 2, once 0-h and 48-h β-human chorionic gonadotropin (β-hCG) levels are obtained.FPUL, failed PUL; IUP, intrauterine pregnancy; TVS, transvaginal ultrasound; UPT, urine pregnancy test.

Model validation and performance assessment
Model validation was carried out twice: once on the model development data, using a procedure called internal-external cross-validation (IECV); and once on the data from Phase 2, as a temporal external validation (EV) 34 .IECV is a cross-validation procedure in which folds correspond to centers.For every center, models were retrained on data from all other centers and tested on the left-out center.In this mixture of standard k-fold cross-validation and EV, the expected performance of the modeling procedure on new centers was assessed.The second validation (EV) involved the testing of the models obtained from all Phase-1 data on more recent data from two of the eight centers.
We assessed discrimination by calculating pairwise AUCs (or c-statistics) for IUP vs FPUL using the conditional-risk method, and binary c-statistic for EP/PPUL vs other 35 .We also calculated the polytomous discrimination index (PDI), an extension of the binary c-statistic for more than two outcomes 36 .Unlike the c-statistic, its value lies between 0.33 (models for three outcomes without any discriminatory ability) and 1 (perfect discrimination).Calibration performance for the risk of EP/PPUL was assessed by the calibration intercept, calibration slope and flexible calibration curves based on LOESS 37 .Multinomial calibration was assessed using flexible multinomial calibration curves based on vector splines 38 .In addition, we calculated positive predictive value (PPV), negative predictive value (NPV), sensitivity, specificity and the proportion of PUL classified as high risk, using a 5% threshold for the risk of EP/PPUL 15,[17][18][19][20][21] .The utility of the models to select women requiring close monitoring for a potential EP was assessed with decision-curve analysis, by plotting Net Benefit for threshold probabilities of EP/PPUL between 3% and 10%.Net Benefit can be compared between competing models and with the default strategies, treat all (closely monitor all women) and treat none (closely monitor no-one) 39 .
For IECV, we first calculated center-specific performance.We then performed a meta-analysis of center-specific performance to obtain an overall result.For the temporal validation, we reported center-specific results.The models and performance metrics are further detailed in Appendix S2.

Missing data and sensitivity analysis
Missing values were observed for pain score, history of EP, vaginal bleeding, maternal age and final PUL outcome when patients were lost to follow-up.The majority of those lost to follow-up had a predicted outcome of FPUL, but we could not confirm UPT results.We assumed a second β-hCG level was missing in those with an initial β-hCG > 25 IU/L if a sample was not taken 2 calendar days after the initial blood sample when PUL was classified.Progesterone level was also considered missing if the patient was using progesterone supplements.As missing data were assumed to be missing at random, we used multiple imputation by chained equations.Multiple imputation was performed separately for the data from Phase 1 and the data from Phase 2 (Appendix S3).
We carried out a prespecified sensitivity analysis using only PUL that were not lost to follow-up, and hence for which outcome was known.

Software
All analyses (model development and validation) were performed using R version 4.0.2(https://www.r-project.org/).The only exception was the meta-analysis of Net Benefit for the IECV validation, which was performed using Winbugs version 1.4.3 (https://www.mrc-bsu.cam.ac.uk/software/bugs/).

RESULTS
A total of 5473 consecutive PUL were recruited prospectively between January 2015 and March 2021, including 3266 in Phase 1 and 2207 in Phase 2 (Figure 2).Following 372 (11%) exclusions in Phase 1, 2894 PUL were used for model development and IECV.In Phase 2, 337 (15%) PUL were excluded, leaving 1870 for EV.The prevalence of EP/PPUL was 12% in Phase-1 and Phase-2 data.Patient characteristics are summarized in Tables 1  and S1.

Model coefficients and training set (apparent) performance
The odds ratios for all predictor variables obtained from all 2894 PUL from Phase 1 are provided in Table S2 and the coefficients are given in Table S3.
Training set PDIs ranged from 0.82 for the M6 NP update to 0.86 for both M6 P Extensions 1 and 2 (Figure S1).The AUCs for EP/PPUL vs IUP/FPUL were between 0.86 for M6 NP update and 0.90 for M6 P Extensions 1 and 2 (Figure S2).The AUCs to differentiate between IUP and FPUL were all between 0.97 and 0.98 (Figure S3).

Internal-external cross-validation
The AUC for EP/PPUL vs IUP/FPUL was 0.89 for M6 P update and both M6 P extensions, 0.85 for M6 NP update and 0.87 for both M6 NP extensions (Figure 3).Pairwise AUC for IUP vs FPUL was 0.97 for all models (Figure S4).The PDI was 0.85 for all M6 P models, 0.81 for M6 NP update and 0.83 for both M6 NP extensions (Figure 4).Center-specific PDIs and AUCs are provided in Figures S5-S22.
All six models were well calibrated for the risk of EP/PPUL, with calibration intercepts between −0.05 and −0.07, calibration slopes between 0.95 and 0.99 and calibration curves close to the diagonal (Table 2, Figures 5  and S23-S28).
Using a threshold of 5% for the estimated risk of EP/PPUL to classify PUL as high risk, the M6 P update model classified 1310 (45%) PUL as high risk, compared to 1491 (52%) for the M6 NP update model (Table 3).Sensitivity was 94% and 92%, specificity 62% and 55%, PPV 26% and 23%, and NPV 99% and 98% for the M6 P update and M6 NP update models, respectively.The addition of clinical variables increased specificity (+3 percentage points for M6 P , +5 percentage points for M6 NP ), but not sensitivity.Center-specific findings for the proportion of PUL classified as high risk, sensitivity, specificity, PPV and NPV are presented in Tables S4-S8.Every model had higher Net Benefit than treat all and treat none at every threshold (Figure 6).All models had similar Net Benefit of 0.10 at the 5% risk threshold (Table S9).

Temporal external validation
The AUC for EP/PPUL vs IUP/FPUL was higher at SMH (range, 0.86-0.89)compared with QCCH (range, 0.82-0.84)(Table 4).Only for M6 NP at SMH did we observe an increase in AUC (+0.02) upon the addition of clinical variables.The AUC for IUP vs FPUL was 0.99 for all models at QCCH and varied between 0.97 and 0.98 at SMH (Table S10).PDI values were in line with the AUC results (Table 5).The calibration curves for the risk of EP/PPUL suggest that estimated risks up to 0.4 were underestimated (i.e.too low) at QCCH, in particular for the extended models (Table 6, FiguresS29-S32).At SMH, these risks tended to be overestimated (i.e.too high), in particular for M6 NP models (Table 6, Figures S30-S33).
Using the 5% threshold to define high risk, more patients were classified as high risk by M6 NP models compared to M6 P models, and by models without clinical factors compared to models with clinical factors (Table 7).Sensitivity varied between 90% and 94%, depending on model and center, but models with clinical factors had a slightly lower sensitivity than did models without (−1 to −3 percentage points).Specificity varied between 48% and 68%, depending on model and center.Specificity   was higher for models with clinical factors compared to models without (+4 to +9 percentage points).Every model had higher Net Benefit than treat all and treat none at every threshold in each center (Figure 7).At the 5% risk threshold, all models had Net Benefit of 0.10 at QCCH (Table S11).At SMH, Net Benefit was 0.10 for M6 P models and 0.09 for M6 NP models.In terms of net number of interventions avoided per 100 patients compared with treat all, M6 P models had values of 28-30 at QCCH and 39 at SMH, whereas

DISCUSSION
We have found that the updated M6 P and M6 NP models perform with a high level of accuracy (PDI/AUC) and Net Benefit, providing excellent NPV and sensitivity for high-risk PUL, whilst being well calibrated.The added value of bleeding score, history of EP and maternal age in further distinguishing PUL outcomes was small and not present in every center.These extended models are well calibrated, with similar PDI, AUC and Net Benefit, and improved specificity and PPV, but with minimal sensitivity reduction.When progesterone levels were not suitable or unavailable, the M6 NP extended model improved accuracy in some centers.The addition of pain score had no impact on model accuracy and performance.
There are several factors that contribute to the generalizability of these findings.First, this was a large, multicenter, prospective study involving consecutive patients.Inclusion and exclusion criteria were consistent with those of previous publications 15,[17][18][19][20][21] .Multiple forms of EV were used and a sensitivity analysis excluded potential bias due to patients lost to follow-up.Multiple analyses were performed to evaluate model calibration and clinical utility, as well as discrimination and classification.
Whilst Phase-1 data were collected from eight centers, Phase-2 data were collected from only two of these centers.Although two centers (the population of which was independent to the IECV process) do not allow between-center heterogeneity to be evaluated as well, the sample size was large enough for performance evaluation in each of these centers to be sufficiently precise.Although pictorial assessment was used to reduce subjectivity for bleeding scores, both bleeding and pain scores introduce variables that are subjective compared with clinical biochemistry (which itself is subject to variability due to, for example, assay differences), and previous EP was the only factor evaluated in terms of patient history.Whilst the magnitude of bias is difficult to quantify, measurement heterogeneity could be a reason why pain scores had little impact on model performance 40,41 .Management according to Step 1 of the two-step protocol (i.e.progesterone ≤ 2 nmol/L) incurred missing data for the second β-hCG recording (usually collected 2 calendar days after the initial blood sample).We aimed to counter potential bias in performance by using thorough multiple imputation by chained equations.This is better than simple approaches such as complete case analysis, which would incur selection bias based on the initial progesterone level 19,20 .
In the study of Ayim et al. 24 , the presence or absence of vaginal bleeding, pelvic pain or diarrhea were used for logistic regression modeling.However, the AUC of 0.73 reported in their study may reflect the subjective nature of clinical symptoms 24 .A scoring system developed using maternal age, history of EP/non-viable IUP, bleeding and β-hCG discriminatory zones carried high specificity for enhanced or reduced monitoring, so that follow-up of suspected non-IUP (enhanced) and non-EP (reduced) cases was appropriate, but at the consequence of low sensitivity for outcome discrimination 10 .Although the role for clinical factors in tangent with biochemical markers and within prediction modeling is small, it remains essential to collect this data as part of a clinical history, not only to validate patient priorities and experience, but to also build trust whilst deciding how best to provide treatment 26 .
Whilst future research and quality assurance projects are merited to assess cost-effectiveness, patient and clinician satisfaction and strategies to help both groups, these models add minimal financial burden, provide an objective tool to support women and healthcare professionals, and can be used as part of a two-step strategy (Figure 1) 19,20,42 .Clinical factors can be collected routinely, most units already use serial β-hCG monitoring (with or without progesterone measurement), and Step 1 of the two-step strategy safely reduces the number of 48-h reviews required 19,20 .
Previous research suggested that UK-developed models are not as predictive when applied to American populations 36 .However, studies in Sweden, Australia and Pakistan support both M4 and M6 as superior PUL triage techniques [43][44][45][46] .Although most of these publications are retrospective, the importance of proactive validation and revalidation is clear.Further prospective clinical implementation analysis, similar to previous work published by our group 20 , involving more centers nationally and globally, is required to confirm model generalizability and that women are managed safely.
Machine learning is now playing an increasingly central role in analytics, given its ability to solve complicated problems 47,48 .However, it can be subject to bias evolving from validation procedures, and calibration is often insufficient 47,48 .In fact, publications comparing machine learning to statistical logistic regression have concluded equivocal performance [48][49][50] .Whilst the current statistical methodology (logistic regression) is sound, a machine-learning-based M6 model could be developed and its performance compared.
In conclusion, we have shown that the updated M6 model offers accurate diagnostic performance for women following a PUL classification, with excellent sensitivity for EP.Adding clinical factors to the model improved performance in some centers, especially when progesterone levels were not suitable or unavailable.

Table 1
Predictors and final diagnosis of women with pregnancy of unknown location (PUL) in development (Phase 1; eight centers) and external validation (Phase 2; two centers) datasets Data are given as median (interquartile range)[range]or n (%), unless stated otherwise.*Descriptive comparison showing difference in median or percentage between development and validation data; bootstrapping was used to calculate 95% CI. β-hCG, β-human chorionic gonadotrophin; EP, ectopic pregnancy; FPUL, failed PUL; IUP, intrauterine pregnancy; PPUL, persistent PUL. Figure 4 Polytomous discrimination index (PDI) for internal-external cross-validation on Phase-1 data.M6 NP , M6 model excluding progesterone; M6 P , M6 model including progesterone; PI, prediction interval.

Table 3
Classification performance for ectopic pregnancy/persistent pregnancy of unknown location using risk threshold of 5% for internal-external cross-validation on Phase-1 data

Table 4
Area under receiver-operating-characteristics curve for ectopic pregnancy/persistent pregnancy of unknown location vs intrauterine pregnancy/failed pregnancy of unknown location for temporal external validation on Phase-2 data

Table 5
Polytomous discrimination index for temporal external validation on Phase-2 data

Table 6
Calibration slope and intercept for risk of ectopic pregnancy/persistent pregnancy of unknown location for temporal external validation on Phase-2 data

Table 7
Classification performance for ectopic pregnancy/persistent pregnancy of unknown location, using risk threshold of 5% for temporal external validation on Phase-2 data