A comparison of the beta‐geometric model with landmarking for dynamic prediction of time to pregnancy

Abstract We conducted a simulation study to compare two methods that have been recently used in clinical literature for the dynamic prediction of time to pregnancy. The first is landmarking, a semi‐parametric method where predictions are updated as time progresses using the patient subset still at risk at that time point. The second is the beta‐geometric model that updates predictions over time from a parametric model estimated on all data and is specific to applications with a discrete time to event outcome. The beta‐geometric model introduces unobserved heterogeneity by modelling the chance of an event per discrete time unit according to a beta distribution. Due to selection of patients with lower chances as time progresses, the predicted probability of an event decreases over time. Both methods were recently used to develop models predicting the chance to conceive naturally. The advantages, disadvantages and accuracy of these two methods are unknown. We simulated time‐to‐pregnancy data according to different scenarios. We then compared the two methods by the following out‐of‐sample metrics: bias and root mean squared error in the average prediction, root mean squared error in individual predictions, Brier score and c statistic. We consider different scenarios including data‐generating mechanisms for which the models are misspecified. We applied the two methods on a clinical dataset comprising 4999 couples. Finally, we discuss the pros and cons of the two methods based on our results and present recommendations for use of either of the methods in different settings and (effective) sample sizes.

probability that a disease is present to justify the use of an invasive or expensive diagnostic test, or prediction of the expected course of a disease to aid in deciding whether treatment should be initiated. However, there is rarely solely one fixed moment in time when a medical decision has to be made. More often, clinicians make an initial decision based on information known at baseline, but wish to reassess this decision later on, since during the period between assessments often additional or updated information has become available (Schumacher, Hieke, Ihorst, & Engelhardt, 2019). Examples include (re)measured biomarkers or an intermediate event that the patient can experience. The fact that the patient has remained event-free until the time of reassessment is also important information that can be used to update the patient's prognosis (van Houwelingen & Putter, 2012).
Statistical methods for dynamic prediction have been introduced that allow predictions to be updated over time. This enables the clinicians and patients to make an informed decision at later time points based on the most recent information. Dynamic prediction thus has a high clinical utility (Schumacher et al., 2019). Dynamic prediction models are being used more often in recent years as shown by 86 search results for 'dynamic prediction' on PubMed in 2000 and 603 results in 2017. Further, several published models in oncology and cardiology are ready for clinical use via a web-based calculator or nomogram (Eichinger, Heinze, & Kyrle, 2014;Fontein et al., 2015).
One setting where there is an explicit need for dynamic prediction is in reproductive medicine, in particular the subspecialisation of fertility. Approximately 1 in 10 heterosexual couples who wish to have a child cannot conceive naturally within one year of trying and this is referred to as subfertility (Gnoth et al., 2005). Subfertile couples receive a fertility workup wherein the basic necessities to achieve a natural pregnancy are established, but in 40-50% no barrier can be found, which is referred to as unexplained subfertility (Aboulghar et al., 2009). Treatment (medically assisted reproduction, MAR) is available, for example, intrauterine insemination or in vitro fertilisation, but since couples without a clear-cut diagnosis can still conceive naturally, the preferred treatment course is not evident (van Eekelen et al., 2017b).
To make decisions on whom to treat, clinicians may turn to statistical models to inform them on the chance of pregnancy when following alternative scenarios. For instance, if a couple's prognosis of natural conception is relatively good, they might not need invasive MAR treatment but could instead keep trying to conceive naturally. If their prognosis is poor, the couple is expected to benefit from MAR and they can be treated. The initial decision is made upon completion of the diagnostic workup. However, if a couple choose to continue trying to conceive naturally for an additional period of time and do no succeed, again a decision has to be made when the couple return to the clinic.
'Static' models that predict over one fixed time horizon are unable to update predictions and will overestimate the probability of conception when reapplied at a later time (van Eekelen et al., 2017b). This is because the prognosis of natural conception decreases as time progresses due to a selection process that occurs. This is caused by couples who have a high chance of natural conception conceiving earlier and dropping out of the population, leaving behind a subset that, on average, has a lower probability of conception (van Eekelen et al., 2017b). Several statistical methods and models have been suggested to estimate the prognosis of (natural) conception in the presence of heterogeneity, but most focus on retrospective or cross-sectional studies regarding time-to-pregnancy since those are the most cost-efficient study designs (Ecochard, 2006;Scheike & Jensen, 1997;Scheike & Keiding, 2006;Zhou, 2006). These methods do not naturally extend to a setting of dynamic prediction.
Recently, two dynamic prediction models have been developed in the setting of reproductive medicine, published in a high impact clinical journal in the field (McLernon et al., 2019;van Eekelen et al., 2017a). The statistical methods used in these two publications are quite different: one used the landmarking in combination with Cox proportional hazards models and the other used the beta-geometric model (Bongaarts, 1975;van Houwelingen & Putter, 2012;Weinberg & Gladen, 1986).
It is unknown what the advantages or disadvantages are of these two methods and how their predictive accuracy compares. The aim of this paper is to identify which of these two methods perform best when used to develop a clinical dynamic prediction model. This cannot formally be assessed in an internal validation of the developed prediction models as the truth that they are estimating is unknown. We conducted a simulation study based on the setting of predicting the chances of pregnancy to compare these methods and to advise which method should be preferred in a particular setting. We applied the two methods on a clinical data set comprising 4999 couples with unexplained subfertility.

Landmarking with Cox proportional hazard models
The first method to derive dynamic predictions of time to pregnancy is landmarking. This method was recently used by McLernon et al. (2019) to develop a clinical prediction model for time to pregnancy. Landmarking was first suggested by Anderson, Cain, and Gelber (1983) to prevent immortal time bias induced in survival modelling in oncology. In this original proposal, the authors suggested creating new datasets, the landmark datasets, at fixed time points after inclusion in the study using only patients who have survived at least until that time point. Covariates are updated based on information that has become available since the start of follow-up, for example, a surgery that was since performed, after which the effect of those covariates can be estimated in the landmarks in the absence of immortal time bias.
Landmarking for dynamic prediction was introduced by van Houwelingen (2007). We define s as the landmark time point, w as the fixed prediction window, that is the time span that each landmark covers and t as the follow-up time after inclusion in the study. The prediction window can be any time period over which one wants to predict, it is not related to the distance between landmarks. Van Houwelingen (2007) suggests to select a set of K + 1 landmark time points {s 0 , s 1 , s j …, s K } and then to create a landmark dataset for each landmark time point by only selecting patients from the original dataset that are still at risk at s j and censoring them at time s j + w (often referred to as administrative censoring).
After creating these landmark datasets, there are three approaches to estimate the dynamic predictions over time. The first and easiest option is to fit a separate Cox model on every landmark dataset and use these separate models to derive dynamic predictions. We get K + 1 Cox models: with 0 ( + ) = exp(− 0 ( + )), where ( + | ) is the probability that subject i with covariate values X ij at landmark time point s j will survive until at least s j +w and are the estimated regression coefficients for covariates X. We assume that the are constant over time, that is the assumption of proportional hazards is met, at least within landmarks. The cumulative hazard H 0j is estimated based on the follow-up information of subjects at risk from s j .
Using separate Cox models in landmarking has the advantage of being semi-parametric and thus not reliant on the shape of the underlying distribution of the baseline cumulative hazard. The method can easily incorporate time-varying covariates; however, it might be sensitive to small sample sizes in later landmark datasets when estimating separate baseline hazards and separate regression coefficients based on limited sample sizes.
If it is not expected that effects of covariates will change over time, or if that does not align with the research question, one might wish to fit a simpler, more concise model. Van Houwelingen and Putter (2012) propose a second approach referred to as the ipl (integrated partial likelihood) 'super' model (van Houwelingen & Putter, 2012). The process is equivalent to the first approach in that it uses the constructed landmark datasets at time points s j , but instead of analysing them separately, we now assume that there is only one fixed set of coefficients . Thus, we assume that all j from (1) are equal to and that not only do they remain constant over time within a landmark, but also across landmarks. is estimated using likelihood contributions from all landmark sets, which might improve the stability of the estimations. In this second approach, the baseline hazards are still estimated separately for each landmark. In practice, the parameters for this second approach can be estimated by using landmark as a 'strata' variable when estimating one Cox model on a 'stacked' dataset consisting of all landmark datasets.
The third approach, referred to as the ipl* super model, is to simplify the model further by assuming that the baseline hazards across landmarks are related. One can assume a functional form of the baseline hazard by adding (functions of) s j to the model. We get where exp(Γ( )) is the assumed fixed ratio between the cumulative hazard at landmark s j relative to the cumulative hazard at s 0 . This function can comprise multiple parameters that are usually incorporated using linear and quadratic terms for transformed values of s j in landmarks, dividing by s K such that the range of values for the terms becomes 0 to 1 (van Houwelingen & Putter, 2012). Alternatively, we can also formulate the ipl* super model by rearranging (2), yielding an expression where Γ( ) is modelled as additional coefficients in the regression part of the Cox model, with The advantage of this third ipl* approach is that it avoids estimation of many separate baseline hazards by pooling the baseline hazard of all landmarks at the cost of assuming a functional form of how the baseline hazard changes over landmarks. An additional advantage is that the ipl* model no longer depends on the choice of values for s j : The model can predict over time horizons exceeding s j +w or at time points not included in s j , which is something the ipl model cannot do since there are no estimated baseline hazards between landmarks or after the prediction window. When applying landmarking in combination with Cox models in the fertility setting, the separate baseline hazards will accommodate the decrease in chances over time for the 'separate Cox models' and the ipl super model. For the ipl* super model, the decrease in chances is captured by the function Γ( ).

Beta-geometric model
The second method to derive dynamic predictions from time-to-pregnancy data is the parametric beta-geometric model introduced by Bongaarts (1975). This method was recently used by van Eekelen et al. (2017a) to develop a clinical prediction model for time to pregnancy. The beta-geometric model is based on the notion that given a per cycle probability to conceive p that holds for all couples, time to conception T will follow a geometric distribution (Bongaarts, 1975). In the context of conception and subfertility, the cycle refers to every menstrual cycle, that is every opportunity for a couple to conceive. The probability of a first pregnancy in cycle t is thus However, p is unlikely to be equal for all individual couples who wish to conceive and the spread in chances is modelled according to a beta distribution (Bongaarts, 1975;Leridon & Spira, 1984;te Velde, Eijkemans, & Habbema, 2000). The two shape parameters of the beta distribution, and , are reparametrized following Griffiths (1973) to where is the mean probability of conception in the first cycle and is now the second shape parameter that represents the extent of heterogeneity (Griffiths, 1973;Weinberg & Gladen, 1986). After s failed cycles, there remains a subset of couples with on average a lower probability per cycle p, such that p will follow a beta distribution with updated and = 1 + and = 1 + .
The probability to conceive within s + w cycles given that a couple did not conceive within s cycles is To incorporate baseline covariate information X, we follow Weinberg and Gladen (1986) and regress the logit of the population mean linearly on the covariates X. We use a log link for and get Let n p denote the subset of couples with observed pregnancies, n c the subset of couples that were right censored, n = n c + n p the total sample size and T i the last observed cycle for an individual couple i either corresponding to pregnancy or censoring. We can estimate the coefficients , and the intercept 0 for X = 0 via maximum likelihood optimization. The likelihood contribution of individual couples can be assumed to be independent, thus the total log-likelihood L is the sum of the log-likelihood for couples with observed pregnancies (L p ) and the log-likelihood for couples who were right-censored (L c ). After rearrangement of the notation from Weinberg and Gladen (1986), we get After estimation of the parameters, dynamic predictions of the cumulative probability of pregnancy over prediction window w given s failed cycles can be calculated using Equation (7) and plugging in̂(X i ) and̂.
The beta-geometric model uses all data from inclusion to the end of follow-up in its fit and is thus expected to be less sensitive to a small sample at later s than landmarking. Since it is parametric, it could be sensitive to misspecification, in addition to being limited to discrete time-to-event and baseline covariates. The closed expression prediction formula is convenient to apply since a small number of parameters are required to be able to calculate a prediction.
The beta-geometric model can be extended to allow for a fraction of couples that have zero chances to conceive, that is absolute sterile couples (Weinberg & Gladen, 1986). This fraction, denoted as the sterility parameter , can be estimated in the beta-geometric model (Weinberg & Gladen, 1986). This is referred to as the mixture model. We extend the log-likelihood function from Equation (9) to After estimation of the parameters, again using maximum likelihood, the calculation of a probability to conceive within s + w cycles given that the couple did not conceive within s cycles now becomes with denoting the expected fraction of couples that are sterile after s failed cycles ) .
We described landmarking with three approaches to build a prediction model: fitting separate Cox models (model A), using landmarks as strata (model B) and incorporating linear and quadratic terms of landmark numbers as covariates (model C). We also described the beta-geometric model with (model D, referred to as a mixture model) and without (model E) the sterility parameter. Table Supp-I in the Supporting Information summarizes the expected advantages and disadvantages of the five models. We continue with describing the framework for the simulation study where we compare the performance of models A to E.

SIMULATION STUDY
The parameters of the simulation scenarios were closely based on what was observed in two recent cohorts following unexplained subfertile couples for natural conception (van der Steeg et al., 2007;van Eekelen et al., 2017avan Eekelen et al., , 2018. We are interested in the ability of the models to predict ( ≤ + | ⟩ ), not only from baseline (s = 0), but also after s failed cycles with s ∈ {0,1,2,…26}. We choose a prediction window w = 13 cycles, so we predict one year pregnancy chances from 27 time points onwards.
We compare the accuracy of the predictions from models A to E where we prefer low bias and high precision summarized in terms of the lowest root mean squared error of average predictions (RMSE) and the lowest root mean squared error of individual predictions (RMSPE). In addition, we were interested in other commonly used measures to denote and compare clinical prediction model performance, namely those that describe (internal) prediction error and discrimination: the Brier score and the c-statistic (Steyerberg, 2009). We simulate data by 10 different data-generating mechanisms to assess the robustness of models over multiple scenarios.

Data generation
We chose a discrete time to event because the beta-geometric model is based on discrete cycles, not on calendar time. To generate the time to event for each of n = 6000 couples, we followed these steps: 1. We simulated two covariates: female age in years (X 1 ) drawn from a normal distribution N(34,5) and duration of subfertility (the number of years couples have been trying to conceive, X 2 ) from an exponential distribution Exp(0.8) + 1 such that the minimum for the latter was 1.
2. To gain covariate-specific mean conception chances (X i ), we combined the covariates with assumed linear effects on logit scale and used the intercept logit( 0 ) of −0.60 such that logit( ( )) = logit ( 0 ) (13) and then backtransformed to means using the inverse-logit.
3. To introduce heterogeneity, we assumed the individual probabilities followed a beta distribution with (X i ) and from Equation (5) with fixed = 0.15 to randomly draw an individual per cycle probability of pregnancy p(X i ) for each couple. We consider other assumptions for the heterogeneity distribution later on.
4. We drew individual time to event values from a geometric distribution with success probability p(X i ).
5. We assumed the censoring times were geometrically distributed with per cycle probability 0.1.
6. If the time of event was earlier or equal to the censoring time, the couple conceived during the study, otherwise they were followed until the time of censoring or the end of study, set at 52 cycles. Couples were followed for at least one cycle.

Model fitting
The two beta-geometric models (models D and E) were fitted using Equations (10) and (9), respectively, by optimizing the loglikelihood using the R command optim() and the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm based on Newton's gradient ascent. Predictions were calculated using Equations (11) and (12) for model D and Equation (7) for model E.
For the landmarking-based models, we first derived all 27 landmark datasets by intervals of one cycle. We then followed the steps described in Chapter 2.1, selecting the subset still at risk at time s and censoring administratively at s+w.
We fitted separate Cox models using cph() on all landmarks (A) and derived predictions following Equation (1). After stacking all landmark datasets into one 'super' dataset, we fitted the ipl model (B) using s as strata and derived predictions as described before. We fitted the ipl* model (C) by removing the strata and adding linear and quadratic terms for s divided by 26 as covariates to the Cox model, then derived predictions following Equation (3).
In all models, we assumed a piecewise linear effect for age (differentiating between the effect of age below and above 33) and a linear effect for duration (X 2 ).
In every landmark dataset, we calculated individual cumulative predicted probabilities of pregnancŷ( ≤ + 13 | ⟩ , ) using the fitted models A-E. We also calculated the true cumulative probability for every individual by using the generated individual probabilities p(X i ) in every landmark dataset as In addition to models A-E, we fitted a Kaplan-Meier curve (model F) for the sake of tracking the observed pregnancy rates and their variance between simulation replications as opposed to what is expected based on true probabilities.

Performance metrics
The following metrics were calculated at each landmark time point s where n s couples are still at risk. We were interested in the expected error on the population level and the expected error on the individual level. For the first, to compare the average predictions based on a sample to the true population parameter, we averaged the individual cumulative predicted probabilities of pregnancŷinto the average predicted probabilitŷFor each simulated dataset, we averaged the generated individual cumulative probabilities P from Equation (14) to calculate the true average probabilitȳ. We view the average of̄over all simulation replications as the true (empirical) population parameter given all different, possible samples that could have been drawn. Bias of average predictions is defined as any deviation of̂in a replication from that simulation average of̄.
We then calculated the RMSE, that is the expected error in the average estimated probability denoted in percentage points compared to the true population parameter by where nsim is the total number of simulation replications. Next, we calculated the RMSPE, that is the expected error in individual predictions in percentage points by comparing it to the true individual probabilities per simulation replication following We derived Brier scores that represent the overall performance of the models, following Graf, Schmoor, Sauerbrei, and Schumacher (1999), and c statistics, that represent the discriminative ability of the models, following Harrell, Lee, and Mark (1996), both for right-censored data. Finally, to show how sample size changes over time, we calculated the average of the sample size per landmark n s over all simulation replications. In addition to results for models A-F, simulation metrics were also calculated using the true probabilities to attain the best obtainable (i.e. lowest) Brier score and best obtainable (i.e. maximum, highest) c statistic to aid interpretation of the differences of performance measures between models.
All metrics were evaluated on an external dataset, representing the out-of-sample performance of the models: For each development simulation set, a second validation set was generated using the same generating mechanism. The predicted probabilities in the validation set were calculated by applying the models that were fitted to the development set. True probabilities were based on the validation set only. We also evaluated the main scenario (scenario 1) on the training data, that is internal validation to see differences with out-of-sample performance.

Alternative simulation scenarios
We varied several aspects of the data generation algorithm that might influence the performance of the models.
First, the heterogeneity distribution could be different from beta: We used the compressed beta distribution and the logitnormal distribution as alternatives. For the first, true probabilities p(X i ) were drawn from a beta distribution as in step 5 of the main algorithm but multiplied with a compressor value of 0.6. For the second, individual log-odds for pregnancy were drawn from a normal distribution (logit( ( )), 1) with adjusted logit( 0 ) = −1.5. These were backtransformed to individual per cycle probabilities cycle ( ) using the inverse-logit.
We also considered a scenario where there was no (unobserved) heterogeneity at all, so where the individual true probabilities per cycle were equal to the covariate-specific means, that is ( ) = ( ).
Second, we considered the possibility that a fraction of the couples in the cohort was sterile. To simulate this, we drew a random fraction of couples from a Bernoulli distribution with probability 0.3 (assuming 5% sterile couples in the population and the simulated data represent a subfertile cohort who have been trying to conceive for at least one year) and set their p(X i ) to zero, meaning they are followed until the time of censoring or until the end of the study. The intercept 0 was adjusted such that cumulative probabilities were similar to scenarios without the sterile fraction.
Third, it is assumed in the main scenario that for each couple, the true probabilities per cycle p(X i ) are fixed over the entire follow-up of the study. This may not hold, in particular because women age over follow-up and their fecundability may decline. This is likely to have more impact in women that were older at the start of follow-up (Sozou & Hartshorne, 2012).
We have implemented ageing in some of the simulation scenarios by assuming that p(X i ) changed over time within the same couple. If the female age reached 33 years or above, their p(X i ) decreased by the same 0.15 on logit scale that is assumed to hold for differences between couples in terms of baseline female age. Finally, we considered the possibility where there is no censoring, for instance, in a retrospective study when data linkage with maternity databases is performed. For this scenario, we set the probability of right censoring per cycle to zero. Table 1 summarizes all the 10 different scenarios we implemented. We consider scenario 1 as the main scenario because we view it as the most realistic given evidence from literature (Hunault et al., 2004;van der Steeg et al., 2007;van Eekelen et al., 2017avan Eekelen et al., , 2018.
All simulations were replicated 1000 times. All datasets were generated using a known seed. We tracked how often models failed to converge. If so, results were calculated using replications where models did converge.
All analyses were performed using R version 3.3.2 (R Core Team [2017], R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/) using the survival, dynpred, rms, xtable and ipred packages.
Source code for all data generation, model fitting, simulation replication and processing of simulation results are provided online at the Biometrical Journal web page.

Bias in the average predictions
Results for the average predictions over all landmarks are visualized in Figure 1 for scenarios 1-10. The ipl* model (C) overestimated the average cumulative probability of pregnancy for all scenarios up until landmark 10. This is likely due to an inaccurate approximation of the baseline hazard function using a linear/quadratic functional form in the model.
In the main scenario 1, all models except C (which overestimated by 6.9 percentage points in the baseline landmark) gave unbiased estimates of the average predictions until landmark 5 but both beta-geometric models overestimated chances in landmarks thereafter (D by 2.1 percentage points and E by 1.5 percentage points in landmark 26). This is likely because the presence of a sterile fraction led to inaccuracy in the estimation of the beta distribution. Model D often had difficulty estimating the sterile fraction: Model D estimated an average fraction of sterile couples at baseline of 0.5% and in 24% of simulation replications this was less than 5%, compared to the true value of 30% used in the simulation. Results when the average prediction was evaluated on internal data were similar to the main scenario 1, which was expected given we were evaluating against the population parameter which is identical for internal and external datasets.
In the scenario without sterile couples (scenario 2), all models except model C were unbiased. Including ageing over follow-up (scenario 3) yielded similar results as scenario 1, so all models were robust with regard to ageing.
In the scenario without censoring (scenario 4), the beta-geometric mixture model (D) was better able to estimate the fraction of sterile couples (now 27% on average at baseline) but this did not improve accuracy of the average predictions of the model.
In the scenario where there was no heterogeneity (scenario 5), both beta-geometric models performed poorly since they falsely assumed some degree of heterogeneity (estimates for were substantially lower compared to scenario 1 but non-zero in all simulations). However, in the scenario where there was neither heterogeneity nor a fraction of sterile couples (scenario 6), F I G U R E 1 Average predictions (̂( ≤ + 13| > , )) and true values (̄( ≤ + 13| ⟩ , )) for all landmarks s for scenarios 1-10 we can see that both beta-geometric models no longer found spurious heterogeneity (estimates for were almost zero). This falsely assumed heterogeneity in scenario 5 thus seems to result from the inability of the model to accurately estimate both the sterile fraction and the beta distribution.
In scenarios where we used the logit-normal distribution for heterogeneity (scenarios 7 and 8, the latter with ageing) or the compressed beta distribution (scenarios 9 and 10, the latter with ageing), the results were similar to scenarios 1 and 3 although there was slightly more bias in the beta-geometric mixture model (D) in scenarios 9 and 10. This indicates that the beta-geometric models were robust to misspecification of the exact heterogeneity distribution. Table 2 denotes the RMSE for average predictions for s ∈ {0,13,26} for scenarios 1-10. Sample sizes per landmark were similar for all scenarios except for scenario 4 in which no censoring is present. In the main scenario 1, the RMSE increased for all models in later landmarks due to decreasing sample sizes. RMSEs at baseline were around 0.7-0.8 percentage points for all models except ipl* (C), which had a prediction error of approximately 7 percentage points due to its large bias. The betageometric model without sterility parameter (E) performed the best in all landmarks except s = 0 in scenario 1 in terms of the lowest RMSE. This is because, albeit slightly biased upwards at the later landmarks, model E had the lowest empirical standard T A B L E 2 RMSE (in percentage points) of average predictions for models A-F or the empirical variance of true probabilities between simulation replications in selected s for scenarios 1 to 10 error, that is the highest precision. Results when the RMSE was evaluated on internal data were similar to the main scenario 1, which was expected given we were evaluating against the population parameter which is identical for internal and external datasets.

RMSE in the average predictions
In scenario 2 without sterile couples, both beta-geometric models vastly outperformed the landmarking models because now both approaches were unbiased but the beta-geometric models were more precise. In scenario 3 including ageing over followup, the beta-geometric models performed less well compared to scenario 1 but still outperformed the landmarking-based Cox models in most landmarks in terms of the lowest RMSE. In scenario 4 without censoring, landmarking-based Cox models A and B outperformed the beta-geometric models D and E in terms of a lower RMSE because there was much more data available in later landmarks (n s = 3531 at s = 26 where it was approximately n s = 228 for scenarios 1-3). In scenarios 5-10, the betageometric model without sterility parameter (E) performed best overall in terms of the lowest RMSE.   Table 3 denotes the RMSPE for individual predictions for s ∈ {0,13,26} for scenarios 1-10. The RMSPE were very high (approximately 35% off for s = 0 in scenario 1) since the unobserved heterogeneity in individual chances of pregnancy far outweighed the information in the two simulated covariates, which also holds for the out-of-sample performance for scenario 1. The RMSPE decreased for all models in later landmarks due to a more homogeneous subset, and a larger sterile fraction, of couples that remain in the cohort. The only exception was scenario 6 with no heterogeneity and no sterile fraction of couples where the RMSPE was very low (approximately 1.5% at s = 0) and increased over landmarks. Aside from the ipl* model (C) that generally performed the worst in the first landmark s = 0 in most scenarios and the beta-geometric models that performed the best in scenario 6 without heterogeneity, there were no relevant differences between models suggesting that all models were similar in their (in)ability to estimate individual predictions in the presence of substantial unobserved heterogeneity.

Brier score
All models had similar Brier scores in scenarios 1-10 (Supporting Information, Table Supp-III) with no model clearly being preferred over another. The Brier score for the Kaplan-Meier (F) represents the prediction error for a model without covariates and was close to the Brier scores of models A-E for all scenarios since the unobserved heterogeneity in individual chances of pregnancy far outweighed the information in the two simulated covariates.

c statistic
The c statistics (Supporting Information, Table Supp-IV) were also similar for all models in scenarios 1-10 except for the separate Cox approach (A). This approach yielded the poorest statistic in nearly all scenarios and landmarks, except in scenario 1 where the c statistic was evaluated on internal data. In that scenario, the separate Cox landmark approach (model A) yielded the highest c statistic beyond landmark s = 0, with a value of 0.66 in landmark s = 26. This indicates that model A might discriminate well in the development data landmarks but does not transport well to external datasets at later landmark time points.

Convergence
The beta-geometric mixture model (D) failed to converge in 17.7% of simulation replications of the main scenario 1 and at least one of the separate Cox landmark models (model A) in 2.3%. The other models always converged. Model D failed due to insufficient information after censoring and an insufficient length of follow-up to estimate the sterile fraction and model A failed due to an insufficient number of events in later landmarks. Table Supp-II in the Supporting Information shows the proportion of convergence failure for models D and A in the remaining simulation scenarios.

APPLICATION
We applied the five models to clinical data obtained from 2002 to 2004 in a Dutch national prospective cohort study conducted in 38 fertility centres (van der Steeg et al., 2007;van Eekelen et al., 2017a). Couples for which no barrier to pregnancy could be found during the fertility workup (unexplained subfertility) were followed for pregnancy from the completion of the fertility workup onwards. Data on n = 4999 couples were available, which decreased to n = 151 in the final landmark. The median number of cycles of follow-up was 7. After fitting the models, we evaluated the average predictions, Brier scores and c statistics in all landmarks in an internal validation with all methods and approaches similar to the simulation study.
To align with the simulation study, we only used the covariates female age and duration of subfertility in the models as these were shown to be the most important predictors of pregnancy in previous work, as was the piecewise linear effect of female age above and below approximately 33 years (Hunault et al., 2004). All models converged. The model parameters are shown in Table 4 for the landmarking-based models and Table 5 for the beta-geometric models.
The pooled estimates for coefficients from the 'stacked' landmarking-based models (B and C) are easier to interpret than the separate Cox models as for the latter, the coefficients change over time. We observed that some coefficients in models B and C pointed in counterintuitive directions such as a positive effect of increasing female age over 33 years on pregnancy. The estimated sterile fraction in the beta-geometric mixture model (model D) was close to zero.
The average model predictions per landmark in the application to clinical data are shown in the last, lower right-most panel of Figure 1, decreasing from 0.27 in s = 0 to 0.07 in s = 26. Notably, the tendency of the super ipl* landmark model (model C) to overestimate in the first six landmarks compared to the Kaplan-Meier estimates was also visible in this application to real data. The beta-geometric models underestimated compared to the Kaplan-Meier estimates from landmark s = 7 onwards, but due to the rapidly decreasing sample size over time Kaplan-Meier estimates were imprecise. Estimates from Brier scores were similar T A B L E 4 Estimated parameters for the landmarking-based models in the data application for all models, decreasing from 0.21 in s = 0 to 0.07 in s = 26. c statistics were also similar for all models and comparable to fixed, that is non-dynamic models in the field, but showed more variability over landmarks than Brier scores, with a median of 0.60, increasing from 0.60 in s = 0 to the highest value of 0.71 in s = 24, back to a value of 0.59 in s = 26.

DISCUSSION
In this paper, we have compared two methods for dynamic prediction of time to pregnancy and applied these to clinical data. We observed in most of our simulation scenarios that although the beta-geometric models were slightly biased at later landmarks, the beta-geometric model without the sterility parameter had the lowest RMSE of average predictions out of all models due to higher precision. The landmarking-based Cox models performed better than the beta-geometric models in terms of RMSE when a large amount of information, in terms of sample size in both the baseline landmark and in later landmarks, was available. All models were comparable in RMSPE and Brier scores due to the difficulty of estimating chances on the individual level in the presence of a strong degree of unobserved heterogeneity between couples. The separate Cox approach (A) yielded a high internal c statistic at later landmarks in internal validation but did not transport well to external data due to following noise in highly varying, small landmark risk sets later on. Our results were based on a simulation study using several data generating mechanisms, but there remain limitations to interpretations and generalizations when applying these methods to clinical data of which the data-generating mechanisms are unknown.
The super ipl* landmark model was not able to accurately model the baseline hazard in the early landmarks, not even when adding cubic terms of landmark numbers or when using natural or restricted cubic splines of landmark numbers as a sensitivity analysis, so we advise against using this method. The tendency of this model to overestimate chances in early landmarks was also noted in the internal validation of our application to clinical data. The main challenge for the other two landmarking approaches is a low (effective) sample size, not only expressed as the risk set in the first landmark but in particular if one wishes to cover later landmarks in the presence of censoring. When estimating separate Cox models on all landmarks, the fluctuations in covariate effects over time are sensitive to noise in small risk sets. Another disadvantage of this method is that varying covariate effects can be difficult to report and communicate to clinicians and their patients.
We recommend landmarking in combination with fitting a super ipl Cox model that incorporates landmarks as strata when the effective sample size remains above 500 over follow-up since this model was both accurate and concise. At sample sizes around 500 or lower, the RMSE for the model increased considerably.
Both beta-geometric models were sensitive to the presence of a sterile fraction which led to overestimated probabilities at later landmarks. Estimation of the sterile fraction is computationally challenging and convergence requires sufficient information, both in terms of a large sample size and a long follow-up (Klebanov & Yakovlev, 2007;Shi & Yin, 2017). In our application to clinical data, the sterility parameter was close to zero, which might be due to couples in the data being followed for only a median of seven cycles. If the effective sample size drops below 500 in landmarks that researchers wish to cover, as was the case in the data used by van Eekelen et al. (2017a) to develop their clinical prediction model, the beta-geometric model without the sterility parameter is a more precise alternative than landmarking. The beta-geometric model was also robust across scenarios where the heterogeneity distribution was not strictly beta or where we introduced decreasing probabilities of conception per cycle due to ageing over follow-up.

SUPPORTING INFORMATION
Additional supporting information including source code to reproduce the results may be found online in the Supporting Information section at the end of the article.