Effects of duration of follow‐up and lag in data collection on the performance of adaptive clinical trials

Different combined outcome‐data lags (follow‐up durations plus data‐collection lags) may affect the performance of adaptive clinical trial designs. We assessed the influence of different outcome‐data lags (0–105 days) on the performance of various multi‐stage, adaptive trial designs (2/4 arms, with/without a common control, fixed/response‐adaptive randomisation) with undesirable binary outcomes according to different inclusion rates (3.33/6.67/10 patients/day) under scenarios with no, small, and large differences. Simulations were conducted under a Bayesian framework, with constant stopping thresholds for superiority/inferiority calibrated to keep type‐1 error rates at approximately 5%. We assessed multiple performance metrics, including mean sample sizes, event counts/probabilities, probabilities of conclusiveness, root mean squared errors (RMSEs) of the estimated effect in the selected arms, and RMSEs between the analyses at the time of stopping and the final analyses including data from all randomised patients. Performance metrics generally deteriorated when the proportions of randomised patients with available data were smaller due to longer outcome‐data lags or faster inclusion, that is, mean sample sizes, event counts/probabilities, and RMSEs were larger, while the probabilities of conclusiveness were lower. Performance metric impairments with outcome‐data lags ≤45 days were relatively smaller compared to those occurring with ≥60 days of lag. For most metrics, the effects of different outcome‐data lags and lower proportions of randomised patients with available data were larger than those of different design choices, for example, the use of fixed versus response‐adaptive randomisation. Increased outcome‐data lag substantially affected the performance of adaptive trial designs. Trialists should consider the effects of outcome‐data lags when planning adaptive trials.


| INTRODUCTION
Most randomised clinical trials are conducted with no or few interim analyses that usually employ very strict criteria for stopping early, 1 for example, O'Brien-Fleming monitoring boundaries that preserve most of the alpha for the final analysis (the final analysis employs more lenient criteria for declaring an intervention effective than earlier analyses). 2his approach is challenged by sample size calculations based on assumptions that often turn out to be optimistic or incorrect, [3][4][5][6] which may lead trials to ultimately be unable to firmly confirm or reject clinically important effects and to running longer than required. 1,4,7,8Thus, adaptive trials with a higher number and frequency of interim analyses have received increased attention. 1,9,10Such designs may employ both adaptive dropping of inferior arms, adaptive stopping of the full trial, and response-adaptive randomisation according to pre-specified adaptation rules. 10,11egardless of design, whenever interim or adaptive analyses are conducted before inclusion, follow-up, and data collection for all patients has concluded, the proportion of randomised patients with available data at the time of analysis will be below 100%, except when overrunning is not allowed (i.e., when inclusion is paused during data-collection lag periods and while awaiting analysis results, which is often infeasible in practice, especially when the number of possible analyses is large).Consequently, effect estimates used for adaptations may later change in magnitude or even direction once data from all randomised patients are analysed. 12Higher inclusion rates, longer follow-up durations, and longer data-collection lag will aggravate this risk, 13 especially with a relatively higher number of adaptive analyses.Similarly, the influence of longer outcome-data lags depends on the follow-up duration to inclusion period ratio, 13 with a threshold of <0.25 previously suggested as a rule of thumb for when adaptive trial designs are useful. 14While these factors should be considered when selecting follow-up durations and planning data collection and verification procedures in adaptive trials, studies of the influence of different outcome-data lags (durations of follow-up plus data-collection lag) in complex adaptive trials also considering different inclusion rates that could inform these decisions are lacking, and previous methodological studies have been criticized for ignoring these factors and assuming that outcome data are always instantaneously available following inclusion. 13onsequently, we undertook a simulation study assessing the influence of several different outcome-data lags on the performance of several adaptive trial designs under different inclusion rates.We hypothesised that longer outcomedata lags would affect the performance characteristics of adaptive trials to an extent where it could sway trialists to prefer relatively shorter follow-up durations or less adaptation for such trial designs, where possible. 15

| METHODS
We conducted this simulation study according to a publicly pre-registered protocol and statistical analysis plan 15 adhering to recommendations for statistical simulation studies 16 and considering previous studies assessing the performance of different adaptive trial designs. 17-19

| Statistical methods and simulation
We conducted all simulations using R v4.2.2 using our adaptr 20 package (inceptdk.github.io/adaptr;see Appendix C).The adaptr package simulates adaptive trials with adaptive arm dropping, stopping, and/or response-adaptive randomisation using Bayesian statistical methods.The complete analysis code is included in Appendix C.
Using the adaptr package we simulated studies where patients were randomly allocated to the active treatment arms (described below) according to the current allocation ratios, immediately followed by simulation of random, undesirable binary outcomes (e.g., mortality) from Bernoulli distributions.Subsequently, probabilities of experiencing the outcome in each trial arm were estimated using beta-binomial conjugate prior models [20][21][22] with flat beta(α = 1, β = 1) priors, which corresponds to all event probabilities being equally probable a priori and information-wise as two additional patients in each arm (one with the outcome, one without). 21Importantly, while outcomes were randomly generated immediately following randomisation, outcome data were only used for patients who had completed the relevant combined follow-up duration/data collection lag period at the time of each analysis.
During each analysis, comparisons used 5000 independent posterior draws from each trial arm.We simulated 100,000 trials for each combination of the parameters described above; this number ensures sufficient precision for all performance metrics and corresponds to the United States Food and Drug Administration's recommendations for the assessment of final, 'real' complex adaptive trials, 23 and is likely larger than what was necessary due to the number and spacing of combined outcome-data lags considered. 10,23

| Adaptive trial designs assessed
We considered six different adaptive trial designs: 1. Two arms using fixed, equal randomisation (50%:50%) 2. Two arms using response-adaptive randomisation 3. Four arms all compared against each other (i.e., no common control arm) using fixed, equal randomisation (25%:25%:25%:25%) 4. Four arms compared against each other using response-adaptive randomisation 5. Four arms with three interventional arms compared pairwise against a common control arm using fixed, square root ratio-based randomisation 10,24 (square root of the number of non-control arms to 1-ratio for the allocation probabilities in the common control arm and all interventional arms; that is, 36.6% control arm allocation and 21.1% allocation to each interventional arm, with allocation probabilities re-calculated after arm dropping using the same approach, for example, 41.4% to the control arm and 29.3% to each interventional arm after dropping of 1 interventional arm) 6.Four arms with three interventional arms and a common control group using square root ratio-based initial randomisation, 10,24 with subsequent fixed control arm allocation (to 36.6% initially, with the control arm allocation probability re-calculated after arm dropping as described above) and response-adaptive randomisation in the interventional arms 6][27][28][29] In these, inclusion rates have generally been reasonably constant after an initial period of site initiations, and the first adaptive analyses in this simulation study will occur after a burn-in period (described below) long enough to ignore the initial non-constant inclusion rate.
We considered eight different outcome-data lags, defined as the follow-up durations plus data-collection lags: 0 (outcome data immediately available), 15, 30, 45, 60, 75, 90, and 105 days, which covers the range of periods (including data-collection lags) mostly used in clinical trials conducted in our setting (critical care). 30The proportions of randomised patients with available outcome data at the time of each possible adaptive analysis according to the inclusion rate and outcome-data lag combinations are presented in Figure 1.Of note, while the same outcome-data lag periods can consist of different follow-up durations and data-collection lag combinations (e.g., an outcome-data lag of 30 days could equally consist of 15 days of follow-up plus 15 days of data-collection lag or 20 days of follow-up plus 10 days of data collection lag), any influence on adaptive trial performance is due to the length of the combined period.
Finally, we considered three different clinical scenarios, that is, three different sets of simulated event probabilities (including both beneficial and harmful effects compared to the control/standard-of-care arm), with the two-arm trials only using the first two event probabilities in each set: 1.No differences: (25% event rate in all arms) 2. Small differences: arm A 25%; arm B 22.5%, arm C 27.5%, arm D 26.5% 3. Large differences: arm A 25%; arm B 20%, arm C 30%, arm D 28%.Across all scenarios, arm A was considered to represent the standard-of-care and was used as the common control in designs employing a common control arm.In total, this yielded 432 configurations based on unique combination 6 of trial designs, 3 inclusion rates, 8 combined follow-up durations and data collection periods, and 3 scenarios.

| Adaptations
In all simulations, the first adaptive analysis was conducted after randomisation of 400 patients, followed by analyses after every 300 additional randomised patients up to a maximum of 10,000 randomised patients.Adaptive analyses were thus conducted according to the number of patients randomised (and not the number of patients with outcome data available or according to fixed time-points, as may also be done), and early planned adaptive analyses were skipped in the cases where no patients had outcome data available.Overrunning was allowed, that is, inclusion was not paused while awaiting data collection and adaptive analyses to complete.In each simulation, a final analysis including outcome data from all randomised patients was conducted after stopping and used to calculate a specific performance metric (described below), but superiority could not be declared at this analysis.

| Stopping rules
Based on comparisons of the posterior distributions, two-arm trials were stopped for superiority if the probability of one arm being the best exceeded a certain threshold.In the four-arm trials with all arms compared against each other, arms were dropped for inferiority if their probability of being the overall best were below a certain threshold, and trials were stopped for superiority when the probability of one arm being the overall best exceeded a certain threshold.In these cases, arm dropping was immediately followed by a new analysis; as the probabilities of each arm being the best must sum to 100%, dropping an arm and updating these probabilities may lead to another arm crossing the superiority threshold.In the four-arm trials with pairwise comparisons against a common control group, interventional arms were dropped for inferiority if their probability of being better than the control was below a certain threshold.If the probability of an intervention being better than the control group exceeded a certain threshold, that interventional arm was promoted to be the new control arm with all subsequent pairwise comparisons of interventional arms against the new control arm only, the previous control arm was dropped for inferiority, and all remaining interventional arms were immediately compared to the new control arm, as previously described. 10,20In these designs, trials where stopped for superiority whenever only one arm remained after dropping all other arms for inferiority, regardless of whether the remaining arm was the control or an interventional arm (i.e., the overall final trial status was not considered to be Proportion of randomised patients with data available at each adaptive analysis.Proportion of randomised patients with data available at the time of each possible adaptive analysis (marked with dotted lines) according to inclusion rates and outcome-data lag (follow-up durations plus data-collection lag).Horizontal axis: % of patients included out of the maximum possible 10,000; vertical axis: proportion (%) of randomised patients with outcome data available at the time of each analysis.
stopped for superiority if the original common control was dropped due to superiority of an interventional arm if other interventional arms remained for comparisons with the new control arm).All stopping thresholds were constant during all analyses in the same set of simulations and symmetric, that is, the probability thresholds for inferiority were defined as 100% minus the probability thresholds for superiority.To ensure fair comparisons between designs, we calibrated stopping rules to keep the overall Bayesian type-1 error rates at approximately 5% (as is usually required for 'real' clinical trials), 23 defined as the probability of conclusiveness in the scenario with no differences between trial arms, 10 corresponding to two-sided family-wise type-1 error rates for the outcome across all arms being compared.We used a sequential model-based Bayesian optimisation algorithm 31 using Gaussian process interpolation, 32 initially based on fewer simulations than the final 100,000.If the probability of conclusiveness was ≤4% or ≥6% after initial calibration, we planned to repeat calibration with 100,000 simulations in each step until these criteria were reached; this, however, proved unnecessary.

| Response-adaptive randomisation
Response-adaptive randomisation was based on the probabilities of each arm being the overall best regardless of the design (reflecting the aim of ultimately finding a single best arm), 18 'softened' by raising the raw probabilities to the power 0.7 and normalising, with minimum allocation ratios of 30% (two-arm trials) and 15% (four-arm trials) for all active arms. 10

| Performance metrics
For each configuration, the following performance metrics were calculated 10,17-20 : • Mean (expected) total (i.e., including all arms) sample sizes, event probabilities, and event counts across all simulated trials (calculated for all randomised patients, i.e., after concluding follow-up and data collection for all randomised patients); these were supplemented with standard deviations, medians, and interquartile ranges.• Probability of conclusiveness: the percentage of simulated trials stopped for superiority at an adaptive analysis (i.e., the last time-point where superiority could be declared was at the adaptive analysis conducted after randomisation of 10,000 patients, which only included those with available outcome data at that time), which corresponds to the type-1 error rate in the scenario with no differences and as the power (100% -the type-2 error rate) in the scenarios with differences. 10 Probability of selecting the best arm (i.e., stopping with a superiority decision for the best arm).
• Root mean squared errors (RMSEs) of (1) the estimated event probabilities (median values from the posterior distributions from the last adaptive analysis, that is, when trials were stopped) in the selected arm compared to the true event probabilities and (2) of the differences between estimated event probabilities in the selected arms at the last adaptive analysis (i.e., after randomisation of maximum 10,000 patients and including only those with available outcome data at that time) compared to those from a final analysis including outcome data for all randomised patients.• Ideal design percentages (IDPs): a measure combining arm selection probabilities, power, and the consequences of selecting inferior arms (i.e., selecting an arm that is slightly inferior to the best will affect the IDP less than selecting a substantially inferior arm), defined and calculated as previously described (with n denoting the number of arms in the trial) 10,[17][18][19][20] : Probabilities of selecting the best arm and IDPs were only calculated in scenarios with differences present.RMSEs of the estimated event probabilities in the selected arms and IDPs were calculated twice 10,20 : first, for trials ending in superiority only with the superior arm selected and, second, assuming that the control or standard-of-care arm was selected in inconclusive trials (except four-arm trial simulations where this arm was dropped at an earlier analysis), as this corresponds to what may occur in clinical practice where the standard-of-care arm is most commonly used after an inconclusive trial. 10

| RESULTS
Calibrated superiority stopping thresholds ranged from 98.09% to 99.75% and were largely similar according to outcome-data lags (Table S1, Appendix B).Raw numerical results for all performance metrics are presented in Tables S2-S11 (Appendix B).

| Mean total sample sizes
Longer outcome-data lags substantially affected sample sizes in the scenarios with small or large differences (Figure 2), with differences largest in the simulations with fastest inclusion rates and in the four-arm trials without a common control group, for which differences up to approximately 1150 (small-differences-scenario) and 1440 (large-differences-scenario) were observed.Mean total sample sizes were less than the maximum 10,000 patients in all simulation configurations regardless of between-arm differences and outcome-data lags.

| Mean total event counts and probabilities
Longer outcome-data lags substantially affected total event counts in the smallÀ/large-difference scenarios (Figure 3), especially when combined with higher inclusion rates, with differences up to approximately 350 additional events.Mean total event probabilities were similarly affected, although differences were relatively small in the two-arm designs   (approximately 0.2% in the scenario with large differences, substantially less in the other scenarios) and smaller in the four-arm trials with a common control than in the four-arm trials without (approximately 0.2%-points versus 0.6%-points in the scenarios with large differences, respectively; Figure S1, Appendix A).Designs using adaptive randomisation in general led to lower mean total event probabilities, especially with shorter outcome-data lags.

| Probability of conclusiveness and selecting the best arm
The probabilities of conclusiveness (Figure 4) in the scenarios without differences (corresponding to the type-1 error rates) ranged from 4.52% to 5.55% after calibration.In the scenarios with differences, the probabilities of conclusiveness (corresponding to power) were ≥ 99.3% in all scenarios with large differences.In the scenarios with small differences, we observed differences up to 9.1%-points according to outcome-data lag, largest in the four-arm designs with fast inclusion rates.The probabilities of selecting the best arm showed a similar pattern as the probabilities of conclusiveness (Figure S2, Appendix A).

| Root mean squared errors
RMSEs for the estimated compared to the true event probabilities in the selected arm across simulations ending in superiority only (Figure S3, Appendix A) ranged from approximately 1.4%-points to 2.8%-points in the large-and smalldifferences-scenarios, and from approximately 3.2%-points to 7.5%-points in the no-differences scenarios, with relatively large increases with longer outcome-data lags and largest differences present in the four-arm trials.A similar pattern emerged when selecting the standard-of-care/control arm in inconclusive trials, although RMSEs, especially in the nodifferences scenarios, were smaller in magnitude and RMSEs in the no-difference scenarios were smaller than in the scenarios with differences (Figure S4, Appendix A).RMSEs for the estimated event probabilities in the last adaptive analysis compared to the final analysis including all patients in trials stopped for superiority substantially increased   with longer outcome-data lags, from 0%-points to 5.8%-points (no-differences-scenarios), to 2.0%-points (small-differences-scenarios), and to 2.2%-points (large-differences scenarios), with the largest increases seen with faster inclusion (Figure 5).

| Ideal design percentages
IDPs were negatively correlated with longer outcome-data lags, both when calculated across trials ending in superiority only and when selecting the standard-of-care/control arm in inconclusive trials (Figures S5, S6, Appendix A).IDPs were >98.4% in all trials in the large-differences scenario regardless of calculation method; in the small-differences scenarios, all IDPs were >97.9% when restricted to trials ending in superiority.When selecting the standard-of-care/control arm in inconclusive trials, IDPs in the small-differences scenario substantially decreased with longer outcome-data lags, with values ranging from 59.5% to 69.7% in the two-arm trials, 85.5% to 88.9% in the four-arm trials without a common control group, and 82.8% to 89.2% in the four-arm trials with a common control group, respectively.

| DISCUSSION
We assessed the influence of different outcome-data lags, that is, combined follow-up durations and data-collection lags, on the performance of several adaptive trial designs using statistical simulation under three different underlying inclusion rates and scenarios with no, small, or large differences present.As expected, we found that performance metrics deteriorated with diminishing proportions of randomised patients with outcome data available at time of analysis due to longer outcome-data lag or faster inclusion.Deteriorations in performance metrics for outcome-data lags up to 45 days were generally relatively smaller than deteriorations occurring with lags of 60 days or more, with deteriorations for 60 or more days of lag being substantial.This pattern was similar for different inclusion rates, but more pronounced  F I G U R E 4 Probability of conclusiveness.Probability of conclusiveness for each design/inclusion pattern/scenario/outcome-data lag combination, with specific simulation settings outlined in the legend below the plot.The probability of conclusiveness may be interpreted as the type-1 error rate in scenarios without differences and as the power (100%the type-2 error rate) in scenarios with differences.
with faster inclusion rates.Notably, important deteriorations with longer outcome-data lags occurred despite all outcome-data lag/maximum inclusion period ratios in our study being substantially below the previously suggested threshold of 0.25 for when adaptive trials become useful (all ≤0.105 in our study). 14The effects of different outcomedata lags and lower proportions of randomised patients with data available on most metrics were generally larger than the effects of using fixed or response-adaptive randomisation.Longer outcome-data lags affected both performance metrics that may be prioritised for logistical reasons (expected sample sizes), to benefit patients external to the trial (probability of conclusiveness/selecting the best arm, IDPs), to benefit patients internal to the trial (total event counts/probabilities), and to yield accurate estimates (RMSEs). 10s longer outcome-data lags mean that less patients are included in each analysis, the total sample sizes increase, which in turn increases the total event counts and leads to deteriorations in the other performance metrics due to adaptations being made based on less available data.Our results add to the existing literature assessing errors in treatment estimates in more conventional adaptive trials using group sequential designs, 11 using simulations 33,34 and analytically, 35 by illustrating the influence of incomplete information on accuracy of treatment effect estimates.While appropriate stopping rules control the overall type-1 error, treatment effects will generally be overestimated in trials stopped before reaching the maximum sample size, although the extent of this seems to vary from slight and typically unimportant, 33,35 to potentially important. 34,35Importantly, overestimation is typically smaller in large trials (comparable to those assessed in our simulations) 34 and for trials stopped early according to constant stopping rules (i.e., Pocock monitoring boundaries in conventional group sequential designs akin to our constant stopping rules) versus those using stopping rules more conservative in earlier analyses (i.e., O'Brien-Fleming or Haybittle-Peto monitoring boundaries). 35s expected, RMSEs calculated for simulations stopping for superiority were highest in the scenarios without differences present, as all superiority decisions were, by definition, type-1 errors in these simulations.
On a more general note, comparing designs using fixed randomisation to the corresponding designs using responseadaptive randomisation showed that (1) mean total event counts and probabilities were lower with response-adaptive F I G U R E 5 Root mean squared errors between last adaptive analysis and final analysis (superiority only).Root mean squared errors (RMSEs) of the estimated event probabilities in the selected arm for trials ending in superiority only at the last adaptive analysis (where the trial was stopped for superiority) compared with the final analysis (including outcome data for all randomised patients) across all simulations for each design/inclusion pattern/scenario/follow-up and data collection lag duration combination, with specific simulation settings outlined in the legend below the plot.
randomisation regardless of the number of arms or use of a common control, which is explained by allocation of more participants to better arms (i.e., arms with lower event probabilities), (2) expected sample sizes were only slightly larger with response-adaptive randomisation in some of the simulations for two-arm trials, and (3) response-adaptive randomisation led to slightly lower probabilities of conclusiveness in the two-arm trials, similar probabilities in the four-arm trials without a common control, and slightly higher probabilities in the four-arm trials with a common control arm.This suggests that response-adaptive randomisation, at least when restricted, is not detrimental to two-arm trials and may even be preferable for patients internal to the trials due to a potentially lower risk of undesirable outcomes.As such, previous discussions about whether response-adaptive randomisation is preferable or suboptimal from an ethical point of view in two-arm trials [36][37][38][39][40] should not cause trialists to abandon the use of response-adaptive randomisation in such trials before submitting such designs to proper evaluation.

| Strengths
This simulation study has multiple strengths.First, we conducted it following a protocol and statistical analysis plan, registered and made publicly available prior to conducting the analyses, 15 and include all analysis code in Appendix C for full transparency and reproducibility.Second, we ran a large number of simulations for each configuration, ensuring adequate precision of the presented estimates.Third, the relatively broad range of realistic outcome-data lags in combination with different inclusion rates yielded a relatively fine grid of different proportions of randomised patients with outcome data available at each analysis.Further, influence of outcome-data lags was assessed in multiple disparate trial designs, with the designs using response-adaptive randomisation using sensible restrictions, and under different assumed clinical scenarios.While the results are likely not generalisable to all settings, the combinations of multiple combined outcome-data lags, trial designs, and clinical scenarios means that these results should have reasonable external validity in at least relatively similar scenarios.Finally, the calibration of all trial designs to ensure type-1 errors of approximately 5% resembles actual clinical trial design and ensures comparability of results across simulation configurations.

| Limitations
This study also has limitations.First, while a relatively broad range of outcome-data lags were assessed using various relevant trial designs, clinical scenarios, and inclusion rates, the results are not generalisable to all other settings and preferences for specific designs and outcome-data lags may depend on other factors and practical/logistical considerations external to the trial design per se and, thus, not assessed here.We recommend that trialists, who consider different follow-up durations, assess the influence of these in a formalised manner using simulations such as those in this study, to guide their choices in actual trials.Second, the clinical scenarios and maximum allowed sample sizes (of 10,000 patients) were somewhat arbitrary but chosen to resemble what may be realistically chosen in large, pragmatic phase-3 or À4 trials conducted in the clinical setting of the clinical co-authors, that is, critical care.Of note, the choice of similar maximum sample sizes in both two-and four-arm trials could be challenged, and our maximum sample sizes were substantially larger than in some other simulation studies of adaptive trial performance. 17,18Finally, while conducted using a Bayesian framework, we used minimally informative priors in all analyses.Although beyond the scope of this study, the Bayesian framework enables the use of more informative priors conveying scepticism (to protect against early adaptations to chance through regularisation 19 ) or incorporating either previous results or actual a priori beliefs, and sceptical or informative priors may be considered in actual trial planning.

| Conclusions
We found that performance metrics in adaptive trials deteriorated with diminishing proportions of randomised patients with data available at the time of analysis, due to longer outcome-data lags, especially when the fraction further decreased due to faster inclusion rates.The effects on performance metrics of longer outcome-data lags were generally larger than those of designs characteristics such as fixed or response-adaptive randomisation.The relative impairment of performance metrics substantially increased with lags of 60 days or more; consequently, trialists should consider the effects of outcome-data lags when planning adaptive trials.

2
Mean sample sizes.Mean sample sizes across all simulations for each design/inclusion pattern/scenario/outcome-data lag combination, with specific simulation settings outlined in the legend below the plot.

3
Mean total event counts.Mean total event counts across all simulations for each design/inclusion pattern/scenario/ outcome-data lag combination, with specific simulation settings outlined in the legend below the plot.