Required efficacy for novel therapies in BCG‐unresponsive non‐muscle invasive bladder cancer: Do current recommendations really reflect clinically meaningful outcomes?

Abstract Background Single‐arm trials are currently an accepted study design to investigate the efficacy of novel therapies (NT) in non‐muscle invasive bladder cancer (NMIBC) unresponsive to intravesical Bacillus Calmette‐Guérin (BCG) immunotherapy as randomized controlled trials are either unfeasible (comparator: early radical cystectomy; ERC), or unethical (comparator: placebo). To guide the design of such single‐arm trials, expert groups published recommendations for clinically meaningful outcomes. The aim of this study was to quantitatively verify the appropriateness of these recommendations. Methods We used a discrete event simulation framework in combination with a supercomputer to find the required efficacy at which a NT can compete with ERC when it comes to quality‐adjusted life expectancy (QALE). In total, 24 different efficacy thresholds (including the recommendations) were investigated. Results After ascertaining face validity with content experts, repeated verification, external validation, and calibration we considered our model valid. Both recommendations rarely showed an incremental benefit of the NT over ERC. In the most optimistic scenario, an increase in the IBCG recommendation by 10% and an increase in the FDA/AUA recommendation by 5% would yield results at which a NT could compete with ERC from a QALE perspective. Conclusions This simulation study demonstrated that the current recommendations regarding clinically meaningful outcomes for single‐arm trials evaluating the efficacy of NT in BCG‐unresponsive NMIBC may be too low. Based on our quantitative approach, we propose increasing these thresholds to at least 45%‐55% at 6 months and 35% at 18‐24 months (complete response rates/recurrence‐free survival) to promote the development of clinically truly meaningful NT.

Required efficacy for novel therapies in BCG-unresponsive non-muscle invasive bladder cancer: Do current recommendations really reflect clinically meaningful outcomes?

2/24
Influence diagrams Supplemental Figure 1: Influence diagram visualizing the concepts incorporated into the discrete event simulation model and their assumed interaction. Arrows represent influences but do not imply causality. ERC: early radical cystectomy; NT: novel therapy; RC: radical cystectomy

Influence diagram
Required efficacy for novel therapies in BCG-unresponsive non-muscle invasive bladder cancer: Do current recommendations really reflect clinically meaningful outcomes?

3/24
Supplemental Figure 2: Influence diagram visualizing the concepts incorporated into the discrete event simulation model and their assumed interaction (strategy: early radical cystectomy). Arrows represent influences but do not imply causality. ERC: early radical cystectomy; NT: novel therapy; RC: radical cystectomy

Decreased / Increased
Required efficacy for novel therapies in BCG-unresponsive non-muscle invasive bladder cancer: Do current recommendations really reflect clinically meaningful outcomes?

4/24
Supplemental Figure 3: Influence diagram visualizing the concepts incorporated into the discrete event simulation model and their assumed interaction (strategy: novel therapy). Arrows represent influences but do not imply causality. ERC: early radical cystectomy; NT: novel therapy; RC: radical cystectomy

Decreased / Increased
Required efficacy for novel therapies in BCG-unresponsive non-muscle invasive bladder cancer: Do current recommendations really reflect clinically meaningful outcomes?

Simulation logic
The simulation logic describes the technical steps that are processed at different stages/events of the simulation. Parameters are described in the section "Input parameters". Due to technical reasons, some parameters are sampled in each patient although they might never be used. As an example, our model samples a time to muscle invasion after failure of novel therapy for each patient although not all simulated patients eventually experience failure of the novel therapy. At event "NT.Metastatic": • Biological event that can occur in patients who failed novel therapy and progressed to muscle-invasive disease • No associated tasks 8/24 At event "NT.Staging": • Preoperative staging before radical cystectomy in patients who failed novel therapy • Check if event "NT.Metastatic" has occurred before: Exact timing is allowed to vary between -7 days to +7 days. Karakiewicz et al. [9] Individual-level data were reconstructed from Figure 1A.
Threshold after which t_Recurrence is not allowed anymore (C.

External validation methodology
As part of the validation process, we compared the outputs of our model of the strategy "early radical cystectomy" to outcomes reported in literature (validation targets). To fulfil the criteria of a formal external validation, the studies used to inform the validation targets were not allowed to be part of the input sources. Furthermore, simulated patients, in contrast to regular study participants, are never lost to follow-up. To account for this important difference, the external validation approach had to mimic the censoring patterns observed in the studies that we used to inform the validation targets. This was implemented by sampling censoring times from gamma distributions as described by Wallis et al. [21]. The validation targets are listed in Supplemental Table 5.

Calibration methodology
We calibrated several input parameters as we detected a certain deviation of the initial model output from the validation targets. The calibration process was divided into two steps. In a first step, we calibrated our model to meet the proportions of pT3/pT4 disease and positive nodal disease reported in literature while the second step focused on calibrating the model against cancer-specific survival at 5 and 10 years.
Step 1 The proportions of pT3/pT4 disease and positive nodal disease are determined by the corresponding multivariable logistic regression models. Therefore, we modified the output of the two regression models with calibration factors which means that, at each calculation, the preliminary probabilities are multiplied by specific calibration factors (numerical values between 0 and plus infinity). A calibration factor of 1 means no calibration while values below 1 and above 1 translate into a decrease and an increase of the preliminary probability, respectively. An optimal set of calibration factors was defined as the parameter set that minimizes the difference between model output and validation targets. We quantified this difference by a single numeric value, the weighted goodness of fit measure, that simultaneously evaluates the deviation of several model output/validation target pairs (see Vanni et al. [23]). The lower the weighted goodness of fit measure, the closer the set of calibration factors matches the validation targets.
To find an optimal set of calibration factors, we used the following optimization approach: 1. Define a plausibility range for each calibration factor 2. Sample several sets of calibration factors (usually hundreds/thousands). To enhance the coverage of the whole parameter space, we used latin hypercube sampling [23]. After several iterations of the above-mentioned optimization algorithm, we identified a set of calibration factors that yielded a model output with only minimal deviation from the validation targets. We therefore considered our model valid with regards to proportions of pT3/pT4 disease and positive nodal disease. Parameter set Goodness of fit (log) GOF smaller than lowest percentile GOF greater than or equal to lowest percentile Required efficacy for novel therapies in BCG-unresponsive non-muscle invasive bladder cancer: Do current recommendations really reflect clinically meaningful outcomes?

19/24
Step 2 Next, we calibrated our model against cancer-specific survival at 5 and 10 years as reported in literature. We assumed these validation targets to be highly influenced by the background mortality and time to recurrence after radical cystectomy (t_Recurrence). The former was calibrated by a simple calibration factor as described earlier while the calibration of the latter was decomposed into: • Calibration of lambda parameter of Weibull distribution • Calibration of gamma parameter of Weibull distribution • Calibration of hazard ratio modifying the raw event time yielded by the Weibull distribution • Calibration of C.Threshold (threshold after which t_Recurrence is not allowed anymore) To find an optimal set of input parameters, we used the optimization approach as described earlier.
After both calibration steps, our model output matched the results reported in literature very closely (see Supplemental Table 6). We therefore considered it valid.

Supercomputer configuration
We used the following simulation architecture: • Lower level: simulation of each strategy among a cohort of 100,000 patients (4 x 100,000 patients) • Middle level: replication (1,000 times) of each lower-level run to reflect the uncertainty associated with some input parameters (expert opinions, health state utility values, and parameters derived through calibration) • Upper level: simulation of the 24 efficacy thresholds The cohort size of 100,000 patients was chosen empirically as this number yielded highly stable results with a percentage deviation of less than 1% from the mean value (see Supplemental Figure  6). The reliable analysis of 24 efficacy thresholds required simulating the clinical course of 9.6 billion individuals (4 strategies x 100,000 patients x 1,000 probabilistic samples x 24 efficacy thresholds). From a computational perspective, the simmer simulation core [24] had to be fed with 24,000 input sets (1,000 probabilistic samples x 24 efficacy thresholds).
The computations were performed on the Niagara supercomputer at the SciNet HPC Consortium [25,26]. All simulation runs were performed during a resource allocation window that provided 640 computation cores (on 16 nodes) for 24 hours (effective computation time: 19 hours and 12 minutes). Each node consisted of 40 Intel Skylake cores at 2.4 GHz and 202 GB RAM. We distributed each lower-level simulation run (4 x 100,000 patients) across 10 sub-runs (4 x 10,000 patients) to prevent a memory overload although this increased the number of times the simmer simulation core [24] had to be initialized to 240,000 (10 sub-runs x 1,000 probabilistic samples x 24 efficacy thresholds). The resulting 240,000 inputs sets were delivered in chunks of 160 to the 16 nodes. Within each node, the 160 input sets of a single chunk were distributed across 40 cores so they could be processed in parallel.