Similarity of competing risks models with constant intensities in an application to clinical healthcare pathways involving prostate cancer surgery

The recent availability of routine medical data, especially in a university-clinical context, may enable the discovery of typical healthcare pathways, i.e., typical temporal sequences of clinical interventions or hospital readmissions. However, such pathways are heterogeneous in a large provider such as a university hospital, and it is important to identify similar care pathways that can still be considered typical pathways. We understand the pathway as a temporal process with possible transitions from a single initial treatment state to hospital readmission of different types, which constitutes a competing risk setting. In this paper, we propose a multi-state model-based approach to uncover pathway similarity between two groups of individuals. We describe a new bootstrap procedure for testing the similarity of transition intensities from two competing risk models with constant transition intensities. In a large simulation study, we investigate the performance of our similarity approach with respect to different sample sizes and different similarity thresholds. The studies are motivated by an application from urological clinical routine and we show how the results can be transferred to the application example.

clinicians in context. This could enable to improve general standards of clinical care and thus overall health outcomes. Still, pathways of patients in a large provider as a university hospital are heterogeneous as many diagnostic and treatment options exist and patients are partly readmitted to the hospital after discharge for different reasons. A similar healthcare pathway could still be considered a typical healthcare pathway. For this purpose, key questions would be how to measure such similarity and how to decide whether two different paths are still similar and when they would be considered different. To date, very few methodological works on the measurement of healthcare similarity can be found and these are predominantly informatics-based. For instance, assuming that healthcare pathways depend on factors such as choices made by the treating physician, Huang et al 1 suggest a fully unsupervised algorithmic approach based on a probabilistic graphical model representing a mixture of treatment behaviors by latent features.
From a clinical and also patient-centered perspective, it is essential to keep the care pathway as short as possible and prevent complications or disease-related hospital readmissions. In this article, motivated by an application from urologic clinical practice, we would therefore like to focus on objective and universally recorded clinical event measures including main events "hospital treatment" and "hospital readmission." We understand the pathway as a temporal process with possible transitions from a single initial treatment state to hospital readmission of different type. We consider the time-to-first hospital readmission, whichever comes first, which constitutes a competing risks setting. 2 Specifically, we aim to judge similarity of such pathways for samples of two different populations: group (i) patients receiving specific inhouse diagnostics before hospital treatment, and group (ii) patients not receiving specific inhouse diagnostics before hospital treatment. Our interest in the similarity of these pathways has the following reasons: While a certain disease requires specific treatment that is often only offered in specialized clinics, diagnostical tools are often more diverse and partly offered in outpatient facilities. Therefore, treatment data including diagnostics performed are often not readily available from the non-clinical sector (at least in Germany) and can not yet or only insufficiently be used for the investigation of healthcare pathway similarities. From such a path perspective, one may ask whether it makes a difference in terms of hospital readmission if a particular diagnostic procedure was performed in the clinic or not. From the perspective of the clinical practitioner, it may be plausible to assume that the probability of hospital readmission differs only by treatment, not by different pre-treatment diagnostics. If we could statistically show a similarity of the pathways of both groups, the latter assumption would be confirmed and we may attribute similar pathways to typical pathways.
In this article, we propose a multi-state model-based approach to reveal such path similarities of two groups of individuals. Multi-state models based on counting processes for event history data have been successfully applied to analyze progression of a disease. [3][4][5] In the context of care pathways or similarity, however, they have been used only rarely so far and for other purposes. Gasperoni et al 6 investigated multi-state models for evaluating the impact of risk factors on heart failure care paths involving multiple hospital admissions, admissions to home care or intermediate care units or death. Gasperoni et al 7 considered potential similarities and differences among healthcare providers on the clinical path of heart failure patients.
Our approach differs from this work and we aim for testing the similarity of the transition intensities from two independent competing risks Markov models with constant intensities. Then, the problem is methodologically related to the meanwhile classical problem of bioequivalence, which aims at demonstrating the similarity between two pharmacokinetic profiles by considering the area under the curve or the maximum concentrations of the two curves. 8,9 However, none of these methods for establishing bioequivalence can be transferred to the comparison of transition intensities as they are usually developed under the assumption of normally distributed (independent) data. Further, although the asymptotic distribution could be derived for this case as well, an approach based on asymptotics would not yield satisfying power for small sample sizes or data with only few events. In fact in the following we will develop new bootstrap methodology to address this problem.
The article is structured as follows: Section 2 describes the clinical healthcare pathways in the application example involving prostate cancer surgery. Section 3.1 introduces the competing risks notation for samples of two different populations. In Section 3.2 we describe a novel bootstrap procedure for testing similarity of transition intensities from two competing risks models. In a large simulation study created on the basis of the numbers and estimates from the application example we investigate the performance of our similarity approach with respect to different sample sizes and different similarity thresholds (Section 4). In Section 5, we briefly discuss how the results from the simulation study translate to the application example. We close the article with a discussion in Section 6.

CLINICAL HEALTHCARE PATHWAYS INVOLVING PROSTATE CANCER SURGERY
The application example that drove our methodological development comes from the clinical practice of the Department of Urology at the Medical Center-University of Freiburg. The clinic covers the entire spectrum of urological diagnostics and therapy according to the current state of the art. As data basis, we use the German reimbursement claims dataset for inpatient healthcare, which was systematically integrated into a central database at the Medical Center-University of Freiburg as part of the German Medical Informatics Initiative. For each inpatient case, the admission and discharge diagnoses (main and secondary diagnoses) are coded in the form of ICD10 (10th revision of the International Statistical Classification of Diseases and Related Health Problems) codes; in addition, all applied and billing-relevant diagnostic and therapeutic procedures are coded together with a time stamp in the form of OPS ("Operationen-und Prozedurenschlüssel") codes.

Hospital readmission after surgery with and without prior fusion biopsy
One of the most frequent reasons for inpatient admission at the Department of Urology is prostate cancer. One possible treatment option is the open or robotic-assisted surgery with the resection of the prostatic gland along with the vesicular glands, also referred to as radical prostatectomy. 10 From our reimbursement claims database, we retrospectively identified all patients with prostate cancer who underwent open radical prostatectomy (ORP) at the Medical Center-University of Freiburg between 1 January 2015 and 1 February 2021. This includes all cases with OPS code 5-604 (radical prostatovesiculectomy) irrespective of the concrete surgical procedure-but without the additional OPS code 5-987 for robotic assistance-and resulted in a total of n=695 patients. Prior to surgical intervention, diagnostics are performed in a variety of ways both in the clinic or in an out-of-hospital setting. The current diagnostic standard is a multiparametric magnetic resonance imaging-based pathway with targeted fusion biopsy (FB; OPS code 1-465). However, only a part of the patients receives their biopsies at the Department of Urology, which often depends on the practice of the referring urologists in private practices. In our data n = 213 (31%) patients received FB diagnostic prior to ORP, while a larger part of the patients, n = 482 (69%), did not receive FB diagnostic at the Department of Urology prior to ORP. We did not place a time restriction on when exactly the FB diagnostic took place prior to ORP. Therefore, we distinguish the two populations based on the pre-surgery FB diagnostic obtained and are interested in their further healthcare paths regarding hospital readmission by means of the two independent competing risks models as illustrated in Figure 1.
Radical prostatectomy carries a risk of postoperative complications. One of the more common complications is lymphocele, which typically develops within a couple of weeks after surgery and can be treated at the Department of Urology. Patients may, however, be also readmitted to the clinic after radical prostatectomy for other reasons related to the initial surgery. A typical time window for surgery-related hospital readmission is 90 days after surgery. Therefore, competing outcomes of interest are different reasons for hospital readmission within 90-days. In the data, we identified the most frequent readmission diagnoses defined by the ICD10 main diagnosis codes "C61: Malignant neoplasm of prostate" (model 1: n = 18, 8%; model 2: n = 60, 12%), "I89.8: Other specified noninfective disorders of lymphatic vessels and lymph nodes" (model 1: n = 17, 8%; model 2: n = 29, 6%), and combined all other observed diagnoses into one state "any other diagnosis" (model 1: n = 6, 3%; model 2: n = 31, 6%). While I89.8 is typically coded for a complication after radical prostatectomy requiring specific treatment, a readmission with a C61 diagnosis may mask diagnostic procedures after surgery only or specific follow-up treatment. For all patients in our data set we have at least 90 days clinical follow-up information available, so censoring is only administrative at 90 days after ORP. All cases had to be complete, that is, a discharge date for the initial stay with ORP as well as a potential readmission stay had to be present at the time of data retrieval. We note here that the terms (l) 0j (t) in Figure 1 describe the transition intensities to move from the initial state (ORP) into any of the competing states (hospital readmission with ICD10:I89.8, ICD10:C61, or any other ICD10) and are central in this work. They are formally defined in Equation (4) in Section 3. Assuming the transition intensities to be constant over time, one may estimate them separately by dividing the sum of type-j-events through the sum of person-time at risk in the initial state (see Equation (10) in Section 3 for a formal definition). In our data, this yields the following estimates: The estimated constant intensities should be interpreted in the context of the scale in which the time was measured. In our case, the time was measured in days. Since the observation period for all patients was identical (90 days) and overall only few events were observed, the magnitude of the intensities can be appropriately converted into an approximate number of expected events using the formula: transition intensity estimate times observation period (in days) times sample size of the population under consideration (cf. Equation (10) for the precise definition of the estimate). For example, for transition intensity estimatê( 1) 01 this means 0.001 * 90 days * 213 patients ≈ 19 events. As the readmission intensities are overall low, from a pathway analytic perspective the question is whether they are sufficiently similar for patients with prior in-house FB diagnostics vs without prior in-house FB diagnostics w.r.t. the specific transition such that the two populations can be combined, for example, for a common analysis on hospital readmission due to complications.

Competing risk models
To model the event histories as competing risks for samples of two different populations, we use two independent Markov processes with state spaces {0, 1, … , k} following Andersen et al. 3 The processes have possible transitions from state 0 to state j ∈ {1, … , k} with transition probabilities Every individual starts in state 0 at time 0, that is, P(X(0) = 0) = 1. The time-to-first-event is defined as stopping time T = inf{t > 0|X(t) ≠ 0} and the type of the first event is X(T) ∈ 1, … , k. Let ( ) denote the cause-specific transition intensity from state 0 to state j ∈ {1, … , k} for the th model, ∈ {1, 2}. The transition intensities completely determine the stochastic behavior of the process. In our application example, the two competing risk models with the initial state ORP and three competing risks each are shown in Figure 1, in which the transition intensities are assigned to the transition arrows.

Similarity of competing risk models
We are interested in the similarity between the transition intensities in the two models. In other words, we want to test the hypotheses vs Here ||f − g|| ∞ = sup t∈ |f (t) − g(t)| denotes the maximal deviation between the functions f and g and Δ 1 , … , Δ k are pre-specified thresholds, defining for each pair of transition intensities the maximum deviation Δ j under which (1) 0j and (2) 0j are considered as similar.
In order to make the method easily understandable and to be able to provide closed form solutions for the estimates (for a discussion on that, see eg, Cube et al 11 ) we will assume constant transition intensities throughout this article. This assumption is frequently made in the literature. 12,13 For the same reason, for explaining the methodology we restrict ourselves to the case of no censoring. Additionally, we briefly explain in Section 3.3 how to adapt the method to the right-censored case and investigate this situation in the simulation study, where we address both administrative censoring and random right-censoring.
In the following, we describe a novel bootstrap procedure for testing the hypotheses (5) and (6) for competing risk models with constant transition intensities, which is motivated by the methodology developed in Dette et al 14 for comparing regression curves. To be precise, assume that two independent samples X (1) 1 , … , X (1) n 1 and X (2) 1 , … , X (2) n 2 from Markov processes (2) are observed over the interval  = [0, ], containing the state and transition time of an individuals i. We define as the indicator that a state transition of the individual i from 0 to j has occurred in the time interval [0, ] (note that N ( ),i 0j ( ) is either 0 or 1). We also denote by 0 < T ( ),i 0j ≤ the corresponding transition time (if N ( ),i 0j ( ) = 0 the transition time is undefined). Further we introduce the notation which indicates whether at time t the ith individual of the th group is at risk (here I{A} denotes the indicator of the event A). Under the assumption of constant transition intensities it then follows from Andersen and Keiding 4 that the corresponding likelihood function in the th model is given by where is the number of transitions from state 0 to state j in the th group, is the total observation time of all individuals in the th group, ( ) = ( ( ) 01 , … , ( ) 0k ) ⊤ is the vector of transition intensities in model = 1, 2. The logarithm of (7) is given by Taking the partial derivatives and equating to zero yields the maximum likelihood estimates (MLE) Via S ( ) 0 in (10) the intensity estimate depends on the time scale, as already pointed out at the end of Section 2. We now want to address the question of similarity as stated in the hypotheses (5) and (6). Due to the assumption of constant transition intensities the maximum deviation simplifies to that is we consider the absolute difference between these intensities for all states j = 1, … , k. In order to reject the null hypothesis in (5) the differences between transition intensities have to be smaller than the pre-specified margins Δ j for all states. Hence the test problem can be assessed by simultaneously testing the individual hypotheses vs for all j = 1, … , k. According to the intersection union principle 15 the global null hypothesis in (5) can be rejected at a significance level of if the individual null hypotheses in (11) are rejected at a significance level of for all j = 1, … , k. This means in particular that there is no adjustment of the level necessary. The following algorithm summarizes how these individual tests are performed.
As stated above, the global null hypothesis (5) is rejected if all individual null hypotheses are rejected. As a consequence of this procedure the power of the test decreases with an increasing size of states in the model as these are leading to a higher number of individual tests (see Berger 15 for theoretical arguments on this). More precisely, it is a well known fact that methods based on the intersection union principle can be rather conservative, 17 depending on the sample size, the variability of the data and the number of individual tests. It can be shown that the test is consistent and controls its level. The theoretical arguments for that follow from adapting the proofs of Dette et al 14 to the present situation. The details are omitted for the sake of brevity. We will investigate the finite sample properties by means of a simulation study in Section 4.

Right-censoring
Note that in case of right-censoring the methodology described above can be extended by adding corresponding factors from the distribution of the censoring times to the likelihood in (7). This requires the assumption of independence between censoring times and survival times. Under this assumption the MLE in (10) ( 2) as given in (10) and the corresponding test (a) In order to approximate the null distribution we define constrained estimates (1) , (2) of (1) and (2) maximizing the sum log  1 ( (1) ) + log  2 ( (2) ) of the log-likelihood functions definedin (9) under the additional restriction that is we estimate the transition intensities such that the models correspond to the margin of the (individual) null hypothesis (11) for state j 0 . Further definê and Consequently, if the test statisticd j 0 is larger or equal than the similarity threshold Δ j 0 , which reflects the null situation, the original (and hence unconstrained) estimateŝ( 1) and̂( 2) can be used. (b) Use the constrained estimateŝ( ) , = 1, 2, derived in (14), to simulate bootstrap data X * (1) Specifically we use the simulation approach as described in Beyersmann et al, 16 where at first for all individuals survival times are simulated with all-cause hazard ∑ k j=1̂( ) 0j and then a multinomial experiment is run to decide on state j with probabilitŷ calculate the MLÊ * (1) and̂ * (2) as in (10) and the test statistic for state j 0 as in Step (i), that isd * j 0 ∶= |̂ * (1) Repeat steps (iib) and (iic) B times to generate B replicates of the test statistic and letd * j 0 (1) , … ,d * j 0 (B) denote the corresponding order statistic. An estimate of the -quantile of the distribution of the statisticd * j 0 is then given by q * ∶=d * j 0 (⌊B ⌋) and the null hypotheses in (11) is rejected at the targeted significance level wheneverd j 0 <q * . Alternatively a test decision can be made based on the P-valueF

Design
In the following we will investigate the finite sample properties of the proposed methods by means of a simulation study, driven by the application example given in Section 2. According to the data mentioned therein we first assume that individuals of two groups ( = 1, 2) are observed regarding three different outcomes over a period of 90 days, hence we consider two competing risk models with each j = 3 states over the time range  = [0, 90]. If there is no transition to one of the three states, an individual is administratively censored after these 90 days. In a second setting we replace the administrative censoring by random right-censoring in order to investigate the effect of an increasing proportion of censored individuals on the procedure. A more precise description of that and the corresponding results can be found in Section 4.4. The data in all simulations is generated according to the algorithm described in Beyersmann et al. 16 Using an Intel Core i7 CPU with 32 GB RAM the computation time for one individual test with B = 500 bootstrap repetitions is about 25 s for all sample sizes under consideration, which results in approximately 3 × 25 = 75 s for one dataset comparing three states. Thus, the computational cost depends much more on the number of states to be compared than on the sample size.
We consider in total four different scenarios, which are summarized in Table 1. For the first three scenarios we choose the transition intensities of the application example in (1) (compare also Figure 1). This choice results in true absolute differences of which are also given in Table 1. In order to demonstrate the effect of different numbers of states, we start by testing for similarity of all three transition intensities simultaneously in the first scenario, whereas in the second and in the third scenario we only consider two states and hence only the difference of two transition intensities. Precisely, in Scenario 2 we only compare the transition intensities for State 1 and 3 and in Scenario 3 we only consider State 1 and 2, respectively. Finally, in the fourth scenario we choose identical models, that is (1) 01 = (2) 01 = 0.001, (1) 02 = (2) 02 = 0.0011 and (1) 03 = (2) 03 = 0.0004, respectively, resulting in a difference of 0 for all transition intensities.
In other applications the number of patients ending up in one of the three states might be even smaller than the ones found in our application example. To this end, we consider a broader range of different sample sizes given by In order to simulate both the type I error and the power of the procedure described in Algorithm 1, we consider different similarity thresholds Δ = (Δ 1 , Δ 2 , Δ 3 ) or, for scenarios 2 and 3, Δ = (Δ 1 , Δ 2 ), respectively. Precisely we choose Δ j ∈ {0.00015, 0.0002, 0.0005, 0.0006, 0.001, 0.0015, 0.002}, where for the first three scenarios, the first four choices correspond to the null hypothesis (5) and the other three to the alternative in (6) (note that for sake of brevity not all choices are presented in the tables). Regarding the fourth scenario, we only consider Δ j = 0.001, 0.0015 as in this case we only simulate the power of the test. Table 2 displays the type I errors for scenarios 1-3. It turns out that the proportions of rejections of the null hypothesis (5) for the global test are close to zero. These findings are in line with the theoretical arguments given after Algorithm 1, as tests based on the intersection union principle tend to be conservative in some situations. However it also becomes visible that this effect decreases when considering only two states instead of three (see the columns corresponding to Scenario 2 and 3, respectively). Moreover we note that the individual tests yield a very precise approximation of the nominal level, as the proportion of rejections is close to 0.05 in all scenarios under consideration. The difference between type I error rates of the individual tests and the global test become in particular visible when considering the first row of Figure 2, which yields a visualization of the results presented for Scenario 1 in Table 2. Whereas the proportion of rejections are all around 0.05 for the individual tests on all three states, the line indicating the results for the global test is close to zero.

Type I error
Finally, the points on the left of Figure 3, corresponding to the smallest threshold, namely Δ = (0.00015, 0.0002, 0.0002), display the type I errors for a sample size of n 1 = n 2 = 300 in a scenario which is not on the margin but in the interior of the null hypothesis. In this situation type I errors are smaller and well below the nominal level. Considering the individual test on the first state the proportion of rejection is close to as the absolute distance d 1 = | (1) 01 − (2) 01 | = 0.0002, which is rather close to the chosen threshold Δ 1 = 0.00015. For the other two states we have d 2 = 0.0006 and d 3 = 0.0005 and hence, regarding the similarity thresholds of Δ 2 = Δ 3 = 0.0002, these situations correspond even stronger to the null situation, resulting in lower type I errors of the individual tests, given by 0.017 for state 2 and 0.009 for state 3, respectively (compare Figure 3). Table 3 displays the simulated power of the global test and the individual tests for scenarios 1 to 4, respectively, as well as the two lower lines of Figure 2 visualize some of the results from Scenario 1 of Table 3. In general we observe that the test achieves a reasonable power in all scenarios under consideration and for increasing sample sizes the power approaches 1. For example, considering n 1 = n 2 = 300 in Scenario 4, the simulated power lies between 0.837 and 1.000, depending on the threshold under consideration (see Scenario 4 in Table 3). In particular keeping in mind the very small transition intensities (which result in only few cases in the several states-see the application example in Figure 1) these results are very promising. In addition, considering the first three scenarios it becomes obvious that the power increases significantly when just considering two instead of three states (compare for the same thresholds the results for Scenario 1 in Table 3 to Scenarios 2 and 3). This becomes also visible in the two lower rows of Figure 2 presenting the power of comparing states individually and simultaneously for the first scenario. When assuming the same similarity thresholds Δ 1 = Δ 2 = Δ 3 the power for the individual test on the second state is clearly below the observed values for the other two states. This results from the fact that the true absolute difference is given by d 2 = 0.006 and hence larger compared to d 1 and d 3 , which are given by 0.002 and 0.005, respectively. When comparing the scenarios with only two states, that is Scenarios 2 and 3 in Table 3, we observe that the power of the global test is higher in the first. This holds for all sample sizes and choices of the threshold Δ and results from the different power obtained for the individual tests, which is due to the underlying assumed transition intensities. However, as mentioned beforehand, this effect decreases with increasing sample sizes, where, for all scenarios under consideration, the power converges to 1.

Power
Finally, Figure 3 displays the proportion of rejections for Scenario 1 in dependence of the chosen similarity threshold Δ. We observe that for the first two choices all values, that is the proportion of rejections for the individual test and the global test, respectively, are below or close to as these situations correspond to the null hypothesis (see also the discussion at the end of Section 4.2). For the other three choices of Δ presented in the right part of the figure simulations correspond     to the alternative (6). Consequently, with increasing similarity thresholds, the proportion of rejections, which results in claiming similarity, increases.

Exponential right-censoring
In the following we will assume the data to be randomly right-censored and investigate the effect of several censoring rates on the performance of the procedure. Precisely, the censoring times now follow an exponential distribution with TA B L E 4 Censoring rates for data simulated with exponential right-censoring and the resulting mean proportions of censored TA B L E 5 Simulated level (first column of results) and power (second and third column, respectively) of the test on similarity described in Algorithm 1 for each scenario, considering different sample sizes, censoring rates, and thresholds Δ five different censoring rates, that is 0.001, 0.002, 0.003, 0.005, and 0.01, resulting in approximately 22% up to 80% of the individuals being censored, for a detailed overview see Table 4. We consider Scenario 1 as described in Table 2, that is, all three states are compared. For the sake of brevity we now only consider three different sample sizes, that is, n 1 = n 2 = 200, 300, and 500, resulting, together with the five censoring rates, in 15 combinations per choice of the threshold Δ. In order to investigate both, the type I error and the power, we simulate all choices of Δ as described in Section 4.1 such that we consider one setting under the null hypothesis (5) and three under the alternative (6), respectively. Table 5 displays the proportion of rejections of the individual tests and the global test, respectively, for three different choices of the threshold Δ. The first column of the results, corresponding to Δ = (0.0002, 0.0006, 0.0005), belongs to the null hypothesis (5) and hence displays the simulated type I errors. We observe that the type I errors of the individual tests are very close to the nominal level of = 0.05. The global test is again conservative and in general the results are in line with the simulations assuming administrative censoring presented in Table 2.
In the following two columns the power is presented. It turns out that the power still remains reasonably high, also in case of rather large censoring rates, albeit it decreases with increasing censoring rates, as expected. For instance, considering n 1 = n 2 = 200 and a threshold of Δ = (0.001, 0.0015, 0.001) we observe that in a setting with approximately 20% to 30% of individuals being censored the global power of the test is 0.852, whereas it reduces to 0.209 if 75% to  Figure 4 where the proportions of rejections are shown for all five censoring rates and in dependence of the similarity threshold. Compared to the results assuming administrative censoring (see Table 3) we note that in general the power for data with exponential censoring investigated in this section is larger for almost all configurations. This results from the fact that when assuming the end of the observation period after 90 days and the transition intensities as given in Table 2, the proportion of censored individuals is also rather high, that is on average about 80.7% in Group 1 and 75.1% in Group 2, respectively, and hence comparable with the highest censoring rate of 0.01 in the setting of exponential censoring. For instance, considering n 1 = n 2 = 300 and a threshold of Δ = (0.001, 0.0015, 0.001) we obtain a power of 0.457 in case of administrative censoring and 0.452 for exponential censoring with a censoring rate of 0.01. This is also underlined by Figure 4, where the two lines almost completely coincide over the total range of similarity thresholds. We therefore conclude that the impact of random right-censoring and administrative censoring is comparably large if the proportion of censored individuals is similar.

SIMILARITY OF HEALTHCARE PATHWAYS INVOLVING PROSTATE CANCER SURGERY
We now want to address the question whether the readmission intensities for patients with prior in-house FB diagnostic are similar to the ones of the patients without prior in-house FB diagnostic (Equation 1). Therefore we perform the test on similarity described in Algorithm 1 considering numerous different similarity thresholds Δ j , j = 1, 2, 3, on the given application example. The choice of these thresholds is motivated from the simulation studies presented in Section 4. In Table 6 we display the P-values of the individual tests on states 1, 2 and 3, respectively, for six different similarity thresholds. We observe that for the smallest threshold, that is Δ j = 0.0005, all individual P-values are far above the nominal level of = 0.05. For Δ j = 0.0007 the individual P-value of the test for the first state is now given by 0.044, which is below the nominal level and results in claiming similarity of transition intensities for the first state. Considering the same threshold for state 2 and 3, respectively, yields that for these states the individual null hypotheses cannot be rejected. Further, considering Δ j = 0.0008 we observe that now similarity of the corresponding readmission intensities can be claimed for state 1 and 3, as both individual P-values are below the nominal level. The same holds for Δ j = 0.001, as the P-values for state 1 and 3 are given by 0.006 and 0.004, respectively, whereas the P-value of the test for state 2 is given by 0.094.

DISCUSSION
In this article we developed a hypothesis test based on a constrained (parametric) bootstrap to assess the similarity of competing risk models with constant transition intensities. Specifically, we performed an individual test for each state and combined these individual tests by applying the intersection union principle. We examined the finite sample properties by numerous simulations motivated by an example application in urology, and demonstrated that the test properly controls its level and yields a reasonable power. We focused on testing the similarity of competing risks models with the same number of states. However, the test procedure can also be applied to the comparison of competing risks models with a differing number of states. In such a case, the corresponding equal risks from both models are compared, and observations from individuals that experience one of the remaining risks that do not coincide in both models are censored. The fact that our approach is based on the intersection union principle has the advantage that all states can be compared separately with different thresholds Δ 1 , … , Δ k . As a consequence the resulting overall comparison is conservative by construction.
As an alternative to testing the individual hypotheses in (11) and (12) one could also consider aggregated versions of the deviations between the transition intensities such as or At the moment it is not clear how to extend the constrained bootstrap algorithm for testing hypotheses of the form (16) or (17), and we postpone this problem as a future research topic. We mention again that a potential improvement in power will come at the cost of not being able to draw conclusions for each state individually as all information from the different states is summarized in one quantity. We proposed measuring similarity by the absolute difference between transition intensities. However, instead of considering differences a similar methodology can be developed for comparing the ratios of the transition intensities. In the case of the application example, this would mean examining the following ratios to test for similarity: (1) 01 ∕ (2) 01 = 1.25, (1) 02 ∕ (2) 02 = 0.65, and (1) 03 ∕ (2) 03 = 0.44. On the one hand, considering ratios would have the advantage that they are time-invariant, that is, not depending on the time scale anymore. On the other hand, for small transition intensities, that is, settings with few events, differences may better communicate situations where there is no large difference in terms of intensities or events as compared to ratios. For instance, while | (1) 03 − (2) 03 | = 0.0005 is in fact fairly small this may by far not be assumed when examining the ratio (1) 03 ∕ (2) 03 = 0.44. In general, the choice how to measure the deviation between the transition intensities depends on the goal of the study and should be carefully investigated by the researcher. This also applies to the corresponding equivalence thresholds which offer on the one hand a maximum of flexibility for our approach but on the other hand also provide the need of a very careful discussion in advance. Currently there are no guidelines fixing these thresholds in studies as considered in Section 2, which makes this decision an important topic for further research.
With respect to the application example, we were able to identify thresholds for which the global null hypothesis could be rejected and therefore the transition intensities are to be considered similar. The chosen thresholds were based on the results of the simulation study and were also chosen differently to illustrate the effect on the P-values. This extensive procedure is not necessary for future applications of the method to clinical data, instead a careful preliminary determination of plausible equivalence thresholds is required. This can be done in such a way that one considers which difference in number of events one would still like to allow and then calculates the corresponding threshold accordingly, taking into account the examined time span and sample size. We point out here that while very stringent thresholds are often expected from an equivalence test of a therapeutic study, 18 these would rather not be fulfilled in our application example. However, we can consider this somewhat less stringent and continue to assume similarity, since the actual goal is to use the overall data to examine an outcome that is supposed to be no longer directly related to the diagnostic procedure. Therefore, we would like to explicitly note here that our proposed testing procedure tests for similarities between two different populations and does not compare models that adjust for a possible difference in groups. Detecting sufficient similarity would allow the groups to be combined for outcome analysis without further adjustment. A further strength of the kind of data we used in our application is that it was drawn from sets of claims data having a standardized format used for quality assurance and for the calculation of the German Diagnosis Related Groups system. The process and quality assurance measures for providing this dataset are highly standardized. The data are easily accessible and therefore provide a good source of information for this investigation.
A limitation of the methodology proposed in this article is the assumption of constant transition intensities, which may not be met in real data applications. However, our proposed approach based on parametric bootstrap allows in principle an extension to different parametric distributions of event times. This requires further extensive investigations, which are beyond the scope of this work. We therefore leave it for future research.