Discussion on “Optimal test procedures for multiple hypotheses controlling the familywise expected loss” by Willi Maurer, Frank Bretz, and Xiaolei Xun

This comment builds on the familywise expected loss (FWEL) framework suggested by Maurer, Bretz, and Xun in 2022. By representing the populationwise error rate (PWER) as FWEL, it is illustrated how the FWEL framework can be extended to clinical trials with multiple and overlapping populations and the PWER can be generalized to more general losses. The comment also addresses the question of how to deal with midtrial changes in the posttrial risks and related losses that are caused by data‐driven decisions. Focusing on multiarm trials with the possibility of dropping treatments midtrial, we suggest to switch from control of the unconditional expected loss to control of the conditional expected loss that is related to the actual risks and is conditional on the sample event that causes the change in the risks. The problem and here suggested solution is also motivated with a sequence of independent trials for a hitherto incurable disease which ends when an efficient treatment is found. No multiplicity adjustment is applied in this case and we show how this can be justified by the consideration of the changing out‐trial risks and with control of conditional type I error rates and losses.

The work I will comment on definitively addresses needs of modern clinical trials like subgroup selection, basked and umbrella trials, multiarm multistage (MAMS), and platform trials.It also contributes to the current discussion on methods for multiple type I error rate control.The authors' suggestion to extend some current concepts for multiple error rate control to the control of an FWEL is highly valuable because it permits to account for the potential post-trial consequences of type I errors.A similar intention has led Brannath et al. (2023) to introduce the populationwise error rate (PWER) for clinical trials with multiple populations.Such trials are an important application also in Maurer et al. (2023).While control of the PWER can obviously be viewed as control of a particular expected loss, it is not covered by the additive and binary loss considered specifically in Maurer et al. (2023).As a consequence, the trial examples considered there are only with mutually disjoint subpopulations while the focus in Brannath et al. ( 2023) is on subpopulations with nonempty intersections.I will therefore show in this comment how the additive and binary loss functions can be extended to cover the PWER and will then generalize it to what could be called a populationwise expected loss (PWEL).While introducing the common framework, I leave for future research important questions, like the identification of inadmissible and optimal test procedures with PWEL control.Maurer et al. ( 2023) have successfully addressed these issues for the additive and binary loss functions.
One intention of the PWER and (as I understand) the FWEL is to account for type I errors only so far as they are essential in the sense that they have consequences for future patients, like receiving an experimental treatment after a significant trial result.In this comment, I will pursue this intention and address the issue of changes in the post-trial risks within a single and across different studies.I will thereby focus on the specific situation of a disease with no approved treatment (like COVID-19 in the past or Alzheimer's disease) where several experimental treatments are investigated in independent trials over time until superiority over standard care is approved for at least one of them.I will also discuss the situation where in a MAMS or platform trial treatments are dropped midtrial at interim analyses.Since the post-trial risk associated with a dropped treatment is no longer present, the question arises of how to account for this in the control of the multiple type I error rate or, more general, the FWEL.My intention is to illustrate the issue and to trigger a discussion as well as new research on this topic.
The rest of my comment is organized as follows: In Section 2, the PWER will be represented and generalized via the FWEL.Section 3 introduces the issue of changes in the post-trial risks and a first suggestion of how to deal with it in the above-mentioned specific situations.I conclude with a general discussion and final remarks in Section 4.

THE PWER AND ITS GENERALIZATION AS PWEL
Let me start with a brief review of the FWEL introduced in Maurer et al. (2023).The setup is a multiple testing problem for an -dimensional parameter vector  = ( 1 , … ,   ) with the  one-sided null hypotheses   ∶   ≤ 0,  = 1, … , .The true state of nature is represented by binary indicators   (  ),  = 1, … , , where   (  ) = 1 if   ∉   and   (  ) = 0 if   ∈   , summarized to a vector () = ( 1 ( 1 ), … ,   (  )).The test decisions are represented by a binary decision vector  = ( 1 , … ,   ) where   = 1 if   is rejected and   = 0 otherwise.Typically,  is defined via test statistics or -values.When each   depends only on the test statistic or -value for   then  is called separable.
In general, the loss  is defined as a function of  and ,  = (, ), or slightly more specific as a function of  and ,  = (, ()).Specifically, Maurer et al. for the unit loss associated with an erroneously rejection of   .We will see below that this generality does not help to cover the PWER.
As example, Maurer et al. ( 2023) consider a trial with  disjoint subpopulations   and efficacy parameter   such that   ∶   ≤ 0 is the nonefficacy null hypothesis for   .They suggest to use the additive loss with   equal to the proportion of   in the study population.This choice is (among many others) in line with the per comparison error rate, that is, the testing of each   at the unadjusted level .This is reasonable, since every future patient is effected by rejection of at most a single null hypothesis, namely, the   of the population   she or he belongs to.
The populationwise error rate introduced in Brannath et al. ( 2023) is defined as the probability that a future patient, which is randomly drawn after the study from the overall patient population, will receive an inefficient treatment.One easily sees that with disjoint subpopulations, this definition corresponds to the expectation of the previously mentioned additive loss with   equal to the proportion of   .However, with overlapping populations the definition of the PWER gives with   the proportion of the population stratum   = ∩ ∈   ⧵ ∪ ∉   .The corresponding loss function is with the stratum   specific binary loss   (, ) = max ∈   (1 −   ).Obviously, (1) neither matches the additive nor the binary loss functions in Maurer et al. (2023).Note that (1) can also not be matched by consideration of the disjoint strata   instead of the overlapping   , since we do not test null hypotheses for the up to 2  − 1 nonempty strata but for the  overlapping populations   only.
Given loss function (1) for the PWER, the concept can be generalized by consideration of other stratumspecific losses.We may call this a populationwise expected loss.We may, for example, use weighted maximum losses   (  ,   ) = max ∈    ⋅   (1 −   ) with unequal weights    > 0 to account for differences in the safety profile or financial costs of the treatments.The weights may or may not depend on the stratum.The interpretation of the corresponding expected loss is the expected maximum loss that results from treating a randomly chosen future patient with at least one treatment that has erroneously be claimed effective for a population she or he belongs to.Finally, when choosing additive stratum-specific costs,   (  ,   ) = ∑ ∈    ⋅   (1 −   ), we end up with an additive loss like in Maurer et al. (2023) where the costs   = ∑ ∋      not only depend on the treatment  but also on the stratum-specific costs and the proportion of the strata   which compose the population   = ∪ ∋   the treatment is approved for when rejecting   .

ACCOUNTING FOR CHANGES IN POST-TRIAL RISKS AND LOSSES
Next, I discuss issues that are connected with changes in post-trial risks within a single study or between different trials and I illustrate how one could address them in specific situations.

Trial-and-error sequence of studies
As first illustrative example, consider a sequence of clinical trials for a disease with no approved treatment, each with one experimental treatment.The intention is to approve one treatment that is better than standard care and to stop at the end of the last started trial whenever this has been achieved.We start assuming that the trials are nonoverlapping (neither in time nor in patients).Usually, no control of the multiple type I error or an overall FWEL is required across different studies even though there seems to be a multiplicity issue here.It is interesting to see that an unadjusted efficacy testing can be justified by consideration of the time-changing risk for the patients outside the trials: At every time point, the risk for all out-study patients to receive an inefficient treatment is the same as for a sin-gle trial, because there is always at most one experimental treatment under investigation.The unadjusted testing cannot be justified with an overall expected loss that accounts for all trials simultaneously.However, it can be justified by consideration of the conditional expected loss at the beginning of each individual trial, conditional on the failure of previous trials.
Let us consider now trials that are overlapping in time (but still nonoverlapping in patients) and start assuming that there are only two.Since at the beginning of the second trial, all out-trial patients have an increased risk to be exposed to inefficient treatments, we could ask for control of an overall FWEL (like the FWER) and use corresponding critical boundaries in the two trials.Assume now that the first trial fails.From this moment on, the risk for outtrial patients reduces to the risk from the second trial.This justifies unadjusted efficacy testing in the second trial which means to switch from overall to conditional control of the expected loss given the failure of the first trial, like in the case of nonoverlapping trials.Indeed, at the end of the first trial it makes no difference to the out-trial patients whether the two trials are overlapping or not; the risk to be exposed to an inefficient treatment is just the same in both cases.Assume now that the first trial is successful.In this case, it contributes to the final risk for the out-trial patients and should therefore be accounted for.Note that conditioning on success of the first trial would often make the error control impossible.This is, for example, the case with the binary loss whose conditional expectation (i.e., the conditional FWER) becomes one.We therefore suggest to not condition on the first trial's success and instead to stay with control of the overall FWEL.This means to test efficacy of the second treatment at a multiplicity adjusted critical value if and only if the first trial is successful.Similar arguments apply to the case of more than two overlapping trials, where a successive conditioning in the FWEL after and only after a failure of a trial permits to adjust over time the critical values to the number of still open trials.

Dropping treatments in MAMS and platform trials
Let us consider now a single clinical trial which starts with  treatment groups and one control arm and where treatments can be dropped midtrial.Let us assume for simplicity that no early rejection of the null hypothesis is anticipated.Let us also assume that control of the expected loss with the binary loss function is anticipated which corresponds to FWER control.Assume now that treatment 1 is dropped at interim analysis  because the interim test statistic  ,1 for the nonefficacy hypothesis  1 is below a futility threshold , that is, with the interim event   = { ,1 < }.It is clear that from this moment on no post-trial patient will be exposed to treatment 1 and thereby the posttrial risk (or loss) is changed from (, ) = max  =1   (1 −   ) to l(, ) = max  =2   (1 −   ).Of course, the new loss is smaller and the question arises how to account for this reduction.
Given the arguments from the previous subsection, a natural approach to account for the change in the post-trial risk when dropping a treatment, is to pass from control of the unconditional expected loss, sup    [ (, ()) ] ≤ , to control of the conditional expected loss, sup    [ l( d, ()) |   ] ≤ , where d is the test decision function we use when the interim event   has occurred.Note that I suggest to switch to the conditional control only in cases where the post-trial risks change and to stay with overall control when the risks remain unchanged.The switch from unconditional to conditional control if a treatment is dropped, can further be justified by the often true assumption that the event of dropping treatment 1 (and only this information) is spread across a broad community (e.g., regulators, sponsors, investigators, clinicians, and patients in and outside the trial) before the trial ends.The community could then aim for conditional control of the FWEL instead of unconditional control, since the latter would mean to ignore the new information.If no treatment is dropped, then no information is spread and hence the unconditional control remains relevant.In particular, the conditional FWER given that a specific treatment is dropped, provides an upper bound for the probability that a post-trial patient will receive one of the remaining treatments although they are inefficient.In reply to a comment of the associated editor, I would like to add that the given arguments for switching to conditional control also applies if the treatment is dropped late, for example, just some few patients before the trial ends.For consistency reasons, we may even apply it at the end of the study, when decision thresholds have been predefined that preclude the licensing of a treatment without further discussion.This may (but need not) be the unadjusted rejection boundaries.
Let me discuss now the computation of the conditional FWER, that is, the conditional expectation of the binary loss given that treatment 1 is dropped.Often (2) This implies c <  and thereby a gain in power by switching from  to d if   = { ,1 < } occurs.While the second inequality in ( 2) is obvious, the first requires some arguments.It follows from Pitt's inequality (e.g., Tong, 1990) if  ,1 ,  1 , … ,   are multivariate normal with nonnegative correlations, an assumption that is often used to determine critical values, justified by asymptotic approximations.The arguments can be found in the Appendix.It also follows intuitively, because a small  ,1 occurs if either the interim mean of the first treatment is small, which has no influence on the final mean from the other treatments and the control, and/or the interim mean in the control group is large, which decreases the chance for the other test statistics to finally be above the threshold .This is because the conditional expectation in (2) converges to its unconditional expectation when  1 converges to −∞.

SUMMARY AND DISCUSSION
In this comment, I illustrated how the PWER (Brannath et al., 2023) can be written as FWEL (Maurer et al., 2023) and thereby generalized it to what I called a populationwise expected loss.This opens the possibility to apply the FWEL framework to clinical trials with overlapping populations.I also used the FWEL framework to discuss the question of how to deal with changes in the risks and costs during a single trial, like a MAMS or platform trial, or across a sequence of trials, when dropping a treatment or observing an unsuccessful trial.I suggested to condition in this (and only this) case on the event that causes the change in the risk, in our examples the dropping of a treatment or the unsuccessfulness of a previous trial, and to pass from unconditional to conditional control of the FWEL.I argued in several ways why the switch from unconditional to conditional control can be considered as natural action and showed that in many cases this just results in an adjustment of the critical boundaries for an overall control of the FWEL with the remaining risks (see also the Appendix).More generally, one could say that, when the risks for post-or out-trial patients change in course of a single or a sequence of trials, it is natural to account for this change by conditioning on this event and accounting for the new and actual risks in the control of the FWEL.At least, to ignore the change and account for risks and losses that can definitely be excluded from some time point on, appears unjustified and inefficient to me.I expect that my suggestions will be considered controversial and hope to initiate a discussion and further research on how to best deal with interim decisions that change the post-trial risks.Future research may consider also other examples like the adding of new treatments or the dropping/adding of subgroups.A.1 Inequalities for the conditional expected loss We will utilize here a result of Pitt as given in (Tong, 1990;Result 7.2.3):For a -dimensional multivariate normal random vector  with only nonnegative correlations and nondecreasing, Borel-measurable functions  1 ,  2 ∶ ℝ  → ℝ we have that [  1 ()  2 () ] ≥ [ 1 ()] [ 2 ()] whenever the expectations exist.
In the following, we assume a loss function l(, ) which is nondecreasing in  for any given  and decision functions   ,  = 1, … , , which are nondecreasing functions of the trial's final parameter estimator θ1 , … , θ .We further assume interim statistics  ,1 , … ,  , , for example, interim test statistics or some interim treatment group means, such that  ,1 , … ,  , , θ1 , … , θ are multivariate normal with nonnegative correlations.The assumptions on  and l will be satisfied for all reasonable test decision and loss functions.The assumptions on the statistics are less general, but often satisfied at least asymptotically.
Let now  be a nonincreasing event in  ,1 , … ,  , meaning that its indicator function () is a nonincreasing function of the interim statistics.