On the logic of collapsibility for causal effect measures

Liu et al. (2020) discuss the relation between efficacymeasures within subgroups and efficacymeasures on the population level, which can be obtained by merging the subgroups. They come to the conclusion that neither odds ratios (for binary endpoints) nor hazard ratios (for time-to-event endpoints) are suitable measures of efficacy in this context. This insight is not new, and more general settings have been considered previously (Daniel, Zhang, & Farewell, 2020; Greenland & Pearl, 2011; Greenland, Robins, & Pearl, 1999;Huitfeldt, Stensrud,& Suzuki, 2019;Martinussen&Vansteelandt, 2013; Pang, Kaufman, & Platt, 2013; Sjölander, Dahlqwist, & Zetterqvist, 2016). While we largely agree with their conclusion, we do so for different reasons and would like to point out a number of important subtleties that have perhaps not been appreciated by Liu et al. (2020). These should be carefully understood to avoid any further misleading interpretations. In particular, we want to emphasise, like many before, that confounding and non-collapsibility are separate issues (Didelez et al., 2010; Greenland, 1996; Greenland & Pearl, 2011; Greenland et al., 1999; Pand, Kaufman, & Platt, 2013; Pang et al., 2013; Shrier & Pang, 2015); to cite Greenland (2011): ‘confounding may occur with or without non-collapsibility, and non-collapsibility may occur with or without confounding’. Moreover, in view of patients and investigators preferring contrasts in terms of absolute risks (Murray, Caniglia, Swanson, Hernández-Díaz, & Hernán, 2018), we are sceptical about the emphasis on relative median survival time proposed in Liu et al. (2020).

parameter of ( | )). Let = ( , | = ) be a measure of conditional association between and given = ; that is, is a functional of the conditional distribution ( , | = ) (in particular, could be a parameter of ( | , = )). The measure is called collapsible over if is a weighted average of the , ∈  (Greenland, 2011;Huitfeldt et al., 2019). Note that a collapsible measure is thus 'logic-respecting' as defined in Liu et al. (2020). When = ′ , for all , ′ , strict collapsibility demands that = (Greenland et al., 1999). In the context of odds-ratios 'collapsibility' is often used to mean 'strict collapsibility'.
The parameters or are measures of association, which may or may not have a causal interpretation. It is well known that the regression coefficient in a linear regression of on alone, versus on ( , ) jointly is not collapsible when and are correlated, that is, when is unbalanced in the treatment arms. As a further example consider the associational risk-difference = ( = 1 | = ) − ( = 1 | = ). Note that Hence, it can easily be seen that the marginal associational risk difference equals a weighted average of = ( = 1 | = , = ) − ( = 1 | = , = ) with weights ( = | = ) under the conditions that ⟂ ⟂ | = and ( = | = ) can be completely different distributions and is not a weighted average of , as illustrated by the example of tables 2 and 3 in Liu et al. (2020). Note that what the authors calculate as the 'true' marginal efficacies in these tables are really the crude (marginal) associations and not the causal efficacy (see below).
In both cases considered above (linear regression and risk differences), ⟂ ⟂ is balanced in the treatment arms), and this independence is implied when is randomised and causal effects of are identified. However, we do not generally expect collapsibility otherwise. These points hint at a relation between collapsibility and causal inference. Causally interpretable measures are also what Liu et al. (2020) are interested in, as indicated by the title of their paper. Thus, we will discuss the issue of collapsibility from a causal point of view and argue that a formal consideration of causal concepts is crucial.

A CAUSAL POINT OF VIEW
Causal contrasts, comparing a treatment versus control, are commonly expressed in terms of potential outcomes or, more generally, their distributions. Thus, ( ) is the outcome if a person is assigned to treatment and ( ) is the outcome if the same person is assigned to control. Note that this notation is oblivious as to whether the data are obtained form an observational study or a randomised controlled trial (RCT), and we use it to be formal and explicit about estimands and assumptions. In other scientific communities this is also expressed as interventional distributions such that ( = ; do( = )) stands for the distribution of the endpoint under an intervention that sets treatment to , and similar for controls (Pearl, 2009). For our purposes we can regard ( ( ) = ) as equal to ( = ; do( = )).

1.1
Causal collapsibility Liu et al. (2020) state that their interest is 'true effects'. While they do not provide a formal definition of 'true effects', we assume that they refer to some causal notion and its population as opposed to estimated value. To define causal effects it is immaterial whether is balanced between treatment arms, though of course, estimation may be affected by such an imbalance. Let ( ) be a summary measure of a distribution, for example, its mean or the odds. We denote this summary for the subgroups + , − in each treatment arm as The above can be interpreted as applying (or ) to each subgroup and then summarising the distribution using separately by subgroups. Similarly, over the entire patient population (or marginally) we have This can be interpreted as applying (or ) to the entire patient population and then summarising the distribution using (as in Liu et al. (2020), start of section 3). Further, let denote a contrast of versus , for example, the difference or ratio; and let { + , − } denote the corresponding contrast of versus . In other words, is the causal counterpart to and { + , − } to , respectively. In this context, we define collapsibility of a causal measure over if it holds that { + , − } is a weighted average of + and − (Huitfeldt et al., 2019).
It can easily be seen that the causal risk difference and causal risk ratio are collapsible effect measures while the causal odds ratio is not. Huitfeldt et al. (2019) show that the marginal causal risk ratio (or relative response in Liu et al. (2020)) is a weighted average of subgroup causal risk ratios with weights ( = | ( ) = 1) (see also Miettinen, 1972).

1.1.1
Prognostic and predictive biomarkers Liu et al. (2020) distinguish prognostic and predictive properties of a biomarker . For our considerations here and for the causal claims made by the authors, we want to stress that the biomarker needs to be pre-treatment (or baseline), that is, known to not be affected by treatment. The causal assumptions detailed below do not generally hold when conditioning on any post-treatment variables which may be on some causal pathway from to . Under this premise, we interpret the definition in Liu et al. (2020) of a biomarker to be 'treatment-effect prognostic' as ( ) ̸ ⟂ ⟂ , deviate from the prevalence of the biomarker = ( = ), when it is prognostic for the controls. In contrast, a biomarker is predictive if + ≠ − , that is, it is an effect modifier on the chosen scale. It is worth mentioning that effect modification depends on the scale, for example, if is not predictive on the additive scale (risk difference) it will be predictive on the multiplicative scale (risk ratio). We recommend VanderWeele and Robins (2007) and Vander-Weele (2015) for further insights into the causal interpretation of statistical interactions and their distinction form causal effect modification.

Structural assumptions
Inference on causal effects from any data always relies on certain structural assumptions (in addition to, say, parametric or other modelling assumptions). These structural assumptions refer to the relation between potential outcomes and observables and are the basis for using the observed data for inference on causal effects. Some of these assumptions can easily be derived from causal diagrams representing prior assumptions on the causal structure (Greenland & Pearl, 2011;Pearl, 2009). As we will explain, the key structural assumptions are often easier to justify when the data result from an RCT. The first assumption is that of positivity demanding that 0 < ( = | = ) < 1; this is always satisfied in an RCT at least for the population eligible for the trial. Further, note that the potential outcomes ( ) and ( ) can never be observed jointly (one of them will be 'counterfactual'); however under the assumption of causal consistency, we can at least observe one of the potential outcomes. This states that when = , then we observe = ( ). Hence, under consistency we have for the observable outcome : where {⋅} is the indicator function. There is no need for an upper index on as used in Liu et al. (2020). Also note that in the standard RCT context, consistency will typically hold. It might be violated if the administration of treatment in the trial (under special medical supervision, say) is extremely different, in a way that substantially affects the outcome, from how it would ever be administered in real life.

Collapsibility and confounding
Besides positivity and consistency, a key assumption is that of 'ignorability' (aka 'exchangeability' or 'no unmeasured confounding').

Ignorability
We have ignorability if ( ) ⟂ ⟂ . This can be interpreted as being independent of any baseline characteristics that predict the potential outcomes (this ensures covariate balance in expectation). It is easily seen that ignorability (with consistency) implies The above equality proves non-parametric identifiability of any marginal (i.e. population) causal measure of efficacy in an RCT. In particular it implies that, in an RCT, the distribution of in the treatment arm can be seen as a sample from the distribution of ( ), and the control arm as a sample from the distribution of ( ). Note that like always with finite samples, especially when small, these can happen to be 'bad' samples due to sampling variability, but they will not systematically be so. An accidental imbalance of 1/3 versus 2/3 as considered in table 3 of Liu et al. (2020) is unlikely in sufficiently large RCTs.

Conditional ignorability
A different assumption, which Liu et al. (2020) do not clearly distinguish from ignorability, is that of conditional ignorability given : This demands that ( ) ⟂ ⟂ | . Note that this is not the same as (unconditional) ignorability. Conditional ignorability is often assumed in the context of observational studies where ̸ ⟂ ⟂ does not balance baseline covariates), where is a set of covariates (not only a biomarker) that capture all confounding between and .

Randomisation
Under randomisation of we have the even stronger property that ( ( ), ) ⟂ ⟂ , and this implies both ignorability and conditional ignorability. Under randomisation of (or under conditional ignorability) we have that This proves that under randomsiation of any conditional (i.e. subgroup) causal measure of efficacy is non-parametrically identified. In the case where ( , , ) are binary, no further assumptions are required. Causally interpretable marginal or conditional risk differences, risk ratios or odds ratios can consistently be estimated; also note that under independent censoring, marginal and conditional survival curves can consistently be estimated, for example, by the Kaplan Meier estimator.

Confounding
The amount of confounding, say in an observational study, is sometimes measured by comparing an estimate of the simple associational measure with an estimate of the causal marginal effect , where is obtained by suitable adjustment such as standardisation or inverse probability of treatment weighting (IPTW) (Greenland et al., 1999). Essentially, this boils down to comparing whether (a summary of) ( = | = ) is different from (the same summary of) ( ( ) = ), though one would need to exclude other sources of structural bias as well.
The assumption of ignorability thus implies that there is no confounding of the effect of on ; the assumption of conditional ignorability given implies that there is no confounding, other than possibly by , of the effect of on .
Under either assumption, and in particular under randomisation of any of the above quantities and { + , − } can consistently be estimated from an RCT as they are known functions of ( ( ) = ) or ( ( ) = | = ), for = , , and these can be obtained by the observable ( = | = ) or ( = | = , = ), respectively.
Moreover, the relation between any marginal and conditional measures is mathematically determined: We have which follows the subgroup mixable principle as stressed by Liu et al. (2020). The marginal causal odds ratio given by can be re-expressed in terms of the probabilities in the subgroups via (3); all of these quantities can consistently be estimated with data from an RCT or with data on ( , , ) from an observational study if conditional ignorability holds (see, e.g. Zhang, 2008). The same holds for the subgroup causal odds-ratios. The fact that the marginal causal odds ratio is not a weighted average of the subgroup causal odds ratios, that is, its non-collapsibility, pertains to its mathematical properties and has nothing to do with absence or presence of confounding (Greenland, 1996(Greenland, , 2011Greenland et al., 1999;Hernán et al., 2011).

Adjustment
In section 3.1 Liu et al. (2020) address adjustment for imbalances. Their motivation for this is somewhat unclear as they otherwise consider RCTs. Indeed, we do not consider it of much importance that, in RCTs, small sample sizes can result in random imbalances -'confounding' is a source of structural bias which does not go away with increasing sample size, but such finite-sample imbalances under randomisation are not sources of bias, they are the random variation we expect with small samples. In the context of subgroups, one considers conditional effects. As emphasised by Daniel et al. (2020), 'adjusting' and 'conditioning' should not be confused. Adjusting typically means that the analysis takes observed confounders into account. This can be achieved by a number of different methods, such as regression adjustment, stratification, matching or IPTW (see, e.g. Goetghebeur, le Cessie, De Stavola, Moodie, & Waernbaum, 2020, for a review). Of these, regression and stratification use the principle of conditioning. However, when the aim is to estimate a marginal causal effect, then additional standardising, on the correct scale, with respect to the confounder distribution is required. Liu et al. (2020) are instead interested in subgroup-specific measures of efficacy, that is, they are conditional on the subgroup indicator. Their reason for the conditioning, in an RCT, is thus a different one than adjustment.

TIME-TO-EVENT ENDPOINTS
While it has been known (at least) since the 1970s that the odds ratio is not collapsible (Whittemore, 1978), it has taken a little longer to appreciate the corresponding problem with hazard or rate ratios (Aalen, Cook, & Røysland, 2015;Greenland, 1996;Sjölander et al., 2016). The issue with rate (hazard) contrasts is more subtle than for odds ratios, and we want to address two particular aspects here.

Non-collapsibility of rate/hazard differences/ratios
It is curious that while (causal) risk differences and risk ratios are collapsible, rate (hazard) differences and rate (hazard) ratios are not. The key here is that rates (hazards) are based on conditional probabilities, namely, conditional on prior survival. A rate (hazard) can be converted into a risk, the probability of an event before a given time, through a non-linear transformation which 'destroys' the collapsibility of risk differences and risk ratios, as described in Daniel et al. (2020). As with odds ratios, these phenomena occur under ignorability or randomisation, and are therefore not linked to, or indicative of, any confounding.
However, additive hazard differences (in continuous time) satisfy the special condition of strict collapsibility (Daniel et al., 2020); in particular, when an additive Aalen model with no interaction terms fits the data-generating mechanism during the entire follow-up (which is a strong parametric restriction), then the marginal and conditional hazard differences are equal (Daniel et al., 2020;Sjölander et al., 2016). This strict collapsibility does not imply that the hazard difference is a collapsible effect measure in general (see also Section 2.4).

'The hazards of hazard ratios'
The title of this subsection is a quote from Hernán (2010), who explains why hazard ratios are problematic as measures of causal contrasts. The issues pertain to the differential effects in (possibly latent) subgroups. Indeed, the causal interpretation of hazard ratios as well as other hazard-based contrasts is ambiguous, as we explain next (Martinussen, Vansteelandt, & Andersen, 2020;Stensrud, Aalen, Aalen, & Valberg, 2019a;Stensrud & Hernán, 2020). As mentioned in Section 2.1, rates and hazards are based on conditional probabilities, conditioning on prior survival. Using potential outcomes notation, this means that we consider where ( ) is the time-to-event when assigned to treatment arm ∈ { , }. The hazard ratio is therefore This hazard ratio, which is the target of inference in most randomised trials with time-to-event outcomes, is problematic because a causal contrast should compare the same population under the scenarios of treatment versus control. However, the numerator of the hazard ratio is a probability for the 'population' characterised by { ( ) ≥ } while the denominator gives a probability for 'population' { ( ) ≥ } -these can be very different groups.
To explain the distinction between conditioning on { ( ) ≥ } and { ( ) ≥ }, assume is a latent frailty which interacts with treatment. While is balanced at baseline due to randomisation of , the treated individuals who survive a given time may tend to be more 'frail' (if treatment is beneficial and lowers mortality) than the controls who survive the given time (Hernán, 2010). Thus, as increases, a contrast of hazards compares possibly increasingly differing groups of 'survivors'.

Testing the null hypothesis
A special case deserves attention, especially in the context of clinical trials where testing the null hypothesis of 'no effect' is often described as the prime interest. When ⟂ ⟂ | , then the causal odds ratio and hazard ratio are both strictly collapsible. In other words, we do not have to worry about the validity of statistical tests as the significance level is preserved; in an RCT, the marginal and the subgroup odds/hazard ratios all equal 1 under the null hypothesis. The issues discussed by Liu et al. (2020) are therefore relevant when the central aim is to quantify efficacy, which is more meaningful than hypothesis testing in many (if not most) practical settings. However, when null hypotheses are of interest, these should be formulated in terms of survival probabilities instead of hazard ratios (Stensrud et al., 2019a;Stensrud, Røysland, & Ryalen, 2019b).

Collapsibility of measures or models?
Collapsibility can be defined 'non-parametrically' as a property of a measure of efficacy (Huitfeldt et al., 2019). Nonparametric estimation is also feasible when we consider a restricted number of discrete variables. However, when extending the considerations to a continuous biomarker (which is not dichotomised), say, then models are used to impose some smoothness including smoothness over time for time-to-event endpoints. For example, the logistic link ensures that the probability remains inside (0,1), and the Cox model ensures the hazard remains positive (which the Aalen additive hazards model does not ensure). A measure of efficacy that is non-collapsible in general, can be collapsible under certain (restrictive) parametric model assumptions: for instance, the strict collapsibility of the hazard difference implies that the Aalen model without interaction terms is collapsible. However, this is a special case: for instance, if a simple Cox model without interaction terms fits the data-generating mechanism, then the hazard difference is no longer collapsible.

CONCLUSIONS
To summarise we would like to emphasise the following points. For general clarity and to avoid misunderstandings, associational concepts of dependence should clearly and formally be distinguished from causal contrasts (or causal measures of efficacy). Formal frameworks and notation to do so have been available for over 40 years and keep being refined (Pearl, 2009;Robins, 1986;Rubin, 1974).
As explained above, and as has been pointed out many times before, confounding and non-collapsibility are separate issues and should be kept apart. It is not correct that the odds ratio or the hazard ratio somehow re-introduce confounding into an RCT as claimed by Liu et al. (2020). An RCT does guarantee 'no confounding' and this is not a 'belief' but a mathematical fact, which does not need amending. This has nothing to do with the choice of efficacy measure. Of course, RCTs can suffer from other problems, for instance, due to non-adherence or other intercurrent events.
We agree with Liu et al. (2020) that (causal) odds ratios and hazard ratios are problematic as causal contrasts. The non-collapsibility of these parameters is a mathematical property which makes their interpretation awkward, and this is amplified for hazards by their conditioning on survival. Thus, they are also unsuitable measures for transportability between different populations (Martinussen & Vansteelandt, 2013). It is particularly concerning that meta-analyses pool odds ratios or hazard ratios from different studies each possibly using different variables for adjustment where the issue of non-collapsibility is typically ignored. Careful causal considerations are in general required for transportability (see, e.g. Bareinboim & Pearl, 2013;Dahabreh, Robins, Haneuse, & Hernán, 2019).
However, a measure being collapsible does not automatically make it meaningful. Hazard differences derived from an additive Aalen model, for instance, are not easy to interpret and not really useful for decision making. In fact, there is empirical evidence that patients and investigators prefer contrasts in terms of absolute risk (Murray et al., 2018). This makes us sceptical whether the proposed ratio of median survival times will take hold; moreover, its estimation will often need to rely on parametric assumptions, even in a perfectly executed RCT, for instance, when less than 50% of the individuals experience the outcome during follow-up in either the treatment or control arm. Hence, the causal inference community is moving towards using contrasts of risk in time-to-event context, such as differences between suitably adjusted and standardised survival curves (Hernán & Robins, 2020;Robins, 1986). This has also increasingly been recommended in clinical contexts (Stensrud et al., 2019a;Stensrud & Hernán, 2020;Uno et al., 2014). In an RCT, differences between survival probabilities (marginally or in subgroups) can non-parametrically be estimated without proportional hazards assumption at pre-specified times, such as 1-year, 5-year and 10-year survival probabilities, according to the context. As probabilities (i.e. risks), these parameters have the further advantage of complying with the subgroup mixable effects principle.