A Bayesian (meta‐)regression model for treatment effects on the risk difference scale

In clinical settings, the absolute risk reduction due to treatment that can be expected in a particular patient is of key interest. However, logistic regression, the default regression model for trials with a binary outcome, produces estimates of the effect of treatment measured as a difference in log odds. We explored options to estimate treatment effects directly as a difference in risk, specifically in the network meta‐analysis setting. We propose a novel Bayesian (meta‐)regression model for binary outcomes on the additive risk scale. The model allows treatment effects, covariate effects, interactions and variance parameters to be estimated directly on the linear scale of clinical interest. We compared effect estimates from this model to (1) a previously proposed additive risk model by Warn, Thompson and Spiegelhalter (“WTS model”) and (2) backtransforming the predictions from a logistic model to the natural scale after regression. The models were compared in a network meta‐analysis of 20 hepatitis C trials, as well as in the analysis of simulated single trial settings. The resulting estimates diverged, in particular for small sample sizes or true risks close to 0% or 100%. Researchers should be aware that modelling untransformed risk can yield very different results from default logistic models. The treatment effect in participants with such extreme predicted risks weighed more heavily on the overall treatment effect estimate from our proposed model compared to the WTS model. In our network meta‐analysis, this sensitivity of our proposed model was needed to detect all information in the data.

relation is often monotonic, but not necessarily so: in the case of effect modification, an increased effect on one scale may correspond to a decreased effect on another. 1 There are diverging views on the most appropriate scale of treatment effects in (meta-)analysis of binary outcomes. Arguments include mathematical properties, the presence of between-trial heterogeneity in treatment effects, clinical relevance, and ease of interpretation. There is typically less heterogeneity on ratio scales, [2][3][4] making effects on these scales more generalisable and, in the case of meta-analysis, more consistent. Some argue that an effect measure should be chosen based purely on the best fit to the data, or least between-trial heterogeneity in meta-analysis. 5 Others suggest that the choice of effect measure should be based on both the data and clinical considerations regarding the relation of treatment benefit to baseline risk. 2,6 Recently, guidelines for modelling between-patient HTE in RCT analyses were published in the form of the PATH (Predictive Approaches to Treatment effect Heterogeneity) statement. 7,8 The PATH authors recommend measuring benefits on the "natural" RD scale to guide clinical decision making. This is in line with the PRISMA-statement for reporting meta-analyses, 9 which proposes the absolute risk scale as the most easily interpretable.
In this paper, we focus on (meta-)regression methods to obtain treatment effects on the absolute risk scale: risk differences. One approach is to use logistic regression and transform treatment effects in the form of log odds ratios to risk differences if an estimate of baseline risk is available. This was recently done in a Bayesian network meta-analysis (NMA), 10 where both personalised risks and treatment effects were incorporated following the PATH risk modelling approach. The log odds scale has attractive mathematical properties, though it is not always suitable for situations with predicted risks very close to 0 or 1. This occurs, for example, in trial data where almost all patients experience the outcome in both treatment groups.
Recently, we performed a network meta-analysis where 100% success rates were an important feature of the data. In a logistic NMA model, the use of weakly informative priors was not sufficient to obtain convergence for all model parameters. In addition, all model parameters were defined on the logit scale, while the clinicians were only interested in the natural scale. This meant, for instance, that the between-study variance was not on the scale of interest to the clinicians. The same applied to any treatment-by-covariate interactions, which may influence the end results in different ways due to the noncollapsibility of the log odds. Since the original clinical focus of the NMA was on "personalised" treatment effects on the natural scale, we were motivated to explore options to model these directly.
This led to the development of an alternative model for benefit estimation directly on the natural risk scale. Any model that regresses on the natural risk scale must somehow keep the predicted risk within [0, 1]. Standard methods for the maximum likelihood (ML) estimation of binomial generalised linear (mixed) models (GL(M)M) with identity link have no such safety net and can fail to attain convergence. The Bayesian MCMC framework offers more flexibility to define bounds on the linear predictor at each iteration. Warn, Thompson and Spiegelhalter (WTS) bound the risk parameter domain using min() and max() functions in the meta-analysis of aggregated binary outcomes. 11 Their method has been applied in a few meta-analyses [12][13][14] and is referenced in meta-analysis guidelines. 15 It has been developed further for the analysis of cluster-randomised trials. 16 However, the model runs into difficulty with empirical success precentages close to 0 or 100%. As a remedy, WTS propose a small continuity correction on the aggregated data. It is not immediately obvious how such a continuity correction may be applied to individual patient data (IPD) in HTE modelling.
We aimed to develop an improved absolute risk model that is suitable for IPD MA and NMA, and that can deal with extreme risk estimates. We hereto propose an alternative Bayesian (meta-)regression model for binary outcomes on the natural risk scale. The model has a true identity link function and it deals well with empirical success in (almost) all patients of a trial arm. Our proposed model, as well as the logistic and WTS model are defined formally in Section 2. We compare the models' resulting estimates in an IPD NMA of the example dataset that motivated this research (Section 3). To further investigate the differences between the models, we examine simulated scenarios (Section 4). We then discuss our results and give recommendations in (Section 5).

Data setting
We will define all models very generally, so that they can be applied to a network meta-analysis. Simpler forms can be applied in direct pairwise meta-analysis and to the analysis of single trials. Here, assume we have data from n s studies, with a total of n p participants. In total, n t treatments are compared, though not all n s studies necessarily have arms for all n t treatments. For each study j ∈ {1, … , n s }, r j ∈ {1, … , n t } is defined as the study-specific reference treatment. For each participant, we assume to have observed a binary outcome and possibly covariate values of interest. With y ijk , we denote the outcome for participant i ∈ {1, … , n p }, who participated in study j ∈ {1, … , n s } and was given treatment k ∈ {1, … , n t }. In our notation above and model definitions below, we assume that individual participant data (IPD) are available. The models are easily adjusted to aggregate data (AD) by replacing the individual binary outcome y ijk with y jk , the aggregate binomial outcome in a trial arm that received treatment k and was part of study j.

NMA using generalised linear models
In all models below, y ijk is assumed to be randomly sampled from a binomial distribution with size 1 and success probability p ijk : y ijk ∼ Binomial(1, p ijk ), and p ijk = g −1 ( j + j,r j k ), where j,r j k = 0 for k = r j . Here, g is the link function, which is the logit or the identity in this paper. The regression equation for y ijk contains study-specific regression parameters: • j : the intercept of study j, representing baseline risk in models with identity link or baseline log odds in logistic models for study j under the reference treatment; • j,r j k : the study-specific treatment effect of treatment k compared to the reference treatment r j .
In the case of meta-analysis of multiple trials, study-specific treatment effects may be combined into overall effects in two ways. They can be assumed identical across studies and equal to the overall effect ( j,r j k = d r j k ), which is often referred to as a common effects (CE) meta-analysis model. Alternatively, study-specific effects may be assumed to be an exchangeable sample from a common distribution, usually Gaussian. In this case, the effects are independent conditional on their mean and between-study variance: j,r j k ∼  (d r j k , 2 ). In frequentist terms, this type of model is usually called a random effects (RE) model. Both the overall mean and between-study variance are defined on the scale of the link function. When the number of studies per comparison is small, as in the NMA of our example, it is convenient to assume the between-study variance 2 to be constant across treatment comparisons. When there are studies with more than two arms, the correlation between study arms can be accounted for by specifying a multivariate normal distribution with non-zero covariances. 15 The dependence structure of the study-specific intercepts j can be chosen freely. We chose to model the j as marginally independent across studies, so that the meta-analysis was fully stratified by study.
In the NMA, indirect comparisons of interest can be specified through consistency assumptions, again on the scale defined by the link function: In what follows, we denote the scale of the overall treatment effect (risk difference or difference in log odds) with a superscript (RD or log, respectively).

Obtaining risk difference estimates using logistic regression
Our objective is to estimate the risk difference that treatment k yields compared to the overall reference 1, denoted with RD 1k . The logistic model yields log odds ratios as treatment effects. Their transformation to the absolute scale requires 0 , the log odds (or logit risk) of the outcome for a patient who receives the reference treatment: The reference log odds 0 may be supplied to the model as a point estimate or in the form of a distribution. They may be estimated from internal data, for instance using the study-specific baseline log odds j . Alternatively, an estimate could be obtained from external data, such as observational studies. In our analysis of the Hepatitis C data, we used a point estimate based on all patients in the dataset that received the reference treatment. As an example, suppose we are analysing a single trial with two treatment arms. For each patient i, we record an outcome y i ∼ Binomial(1, p i ) and only one covariate x i ∈ {0, 1} that indicates the treatment assigned to patient i. An translation of (1) into JAGS model code is given below for this situation. Here, we have chosen a weakly informative  (0, 5 2 ) prior on the treatment effect on the log odds scale, and an uninformative  (0, 100 2 ) prior on the log odds of the outcome in the control group.

Modelling untransformed risk directly
In binomial models with an identity link, RD 1k is equal to the treatment effect d RD 1k , so no additional transformations are necessary. An important difficulty in the estimation of the additive model, though, is that the probabilities p ijk have unit interval support. The regression coefficients, by contrast, are not necessarily restricted to produce linear predictors within [0, 1]. This is not a problem with the logistic model, as the support of log odds comprises the entire real line. The Bayesian framework with MCMC estimation provides the flexibility to define additional constraints in the regression equations that keep the linear predictor in [0, 1].
We will discuss two Bayesian implementations of a binomial regression model with identity link. Both implementations essentially aim to fit the same theoretical additive risk model. When writing out the likelihood of a binomial GLM with identity link, an indicator function 1 {0<=linear predictor<=1} would make the bounds of the model parameters explicit in the expression. As such, the support for each model parameter depends on the data as well as on the other parameters. To fit the model, this indicator function needs to be translated into code. As the parameter bounds may vary for each observed patient, it is not generally possible to translate them into priors. Warn, Thompson and Spiegelhalter (WTS) 11,15 translate the parameter bounds into code using min and max functions (Section 2.4.1). We found that placing a logit function on both sides of the regression equation (Section 2.4.2) also translates the parameter bounds, with slightly different properties than the WTS approach.

The Warn, Thompson, and Spiegelhalter (WTS) model
One option is the model published by Warn, Thompson, and Spiegelhalter, 11,15 which we adapted slightly to IPD context: Equation (2) is akin to equation (4) in the WTS paper. 11 However, in equation (5) in WTS the analysis takes place on the treatment group level, and exactly two treatment arms are allowed per trial. Equation (2) above is a slightly generalised version with three index levels: i for individual patients, j for trials and k for treatments in the (N)MA. These indices allow for the application of the model to other settings than a two-treatment meta-analysis, such as NMA, IPD MA, and trial analysis with individual covariates. If a more elaborate linear predictor is of interest, its intercept may replace RD j in (2), and RD j,r j k may be replaced by the rest of the linear predictor. In Equation (2), the parameters RD j represent the "baseline" probability of the outcome in the reference arm of each trial j. This probability is by definition in the unit interval, hence priors with unit interval support are appropriate for RD j .
When such priors are used, Equation (2) reduces to p ijk = RD j for patients in the reference arm of trial j as follows: The parameters RD j,r j k in equation (2) represent the difference in risk of the outcome under treatment k versus the reference in trial j. As the total risk for each patient is in [0, 1], the risk difference is bounded between − RD j and 1 − RD j . Generally, in an additive risk model, the bounds for each parameter depend on the data as well as the other parameters. WTS translate this constraint to BUGS code using min and max functions: when RD j,r j k is such that the sum − RD j + RD j,r j k would drop below 0 or exceed 1, the second term of the sum is set to − RD j or 1 − RD j , respectively, so that the sum equals 0 or 1, respectively.
As such, the WTS model does not constrain the linear predictor RD j + RD j,r j k per se; instead, it constrains p ijk . The min and max functions set p ijk to 0 or 1 when the linear predictor reaches below 0 or above 1. Hence, (2) allows treatment effect estimates that yield a linear predictor outside [0, 1], though the predicted risks remain within [0, 1]. Neither p ijk nor the linear predictor is transformed to a nonlinear scale, so treatment effects from this model have an interpretation as risk differences. However, because of the min and max functions, p ijk and the linear predictor are not identical.
We illustrate (2) on the example trial of Section 2.3, where the data consists of a binary outcome y i and a binary treatment indicator x i for each patient i. JAGS model code for this simple case is provided below. As previously, we defined a weakly informative prior on the (risk difference) treatment effect and a uniform prior on the probability of the outcome in the control group.
When the WTS model is applied in a meta-analysis, the dependence structure of the within-study parameters within and between the studies may be defined in several ways, as explained in Section 2.2. In the examples in this paper, we assume RD j to be independent across studies, whereas RD j,r j k are assumed exchangeable with constant between-study variance 2 .

A proposed alternative
As an alternative to the WTS model, we also assume a linear risk model p ijk = RD j + RD j,r j k , where we propose using a logit transformation on both sides of the regression equation: The logit on the right-hand side of the equation constrains the linear predictor RD j + RD j,r j k 1 {k≠r j } within [0, 1]. Theoretically, this likelihood is only defined for parameter combinations that result in linear predictor values in [0, 1], as the logit of a number below 0 or above 1 is not on the real line. Therefore, no posterior mass is placed on parameter combinations resulting in probabilities outside [0,1]. In our implementation, this resulted in such parameter combinations not being sampled by the Gibbs samplers JAGS 17 and Stan, 18 as implemented in R 19 version 4.2.2. A transformation is also applied on the left-hand side of the equation. As the logit function is one-to-one, the transformation on both sides ensures an identity relation between p ijk and the linear predictor.
We revisit the example of a two-arm trial with binary outcome y i and a binary treatment indicator x i . Approach (3) can be translated into JAGS code as follows, using the same priors as in the previous section.
In the examples of (network) meta-analysis that follow, we assume RD j to be independent and RD j,r j k to be exchangeable across studies with constant between-study variance 2 . Other assumptions are of course possible (Section 2.2).

Priors and estimation
In the Bayesian context, the model is completed with prior distributions on the study-specific intercepts j , the overall effects d 1⋅ , and the between-study SD . For the intercepts and overall effects, (N)MA guidelines have recommended using vague priors such as the  (0, 100 2 )-distribution on log odds ratios. 10,15 A downside of vague default priors is that they can introduce an unrealistic amount of variance. For instance, a  (0, 100 2 ) prior encodes the assumption that the treatment effect is almost as likely to lie between 20 and 30 as it is to lie between 0 and 10. Yet treatment effects on the log odds scale exceeding [−5, 5] are quite extreme, corresponding to odds ratios exceeding [0.007,148]. Moreover, risk difference treatment effects mathematically cannot exceed [−1, 1]. Prior distributions with smaller variances encode this information. In addition, informative priors can help regularise the model. We therefore opt for normal priors with mean 0 for all models, and SD 5 for log odds and SD 1 for risk differences. These priors are very weakly informative, but do not add an unrealistic amount of variance. For each model, the posterior distribution of the model parameters can be estimated using MCMC, though there are some considerations to take into account. Firstly, the WTS model (2) and our proposed model (3) are nonstandard and require a large degree of flexibility in the model specification options of the MCMC method. In addition, some MCMC methods may be more appropriate to the shape of the parameter support and the correlation structure than others. In the additive models, the support of one parameter depends linearly on the data and on the other parameters, resulting in sharp edges of the likelihood. This can lead to problems with MCMC algorithms that use the shape of the posterior distribution to their advantage, such as the No U-Turn Sampler (NUTS) implemented in Stan. 18,20 We have been able to estimate several versions (single trial, meta-analysis, network meta-analysis) of the model in the Gibbs sampler JAGS, and some in Stan as well. Although we have not done a formal comparison, we noticed that the Metropolis-within-Gibbs sampler in JAGS 17 achieved convergence in all simulated scenarios of Section 4, while Stan did not. This may have to do with the "sharp edges" in the likelihood support of the additive risk models.
The convergence rate and autocorrelation of the chain will depend on the specific model and data structure at hand. Therefore, the parameters of the MCMC procedure (eg, the number of iterations, the number of burnin samples and the thinning rate) will need to be finetuned in each implementation to obtain decent effective sample sizes and to ensure convergence for the MCMC samples.

Other applications
As special cases of NMA, the models described above can be applied to the analysis of a single trial as well as MAs with direct comparisons. Continuous and categorical covariates could be added, as we will show in some of our simulations in Section 4. Interaction terms could also be added in single trial analyses and (N)MA, though in (N)MA their interpretation warrants caution. 21

NMA OF HEPATITIS C DATA
Our alternative model was motivated by in an individual participant data (IPD) network meta-analysis (NMA) of the largest connected component of the TherapySelector (TS) database. The database contains data from 20 studies, comprising 5,842 hepatitis C (HCV) patients and 15 therapy regimens (Table A1). The primary outcome was Sustained Virologic Response (SVR, binary) and four categorical covariates were available for all patients. In the past few years, antiviral agents have entered the market that are almost 100% successful in repressing the hepatitis C virus. As a result, the HCV data contains several trial arms with SVR proportions close to or equal to 1. This aspect of the data played an important role in our modelling process. The aim of the analysis was to estimate treatment effects with respect to overall reference treatment 1 (peginterferonplus ribavirin for 48 weeks), in the form of absolute risk differences, preferably conditional on patient covariates. To this end, we considered the three network meta-analysis approaches outlined above.

Descriptives
The connectivity within the network was generally low (Figure 1). The therapies 8, 12, 14, and 15 were given to relatively small numbers of patients, potentially leading to larger uncertainty about these therapies in our analysis. There were several trial arms where (almost) all patients attained SVR during the study (Table A2). All analyses were complete case, intention-to-treat. The study samples were heterogeneous with respect to HCV genotype and proportion of patients with cirrhosis (Supplement , Table S1).

Network meta-analysis
We applied the three models decribed in Section 2 in an NMA of the TS data. As in the single trial example described in Sections 2.3,2.4.1 and 2.4.2, we used uninformative priors on the baseline log odds or risk, while weakly informative priors were defined for the treatment effects. All models were estimated using JAGS accessed from R version 4.0.2 19 through R2jags. 22 After some finetuning of the MCMC parameters, we ran two parallel chains of 50 000 iterations each, with a burn-in period of 5000 samples. A thinning factor of 20 was used, resulting in a total of 4500 saved iterations for each model. For both additive risk models, we found no signs of lack of MCMC convergence in the Gelman-Rubin statistics (all ≤ 1.01) nor in the traceplots. The Markov Chain resulting from our proposed model did suffer from more autocorrelation than the WTS model as reflected in lower effective sample sizes (minimum ESS 210 vs 1000). For the logistic model, three risk difference estimates had Gelman-Rubin statistics above 1.01 (RD4, RD8, RD13), raising concerns about their convergence, despite the weakly informative priors. All effective sample sizes were below 500, with the minimum ESS at 72 (RD5), suggesting high autocorrelation within the chains. The posterior densities of the estimated risk differences for each treatment compared to the reference were summarised by their median and 95% central credible interval (Figure 2). Firstly, the credible intervals are generally wide, indicating large posterior uncertainty about the relative effectiveness of the treatments. Secondly, the estimated risk differences differ vastly depending on the model that was used. The between-model differences are most pronounced between the logistic model on the one hand and the two additive risk models on the other. Still, the two additive risk models disagree for almost half of the treatments.
The between-study SDs ( ) were estimated at 0. 19   Treatment effects compared to the overall reference F I G U R E 2 Graphical summaries of the posterior densities of the risk differences, after fitting the three models described in Section 2.
The horizontal lines in the plots span a central 95% posterior interval and the dots represent posterior medians. As the outcome (sustained virologic response) is beneficial, higher RD values mean higher treatment effectiveness.  The differences in between-study variance on the logit and natural scales of analysis, as well as the convergence problems of the logistic model may partly explain the variation in results across the models. To gain more insight into why and when the models obtain differing results, we simulated several scenarios, zooming in on less complicated settings in single trials and pairwise MA.

Simulated scenarios
To understand the differences in behaviour of the three models on the TS data, we studied their results in simpler simulated settings. As our simulations were motivated by a network meta-analysis of trials, our simulated scenarios were inspired by the setup of a single trial, with x 1i denoting the binary treatment indicator. An important characteristic of the Hepatitis C data is the perfect success rate in some trial arms. We therefore chose our true parameters such that the true treatment effect remained mostly constant on the scale of the link function, and increased the intercept, moving the predicted risks p i closer to 1 with every increase. Each simulation was performed with n = 1000, to limit the effect of sampling variability, as well as n = 100, to obtain an impression of the models' behavior in smaller samples. We simulated data from the following GLM, assuming i ∈ {1, 2, … , n}: The link function g was defined as either the logit or the identity. In addition, we varied the number of independent variables, the true parameters , , , the total sample size n, and the distribution of x.

One independent variable
We first examined scenarios with a single balanced treatment variable x i1 . In this case, the structure of the data is such that it can be fully captured in a crosstable, irrespective of the true link function. We therefore did not vary the true link and used the identity for ease of interpretation. The true risk difference between the treatment groups ( ) was first held fixed at 0.3, while varying the intercept ∈ {0.5, 0.6, 0.69}. In a second round of simulations, was held fixed at 0 while ∈ {0.8, 0.9, 0.99} varied.

Two independent variables
In all our simulations with two independent variables, we imitated the two-arm trial case with one balanced binary treatment variable and one covariate. The covariate x i2 was defined as x i2 ∼ Bin(n, 0.5) or x i2 ∼  [−1, 1], in both cases drawn independently of x i1 , mimicking treatment randomisation. For the binary and continous x i2 , respectively, we simulated data where g was the identity and data where g was the logit. The following parameter combinations were specified: We chose the regression parameters to create a comparable range (or comparable proximity to the edges of [0,1]) of true risks between the two link functions. We keep the effect of the covariates ( , ) constant and vary the intercepts to x i1 binary, x i2 binary, p i =0.39+0.3x i1 +0.3x i2 .
x i1 binary, x i2 binary, p i =expit(−1+x i1 +x i2 ) move risks over the [0,1] interval, from more in the middle towards the edges. Since the parameters are on different scales (logit vs natural) for the logistic and identity models, respectively, the parameter values that create this (comparable) range of true risks, differ, while the obtained risks are similar.

Simulation results
In each simulated scenario, we fitted the three models described in section 2 using JAGS from R (4.2.2). For each model fit, we ran two Markov chains of 10 000 iterations each, with a burn-in period of 2000 iterations and a thinning factor of 8. Compared to the real data NMA, the simulation scenarios were relatively simple in terms of model complexity and data structure. Therefore, we started with a (comparatively) shorter burnin period, fewer iterations and the default thinning factor of R2jags. This yielded satisfactory MCMC diagnostics, so we did not change this. For the resulting Markov chains, we assessed convergence based on the Gelman-Rubin statistic. We considered a chain converged when the statistic was equal to or below 1.01. In addition, we registered the time that it took for the models to fit. For full RMarkdown reports of the simulations, including further model diagnostics, we refer the reader to our GitHub repository https://github.com/ DThomassen/Bayesian-RD-Regression.
Our primary interest was in the estimated posterior distribution of the risk difference (RD) due to treatment, compared to the true risk difference. A "true" constant risk difference is only available in the scenarios where the true link function x i1 binary, x i2 binary, p i =0.39+0.3x i1 +0.3x i2 .
x i1 binary, x i2 binary, p i =expit(−1+x i1 +x i2 ) is the identity. Because we sampled each scenario only once, the data is subject to sampling variability with respect to the true risk difference. As additional comparators, we obtained the observed RD (ie, observed success proportion in treatment group minus observed success proportion in the control group) and the ordinary least squares (OLS) estimate of the risk difference.

One binary independent variable
In the single binary treatment variable scenarios, all three models provided very similar results in most cases (Figures 3  and 4). All Gelman-Rubin statistics were well below 1.01. In the smaller sample scenarios (Figure 3), the sampling variability was larger, resulting in higher posterior variance and larger deviations of the observed RD from the true RD. In all binary treatment scenarios, the fitting time of the logistic models was roughly 150% of the fitting time of the additive risk models. The latter were very similar in fitting time.
A difference between the additive risk model results arose in cases where n = 100 and the true risk for one of the treatment groups was very close to one ( Figure 3C,F). Here, the posterior RD resulting from the WTS model had large variance and most of its mass was above the observed risk difference. This results from the use of the min() and max() functions in the model. When the success rate in one treatment arm is 100%, any value of that exceeds the observed risk difference will result in a linear predictor larger than one. The min() function turns these linear predictors into predicted risks of 100% for that treatment arm, which is perfectly compatible with the data. For this reason, the Deviance Information Criterion (DIC) of the WTS model was similar to or even lower than the DIC of our proposed model in these scenarios. As a consequence, the WTS model can only learn from the data that the risk difference is likely equal to or x i1 binary, x i2~U [−1,1], p i =0.3+0.3x i1 +0.2x i2 . larger than the observed risk difference. Posterior mass is assigned to all risk difference values that exceed the observed risk difference, limited by the prior. The analogue holds for predicted risks of 0%.
In the scenario where n = 100 and the true risk was 99% in both trial arms ( Figure 3F), the logistic model differed slightly from the additive risk models. The logistic model showed extreme effects on the log odds scale and a sharp peak in density around the observed risk difference after transformation to the RD scale. For the single trial RD point estimate, this is not necessarily a problem. But such extreme effects on the log odds scale can influence the overall results of MA or NMA, as the effects are often combined (weighed, averaged, added and subtracted) on the log odds scale. This may have played a role in the Hepatitis C NMA.

Two independent variables
When we look at the results from the scenarios with one binary treatment variable and one additional independent variable ( Figures 5-8), the main differences appeared between the logistic model on one side and the two additive models on the other. There was a home advantage: where the true link was the logit (Figure panels D-F), the logistic models were better able to capture the treatment-covariate-outcome relation and where the true link was the identity (Figure panels A-C), the natural risk models performed better. This home advantage increased when variation of moved the true risks to the more nonlinear part of the logistic function. For the scenarios with x i2 ∼  [−1, 1], this was less clear, because when was lower, the minimum true risks got close to zero, whereas when was higher, the maximum true risks approached one (Figures 7 and 8). So in most scenarios, the "home advantage" of the models was visible on either the lower or higher end of x i2 . When the true link is the logit, the linear effect of treatment (ie, the first derivative of g −1 (linear predictor) with respect to the treatment variable) becomes x i1 binary, x i2~U [−1,1], p i =0.3+0.3x i1 +0.2x i2 . smaller as the predicted risks approach 0 or 1 from 0.5. This home adantage was also visible in the DICs of the models, with differences in DIC ranging from 1 to 19 in favor of the true model as the predicted risks got closer to 0 or 1.
In the smaller samples (n = 100) contrasts also emerged between the additive risk models (Figures 5 and 7). Our proposed model generally resulted in slightly more conservative treatment effect estimates that were closer to the observed risk difference. Where the maximum true risk approached one, the WTS model yielded larger posterior variance and assigned posterior mass to RD values corresponding to linear predictors exceeding one ( Figure 5C,F).
For binary x i2 , the two additive models were comparable in terms of fitting time, while the logistic model was roughly 15% slower. For uniform x i2 , the logistic model was the fastest, with the WTS model being about 30-60% slower, and our proposed model taking more than twice as long to fit.

Link with the data example
In scenarios with 100% success in one trial arm, we noted that the WTS model put posterior mass on all beta values above the observed RD, within the constraints of the prior. Our model assigned mass somewhat symmetrically around the observed RD. If such a trial were part of a meta-analysis, the differences in posterior variance could impact the weight that is given to the trial. In the Hepatitis C data, there is an example of such a meta-analysis with one study that has a 100% successful arm. To compare treatment 13 to 5, two studies are available: ASTRAL-2 and ASTRAL-3. The observed risk differences in these studies are 0.06 and 0.17, respectively. ASTRAL-2 has one trial arm where 133/133 patients reach SVR.

F I G U R E 9
Considerations in choosing between the three approaches discussed in this article.
When we analysed these trials individually, our proposed model as well as the logistic model resulted in a posterior mean close to 0.06 for ASTRAL-2, whereas the WTS model resulted in a flat posterior and a posterior mean around 0.4. The large posterior variance will lower the weight that is given to this trial in a meta-analysis. For ASTRAL-3, all three models give the same result.
When we look at the pairwise comparison within the network meta-analysis, we see that the three models yielded diverging summary estimates. Our model estimated the overall posterior RD around 0.12, so in between the estimates from the individual trials. The summary estimate from the WTS model was roughly 0.17, close to the estimate of the ASTRAL-3 trial. Apparently the weight of ASTRAL-2 in the MA was low. The summary estimate from the logistic model was around 0.22, exceeding the estimates from both individual trials, which is surprising. Factors such as heterogeneity and violations of the consistency assumption may be at play here. As the treatments 5 and 13 are very central in the network (Figure 1), these results may have seriously affected the other treatment comparisons in the NMA of Figure 2.

DISCUSSION
We compared three methods to obtain estimates of risk differences in the (meta-)analysis of trials: a logistic model with transformation, a natural risk model by WTS and our proposed natural risk model. In the NMA of the Hepatitis C data, the models' results were quite different. In most simulated single trial scenarios, however, all four models provided similar results. Still, differences emerged when the true risks approached 0 or 100%. Here, the sigmoidal shape of the logistic function is most nonlinear and the 'home advantage' of the logistic versus the additive risk models is most pronounced.
In some such cases, the logistic model became numerically unstable. This also occurred in our analysis of the Hepatitis C data.
In addition, our proposed model was more sensitive to what happened at the boundaries of the risk parameter support than the WTS model. This sensitivity can be both an advantage and a disadvantage. In the Hepatitis C data, there were several trials where much information about the treatment effect lay at these boundaries. The WTS model was not able to pick up this signal, whereas our proposed model was. In the case of a continuous variable with a few extreme values, the risk difference estimate for our model would be limited by these outliers: the linear predictor for the extreme data points should still lead to a linear predictor in the unit interval. The WTS model would be less affected by such outliers.
A limitation of our simulated scenarios is that they are by no means exhaustive. We primarily aimed to understand the diverging model results on the Hepatitis C data. Furthermore, we only explored scenarios where at least one of the models had the 'true' link function, while in reality, the structure of a dataset will almost never be exactly logistic or exactly linear. More systematic comparisons of these models are required to gain insight into their behaviour in a wider range of scenarios. To apply our proposed model in a different setting, we recommend examining simulated scenarios that are relevant to this setting.
We defined a first version of our proposed model in a trial and error process while working on a Bayesian NMA of the Hepatitis C data using JAGS. When developing it further, we decided to remain within this Bayesian MCMC framework, as it provided the flexibility to define our logit-logit model. We also tried fitting the model using Stan, but ran into problems with convergence. The angular shape of the likelihood likely plays a role here, as well as the specification of the priors. As the support of the likelihood for any additive risk model is data-dependent, it is not evident how to define a joint prior whose support perfectly fits the likelihood's support. We defined priors such that their support at least fully contains the likelihood support.
An often-cited downside of the Bayesian approach is the need for the specification of prior distributions, which introduces a degree of subjectivity. It is also possible to use the flexible MCMC framework to obtain maximum likelihood estimates. To this end, one can fit a "Bayesian" model with the desired likelihood function and a (possibly improper) uniform prior on the parameters. The posterior distribution is then proportional to the likelihood function and the posterior mode will be equal to the maximum likelihood estimate. In contrast to JAGS, Stan allows improper priors and may therefore be preferred for MCMC ML estimation.
On the other hand, the priors in a Bayesian model present a transparent way to restrict parameters and regularise the model. This is especially useful in the case of additive risk models, which can be computationally tricky. In our experience, regularising priors were necessary for convergence of the MCMC procedure in more complex settings like our network meta-analysis.
In our example NMA, we specified independent priors on the study-specific intercepts j , among j and j , and among the overall treatment effects d. Conditionally on the data, these parameters are not independent. The assumption of marginal independence is commonly made, for example in Reference 15, but some have argued that it may not always be realistic. 23 To salvage this, a multivariate prior can be employed, or the model can be reparametrised such that the assumption of marginal independence becomes more likely to hold. 24,25 The parametrisation of our proposed model could be optimised further, though this may come at the cost of the interpretability of the parameters.
The inclusion of covariates and interactions in the NMA model is necessary to obtain treatment effects conditional on patient covariates. 10 Attention is needed on separating within-trial from between-trial interactions. 21 Study-specific centering may be a pragmatic solution, which however complicates the interpretation of the summary treatment effect. Prediction of a treatment effect for a new patient that was not part of any study is an important direction for further investigation.
The compared models have their own properties and strengths. Which model is most suitable, depends on the specific modelling scenario. Considerations in choosing between the three approaches discussed above are summarised in Figure 9. Distinct criteria such as interpretability and best fit need not exclude each other. It is possible to model treatment effects on one scale for best fit and transform the results to another scale for interpretation. Mainly, a logistic model and an additive risk model imply different assumptions about the data. These concern, for instance, the scale on which the treatment-outcome relation is assumed to be linear and the scale on which interactions are presumed to take place. In network meta analysis, they also concern the scale on which consistency between the treatment effects across trials is assumed.
When one is interested in interactions or heterogeneity on the natural risk scale, an additive risk apprach can be applied. As we have discussed, there are at least two viable approaches: the WTS approach and our proposed approach. The choice between the two additive risk models depends on how sensitive the model should be to what happens at the extremes of the probability parameter support. We suggest comparing both approaches in a sensitivity analysis. Our proposed model is preferable when datapoints with predicted risk in the outer regions of the 0%-100% range carry important information about the parameters to be estimated.