Determining maximal achievable effect sizes of antidepressant therapies in placebo‐controlled trials

Antidepressants outperform placebo with an effect size of around 0.30. It has been suggested that effect sizes as high as 0.875 are necessary for a minimal clinically important difference. Whether such effect sizes are achievable in placebo‐controlled trials is unknown. Therefore, we aimed to assess what effect sizes are theoretically achievable in placebo‐controlled trials of antidepressants.


| INTRODUCTION
The magnitude of drug-placebo differences for antidepressant therapies has been extensively debated. 1 Some authors have argued that the statistically significant effects observed in clinical trials-an effect size of about 0.30 or an HDRS-17 difference of 2 points-are far below commonly championed cut-offs for clinical relevance, and that antidepressant therapies should therefore not be used. [2][3][4][5] In this vein, previous treatment guidelines from the National Institute of Health and Care Excellence (NICE) refrained from recommending a number of established treatments for major depression since their effect size (ES) as compared to placebo was below 0.50. 6 Others have argued for even higher cut-offs, an analysis comparing scores on the Clinical Global Impression (CGI) and the 17-item Hamilton Depression Rating Scale (HDRS-17) 7 suggested a CGI rating of minimally improved to correspond to an improvement of 7 points on the HDRS-17, or an ES of 0.875, hence promoting the perception that antidepressant therapies displaying treatment differences smaller than these lack clinical relevance. [2][3][4][5] While the cutoffs described above originate from within-patient observations (ie the difference between the baseline and endpoint rating), they have been interpreted as implying that HDRS-17 differences between treatments smaller than these are clinically imperceptible. [2][3][4][5] There are several problems with applying a metric derived from within-person change scores to between-treatment differences, 1 one issue that has received scarce attention is whether treatment-placebo differences of such magnitudes are at all achievable in antidepressant trials. Given (i) that placebo-treated participants often display mean endpoint scores of less than 15 HDRS-17 points, with standard deviations (SDs) around 7.5, 7 (ii) that it is impossible to score less than zero, (iii) that healthy volunteers on average display HDRS-17 scores around 3 (SD 3.2) 8 and (iv) that dropouts-who tend to display much higher average HDRS-17 scores than healthy volunteers or remitted depressed individuals-are often in the 30-40% range, it is unclear if between-treatment differences of 7 HDRS-17 points (or an ES of 0.875) are realistic targets.
The purpose of the present study was to determine the maximal effect size magnitude possible to obtain in placebocontrolled antidepressant trials. The analyses were based on HDRS-17 ratings from patients with major depression who have been treated with placebo in randomized controlled trials of selective serotonin reuptake inhibitors (SSRIs). Using the placebo population as baseline, we created a duplicate population for which we simulated treatment with effective antidepressant therapy. Effect sizes achieved under different assumptions of efficacy were then calculated by comparing simulated antidepressant treatment to real-world placebo outcomes. We also assessed how treatment non-completion and time to maximum effect impacted achievable effect sizes.

| Aims of the study
To provide an upper limit on the maximum achievable effect size for a theoretically perfect antidepressant and to assess what effect sizes can be expected in placebo-controlled trials of slow-acting antidepressants relative to their true underlying efficacy.

| METHODS
This is a simulation study based on outcome data from actual placebo-treated subjects who have participated in acutephase trials of antidepressants for major depression. Using the placebo data as the control group, we investigate maximum achievable effect sizes for simulated antidepressant treatments with varying rates of efficacy.

| Data acquisition
The data set consisted of all placebo-treated patients who participated in acute-phase placebo-controlled trials in adult major depression during the development programs for citalopram, paroxetine and sertraline. These data are available to us for offline use due to data sharing agreements with the respective manufacturer. Data can be requested from these manufacturers and/or from clinical data sharing websites like ClinicalStudyDataRequest.com and Vivli.org. Participants receiving active treatment were included only to inform rates of treatment non-completion. Trials had to utilize the HDRS-17 scale and have a scheduled evaluation at week 6 to be eligible for inclusion. Additional details on the study population have been presented previously. 9

| Analyses
The last available observation up until the week 6 evaluation for all placebo-treated participants was included. This population served both as the control group (for the effect of placebo) and as baseline for simulated antidepressant treatment. For each simulation, the placebo population was duplicated and the effects of simulated antidepressant treatment modelled onto the duplicated population. We first assessed an improvement-type effect of antidepressant treatment where patients undergoing simulated treatment receive on average a 50% improvement over the endpoint score of their placebo counterparts. Specifically, endpoint HDRS-17 scores for patients undergoing simulated treatment were the endpoint scores of the corresponding patients in the placebo distribution multiplied with an improvement factor. These were modelled using a normal distribution with a mean of 0.5 and a standard deviation of 0.1. To preclude changes too small or too large to fit with the clinical perception of response, improvement factors were bounded at 0.2 and 0.8. We considered models where 0%, 20%, 40%, 60%, 80% and 100% of participants were responsive to treatment, defined as mentioned above. In the first run, all patients were modelled as achieving this improvement regardless of time under treatment as long as the patient had at least one post-baseline HDRS-17 evaluation (since otherwise improvement is impossible to detect). Patients who only had a baseline observation retained the score of their placebo duplicate. In a second run, we accounted for dropouts by modelling improvement as linearly related to time under treatment. A participant that was randomized to obtain a 60% decrease in endpoint score (an improvement factor of 0.4) had they completed 6 weeks of treatment would thus have their score decreased by 10% if their last visit was at week 1, by 20% if their last visit was at week 2, etc. Dropout rates for simulated treatment were taken from the active treatment arms of the included trials.
In the second simulation, we assumed a remission-type effect of the simulated antidepressant treatment. Patients undergoing simulated treatment were thus given a new endpoint score drawn from a distribution approximating the HDRS-17 scores that can be expected in healthy volunteers. The score distribution for remitted patients was generated using an exponential distribution with a rate parameter, 0.3125, chosen to correspond to a mean and SD of 3.2 in healthy volunteers as reported by Zimmermann and co-workers. 8 Since some patients receiving placebo displayed low HDRS-17 scores at endpoint, we introduced an additional criterion according to which simulated remitters whose new endpoint score was higher than the endpoint score of their placebo duplicate retained the placebo endpoint score (ie so that simulated treatment can never increase endpoint scores and thereby make it more difficult to achieve treatment-placebo separation). Using the same strategy as for the simulation of improvement-type mechanisms, we first simulated runs where all patients with at least one post-baseline evaluation remitted, defined as mentioned above, and then considered models in which chance of remission was linearly related to time under treatment. For example, in a simulation where 60% of participants undergoing simulated treatment obtained remission, a subject dropping out after week 1 would have a 10% chance of remission, a patient dropping out after week 2 would have a 20% chance of remission, etc.

| Sensitivity analyses
In order to explore the influence of time under treatment and dropout rates, we ran simulations using placebo HDRS-17 scores obtained from the week one evaluation and from the week 12 evaluation (in trials with such duration), respectively. For week one data, only dropouts prior to any postbaseline visit are relevant and there were thus no analyses assessing the impact of dropouts. For week 12 data, we first ran these using dropout figures from the full population, with maximum effect being achieved if the patient received treatment for ≥6 weeks. This was done to isolate the effect of lower endpoint scores in the placebo group in longer duration trials so that the impact of this factor could be assessed. To assess the impact of an antidepressant effect that was slower to develop than in the primary analyses, we ran the same analyses using dropout figures from 12-week trials, with the treatment effect being linearly related to time and a full effect only being achieved if the patient remained in the study for ≥12 weeks.
All analyses were conducted in R version 3.6.3.

| RESULTS
Characteristics of the included studies are displayed in Table 1. Observations were available for 7170 participants, 2201 of which had been allocated to placebo. The overall dropout rate for SSRI-treated participants was 31.9%, with the last available evaluation being the baseline evaluation for 4.6% (230) of cases, the week 1 evaluation for 8.1% (400), the week 2 evaluation for 5.7% (285), the week 3 evaluation for 4.9% (243), the week 4 evaluation for 8.4% (416) and the week 5 evaluation for 0.2% (9). The overall dropout rate was not significantly different (p = .483) from that of placebotreated participants (32.7%).

| Improvement-and remissiontype mechanisms
The mean placebo endpoint HDRS-17 rating for trials with a week 6 evaluation was 15.5 with a standard deviation of 7.73. Figure 1A-F detail the results of simulating an improvementtype antidepressant effect (ie a 50% additional reduction over that obtained with placebo) with full effect of treatment being achieved as long as the individual has at least one postbaseline visit. As shown in Figure 1E-F, a rate of improvement above 80% would be needed to surpass an ES of 0.875 and a rate close to 100% would be needed to reach a mean difference of 7 HDRS-17 points. Figure 2A-F show the same improvement-type analysis but with full effect of treatment being achieved only if an individual stays in the study for six weeks. Under these assumptions, models where 100% of study completers experienced a 50% larger improvement than that obtained with placebo resulted in an ES of 0.76 and a mean difference of 5.55 ( Figure 2F). Figures 3 and 4 detail the results of the corresponding simulations using a remission-type antidepressant effect. In the case of no dropouts, 100% of subjects remitting resulted in an ES of 1.75 and a mean difference of 11.6 ( Figure 3F) remission rate yielded an ES of 0.85 and a mean difference of 6.96 ( Figure 3D), that is an outcome close to what has been suggested as cut-off for a clinically relevant effect. If a full effect of treatment was only achieved in participants who did not dropout prior to week 6, a remission rate of 80% would be required to obtain an ES of 0.85 and a mean difference of 7.04 ( Figure 4E).

| Sensitivity analyses
The mean HDRS-17 score at week one for placebo-treated participants was 19.8 with an SD of 5.53. Figures S1 and S2 show the results of simulating improvement-and remissiontype effects fully realized at this timepoint. Due to the lower SD, an ES of 0.93 is achieved already at 60% rate of improvement ( Figure S1D), whereas a mean difference of >7 is achieved first at an 80% rate of improvement. For remissiontype mechanisms, an ES of 0.81 is reached at 40% remission ( Figure S2C) with a mean difference of 6.26. At week 12, the mean endpoint score in the placebotreated group was 13.9 with an SD of 7.99. Figures S3 and   S4 show simulations of improvement-and remission-type mechanisms with full efficacy if the patient had an evaluation at or after week 6. Under these conditions, the improvementtype mechanism achieved a maximum ES of 0.69 with a mean difference of 5.06 ( Figure S3F) and the remission-type mechanism maxed out at an ES of 0.98 with a mean difference of 7.65 ( Figure S4F). Figures S5 and S6 detail the results of simulations with an antidepressant effect that is slower to develop (reaching full strength after 12 weeks of treatment). Maximal ES for a improvement-type mechanism under these conditions was 0.59 and the maximum mean difference was 4.48 ( Figure  S5F). The corresponding figures for remission-type models yielded a maximum ES of 0.83 with a mean difference of 6.76 ( Figure S6F).

| DISCUSSION
The primary finding of this study is that previously suggested criteria for what should be regarded as lower limit for a difference between an antidepressant and placebo to be regarded clinically relevant, that is an ES of 0.875 or a mean difference of 7 HDRS points, are likely above what is possible to achieve with a slow-acting antidepressant therapy given the current design of antidepressant trials. Even under the unrealistic assumption of full efficacy in study noncompleters, simulations thus show that 80% of subjects on active treatment must display an additional 50% improvement than if treated with placebo, or that 60% must display remission, for these cut-offs to be surpassed more often than not. Given the dropout rate normally seen in antidepressant trials, and assuming the effect of active treatment to be only partial in non-completers, the maximum achievable ES is 0.76 for improvement-type simulations and 1.10 for remission-type models.
The assumption of full efficacy also in treatment noncompleters, which was used in some simulations, is obviously unrealistic for slow-acting antidepressant therapies. Similarly, simulating treatment effects in dropouts as being linearly related to time is likely overly optimistic as there is very little separation on the HDRS-17 sum-score during the first two weeks of treatment with, for example an SSRI. 10 While likely more realistic to model treatment effects as being, for example, quadratically related to time under treatment, these conservative design choices were made in order not to overstate the difficulties in achieving large ESs. For the same reason, we did not account for other factors which may make treatment-placebo separation difficult to achieve, including (i) that trial participants in remission may display substantially higher average HDRS-17 scores than healthy volunteers, 8,11,12 (ii) that HDRS-17 captures symptoms which may be elicited as treatment-emergent side effects 7,13 and (iii) that no participant responded worse to simulated antidepressant therapy than to placebo, which cannot be assumed to hold true in reality. Also in other regards, we applied a conservative approach; we hence did not consider placebo samples other than the intention to treat population and did not conduct analyses stratified by initial severity, since the hereby obtained subpopulations (ie completers and/or mildly depressed subjects) would have displayed lower endpoint scores than the full population, thus making it more difficult to achieve treatment-placebo separation under the simulated scenarios. It might be argued that our results reveal that most depressive episodes will be significantly improved by 6 weeks also when treated with placebo and that antidepressant interventions are thus of limited value, at least in the acute phase. Such an interpretation is, however, overreliant on mean scores, neglecting that treatment outcomes vary widely also among placebo-treated patients. While a significant fraction of these display very low depression scores at endpoint, there is also a sizeable fraction who do not respond. Reducing the duration and severity of depressive episodes for these patients, as well as decreasing the risk of relapse, should remain a priority. [14][15][16] In this context, it should also be considered that few other treatments in medicine would be able to surpass the effect size cut-offs sometimes suggested adequate for depression. 17,18 That antidepressants in fact seem to perform at or above average in reducing core depressive symptoms is noteworthy, 7,17 especially given that trials in other fields often should be less marred by issues such as poor compliance, inclusion of misdiagnosed subjects and inexact outcome measures, which are all factors that are inclined to reduce the apparent effect of active treatment in depression trials (but not addressed in the present analyses).
We also assessed the relative importance of the two major factors contributing to the difficulty in achieving a high degree of treatment-placebo separation in depression trials, that is the large and variable symptom reduction in subjects receiving placebo 19 (which can be the result of, for example spontaneous remission, regression towards the mean or an actual placebo effect), and the high rate of dropouts. 20,21 To this end, the impact of the placebo response was inferred by contrasting different time points while holding the influence of dropouts constant. When contrasting simulations where the full effect of treatment was at hand in all patients who had at least one post-baseline observation, maximum ES for improvement-type simulations changed from 1.84 using week 1 data to 1.08 using week 6 data ( Figure 1F, Figure S1F), the corresponding figures being 2.90 and 1.75 for remission-type simulations ( Figure 3F, Figure S2F). Similar patterns were seen when contrasting simulations utilizing outcome data at week 6 and week 12 (Figures 2 and 4, Figures S3 and S4). Needless to say, the major contributor to these differences is the mean placebo endpoint score being much higher at week 1 (19.8) than at week 6 (15.5), and slightly higher at week 6 than at week 12 (13.9); wherefore, there is more room for a treatment effect in earlier evaluations. Similarly, the impact of dropout can be assessed by comparing models that handle treatment non-completion differently while holding placebo responses constant. Contrasting models where the size of the simulated treatment effect was linearly related to time under treatment to those where full efficacy was assumed to be achieved in all patients showed this constraint to lower the maximally achievable ES at week 6 from 1.08 ( Figure 1F) to 0.76 ( Figure 2F) for improvementtype models, the corresponding decrease for remissiontype models being a change from 1.75 ( Figure 3F) to 1.10 ( Figure 4F). Similar patterns were seen when contrasting models where the full effect of treatment was realized only after 12 weeks of treatment to those where full efficacy was achieved after 6 weeks ( Figures S3 and S6).
Though the debate on clinical relevance has primarily revolved around antidepressants, the difficulties in achieving meaningful separation between treatment arms should apply to all slow-acting therapies for depression (eg transcranial directcurrent stimulation and psychotherapies) [22][23][24] and be especially pronounced when one attempts to detect differences between two treatments both of which may display some efficacy. 10,25 A notable observation is that SDs were often unrelated to the proportion of treatment responders in improvement-type models although endpoint score distributions for patients receiving simulated treatment were clearly distinguishable from those of patients receiving placebo (eg Figure 1A & C and 2A & D). The observation that comparatively simple transformations can leave standard deviations largely unaffected is in line with a recent publication arguing that meta-analytical results of variability ratios not significantly different from 1 do not imply that there is no treatment effect heterogeneity. 26 This study has several limitations. First, it is a simulation study considering a limited range of underlying treatment mechanisms. While this should not impact the general