Improving Access to Psychological Therapies (IAPT) in the United Kingdom: A systematic review and meta-analysis of 10-years of practice-based evidence

Objectives. Improving Access to Psychological Therapies (IAPT) is a national-level dissemination programme for provision of evidence-based psychological treatments for anxiety and depression in the United Kingdom. This paper sought to review and meta-analyse practice-based evidence arising from the programme. Design. A pre-registered (CRD42018114796) systematic review and meta-analysis. Methods. A random effects meta-analysis was performed only on the practice-based IAPT studies (i.e. excluding the clinical trials). Subgroup analyses examined the potential inﬂuence of particular methodologies, treatments, populations, and target conditions. Sensitivity analyses investigated potential sources of heterogeneity and bias. Results. The systematic review identiﬁed N = 60 studies, with N = 47 studies suitable for meta-analysis. The primary meta-analysis showed large pre-post treatment effect sizes for depression (d = 0.87, 95% CI [0.78 – 0.96], p < .0001) and anxiety (d = 0.88, 95% CI [0.79 – 0.97], p < .0001), and a moderate effect on functional impairment (d = 0.55, 95% CI [0.48 – 0.61], p < .0001). The methodological features of studies inﬂuenced ESs (e.g., such as whether intention-to-treat or completer analyses were employed).


Practitioner points
IAPT interventions are associated with large pre-post treatment effect sizes in depression and anxiety measures. IAPT interventions are associated with moderate treatment effect sizes with regards to work and social adjustment. A reduction in dropout and also the prevention of post-treatment relapse via the offer of follow-up support are important areas for future development.
In the United Kingdom, the National Institute for Health and Care Excellence (NICE) guidelines recommend evidence-based psychological interventions for common mental health problems organized in a stepped care model (NICE, 2011). These guidelines were implemented at a national level in 2008 in England through the Improving Access to Psychological Therapies (IAPT) programme. Historically, IAPT was founded on the premise that many patients receiving an evidenced-based psychological therapy would likely recover and return to work, therefore reducing the welfare benefit cost burden (Clark, 2011). This national implementation was supported by positive results from two initial IAPT 'demonstration' sites which provided evidence of the feasibility and effectiveness of the IAPT model (Clark et al., 2009). Ten years later, there are over 200 IAPT services across England, which is the largest publicly funded and systematic implementation of evidence-based psychological care in the world. The IAPT programme has subsequently served as a model for the development of similar systems in other countries such as Australia (Cromarty, Drummond, Francis, Watson, & Battersby, 2016), Canada (Naeem, Pikard, Rao, Ayub, & Munshi, 2017), Norway (Knapstad, Nordgreen, & Smith, 2018), and Japan (Kobori et al., 2014). IAPT services have three distinctive features: a stepped care model of service provision, the implementation of evidence-based and highly standardized and protocol-driven treatments, and also the systematic use of routine outcome monitoring.
To date, approximately 7.5 million referrals have been received by IAPT services since national statistics were introduced in 2012, of whom approximately 4.9 million received psychological treatment. National statistical reports indicate that the IAPT programme now receives around 1.25 million annual referrals. IAPT services deliver psychological treatments following stepped care principles (Bower & Gilbody, 2005), which is an organizational model supported by evidence from controlled trials  in which progressively intensive psychological treatments are made available to patients according to need. Patients are initially offered brief (≤8 sessions), low-cost, and low-intensity guided self-help (GSH) based on principles of cognitive behavioural therapy. GSH is psychoeducational in nature and can be delivered over the telephone, via computerized CBT, in large groups or in a one-to-one format. GSH in IAPT services is delivered by psychological well-being practitioners (PWPs), who are trained and supervised to deliver highly standardized, evidence-based interventions guided by a national competency framework and associated assessment and treatment competency measures (Kellett et al., 2020). Patients who have not benefited from GSH are stepped up to high-intensity psychological therapies, which involve formal CBT and other therapies such as person-centred experiential counselling, interpersonal psychotherapy (IPT), dynamic interpersonal therapy (DIT), eye movement desensitization and reprocessing (EMDR) and couples counselling for depression. High-intensity interventions are delivered following evidence-based treatment protocols, are lengthier (i.e. typically around 16-20 sessions), and are mostly delivered one-to-one, in person. These interventions are delivered by qualified therapists , under weekly clinical supervision to ensure fidelity to associated competency frameworks (e.g., Roth & Fonagy, 2005).
IAPT services operate a routine outcome monitoring system in which patients complete a series of standardized questionnaires on a session-to-session basis, including self-reported measures of depression (Patient Health Questionnaire-9; PHQ-9; Kroenke, Spitzer, & Williams, 2001), anxiety (Generalized Anxiety Disorder Scale-7; GAD-7; Spitzer, Kroenke, Williams, & Lowe, 2006), and functional impairment (Work and Social Adjustment Scale; WSAS; Mundt, Marks, Shear, & Greist, 2002). Other disorder-specific questionnaires are also applied when relevant to the patient's problems (Mental Health Policy Team, 2018). This routine outcome monitoring system has enabled the large-scale evaluation of IAPT services around the country, yielding insights into the factors that distinguish more and less effective services (e.g., see Clark et al., 2018;Gyani, Shafran, Layard, & Clark, 2013). Furthermore, numerous studies have emerged from IAPT services, supported by practice research networks of IAPT therapists and researchers (e.g., see Lucock et al., 2017). The IAPT programme is also remarkable for its transparent and openaccess reporting of clinical performance data at a national scale (Clark et al., 2018).
The present study is the first systematic review of practice-based studies arising from the first 10 years since the implementation of the IAPT programme. Its primary objective was to quantify the effectiveness of IAPT interventions delivered during routine practice. As such, this review focused specifically on quantitative, practice-based outcome research, excluding randomized controlled trials (RCTs). The rationale for excluding RCTs conducted within IAPT services (e.g., Richards et al., 2016) is that these studies often apply strict inclusion/exclusion criteria which render samples that are not typical of routine IAPT populations (e.g., excluding cases with comorbid disorders; Westen & Morrison, 2001). Furthermore, effects from RCT samples may not be realistic reflections of the effects of routine service delivery (e.g., see Baker, McFall, & Shoham, 2009). Because the IAPT programme has expanded to also include assessment and treatment of patients with psychological distress associated with physical health problems (IAPT, 2018), and in order to provide a more comprehensive evaluation of the effectiveness of the programme, studies including patients with long-term physical health conditions were included in this review. A secondary aim was to narratively synthesize the characteristics of the practice-based studies that constitute the IAPT evidence base.

Inclusion and exclusion criteria
Study inclusion criteria were as follows: (1) an outcome study with an adult clinical population (i.e., 18+ years); (2) quantitatively analysed standardized outcome measures and had at least two points of outcome data collection; (3) published in a peer-reviewed journal and written in English; and (4) conducted in UK-based IAPT service delivering group or individual interventions. Study exclusion criteria were as follows: (1) the focus of the study was on children/adolescent populations; (2) only assessment scores were reported on the outcome measures; (3) the methodology was an RCT design; and (4) qualitative studies/opinion pieces/editorials.

Literature search strategy
The study protocol was prospectively registered (PROSPERO ref: CRD42018114796).
Three databases were searched -Scopus, PsycINFO, and MEDLINEup until the date of 13-08-2018. The search terms utilized were as follows: 'Improving Access to Psychological Therapies' AND/OR IAPT OR 'stepped care' NOT 'International association for plant taxonomy'. As the IAPT initiative commenced in 2008, the search years were inclusive of 2007 to the current date. The process for capturing all relevant studies followed several components: (1) a systematic search of the three databases using the predetermined search strings which were operationalized to capture all relevant articles; (2) hand-searching, which involved searching the reference lists of those articles that met inclusion criteria; and (3) of those articles meeting inclusion criteria from steps 1 and 2, a backward/reverse citation search was completed.

Eligibility of relevant articles and data extraction
Sixty studies met the inclusion criteria, with n = 29 reporting sufficient statistical information to calculate effect sizes (ESs). For those studies that did not report statistics that were eligible for the meta-analysis (n = 31), we contacted the corresponding author of the article by email and requested the relevant study statistics. This resulted in accessing data from n = 18 additional studies and enabled these studies to be included in the meta-analysis. A narrative synthesis was also carried out including all eligible studies. Figure 1 is a PRISMA diagram (Moher, Liberati, Tetzlaff, & Altman, 2009) detailing the process of study selection. This process followed two stages and was completed by one author in the first instance (SW). Queries about eligibility were discussed and ratified at subsequent research meetings, including three members of the research team (JD, SK, and SW). The eligibility process initially reviewed and removed inappropriate articles (i.e., duplicates), followed by the reviewing of the title and abstract, and finally by accessing and reviewing the full text. A bespoke data extraction tool was used and contained the following items: author/year, service, mental health condition, analysed N, dropout N, analysis (intention-to-treat [ITT] or completer analysis), intervention, main findings, and outcome measures. Any issues likely to introduce bias were also noted in the data extraction tool.

Quality assessment and risk of bias
The Critical Appraisal Skills Programme (CASP) tool was used to assess the quality of studies (see Table 1). One researcher completed quality assessments for all studies (SW), followed by blind rating by two other raters (rater 1 = accredited IAPT CBT therapist; rater 2 = clinical psychologist). Rater 1 rated 12 papers (which represented 20% of the studies), and rater 2 rated six papers that overlapped with rater 1 (which represented 10% of the studies). Second (blind) ratings were achieved by splitting the 60 included papers into study quality quartiles and then randomly selecting from each quartile (i.e., 15 papers per quartile) to ensure coverage across all study quality levels. Once completed, the ratings were compared and any discrepancies discussed. An overall agreement consensus for the rating of each paper was completed where possible. Where this was not possible, other members of the research team not involved in quality rating were consulted (JD and SK). Inter-rater reliability was calculated using the Kappa statistic (Cohen, 1960); the level of agreement was 'moderate' both between the original rater and rater 1 (k = 0.526 95% CI: 0.430-0.662), and between the original rater and rater 2 (k = 0.546 95% CI: 0.369-0.683).
Narrative review and meta-analysis A narrative synthesis aimed to summarize key study characteristics. A random effects meta-analysis aimed to synthesize the available outcome data (i.e. pre-post treatment, within-group effect sizes derived from available statistics). Analyses were conducted using R packages metafor via MAVIS: Meta-analysis via Shiny and forestplot (R version 3.6.3) (Gordon & Lumley, 2019;Hamilton, Aydin, & Mizumoto, 2016;Viechtbauer, 2010). Inclusion criteria for meta-analysis were as follows: (1) reporting pre-and post-means and SDs convertible into an ES (Cohen's d;Cohen, 1988) Included studies within narrative synthesis: 60 * * 13 studies included in narrative synthesis only as data not available for inclusion in meta-analysis.

Completers
Step 2 Table S1; b Those studies included in the meta-analyses; c Those who completed treatment were recruited, and following this stage, the data were analysed using ITT (survival analysis) of all participants, even those lost to follow-up; d Clients data only reported within this review; e Completer analysis used for the outcomes from CBT intervention. However, this study does compare those who dropped out with the rest of the sample on other variables, such as demographics; f Some missing data and not used, but analysis included dropouts; g Doncaster outcomes are reported in full in another paper (Richards reporting other ESs, but with sufficient additional information (i.e., means/SDs) to enable Cohen's d to be calculated, or (4) reporting the mean pre-post change and SD. The calculation for Cohen's d was d = (M 1 À M 1 )/SD pooled, where SD pooled = p ððSD 2 1 þ SD 2 2 Þ=2Þ. Cohen's power primer definitions (Cohen, 1992) were used to interpret ESs: 'small' (d = 0.2), 'medium' (d = 0.5), or 'large' (d = 0.8), with anything < 0.2 classified as 'negligible'. Forest plots summarize the ES for each study, as well as the pooled (combined) depression, anxiety, and functioning ESs across studies. Numbers needed-to-treat (NNT) results are provided for each of the outcome measures to increase the clinical significance of the meta-analysis results. Publication bias was assessed using funnel plots (Egger, Davey Smith, Schneider, & Minder, 1997) and by using the failsafe N (Orwin, 1983) and rank correlation tests (Begg & Mazumdar, 1994). Heterogeneity was examined using the I 2 statistic and Cochrane's Q test. Moderator analyses examined potential sources of heterogeneity in between-study ES. Subgroup analysis investigated five categorical variables: methodological design (ITT/completer), step of care (step two/ step three/steps two and three), primary condition (mental health only/comorbid physical health), format (individual/group), and risk of bias (low/medium/high). Metaregression investigated four continuous variables: gender, age, mean baseline score, and treatment duration. The alpha threshold for significance was adjusted to p < .01 for subgroup and meta-regression analyses to account for multiple testing.

Results
Section one of the results presents the narrative synthesis and section two the metaanalysis. Table 1 describes the characteristics and risk of bias assessment of all included studies (n = 60). Tables 2 and 3 provide a summary of the moderator analyses performed on studies included in the meta-analysis (see Tables S1-S3 for summaries of the main findings from all included studies).

Demographics
Sample sizes ranged from a single-case study (n = 1; Mofrad & Webster, 2012) to data from 209 clinical commissioning groups (n = 537,131; Clark et al., 2018). One study included only male patients (Adamson, Gibbs, & McLaughlin, 2015), and 17 studies did not report the gender distribution of the patients. Of those studies that reported gender, the average percentage of females was 60.2%. Twenty-seven studies did not report ethnicity data; those studies that did on ethnicity tended to vary in the depth of detail provided. With the exception of three studies, the category of 'White'/'White British'/'Caucasian' was the largest ethnic group. North of England services contributed the largest number of studies (n = 17), and London-based services contributed N = 11 studies.

Mental health conditions and populations
The majority of studies investigated outcomes for depression and anxiety. Six studies (9.8%) investigated outcomes for physical health conditions, with one study investigating outcomes for dementia (Cheston & Howells, 2016). Other target conditions included psychosis, relationship distress, and problematic alcohol use (one study each; 4.9% overall). One study (1.6%) was set in a prison for male offenders (Adamson, Gibbs, & McLaughlin, 2015), whilst two papers (3.3%) studied outcomes with veterans (Clarkson et al., 2016;Giebel, Clarkson, & Challis, 2014). One study explored the effectiveness of an Note.C I= confidence interval; COM = completer; GAD-7 = Generalized Anxiety Disorder Scale-7; ITT = intention to treat; k = number of comparisons per subgroup; PHQ-9 = Patient Health Questionnaire-9; WSAS = Work and Social Adjustment Scale. a Moderator analysis for 'primary condition' was not undertaken for the WSAS outcome measure as all studies included were deemed to be investigating mental health with none focusing purely on physical health; *Significant at p < .05 threshold; **Significant at p < .01 threshold; ***Significant at p < .0001 threshold, between subgroup differences significant at Bonferroni-adjusted p < .01 threshold for multiple testing (in bold). Note.C I= confidence interval; GAD-7 = Generalized Anxiety Disorder Scale-7; k = number of comparisons; M = mean; PHQ-9 = Patient Health Questionnaire-9; SE = standard error; WSAS = Work and Social Adjustment Scale. *Significant at p < .05 threshold; **significant at p < .01 threshold; ***significant at p < .0001 threshold, significant at Bonferroni-adjusted p < .01 threshold for multiple testing (in bold).

Follow-up
There were four studies that had a post-treatment follow-up period, and this ranged from 4 to 52 weeks.

Risk of bias assessment
Overall, the majority of studies (58%) were rated as having low risk of bias, 30% had medium risk (30%), and 12 % had high risk. Study quality was particularly affected by the lack of follow-up data.

Meta-analysis
Overall, n = 47 studies were included in the meta-analysis. The analyses were organised according to the outcome measures routinely used within IAPT services. Due to discrepancies with which measures were used and reported across the studies, this resulted in different numbers of studies in each analysis. Within the studies included here, 46 used the PHQ-9 as an outcome measure; 41 used the GAD-7 as an outcome measure; and 19 used the WSAS as an outcome measure. Some of the included studies reported more than one ES for independent samples contained within their original research (n = 8 studies). Where this occurred and the separate ES reported did not contain overlapping patient data, the ESs were included as independent samples. This was consistently implemented across the whole meta-analysis and subgroup analyses. For example, in the paper by Delgadillo, Dawson, et al. (2017) a separate ES is reported for different patient groups and therefore each group is represented by the individually reported ES. This means that whilst the number of studies is given in each description below, this does not always match the actual number in the ES calculations included in the meta-analysis. The number of studies and number of ES reported in each analysis will be reported for clarity. A limited number of studies reporting pre-post outcomes also included follow-up data (n = 4; Clark et al., 2009;Kenwright, McDonald, Talbot, & Janjua, 2017;Meadows & Kellett, 2017;Pack & Condren, 2014). Due to the small number of these studies, follow-up outcomes have not been included within this meta-analysis.

Primary meta-analysis
Results for the PHQ-9 summarizing outcomes from 636,734 patients (mean n = 9,796; median n = 619) across 46 studies (n = 65 independent samples) are reported in Figure 2. The overall combined pre-post treatment PHQ-9 ES was large (d = 0.87, 95% CI [0.78-0.96], p < .0001, NNT = 2.17), indicating a statistically significant and large reduction in depression severity. There was evidence of considerable heterogeneity across PHQ-9 outcome studies: I 2 = 98%; Q(df = 64) = 3600.47, p < .0001. Funnel plot asymmetry (see Figure 3) suggested the presence of publication bias. However, there was a non-significant rank correlation test (p = .196) and non-significant regression test for funnel plot asymmetry (p = .083). The fail-safe N analysis indicating the number of nonsignificant studies needed to be published to overturn the findings to a small clinically nonsignificant effect was 97. Results for the GAD-7 included outcomes from 598,166 patients (mean n = 9,969; median n = 541) across 41 studies (n = 60 independent samples) and are reported in Figure 4. The overall combined pre-post treatment GAD-7 ES was large (d = 0.88, 95% CI [0.79-0.97], p<.0001, NNT=2.15), indicating a statistically significant and large reduction in anxiety severity.The overall combined pre-post treatment GAD-7 ES was large (d = 0.88, 95% CI [0.79-0.97], p<.0001, NNT=2.15), indicating a statistically significant and large reduction in anxiety severity. There was evidence of considerable heterogeneity across studies, I 2 = 98%; Q(df = 59) = 4239.30, p < .0001. There was some evidence of funnel plot asymmetry (see Figure 5); the funnel plot asymmetry regression test (p = .014) and the rank correlation test (p = .008) were significant, indicating some evidence for publication bias. However, the fail-safe N analysis indicated that 92 studies with null findings would be necessary to reduce the results to clinically non-significant. The results for the WSAS included data from 478,693 patients (mean n = 19,946; median n = 1,351) from 19 studies (n = 24 independent samples) and are summarized in Figure 6. The overall combined WSAS ES was moderate (d = 0.55, 95% CI [0.48-0.61], p < .0001, NNT = 3.30), indexing a statistically significant treatment effect on work and social adjustment. There was evidence of significant heterogeneity across studies, I 2 = 95%; Q(df = 23) = 524.11, p < .0001. Funnel plots were visually inspected and suggested some asymmetry with missing studies demonstrating larger effects (see Figure 7). The statistical tests showed mixed evidence of publication bias; the funnel plot asymmetry regression suggested significant asymmetry (p = .027), and the fail-safe N indicated 13 null-finding studies would reduce the average ES to a small clinically non-significant pre-/post-improvement (d = 0.35); however, the rank correlation test was not significant (p = .572).

Moderator and sensitivity analyses Subgroup analyses of categorical variables
Significant between-study heterogeneity was explored using subgroup analyses to investigate five categorical moderators of treatment effects across the three outcomes (Table 2). For PHQ-9 outcomes, significant variations in ES by subgroups were evident for type of methodology used, primary condition, step of care, and level of study bias. Completer analyses produced significantly larger ES than ITT analyses. Studies of primary mental health conditions produced significantly larger effects than studies which included patients with a chronic physical illness as the primary condition. Studies with increased risk of bias produced larger treatment effects than studies with low risk of bias. Samples reporting outcomes for step 3 (high-intensity) interventions produced larger effects than those reporting outcomes for step 2 (low-intensity) interventions. However, the subgroup differences in the latter two comparisons were no longer significant after accounting for multiple testing. For GAD-7 outcomes, significant variations in ESs by subgroups were evident for type of methodology used (completer vs. ITT analysis), primary condition (mental health vs. physical illness), and risk of study bias, showing the same pattern as in the PHQ-9 outcomes. Effects for step of care were not significantly different for GAD-7 outcomes. The format of treatment did not explain variations in treatment effects for either PHQ-9 or GAD-7 outcomes, and no significant variation in effects across subgroups was found for WSAS outcomes.

Meta-regression analyses of continuous variables
Significant between-study heterogeneity was explored using meta-regressions to investigate four continuous moderators of treatment effects across the three outcome measures (Table 3). For GAD-7 and WSAS outcomes, between-study variations in ESs were not related to differences in the mean age or gender proportions of the study samples. PHQ-9 outcomes did show larger treatment effects when proportions of males increased; however, the effect did not remain significant after adjusting for multiple testing. Mean treatment duration was significantly associated with between-study ES variations for both PHQ-9 and GAD-7 outcomes, with larger effects evident when there were a greater mean number of sessions attended. Larger effects were also associated with higher baseline severity scores for PHQ-9 and GAD-7 outcomes, although the PHQ-9 effect did not remain significant after accounting for multiple testing. There was no association between intake score or treatment duration and variation in treatment effects for WSAS outcomes.

Sensitivity analysis excluding atypical studies
Sensitivity analyses investigated the aggregated ES for those studies that were more similar to each other, through excluding studies deemed to be atypical of routine IAPT services in terms of their population, target condition, or treatment type. There were eight studies excluded on this basis. The excluded studies focused on samples of male offenders  (Adamson, Gibbs, & McLaughlin, 2015), two studies of veterans (Clarkson et al., 2016;Giebel et al., 2014), deaf patients (Young et al., 2017), two studies of systemic therapy (Kuhn, 2011;DIT, Wright & Abrahams, 2015), and two studies due to both the population and treatment delivered (couples and BCT-D, Baucom et al., 2018;psychosis and CBT-p, Jolley et al., 2015). Meta-analyses for each outcome were completed with the atypical studies excluded. Overall, and in comparison with the primary meta-analysis, there was only a minimal difference in the ES found in the sensitivity analyses. With regard to the PHQ-9, 57 separate comparisons contributed to the analysis producing a moderate-to-  the Tables S1-S3). Overall, this indicates that findings from the primary metaanalyses were stable and robust to sample selection across sub-group analyses.

Discussion
This systematic review has identified and synthesized all available, peer-reviewed, practice-based evidence generated by the IAPT programmean initiative originally designed to increase rapid access to evidence-based psychological treatments for those experiencing common mental disorders (Clark, 2011;Clark et al., 2009). The narrative review summarized n = 60 studies that varied markedly in terms of the methods used, samples studied, and outcomes analysed. The meta-analysis aimed to quantify the overall impact of IAPT interventions using standardized outcome measures, including data from over 600,000 patients. RCTs were excluded from this review in order to gain a better understanding of outcomes achieved in routine practice, due to the common issues regarding generalizing from experimental studies to routine service delivery contexts (Lorenzo-Luaces, Johns, & Keefe, 2018). The main results from the primary meta-analysis found large pre-post treatment effect sizes for reductions in depression and anxiety, with a medium effect regarding improvements in work and social adjustment. The GAD-7 effect mirrors the results of the Stewart and Chambless (2009) meta-analysis of the effectiveness of CBT for adult anxiety disorders delivered in routine practice, which illustrated that pre-/post-treatment outcomes on disorder-specific measures were large and, when benchmarked against the outcomes achieved in RCTs, were equivalent. The PHQ-9 effect mirrors the Thimm and Antonsen (2014) meta-analysis of the treatment of depression in routine practice, in that the ES at post-treatment was large (d = 0.97), and 44% demonstrated a significant improvement in depression. The tests of heterogeneity throughout the current metaanalyses indicated high levels of variability across studies and there was some evidence of publication bias (for GAD-7 outcomes) so results should be interpreted cautiously. The ES reported here therefore complements the recovery rates that are routinely reported by services (Clark, 2019) to assess the effectiveness of the IAPT programme, alongside other targets related to wait-times for assessment, entry into treatment, return to work rates, etc.

Moderator analyses
Studies using ITT analyses were compared with completer analyses (COM), which is an important and well-known methodological distinction (Kyrios, Hordern, & Fassnacht, 2015). ITT methods are recommended to minimize bias (Ranganathan, Pramesh, & Aggarwal, 2016), whereas COM tends to increase the rate of Type I errors (Fergusson, 2002). The ESs in COM studies were larger than those using ITT analysis across both anxiety and depression outcomes, and this is further evidence that study designs which employ COM approaches for routinely delivered psychological interventions risk yielding overoptimistic and biased results.
Significant differences were found in the magnitude of effect sizes observed for low and high intensity interventions for depression; however, these were no longer significant after accounting for multiple testing. Although differences between low and high intensity interventions were not significant for anxiety outcomes and functional impairment, there was a pattern of larger effects for high intensity interventions. This may have been due to the fact that when intake scores were assessed, there were no differences in initial assessment scores between the step 2 and step 3 studies. Psychological well-being practitioners delivering low-intensity interventions in IAPT are trained to post-graduate certificate level via a national curriculum to work with mild-to- moderate anxiety and depression, with the psychoeducational approaches used being originally designed for such presentations (Kellett et al., 2020). Therefore, ESs may possibly be attenuated in some patients with more complex problems, where the skill level of the practitioner or the content of the intervention may be insufficient. This finding is a challenge to stepped care principles, as low-intensity interventions are not assumed to be less effective, just less intense in format, and more flexible in terms of service delivery method (Firth, Barkham, Kellett, & Saxon, 2015). Recent studies suggest that 'complex cases' tend to have poor treatment outcomes when they are initially allocated to and receive low-intensity therapies, compared to high-intensity interventions (Delgadillo, Huey, et al., 2017;Delgadillo, Moreea, & Lutz, 2016). Other research has also investigated the use of predictive models to identify factors that may impact on outcomes at the various steps of IAPTboth at patient (e.g., demographic and clinical factors) and therapist levels (e.g., Delgadillo, Moreea, et al., 2016;. The average duration of IAPT treatments (mean = 6.7) was associated with larger treatment effects for depression and anxiety outcomes (although anxiety effects were not significant after controlling for multiple testing). This finding is in line with national evidence that suggests the average length of an IAPT treatment is seven sessions and that patients that move to recovery attend eight sessions on average (NHS England, 2019).

Study limitations
The absence of any control comparators means that the observed effects may be confounded by statistical phenomena such as regression to the mean and/or a possible natural recovery phenomenon (Posternak & Miller, 2001;Whiteford et al., 2013). The lack of any indices of treatment fidelity, integrity, or competency in the studies raises uncertainty as to whether the interventions described were actually delivered as intended. The moderate rate of agreement concerning risk of bias ratings could have created unreliable treatment effect estimates in the meta-analysis (Armijo-Olivo et al., 2014). The lack of precision in the studies regarding the specificity of low-and highintensity interventions means that there was insufficient granularity in the descriptions of the interventions. There were relatively fewer purely low-intensity or high-intensity outcome studies for inclusion, and this weakened the specificity of the moderator analyses conducted. The lack of studies with adequate post-treatment follow-up data means that the durability of IAPT interventions is still open to question.

Research, policy, and clinical implications
In order to continue to improve our understanding of the effects of routinely delivered interventions, there is a need for the following: (1) studies analysing outcomes on other disorder-specific measures; (2) studies describing the interventions in greater detail; (3) consistent use of measures of treatment fidelity and competency; (4) studies investigating moderators and mediators of depression and anxiety outcomes; (5) studies collecting longer-term follow-up outcome data; (6) more consistent reporting of dropout rates; and (7) studies modelling and exploring variability between therapists/services/regions. Future IAPT studies should apply ITT analyses and report the percentage of patients treated at each step, the stepping up rate, the dropout rate, pre-and post-treatment means (SDs), and ESs on the standard IAPT outcome measures as well as the disorder-specific outcome measures used in routine care.
In terms of the policy implications, the following are of note: (1) the commissioning of routine follow-up support post-treatment, (2) identifying numbers of patients that are rereferred for IAPT treatment; and (3) open access to routinely collected patient-level IAPT data sets, to enable research to keep pace with the rapidly shifting IAPT policy context. National performance reports could be improved through the commissioning of rigorous meta-analytic evaluations, as exemplified in this study. In addition, it is clear that clinical outcomes are attenuated in populations with chronic and long-term illnesses and multidisciplinary care is advisable for this population based on the wider evidence base (e.g., see Delgadillo, Dawson, et al., 2017). Furthermore, the extent to which the effects of IAPT interventions endure over time is largely unknown, and the little available data on this topic indicate that relapse after low-intensity interventions is likely to be very common (Ali et al., 2017). A major area for improvement is the consistent implementation of evidence-based relapse prevention support, such as booster sessions (Gearing, Schwalbe, Lee, & Hoagwood, 2013) or mindfulness-based relapse prevention (Kuyken et al., 2016). A promising development in this regard concerns telephone-delivered relapse prevention support which could be implemented at low cost to support IAPT patients to maintain their improvement after the acute phase of therapy (Lucock et al., 2018) and during the first 6 months after therapy which is known to be the time of highest risk of relapse (Ali et al., 2017).
This broad review of routinely delivered IAPT interventions has some implications for clinical practice. First, the expansion of high-intensity treatment options (e.g., provision of interpersonal psychotherapy, dynamic interpersonal psychotherapy, person-centered experiential counselling, and couples therapy for depression) has not been mirrored for low-intensity interventions which are mainly based on CBT principles. An expansion of other evidence-based low-intensity treatment options could provide greater choice for the highly heterogeneous clinical populations treated by IAPT services (Meadows & Kellett, 2017). There is increasing evidence to support stratified models of treatment matching for more complex cases, who evidently have higher dropout rates and poorer outcomes when offered very brief interventions. The original aim of the IAPT programme was to increase access to evidence-based talking treatments and there is evidence that large numbers are being treated annually, and that recovery rates are slowly increasing and achieving the 50% target (IAPT, 2019). There is, however, considerable room for improvement, particularly for patients who do not attain clinically significant improvement and who may find themselves in a 'revolving door' scenario of repeated treatment episodes (Cotton, 2019). There is also evidence to suggest that a considerable proportion (~30%) of IAPT patients have complex presentations (e.g., severe symptoms, comorbidity, socioeconomic adversity, and personality disorder traits), and they derive less benefit from routinely delivered interventions (Delgadillo, Huey, et al., 2017). It is also evident that some complex cases do not benefit from low-intensity interventions, and therefore identifying complex cases early and signposting to high-intensity interventions is an important area for future development.

Conclusion
The IAPT programme is a notable example of psychological public health care transformation informed by scientific evidence (Clark et al., 2018). Analysis of the evidence accumulated over the last 10 years supports the effectiveness of the IAPT programme and also demonstrates that innovative research and practice development have flourished within this context. A huge amount of investment has occurred to enable and to maintain the IAPT programme and this has been achieved via mental health service infrastructure change, human resource investment in recruiting a new therapies workforce and overall organisational culture development/change. This transformation of the landscape of psychological services for people with anxiety and depression in the United Kingdom has served as a model for similar developments in other countries. This review has demonstrated that the systematic routine outcome monitoring implemented at scale in the IAPT programme also has huge scientific potential (Clark et al., 2018).

Supporting Information
The following supporting information may be found in the online edition of the article: Table S1. Main findings of each study and quality assessment ratings. Table S2. Subgroup analysis of pre-post treatment effects in the typical study sample (n = 8 atypical studies excluded). Table S3. Meta-regression analysis of pre-post treatment effects in the typical study sample (n = 8 atypical studies excluded).