Evaluating narrative exposure therapy for post-traumatic stress disorder and depression symptoms: A meta-analysis of the evidence base

Narrative exposure therapy (NET) is an intervention for trauma spectrum disorders. Originally developed to treat refugee populations, NET has since been tested for efficacy across different settings. In this review, the NET evidence base is examined through a retrieval, synthesis and appraisal of randomized controlled trials (RCTs) published since 2002. Two independent reviewers (S. R. and N. S.) searched online databases including EMBASE, PsycINFO and PubMed. Twenty-four RCTs were selected for a meta-analysis of three outcomes: post-traumatic stress disorder (PTSD) diagnosis and PTSD and depression symptoms. All outcomes were analysed at short-term (3 – 4 months), midterm (6 – 7 months) and long-term ( ≥ 12 months) data points. A random-effects model was applied to yield standardized mean differences (SMDs) and odds ratios (ORs) as indicators of NET treatment effect. Subgroup analyses for type of trauma and type of control groups were conducted to examine potential heterogeneity. For the NET group, moderate effect sizes for PTSD symptom severity were observed at midterm and long term and at midterm for depression symptom severity. The number of PTSD diagnoses decreased significantly in the short term for the NET condition, but this was not sustained at the long term. Caution must be exercised when interpreting these results due to high heterogeneity estimates and low quality of evidence across trials. Potential small-study effects further complicate the interpretation of the findings. Recommendations are made for augmenting statistical significance research with qualitative analyses of NET efficacy to better inform clinical practice.

& Fazel, 2010). On the basis of the principles of cognitive behavioural therapy (CBT), testimony therapy and exposure therapy, NET follows a manualized treatment protocol that is aimed at constructing a consistent, coherent autobiographical representation of the traumatic event(s) within the context of a narrative account of one's life. This is thought to facilitate emotional processing of trauma memories to bring about improvement in emotional, cognitive and behavioural symptoms of trauma.
NET was developed specifically for use in low-resource settings and victims of organized and family violence. It is highly accessible because it can be delivered by non-mental health professionals following a short training programme using the manual developed by Schauer, Neuner, and Elbert (2005) and Schauer, Elbert, and Neuner (2011). A detailed description of the NET procedure can be found in the treatment manual .
NET has been evaluated using randomized controlled trials (RCTs) primarily with refugee populations and asylum seekers, wherein the trauma experienced may be naturally caused or man-made. The focus in these situations is on the atrocities endured, usually at the hands of NICE recommends NET as a first-line treatment option for PTSD along with CBT, cognitive processing therapy (CPT) and prolonged exposure (PE). On the contrary, APA guidelines published a conditional recommendation for NET, claiming the evidence was insufficient for a strong recommendation (APA, 2017). Strong recommendations were made for CBT, CPT and PE. Similarly, VA/DoD found PE and eye movement desensitization and reprocessing (EMDR) to have the strongest evidence base, whereas NET was found to have 'sufficient' evidence for a strong recommendation (VA/DoD, 2017). ISTSS also provided a standard recommendation for NET (Berliner et al., 2019) and a strong recommendation for CPT, EMDR and trauma-focused CBT. As such, the evidence base for CBT, CPT, PE and EMDR appears to be stronger when compared with NET. A systematic evaluation of the NET evidence base is essential to better inform clinical and practice guidelines regarding its efficacy.

| Rationale for review
By publishing data about NET efficacy, one can advocate for its widespread implementation in low-resource settings to alleviate symptoms. Narrative reviews of NET efficacy have found NET favourable to controlled comparisons in reducing traumatic stress in a range of socio-economic and cultural contexts including low dropouts and sustained improvements over time (McPherson, 2012;Robjant & Fazel, 2010). In one meta-analysis of NET efficacy, authors found a medium effect size (g = 0.63) for PTSD symptom reduction postintervention (Gwozdziewycz & Mehl-Madrona, 2013). This estimate was interpreted as evidence of NET efficacy, although it is not clear how the average effect size was calculated in terms of follow-up time points. The control groups used for comparisons with NET are varying and not restricted to 'bona fide' or active treatments, that is, treatments that were intended to be 'therapeutic' (Wampold & Imel, 2015).
For example, RCTs used wait-list groups, no-treatment groups and supportive counselling as controlled comparisons to NET. Importantly, most reviews have not critically appraised the quality of the evidence, which makes these findings inconclusive regarding NET's true treatment efficacy.
NET trials have been included in large meta-analytic studies of a range of psychological therapies for traumatic stress (Bisson, Roberts, Andrew, Cooper, & Lewis, 2013;Patel, Kellezi, & Williams, 2014). In these reviews, the methodological rigour of the included NET trials was questioned. Further, both reviews included only a handful of NET trials each (N = 07, Bisson et al., 2013;N = 04, Patel et al., 2014), and since their publication, several recent trials investigating NET efficacy have been published. This suggests a need to conduct a comprehensive, up-to-date quality appraisal of the NET evidence base to inform researchers and practitioners.
Only one other comparable meta-analytic review was identified at the time of this review (Lely, Smid, Jongedijk, Knipscheer, & Kleber, 2019). The authors of this review concluded that despite methodological weakness in the included trials, there is empirical support for NET as a trauma intervention. However, at least seven published RCTs of NET efficacy have not been included in the review, and the authors have used 'last follow-up' as the uniform follow-up time

Key Practitioner Message
• Narrative exposure therapy (NET) significantly alleviates post-traumatic stress disorder (PTSD) symptoms at midterm and long-term time points compared with control interventions.
• Diagnostic status of PTSD is statistically improved by NET only at the short-term and midterm time points.
• Depression symptoms are significantly lower for NET groups only at the midterm time point.
• Most trials are underpowered and are highly heterogeneous across outcomes and time points. Quality of evidence for NET efficacy is weak due to high risk of bias across domains.
• Caution is advised when interpreting pooled intervention effects from randomized controlled trials to determine the efficacy of NET in clinical settings. point. NET trials vary significantly in their measurement of outcomes, especially concerning data points. In the Lely, Smid, et al. (2019) metaanalysis, this ranged from 9-52 weeks. They also do not include PTSD diagnostic status (i.e., meeting the criteria for a diagnosis of PTSD) as a measured outcome, which could potentially be an important consideration when making a recommendation for clinical efficacy.
In this paper, a meta-analysis of NET efficacy is attempted using all available data from NET clinical trials published to date. The term 'efficacy' is used unequivocally throughout this paper as NET RCTs have predominantly evaluated the intervention under ideal and controlled circumstances (Roland & Torgerson, 1998). This is differentiated from evaluations of 'effectiveness' (a measure of intervention benefit under 'real-world' clinical conditions). Of note, some of the included trials as well as a previously published meta-analytic review (Lely, Smid, et al., 2019) have used the term 'effectiveness' interchangeable with 'efficacy' , although the design of the trial is in line with an efficacy evaluation.
For the meta-analysis, the pooled intervention effect of NET will be estimated against controlled comparisons on outcomes of PTSD (symptom severity and diagnostic status) and depression symptom severity. Due consideration will also be given to issues of heterogeneity and methodological quality of the included trials.

| OBJECTIVES
The objective was to evaluate the efficacy of the NET evidence base for the treatment of trauma-related psychopathology on outcomes of PTSD and depression across different sociocultural contexts and trauma exposure.

| METHODS
Search results that fulfilled the selection criteria were scrutinized at the abstract and full-text stage independently by two reviewers (the first and second authors: S. R. and N. S.). Appropriate data were extracted independently by both reviewers, and disputes, if any, were resolved by the third author, N. H. The risk of bias was assessed by S. R. and N. S. in accordance to version 1.0 of the Cochrane risk of bias tool for randomized trials (Higgins & Green, 2011).

| Types of studies
Original RCTs using NET (or an adaptation of NET) were included.
Inclusion criteria for the trial population were individuals with a history of exposure to trauma and reporting PTSD outcome measures (diagnostic status and/or symptom severity) following such exposure.
No restrictions were placed based on the type of trauma experienced by the population. The search was not limited to RCTs with participants over 18 years of age. Instead, we chose to exclude studies that used the version of NET adapted for children below the age of 18 years: KIDNET (Onyut et al., 2005). From an initial scoping review of the NET literature, it was anticipated that some trials may include a combination of both adults and underage (<18 years) participants, especially in the case of refugees and asylum seekers. Further, KIDNET was developed and tested for efficacy merely 3 years since the first publication of NET (Neuner et al., 2002), which would make it the preferred choice for studies strictly recruiting participants under 18 years of age. By not placing an explicit age restriction and instead using the KIDNET filter, we aimed to include all available data published on NET trials conducted with adult participants.
Studies from any part of the world implementing any control comparisons were included. No restrictions were placed on the number of control comparisons. Outcome measures included PTSD (scales or diagnostic interviews) and depression symptoms.

| Timing of outcome assessment
End point assessments were used. NET does not have a fixed number of sessions or intervention duration, and we anticipated some variations in timings of outcome assessments across studies. As a result, we estimated effects at (a) short term (3-4 months), (b) midterm (6--7 months) and (c) long term (12 months or above).

| Search strategy
The local databases searched were PubMed, CINHAL, PsycINFO, Medline, Cochrane Library and Embase. The search strategy for PubMed is presented in Table A1 as an example. Additionally, the original authors of NET were contacted and asked to provide a full publication list of NET research. The date last searched is 8 December 2019.

| Selection and screening
Two authors (S. R. and N. S.) independently screened all titles and abstracts. Full-text articles were reviewed for inclusion, and data were extracted independently by S. R. and N. S. Cohen's kappa statistic was calculated to indicate inter-rater agreement at the title and abstract screening stages (Cohen, 1960). Kappa values were interpreted using Landis and Koch's (1977) guidelines. Included publications were extracted for relevant data, and disagreements were resolved by the third author (N. H.).

| Risk of bias
Two authors (S. R. and N. S.) separately assessed the risk of bias by using criteria according to the Cochrane Collaboration's risk of bias tool (Higgins & Green, 2011). At the time of completion of this review, there was no mandate to use the updated version (RoB 2.0; Sterne et al., 2019), and it was still being pilot tested by the Cochrane Review teams (Cochrane Library, 2019). Additionally, assessment issues and therapist qualifications were included as domains of risk. For assessment issues, criteria for low risk of bias included the use of valid and reliable measures or the use of translated measures with sufficient psychometric properties. In the absence of such information for translated versions, the risk was deemed unclear. When trials failed to report whether measures were valid and reliable, or when there was a lack of clarity regarding attempts to translate and back-translate standardized measures, the trials were considered to have a high risk of bias. For therapist qualifications, trials that reported details of the therapist's NET training and qualifications qualified for a low risk of bias. If such information was found insufficient or lacking, the trials were rated as having an unclear risk of bias. If the intervention was delivered by untrained individuals with a lack of relevant qualifications, or if the study did not report any information at all, it was rated as having a high risk of bias.
Funnel plots were examined for comparisons with 10 or more trials to indicate publication bias in line with the rule of thumb recommended by the Cochrane Handbook for Systematic Reviews of Interventions . Disagreements were resolved by the third author, N. H.

| Data synthesis
A random-effects (RE) model was used to calculate the standardized mean differences (SMDs) and 95% confidence intervals (CIs) for continuous outcomes. The odds ratio (OR) and 95% CIs were calculated by using an RE model for dichotomous variables. When trials reported multiple treatment arms, the non-NET active treatment arm was treated as a control group and was combined with the non-active treatment control group (such as wait-list or no-treatment controls) to generate pooled mean and standard deviation (SD) values. Study authors were contacted for missing data, such as means and SDs. If these data could not be obtained, those trials were excluded from the meta-analysis. The desktop version of Review Manager (RevMan, version 5.3) was used for data analysis (RevMan, 2014).

| Assessment of heterogeneity
Visual inspection of graphs, a Mantel-Haenszel χ 2 statistic and I 2 statistic was used to test for heterogeneity. An I 2 estimate ≥50% accompanied by a statistically significant χ 2 statistic was interpreted as evidence of substantial levels of heterogeneity.

| Subgroup analysis
From a clinical perspective, combining studies that varied by the type of trauma experienced by the participants and the type of control conditions used in the analysis could impact the findings. To investigate the influence of such variability, two subgroup analyses were conducted. The first subgroup analysis was based on the type of control, that is, active treatment versus no treatment. Type of traumatic event was also used for a subgroup analysis between trauma with a perpetrator (such as war trauma, combat trauma, abuse and violence and torture) and trauma without a perpetrator (such as natural disasters and occupational trauma). Age is also a potential effect modifier.
If enough studies (N = 10; Higgins et al., 2019) with participants below 18 years of age are found, a subgroup analysis based on age will be performed.

| Sensitivity analysis
The effects of excluding trials with a high risk of bias or those that appeared to be outliers upon a visual inspection of forest plots were conducted as part of the meta-analysis. When the removal of these studies did not change the direction or significance of the treatment effect, they were included in the final analysis.

| RESULTS
A total of 306 results were retrieved from the database search, and 110 were chosen after screening titles. Inter-rater agreement for screening titles was substantial (k = 0.676). Six studies were added from the NET reference database received from NET authors. After Twenty-six full-text studies were retrieved for full-text review.
Two studies had to be excluded at this stage; one study did not meet the randomization criteria (Crombach & Siehl, 2018), and the authors of the other study were unable to provide the required data for the meta-analysis (Hinsberger et al., 2017). Twenty-four trials were included in the final meta-analysis.
The PRISMA flow diagram in Figure B1 illustrates the selection of the studies included in the review.

| Included studies
A total of 1,391 participants were recorded in the trials included, with sample sizes ranging from 18 to 277. Studies were conducted across the world, including countries in Africa, Europe, Asia and North America. Most participants were survivors of war trauma and organized and/or personal violence, and in a few studies, participants were survivors of natural disasters. The mean age of the participants across trials ranged from 17 to 70 years. NET was administered in the intervention arm in 18 studies. Four trials used an adaptation of NET known as the Narrative Exposure Therapy for Forensic Offender Rehabilitation (FORNET; Hecker, Hermenau, Crombach, & Elbert, 2015) meant for persons with a history of perpetrated violence. In two trials, different versions of brief NET were used. In one trial, a combination of NET and interpersonal psychotherapy (IPT) was used. The most common control condition was wait-list controls (WLC; 10 trials; see Table A2). An average of 7.58 sessions of NET therapy was delivered across all 24 included trials with a range of 14.2. All studies provided details of attrition during treatment except one (Morath, Moreno-Villanueva, et al., 2014). The rate of attrition varied from 0% to 38.64% , with a mean attrition rate of 7.43% during treatment. Studies that reported dropout data highlighted the sensitive nature of refugee and asylum status (such as camp closures, receiving asylum, disappearance and transfers) as common reasons for dropout. Other reasons included a lack of motivation, trust, and psychosocial problems. One study with abuse victims reported emotional suffering over treatment as a reason for dropout (Orang et al., 2018).
Twenty-three trials reported the severity of PTSD symptoms as one of the primary outcomes, and 13 trials reported PTSD diagnos-  Sheehan et al., 1998) and Hopkins Symptoms Checklist (HSCL; Derogatis, Lipman, Rickels, Uhlenhuth, & Covi, 1974) were all used to measure this outcome in the form of self-report as well as structured interviews.
For a detailed description of the studies, please refer to Table A2. Figure B2 depicts the risk of bias assessment for the trials included in the meta-analysis.

| Random sequence generation
Recognized randomization procedures were used in 15 out of 24 trials, and these trials were rated as low risk. Nine trials were judged as reporting unclear randomization techniques. Some of these trials did not report the randomization technique, whereas others used techniques that may or may not ensure complete randomization (e.g., randomizing only a section of eligible participants and not the others and using restricted randomization, which could introduce selection bias). In two of these trials, assessment of baseline outcomes was used for the assignment of participants to treatment and control groups, which suggests that the sequence generation may not have been random Hermenau, Hecker, Schaal, Maedl, & Elbert, 2013).

| Allocation concealment
Only two studies were judged to report adequate information about allocation concealment. The remaining 22 trials did not indicate information about allocation concealment. Most of these studies reported that groups did not significantly differ at baseline on outcomes or demographics, which could indicate adequate randomization. These studies were categorized as having unclear risk.

| Blinding of participants and personnel
Only one study reported blinding of participants to decrease the likelihood of further unblinding of outcome assessment. However, there was no mention of blinding personnel, as this is not possible in an RCT using a psychological intervention such as NET. This led to the trial being rated as unclear for performance bias (Jacob, Neuner, Maedl, Schaal, & Elbert, 2014). The remaining 23 trials were rated as having a high risk of performance bias.

| Blinding of outcome assessment
Seven trials were judged to be at high risk for detection bias due to not implementing appropriate blinding while assessing outcomes post-intervention and at follow-up. In four of these trials, patients accidentally revealed their treatment condition to the assessors. In two trials, the lead author, who was also one of the therapists, assessed outcomes, which placed these trials at high risk for detection bias. Four trials did not provide adequate information or suggested plausible, accidental unblinding and were rated as having unclear risk.
The remaining studies were rated as having a low risk of bias.

| Incomplete outcome data
Four trials were at high risk for attrition bias due to analysing only treatment completers. In one trial that analysed only treatment completers, authors found that demographics and study variables did not significantly predict treatment completers or dropouts in a logistic regression analysis (Orang et al., 2018). Further, the type of therapy did not significantly predict dropouts either. In a second trial, missing data from dropouts were replaced by estimation with a restricted maximum likelihood procedure, and no significant differences were found before and after data were replaced (Alghamdi, Hunt, & Thomas, 2015). Thus, the authors analysed only treatment completers. The authors in both trials did not publish these results in the final report. Therefore, these trials have been classified as having unclear risk. The remaining trials were classified as low risk due to either reporting no dropout or using some form of intention-to-treat (ITT) methods to account for missing data.

| Selective reporting
Nine trials were judged as low risk for selective reporting, as all primary and secondary outcomes reported in the protocol were matched with those reported in the final publication. A further six trials with published protocols were judged as having unclear risk due to not reporting secondary outcomes mentioned in the protocol in the final publication. Other trials judged as unclear risk were due to the lack of a protocol available. Two studies were categorized as high risk for selective reporting due to not reporting primary outcomes indicated in the protocol. Only four out of eight comparisons had 10 or more trials contributing to the analysis (PTSD symptoms at short term and midterm and depression symptoms at short term and midterm). A visual examination of the funnel plots (see Figure B3) for these comparisons did not clearly indicate asymmetry.

Therapist qualification
Seven trials reported that the interventions were delivered by clinical psychologists/counsellors with NET and trauma experience. Nine trials reported that the therapist was a PhD/graduate student or therapist who had explicit NET training per the manual. These studies were classified as low risk. Two trials did not specify the training or manual adherence procedures of their therapists (clinical psychologists and clinical psychology doctoral students). Hence, these trials have been classified as unclear in terms of risk. Three trials used trained 'lay counsellors' to deliver therapy. In these trials, the qualifications of the therapists were unclear. One trial used trained final year undergraduate psychology students, whereas another used clinical psychology undergraduate degree holders who were trained in NET. These trials were classified as having unclear risk because NET is a manualized technique that does not require clinical or medical qualifications for its administration . However, the lack of graduate qualification/healthcare training does not allow for these trials to be completely devoid of risk of bias.

Assessment issues
Only five studies used valid and reliable measures for the assessment of outcomes. These trials have been categorized as low risk. Eleven trials performed real-time translation and back-translation of outcome measures to capture the symptoms of PTSD and depression, which are primary outcomes of interest to this review. No information was provided about the psychometric properties of the translated measure. Further, translations and back translations achieve linguistic equivalence, which often subsumes the importance of cultural meaningfulness and appropriateness to the context. One trial used structured interviews using interpreters to measure outcomes (Neuner et al., 2010). In these cases, one cannot rule out the risk of the assessor's influence in achieving the desired outcome. Therefore, these trials were classified as having an unclear risk of bias. A further eight trials did not report information regarding the translation of English language outcomes or the use of valid instruments to measure outcomes, which suggest these trials are at a high risk of bias. Figure B4 depicts the risk of bias summary for the trials included.

| Effects of intervention
The main comparison was NET versus control for the treatment of PTSD and depressive symptoms and PTSD diagnostic status.

| Comparison: NET versus any control
Twenty-four full-text studies were included in the meta-analysis. Due to different scales of assessment being used to measure the outcomes discussed, results were combined as SMDs. All 24 included studies contributed to the above comparison. This comparison had nine analyses, and the results have been summarized in Table A3 (3-4 months). Four studies relevant to this outcome were identified (total N = 173; see Figure B5). There was a significant effect of NET intervention on diagnosis compared with control, with an OR of 0.28 (95% CI 0.12 to 0.66, Z = 2.89, p = .004).

Long term (≥12 months).
Three relevant studies with a total of 190 participants were identified (see Figure B7). NET did not significantly outperform controls in this analysis (OR 0.68; 95% CI 0.16 to 2.87, Z = 0.51, p = .60). Further, significantly high levels of heterogeneity were found (χ 2 = 9.96; df = 2.0; p = .0071; I 2 = 80%). One study that appeared to be an outlier (Neuner, Schauer, Klaschik, Karunakara, & Elbert, 2004) was removed from the analysis, and although this reduced the heterogeneity estimates to 0% (χ 2 = 0.45, df = 1 [p = .45]; I 2 = 0%), the direction or significance of the effect was not altered, and the study was retained in the final analysis.  term (3-4 months). Fifteen relevant studies contributed to this outcome, with a total of 813 participants (see Figure B8). Small effect size evidence that NET was significantly different in its effects compared with controlled comparisons was found (SMD −0.30, 95% CI −0.49 to −0.11, Z = 3.05, p = .002). Moderate but non-significant heterogeneity was found (χ 2 = 21.62, df = 14 [p = .09]; I 2 = 35%). A sensitivity analysis was conducted in which an outlier study was removed from the analysis (Adenauer et al., 2011). This study also had a high risk of bias in three domains This reduced the heterogeneity (I 2 = 6%) and did not alter the significance of treatment effect (SMD −0.21, 95% CI −0.37 to −0.06, Z = 2.78, p = .005). Therefore, it was considered in the analysis due to not altering the direction/significance of the treatment effect.

Types of controls
To account for the clinical heterogeneity in the type of control conditions that NET was compared with across trials, a subgroup analysis of active controls (e.g., other psychological therapies, treatment-asusual and psychoeducation) versus inactive control (e.g., no treatment and WLC) was performed at the three time points for PTSD and depression symptoms. For depression symptoms at midterm, NET significantly outperformed active interventions (SMD −0.50, 95% CI −0.84 to −0.16, p = .004), whereas a non-significant overall effect was found when compared with the no intervention subgroup. However, the subgroup differences did not reach statistical significance (p = .27). In this analysis, the active intervention group was characterized by significant heterogeneity (χ 2 = 21.28; df = 9; p = .0; I 2 = 58%), and the no intervention subgroup consisted of only three studies.
Subgroup differences for PTSD and depressive symptoms were not significant at any other time point based on the type of control groups (see Table A4).

Type of trauma
For the purpose of this analysis, types of trauma varied between trauma with a perpetrator and trauma without a perpetrator in trials, and these categories were compared in a subgroup analysis. Trauma without a perpetrator included singular events such as natural disasters or trauma induced by occupational stress. Trauma with a perpetrator consisted of either repetitive or consistent trauma or threat of trauma such as war, combat, torture and abuse.
Subgroup differences for PTSD and depressive symptoms were not significant at the short-term and midterm time points based on the type of trauma. PTSD and depressive symptoms in the long term could not be analysed for subgroup differences based on trauma due to a lack of relevant trials in the simple trauma subgroup (see Table A5).

Age
There was insufficient data to perform a subgroup analysis based on age of participants. Conversely, a statistically significant effect of NET was found on PTSD diagnostic status in the short-term and midterm time points, but these effects were not sustained at long-term follow-up. It is important to note that the outcome at long term was characterized by very high heterogeneity estimates. Further, only a small number of trials contributed to the analyses for PTSD diagnostic status at all time points (N < 5), which impacts the findings for this outcome. When RE models are used, including a substantial number of studies is necessary in order to make reliable inferences, especially in the presence of high between-study heterogeneity (Guolo & Varin, 2017;Seide, Röver, & Friede, 2019). This significantly impacts the findings for this outcome.

| Summary of main results
In terms of depression symptoms, NET emerged statistically superior to controlled comparisons only at midterm with a medium effect size (SMD = −0.49, 95% CI −0.79 to −0.20). Lely, Smid, et al. (2019) found that NET outperformed non-active controls with medium-tolarge effect sizes ( g = 0.79) for depression symptoms. In our study, this was not reflected in the subgroup analysis for types of control for depression symptoms at any time point. However, in line with their overall conclusion, it appears that NET is relatively less effective in treating depressive symptomatology.
Subgroup analyses for type of controls emerged significant only for PTSD symptoms at the short term, with NET performing better than no intervention controls, when compared with active interventions. Within the active intervention subgroup, a non-significant overall effect of NET was found (p = .62). Although this could imply that NET does not outperform active interventions as efficaciously as WLC and no-treatment controls, the wide 95% CIs of the point estimate (−0.07, 95% CI -0.34 to 0.20) suggest caution in the interpretation of the summary statistic for this subgroup. The relatively high yet non-significant heterogeneity estimates (I 2 = 40%, p = .12) also suggest that within the active intervention group, there is potential for further subgroup analysis based on the type of the controls. Subgroup differences were not statistically significant for either outcome at any other time point. Lely, Smid, et al. (2019) found that NET significantly outperformed only non-active controls for both PTSD and depression symptoms. However, they do not specify the follow-up timings for controlled comparisons in their study, thereby limiting the comparability of these findings. Regarding type of trauma, the lack of subgroup differences suggest that NET is similarly efficacious across a range of trauma populations.
Conclusions about the evidence of treatment effect (especially when applied to clinical practice) must be made with caution due to high heterogeneity estimates when the data were statistically pooled.
An exploration of heterogeneity using the visual inspection method allowed us to identify outliers, and the removal of these trials did not reduce heterogeneity in all trials. This is especially true in the case of PTSD symptoms at 6 months (significant effect) and PTSD diagnosis at 9-12 months (non-significant effect).

| Quality of evidence
In a systematic review by Gwozdziewycz and Mehl-Madrona (2013), the authors noted that the trials included in the review were validly designed and executed. However, we found varying degrees of bias across the included trials. Unclear randomization procedures were identified in a handful of trials, thereby potentially compromising the quality of these trials. The risk of allocation bias was mostly unclear across the review, due to a lack of adequate detail provided by study authors to make a definitive judgement. Almost all trials were considered to have a high risk of performance bias. This was due to treatment allegiance to a single psychotherapeutic intervention (NET), making the blinding of personnel impossible due to the nature of the intervention. However, while blinding of participants could be possible to some extent, the extensive use of WLC and no-treatment control groups made this impossible in most trials. The risk of bias regarding outcome assessments suggested that most studies used self-report or assisted-report psychometric scales, translated and back-translated from English. No information was provided about the psychometric validity and reliability of the translated tool, and one cannot rule out the effect of interpreters or translators in achieving desired outcomes during assessments. As a result, several trials were judged as having an unclear risk of bias.
Studies used varying assessment methods such as scales of assessment, the language of administration, use of interpreters and diagnostic interviews versus self-report versus assisted report. This could potentially explain high heterogeneity among trials. Additionally, many RCTs did not report methodological aspects such as information on blinding, allocation and use of valid and reliable outcome measures clearly, which led to several studies being judged as having an unclear risk of bias.
There were other methodological concerns to be considered.
Most of the trials reported WLC in addition to no-treatment conditions. Only a handful of studies used other active, bona fide interventions as controlled comparisons. As in Lely, Smid, et al. (2019), NET performed better against non-active comparators, but the difference was not statistically significant. The use of WLC in anxiety disorders research has been criticized (Patterson, Boyle, Kivlenieks, & Van Ameringen, 2016). Studies have identified ethical and humanitarian issues of delaying treatment to individuals who are undergoing acute distress or may be at risk for self-harm and suicide (Devilly & McFarlane, 2009). Additionally, the use of WLC may be associated with other methodological concerns related to increased risk of bias (Mohr et al., 2009) and larger effect sizes for the psychotherapy group (Furukawa et al., 2014). Specially with regard to PTSD research, a meta-analysis of 20 studies and 418 participants demonstrated small-to-medium effect sizes ( g = 0.34) for WLC, whereas bona fide trauma-focused treatments yielded very large effect sizes ( g = 1.5; Devilly & McFarlane, 2009). Similar findings were demonstrated in a review of WLC effect sizes in social anxiety disorder research (Steinert, Stadter, Stark, & Leichsenring, 2017). In agreement with a narrative review of NET efficacy (Robjant & Fazel, 2010), it is problematic to draw conclusions regarding NET's superior efficacy compared with other trauma-focused treatments due to lack of sufficient RCTs using such active treatments as controls.

| Limitations of the current review
The first limitation of this meta-analysis is that a protocol was not registered a priori. Preregistered protocols ensure methodological rigour and commitment, and the lack of a published protocol before the commencement of data collection and analysis is acknowledged as a shortcoming of this paper.
The risk of bias assessments were conducted with the Cochrane Collaboration's risk of bias tool available at the time of analysis (Higgins & Green, 2011). Since that time, a new version of the tool (RoB 2.0) was released by the Cochrane Collaboration (Sterne et al., 2019). The updated tool is not yet available on the RevMan software and is currently being piloted by the Cochrane Review team.
Future studies may employ RoB 2.0 to perform a comparable analysis of internal validity and quality appraisal.
Another limitation is that several trials used adaptations of NET ranging from NET specifically targeted to forensic offenders (FORNET) to brief versions of NET. No subgroup comparisons were planned and therefore conducted between RCTs that used NET and those with adapted NET. As such, targeted analyses of studies using specific adaptations will provide conclusive data of their efficacy, but due to the small number of trials reporting homogenous adaptations, it was not possible to determine whether NET can be successfully adapted to reduce symptomatology.
Other limitations of the paper are related to the included studies themselves. Some RCTs had very small sample sizes (in some cases below 10), with only two notably well-powered studies. Further, only four comparisons had over 10 studies contributing to the analysis. This raises the issue of small-study effects (i.e., the overestimation of intervention effects in trials with small to moderate trial size). Underpowered studies tend to show exaggerated treatment effects when compared with well-powered studies, as well as contributing to higher heterogeneity estimates. (Turner, Bird, & Higgins, 2013). Further, one of the possible causes of small-study effects could be publication bias . Only published trials were included in this analysis, and the number of trials included in each comparison was low. It could be argued that with the inclusion of unpublished trials, smaller estimates of treatment effects for NET would emerge. Only four comparisons were eligible for funnel plot analyses, and a visual inspection did not reveal clear evidence of asymmetry. However, it is well known that asymmetry interpretations are subjective and may be caused by other issues such as methodological heterogeneity (Sedgwick, 2013). It must also be noted that tests for publication bias using funnel plots (such as the Egger's test) are less reliable when small trials dominate the meta-analysis, which is the current case (Egger, Davey Smith, Schneider, & Minder, 1997). Therefore, the issue of small-study effects complicates the reliability of the NET evidence base considerably.

| Strengths and comparisons to other reviews
An attempt was made to access and include all relevant trials through the search strategy. Although it is possible that we may have missed some published or unpublished material, our correspondence with the NET authors suggests that our review is the most comprehensive and complete analysis to date.
In addition to the trials included in Lely, Smid, et al. (2019), we have reviewed a further eight trials as well as analysed PTSD diagnostic status, which makes our findings more comprehensive and more up-to-date. In terms of quality appraisal, we find our ratings to be more conservative with several trials being rated as having a high risk of bias on domains rated by Lely, Smid, et al. (2019) as unclear. Subgroup analyses conducted in our review also attempted to address the issue of potential clinical heterogeneity in combining different trauma groups and trials using varied control groups together. Our results suggest that other sources of heterogeneity must be explored to account for the high estimates in the analyses across comparisons. Lely, Smid, et al. (2019) concluded that despite methodological issues and high heterogeneity estimates, NET trials provide evidence of favourable treatment efficacy. However, we recommend greater caution in the interpretation of the metaanalytic findings in our comprehensive and more conservative overview of NET efficacy. 6 | CONCLUSION

| Implications for practice
Currently, low-quality evidence indicates the efficacy of NET over both active and non-active psychotherapeutic control treatments in primarily decreasing PTSD symptom severity and, to a small extent, depression severity. The data were significantly heterogeneous across most outcomes measured. Further, there was no impact of the type of trauma or type of controls on treatment effect, which leaves heterogeneity estimates unexplained. Low dropout rates from therapy are an indication of NET's acceptability and feasibility with sensitive populations. NET has been tested with a range of trauma groups, that is, with and without perpetrators. Although NET shows promise with both groups, the evidence base for NET is most strongly suited to the group it was originally intended for, that is, victims of war and organized violence. Although this suggests that NET may be suitable for diverse trauma groups, there is not enough data to draw conclusions regarding NET's applicability when there is no perpetrator involved. The highly heterogeneous treatment groups (in terms of trauma history, risk of a future threat and socioculturaleconomic settings) further confound the pooling of effects.
NET is currently recommended by a range of clinical guidelines for PTSD treatment including APA, NICE, VA/DoD and ISTSS.
Regarding clinical efficacy, it is important to consider findings from reviews and meta-analyses. This involves considering the quality of the evidence base as opposed to focusing on data from individual RCTs or using narrative summaries. Although these trials may have found promising effects, the small sample sizes and methodological issues warrant caution when using the findings to inform policies and guidelines. Further, clinical guidelines have questioned the applicability of NET to non-refugee trauma populations. Because a majority of the trials so far have used NET with refugees and asylum seekers, we highlight this as a further limitation of the evidence base. In future research with NET, diverse trauma exposure across a range of sociocultural settings must be considered, specifically with non-refugee populations, to warrant its recommendation as a PTSD treatment approach.

| Implications for research
The NET evidence base includes RCTs conducted in diverse settings.
As previously discussed, Mundt et al. (2014) critiqued the NET evidence base as being inapplicable to low-and middle-income countries (LMIC) as well as being unable to address as aspects of psychosocial well-being. In their review of interventions for torture survivors, Patel et al. (2014) raise similar concerns about the need to consider legal, contextual and psychosocial factors in the delivery of trauma interventions.
From a research perspective, RCTs of short-term interventions (of which NET is an example) have long since been criticized for underestimating or altogether missing out on some of the most crucial aspects of psychotherapy such as its self-correcting nature, addressing multiple, interacting problems and co-morbidities and therapist/practitioner-related moderators that impact outcomes and treatment effects (Fensterheim & Raw, 1996;Persons & Silberschatz, 1998;Seligman, 1995;Shean, 2014). However, as Rasmussen (2014) argues, RCT findings cannot be expected to address all aspects of psychosocial well-being but can augment components of care packages intended to effect recovery at multiple levels of wellbeing. The larger issue concerns the need for empirical health systems research, in that the scope of NET research needs to be widened to include the socio-economic, political and cultural contexts of the local communities they are intended to be implemented in. Holistic treatment and care systems must be evaluated for efficacy as opposed to a single component, that is, the intervention. This is especially applicable to NET research, a technique whose basis, that is, storytelling, is woven into the unique sociocultural fabric of diverse societies.

Mechanisms of change is an important, under-researched area
when it comes to NET inquiry. A few included trials have isolated neurobiological and molecular correlates of recovery when NET is used (Adenauer et al., 2011;Morath, Gola, et al., 2014;Morath, Moreno-Villanueva, et al., 2014). Additionally, psychological mechanisms of change can add rich detail to bettering intervention protocols. A rigorous, in-depth qualitative analysis of the narratives or final testimony could provide new insight into change by identifying and isolating indicators of recovery or deterioration in accordance to treatment protocols in the manual . This evidence adds depth to statistical data, as testimonies are largely personal accounts of change and are bound to integrate the psychosocial, political, economic and cultural context of the treatment setting.
Homogenous, well-powered trials using uniform methodological design as well as comparable outcome assessments (e.g., measures and follow-up time) are needed to substantiate NET's spot among active trauma interventions. When complemented by deeper, qualitative analyses of mechanisms of change, NET's evidence base can be strengthened to reveal its true effect across trauma populations.

ACKNOWLEDGEMENT
The authors wish to extend their gratitude to Dr Farhad Shokraneh for his direction and guidance to author S. R. in carrying out the metaanalysis, as well as for making the time to clarify doubts at every step of the way. His expertise in conducting systematic reviews and metaanalyses was immensely valuable in undertaking this study. The authors received no specific funding for this work.

APPENDIX A.
T A B L E A 1 Search strategy example: PubMed