Explorer A systematic review and meta-analysis of gene therapy in animal models of cerebral glioma

Background: The development of therapeutics is often charac-terizedbypromisinganimalresearchthatfails totranslateintoclin-ical efficacy; this holds for the development of gene therapy in glioma. We tested the hypothesis that this is because of limitations in the internal and external validity of studies reporting the use of gene therapy in experimental glioma. Method: Wesystematically identified studiestesting gene therapy in rodent glioma models by searching three online databases. The number of animals treated and median survival were extracted and studies graded using a quality checklist. We calculated median survival ratios and used random effects meta-analysis to estimate efficacy. We explored effects of study design and quality and searched for evidence of publication bias. Results: We identified 193 publications using gene therapy in experimental glioma, including 6,366 animals. Overall, gene therapy improved median survival by a factor of 1.60 (95% CI 1.53 – 1.67). Study quality was low and the type of gene therapy Conclusion: As the dysregulation of key molecular pathways is characteristic of gliomas, gene therapy remains a promising treatment for glioma. Nevertheless, we have identified areas for improvement in conduct and reporting of studies, and we provide a basis for sample size calculations. Further work should focus on genes of interest in paradigms recapitulating human disease. This might improve the translation of such therapies into the clinic.


Introduction
The prognosis for patients with malignant glioma remains poor despite extensive experimental and clinical research. 1 The most effective treatments tested in randomized controlled trials show a median survival of 14 months from diagnosis, 2 only marginal progress from the 9 months median survival reported in clinical trials in the 1970s. 3,4 This poor prognosis reflects the highly invasive nature of malignant glioma and resistance to conventional anticancer treatments. [5][6][7] These tumours are proliferating lesions within an otherwise quiescent organ and rarely metastasize, rather progressing by diffuse invasion along white matter tracts and by inducing oedema through generation of abnormal blood vessels. 8 Therefore they are an ideal target for local gene therapy, as such genes can be designed to target mitotic cells, 9,10 providing a basis for cell-selective cytotoxic therapy. Similarly, genes can be designed and introduced to modulate key processes in glioma growth and the body's response to it, such as angiogenesis and the host immune response-both of which play a key role in disease progression. 11,12 However, as is the case in many areas of biomedical research, the promising results from preclinical animal models have failed to be translated effectively into the clinic. [13][14][15] This is exemplified by two unsuccessful phase III randomized controlled trials of gene therapy 16,17 and variable responses in other smaller clinical studies. 6,[18][19][20][21] Several narrative reviews have described the promise of gene therapy in malignant glioma 7,[22][23][24] ; however, thus far this promise has remained unfulfilled in the clinical setting. Three complementary reasons for this failure have been proposed: (1) efficacy is overstated in animal models, (2) potential efficacy is understated in human clinical trials or (3) animal models simply do not recapitulate the human disease with sufficient fidelity in order to be useful. Systematic review and meta-analysis can provide a more transparent and objective summary of a field of research than narrative reviews; they allow assessment of scientific rigour of included studies using standard instruments. In addition, stratified meta-analysis can explore the impact of independent study design variables (termed external validity) on reported outcome; assess the prevalence and impact of measures to reduce bias such as randomization, blinded outcome assessment and sample size calculations (termed internal validity); and can provide evidence of possible publication bias-a phenomenon where comparative overreporting of small efficacious studies versus small ineffective studies leads to a false overestimation of the benefit of a given therapy. 25,26 Several previous studies on experimental models of neurological disease have demonstrated the flaws in the internal validity of studies and have shown that reporting of such measures can significantly affect efficacy estimates. 15,[27][28][29] Consequently, assessing these features forms a critical domain of the systematic review and meta-analysis in preclinical literature.
We hypothesized that limitations in the internal and external validity in animal modelling lead to an overstatement of efficacy of gene therapy in animal models of glioma. Here we use systematic review and meta-analysis to describe the relationship between study design, study quality and the reported improvements in median survival, and the fidelity with which limitations in the animal data were taken into account in the design of human clinical trials.

L I T E R A T U R E S E A R C H A N D I N C L U S I O N C R I T E R I A
Relevant full publications and meeting abstracts were identified by electronic searching of three online databases (Pubmed, Embase and Web of Knowledge) using the search terms: <gene therapy> AND <<glioma> OR <brain tumour> OR <brain tumour> OR <brain neoplasm> OR <glioblastoma> OR <ependymoma> OR <astrocytoma> OR <oligodendroglioma>>. Results were limited to animal studies with no language or date limits. Following comments raised in the review, we tested whether the term <gene therapy> was overly restrictive by searching for <glioma> and <thymidine kinase> to determine whether additional studies would be identified.
The inclusion criteria were adapted from previously published criteria 30 and required studies to report: (1) a single form of gene therapy, (2) a rat or mouse model of glioma, (3) the glioma cell line used, (4) intracerebral implantation of the tumour, (5) median survival data reported within the text or which could be calculated from Kaplan Meier survival graphs and (6) the number of animals in the control and treatment group(s). We defined a single gene therapy as the use of a single vector containing either one or multiple genes. To improve the sensitivity of identification of relevant studies, each publication identified in the electronic search was assessed individually against the inclusion and exclusion criteria by two of four independent reviewers (SC, ALM, TCH and MRM), with differences resolved by discussion.

M E T H O D O L O G I C A L Q U A L I T Y
A 9-item quality checklist was adapted from the CAMAR-ADES (Collaborative Approach to Meta-Analysis and Review of Animal Data in Experimental Studies) published criteria 31 and the glioma-specific score previously described by our group. 30 The checklist comprised (1) publication in a peer-reviewed journal and the reporting of (2) the number of tumour cells implanted, (3) randomized allocation of tumour-bearing animals to treatment and control groups, (4) blinded assessment of outcome, (5) a sample size calculation, (6) compliance with animal welfare regulations, (7) a potential conflict of interest, (8) the number of animals originally inoculated with tumour cells and (9) an explanation of any treated animals excluded from survival analysis. While not detailed as a quality checklist item in the study protocol, in response to comments raised in review we have also considered whether the study provided evidence of successful transduction and gene expression in vitro, and whether the study provided evidence of infection, replication and expression in the tumour in vivo.
We extracted data for median survival time, the number of animals in both the treatment and control groups and details of study design characteristics (the gene therapy used, animal species, co-morbidities, tumour cell line, gene therapy vector, route of administration, number of doses, delay to treatment and method of determining survival and presentation of data (i.e. textual or graphical)). We grouped gene therapy into broad categories of angiogenesis, DNA repair, immunomodulation, oncolytic and "other". We calculated an effect size for each comparison by dividing the median survival in the treatment group by the median survival in the control group to give a median survival ratio.
A preliminary stratification identified that studies that used a prodrug in the treatment group (to activate the gene therapy) but not in the control group were associated with significantly larger effects than those using prodrug in both groups, suggesting biological activity of the prodrug alone. Therefore, studies using a prodrug were only included where the same prodrug was also used in the control group. Some studies reported more than one control group. We considered the most appropriate control group to be the one that was most similar to the treatment group, while offering no functional gene therapy, according to a hierarchy that prodrug with non-functioning vector was preferred to prodrug with saline, which was preferred to prodrug only. Where a prodrug was not used, the hierarchy was non-functioning vector in preference to salineonly in preference to no treatment. For studies in which more than 50% of animals survived till the end of the experiment, we used the last time point at which survival was reported to give a conservative measure of median survival.
Individual study effect sizes were weighted by the number of animals for that comparison, as there is no inherent measure of variance available for median survival data. Where a control group served more than one treatment group we corrected the weighting of the study by dividing the number of animals in the control group by the number of treatment groups served. Effect sizes were calculated on log-transformed data 32 using the random effects model of Dersimonian and Laird, 33 as we expected significant heterogeneity between experiments. We estimated the standard error of the summary estimates from the inter-study variance (as described previously 27,34 ).
We used stratified meta-analysis to estimate the significance of differences between groups of studies by partitioning heterogeneity and using the χ 2 distribution with n − 1 degrees of freedom (where n is the number of strata). We performed stratified analyses on the complete dataset that included all gene therapies reported. We also performed analyses on a more homogenous subset of data that consisted of only the most common gene therapy-herpes simplex virus thymidine kinase activated by ganciclovir (GCV). To allow for multiple comparisons (we performed 26 comparisons; 15 on the complete dataset and 11 on the thymidine kinase subset of data) we adjusted our significance level to p < 0.0019 using Bonferroni correction for 26 tests of statistical significance in the same dataset. We used funnel plotting, 35 Egger regression 36 and "trim and fill" 37 to assess for the presence of publication bias.
To estimate the statistical power of a typical experiment, we calculated the median observed values for median survival in the control and treatment groups and the median numbers in each group and used the "stpower exponential" functional in Stata. Data extracted from studies included in the review and the results of meta-analysis and publication bias assessment are available from the Dryad Digital Repository: http://doi.org/10.5061/dryad.bs8c4. 38

Results
We identified 3,860 publications, of which 208 met our inclusion criteria ( Figure 1; Appendix S1, Supporting Information). Of these, 193 publications reported data suitable for meta-analysis; these described 427 comparisons using 6,366 animals (Appendices S1 and S2).
Overall, 127 different gene therapies were tested. A total of 101 used a single gene and 26 used two genes in a single vector (Appendices S2 and S3). Thymidine kinase was the most common gene therapy (61 comparisons, given as a single gene therapy in 49) followed by IL-4 (23 comparisons), IL-2 (21 comparisons) and tumour necrosis factor-related apoptosis inducing ligand (TRAIL, 18 comparisons). Across all 127 gene therapies there was a significant improvement in the median survival time (survival ratio 1.60, 95% CI 1.53-1.67), and there was significant between-study heterogeneity (χ 2 = 1,522; df = 426, p < 0.0019; I 2 = 72%). While the approaches to gene therapy were diverse, the broad category of gene therapy used did not account for a significant proportion of the observed heterogeneity ( Figure 2).

R I S K O F B I A S
The median number of quality checklist items reported was three of a possible nine (interquartile range (IQR) 3-4; Appendix S4); 193 (100%) publications were in peerreviewed journals, 170 (88.1%) reported the number of tumour cells implanted, 23 (12.4%) randomly allocated animals to group, 7 (3.6%) blinded the assessment of outcome, 133 (68.9%) had a statement of compliance with animal welfare regulations, 15 (7.8%) had a statement of a potential conflict of interest, 24 (12.4%) reported the number of animals originally inoculated with the tumour and 41 (21.1%) gave an explanation of any treated animals excluded from the survival analysis. A total of 139 studies (72.0%) provided evidence of successful transduction and gene expression in vitro and 90 (46.6%) provided evidence of infection, replication or expression in the tumour in vivo. No publication reported a sample size calculation and the median number of animals in each of the control and treatment groups was eight (IQR 6-10). In 90 publications it was reported that animals were killed when they manifested signs reflecting disease of a certain severity (rather than allowing them to die of their disease), and in the remainder of studies the circumstances of death (euthanasia or spontaneous) were not reported.
The aggregate number of quality checklist items scored or the reporting of randomized group allocation did not account for between-study heterogeneity ( Figure 3A). Only seven publications (9/427 comparisons) reported the blinded assessment of outcome, too few to allow further analysis. We did not identify any differences in treatment effects between studies that reported survival data within the text and those where data were extracted from a graph.
Bias introduced by an excess of small, imprecise studies was suggested with asymmetry in the funnel plot ( Figure 3B) and Egger regression (11.3 ± 0.301; t = 11.3, p < 0.001; Figure 3C) but not using "trim and fill".
The method of gene delivery (molecules, viruses, cells or virus-producing cells) had a significant impact on the Figure 3. External validity. A. Stratification by aggregated quality score did not account for between-study heterogeneity, implying no variation in efficacy with quality score in this dataset (p > 0.0019). The grey band represents global 95% confidence intervals (CIs); columns represent mean ± 95% CI and column width a measure of number of comparisons within each stratum. The solid line represents the level of neutral treatment effect. B. Funnel plots showing effect size (x-axis) versus a measure of study precision (y-axis). The dataset appears to be skewed, with imprecise studies generally showing more efficacy than those with larger sample sizes. The solid line represents the line of neutral treatment effect and the dotted line marks the global efficacy estimate. C. Egger regression plot depicting effect size × precision (x-axis) versus precision (y-axis). Regression revealed a positive intercept (p < 0.001) implying an excess of small, imprecise studies. The vertical solid line represents the level of neutral treatment effect; dotted lines represent 95% CI of the regression. therapies into those using a single gene or those using multiple genes in a single vector did not account for between-study heterogeneity (p > 0.0019, n = 387 and 40 respectively; black plots with bold labels). Furthermore, subcategorizing the single gene group by the broad mechanism of action of that gene did not account for between-study heterogeneity (p < 0.0019, grey plots). Plots represent mean ± 95% confidence interval (CI) and the diamond represents a measure of number of comparisons within each stratum. The dotted line represents the level of neutral treatment effect. reported effect size. Cells and virus-producing cells were associated with the largest treatment effects (χ 2 = 24.1, df = 3, p < 0.0019; Figure 4A). Furthermore, the selection of viral delivery system accounted for a significant proportion of between-study heterogeneity (χ 2 = 53.0, df = 3, p < 0.0019). The greatest estimates of effect were observed where retroviruses and adeno-associated viruses were used. Similarly, selection of cellular delivery system also accounted for significant between-study heterogeneity (χ 2 = 56.2, df = 5, p < 0.0019). Bone marrow-derived stem cells and neural stem cells were associated with the largest effect sizes. Several gene therapy paradigms were reported that usually included the concomitant use of a prodrug; 67 comparisons reported such combinations, for instance thymidine kinase with GCV, cytosine deaminase (CD) with 5-fluorocystine (5-FC). Furthermore, these same gene therapies were sometimes used without the appropriate prodrug (35 comparisons). Gene therapies using a prodrug were associated with an increased effect size if that prodrug was used (1.90; 95% CI 1.67-2.14, n = 67) when compared with the same gene therapies where no prodrug was used (1.27; 1.14-1.41, n = 35; χ 2 = 45.6, df = 1, p < 0.0019) with median survival ratio being 50% higher (95% CI 35 -65%). A further 17 comparisons involved prodrugs not usually associated with the gene therapy used (i.e. prodrug used alongside that gene in fewer than 50% of comparisons; e.g. interferon with 5-FC, tumour necrosis factor with GCV). The vector used to deliver gene therapy accounted for a significant proportion of heterogeneity in the complete dataset, and there was evidence of differing efficacy between different cellular and virus vector paradigms (p < 0.0019). The grey band represents global 95% confidence intervals (CIs); plots represent mean ± 95% CI and the diamond represents a measure of number of comparisons within each stratum. The dotted line represents the level of neutral treatment effect. B. The number of doses accounted for between-study heterogeneity (p < 0.0019), the largest effect seen with five doses. The grey band represents global 95% CIs; columns represent mean ± 95% CI and column width a measure of number of comparisons within each stratum. The solid line represents the level of neutral treatment effect. C The route of gene therapy delivery accounted for heterogeneity, with the largest efficacy associated with intra-arterial and intraperitoneal systemic delivery, and ipsilateral and contralateral intracranial therapy (p < 0.0019). There was no observed difference in efficacy between intracranial and systemic administration. The grey band represents global 95% CIs; plots represent mean ± 95% CI and the diamond represents a measure of number of comparisons within each stratum. The dotted line represents the level of neutral treatment effect. D. The delay to treatment (where "0" refers to the day of tumour inoculation, ">0" therapy initiation post-tumour inoculation and "<0" therapy initiated before tumour inoculation) accounted for heterogeneity (p < 0.0019), with therapy given concomitantly with tumour inoculation giving the greatest efficacy. The grey band represents global 95% CIs; columns represent mean ± 95% CI and column width a measure of number of comparisons within each stratum. The solid line represents the level of neutral treatment effect. The most commonly used prodrugs were GCV (40 comparisons) and 5-FC (24 comparisons, see Appendix S2).
The number of gene therapy doses administered (ranging from 1-7 doses) also accounted for a significant proportion of the between-study heterogeneity. We observed a direct relationship between the number of doses and effect size where up to five doses were given followed by a fall in efficacy where six or seven doses were administered (χ 2 = 50.1, df = 6, p < 0.0019; Figure 4B). Intracranial gene therapy delivery was common (328/427 comparisons) and we stratified these into intratumoural, ipsilateral (gene therapy introduced into the same cerebral hemisphere as the tumour), contralateral (opposite hemisphere), coinoculation (tumour and vector inoculated together), pretransfection (glioma cells transfected before tumour inoculation) and unspecified intracerebral. While these groups contain information on both location and time of implantation, we used a single stratification as the variables display colinearity (i.e. coinoculated cells can only be implanted at the same time as the tumour, intratumoural injection requires an established tumour to be present). Of these routes, intratumoural was the most common (226/328), followed by coinoculation (36/328) and pretransfection (31/328). The remaining comparisons, except for one unknown, were systemic-the most commonly used routes being subcutaneous (49/98) and intravenous (32/98). There was no difference in observed effect size between treatments that were delivered centrally and those that had to cross the blood-brain barrier (delivered systemically); however, we did observe significant portion of heterogeneity accounted for by more specific stratification of route of delivery (χ 2 = 45.0, df = 11, p < 0.0019; Figure 4C). The routes associated with greatest efficacy were ipsilateral and contralateral central delivery, which were more effective than intratumoural treatment.
Gene therapy was delivered from 1 month before to 1 month after tumour induction. We stratified the data into three groups (before, the same day as, or after the induction of tumour), and the timing of treatment had a significant impact on reported efficacy (χ 2 = 13.0, df = 2, p < 0.0019; Figure 4D). Where tumour cells were treated in vitro prior to implantation these were classified as therapy starting on the same day as implantation. The type of control group used did not account for any between-study heterogeneity.
Median survival across all control groups was 25 days and across all treatment groups was 40 days. With a median of eight animals in control and treatment groups we estimate that the median powered study in this cohort has only 17% statistical power to detect the median change in median survival. This compares with around 30% in experimental stroke studies (CAMARADES group, data not published), and a convention of seeking power of 80 to 90% in wellconducted clinical trials. As a guide for future investigators in Appendix S5, we present the relationship between statistical power and number needed per group for the above comparison and for a range of median survival ratios that might be sought for in future experiments.

T H Y M I D I N E K I N A S E D A T A S E T S U B -A N A L Y S I S
We performed a sensitivity analysis for the most common gene therapy paradigm (thymidine kinase with GCV, 30 comparisons using 446 animals). Thymidine kinase gene therapy was associated with a significant increase in median survival (1.99; 95% CI 1.68-2.37) and between-study heterogeneity comparable with the complete dataset (χ 2 = 49.9, df = 29, p < 0.0019; I 2 = 69%).
The risk of bias appears to be similar between the two datasets. The median study quality score was again 3 (IQR 2-3.25) and we did not find an association between the number of study quality checklist items scored and efficacy ( Figure 6A). Small study bias was suggested with asymmetry in the funnel plot and a positive intercept on Egger regression (11.56 ± 1.76; t = 6.61, p < 0.001; Figure 6B and C), but not in "trim and fill" analysis. Again, there were no differences between studies that reported survival data within the text and those where data had to be extracted from a graph, or between the types of control used.
Only one study used more than one dose of gene therapy, only three used a route of delivery other than intracranial inoculation and only three reported efficacy in animals with co-morbidity; these data were not analysed further. There was no effect of the method of gene delivery, the vector used ( Figure 6D), the time to delivery, the species used or the species of origin of the tumour cell line.
Searching for <glioma> and <thymidine kinase> identified only one additional study. In a sensitivity analysis 2015 | Volume 1 | Issue 1 | e00006 Page 26 including data from this study, there were no changes of substance either in the point estimates of efficacy or in any of the conclusions drawn.

Discussion
In this first systematic review and meta-analysis of gene therapy in experimental glioma we show substantial and significant prolongation of median survival across a range of experimental conditions. However, there is a high risk of bias in included studies and we observed substantial heterogeneity; consequently these results should be interpreted with caution. We observed influences on reported efficacy by vector type, control type, delay to treatment, glioma model selection, animal, immune status and the method of determining survival. Further high-quality preclinical investigation developed to better define the impact of the study design characteristics listed above, and in particular for those in which we identified substantial efficacy in conditions that reflect those seen in human disease, may provide promising avenues for clinical trial development.

S T U D Y Q U A L I T Y
Overall the quality of studies was limited; the median number of quality checklist items scored was three of a possible nine (IQR 3-4). Our threshold for significance testing was conservative (Bonferroni correction for 26 stratifications), and while testing for an effect of study quality on treatment efficacy did not reach this significance threshold (p = 0.024), an association is likely. Only 12% of studies reported randomization and less than 4% reported the blinded assessment of outcome. As two thirds of studies met either Figure 5. Glioma model setup. A. Gene therapy in rats was associated with greater efficacy than in mice (p < 0.0019). B. Immune status was associated with heterogeneity: use of animals with severe combined immunodeficiency were associated with greater efficacy than athymic or normal counterparts (p < 0.0019). The grey bands represent global 95% confidence intervals (CIs); columns represent mean ± 95% CI and column width a measure of number of comparisons within each stratum. The solid line represents the level of neutral treatment effect. C. Glioma model was associated with between-study variance (p < 0.0019), with greatest efficacy seen with N32 and G203 lines in the complete dataset. Furthermore, the species of tumour origin was associated with heterogeneity (p < 0.0019). The grey band represents global 95% CIs; plots represent mean ± 95% CI and the diamond represents a measure of number of comparisons within each stratum. The dotted line represents the level of neutral treatment effect. three or four checklist items and none met more than seven, we have not been able to ascertain whether highquality studies give lower estimates of efficacy. This contrasts with findings from other models of neurological disease, where the prevalence of reporting of such factors is higher and where there is evidence that high-quality studies report lower estimates of efficacy. 15,39,40 We found some evidence of publication bias, including a positive intercept using Egger regression, but the trim and fill approach did not impute any theoretical missing studies. It has been suggested that trim and fill is less powerful than Egger regression, 36 but the small study effects detected by Egger regression may have other causes, particularly where there is substantial heterogeneity between studies, as is the case here. 41 Our findings are consistent with the presence of publication bias of the same order as reported in a previous systematic review of temozolomide in experimental glioma. 27

S T U D Y D E S I G N F E A T U R E S A F F E C T I N G E X T E R N A L V A L I D I T Y
We found no differences between broad categories of gene therapies so we analysed all therapies together, with a sensitivity analysis using only the most commonly used therapy. It is likely that certain individual therapies were substantially more or less effective than the overall estimate but 79% of these were tested in fewer than four experiments (Appendix S3), and in these circumstances meta-analysis contributes little that cannot be gleaned from an examination of primary data. Nonetheless, thymidine kinase was more efficacious than average (median survival ratio 1.99 vs. 1.60). While this supports there being differences in efficacy between treatments there are alternative explanations-for instance differences in the animal and glioma models used. We found differences in efficacy associated with different vector-delivery mechanisms, routes of gene therapy delivery, numbers of doses and delays to treatment. Delivery of gene therapy using stem cells was associated with greatest efficacy in both datasets, and this might relate to more effective delivery to the required site of action, more sustained gene expression or other factors. In the complete dataset, intracranial delivery (rather than intralesional) and multiple dosing (particularly four or more doses) were most effective. In contrast to the difficulties of transfecting tumours in humans with glioma, we were surprised that, taken together, systemic therapies were as efficacious as those delivered intracranially. 42,43 Over 90% of the thymidine kinase studies administered gene therapy intracranially with a single dose-analogous to clinical practice. 16,[44][45][46] We also found that features of the disease models used were widely variable and significantly affected the observed efficacy. A total of 40 different glioma cell lines were used, originating from humans, mice and rats. No studies used mice with spontaneously occurring or induced glioma cells and only one reported the use of cells recently extracted from human glioma specimens. Median survival ratios for different cell lines ranged from 1 to 8.56 (median 1.57; IQR 1.34-1.71). This suggests that cell line selection is one of the most important factors for investigators to consider during experimental design. As prominent were the species used and their immune status. Rats and mice were used in both datasets, and roughly one third of the animals used were immune-suppressed; this was associated with greater efficacy. The immune system plays an important role in the body's response to glioma, so the immunecompromised mouse may not be an ideal model of human disease.
Consistent with the modelling of other neurological conditions, these data suggest that the efficacy of gene therapies in glioma is characterized by heterogeneity in both the disease model used and the treatment delivery. Further, we observed a low prevalence of reporting of measures to reduce the risk of bias (see Appendix S4 for details), possible overstatement of efficacy in studies at risk of bias, and publication bias. 15,29,40,[47][48][49] There are a number of potential weaknesses to this approach. Foremost, meta-analysis is essentially an observational technique. When we stratify by various study design characteristics and measures of study quality or bias, it may be that there are other unknown differences that are the cause of observed differences. For this reason we have been rigorous in only investigating sources of heterogeneity that we prespecified in a protocol. Our findings demonstrate association rather than causality; the observation that treatments delivered using cellular vectors are more effective than those using viruses or molecular approaches may be due to differences in gene delivery efficiency between these routes, or alternatively it may be that different types of interventions are more suited to different vector systems, and that these interventions differ as a class in their efficacy, or a difference in other study design characteristics shown to be associated with heterogeneity. For this reason our findings should be considered hypothesis-generating only. However, further high-quality preclinical studies can be used directly to test any hypothesis of interest.
Secondly, summarizing a field of research (as we have attempted here) requires by necessity the combination of data from experiments that are, to a greater or lesser extent, dissimilar. In these circumstances, the meaning of a summary estimate of efficacy across a range of studies has limited relevance other than to provide a yardstick of the magnitude of effect that might be expected of an intervention. However, we believe the statistical explanation of the differences between studies, especially in the face of the substantial heterogeneity observed here, is valid and important. This rationale is the basis on which we have deemed it appropriate, corroborated by the evidence that broad groups do not differ in efficacy, to collate all gene therapies into a single analysis-following this, heterogeneity was accounted for by measures of study design rather than the gene therapy paradigm itself. In support of these findings, we ran a separate analysis on the most commonly used gene therapy paradigm, thymidine kinase with GCV; in general we found that the same factors relating to study design and internal and external validity of these experiments had a significant impact on efficacy. Given the presence of such diversity in study design, our findings on study quality, randomization, controlling, experimental design and the consistency of these between the two datasets provide validation of our approach.
Our findings are only as reliable as the data on which they are based, and we have shown that this is likely to be 2015 | Volume 1 | Issue 1 | e00006 Page 29 confounded by poor study quality and by publication bias. However, our search strategy was broad, accepting conference abstracts and publications in languages other than English, so our approach is likely to provide a better summation of what is known than narrative reviews-which are subject to the same potential biases-and the impact of selection bias is likely to have reduced to the minimum possible. We used the term "gene therapy" in our search rather than detailing specific genes so that we might identify the largest number of studies, not just those where the use of that gene was already widely known. The term "gene therapy" may have been unduly restrictive, but searching for "thymidine kinase" and "glioma" in Pubmed-without further limitations-identified only one additional study, inclusion of data from which had no impact on the overall efficacy estimate of efficacy. Finally, we have attempted to minimize false positives in our statistical tests by adjusting for multiple comparisons.
The use of meta-analysis to summarize median survival data is not well established. Because we did not have access to data for individual animals we could not pool hazard ratios as has been suggested for clinical studies, 50 and instead have used methods reported previously, 27,34 based on the work of Simes et al., 32 as a summary estimate that is comparable with hazard ratio pooling.

W I T H I N T H E S E L I M I T A T I O N S , A R E T H E R E A N Y I M P L I C A T I O N S F O R F U T U R E R E S E A R C H O R F O R T H E D E S I G N O F C L I N I C A L T R I A L S ?
In this systematic review and meta-analysis, we have presented substantial evidence that features relating to the risk of bias and experimental design of animal studies significantly affect the observed efficacy of gene therapy for experimental glioma. However, another issue yet to be addressed is that of construct validity; there is evidence from these data that translation of gene therapy from experimental to clinical glioma has failed because the experimental models do not recapitulate human disease.
The optimized conditions that are generally created for animal studies do not recapitulate the heterogeneity of human glioma patients, as these studies are all undertaken on homogeneous rodent populations. While techniques such as meta-analysis seek to counteract this homogeneity, the breadth achieved still does not reflect that of the human population. For example, of these animals, many are immune-compromised, a feature that is uncommon in clinical practice. We have observed a wide variety of glioma models used, but those most commonly selected (GL261, U87 and C6) tend to grow quickly and relatively noninvasively into large discreet spheres, 51,52 contrasting sharply with irregularly shaped, poorly defined, infiltrative human glioblastoma multiforme tumours. 1,8 While each cell line has certain properties that do relate to human diseasefor example extensive capillary networks in U87 models, 53 white matter invasion and low immunogenicity in GL261, [54][55][56] gene mutations in C6 that are comparable to human glioma 57 -these tumours are appropriate for the study of particular components of glioma biology (such as angiogenesis in U87 or immune therapies in GL261) but perhaps lack the robustness for survival studies preceding translation into clinical trial. The recent emergence of the glioma stem cell hypothesis (implicating a cell with stem-like features in the aetiology and pathogenesis of human glioma) has influenced the design of novel preclinical models, 58 but these, to our knowledge, have not yet been adopted into animal studies of gene therapy. Another novel practice is the use of patient-derived xenografts, where animals are inoculated with tumour cells prepared from fresh human surgical specimens rather than cells from established in vitro cultures. These models may be more characteristic of human disease and provide genetic heterogeneity not seen with traditional glioma models; 59 however, we identified only one relevant study using this approach. Finally in animal studies gene transfection rates are evidently high enough to be therapeutic, even when vectors are delivered systemically. Indeed, GL261 tumours are transfected very efficiently by adenoviruses. 60 This contrasts with human therapy where transfection rates are low; the blood-brain barrier is obstructive in glioma therapeutics, preventing the use of systemic vector delivery, 43,61 even when vectors are delivered distal to the ophthalmic artery (Dr Robin Grant, personal communication), as tumour penetration is poor and side effects high. When implanted locally, distribution throughout the tumour is difficult to achieve. 43 This may be attributable to differences in the central nervous system anatomy and the host immune system. 43 The large between-study heterogeneity observed in our data suggests that the efficacy of gene therapy is very variable, depending at least in part on the features we have described and perhaps to a greater degree than is observed in other glioma treatments. 27,30 This matches the so-called lack of "robustness" seen with gene therapy in phase II and III clinical trials 42 that has ultimately led to failure.
The statistical power of the experiments included in this meta-analysis was low, and no study reported a formal sample size calculation. Improving the statistical power of gene therapy experiments may help to reduce heterogeneity by reducing the chances of type II (false negative) errors and also the predictive value of positive studies where the prior probability of success was low. 62, 63 We hope the community finds our guide to statistical power (Appendix S5) helpful in the design of future experiments.
In spite of these limitations, gene therapy treatment for experimental glioma appears to be effective when initiated at later time points, and efficacy was observed against cells of human origin. Both these features are pertinent to successful treatment for human disease, as tumours are only discovered after a period of growth. Gene therapy was effective when given either intracranially or systemically although this does not seem to correlate with clinical 2015 | Volume 1 | Issue 1 | e00006 Page 30 experience. Efficacy appeared to be highest when five doses of the gene therapy were given, but-given the difficulties of systemic administration in humans-it may not be practicable to implant locally more than once. Exploration of the most effective number of treatments was not addressed in any of the included publications and is an important topic for further animal study. As such, we recommend that future preclinical research focuses on genes ratified in both animal and human glioma cell biology, using orthotopic tumours and intracranial gene delivery over one or more doses; they should ideally use stem-like cancer cells or patient-derived xenografts, or at least provide a rationale for tumour model selection, in non-immune-compromised animals where possible. These studies should be registered, randomized, blind assessment of outcome and provide a sample size calculation in all but hypothesis-generating experiments in accordance with ARRIVE guidelines. 64

Conclusions
Gene therapies are associated with substantial increases in median survival in animal models of cerebral glioma, but because of concerns about the internal (study quality), external (study design) and particularly construct (recapitulation of human disease) validity of this literature, these findings should be interpreted with caution. Our analysis suggests that a strategy based on multiple treatments with viral or cellular vectors expressing genes of interest delivered locally, tested in the potentially more relevant tumour models described recently, represents a plausible approach to developing gene therapies for glioma. However, the issues of study quality and construct validity of existing models that we have identified here should be addressed in further animal studies if such strategies are to have the best chance of success.