Understanding in vivo modelling of depression in non-human animals: a systematic review protocol

The aim of this study is to systematically collect all published preclinical non-human animal literature on depression to provide an unbiased overview of existing knowledge. A systematic search will be carried out in PubMed and Embase. Studies will be included if they use non-human animal experimental model(s) to induce or mimic a depressive-like phenotype. Data that will be extracted include the model or method of induction; species and gender of the animals used; the behavioural, anatomical, electrophysiological, neurochemical or genetic outcome measure(s) used; risk of bias/quality of reporting; and any intervention(s) tested. There were no exclusion criteria based on language or date of publication. Automation techniques will be used, where appropriate, to reduce the human reviewer time. Meta-analyses will be conducted if fea-sible. This broad systematic review aims to gain a better understanding of the strengths and limitations of current approaches, models and outcome measures used. This study aims to provide insights into factors affecting the efficiency of model induction and the efficacy of intervention. Here, we outline the protocol for a systematic review and possible meta-analysis of the preclinical studies modelling depression-like behaviours and phenotypes in animals.

1 | BACKGROUND 1.1 | What is already known about this disease/ model/intervention? Why is it important to do this review?
Depression is a mental illness characterized by "low mood, loss of interest and pleasure or loss of energy." 1 It is the leading cause of disability in the world 2 and is currently the brain disorder with the highest financial cost in Europe. 3 The number of people diagnosed with depression worldwide is estimated to be 400 million. 4 Depression places a huge burden on patients and poses a great cost to healthcare systems and governments. The rate of remission with antidepressant medication is, at best, 70% and may only be achieved after several levels of intervention. 5,6 Despite decades of investigation into depression, little is known about the biological mechanisms underpinning the disease. 7,8 With better understanding of the mechanisms causing depression, the development of novel and more reliable treatments might be possible. There is solid rationale that further investigation into the mechanisms and factors that contribute to the development of depression is needed. This is a highly important area to tackle, both from a clinical and a preclinical perspective.
Preclinical investigations contribute significantly to understanding the mechanisms underlying depression, which can, in turn, inform treatment development and increase translational success of clinical research. One example of this contribution is the systematic review and meta-analysis of antidepressants for the treatment of stroke. 9 Analysis of the evidence in the preclinical animal literature informed key aspects of the trial design in the subsequent FOCUS trial. 10 Preclinical experiments have the ability to model and dissect important mechanisms of depression and therefore provide insights into the neurobiology of the disorder. 11 Preclinical experiments can also investigate the safety and efficacy of proposed treatments prior to exposure in human cohorts. 12 The knowledge from preclinical investigations can aid prevention research, translating findings into the best and earliest interventions for the human disease, which are top research priorities recently identified by an MQ: Transforming Mental Health report. 13 Due to the sheer volume of preclinical investigations of depression, it is difficult to achieve an overview of what is already known and to assess the marginal contribution of new research. 23 In this context, a systematic review of the existing preclinical literature could provide an unbiased, collective overview of existing knowledge and allow the additional contribution of new research to be assessed. It could also provide better understanding of the laboratory methods used to induce the condition, the range of outcome measures used to assess depression phenotypes and the variables that might impact the efficacy of different treatments. 14 The findings from this systematic review and meta-analysis may also contribute to the refinement of methods used in animal investigations of depression, reducing the distress caused to animals by substitution with equally informative methods of lower severity, and contribute to the optimisation of the numbers of animals used in depression research by informing wellfounded power calculations.
What is meant by an experimental non-human animal model of depression? An experimental model in preclinical non-human animal neuropsychiatric research is defined as including both a dependant variable (an outcome measurement) as well as an independent variable (model induction or manipulation). 15 We differentiate between 3 broad experimental designs within the animal depression literature and anti-depressant drug literature: 1. Studies which compare a control group to a group of animals that receive a lesion (model induction) on an outcome measure. These studies may also have a drug arm.

2.
Studies which only compare a "lesion" group of animals to a group of lesioned animals that receive a drug intervention. Once a lesion method has been sufficiently established (known to be valid and reliable), experimenters know a lesion will induce a depressive-like phenotype.
3. Studies which use an outcome measure to assess animals who receive a drug intervention. Once an outcome measure has been established (known to be valid and reliable), experimenters know an outcome measure can reliably measure a depressive-like phenotype.
As a starting point, in order to be thorough, we will look at experiments that investigate the differences between a control group and a lesion or model group. This will provide a basis for characterising a depression model. In these experiments, investigators are directly manipulating a variable intended to produce a depressive-like phenotype and measuring the effects of this manipulation on a given outcome measure. These experiments may or may not include the presence of a drug group/arm. On this basis, we will characterize all the known models/lesions. From this, drug interventions that have been tested on known models can be characterized. Secondly, the known outcome measures will be extracted from the control vs model investigations. Once the most commonly used outcome measures are known, there is further scope for characterizing the studies that investigate drug interventions with known outcome measures. We aim to unpick these different experiment design types and evaluate the evidence from all of these (Table 1) Not all aspects of the human condition can be modelled. Some typically modelled phenotypes in non-human animals include anhedonia and disturbances in sleep and/or food consumption. However, we will not exclude the possibility that novel research has attempted to investigate other aspects not previously modelled in non-human animals, and therefore, there are no restrictions on what phenotypes are modelled, only that they are present in the manifestation of the condition in humans.

| Specify the population/species studied
All preclinical studies on any animal species at any stage of development will be included.

| Specify the intervention/exposure
This study will investigate any mode of inducing depressive behaviour or a model that seeks to mimic the human condition or symptoms of depression using genetic, surgical, pharmacological, developmental or behavioural interventions or a combination of interventions. We will include models induced acutely, chronically, genetically or through a combination of these methods. We will also consider experiments where the efficacy of a treatment or intervention is tested in such models.

| Specify the control population
Studies will be included in this review if they include a suitable control, defined as a cohort of animals that have not been exposed to the method of inducing depressive-like behaviour that was used to create the depressive model. The control cohort may have received an appropriate equivalent, for example, sham surgery instead of lesion or placebo without the active ingredient. For studies that investigate treatment efficacy, a suitable control is defined as a cohort of animals that have had the same exposure to model the disorder as those that are given a treatment but has not been exposed to the treatment tested and may instead receive a placebo in an equivalent route of administration. For studies investigating drug intervention on an outcome measure, a suitable control is defined as a cohort of animals that are not exposed to the drug treatment and may instead receive a placebo in an equivalent route of administration.

| Specify the outcome measures 2.5.1 | Primary outcome measure
The primary outcome measure is behavioural outcome measures of animal studies inducing depressive-like phenotype.

| Secondary outcome measures
Secondary outcomes include anatomical outcomes, electrophysiological outcomes, neurochemical outcomes and prevalence of reporting of measures to reduce risk of bias.

| Tertiary outcome measures
Tertiary outcomes include drug efficacy; inter-rater agreement in the application of the inclusion criteria; and sample size, genomic, proteomic and metabolomic outcomes. A machine learning approach is proposed to assist with the screening phase for inclusion and exclusion criteria. A seed set of papers will be screened by 2 independent human screeners upon which the machine learning algorithm can be trained. Any discrepancies will be resolved by a third human screener. We will pilot the most promising approaches in the context of an ongoing collaboration where we are developing machine learning tools for systematic review. The protocol for this approach is under development and will be uploaded to the CAMARADES website (camarades.info).

|
Pilot testing of the machine learning algorithm: A random sample of Using this approach can reduce the screening workload by at least an estimated 50%, reducing the number of papers needed to be screened by 2 independent human reviewers to less than 35 183 papers.
Quality assessment: A small sub-section of the papers, included and excluded papers, that the machine learning algorithm classifies will be checked by a human screener to ensure the performance of the algorithm.
Validation: We will validate the machine learning techniques for screening by sampling; as opposed to having 2 human screeners manually screen every record. A randomly selected proportion of records that are included and excluded by the algorithm will be double checked by human screeners to ensure the gold standard is maintained. We will continuously monitor the articles screened by the machine learning algorithm by sampling. The machine learning approaches must reach comparable levels to human screening gold standard of at least 95% sensitivity, after which the machine learning algorithm that is maximized for specificity will be chosen.

| Phase 2
Two independent screeners are responsible for full-text analysis and data extraction, with the aid of machine learning and text mining tools where appropriate, for example, risk of bias classification. A third independent screener will resolve any discrepancies.

| Study selection criteria
3.3.1 | Type of study design

| Inclusion criteria
Any article providing primary data of an animal model of depression or depressive-like phenotype with an appropriate control group (specified above).

| Exclusion criteria
Review article, editorials, case reports, letters or comments, conference or seminar abstracts, studies providing primary data but not appropriate control group.

| Inclusion criteria
Animals of all ages, sexes and species, where depression-like phenotype intended to mimic the human condition have been induced.
Including animal models where depressive-like phenotypes are induced in the presence of a comorbidity (e.g. obesity or cancer).

| Exclusion criteria
Human studies and ex vivo, in vitro or in silico studies. Studies will be excluded if authors state an intention to induce or investigate only anxiety or anxious behaviour. Studies will be excluded if there is no experimental intervention on the animals (e.g. purely observational studies).

| Inclusion criteria
All studies that claim to model depression or depressive-like phenotypes in animals. Studies that induce depressive behaviour or model depression and that also test a treatment or intervention (prior or subsequent to model induction), with no exclusion criteria based on dosage, timing or frequency.

| Exclusion criteria
Studies that investigate treatments or interventions, but no depressive behaviour or model of depression is induced (e.g. toxicity and side-effect studies).

| Inclusion criteria
Studies measuring behavioural, anatomical and structural, electrophysiological, histological and/or neurochemical outcomes and where genomic, proteomic or metabolomic outcomes are measured in addition to behavioural, anatomical, electrophysiological, histological or neurochemical outcomes.

| Exclusion criteria
Where metabolic outcome measures are the primary outcome measure of a study. Where genomic, proteomic, metabolic or metabolomic outcomes are the sole outcome measures in a study, they will be excluded.

| Inclusion criteria
All languages (using automated translations where required).

| Inclusion criteria
All publication dates.

| Inclusion criteria
Studies must investigate methods or models that induce depressive phenotype/s in vivo, or authors must claim that they investigate a model of depression.

| Exclusion criteria
Studies claiming to induce only anxiety behaviour or a model of anxiety. In cases where both models of anxiety and depression are investigated, the study will be included, and only the depression-related data will be extracted. In the case of data duplication (2 or more papers reporting the same data), the paper reporting the smallest dataset or fewest outcomes will be excluded. Studies will be excluded if they model aspects of bipolar disorder, manic symptoms, obsessivecompulsive behaviours, panic disorder or psychotic symptoms.

| Order of priority exclusion criteria per screening phase
Selection phase 1: screening based on title and abstract 1. Article must be primary research article (excluding reviews, comments or letters).
3. Exclude ex vivo, in vitro or in silico investigations.

4.
Exclude study if no depressive behaviour or model of depression has been induced.
Selection phase 2: full text screening 1. The above criteria from selection phase 1.

Exclude if no appropriate outcome is measured.
3. Exclude if no appropriate control group.

4.
Where sufficient data cannot be extracted and authors do not respond to requests for required information.

5.
Exclude the study with the least information in the case of multiple publications describing the same work. 18 4 | STUDY CHARACTERISTICS TO BE EXTRACTED

| Study meta-data
The first author, corresponding author, year, title, journal name, source of funding and DOI will be extracted.

| Study design characteristics
The number of animals in the experimental and control groups will be extracted. If the number of animals is given as a range, the most conservative estimate will be extracted.

| Animal model characteristics
The species, strain, sex (male or female), age and/or weight of animal will be extracted. 2. Details of the outcome measure (e.g. the sub-type or name of the outcome measure and, e.g. in the case of food restriction, the length of time the animal was restricted for).

| Method of model induction/intervention characteristics
3. The number of times the outcome measure was assessed.

4.
The number of different outcome measures the animal was tested on.

5.
The category of the behaviour or biomarker the outcome measure is measuring (e.g. anhedonia, sleep or weight loss, markers of oxidative stress) 6. Any measures taken before the disease model induction will be extracted. The details of the before-and-after comparison will be extracted.

7.
Has an appropriate outcome measure been selected for use?
Studies that induce a depression model and investigate the effect of a subsequent drug intervention should select a suitable test to measure an outcome (e.g. an outcome measure that does not rely on the same mechanism/behaviour as behaviour that might be affected by side-effects of a given drug).

| Treatment characteristics
The following information regarding the treatments tested will be extracted: the dosage of treatment given, route of delivery, mode of delivery, how long the treatment was given for. The length of time between the administration of the treatment and outcome measurement will be extracted as well as the length of time between the model induction and any treatment given if applicable. This information will be extracted regardless of whether an experiment simply assesses an outcome measure for a given drug treatment or if an experiment has induced a model of depression and tests a drug treatment.

| Other
The number of excluded animals will be extracted, and the reason for their exclusion, if reported, will be extracted.

| Criteria to assess the internal validity of included studies
An adjusted CAMARADES checklist will be used to assess risk of bias, including the following criteria: 1. Publication in a peer-reviewed journal.

2.
Reporting of random allocation.

3.
Reporting of blinding of the conduct of the experiment.

4.
Reporting of blinded assessment of outcome.

5.
Use of comorbid animals (refers to animals where depression is investigated in the presence of another medical condition, e.g. stroke or diabetes).

6.
Reporting of a sample size calculation.

7.
Reporting of compliance with animal welfare regulations.

8.
Reporting of a potential conflict of interest.

9.
Reporting of exclusions of animals.

10.
Whether a study protocol is available dated before the experiments began. 19 We will report the median number of study checklist items scored and the interquartile range.
6 | COLLECTION OF OUTCOME DATA 6.1 | Methods for data extraction/retrieval 1. Numerical data will be extracted from the full text of publications (mean, SD or SEM, P values (exact P -value where possible) and group sample size).

2.
In studies where data are presented only graphically, the software Universal Desktop Ruler, or a similar tool, will be used to extract the data into numerical values. For certain PDF presentations, it may be possible to use data mining approaches to extract these data.
3. If any data are missing, the corresponding authors will be contacted.

4.
In the absence of a response from authors (we will allow 2 months to reply with a follow-up email sent after the 1st month), data will be excluded from analysis.
If the screeners or extractors consider that 2 sources may describe the same data, we will contact the authors seeking clarification. If we receive no response, we will include only the most recent data source.

| Data gathering and combination
All data will be gathered and entered in the CAMARADES-SyRF database. We will provide a qualitative summary along with several separate meta-analyses, where feasible.
7.2 | How the decision as to whether a metaanalysis is appropriate will be made of the total records are expected to be relevant to the research question and included in subsequent meta-analyses. This is similar to previous systematic reviews in models of psychiatric disorders conducted at CAMARADES where about 10% to 15% of the studies were included in the analysis. We expect high heterogeneity between studies due to differences in the study designs; therefore, a metaanalysis is proposed to investigate sources of this heterogeneity. groups, the size of the control group used in the meta-analysis will be adjusted by dividing it by the number of intervention groups it serves. If the number of animals is presented in a range, the most conservative estimate will be extracted (e.g. if presented as n = 6-12, we will consider n = 6). P values, exact P value where possible, will be extracted from primary analyses between model and control and intervention and model in order to conduct P-curve analysis.
Categorical or qualitative information relating to the outcome measures, such as the behavioural measure or the symptom the model is trying to elucidate, will be extracted into a text/comment field or into a form drop-down menu.
A decision will be made once the data has been extracted as to which effect size is the most appropriate to use. As most outcome measures are continuous variables, and outcome measures are not likely to be measured on the same scale, Normalized Mean Difference (NMD) effect sizes will be calculated where possible. This effect size calculation will be used where an appropriate "sham" or "control" group is present 20 or where it is possible to impute the outcome in a "normal" animal. If the data are unsuitable for calculating NMD, Standardized Mean Difference (SMD) will be used. NMD and SMD will be calculated using the equations outlined in Vesterinen and colleagues. 20

| Statistical model of analysis
The data extracted will in all likelihood cover different species, ages and sexes, as well as different study designs and models of induction.
Therefore, the true effect size is likely to differ between studies, and a random effects model will be used. 21 Statistical analyses will be performed using Stata, Statistical Software (College Station, TX: StataCorp LP).

| Statistical methods to assess heterogeneity
Cochran's Q will be used for assessing heterogeneity; Q is used to calculate the excess variance (Q-k, where k is the degrees of freedom). A P value can be calculated for Q, giving an indication of whether all studies share a common effect size (P < 0.05) or not (P > 0.05). I 2 will be used to report heterogeneity as this describes the proportion of observed variance that reflects true differences in effect size between studies. 18 8.4 | Specify which study characteristics will be examined as potential sources of heterogeneity (subgroup analysis) Meta-regression will be used to investigate the impact of different study characteristics on the outcome, where the effect estimate (NMD or SMD) is the dependent variable. Categorical variables will be transformed into dummy variables. Where there are sufficient data, a multivariate meta-regression will be used for both model induction and drug models. At least 10 independent comparisons per covariate investigated are required. 21 If there are insufficient data for multivariate meta-regression, univariate analysis will be used, requiring a total of at least 25 independent comparisons.

| Model induction model
Sub-groups analyses: A separate model will be used to investigate the effect of drug intervention on outcomes.

| Drug model
Sub-group analyses: 1. Drug Treatment or Intervention.

Treatment or intervention dose.
5. Treatment or intervention route.
6. Number of times the treatment or intervention is administered. 9. Randomisation (yes/no).

Blinding:
a. Allocation concealment (yes/no) b. Assessment of outcome (yes/no) 11. Source of funding (public vs industry) Sensitivity analyses will be performed to assess how missing data from study characteristics and effect size might have affected the results. This will be presented in the form of a summary table.

| Correction for multiplicity of testing
Where there are more than 2 groups being compared in a univariate model, we will use the Holm-Bonferroni correction for multiplicity of testing.

| Method for assessment of risk of publication bias
Risk of publication bias analyses will be assessed using funnel plot assessment, P-curve analysis and Egger's regression. Trim and fill analysis will be used to identify potentially missing studies. Analyses will be carried out using SigmaPlot and STATA software package (StataCorp LP; SYSTAT Software Inc).