Situational judgement test validity for selection: A systematic review and meta‐analysis

Situational judgement tests (SJTs) are widely used to evaluate ‘non‐academic’ abilities in medical applicants. However, there is a lack of understanding of how their predictive validity may vary across contexts. We conducted a systematic review and meta‐analysis to synthesise existing evidence relating to the validity of such tools for predicting outcomes relevant to interpersonal workplace performance.


| INTRODUC TI ON
The process of selecting the best candidate for a job is a universal challenge across all industries. There are few such situations where the stakes are higher than when deciding on entrants to medical school. An offer of a place to study is not just an opportunity to gain a university degree, but is usually the gateway to a lifetime career, characterised by both power and responsibility. Ideally, effective medical selection must firstly be 'fair,' in a broad sense. 1 That is, that certain under-represented groups are not unduly disadvantaged by the process; for example, individuals without access to certain resources, such as additional coaching, to help their performance in a specific selection assessment. Second, selection should result in the recruitment of individuals who are both suited to a successful career in the field, and likely to make valuable contributions to society in this regard. 2 These two aims can be seen as complimentary. 1 Though attempts to improve medical selection have often focused on measuring aspects of intellectual ability 3 there is an increasing recognition that 'non-academic abilities' are important when selecting future doctors. 4 Indeed, the majority of disciplinary censures received by practising physicians relates to personal conduct rather than clinical skills or knowledge. 5 However, there are many more challenges to defining and measuring such qualities in contrast to cognitive ability, which can be estimated relatively reliably and validated against academic or educational performance. In this context, we use the term 'non-academic abilities,' in a broad sense, to include qualities or traits relevant to interpersonal functioning, though not directly related to traditional concepts of intelligence, intellectual ability or educational achievement. However, we acknowledge the absence of a single satisfactory label to describe these individual characteristics. Indeed, terms such as 'emotional intelligence' and 'non-cognitive' traits, although sometimes employed, are somewhat contentious. 6,7 The increasing emphasis on non-academic abilities has led to the rapid development and implementation of assessments intended to evaluate such attributes. In contrast to face to face procedures, such as multiple mini interviews (MMIs), the use of situational judgement tests (SJTs) to measure non-academic abilities is viewed as advantageous, as they are relatively cheap and convenient to deliver at scale. 8 The SJTs, in this context, are an assessment format whereby the test-taker is presented with a series of scenarios depicting an interpersonal situation. The candidate must then usually evaluate several possible behavioural responses to each scenario shown or described. The response format for SJTs varies but commonly involves ranking or rating the potential behaviours in order of either appropriateness or perceived effectiveness. Other response formats also exist, such as the candidate choosing the 'best' and 'worst' behaviours depicted. In the context of personnel selection SJTs can be considered a special kind of procedural knowledge test-that is, they ask a candidate what 'should' be done in response to a portrayed scenario. 9 Consequently, the test-taker either knows what should be done or does not and, by definition, such assessments are not considered prone to 'faking' effects. The procedural knowledge evaluated is assumed a necessary, though not sufficient condition for such behaviours to take place in a similar, actual, workplace situation. This is in contrast to self-report personality measures that are vulnerable to faking in high-stakes testing. 10 Moreover, face to face interviews are also open to different forms of bias, though structuring these processes may reduce some of these influences, to some extent. 11 It should be emphasised that SJTs are a particular assessment format and this review is concerned only with their use in selecting candidates on non-academic abilities. 12 Although SJTs have been applied in personnel selection for many decades, their popularity increased when they were framed as 'low-fidelity simulations.' 13,14 Such SJTs were conceptualised as employing representations of aspects of actual workplace situations likely to be encountered in the role being applied for. A published meta-analysis of the predictive validity of SJT scores for future workplace performance reported a pooled correlation coefficient of . 26. 15 However, to date, there have been no systematic reviews or meta-analytic studies specifically relating to SJTs in medical selection. Given the recent rapid rollout of this approach for evaluating non-academic abilities in medical applicants it seemed timely to conduct such a review.
The SJT test format, in the context of personnel selection, is known to generate scores that are sensitive to a range of design choices, implementation methods and settings. Therefore, a review is also needed to begin to understand which factors are most likely to be associated with the observed validity of the resulting scores from such tools. Such knowledge is essential if SJTs are to be optimally designed, validated and implemented in various stages of medical recruitment. Crucially, the choice of criterion-relevant outcome may be as important as the qualities of the SJT in determining the validity coefficients observed. These are likely to vary according to how feasible they are to obtain, being at least partly dependent on the stage of training being selected into. The traditional approach to SJT development for personnel selection involves creating a series of critical incidents, often based on real-world experience, in order to evaluate how candidates respond, relative to expert consensus. It has been speculated that where such occupational experience is relatively lacking, such as in undergraduate selection, it may be more challenging to develop and validate SJT-based measures for relevant personal qualities. 9 However, key questions remain over the feasibility of using potential alternatives, such as 'construct-driven' SJTs, designed to tap into specific traits, as these may show some of the weaknesses of self-report personality measures. 16 Other factors could also influence the observed validity of SJTs. These may include the medium of delivery (eg, multimedia vs a text-based format) and the choice of outcome criterion against which to validate (eg, self-report vs face to face ratings of aspects of performance). 17 Finally, though it is likely that SJTs generally evaluate 'knowledge of interpersonal effectiveness' the content of such assessments carry a wide variety of labels, such as 'integrity,' 'team working,' 'empathy,' etc. Thus, by observing which outcomes have the strongest relationship with the SJT scores we would hope to gain a greater understanding of the constructs being evaluated in this context. In addition, the Ottawa Consensus statement on medical selection gave several recommendations for future research, including a call for systematic approaches towards translating evidence into changes in policy and practice. 18 Thus, the primary aim of this review was to collate the existing evidence for the validity of SJT format assessments used in medical selection for the evaluation of non-academic abilities. A secondary aim was to explore, where possible, the factors associated with the observed validity coefficients, via meta-regressions. Finally, the review was intended to provide both guidance for current practice and, by highlighting existing gaps in the literature, provide an agenda for future research.

| ME THODS
The protocol for this systematic review was registered prospectively on PROSPERO (CRD42019137761). 19

| Selection criteria
Studies were deemed eligible if they investigated any persons undergoing medical selection processes that included an SJT for the evaluation of 'non-academic' abilities. For inclusion, studies had to report on the relationship between SJT scores and an outcome measure that, at least partly, related to non-academic abilities, and was deemed relevant to future or current medical practice. Therefore, relevant outcomes ('validity criteria') would be expected to capture some aspect of interpersonal functioning in the candidate. The outcomes were thus expected to include (though not be limited to): supervisor or tutor ratings; objective structured clinical examination (OSCE) performance, and other ratings of 'integrity' or conscientiousness, or 'success' as a doctor (eg, successful completion of a training stage), etc. No restrictions were placed on study design, though purely qualitative studies were excluded. The inclusion and exclusion criteria are outlined in Table 1. English language limits were not placed on the searches, but any studies identified for which the full text was not available in English were excluded due to lack of translation facilities. A decision was made to exclude any studies cited only in conference abstracts. This was for several reasons; such abstracts often provided few details on the study and were unlikely to have undergone the same rigour of peer review that studies published in scientific journals underwent. Moreover, it was assumed that studies that were methodologically stronger would be more likely to be subsequently published in the peer-reviewed literature. In addition, our approach included an evaluation of the risk of publication bias. In the event, few relevant conference abstracts were identified and the full texts could not be accessed. Unsuccessful attempts were made to access the full texts via contacting the lead author (ESW) in some cases.

| Search strategy and study selection
It is also important to note that, for the final set of studies, we only retained those which included at least one outcome that involved some third-party rating of interpersonal functioning that could be considered directly or indirectly related to workplace performance. This deviated from our original registered search protocol, which also included studies using construct-relevant self-report measures as outcomes. This change was made as a result of feedback from peer review of an earlier version of this report. This highlighted that scores derived from self-report measures, such as personality questionnaires, tend to correlate only very modestly, at best, with ratings of workplace performance. 20 However, we observed that excluding the minority of studies that used only self-report measures as outcomes had minimal impact on our key findings.
Two authors (ESW and PESC) were involved in the process of selecting studies for inclusion, independently screening all titles and abstracts identified by the searches. Full text screening was conducted for potentially relevant papers, determining the final studies retained.
Disagreements at any stage were resolved through discussion with the other two authors (LWP and PAT) until a consensus was reached. Figure 1 displays the PRISMA (preferred reporting items for systematic reviews and meta-analyses) flow diagram outlining the process of study selection. Following this, data extraction was performed by one author (ESW) and checked for accuracy by another (PESC).

| Quality assessment
To assess study quality, the Quality In Prognosis Studies (QUIPS) tool was used as it is well suited for evaluating the risk of biases (RoBs) seen in predictive and prognostic studies. 21 It was used to rate the studies across six domains of: study participation; attrition; prognostic factor evaluation; outcome measurement; confounding and statistical analysis, and reporting.
A rating for study participation was given for identification, recruitment and description of the participants. The category of 'attrition' looked at whether there were any issues related to dropout or incomplete follow-up and what, if any, attempts were made to correct for these effects. In this review, where applicable, this rating included whether the authors corrected for the possible 'attenuation' effects on observed correlations due to the restriction of range when outcomes are only observed in selected candidates. 22 The SJT score was the 'prognostic factor' evaluated. The rating for this do- The six domains were each given a rating of 'low,' 'moderate' or 'high' RoB. An overall RoB for a study was rated as 'low' if 0 or 1 domains were coded as having moderate to RoB; 'moderate' RoB if 2 or 3 domains were rated in this way, and; 'high' RoB if 4 or more domains were rated as presenting at least a moderate RoB. The overall RoB ratings, along with those domains rated as a moderate to high RoB for each study, are shown the final column of Table S1.

| Data synthesis
As the literature was assumed to be relatively heterogeneous in nature the results were synthesised narratively. 23 For this review this involved assessing the papers in order to understand the themes underlying the rationale and contexts of the final included studies. The synthesis identified common features of the literature and examined the relative strengths and weaknesses of the findings, and the respective methods on which they were based. The analysis consisted of grouping papers into categories, appraising study quality, and producing a collective synthesis. This information was then summarised formally, as can be seen in Table S1. The narrative synthesis was also used to inform the inferences we drew from the data.
Additionally, validity coefficients (correlations) were pooled using a random effects meta-analysis, allowing for heterogeneity at the study level. Two authors (PAT and LWP) assessed the relevance of the outcomes reported in the identified studies and conferred in order to derive consensus where there was any doubt. As the published papers frequently reported on multiple construct-relevant outcomes relating to the same (or considerably overlapping) study population, we designated these as 'sub-studies.' Therefore, we introduced a second random effect into the meta-analytic model in order to accommodate the dependency of observations within each shared study population. Thus, there were three-levels (ie involving two random-effects) in our meta-analysis. These levels represented: outcome; population, and paper. We used this model to derive a pooled estimate of the validity of the SJTs used in the relevant studies. Similarly, multi-level meta-regressions were also performed, where applicable, to formally test for any association between sub-study characteristics and the magnitude of the validity coefficients reported. These characteristics formally tested were selected on the basis that: (a) all (or almost all) included studies reported these factors; (b) that there were a sufficient proportion of studies of each type to be likely to observe at least a trend, should it exist, and (c) that there were prior empirical or theoretical reasons to expect some difference in the magnitude of the validity coefficients observed on the basis of the factor. Statistical heterogeneity was assessed using the I 2 statistic. 24 Meta-analysis and regressions were performed in the statistical software R 25 using the metafor package. 26

| RE SULTS
In total, the search identified 470 papers, which after removing duplicates left 218 papers to be screened. After title and abstract assessment 174 studies that did not meet the eligibility criteria were excluded. This resulted in 44 full texts to be assessed, of which 30 were found to meet the inclusion criteria and were subsequently retained for analysis in the review ( Figure 1). As noted in the published protocol (CRD42019137761), 19 all studies were expected to be observational and this was the case. A total of 10 were cross-sectional studies, [27][28][29][30][31][32][33][34][35]55 where the outcome was measured at the same time or in the same selection cycle as taking the SJT. A total of 17 were cohort studies 17,36-51 that had a follow-up period before the outcome of interest was measured. Three studies employed a mixture of F I G U R E 1 PRISMA (preferred reporting items for systematic reviews and meta-analyses) flowchart for the systematic review Abbreviations: CINAHL, Cumulative Index to Nursing and Allied Health Literature; EMBASE, Excerpta Medica Database; ERIC, Educational Resources Information Center; EThos, Electronic Theses Online Service; UCAT, University Clinical Aptitude Test cross-sectional and more distal outcomes. [52][53][54] The length of followup across the cohort studies varied from 1 to 9 years after taking the SJT. Full details of the included studies are listed in Table S1. A total of 11 studies 17,28,29,35,[40][41][42][43]47,49,50 [28][29][30][31][32]34,36,[40][41][42][43]49,50,52,53,55 also provided information in relation to the reliability of the construct-relevant outcome.
These assessments were generally considered to capture 'maximal' performance. That is, where test-takers would be expected to put in maximum effort with the aim of achieving as high a score as possible at evaluation in a high-stakes setting.

| Risk of bias
Overall, according to the QUIPS tool, the results of the studies were deemed to be at a moderate RoB. The RoB ratings for the studies are summarised in the right hand column of Table S1. A frequent area where the potential for bias was identified was the restriction of range common, by definition, to selection studies, whereby outcomes were only observed for those recruited. Not all studies attempted to correct for these, and other attenuating effects, such as imperfect reliability in the SJT or outcome. Other common potential sources of bias were unreported or relatively poor reliability (<0.7) of either the SJT used or the outcome measure. Moreover, some studies did not provide adequate descriptions of the population, SJT or outcome characteristics. Nevertheless, it should be noted that the potential sources of bias identified would tend to lead to a systematic underestimation of the relationship between the SJT scores and the construct of interest. Thus, the results of the studies could be considered likely to be relatively conservative, especially those at higher overall RoB.
Concerning potential 'confounding,' we focused on whether the influence of academic performance had been controlled for. That is, whether the scores from the SJTs were likely to add incremental value above and beyond the traditional measures of academic or intellectual ability that are already widely employed. A total of 17 studies 17,27,[30][31][32]37,38,[40][41][42][43]45,46,48,51,52,55 used a measure of cognitive or academic ability alongside the SJT and thus received a low RoB rating in this domain. For the purposes of this review, we did not consider demographic variables such as age and sex as potential confounders, as they are largely not used in medical selection.
The I 2 value reflects the percentage of variation in the results across studies that is due to heterogeneity in studies (eg, different designs, outcomes and populations, etc.) rather than chance. 56 In this case the I 2 statistic was close to 100.0% suggesting only a small proportion of variation was due to chance alone.
There are many design features in an SJT validation study that may influence the magnitude of the validation coefficient observed. 57 However, in the present case there were a limited number of such characteristics that could be formally statistically evaluated.
This was because such features had to be explicitly described in almost all the studies, with sufficient variation across them to plausibly evaluate for the presence of any trends. Almost all studies reported on whether text or video was used to present the scenarios (ie, the stimulus), the setting (undergraduate vs postgraduate), whether the outcome was longitudinal in nature (ie, captured more than a year after the SJT was administered) as opposed to cross-sectional, if the outcome was captured on a one-off occasion rather than via a more prolonged period of assessment, and whether or not any correction for attenuation of the observed correlation was applied. There were also a priori reasons for hypothesising that these factors might be related to the reporting of higher or lower values for the validity coefficients. Therefore, the potential associations between these factors and the magnitude of the validity coefficients were formally tested using meta-regression analyses. Both univariable and multivariable models were tested in this regard. The results are shown in Table 2.
As can be seen from  Studies with relatively few participants may lack the power to detect the true underlying relationship between a predictor and an outcome.
Therefore both small and large effect sizes may be more likely to be due to chance, though the latter may be more likely to result in a study being published. In a funnel plot, this bias can manifest as asymmetry with a relative paucity of studies in the lower left quadrant of the chart.
This may provide evidence that there are fewer published small studies reporting modest effect sizes than may be expected by chance. As can be seen in Figure 3, the funnel plot is relatively symmetrical in this respect, providing no indication of publication bias.

| D ISCUSS I ON
This is the first review to systematically collate and synthesise evidence relating to the use of SJT-format assessments for selecting candidates into medical training based on non-academic abilities.  Table S1), though other design issues, such as the choice of outcome measure, are also likely to have played a substantial role in determining the degree of incremental validity observed.

| Strengths and potential limitations
We used a rigorous systematic review process, with a prospectively registered search strategy, which identified a substantial number of primary studies for inclusion. These strengths aside, there were several potential limitations to the conduct of the review. First, it may be that we failed to identify unpublished studies, which observed weak or absent correlations between SJT scores and an outcome of interest (ie, publication bias). The review also excluded studies that only reported their findings in conference abstracts. However, we note that our estimated average reported validity coefficient of 0.32 is close to that reported by a previous meta-analysis of SJTs for personnel selection that did include unpublished studies. A previous similar meta-analysis that only included published studies reported a higher pooled validity coefficient of 0.34. 64 Moreover, our funnel plot ( Figure 3) did not provide evidence of publication bias. Due to the lack of translation facilities the studies were restricted to those published in the English language. Nevertheless, only one study was excluded for this reason. 65 Not all of the final study results could be entered into the meta-analysis as some did not cite a correlation coefficient as validity evidence, having employed categorical outcomes. However, only four studies 39,44,45,51 were excluded from the meta-analysis on these grounds.
As with any systematic review, the primary limitation was the quality of the studies included. Overall, the studies were rated as at observed are highly unlikely to be due to random variation in sampling alone. Some of these differences will be explained by aspects of study design, such as context and the outcomes selected, which were captured in our meta-regression. However, inevitably, there will have been other factors, which would have either not been reported consistently in the studies, or captured as part of our data extraction and analysis. Ideally all the design features that may have been relevant to criterion-validity would have been formally tested for their influence on the results. However, due to the number and nature of the studies identified only five of these factors were formally evaluated in the meta-regression. Moreover, given the relatively small numbers of each study type these tests may have been underpowered, and so some caution must be exercised in interpreting the results.

| Implications for policy and practice
The use of the SJT format to evaluate non-academic attributes is becoming increasingly widespread across medical selection and the results of this review would support their general validity in this context. The majority of the studies reported moderate, rather than large, predictor-outcome correlation coefficients. However, as highlighted earlier, these are comparable to those frequently cited for other widely accepted medical selection tools. Having established that SJTs generally have both predictive, as well as incremental, validity in this context there is a question about their optimal place within the selection process. A previous review of the evidence for personnel selection approaches in medicine suggested that SJTs, along with MMIs, (cognitive) aptitude tests, academic record and selection centres, were fairer and more effective than personal statements, references and traditional interviews. 66 Consequently these latter three, relatively unreliable, methods of evaluating personal qualities in medical applicants are less preferable compared to SJTs and the structured observations employed by MMIs and selection centres. Structured interviews, such as MMIs, seem to demonstrate acceptable reliability and validity if implemented appropriately. 67 However, they are relatively resource intensive compared to SJT format assessments. The SJTs used in personnel selection are generally experienced as relatively easy tests. 68 Therefore, they tend to discriminate most accurately between relatively poorly performing test-takers. This implies they may best serve as cost-effective 'screen outs' at an early selection stage when considering, which applicants to progress to face to face interview processes. Our review also highlighted that, where context. 37 There may also be other circumstances in which selectors may wish to only interview candidates obtaining a mid-range score on an SJT. For example, this approach could possibly be justified where the numbers of both applicants and places are relatively large compared to the available resources to perform face to face assessments and those performing well at an SJT are known to be at minimal risk of receiving poor interview-based ratings.
However, though such choices may be justified they would have to be based on some preliminary evidence; the cost-effectiveness of SJTs in conjunction with more resource-intensive selection stages may assumed to be reasonably sensitive to context.
As highlighted earlier, the low to moderate validity coefficients reported for most SJT evaluations are comparable with those cited for cognitive (problem-solving) ability to predict medical academic performance. Thus, it could be justifiable that a similar weight be placed on SJT performance as on the latter assessment scores.
However, some caution in this regard should be exercised. In general, due to their measurement properties and precision, the cognitive assessments used in medical selection are generally able to differentiate candidates even at the upper end of ability. 69 This is less true of SJTs, which tend to be superior at discriminating between average to low performers, hence their suitability to be used as early stage screening assessments. 68 Indeed the weight placed on SJTs, relative to other measures, in medical selection has been debated in relation the allocation process for UK medical graduates to be placed on the country's Foundation training programme (the first 2 years of postgraduate medical training). 70,71 In this case equal weight is currently given to the scores from a 2.4 hour long SJT and the education percentile measure (EPM) derived from academic performance in the previous 5 to 6 years of medical school. 72 Nevertheless, it should be highlighted that, in this situation, the focus of discussion has not been with the quality of the specific SJT used, but rather the relative weight placed on the scores. 73 Indeed, the authors of the validity study for the allocation process for the Foundation programme found evidence for the effectiveness of the SJT used in this context, though suggested that a relatively reduced weighting be placed on the SJT performance in this situation. 51 To summarise, the current state of evidence supports the use of such SJTs within medical selection, usually in conjunction with tests of knowledge and/or cognitive ability. Such assessments may best serve as a way of deciding, which candidates should progress to more resource intensive assessment processes. However, in some circumstances it may be defensible, and more cost-effective, to limit face to face processes to those applicants that score in a certain (middle range) or dispense with such a final stage of selection altogether. Such situations may include those where there is a low applicant to vacancy ratio and a strong imperative to fill training places.

| Directions for future research
Collating and summarising the empirical evidence to date in this area enables us to describe a clear agenda for future research in the field.
First, the identified studies used a wide variety of outcomes in order to obtain evidence to support the validity of the SJTs. Many of these would have been expected to tap into a whole range of traits and abilities that might be presumed to be relevant to performance in a health care setting. However, ideally such outcomes would be more explicit in terms of the constructs that they were tapping into, and indeed their precision (reliability) in terms of their measurement ability. The importance of matching a test score to a criterion-relevant outcome has been previously emphasised. 41 Moreover, the development of a framework for outcome criteria in medical selection has previously been highlighted as a research priority in the field. 66  which to train such systems. 81 The use of more engaging and immersive technologies, using augmented and virtual reality 82 or gamification 83 could also be harnessed in a way, which makes it more feasible to capture more typical (rather than maximal) interpersonal performance, even in a high-stakes selection setting. Consequently, it may be possible to develop and enhance the SJT format in a way, which renders them more effective, though still able to be delivered at scale. It has also previously been suggested that a more construct-driven approach, where particular traits are targeted, may be useful to SJT development in some circumstances. In particular this framework has been raised as a possibility for SJTs used in selection situations where there are relatively few workplace situations to sample, and relatively little on the job experience for test-takers to draw on. 9 However, it is currently unknown if such tests, which, psychometrically, often behave more like traditional personality measures, would be valid in high-stakes situations, where faking effects may come into play. 9,16 Fourth, modelling or quantifying the impact of these selection tools on the actual demographics, and indeed the effectiveness, of the medical profession is required. In this regard, numerical simulation methods, such as those previously applied to medical selection situations, may be useful. 5,60 As the widespread use of SJTs in undergraduate medical selection is a relatively recent development, there is an opportunity to evaluate the footprint of such policy changes over the near future. This could be performed via tracked cohorts, such as those whose information is captured by the UK Medical Education Database. 84 If such SJTs are successful in their aims then a footprint should be observed in terms of improved patient satisfaction and health outcomes. Similarly rates of complaints and professionalism breaches should decline. In order to control for such effects being obscured by other secular trends it may be necessary to employ quasi-experimental designs, or causal inference approaches to analyses of relevant observational data. It should also be possible to provide estimates of the likely impact of both the nature of the tests, and the manner in which they are implemented, on access to medicine to traditionally under-represented groups, such as those from ethnic minorities or less advantaged socio-economic backgrounds.
More generally, it should be highlighted that all the included studies were from high-income countries. Previous research has found that SJT methodology is typically transportable for use in recruitment settings in other countries. 85 However, there is a need for further research relating to the validity and impact of such tools across diverse settings and cultures. We also noted that most of the published research studies identified were led by, or involved, test developers. Therefore, some degree of conflict of interest would be present. It would therefore be desirable, where possible, for a greater number of independent evaluations to be conducted in the future. This may be challenging as much of the testing expertise currently lies within the commercial sector. It is also the case that when SJTs need to be deployed at scale commercial organisations generally have to be involved. Therefore, it is likely that evaluating the effectiveness of widely used SJTs would involve some degree of partnership with industry. Moreover, academics are themselves not free of potential conflicts of interest.

| CON CLUS IONS
Our findings suggest that SJTs used for evaluating non-academic abilities in medical selection generally demonstrate moderate

AUTH O R CO NTR I B UTI O N S
PAT conceptualised the review. ESW prepared the first draft of this paper. ESW also led on data acquisition, with substantial contributions from PESC. LWP and PAT led on data synthesis. All authors (ESW, LWP, PESC and PAT) contributed to the development of the protocol, interpretation of findings, the critical appraisal and revision of the document, as well as approved the final manuscript for submission, and agreed to be accountable for all aspects of the work.

ACK N OWLED G EM ENTS
None.