Assessing context suitability (generalizability, external validity, applicability or transferability) of findings in evidence syntheses in healthcare—An integrative review of methodological guidance

Evidence syntheses provide the basis for evidence‐based decision making in healthcare. To judge the certainty of findings for the specific decision context evidence syntheses should consider context suitability (ie, generalizability, external validity, applicability or transferability). Our objective was to determine the status quo and to provide a comprehensive overview of existing methodological recommendations of Health Technology Assessment (HTA) and Systematic Review (SR) producing organizations in assessing context suitability of evidence on effectiveness of health care interventions. Additionally, we analyzed similarities and differences between the recommendations.


| INTRODUCTION
Evidence syntheses such as Systematic Reviews (SR) or Health Technology Assessments (HTAs) provide the basis for evidence-based decision making in healthcare. 1,2 Users of evidence syntheses may include practitioners, decision makers in health policy or patients. 3 When synthesizing evidence, researchers should consider context suitability to enable users to make a judgement about whether the evidence fits their decision context. 4 If there are reasons that results from an evidence synthesis may not fully apply to specific decision-making contexts, users of the evidence synthesis can incorporate these into their decision making process and make a judgement on how differences in context may affect the certainty in the evidence.
Various concepts relevant to context suitability have been suggested. Some of the most common concepts and terms used in the literature are external validity, generalizability, applicability and transferability, but there exists no general definition of these concepts and sometimes, the terms are used interchangeably. 1,[5][6][7] According to the definitions of Burford and colleagues 5 researchers may assess whether the results provide a correct basis for generalizations to other circumstances (generalizability/ external validity), whether it is feasible to implement an intervention in a specific context (applicability) or whether a similar level of effectiveness could be achieved if an intervention would be implemented in another specific context (transferability). To summarize these concepts and associated terms, we use context suitability as a generic term in the following.
Context suitability is not a binary concept (suitable vs nonsuitable), but should rather be regarded as a continuum (more or less context suitable). 8 Moreover, in contrast to internal validity, context suitability is not static (ie, it is not always the same for one body of evidence) but depends on the specific point of reference (ie, the situation to which the results should be applied). 9 Therefore, the same body of evidence might be more suitable for one decision context but less suitable for another decision context.
Context suitability of research findings can be affected by temporal changes (eg, technology changes), geographical factors (eg, healthcare infrastructure differences) or aspects of study design (eg, strict inclusion criteria). 3,10 Regarding the study design different study types can be distinguished, namely "pragmatic" and "explanatory". "Pragmatic trials" focus on clinical practice conditions while "explanatory trials" focus on idealized settings to demonstrate proof of concept. Thereby, it is assumed, that "explanatory trials" tend to have less generalizability due to homogenous populations or small sample size while "pragmatic trials" tend to be more affected by local clinical practice. 8 Context suitability assessments in evidence syntheses (eg, the assessment of applicability or transferability) received increasing interest in the last years. 1,5 Several tools for assessing context suitability have been developed and suggested, but there are no widely agreed approaches for systematic reviews or health technology assessments. 1,3,5,9,[11][12][13] There are several possible explanations for this. First, several different ways exist to design an assessment, for example, regarding the assessment level (assessment of individual studies or a body of evidence) or the standardization of the tool (standardized vs nonstandardized approaches). Second, there is a lack of • What is already known In the literature, there exist a diverse terminology regarding context suitability terms (e. g. generalisability, applicability, transferability, external validity) Several tools to assess concepts (e.g. applicability) of context suitability in the field of healthcare have been developed and suggested, but there are no widely agreed approaches to assessing these concepts We provide a comprehensive overview of the applied methods to assessing context suitability of evidence on effectiveness of healthcare interventions recommended by evidence synthesis producing organisations with national and international target audiences We found little consistency in the terminology used between the organisations. There are various concepts to assess context suitability (e.g. assessing each study individually vs. a body of evidence as a whole). Moreover, we found heterogeneous recommendations regarding assessment criteria and a lack of guidance on how to apply them • Potential impact for Review Synthesis Methods readers outside the authors' field Authors of evidence syntheses should be aware of the different concepts to assessing context suitability of evidence on effectiveness of healthcare interventions and use clearly described and transparent methods to advance the field and help readers in applying results to their context empirical evidence on factors that affect context suitability 14 (eg, in comparison with risk of bias), for example from meta-epidemiological studies. Third, context suitability highly depends on the decision context and thus should always be adapted to the research question at hand. [15][16][17] Our objective was to provide an overview of existing methodological recommendations of HTA organizations and Systematic Review organizations (evidence synthesis producing organizations) in assessing the context suitability of evidence on effectiveness of healthcare interventions. This includes all methodological recommendations for assessing aspects of context suitability, irrespective of type, scope or content. In addition, we analyze similarities and differences between the methodological recommendations. For this purpose, we performed an integrative review 18 of applied methodological guidance on the assessment of context suitability in the field of healthcare.
This integrative review differs from previous research, as we did not review published checklists-regardless of whether they are used in practice or not-but methods for assessing context suitability as a whole, which are actually recommended to be used in practice. In addition, we provide a comprehensive overview, on both assessment approach concepts (including assessment standardization, assessment level and integration of assessment in the quality of evidence rating or preparation process) and assessment criteria. This is the first part of a larger research project from our team on this topic. In a second part, it is planned to review and compare methodological recommendations in assessing context suitability of evidence on cost-effectiveness in evidence syntheses. In a third part, we aim to provide an overview on how context suitability of evidence on effectiveness and cost-effectiveness is assessed in evidence syntheses and which recommended methods are applied in practice.

| METHODS
There is no published protocol for this integrative review. Unless otherwise indicated, all described methods were prespecified. We performed an integrative review 18 ; "a specific review method that summarizes past empirical or theoretical literature to provide a more comprehensive understanding of a particular phenomenon or healthcare problem." 19 We do not provide a quality assessment of included documents as, to our knowledge, no tool exists for assessing the quality of documents containing methodological recommendations.
This review is reported-as far as applicable to methodological research-according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA). 20

| Search strategy
Our synthesis is based on the methodological recommendations of evidence synthesis producing organizations, defined as institutions that conduct evidence syntheses, including but not limited to SRs, HTAs or rapid reviews in the field of healthcare. We focused on organizations which make concrete recommendations for conducing evidence syntheses, meaning we excluded general methodological papers. Many evidence synthesis producing organizations exist worldwide, but it can be anticipated that the methodological guidance documents of these organizations are mostly not listed in bibliographic databases (eg, Cochrane Handbook, manuals of HTA organizations). Therefore, the documents had to be searched manually.
We performed a structured hand search on webpages of evidence synthesis producing organizations with national (eg, AHRQ) and international (eg, EUnetHTA) target audiences to identify potentially relevant methodological guidance documents. We identified the organizations through publicly available member lists of HTA umbrella organizations. The member lists of the following HTA umbrella organizations were included: European Network of Health Technology Assessment Agencies (EUnetHTA), Health Technology Assessment international (HTAi), International Agency of Health Technology Assessment (INAHTA) and Red de Evaluación de Tecnologías en Salud de las Américas (RedETSA). In total we searched 158 webpages of identified organizations (see Data S1). In addition, we searched the webpages of the following Systematic Review organizations: The Cochrane Collaboration, the Joanna Briggs Institute (JBI), the Centre for Reviews and Dissemination (CRD) and the Institute of Medicine (IoM). We chose this approach for two reasons: First, it was not feasible to search for all organizations (universities, governmental organizations etc.) that could potentially provide a methodological guidance document for every country worldwide. Second, the chosen approach is structured and transparent.
Two reviewers independently searched for methodological guidance documents on the webpages. To allow for a thorough search on webpages with different and sometimes complex site-structures we checked each section of the webpages carefully to identify all potentially relevant documents. Moreover, we translated foreign language websites using machine based and browser translation tools to identify English or German language documents. We performed the searches from January to March 2019. Each reviewer saved the identified documents independently. The lists of saved documents were compared and synchronized manually using excel. In cases were several versions of the document existed, we only considered the latest version. Duplicates were removed manually.

| Eligibility criteria and screening
We screened all identified and potentially relevant documents against the following predefined eligibility criteria: 1. Publication type: Methodological guidance documents for the preparation of evidence syntheses (eg, handbooks, manuals, guidelines, standard operation procedures) 2. Documents include recommendations for assessing the context suitability of quantitative studies (ie, not exclusively qualitative studies) on the effectiveness of health technologies (defined as interventions "developed to prevent, diagnose or treat medical conditions; promote health; provide rehabilitation; or organize healthcare delivery" 21 ) 3. The process of context suitability assessment is specified, for example, in the form of concrete methods, questionnaires, tools or assessment criteria 4. Recommendations on context suitability related terms like external validity, generalizability, extrapolation, transferability or applicability are considered (according to the definitions of Bufford et al 5 ) 5. Publication language: English, German All documents that did not fit our inclusion criteria were excluded. Two researchers screened all identified documents independently. Discrepancies were resolved in a discussion until consensus was reached. The terminology in the research field is highly heterogeneous. Therefore, we considered all sections of the identified documents that could be potentially relevant for the assessment of context suitability, carefully, that is, did not focus on specific terms.

| Data extraction and synthesis
We performed data extraction using a standardized piloted data extraction sheet. In a first step, we preliminarily reviewed the methods documents and developed the data extraction sheet inductively. In a second step, we piloted the data extraction sheet: One reviewer extracted data from a sample of included methods documents (from five organizations) and a second reviewer checked it for accuracy. Afterwards, we discussed and modified the data extraction sheet and re-checked methods documents that have already undergone data extraction. In a third step, one reviewer extracted data carefully and a second reviewer checked it for accuracy. Discrepancies were resolved in a discussion until consensus was reached. When performing data extraction initially, we took the formulations as literally as possible to avoid interpretation bias. We extracted and aggregated data according to the following issues: the target audience of the evidence synthesis product (specific jurisdiction vs generic), the types of considered interventions, the underlying terminology (eg, applicability or external validity) and its definition, information on assessment approach concepts (assessment level, assessment standardization and integration of assessment in the preparation process and in the quality of evidence rating) and the relevant assessment criteria.
We expected substantial heterogeneity in the terminology and its definitions. Therefore, we decided to reassign the definitions used by the organizations to the corresponding definition/terminology by Burford et al 5 to distinguish heterogeneity related to different concepts from heterogeneity related to terminology.
We performed a thematic analysis 22 to identify and summarize the main themes (main topics to organize and structure the analysis) and relating categories (lower hierarchical unit of themes used for further categorization) regarding context suitability assessments across the recommendations of included organizations. We developed the themes (terminology, assessment approach concepts and assessment criteria) for the analysis deductively. Related categories for the terminology (eg, applicability) were taken from Burford et al. 5 The categories for the assessment approach concepts (eg, assessment level) were developed inductively, using an iterative procedure. The categories for the assessment criteria (eg, population, setting) were largely taken from the AHRQs recommended assessment approach 16 and adapted inductively. To develop a set of category-specific items (lower hierarchical unit of categories, which capture details of those categories such as demographic characteristics and comorbidities) of the assessment criteria, we extracted all reported items literally and assigned them to the corresponding category. Further, we grouped the extracted items by developing a set of category-specific items (eg, we grouped "comorbidity," 23 "comorbidities" 16,[24][25][26] and "co-morbid conditions" 10 as "comorbidities"). One reviewer developed the themes, categories and category-specific items and a second reviewer checked them for plausibility. A glossary of developed themes and related categories can be found in T A B L E 1 Terms and themes with definition/explanation

Terms and themes Definition/Explanation
Target audience Describes who is informed by the evidence synthesis product or to what context it relates (here: differentiation between national and international target audiences)

Type of considered intervention
Describes whether the guideline relates to a specific type of intervention (pharmaceuticals, medical devices, medical services/procedures, diagnostics, public health interventions) or whether it is generic Original terminology and definition Describes which context suitability term was used in the guideline (eg, external validity, generalizability, applicability, transferability) and how this was defined by the organization

Harmonized terminology and definition
Due to the high heterogeneity in the terminology we decided to reassign the different definitions in the guidelines to the corresponding term of Burford et al (2013) to achieve a uniform terminology Whether when implementing an intervention in a particular setting or population, the level of effectiveness of the intervention (ie, the effect size) will be similar to that observed in the systematic review."

Assessment approach concepts
Describes how the context suitability assessment is operationalized. This includes the integration of assessment in the preparation process, the assessment level, the assessment standardization and the integration of assessment in the quality of evidence rating

Integration of assessment in the preparation process
Context suitability can be integrated at different steps of the evidence synthesis preparation process. It can be considered in study selection, in study assessment, or both In study selection Consideration of context suitability, when defining and applying eligibility criteria In study assessment Assessment of context suitability of included evidence

Assessment level
Describes, whether the context suitability assessment is applied to each individual study, only to the whole body of evidence, or both Individual study assessment Context suitability is assessed for each included study individually Body of evidence assessment Context suitability is assessed for all included studies as a whole

Assessment standardization
Context suitability assessment may follow a clear structure or not. We differentiate between standardized and nonstandardized approaches. The first is further divided into previously developed assessment approaches and approaches developed by the organization itself Nonstandardized assessment approach A nonstandardized assessment approach does not follow a clear structure Own standardized assessment approach An own standardized assessment approach follows a clear structure and may comprise different steps, checklists or questions that have to be followed or rated. It was developed by the guidance producer itself Previously standardized assessment approach A previously standardized assessment approach can be defined as an approach which was not developed by the guidance producer itself, but refers to a standardized assessment approach of other researchers Integration of assessment in the quality of evidence rating Context suitability can be assessed as a part of other quality of evidence appraisals or not. We differentiate between standalone and integrated assessment approaches Standalone assessment In a standalone assessment approach, context suitability is assessed independently from other aspects for assessing the quality of evidence, in particular the assessment of internal validity

Integrated assessment
In an integrated assessment approach, context suitability is combined with other criteria to assess the quality of the body of evidence, in particular the assessment of internal validity

Assessor
Context suitability can be assessed by the evidence synthesis producing organization itself, or by the reader of the evidence synthesis, or both. If the reader is the assessor, the information that is necessary for a judgement regarding context suitability should be provided in the evidence synthesis in Table 1. We describe our findings in a structured narrative manner and visually present extracted data using tabulations.

| Literature search
We identified 155 potentially relevant methods documents from 63 organizations with a national target audience (some organizations published more than one document) through webpage searches. In addition, we identified five potentially relevant methods documents on the webpages of four organizations with an international target audience. After a manual duplicate removal (some documents were found on more than one webpage) a total of 155 methods documents from 67 organizations remained for screening. Finally, we included 14 methods documents from 12 organizations in our synthesis. 10,[15][16][17][23][24][25][26][27][28][29][30][31][32] The document selection process is illustrated in Figure 1. A list of excluded publications can be found in the appendix (Data S2). Ten eligible publications were obtained from nine organizations with a national target audience (Agency

Evidence Synthesis Producing Organizations
Evidence synthesis producing organizations are institutions that conduct evidence syntheses, including but not limited to SRs, HTAs or rapid reviews in the field of healthcare. 3.2 | Analysis of guidance for assessing context suitability

| Types of considered interventions
Most guidelines (9/14) are not restricted to a specific type of intervention. 10,16,17,23,25,26,[30][31][32] One guideline is specific to pharmaceuticals, 29 one to nonpharmaceuticals 27 and two to diagnostics 24,28 while one guideline could not be assigned to a specific type of intervention and thus can be assumed to be generic. 15 Despite the provision of generic methodological recommendations regarding the assessment of context suitability, some organizations refer to some specifications for certain interventions (eg, Public Health Interventions ["a broadly defined set of activities that aim to protect, promote, and restore the health of all people" 31 ; PHI]). 31 Table 2 provides an overview of the used terminology and definitions as well as other guidance characteristics like the organization, country and target audience.

| Terminology
The 12 included evidence synthesis producing organizations use the terms generalizability, applicability, external validity, transferability, transposability, relevance and/or directness with different, missing or sometimes conflicting definitions. According to the definitions of Burford et al 5 three organizations use the term applicability, 24,30,31 four the terms external validity/ generalisability 10,15,23,29 and four organizations the term transferability. 16

| Assessment approach concepts
All assessment approach concepts are defined in Table 1. An overview of the assessment approach concepts recommended by the organizations is provided in Table 3. Further details on the recommended assessment approaches are available in Data S3. We did not structure the results according to the different context suitability related terms, because we could not find a relation between the used terminology and the recommended assessment approach concepts. However, Table 3 depicts the results in accordance with the harmonized terminology of Burford et al. 5 Context suitability considerations can be applied in the study selection process, when defining and applying eligibility criteria, or in a later step, when assessing context suitability of included research findings. Moreover, a combination of both concepts is possible. Four organizations recommend such a combined approach 15,16,28,31 : According to AHRQ 16,28 and GÖG/LBI 15 researchers should consider excluding studies with characteristics that would limit context suitability too much or make adaptions impossible. CRD 31 and GÖG/LBI 15 also point out that too narrowly defined eligibility criteria in evidence syntheses might affect context suitability.
Regarding the assessment level five organizations recommend an individual assessment of context suitability for each included study (individual study assessment), 15,17,23,24,29 two recommend assessing context suitability only for the whole body of evidence (body of evidence assessment) 27,30 and five organizations recommend both. 10,16,25,28,31,32 Regarding the assessment standardization, seven organizations recommend a nonstandardized approach, 10,23,24,26,27,29,31 four developed a standardized approach [15][16][17]28,30 and two recommend a previously developed standardized approach. 25,31,32 CRD 31 proposes the "List of questions to ask in determining applicability and transferability" from Wang et al 2 and EUnetHTA 25,32 recommends AHRQ's 16 "Four Specific Steps in Assessing and Reporting Applicability." Regarding the integration of assessment in the quality of evidence rating, six organizations recommend assessing context suitability independently from other aspects of quality of evidence (standalone assessment), 15 considering context suitability along with study quality, 26,27,29 evidence grading 26,30 and/or when formulating conclusions. 10 CRD 31 distinguishes between a generic context suitability assessment and a specific assessment for PHI. For the generic assessment an integrated assessment is recommended, while for PHI a standalone assessment should be performed.

| Assessment criteria
An overview of all relevant assessment criteria that are considered by the organizations, when assessing context suitability of evidence on effectiveness data, is provided in Table 4. More details regarding the assessment criteria can be found in the appendix (see Data S4). We did not structure the results for the assessment criteria according to the different context suitability related terms, because we could not find a relation between the used terminology and the recommended assessment criteria. However, Table 4 depicts the relevant assessment criteria in accordance with the associated terminology according to Burford et al. 5

| Population
All included organizations recommend taking population characteristics into consideration, when assessing context suitability of evidence on effectiveness of healthcare interventions. The most commonly recommended aspects are demographic characteristics (eg, age, sex, race, ethnicity), 10,16,23,25,26,28,31 severity or stage of illness, 10,16,23,25,26,28,31 comorbidities 10,16,23-26 and the study's eligibility criteria. 15,17,25,26,28,31 In addition, three organizations 16,25,26 recommend considering event rates and the exclusion rate within run-in period as a population-based item. Moreover, AHRQ, 16 [15][16][17]25 Co-interventions and the interventions' relevance to current practice are considered in the methodological recommendations of the AHRQ, 16 EUnetHTA 25 and IoM. 26 GÖG/LBI 15 also consider the treatment duration in their assessment. 15 In addition, mechanics of the intervention and implementation process (of PHI) are mentioned by the CRD, 31 while the Cochrane Collaboration 10 recommends evaluating the feasibility of the intervention implementation in practice settings.

| Comparator
Six out of 12 organizations include comparator interventions as a relevant aspect in their methodological recommendations. 10,16,23,25,26,28,30 AHRQ, 16,28 EUnetHTA 25 and the IoM 26 describe inadequate dosing of the comparator drug and the use of a substandard alternative therapy as conditions that may limit context suitability. The SBU 30 recommends assessing the control group's relevance. Moreover, the Cochrane Collaboration 10 recommends collecting relevant details of comparator interventions to help review users in assessing the context suitability to their target audience.

| Outcome
Seven out of 12 organizations include the outcome as a relevant aspect in their context suitability assessment approach. [15][16][17]23,25,26,28,29 All of them differentiate between surrogate endpoints and clinically/patient relevant endpoints, when assessing context suitability. In addition, the AHRQ, 16 EUnetHTA 25 and the IoM 26 define composite outcomes, which mix outcomes of different significance, as a condition that may affect context suitability. Moreover, AHRQ 28 recommends the consideration of treatment trends 28 and HIQA, 23 AHRQ 16 and LBI 17 the length of follow-up.

| Safety
Safety items are included in the methodological guidelines of EUnetHTA, 32 GÖG/LBI, 15 LBI 17 and CRD. 31 According to the CRD 31 several factors such as patient or disease characteristics may contribute to the occurrence of adverse events and therefore may limit the context suitability of study results. 31 EUnetHTA, 32 however, points out that safety data may not be suitable for vulnerable groups including people with comorbidities, children or pregnant women. Moreover, a valid evaluation of adverse events is a crucial point in the assessment methods of GÖG/LBI 15 and LBI. 17

| Setting
With exception of AHTAPol, 29 all organizations make recommendations regarding the consideration of settings, when assessing context suitability. Care pathways, 27 availability of health technologies, 27 care sector 15,17,23 or routine clinical practice 15,31 are some of the recommended items for the context suitability assessment. Others include aspects of study design features such as study design, 15,16 study conduct 15,23 or study country 15,24 in their assessment approaches. Moreover, the guidelines of AHRQ, 16 EUnetHTA 25 and IoM 26 stipulate a comparison of standards of care, level of care and specialty population of the study setting and the setting of interest. The Cochrane Collaboration 10 names many different contextual factors, which might be relevant to the context suitability assessment. Some of these factors are context factors within host organizations, fee or payment structure, insurance system or variation in values and preferences. The recommendations of CRD 31 refer to PHI. Researchers should consider how closely a study reflects routine practice and should find out how the intervention could be implemented in the target setting. Here, relevant items are responsibilities, implementation barriers, provider skills or resource availability.

| Summary of findings and interpretation
Our integrative review summarizes the applied methodological recommendations on context suitability assessments of 12 organizations with national and international target audiences. It depicts the status quo and provides a comprehensive overview of methods for considering context suitability.
However, the few hits of our structured search for applied methodological recommendations of evidence synthesis producing organizations show that the topic still receives little attention in practical application. From 162 organizations searched and 155 documents screened only 12 organizations (14 documents) made recommendations on assessing context suitability of evidence on effectiveness. This is surprising considering that most of these organizations perform evidence syntheses to develop recommendations. In future work, we aim to investigate whether and how context suitability is addressed in evidence syntheses in practice.

| Terminology
Our analysis of different context suitability terms and definitions reassigned to the terminology of Burford et al 5 agreed with the inconsistent use of these terms in the literature. 1 Only 7 out of 16 terms matched the corresponding definition of Burford et al. 5 Moreover, three terms could not be reassigned to the terminology, because the corresponding definitions where not reported or inconclusive.
The terms generalizability, applicability and transferability (according to Burford et al 5 ) have different scopes. Therefore, one would expect organizations focusing on a specific target audience to recommend the assessment of applicability or transferability, as these concepts of context suitability can only be assessed in relation to a specific context or population. 12 In contrast, one would expect organizations with an international target audience to focus on generalizability, as they have no specific target audience and it is impossible to assess applicability or transferability for all imaginable contexts and populations. But, only one organization with an international target audience focused on generalisability. 10 Furthermore, there are different conditions that have to be satisfied to fulfil the respective definitions of generalizability, applicability and transferability. For example, the term applicability focuses on aspects which might affect the implementation of an intervention, while transferability also implies that comparable levels of effectiveness can be achieved (eg, effect sizes). 5 Therefore, it could be expected that the depth of the assessment depends on the underlying concept. However, we did not find such a relation. For example, from three organizations focusing on applicability (according to Burford et al 5 ) only one organization specifically includes assessment criteria for implementation (eg, "Are essential resources for implementing the intervention available in the local setting?" 31 ). There are different possible reasons for this finding. First, the comprehensiveness of assessment criteria may depend more on the importance the organizations place on this topic than the specific terms and definitions used. Second, the depth of assessment could also be related to the standardization of the approach rather than to the underlying terminology.

| Assessment approach concepts
Our comparison of the different methodological recommendations shows strong heterogeneity regarding the different assessment approach concepts, which confirms prior research. 1,3,9,33 The assessment approaches are integrated at different steps in the evidence synthesis producing pathway. Eight organizations exclusively recommend considering context along with study assessments, while four organizations additionally take aspects of context suitability into consideration when defining and applying eligibility criteria. However, the recommendations for considering context suitability in study selection are limited to basic considerations and do not include specific recommendations for operationalizations.
Half of the assessment approaches are based on integrated assessments, that is, as part of the assessment of study quality, the assessment of the quality of the body of evidence or the discussion and interpretation of results. The other half is based on standalone assessments. AHRQ even advises against using context suitability assessments as a part of the evidence grading. 16 This view is shared by other researchers. 3 According to Munthe-Kaas et al 3 the (exclusive) consideration of context suitability as a part of evidence grading (eg, GRADE 34 ) can be problematic, because of a lack of standardization and frequent use of ad hoc assessments. Interestingly, the GRADE system integrates context suitability (namely "[in]directness") with other criteria that affect the confidence in an effect estimate (risk of bias, imprecision, inconsistency, publication bias), thus mixing internal validity with context suitability concerns. 34 To a large extent it remains unclear, however, how findings from context suitability assessments should be considered within the GRADE framework. 15,17,23,26 Regarding the assessment level most organizations recommend a context suitability assessment of individual studies or a combined approach, while only two organizations recommend the exclusive assessment of a body of evidence.
In most cases, the context suitability assessment is based on a nonstandardized approach, while six organizations recommend a standardized approach. Few of the recommended assessment approaches seem to be systematically developed or tested. In particular, the nonstandardized approaches are less structured and provide few instructions for users, which may lead to unsystematic or unreproducible assessments. However, too standardized assessments based on formalized checklists or questionnaires may not perfectly fit the specific purposes (eg, contain irrelevant assessment criteria).
Assessing context suitability is particularly challenging for organizations that address international target audiences with a large range of readers in different contexts, because it seems impossible to address all imaginable contexts. In this case one specific approach is the provision of sufficient information on PICO and setting, which allows end-users to conduct their own assessment.
This approach is described in the literature as an alternative or addition to context suitability assessments. 14 However, from four organizations with an international target audience only the Cochrane Collaboration 10 recommends this approach.
Five organizations differentiate between explanatory and pragmatic studies instead of or as a part of the context suitability assessments. [15][16][17]23,32 Interestingly, the organizations make different judgements about the consequences of a pragmatic trial design. Four of them consider them to be more suitable to routine practice settings, [15][16][17]23 while EUnetHTA 32 considers pragmatic trials to have reduced context suitability, as they are affected by local clinical practices. This disagreement between the recommendations seems to rest on two opposing lines of arguments regarding the context suitability of pragmatic trial designs. With their focus on clinical practice conditions, pragmatic studies may, on the one hand, be more context suitable than explanatory studies, as they better reflect reality. 8,35,36 On the other hand, this focus may reduce context suitability in cases where clinical practice conditions differ from the context of interest. 32 Similar opposing lines of arguments can be found with regard to the a priori consideration of context suitability in study selection. On the one hand broad eligibility criteria may better reflect reality. On the other hand context suitability might decrease, when studies with very specific characteristics that do not fit many other contexts, are included. This suggests that evidence synthesists that consider broad eligibility criteria should consider using specific tools to disentangle the possible effects of context relevant issues (eg, difference in population), using methods such as subgroup analyses, metaregressions or matrix based techniques. 37

| Assessment criteria
Most organizations either provide examples or a list of relevant assessment criteria or specific questions to consider when assessing context suitability. Burchett et al, 9 who mapped and studied applicability assessment tools (for PHI) question the usability and usefulness of tools in checklist style and strict assessment criteria. The same is true for AHRQ 16 -that developed a formalized assessment approach which is also recommended by other organizations (EUnetHTA, 25,32 IoM 26 ) -based on the rationale that context suitability considerations always depend on the research question and the clinical area, as well as the intervention of interest. Therefore, the first step of AHRQ's assessment approach is to determine the most important factors that might affect context suitability using the PICOS framework. AHRQ also provides a list of factors potentially affecting context suitability. Moreover, it recommends exclusively considering known effect modifiers, focusing on items, which might affect outcomes. 16 The Cochrane Collaboration 10 and EUnetHTA 25,32 also focus on effect modification. 10,25 There is high heterogeneity regarding the content of recommended assessment criteria. While most organizations use a generic concept, some organizations refer to specific considerations for certain types of interventions such as PHI or diagnostics. Population, intervention and setting are the most frequently and most comprehensively reported categories. Similarities between study setting and target setting regarding population characteristics such as demographics or stage of illness and intervention characteristics including treatment regimen, intervention performance or adherence are frequently recommended assessment criteria. Furthermore, the healthcare setting is an important aspect in most of the recommended assessment approaches. Such healthcare setting characteristics include different standards of care or the availability of technologies. Furthermore, consideration of the patient-relevance of outcomes is often part of the assessment approaches.
We found no indication, that the high heterogeneity regarding the recommended assessment criteria is related to the different terms (eg, applicability or generalizability) used by the organizations. The assessment criteria vary within and between the different terms and definitions used and most assessment criteria seem not to be term/concept-specific. Only the recommended assessment criteria of AHRQ, 16 EUnetHTA 25 and IoM 26 show a significant agreement. However, this can be attributed to the fact that EUnetHTA and IoM adopted a large part of the assessment criteria, as well as the "context suitability" definition exactly from AHRQ. Moreover, it was not possible to find other reasons for the heterogeneity regarding the assessment criteria, as only few organizations (AHRQ, 16 Cochrane 10 or GÖG/LBI 15 ) provide explanations why certain assessment criteria were selected or why they might be relevant to assessing context suitability. The lack of justifications for including certain assessment criteria might be contributed to the fact that little empirical evidence on assessment criteria for context suitability exists. 14 In accordance, Dyrvig and colleagues, 33 who conducted a SR on checklists for external validity, found out that none of the reported assessment criteria were justified by empirical data.
Moreover, in many cases it remains unclear how the recommended assessment criteria should be applied, when judging context suitability. There are several reasons for this. First, most assessment approaches are generic and include relatively unspecific questions and criteria. Second, most methodological guidance documents do not provide any instructions or examples for users on how to use or operationalize the assessment tools. Missing instructions and guidance were also reported by Burchett et al. 9 Third, we could not find any recommendations on how to weight different context suitability issues. This lack of instructions on weighting issues is confirmed by Burford et al, 5 who analyzed context suitability assessment tools for complex interventions. All these restrictions and missing information on how to apply context suitability tools could have a negative impact on the reliability of assessments.

| Other research
Prior research reviewing methodological guidelines for SR on health economic evaluations 13 and HTAs of public health interventions 12 also show high heterogeneity regarding the proposed context suitability assessment. The heterogeneity includes aspects of terminology, assessment approach concepts and/or relevant assessment criteria. Moreover, a variety of instruments and checklist to assess context suitability have been developed and reviewed. 2,3,5,9,33,38,39 There are varying recommendations also within the same scope (eg, context suitability of economic evaluations, public health interventions, systematic reviews) or the same terminology (eg, external validity, applicability) and there seems to be no tool with wide acceptance in any area. 1,[3][4][5]16 Our review confirms these studies, as none of the included evidence synthesis producing organizations (except CRD 31 for PHI) recommends one of these previously developed assessment approaches without adaptions. A promising solution might be a more flexible approach guided by stakeholders that is separated from other aspects of assessing the validity of the body of evidence. 3

| Limitations
Our integrative review has some limitations. First, in some cases, it was unclear whether the statements in the guidelines could be understood as recommendations or rather as general scientific statements on context suitability. Second, many aspects left plenty of scope for user interpretation, as they were not described in detail or the guidelines only contained some examples. Third, we excluded some documents due to language restrictions. Fourth, we had some difficulties regarding the hand searching on the webpages of organizations as some webpages had confusing web-site structures and we did not contact HTA organizations or SR organizations for unpublished methods documents. Moreover, the methodological recommendations included in this review are restricted to those specific organizations, which could be identified in a structured way by using the member lists of international HTA umbrella organizations and major SR organizations. Sixth, the methods of an integrative review 22 are not fully applicable for our purpose. Therefore we had to adapt these methods.

| CONCLUSION
A large number of different approaches exist on how to consider concepts of context suitability, when producing evidence syntheses. However, the few hits of our structured search for applied methodological recommendations of evidence synthesis producing organizations show that the topic still receives little attention in practical application. The included methodological recommendations vary widely regarding the underlying terminology, the assessment approach concepts and the assessment criteria. Further, the provision of nonstandardized assessment approaches and missing instructions on how to use the assessment tools or how to judge the different assessment criteria might lead to nontransparent and heterogeneous assessment performances. It appears justified that the assessment of context suitability is somewhat heterogeneous because of the need to tailor the assessment to different target audiences/decision contexts and the varying relevance of aspects in relation to the medical area. Consequently, a single universal rating system seems inappropriate. Nevertheless, many differences between the recommendations seem unexplained. More harmonization is desirable and appears possible.
Our work provides a basis for developing or adapting current context suitability assessment approaches that suite specific requirements. A strict set of defined assessment criteria and the use of rigid checklists or questionnaires seem inappropriate for the application to different research questions, medical areas, decision situations and concepts of context suitability. Instead, the consideration of "basic" categories which can be adapted for each research question is more appropriate. 9,16,28 One challenge in developing new tools is to find a good balance between standardized, reproducible procedures (apart from checklists and questionnaires) and flexible adaptable assessment criteria. Such a "semi-standardized" assessment approach should be accompanied by clear user-instructions. This may include a core/basic part for the context suitability assessment relevant for most intervention types, decision contexts and all concepts of context suitability (generalizability/external validity, applicability, transferability) and the possibility for supplementing additional parts, which might be necessary for specific intervention types, decision contexts or concepts of context suitability (eg, assessing implantation issues in case of applicability assessment; assessing potential impact on the effect size in case of transferability assessment). Here it would be important to focus less on differences and similarities between the study context and the decision context, but rather to consider the mechanism of action 9 or effect modifiers. This means focusing on differences that are likely to have an impact rather than all that could possibly have an impact. Another challenge in the development of a flexible tool will be balancing validity (all context suitability problems identified and adequately assessed) and reliability because the first one will probably increase with additional flexibility but the latter one will decrease.