Consensus on the definition and assessment of external validity of randomized controlled trials: A Delphi study

External validity is an important parameter that needs to be considered for decision making in health research, but no widely accepted measurement tool for the assessment of external validity of randomized controlled trials (RCTs) exists. One of the most limiting factors for creating such a tool is probably the substantial heterogeneity and lack of consensus in this field. The objective of this study was to reach consensus on a definition of external validity and on criteria to assess the external validity of RCTs included in systematic reviews. A three‐round online Delphi study was conducted. The development of the Delphi survey was based on findings from a previous systematic review. Potential panelists were identified through a comprehensive web search. Consensus was reached when at least 67% of the panelists agreed to a proposal. Eighty‐four panelists from different countries and various disciplines participated in at least one round of this study. Consensus was reached on the definition of external validity (“External validity is the extent to which results of trials provide an acceptable basis for generalization to other circumstances such as variations in populations, settings, interventions, outcomes, or other relevant contextual factors”), and on 14 criteria to assess the external validity of RCTs in systematic reviews. The results of this Delphi study provide a consensus‐based reference standard for future tool development. Future research should focus on adapting, pilot testing, and validating these criteria to develop measurement tools for the assessment of external validity.


Highlights
What is already known • There is a lack of consensus regarding the concept of external validity as well as the domains, criteria and methods necessary to assess it in systematic reviews.• Existing measurement tools to assess the external validity of controlled trials could not be recommended due to lack of rigorous development and validation.

What is new
• This study provides a consensus-based reference standard for future tool development, including a new definition of external validity, criteria to assess the external validity of RCTs and methodological considerations for the assessment of external validity.

Potential impact for Research Synthesis Methods readers
• The results of this study are a good starting point for future research.Future research should focus on adequate tool development and validation and on meta-epidemiological research.

| BACKGROUND
External validity is considered an important factor for decision making in health research. 1,2Although research on external validity has increased in the last decades, 1 there are still many shortcomings and methodological issues in this regard. 3,4esearch has focused on examining various aspects of the internal validity (rather than external validity) of randomized controlled trials (RCTs) (e.g.Risk of Bias [RoB]), 4,5 and there are recommended tools for critically assessing the internal validity of RCTs (e.g. the Cochrane RoB tool 2.0 recommended by the Cochrane collaboration 6 ).Although several assessment tools for assessing external validity of RCTs are available in the literature, no assessment tool has yet been exhaustively developed. 3These tools are heterogeneous with respect to the number of items or dimensions, response options, and their development processes. 1,3Furthermore, data on measurement properties are not reported or considered unsatisfactory for most tools. 3In many evidence synthesis methods, external validity is currently evaluated at the level of the body of evidence, with the indirectness domain of the Grading of Recommendations Assessment, Development, and Evaluation (GRADE) approach. 7However, the indirectness domain is not considered the perfect solution to assess external validity 8 and could not be recommended for the assessment of external validity of RCTs due to insufficient evidence regarding its development process and measurement properties. 3ne of the most limiting factors is most likely the lack of consensus regarding the concept of external validity as well as the domains and criteria necessary to assess it.Hence, the development of new tools or revisions of available tools may not be appropriate before consensusbased standards have been developed.It seems therefore highly necessary to seek consensus to overcome the issues regarding the external validity of controlled trials in health research.This Delphi study responds to this existing gap in the literature.
External validity assessment is considered context dependent. 1,9,10Therefore, it depends not only on the research question but also on who states the question and which stakeholders are interested in the answer to the question.Since authors of systematic reviews, Health Technology Assessments (HTA) and Clinical Practice Guidelines (CPG) seem to pursue different aims, approaches or interests when assessing the quality as well as the applicability of the evidence, 11 a differentiation seems necessary.
RCTs are considered the most robust research design for investigating cause and effect mechanisms of interventions. 12However, the study design of an RCT is susceptible to a lack of external validity due to the randomization process and consequently, (often) poor willingness of eligible participants to participate, as well as the use of strict inclusion and exclusion criteria (in comparison to observational studies). 13,145][16] In addition, due to differences in requested information from reporting guidelines, 17,18 respective items used for assessing the external validity can vary between research designs.Acknowledging the importance of RCTs in health care related fields, this project focused on the external validity of RCTs from the point of view of authors of systematic reviews of interventions.
The aims of the present study were as follows: (1) To reach consensus on a definition of external validity; (2) To reach consensus on the relevance of specific criteria (i.e., items) to assess the external validity of RCTs included in systematic reviews (using the population, intervention, comparator, outcomes and setting (PICOS) framework) as well as (3) to reach consensus on the comprehensiveness of the set of criteria.

| METHODS
When there is lack of agreement, incomplete state of knowledge, or lack of evidence, a Delphi study is an appropriate research method to overcome methodological issues. 19Therefore, a three-round online Delphi study with feedback reports was conducted.The study was pre-registered on July 7, 2021 in the Open Science Framework (OSF) under the registration DOI: https:// doi.org/10.17605/OSF.IO/KSW58, and approved by the ethics committee of the University of Lübeck (file number: 22-018).Details on the development of the survey and identification of panelists are described in the following sections.Before starting the survey, panelists received information on the background and aim of the study, confidentiality of data, and rights to withdraw at any time.Panelists were asked to give their consent by agreeing and clicking on 'yes' to proceed with the survey (each round).At the end of the survey, panelists were asked to give permission to be acknowledged in the publication report.All panelists provided informed consent.
This Delphi study was conducted and reported in accordance with the proposed guidelines from Hasson et al. 20 as well as Spranger et al. 21

| Delphi steering committee
The Delphi steering committee consisted of researchers with experience or expertise in conducting systematic reviews of interventions (AJ, TB, DC, SAO, KL), or Delphi studies (SAO, KL), in tool development/validation (TB, SAO, KL), in trial quality assessment (AJ, TB, DC, SAO, KL), and in meta-epidemiology (SAO, DC).One researcher (SAO) is an active member in the methodological committee of the Cochrane Rehabilitation group.None of the Delphi steering committee members participated in the Delphi surveys.

| Preparation of the survey
The development of the Delphi survey was based on the findings of a previous systematic review aiming to identify tools to assess the external validity of RCTs and to evaluate the quality regarding their measurement properties. 3Information on the definition of external validity, methodology, and criteria to assess the external validity of RCTs was extracted from included reports.In addition, references of these reports were screened manually for further information on definition or criteria.
The first (round) survey consisted of two parts: the first part addressed the definition of external validity and the second part addressed criteria for assessing the external validity of RCTs in systematic reviews.
There are plenty of definitions for external validity and a variety of different terms, e.g., generalizability, applicability, and transferability. 1 Two of the most cited definitions of external validity (by Campbell & Stanley 1963 22 and by Jüni et al. 2001 23 ) were proposed to the panelists, who were then asked to indicate which of the two definitions captures the construct of external validity best.
To generate the pool of proposed criteria, eligible questions ("items") were extracted from the identified tools of the previously conducted systematic review of measurement properties. 3No items were extracted that were relevant only for the reporting quality of external validity (rather than the external validity itself) or that were relevant for internal validity (e.g., methodological quality or risk of bias) or other quality criteria (e.g., imprecision).The eligible questions were collated into 18 items and reworded by two independent researchers (AJ, KL).Subsequently, discrepancies were discussed within the Delphi steering committee to reach consensus on the wording of proposed items and their descriptions for the Delphi study.More information on the preparation of the proposals can be found in the first-round survey (Additional file 2).
No response categories (e.g., dichotomous, open-ended, or Likert scales) were proposed for the individual items/ criteria, since this should be considered and developed by tool-developers in future research and is therefore beyond the scope of this Delphi study.Furthermore, it was assumed that the proposed items may need to be adapted by tool-developers to fit their particular context of interest.Therefore, the items were worded to allow for modifications or adaptations and different response categories.Proposed items were structured in accordance with the PICOS framework. 24,25To ensure comprehensiveness of the set of criteria, an open-ended question was added, to ask panelists if they were aware of any additional potential items.
One member of the Delphi steering committee, who was not involved in the preparation of the survey, reviewed and tested the survey draft, made suggestions to improve the wording and on missing information.The final survey was created in LimeSurvey (https://www.limesurvey.org/de/)after revision and review by all members of the Delphi steering committee.An e-mail was sent to each (potential) panelist with information about the background and aim of the study, a link to the survey homepage, and a personalized access code.

| Identification of panelists
Authors from included reports of the above-mentioned systematic review of measurement properties 3 were invited to take part in the Delphi study.In addition, international organizations aiming to develop and update research synthesis methods were contacted, for example, Cochrane collaboration, Johanna Briggs Institute (JBI), and the GRADE Working Group.Experts were also invited who published in the field of systematic review methodology, meta-epidemiology, development and validation of trial quality appraisal tools (specially to assess an external validity related construct) as well as researchers with years of experience in conducting systematic reviews of interventions from various healthrelated disciplines such as psychology, nursing, occupational therapy, physical therapy, surgery, among others.Potential panelists were researchers who (1) have been involved in conducting evidence synthesis for at least 5 years and have contributed (as authors) to the conduction and publication of at least four systematic reviews of interventions; or (2) have been involved in the development or validation of a measurement tool to assess the external validity (or similar construct) of RCTs (or various study designs including RCTs); or (3) have contributed (as authors) to the conduction and publication of a meta-epidemiological study or methodological review regarding trial quality assessment (especially considering external validity or a similar construct) of RCTs (or various study designs including RCTs).The editorial board of the journal BMC systematic reviews, and the websites of the Cochrane Geographic Groups, Cochrane Review Groups, GRADE centers and JBI were inspected for potential panelists.For each potential panelist identified, a brief search in PubMed was conducted to verify eligibility (in accordance with the eligibility criteria mentioned above).

| Targeted sample size
Most Delphi studies have a sample size between 11 and 25 participants in the final round 26 and there is evidence that a sample size of at least 20 participants is required to ensure stable results. 27Therefore, a minimum sample size of 20 participants in the last round of the present Delphi study was anticipated.From observations made in published Delphi studies, it is known that the response rate of invited experts can be as low as 16% or lower 28,29 and that the response rate can drop to as low as 50% from round to round. 30,31Based on these considerations, it was planned to invite at least 500 potential panelists.

| Analysis
The panelists were asked to rate their agreement or disagreement on proposed definitions and criteria using a five-point rating scale ("strongly agree" to "strongly disagree").In health and social sciences there are various threshold values for reaching consensus, defined by percentage agreement, ranging from 50% to 97%. 26 However, the most suitable threshold depends on the rating scale used to rate the panelists' agreement. 32There is evidence that high thresholds lead to no consensus, when using a five-point rating scale. 32For this Delphi study, consensus was defined as ≥67% of panelists agreeing to a proposal.Accordingly, consensus was reached if ≥67% of the participants rated "somewhat agree" or "strongly agree".This threshold value is commonly used in Delphi studies aiming to reach consensus using a five-point rating scale. 30,33,34In addition, panelists had the option to add open-ended comments to each proposal or question.Confidence intervals of consensus scores were calculated with the Agresti-Coull method recommended by Brown et al. 35 using the GraphPad calculator (https://www.graphpad.com/).
If consensus was not reached (<67%), a modified proposal was presented in the next round based on the feedback from the panelists.All open-ended comments from panelists were independently reviewed by at least two members of the Delphi steering committee.Subsequently, the comments were collated, summarized, and included in the subsequent survey with all results from the previous round as well as new proposals based on the panelists´feedback.Panelists also had access to the feedback reports with all results and all comments from the previous round.

| RESULTS
Invitations were sent to 500 potential panelists.Eighty-six panelists (17%) agreed to participate, 84 panelists (17%) participated (at least partially) in at least one round, and 66 panelists (13%) provided complete responses in at least one round.Details of the participation rate in each round are presented in Figure 1.
According to the answers from panelists who provided complete responses in at least one round, most panelists had between 11 and 20 years of experience conducting evidence synthesis methods (Table 1).Six percent (4/66) had published four (or less) systematic reviews of interventions, 35% (23/66) published five to 10 systematic reviews, 32% (21/66) published 11-20 systematic reviews, and 27% (18/66) had published more than 20 systematic reviews.One-third (22/66) had previously been involved in the development or validation of a quality assessment tool for RCTs, and 48% (32/66) of the panelists had conducted a meta-epidemiologic study regarding RCT quality assessment.The panelists reported working in 23 different countries, and their areas of expertise varied widely covering at least 20 different areas (Table 1).Most panelists were active members in at least one of the following organizations: Cochrane, JBI, or GRADE working group.

| Definition of external validity
Consensus was reached on the definitions by Campbell & Stanley 22 and Jüni et al. 23 (77% and 84%, respectively) in the first round.Thirty-nine percent (32/82) of the panelists preferred Campbell's definition, 56% (46/82) preferred Jüni's definition and 5% (4/82) had no preference.However, some concerns were expressed by many panelists on both definitions: In general, Campbell & Stanley's definition, although not incorrect, was considered too specific and not applicable to all RCTs.In addition, some alternative terms to the PICOS criteria included in this definition were suggested.Jüni et al.´s definition was considered easier to understand but rather vague and not as comprehensive as Campbell & Stanley's definition.Several panelists suggested the combination of both definitions.Based on the panelists' suggestions, a revised definition was proposed in the second round.
The following definition of external validity reached 100% consensus in the second round (Table 2): "External validity is the extent to which results of trials provide an acceptable basis for generalization to other circumstances such as variations in populations, settings, interventions, outcomes, or other relevant contextual factors."In general, panelists felt that the new definition is more precise, comprehensive, yet flexible and applicable not only to medicine.Three panelists considered the word "acceptable" rather vague and suggested to remove or define this term.In order to meet the panelists´request (from the first round) for the definition to be flexible, and since 100% consensus was reached for the revised definition, it was preferred not to revise the new definition.As one panelist commented, "No definition will be perfect, but this one provides a good balance between detail and flexibility.".More details on the consensus, discussion and revision of the definition of external validity can be read in the second-round survey (Additional file 3).

| Criteria to assess the external validity of RCTs
Consensus was reached on the relevance of 14 of the 18 proposed criteria (Table 3).However, for some of the proposed criteria for which consensus was reached, panelists expressed concerns about their feasibility.Some criteria may not be appropriate for all health care fields.In addition, it was noted that for some criteria, although relevant, there is insufficient information in most RCTs to evaluate them.Therefore, in the second and third round, it was proposed to include these items, about which concerns had been raised (regarding their feasibility), in the set of criteria as 'optional items'.Panelists agreed to include five items as 'optional items' in the set of criteria (Table 3).
In line with panelists' suggestions, some minor revisions (rewording) were made or more information was added to some items.For example, two panelists recommended not focusing exclusively on differences and similarities between the characteristics of participants, settings, interventions, as well as outcomes of the study and the review's context of interest.A panelist suggested to "move away from looking for similarities and differences, and instead encourage reviewers to consider whether the differences are likely to affect results".Another panelist suggested rather focusing on effect modifiers.Consistent with these suggestions, a note was added to the description and elaboration of each proposed item: "If there are differences (e.g. between the RCT's characteristics and the review's context of interest), reviewers should consider whether there is any evidence or concern that suggests that results may differ for the review's context of interest, such as whether these differences can be considered effect modifiers" (Table S1).
Based on comments from panelists in the first round, major revisions were made for two of the proposed items (Item J and Item N).Therefore, panelists were asked to rate the relevance of these two reworded items in the second round again.Consensus was reached on one of these items (Item J, 93%), but not on the other (Item N, 53%).One item (Item A) was assumed (by panelists) to measure two independent aspects.Therefore, this item was split (now Items A.1 and A.2) and panelists were asked to rate the relevance of these two items again in the second round.Both items reached consensus on relevance (93% and 77%, respectively).
For one item (Item L: "Was the follow-up time frame relevant for the review's context of interest?"),which reached consensus, some panelists argued that this item may be more relevant for internal validity, rather than external validity.Therefore, panelists were asked in the second round whether they considered this item more important for external validity or internal validity, or both.Thirty-two percent (13/41) of panelists felt that this item is more relevant to external validity, 26% (11/41) of panelists felt that this item is more relevant to internal validity, and most panelists (42%, 17/41) felt that this item is relevant to both (Additional file 3).Comments from panelists indicated that while this item may be relevant to both, this item should still be addressed for both aspects.Hence, this item was retained.
The final set of criteria is presented in Table 3 and the full set of criteria, including description, elaboration and examples, is provided in Table S1 (see Additional file 1).During the study, items were labeled with letters from A to R. For reasons of comprehensibility, the final items in Table 3 are labeled with numbers "1-14".

| Structure and comprehensiveness of the set of criteria
Panelists were asked whether they agree with the dimension-structure based on the PICOS (population, intervention, comparison, outcome and setting) format, as proposed by Atkins et al. 24 and Bornhöfft et al. 25 Consensus was reached for this dimension structure (95%).Although some panelists pointed out that this structure may not encompass all factors relevant to external validity, they agreed that this structure would be very helpful when assessing the external validity of RCTs included in systematic reviews and was currently the best option.In the first round, panelists were asked whether they were aware of any potential item which may be relevant for the set of criteria.Most panelists replied with a 'no'.A few suggestions were made, which in turn were used to reword or supplement the description of items.
In the second round, panelists were asked whether they consider the set of criteria comprehensive enough to assess the external validity of RCTs included in systematic reviews.One hundred percent consensus was reached for the comprehensiveness of the set of criteria (see Additional file 4).

| Methodological considerations
In the first round, many panelists´comments indicated that a study-level assessment may not always be relevant or necessary.It was pointed out that the evaluation of the whole body of evidence (e.g. using the GRADE approach) may be sufficient in many cases.Therefore, in the second round, recommendations were proposed for when a study-level assessment of external validity should be considered.Consensus was reached (89%) for these recommendations (Table 3).In addition, to explore a trend on how the results of an external validity assessment should be used, seven scenarios were presented on how the results of an external validity assessment could be used in a systematic review.Panelists were asked to highlight which of the seven proposals they agree with (Additional file 3).Two proposals were considered appropriate by ≥67% of the panelists.Two new scenarios were suggested by panelists in the comments´section and proposed in round three.Consensus (≥ 67%) was reached for both.The four consensus-based recommendations on how to T A B L E 2 Methodological considerations in systematic reviews and definition of external validity.

Category Description
Round, consensus, (95% CI) Recommendations for when to consider assessing the external validity of RCTs at the study-level.
When an assessment of the indirectness is required at the study-level to assess the certainty of evidence (e.g., with the CINeMA web app in NMA).When different levels of external validity are to be expected between the included RCTs, e.g., due to clinical heterogeneity between included RCTs;or if there is a potential mismatch between the included trials and the review's context of interest in PICOS, and there are concerns that these differences/mismatches may be potential effect modifiers.When conducting a systematic review specifically to evaluate the applicability of a particular intervention (e.g.

| DISCUSSION
This Delphi study intends to inform and guide future research on adequate development and validation of measurement tools to assess the external validity of RCTs included in systematic reviews.To our best knowledge, this is the first Delphi study to reach consensus on a definition and criteria for external validity assessment.
The results of the present study provide a consensusbased reference standard with some degree of content validity ('relevance' of items and 'comprehensiveness' of the set of criteria); one of the most important measurement properties of latent constructs. 30,36,37This is a good starting point for future research.However, it should be noted that this study will not replace all essential steps required to develop an assessment tool.The set of criteria should be adapted, pilot-tested and validated.One important first step in tool development is the definition or description of the construct of interest.This is considered an essential requirement for tool development 30,38 and a poorly defined construct (of interest) is a significant threat to content validity. 39Many researchers may take this step for granted, but most reports on the development of external validity assessment tools have lacked a description of the construct or target population of interest. 3The consensus-based definition of external validity established in this study (Table 2) will be a good reference and starting point for tool developers to define their construct of interest.The PICOS approach may be used to supplement the description of the construct of interest.
Currently, there is no comprehensive methodological guidance for the development of trial quality assessment tools.Whiting et al. 40 published a framework with recommendations for developing quality assessment tools.Their framework may provide some helpful advice for the initial development process, but does not include information on the essential steps for testing measurement properties recommended in standard guidelines for tool development, such as those developed by de Vet et al., 37 Streiner et al., 36 and DeVellis. 38Although these guidelines were mainly developed for health outcome measurements, they may be used as references for the adaptation, pilot testing, and validation of these criteria.
The present Delphi study does not intend to guide the development of one particular tool, but rather any potential future tool.Although it is methodologically possible to develop a generic tool in this regard, it is unlikely that a single tool will be able to capture such a complex construct (external validity) in the same way for all healthrelated fields.Therefore, several tools or adaptations may be necessary in accordance with the construct of interest (e.g.external validity for a specific health-related field).This is one of the reasons why proposed items were reworded in a way that future tool developers can adapt them to their particular construct of interest.Furthermore, panelists were not asked about their opinion on the 'comprehensibility' of items (another important factor of content validity), as this should be done by future tool-developers after adapting (including rewording) and during pilot testing their set of criteria within the tool under development.
Some panelists suggested that some items may be redundant or overlapping.The steering committee of this current work believes that it is methodologically possible to merge items or to use some items as signaling questions for others.These would be dependent on the methodology chosen by tool developers.It is highly recommended to pilot-test if it is feasible to merge items.Caution should be taken with statistical approaches such as factor analysis or item-response-theory.They are mainly dependent on the measurement model of the tool under development and only relevant for reflective measurement models (and are not recommended to be used for formative measurement models). 41,42Hence, the specification of measurement model should be considered by tool developers, before using those statistical approaches mentioned above.
In the first Delphi round, a few panelists suggested adding items related to trial design and external validity.However, to our knowledge these aspects have been addressed in previous research.There are two current and validated tools that address RCT design in terms of applicability: The PRagmatic Explanatory Continuum Indicator Summary (PRECIS) 2 tool (from a trialist's perspective) 43 and The Rating of Included Trials on the Efficacy-Effectiveness Spectrum (RITES) tool (from a reviewer's perspective). 44For both tools, Delphi studies were performed and the RITES tool turned out to have strong evidence on its reliability and validity. 3Therefore, it was considered redundant or unnecessary to address design-related items in the present Delphi study.Furthermore, these design-related items are predominantly used to classify RCTs into pragmatic or explanatory (effectiveness or efficacy).However, external validity is not a static construct but rather a context-dependent and consequently flexible construct. 1Hence, these design-related items are not suitable to assess the external validity of RCTs.It would be possible to do both, classifying RCTs into pragmatic or explanatory and assessing the external validity of these RCTs in relation to the research question.However, these may be very time-consuming and authors may consider focusing on one aspect.The indirectness of a design-related construct and external validity is discussed elsewhere. 3here are a few things that researchers and tool developers need to be aware of, regarding the set of criteria.In line with our experience conducting the systematic review of measurement properties as well as this Delphi study and considering some concerns raised by panelists, some suggestions are summarized in the following section: First, the study-level assessment of external validity should not replace, but rather supplement or facilitate the assessment of the whole body of evidence.However, it should be kept in mind that the external validity assessment was not recommended to be used to downgrade the certainty of evidence by panelists (see Additional file 4).Second, it should not be encouraged to use the set of criteria to assess the external validity of RCTs, as it is currently presented here.There are several steps necessary (in accordance with tool development methodology) before using this set of criteria to systematically assess the external validity of RCTs.This set of criteria may currently be used to discuss the findings of systematic reviews in the discussion section in terms of external validity, but not to make an overall judgment of external validity.Third, it is generally not recommended to use or develop a conventional checklist approach 1,4,9 and a more comprehensive but flexible approach might be necessary.Dimensions and criteria may not always be equally relevant and therefore should not be equally weighted using a sum score for external validity.There are other methodological options, e.g. to use domainbased assessments or to assess the external validity in a hierarchical manner (the revised Cochrane collaboration RoB tool 2.0 45 uses a comparable approach).
Fourth, although consensus was reached for the comprehensiveness of the set of criteria, it may not encompass all relevant factors to external validity for all health-related fields.Researchers may consider adding other relevant factors in relation to their construct of interest and pilot-test before using in evidence synthesis.Fifth, consensus was reached (74%) that the set of criteria may also be used for other study designs, such as nonrandomized controlled trials or prospective cohort studies (Table 2).Tool developers should take this into account when rewording and adapting their items.Sixth, some panelists acknowledged that the descriptions, elaborations and examples of many items are essential to fully understand each item.Therefore, it is recommended to read and add descriptions, elaborations and examples of items as presented in Table S1.Adjustments to these descriptions and examples may be necessary in relation to the construct of interest.
Seventh, some panelists pointed out that the optional items are crucial to the comprehensiveness of the set of criteria.Therefore, tool developers should only omit these items if there is a really good reason to do so and if pilot testing has been done before removing these items.Eighth, for the set of criteria terms were used that were believed to be flexible enough for the Delphi study, such as "representative" or "review's context of interest".Tool developers are encouraged to replace these terms by more suitable or specific terms in line with the construct of interest.Ninth, for tool developers, it will be important to provide guidance on how to deal with missing information on external validity-related factors in the primary reports.This is an issue of reporting quality and a RCT should not automatically be rated as having low external validity because the reporting quality was low.Other appraisal tools such as the revised Cochrane collaboration RoB tool 2.0 45 or the COSMIN RoB 46 tool may offer some guidance in this regard.Tenth, while the items in the set of criteria place an emphasis on effect modifiers (among others), not much is known about the effect-modifying potential of external validityrelated factors in clinical intervention studies. 1 Future research to investigate this question is needed.
There are proposed statistical approaches in healthrelated 3 as well as other research fields, such as computer science, 47 political sciences, 48 biostatistics, 49 and educational sciences 50 to assess external validity in more objective ways, such as causal inference or with the generalizability index.Although not all approaches seem to take into account all relevant dimensions or criteria proposed in this Delphi study, they may help explore the causal relationship or even the effect-modifying potential of external validity-related factors in RCTs in future research.

| STRENGTHS AND LIMITATIONS
Experts from various disciplines and countries contributed to this study to reach international and interdisciplinary consensus on a definition of external validity, and criteria for the assessment of external validity.Therefore, this Delphi study provides the first international consensus-based reference standard regarding the assessment of external validity of RCTs included in systematic reviews.
The first limitation of this study is the low response rate of potential panelists (17%) and the decrease of responses per round.This limitation seems to be a general design-related limitation.Based on observations made from published Delphi studies, this was anticipated (see "Targeted sample size" above) and the minimal sample size as stated in our preregistered protocol was reached.However, as suggested by other authors, 51 the results might have differed if a different subset of the invited participants had participated, as some of those who did not participate may have different opinions."Another limitation might be that the focus was on the external validity specifically from a systematic review authors' perspective.Therefore, the results may not be applicable for authors of other evidence synthesis methods, such as HTAs or CPGs.The results may be helpful and adaptations (e.g.adding items related to costeffectiveness) may be done in relation to the methodology implemented in these other evidence synthesis methods.
Due to the length and time required to complete the surveys, it was decided not to address the various terms of external validity as stated in our preregistered protocol.Consensus on terminology may be investigated in future studies.

| CONCLUSION
The present Delphi study reached international consensus on definition of external validity, the set of criteria to assess the external validity of RCTs in systematic reviews, the dimension-structure as well as the comprehensiveness of the set of criteria and on methodological considerations for the assessment of external validity.Future research should focus on adapting, pilot-testing and validating the set of criteria to develop measurement tools for the external validity assessment of RCTs in systematic reviews.Additionally, since there is not much known about the effect-modifying potential of external validityrelated factors, the conduction of meta-epidemiologic studies to investigate this question is desirable.

K
E Y W O R D S consensus, content validity, Delphi study, external validity, randomized controlled trial, trial quality assessment

T A B L E 1
Abbreviation: RCT, randomized controlled trial.a Panelists were free to give multiple answers regarding the field of expertise and the country in which they mainly work.Therefore, the mean number of panelists may differ per category.

Table 2 .
Set of criteria to assess the external validity of RCTs.