i Description of How Expert Panel Members Were Selected
NHLBI initiated a public call for nominations for panel membership to ensure adequate representation of key specialties and stakeholders and appropriate expertise among expert panel and work group members. A nomination form was posted on the NHLBI Web site for several weeks and distributed to a guidelines leadership group that had given advice to NHLBI on its guideline efforts. Information from nomination forms, including contact information and areas of clinical and research expertise, was entered into a database.
After closing the call for nominations, NHLBI staff reviewed the database and selected a potential chair and co-chair for each expert panel and work group. The potential chairs and co-chairs provided to NHLBI conflict of interest disclosures and a copy of their curriculum vitae. The NHLBI Ethics Office reviewed the disclosures and cleared or rejected individuals being considered as chairs and co-chairs. The selected chairs were then formed into a Guidelines Executive Committee (GEC), which worked with NHLBI to select panel members from the list of nominees.
NHLBI received 440 nominations for potential panel members with appropriate expertise for the task. Panel selection focused on creating a diverse and balanced composition of members. Panel members were selected based on their expertise in the specific topic area (e.g., high blood pressure, high blood cholesterol, and obesity) as well as in such specific disciplines as primary care, nursing, pharmacology, nutrition, exercise, behavioral science, epidemiology, clinical trials, research methodology, evidence-based medicine, guideline development, guideline implementation, systems of care, and informatics. The panels also included, as voting ex officio members, senior scientific staff from NHLBI and other Institutes from the National Institutes of Health (NIH) who are recognized experts in the topics being considered.
ii Description of How Expert Panels Developed and Prioritized Critical Questions
After panels were convened, members were invited to submit topic areas or questions for systematic review. Members were asked to identify topics of the greatest relevance and impact for the target audience of the guideline, which is primary care providers.
Panel members submitted proposed questions and topic areas over a period of several months. The number of critical questions (CQs) was scoped, and questions were prioritized based on available resources. After group discussion, panel members ranked priority CQs through collaborative dialogue and voting. The rationale for each priority CQ is addressed in the main report.
With support from the methodologist and systematic review team, panel members formulated priority CQs. They also developed inclusion and exclusion criteria (I/E criteria) to ensure that criteria were clear and precise and could be applied consistently across literature identified in the search. I/E criteria were defined and formatted using the PICOTS framework. PICOTS is a framework for a structured research question. It includes the following components in the CQ statement or in the question's I/E criteria:
I/E criteria define the parameters for selecting literature for a particular CQ. I/E criteria were developed with input from the methodologist and systematic review team to ensure that criteria were clear and precise and could be applied consistently across literature identified in the search.
The final CQs and criteria were submitted to the literature search team for search strategy development.
iii Literature Search Infrastructure, Search Strategy Development, and Validation
The literature search was performed using an integrated suite of search engines that explored a central repository of citations and full-text journal articles. The central repository, search engines, search results, and Web-based modules for literature screening and data abstraction were integrated within a technology platform called the Virtual Collaborative Workspace (VCW). The VCW was custom-developed for the NHLBI systematic evidence review initiative.
The central repository consisted of 1.9 million citations and 71,000 full-text articles related to cardiovascular disease (CVD) risk reduction. Citations were acquired from the following databases: PubMed, Embase, CINAHL®, Cochrane, PsycINFO, Wilson Science, and Biological Abstracts® databases. Literature searches were conducted using a collection of search engines including TeraText®, Content Analyst, and Collexis, and Lucene. The first three engines were used for executing search strategies, and Lucene was used to correlate the search with literature screening results.
For every CQ, a literature search and screening were conducted according to the understanding of the question and the I/E criteria that provided specific characteristics of studies relevant to the question. Criteria were framed in the PICOTS format, and the question and PICOTS components were translated into a search strategy involving Boolean and conceptual queries.
A Boolean query encodes both inclusion and exclusion rules. It grants access to the maximum quantity of citations, which are then analyzed by text analytics tools and ranked to produce a selection for literature screening. Two independent reviewers conducted this screening in the VCW's Web-based module. Boolean queries select citations by matching words in titles and abstracts, as well as medical subject headings (MeSH) and subheadings. The number of citations resulting from Boolean queries has ranged from a few hundred to several thousand, depending on the question. The text analytics tools suite included:
- A natural language processing module for automated extraction of data elements to support the application of I/E criteria. Frequently extracted and utilized data elements were study size and intervention followup period.
- Content Analyst for automatically expanding vocabulary of queries, conceptual retrieval, and conceptual clustering. The conceptual query engine employed in Content Analyst leverages word frequency features and co-occurrence in similar contexts to index, select, and rank results. The indexing uses the singular value decomposition (SVD) algebraic method.
- TeraText for ranking search engine results and executing operations on literature collections.
Search strategy development was intertwined with the results of literature screening, which provided feedback on search quality and context. Screened literature was categorized into two subsets: relevant or not relevant to the question. Next, results were analyzed to determine the characteristics of relevant versus not relevant citations. Additional keywords and MeSH terms were used to expand or contract the scope of the query as driven by characteristics of relevant citations. If the revised search strategy produced more citations that did not undergo the screening process, then a new batch of citations was added for review. The search strategy refinement/literature review cycle was repeated until all citations covered by the most recent Boolean query had been screened.
Each search strategy was developed and implemented in the VCW. The methodologist and panel members reviewed the search strategy, which was available for viewing and printing at any time by panel members and staff collaborating on the systematic review. The search strategy was available for execution and supplying literature updates until the literature search and screening cut-off date.
An independent methodology team validated the search strategies for a sample of questions. As part of this validation process, the methodology team developed and executed a separate search strategy and screened a random sample of citations against I/E criteria. Then, these results were compared with the search and screening results developed by the systematic review team. Based on the validation process, the searches were considered appropriate. In addition, studies identified in systematic reviews and meta-analyses were cross-checked against a CQ's list of studies included in the evidence base to ensure completeness of the search strategy.
iv Process for Literature Review and Application of I/E Criteria
Using results of the search strategy, criteria were applied to screen literature for inclusion or exclusion in the evidence base for the CQ. I/E criteria address the parameters in the PICOTS framework and determine what types of studies are eligible and appropriate to answer the CQ. When appropriate, the panel members added (with guidance from the methodology team) I/E criteria, such as sample size restrictions, to fit the context of the CQ. To enhance the quality of the abstracted literature, these criteria were applied uniformly (by the systematic review and methodology teams) within a given question.
a Pilot Literature Screening Mode
In the pilot literature screening mode, two reviewers independently screened the first 50 titles or abstracts in the search strategy results by applying I/E criteria. Reviewers voted to include or exclude the publication for full-text review. To ensure I/E criteria were applied consistently, reviewers compared their results. Discrepancies in votes were discussed, and clarification on criteria was sought from the panel when appropriate. For example, if criteria were not specific enough to be clearly applied to include or exclude a citation, then they sought guidance to word the criteria more explicitly.
During this phase, reviewers provided feedback to the literature search team about the relevance of search strategy results; the team used this feedback to further refine and optimize the search.
b Phase 1: Title and Abstract Screening Phase
After completing the pilot mode phase, two reviewers independently screened search results at the title and abstract level by applying I/E criteria. Reviewers voted to include or exclude the publication for full-text review.
When at least one reviewer voted to include a publication based on the title and abstract review, the publication advanced to Phase 2, full-text screening. When both reviewers voted to exclude a publication, then it was excluded and not reviewed further. These citations are maintained in the VCW and marked as “excluded at title/abstract phase.”
c Phase 2: Full-Text Screening Phase
In Phase 2, two reviewers independently applied I/E criteria to the full-text article and voted for “include,” “exclude,” or “undecided.” The reviewer specified the rationale for exclusion (e.g., population, intervention, etc.) in this phase.
Articles that both reviewers voted to include were moved to the “include” list. Similarly, articles that both reviewers voted to exclude were moved to the “exclude” list. These citations were maintained in the VCW and identified as “excluded at the full article phase,” and the rationale for exclusion was noted. Any articles with discrepant votes (i.e., one include and one undecided, one include and one exclude, and one exclude and one undecided) advanced to Phase 3.
d Phase 3: Resolution and Consultation Phase
In this phase, reviewers discussed their discrepant votes for “include,” “exclude,” or “undecided” and cited the relevant criteria for their decision. The two reviewers attempted to achieve consensus through collaborative discussion. If the reviewers could not reach consensus, then they consulted the methodologist. If they were still unable to reach a consensus, then they consulted the panel; however, the methodologist had the final decision. The final disposition of the article (“include” or “exclude”) was recorded in the VCW along with comments from the adjudication process.
Similar to search strategies being posted and available for viewing on the VCW, all citations screened for a CQ were maintained in the VCW with their reviewer voting status and collected comments.
v Description of Methods for Quality Assessment of Individual Studies
Articles meeting the I/E criteria after the three-phase literature review process were then rated for quality. Each study design used a separate quality rating tool.
a Design of the Quality Assessment Tools
Six quality assessment tools, developed by NHLBI and the methodology team, were used to evaluate the quality of individual studies. The tools were based on quality assessment methods, concepts, and other tools developed by researchers in the Agency for Healthcare Research and Quality's (AHRQ) Evidence-Based Practice Centers (EPCs), the Cochrane Collaborative, the U.S. Preventive Services Task Force (USPSTF), the National Health Service Centre for Reviews and Dissemination, as well as consulting epidemiologists and others working in evidence-based medicine. The methodology team and NHLBI staff adapted these tools for this project.
The tools were designed to help reviewers focus on concepts that are key for evaluating the internal validity of a study. The tools were not designed to provide a list of factors comprising a numeric score; instead, they were specific to individual types of study designs. They are described in more detail below.
The tools included items for evaluating potential flaws in study methods or implementation, including sources of bias (e.g., patient selection, performance, attrition, and detection), confounding, study power, the strength of causality in the association between interventions and outcomes, and other factors. Quality reviewers could select “yes,” “no,” or “cannot determine/not reported/not applicable” in response to each item in the tool. For each item where “no” was selected, reviewers were instructed to consider the potential risk of bias that could be introduced by that flaw in the study design or implementation. “Cannot determine” and “not reported” were also noted as representing potential flaws.
Each of the six quality assessment tools has a detailed guidance document (except for the quality assessment tool for case series studies), which was also developed by the methodology team and NHLBI. The guidance documents are specific to each tool and provide detailed descriptions and examples about how to apply the items, as well as justifications for including each item. For some items, examples were provided to clarify the intent of the question and the appropriate rater response. The six quality assessment tools and five related guidance documents are included in Tables A-1 through A-6.
b Significance of the Quality Ratings of Good, Fair, or Poor
Using the quality assessment tools, reviewers rated each study as “good,” “fair,” or “poor” quality. Reviewers used the ratings on different items in the tool to assess the risk of bias in the study due to flaws in study design or implementation.
In general terms, a “good” study has the least risk of bias and results are considered to be valid. A “fair” study is susceptible to some bias deemed not sufficient to invalidate its results. The fair quality category is likely to be broad, so studies with this rating will vary in their strengths and weaknesses.
A “poor” rating indicates significant risk of bias. Studies rated poor were excluded from the body of evidence to be considered for each CQ. The only exception to this policy was if there was no other evidence available, then poor quality studies could be considered.
c Training for the Application of Quality Assessment Tools
The methodology team conducted a series of training sessions on using four of the quality assessment tools. Initial training consisted of 2-day, in-person training sessions. Reviewers trained in the quality rating were master's or doctorate-level staff with a background in public health or health sciences. Training sessions included instruction on identifying the correct study designs, the theory behind evidence-based research and quality assessment, explanations and rationales for the items in each tool, and methods for achieving overall judgments regarding quality ratings of “good,” “fair,” or “poor.” Participants practiced evaluating multiple articles, both with the instructors and during group work. They were also instructed to refer to related articles on study methods if such papers were cited in the articles being rated.
Following the in-person training sessions, the methodology team assigned several articles with pertinent study designs to test the abilities of each reviewer. The methodology team asked reviewers to individually identify the correct study design, complete the appropriate quality assessment tool, and submit it to the team for grading against a methodologist-developed key. Next, the reviewers participated in a second round of training sessions, conducted by telephone, to review results and resolve any remaining misinterpretations. Based on the results of these evaluations, a third round of exercises and training sessions was sometimes convened.
The quality assessment tools for the before-after and case series studies were used only for the Obesity Panel's CQ5, which addresses bariatric surgery interventions. This CQ included those types of study designs and related issues specific to this surgical intervention. As a result, a formal training program for using these quality assessment tools was not conducted; instead, reviewers for CQ5 received individual training.
d Quality Assessment Process
The systematic review team or methodology team rated each article that met a CQ's inclusion criteria. Two reviewers independently rated the quality of each article, using the appropriate tool. If the ratings differed, then the reviewers discussed the article in an effort to reach consensus. If they were unable to reach consensus, then a methodologist judged the quality of the article.
Two methodologists independently rated systematic reviews and meta-analyses. If ratings differed, then the reviewers discussed the article in an effort to reach consensus. If they were unable to reach consensus, then a third methodologist judged the quality.
After they received the initial quality rating, panel members could appeal the rating of a particular study or publication. However, to enhance the objectivity of the quality rating process, the final decision on quality ratings was made by the methodology team, not the panel members.
vi Quality Assessment Tool for Controlled Intervention Studies
Table A-1 shows the quality assessment tool for controlled intervention studies along with the guidance document for that tool. The methodology team and NHLBI developed this tool based in part on criteria from the AHRQ EPCs, the USPSTF, and the National Health Service Centre for Reviews and Dissemination.
This tool addresses 14 elements of quality assessment. They include randomization and allocation concealment, similarity of compared groups at baseline, use of intent-to-treat (ITT) analysis (i.e., analysis of all randomized patients even if some were lost to followup), adequacy of blinding, the overall percentage of subjects lost to followup, differential rates of loss to followup between the intervention and control groups, and other factors.
Table A-1. Quality Assessment Tool for Controlled Intervention Studies
vii Guidance for Assessing the Quality of Controlled Intervention Studies
The guidance document below is organized by question number from the tool for quality assessment of controlled intervention studies.
Question 1. Described as randomized
Was the study described as randomized? A study does not satisfy quality criteria as randomized simply because the authors call it randomized; however, it is a first step in determining if a study is randomized
Questions 2 and 3. Treatment allocation—two interrelated pieces
- Adequate randomization: Randomization is adequate if it occurred according to the play of chance (e.g., computer generated sequence in more recent studies, or random number table in older studies).
- Inadequate randomization: Randomization is inadequate if there is a preset plan (e.g., alternation where every other subject is assigned to treatment arm or another method of allocation is used, such as time or day of hospital admission or clinic visit, ZIP Code, phone number, etc.). In fact, this is not randomization at all—it is another method of assignment to groups. If assignment is not by the play of chance, then the answer to this question is no.
There may be some tricky scenarios that will need to be read carefully and considered for the role of chance in assignment. For example, randomization may occur at the site level, where all individuals at a particular site are assigned to receive treatment or no treatment. This scenario is used for group-randomized trials, which can be truly randomized, but often are “quasi-experimental” studies with comparison groups rather than true control groups. (Few, if any, group-randomized trials are anticipated for this evidence review.)
- Allocation concealment: This means that one does not know in advance, or cannot guess accurately, to what group the next person eligible for randomization will be assigned. Methods include sequentially numbered opaque sealed envelopes, numbered or coded containers, central randomization by a coordinating center, computer-generated randomization that is not revealed ahead of time, etc.
Questions 4 and 5. Blinding
Blinding means that one does not know to which group—intervention or control—the participant is assigned. It is also sometimes called “masking.” The reviewer assessed whether each of the following was blinded to knowledge of treatment assignment:  the person assessing the primary outcome(s) for the study (e.g., taking the measurements such as blood pressure, examining health records for events such as myocardial infarction, reviewing and interpreting test results such as x ray or cardiac catheterization findings);  the person receiving the intervention (e.g., the patient or other study participant); and  the person providing the intervention (e.g., the physician, nurse, pharmacist, dietitian, or behavioral interventionist).
Generally placebo-controlled medication studies are blinded to patient, provider, and outcome assessors; behavioral, lifestyle, and surgical studies are examples of studies that are frequently blinded only to the outcome assessors because blinding of the persons providing and receiving the interventions is difficult in these situations. Sometimes the individual providing the intervention is the same person performing the outcome assessment. This was noted when it occurred.
Question 6. Similarity of groups at baseline
This question relates to whether the intervention and control groups have similar baseline characteristics on average especially those characteristics that may affect the intervention or outcomes. The point of randomized trials is to create groups that are as similar as possible except for the intervention(s) being studied in order to compare the effects of the interventions between groups. When reviewers abstracted baseline characteristics, they noted when there was a significant difference between groups. Baseline characteristics for intervention groups are usually presented in a table in the article (often Table 1).
Groups can differ at baseline without raising red flags if:  the differences would not be expected to have any bearing on the interventions and outcomes; or  the differences are not statistically significant. When concerned about baseline difference in groups, reviewers recorded them in the comments section and considered them in their overall determination of the study quality.
Questions 7 and 8. Dropout
“Dropouts” in a clinical trial are individuals for whom there are no end point measurements, often because they dropped out of the study and were lost to followup.
Generally, an acceptable overall dropout rate is considered 20 percent or less of participants who were randomized or allocated into each group. An acceptable differential dropout rate is an absolute difference between groups of 15 percentage points at most (calculated by subtracting the dropout rate of one group minus the dropout rate of the other group). However, these are general rates. Lower overall dropout rates are expected in shorter studies, whereas higher overall dropout rates may be acceptable for studies of longer duration. For example, a 6-month study of weight loss interventions should be expected to have nearly 100 percent followup (almost no dropouts—nearly everybody gets their weight measured regardless of whether or not they actually received the intervention), whereas a 10-year study testing the effects of intensive blood pressure lowering on heart attacks may be acceptable if there is a 20-25 percent dropout rate, especially if the dropout rate between groups was similar. The panels for the NHLBI systematic reviews may set different levels of dropout caps.
Conversely, differential dropout rates are not flexible; there should be a 15 percent cap. If there is a differential dropout rate of 15 percent or higher between arms, then there is a serious potential for bias. This constitutes a fatal flaw, resulting in a poor quality rating for the study.
Question 9. Adherence
Did participants in each treatment group adhere to the protocols for assigned interventions? For example, if Group 1 was assigned to 10 mg/day of Drug A, did most of them take 10 mg/day of Drug A? Another example is a study evaluating the difference between a 30-pound weight loss and a 10-pound weight loss on specific clinical outcomes (e.g., heart attacks), but the 30-pound weight loss group did not achieve its intended weight loss target (e.g., the group only lost 14 pounds on average). A third example is whether a large percentage of participants assigned to one group “crossed over” and got the intervention provided to the other group. A final example is when one group that was assigned to receive a particular drug at a particular dose had a large percentage of participants who did not end up taking the drug or the dose as designed in the protocol.
Question 10. Avoid other interventions
Changes that occur in the study outcomes being assessed should be attributable to the interventions being compared in the study. If study participants receive interventions that are not part of the study protocol and could affect the outcomes being assessed, and they receive these interventions differentially, then there is cause for concern because these interventions could bias results. The following scenario is another example of how bias can occur. In a study comparing two different dietary interventions on serum cholesterol, one group had a significantly higher percentage of participants taking statin drugs than the other group. In this situation, it would be impossible to know if a difference in outcome was due to the dietary intervention or the drugs.
Question 11. Outcome measures assessment
What tools or methods were used to measure the outcomes in the study? Were the tools and methods accurate and reliable—for example, have they been validated, or are they objective? This is important as it indicates the confidence you can have in the reported outcomes. Perhaps even more important is ascertaining that outcomes were assessed in the same manner within and between groups. One example of differing methods is self-report of dietary salt intake versus urine testing for sodium content (a more reliable and valid assessment method). Another example is using BP measurements taken by practitioners who use their usual methods versus using BP measurements done by individuals trained in a standard approach. Such an approach may include using the same instrument each time and taking an individual's BP multiple times. In each of these cases, the answer to this assessment question would be “no” for the former scenario and “yes” for the latter. In addition, a study in which an intervention group was seen more frequently than the control group, enabling more opportunities to report clinical events, would not be considered reliable and valid.
Question 12. Power calculation
Generally, a study's methods section will address the sample size needed to detect differences in primary outcomes. The current standard is at least 80 percent power to detect a clinically relevant difference in an outcome using a two-sided alpha of 0.05. Often, however, older studies will not report on power.
Question 13. Prespecified outcomes
Investigators should prespecify outcomes reported in a study for hypothesis testing—which is the reason for conducting an RCT. Without prespecified outcomes, the study may be reporting ad hoc analyses, simply looking for differences supporting desired findings. Investigators also should prespecify subgroups being examined. Most RCTs conduct numerous post hoc analyses as a way of exploring findings and generating additional hypotheses. The intent of this question is to give more weight to reports that are not simply exploratory in nature.
Question 14. Intention-to-treat analysis
Intention-to-treat (ITT) means everybody who was randomized is analyzed according to the original group to which they are assigned. This is an extremely important concept because conducting an ITT analysis preserves the whole reason for doing a randomized trial; that is, to compare groups that differ only in the intervention being tested. When the ITT philosophy is not followed, groups being compared may no longer be the same. In this situation, the study would likely be rated poor. However, if an investigator used another type of analysis that could be viewed as valid, this would be explained in the “other” box on the quality assessment form. Some researchers use a completers analysis (an analysis of only the participants who completed the intervention and the study), which introduces significant potential for bias. Characteristics of participants who do not complete the study are unlikely to be the same as those who do. The likely impact of participants withdrawing from a study treatment must be considered carefully. ITT analysis provides a more conservative (potentially less biased) estimate of effectiveness.
General Guidance for Determining the Overall Quality Rating of Controlled Intervention Studies
The questions on the assessment tool were designed to help reviewers focus on the key concepts for evaluating a study's internal validity. They are not intended to create a list that is simply tallied up to arrive at a summary judgment of quality.
Internal validity is the extent to which the results (effects) reported in a study can truly be attributed to the intervention being evaluated and not to flaws in the design or conduct of the study—in other words, the ability for the study to make causal conclusions about the effects of the intervention being tested. Such flaws can increase the risk of bias. Critical appraisal involves considering the risk of potential for allocation bias, measurement bias, or confounding (the mixture of exposures that one cannot tease out from each other). Examples of confounding include co-interventions, differences at baseline in patient characteristics, and other issues addressed in the questions above. High risk of bias translates to a rating of poor quality. Low risk of bias translates to a rating of good quality.
Fatal flaws: If a study has a “fatal flaw,” then risk of bias is significant, and the study is of poor quality. Examples of fatal flaws in RCTs include high dropout rates, high differential dropout rates, no ITT analysis or other unsuitable statistical analysis (e.g., completers-only analysis).
Generally, when evaluating a study, one will not see a “fatal flaw;” however, one will find some risk of bias. During training, reviewers were instructed to look for the potential for bias in studies by focusing on the concepts underlying the questions in the tool. For any box checked “no,” reviewers were told to ask: “What is the potential risk of bias that may be introduced by this flaw?” That is, does this factor cause one to doubt the results that were reported in the study?
NHLBI staff provided reviewers with background reading on critical appraisal, while emphasizing that the best approach to use is to think about the questions in the tool in determining the potential for bias in a study. The staff also emphasized that each study has specific nuances; therefore, reviewers should familiarize themselves with the key concepts.
viii Quality Assessment Tool for Systematic Reviews and Meta-Analyses
Table A-2 shows the quality assessment tool for systematic reviews and meta-analyses along with the guidance document for that tool. The methodology team and NHLBI developed this tool based in part on criteria from AHRQ's EPCs and the Cochrane Collaborative.
Table A-2. Quality Assessment Tool for Systematic Reviews and Meta-Analyses
This tool addresses eight elements of quality assessment. They include use of prespecified eligibility criteria, use of a comprehensive and systematic literature search process, dual review for abstracts and full-text articles, quality assessment of individual studies, assessment of publication bias, and other factors.
xi Guidance for Quality Assessment of Systematic Reviews and Meta-Analyses
A systematic review is a study that attempts to answer a question by synthesizing the results of primary studies while using strategies to limit bias and random error . These strategies include a comprehensive search of all potentially relevant articles and the use of explicit, reproducible criteria in the selection of articles included in the review. Research designs and study characteristics are appraised, data are synthesized, and results are interpreted using a predefined systematic approach that adheres to evidence-based methodological principles.
Systematic reviews can be qualitative or quantitative. A qualitative systematic review summarizes the results of the primary studies but does not combine the results statistically. A quantitative systematic review, or meta-analysis, is a type of systematic review that employs statistical techniques to combine the results of the different studies into a single pooled estimate of effect, often given as an odds ratio.
The guidance document below is organized by question number from the tool for quality assessment of systematic reviews and meta-analyses.
Question 1. Focused question
The review should be based on a question that is clearly stated and well-formulated. An example would be a question that uses the PICO (population, intervention, comparator, outcome) format, with all components clearly described.
Question 2. Eligibility criteria
The eligibility criteria used to determine whether studies were included or excluded should be clearly specified and predefined. It should be clear to the reader why studies were included or excluded.
Question 3. Literature search
The search strategy should employ a comprehensive, systematic approach in order to capture all of the evidence possible that pertains to the question of interest. At a minimum, a comprehensive review has the following attributes:
- Electronic searches were conducted using multiple scientific literature databases, such as MEDLINE, EMBASE, Cochrane Central Register of Controlled Trials, PsychLit, and others as appropriate for the subject matter.
- Manual searches of references found in articles and textbooks should supplement the electronic searches.
Additional search strategies that may be used to improve the yield include the following:
- Studies published in other countries
- Studies published in languages other than English
- Identification by experts in the field of studies and articles that may have been missed
- Search of grey literature, including technical reports and other papers from government agencies or scientific groups or committees; presentations and posters from scientific meetings, conference proceedings, unpublished manuscripts; and others. Searching the grey literature is important (whenever feasible) because sometimes only positive studies with significant findings are published in the peer-reviewed literature, which can bias the results of a review.
In their reviews, researchers described the literature search strategy clearly, and ascertained it could be reproducible by others with similar results.
Question 4. Dual review for determining which studies to include and exclude
Titles, abstracts, and full-text articles (when indicated) should be reviewed by two independent reviewers to determine which studies to include and exclude in the review. Reviewers resolved disagreements through discussion and consensus or with third parties. They clearly stated the review process, including methods for settling disagreements.
Question 5. Quality appraisal for internal validity
Each included study should be appraised for internal validity (study quality assessment) using a standardized approach for rating the quality of the individual studies. Ideally, this should be done by at least two independent reviewers appraised each study for internal validity. However, there is not one commonly accepted, standardized tool for rating the quality of studies. So, in the research papers, reviewers looked for an assessment of the quality of each study and a clear description of the process used.
Question 6. List and describe included studies
All included studies were listed in the review, along with descriptions of their key characteristics. This was presented either in narrative or table format.
Question 7. Publication bias
Publication bias is a term used when studies with positive results have a higher likelihood of being published, being published rapidly, being published in higher impact journals, being published in English, being published more than once, or being cited by others [425, 426]. Publication bias can be linked to favorable or unfavorable treatment of research findings due to investigators, editors, industry, commercial interests, or peer reviewers. To minimize the potential for publication bias, researchers can conduct a comprehensive literature search that includes the strategies discussed in Question 3.
A funnel plot—a scatter plot of component studies in a meta-analysis—is a commonly used graphical method for detecting publication bias. If there is no significant publication bias, the graph looks like a symmetrical inverted funnel.
Reviewers assessed and clearly described the likelihood of publication bias.
Question 8. Heterogeneity
Heterogeneity is used to describe important differences in studies included in a meta-analysis that may make it inappropriate to combine the studies . Heterogeneity can be clinical (e.g., important differences between study participants, baseline disease severity, and interventions); methodological (e.g., important differences in the design and conduct of the study); or statistical (e.g., important differences in the quantitative results or reported effects).
Researchers usually assess clinical or methodological heterogeneity qualitatively by determining whether it makes sense to combine studies. For example:
- Should a study evaluating the effects of an intervention on CVD risk that involves elderly male smokers with hypertension be combined with a study that involves healthy adults ages 18 to 40? (Clinical Heterogeneity)
- Should a study that uses a randomized controlled trial (RCT) design be combined with a study that uses a case-control study design? (Methodological Heterogeneity)
Statistical heterogeneity describes the degree of variation in the effect estimates from a set of studies; it is assessed quantitatively. The two most common methods used to assess statistical heterogeneity are the Q test (also known as the χ2 or chi-square test) or I2 test.
Reviewers examined studies to determine if an assessment for heterogeneity was conducted and clearly described. If the studies are found to be heterogeneous, the investigators should explore and explain the causes of the heterogeneity, and determine what influence, if any, the study differences had on overall study results.
x Quality Assessment Tool for Cohort and Cross-Sectional Studies
Table A-3 shows the quality assessment tool for cohort and cross-sectional studies along with the guidance document for that tool. The methodology team and NHLBI developed this tool based in part on criteria from AHRQ's EPCs, the USPSTF, consultation with epidemiologists, and other sources.
This tool addresses 13 elements of quality assessment. They include: clarity of the research question or research objective; definition, selection, composition, and participation of the study population; definition and assessment of exposure and outcome variables; measurement of exposures prior to outcome assessment; study timeframe and followup; study analysis and power; and other factors.
xi Guidance for Assessing the Quality of Cohort and Cross-Sectional Studies
The guidance document below is organized by question number from the tool for quality assessment of cohort and cross-sectional studies.
Question 1. Research question
To answer this question, reviewers asked: Did the authors describe their research goal? Is it easy to understand what they were looking to find? This issue is important for all types of scientific papers. Higher quality scientific research explicitly defines a research question.
Questions 2 and 3. Study population
Reviewers asked: Did the authors describe the group of individuals from which the study participants were selected or recruited, using demographics, location, and time period? If the authors conducted this study again, would they know whom to recruit, from where, and from what time period? Is the cohort population free of the outcome of interest at the time they were recruited?
An example would be men over 40 years old with type 2 diabetes who began seeking medical care at Phoenix Good Samaritan Hospital between January 1, 1990 and December 31, 1994. In this example the population is cleared described as  who (men over 40 years old with type 2 diabetes);  where (Phoenix Good Samaritan Hospital; and  when (between January 1, 1990 and December 31, 1994). Another example is women who were in the nursing profession ages 34 to 59 with no known coronary disease, stroke, cancer, hypercholesterolemia, or diabetes, recruited from the 11 most populous States, with contact information obtained from State nursing boards.
In cohort studies, it is crucial that the population at baseline is free of outcome of interest. For example, the nurses' population above would be an appropriate group in which to study incident coronary disease. This information is usually found either in descriptions of population recruitment, definitions of variables, or inclusion/exclusion criteria.
When needed, reviewers examined prior papers on methods in order to assess this question. They usually found the papers in the reference list.
If fewer than 50 percent of eligible persons participated in the study, then there is concern that the study population does not adequately represent the target population. This increases the risk of bias.
Question 4. Groups recruited from the same population and uniform eligibility criteria
Were the inclusion and exclusion criteria developed prior to recruitment or selection of the study population? Were the same inclusion and exclusion criteria used for all of the subjects involved? This issue is related to the description of the study population described in the section above. Reviewers may find information for both of these questions in the same section of the paper.
Most cohort studies begin with selection of the cohort; participants in this cohort are then measured or evaluated for their exposure status. However, some cohort studies recruit or select exposed participants from a different time or place than that of unexposed participants, especially retrospective cohort studies. In these retrospective studies, data are obtained from the past (retrospectively), but the analysis examines exposures prior to outcomes. The following question addresses the similarity of populations: Are diabetic men with clinical depression at higher risk for cardiovascular disease than those without clinical depression? In this example, diabetic men with depression might be selected from a mental health clinic and diabetic men without depression might be selected from an internal medicine or endocrinology clinic. Because this study recruits groups from different clinic populations, the answer to Question 3 would be “no.” However, the selection of women nurses described in Question 2 were based on the same I/E criteria, so in that case the answer to Question 3 would be “yes.”
Question 5. Sample size justification
Specifically, Question 4 asks: Did the authors present their reasons for selecting or recruiting the number of individuals included or analyzed? Did they note or discuss the statistical power of the study and provide a target value? This question addresses whether the study had enough participants to detect an association if one truly existed.
Reviewers examined methods sections of articles for an explanation of the sample size needed to detect a hypothesized difference in outcomes. Reviewers examined discussion sections of articles for information on statistical power (i.e., the study had an 85 percent power to detect a 20 percent increase in the rate of an outcome of interest, with a 2-sided alpha of 0.05). Instead of sample size calculations, sometimes an article gives estimates of variance and/or estimates of effect size. In all these cases, the answer to Question 5 would be “yes.”
However, observational cohort studies often do not report anything about power or sample sizes because the analyses are exploratory in nature. In this case, the answer to Question 5 would be “no.” A lack of a report on power or sample size is not a “fatal flaw.” Instead it may indicate the researcher did not focus on whether the study was sufficiently sized to answer a prespecified question; it may have been an exploratory, hypothesis-generating study.
This question does not refer to a description of the manner in which different groups were included or excluded per the inclusion/exclusion criteria (e.g., “Final study size was 2,978 participants after exclusion of 756 patients with a history of MI,” is not considered a sample size justification for the purposes of this question.)
Table A-3. Quality Assessment Tool for Cohort and Cross-Sectional Studies
Question 6. Exposure assessed prior to outcome measurement
This question is important because in order to determine whether an exposure causes an outcome, the exposure must precede the outcome.
In some prospective cohort studies, investigators identify the cohort, then determine the exposure status of members of the cohort (large epidemiological studies like the Framingham Study use this approach). However, for other cohort studies, investigators select the cohort based on its exposure status, as in the example above of diabetic men with depression (the exposure being depression). Other examples include a cohort identified by its exposure to fluoridated drinking water and compared to a cohort living in an area without fluoridated water, or a cohort of military personnel exposed to combat in the Gulf War compared to a cohort of military personnel not deployed in a combat zone.
With either of these types of cohort studies, the investigator follows the cohort forward in time (i.e., prospectively) to assess the outcomes that occurred in the exposed compared to nonexposed members of the cohort. In other words, the investigator begins the study in the present by examining groups that were exposed or not exposed to some biological or behavioral factor, intervention, or other factor, then follows them forward in time to examine outcomes. If a cohort study is conducted properly, the answer to Question 6 should be “yes,” since the investigators determined the exposure status of members of the cohort at the beginning of the study, before the outcomes occurred.
For retrospective cohort studies, the same principal applies. The difference is that rather than identifying a cohort in the present and following it forward in time, investigators go back in time (i.e., retrospectively) and select a cohort based on its past exposure status. Then, they follow them forward to assess the outcomes that occurred in the exposed and nonexposed cohort members. In retrospective cohort studies, the exposure and outcomes may have already occurred (it depends on how long they follow the cohort); consequently, investigators need to ensure that the exposure preceded the outcome.
Sometimes in cross-sectional studies (or cross-sectional analyses of cohort study data) investigators measure exposures and outcomes during the same timeframe. As a result, cross-sectional analyses provide weaker evidence than regular cohort studies regarding a potential causal relationship between exposures and outcomes. For cross-sectional analyses, the answer to Question 5 would be “no.”
Question 7. Sufficient timeframe to see an effect
Did the study allow enough time for a sufficient number of outcomes to occur or be observed, or enough time for an exposure to have a biological effect on an outcome? The intent of Question 6 is to determine whether a study allowed enough time for a sufficient number of outcomes to occur or be observed, or enough time for an exposure to have a biological effect on an outcome. For example, if clinical depression has a biological effect on increasing risk for cardiovascular disease, such an effect may take years. Similarly, if higher dietary sodium increases BP, a short timeframe may be sufficient to assess its association with BP; however, a longer timeframe would be needed to examine its association with heart attacks.
Investigators must consider timeframe to conduct a meaningful analysis of the relationship between exposures and outcomes. Often, they must conduct a study for at least several years, especially when examining health outcomes. However, the timeframe depends on the research question and outcomes being examined. Cross-sectional analyses allow no time to see an effect, since the exposures and outcomes are assessed at the same time. So with this type of analysis, the answer to Question 7 would be “no.”
Question 8. Different levels of the exposure of interest
If the exposure can be defined as a range (e.g., range of drug dosages, amount of physical activity, or amount of sodium consumed), did the investigators assess multiple categories of that exposure? (For example, for a particular drug: was the person not on medication, on a low dose of medication, or on a high dose of medication? For physical activity, did the person not exercise, exercise less than 30 minutes per day, or exercise more than 30 minutes per day? For dietary sodium: did the person consume less than 1,500 mg per day, between 1,500 mg and 3,000 mg per day, or greater than 3,000 mg per day? Sometimes exposures are measured as continuous variables (e.g., actual mg per day of dietary sodium consumed or actual minutes of exercise per day) rather than discrete categories (e.g., low sodium versus high sodium diet; normal blood pressure versus high blood pressure).
In any case, studying different levels of exposure, when possible, enables investigators to assess trends or dose-response relationships between exposures and outcomes (e.g., the higher the exposure, the greater the rate of the health outcome). Trends or dose-response relationships lend credibility to the hypothesis of causality between exposure and outcome.
However, for some exposures, Question 8 may not be applicable (e.g., when the exposure is a dichotomous variable like living in a rural setting versus an urban setting, or being vaccinated or not being vaccinated with a one-time vaccine). If there are only two possible exposures (yes/no), then reviewers would have answered this question “NA.” This answer should not negatively affect the quality rating.
Question 9. Exposure measures and assessment
Were the exposure measures defined in detail? Were the tools or methods used to measure exposure accurate and reliable—for example, have they been validated or are they objective? How Question 9 is answered can influence confidence in reported exposures. When investigators measure exposures with less accuracy or validity, it is difficult to observe an association between exposure and outcome, even if one exists. As important is whether they assessed exposures in the same manner within and between groups; if not, bias may result.
The following two examples illustrate how differing exposure measures can affect confidence in associations between exposure and outcome. The first addresses measurement of dietary salt intake. A study that prospectively uses a standardized dietary log and tests participants' urine for sodium content is more valid and reliable than one that retrospectively reviews self-reports of dietary salt intake. In this example, the reviewer would answer “yes” to Question 8 with the first method and “no” for the second one. The second example addresses BP measurement. A study that uses BP measurements from a practice that follows certain standards for example, uses trained BP assessors, standardized equipment (e.g., the same BP device which has been tested and calibrated), and a standardized protocol (e.g., patient is seated for 5 minutes with feet flat on the floor, BP is taken twice in each arm, and all four measurements are averaged)—is more reliable and valid than a study that uses measurements from a practice that does not have such standards in place. Again, the reviewer would answer “yes” to Question 8 with the first method and “no” for the second one.
This final example illustrates the importance of assessing exposures consistently across all groups. In a study comparing individuals with high BP (exposed cohort) with those with normal BP (nonexposed group), an investigator may note a higher incidence of CVD in those with high BP, concluding that high BP leads to more CVD events. Although this increase may be true, it also may be due to these individuals seeing their health care practitioners more frequently. With more frequent visits, there are increased opportunities for detecting and documenting changes in health outcomes, including CVD-related events. Thus, the increased number of visits can bias study results and lead to inaccurate conclusions.
Question 10. Repeated exposure assessment
Was the exposure for each person measured more than once during the course of the study period? Multiple measurements with the same result increase confidence that investigators correctly classified the exposure status. In addition, multiple measurements enable them to observe changes in exposure over time. The example of individuals who had a high dietary intake illustrates changes that can occur over time. Some may have had a high dietary sodium throughout the followup period. Others may have had a high intake initially and then reduced their intake, while still others may have had a low intake throughout the study. Once again, this example may not be applicable in all cases. In many older studies, exposure was measured only at baseline. However, multiple exposure measurements do result in a stronger study design.
Cross-sectional study design does not allow for repeated exposure assessment because there is no followup period, so the answer to question 10 should be “no” for cross-sectional analyses.
Question 11. Outcome measures
Were the outcomes defined in detail? Were the tools or methods for measuring outcomes accurate and reliable—for example, have they been validated or are they objective? Answers to this question can influence confidence in reported exposures. These answers also can help determine whether the outcomes were assessed in the same manner within and between groups.
An example of an outcome measure that is objective, accurate, and reliable is death. But even with a measure as objective as death, differences can exist in the accuracy and reliability of how investigators assess death. For example, did they base outcomes on an autopsy report, death certificate, death registry, or report from a family member? A study on the relationship between dietary fat intake and blood level cholesterol in which fasting blood samples used to measure cholesterol were all sent to the same laboratory illustrates outcomes that would be considered objective, accurate, and reliable. This example would get a “yes.” However, outcomes in studies in which research participants self-reported they had a heart attack or self-reported how much they weighed would be considered questionable and would get a “no.”.
Similar to the final example in Question 10, results may be biased if one group (e.g., individuals with high BP) is seen more frequently than another group (individuals with normal BP); more frequent encounters with the health care system increase the chances of outcomes being detected and documented.
Question 12. Blinding of outcome assessors
Blinding or masking means that outcome assessors did not know whether participants were exposed or unexposed. To answer this question, the reviewer examined the article for evidence that the person(s) assessing the study outcome(s) (outcome assessor) was masked to the exposure status of the research participants. An outcome assessor, for example, may examine medical records to determine outcomes that occurred in the exposed and comparison groups. Sometimes, the person measuring the exposure is the same person conducting the outcome assessment. In this case, the assessor would most likely not be blinded to exposure status. A reviewer would note such a finding in the comments section.
In assessing this criterion, the reviewers determined whether it was likely that the outcome assessors knew the exposure status of the study participants. If not, then blinding was adequate. The following example depicts how adequate blinding of the outcome assessors can be done. Investigators created a separate committee whose members were not involved in the care of the patient and had no information about the study participants' exposure status. Following a study protocol, committee members reviewed copies of participants' medical records, which had been stripped of any potential exposure information or personally identifiable information, for prespecified outcomes. If blinding was not possible, which is sometimes the case, the reviewers marked Question 12 “NA” and explained the potential for bias.
Question 13. Followup rate
Higher overall followup rates are always desirable to lower followup rates. Although higher rates are expected in studies of short duration, lower rates are often seen in studies of longer duration. Usually an acceptable overall followup rate is considered 80 percent or more of participants whose exposures were measured at baseline. However, this rate is just considered a general guideline. For example, a 6-month cohort study examining the relationship between dietary sodium intake and BP level may have over 90 percent followup; whereas, a 20-year cohort study examining the effects of sodium intake on stroke may have only a 65 percent followup rate.
Cross-sectional study design does not incorporate a followup period, so the answer to question 13 should be “no” for cross-sectional analyses.
Question 14. Statistical analyses
Were key potential confounding variables measured and adjusted for, such as by statistical adjustment for baseline differences? Investigators often use logistic regression or other regression methods to account for the influence of variables not of interest.
This is a key issue in cohort studies: statistical analyses need to control for potential confounders, in contrast to RCTs in which the randomization process controls for potential confounders. In their analysis, investigators need to control for all key factors that may be associated with both the exposure of interest and the outcome and are not of interest to the research question. For example, a study of the relationship between cardiorespiratory fitness and CVD events (heart attacks and strokes) should control for age, BP, blood cholesterol, and body weight. All these factors are associated with both low fitness and CVD events. Well-done cohort studies control for multiple potential confounders.
General Guidance for Determining the Overall Quality Rating of Cohort and Cross-Sectional Studies
The questions in the assessment tool were designed to help reviewers focus on key concepts for evaluating a study's internal validity, instead of being used as a list from which to add up items to judge a study's quality. Internal validity for cohort studies is the extent to which the results reported in a study can truly be attributed to the exposure being evaluated, rather than to flaws in the design or conduct of a study—in other words, the ability of the study to draw associative conclusions about the effects of the exposures being studied on outcomes. Such flaws can increase the risk of bias.
Critical appraisal involves considering the risk of potential for selection bias, information bias, measurement bias, or confounding (the mixture of exposures that one cannot tease out from each other). Examples of confounding include co-interventions, differences at baseline in patient characteristics, and other issues addressed in the questions above. High risk of bias translates to a poor quality rating, while low risk of bias translates to a good quality rating. Again, the greater the risk of bias, the lower the quality rating of the study.
The more a study design addresses issues affecting a causal relationship between the exposure and outcome, the higher quality the study. Issues include exposures occurring prior to outcomes, evaluation of a dose-response gradient, accuracy of measurement of exposure and outcome, sufficient timeframe to see an effect, and appropriate control for confounding.
Generally, in evaluating a study, one will not see a “fatal flaw,” but will find some risk of bias. To assess potential for bias, reviewers focused on concepts underlying the questions in the quality assessment tool. For any box checked “no,” reviewers asked: “What is the potential risk of bias that may be introduced by this flaw in study design or execution?” That is, did this factor cause them to doubt the study results or doubt the ability of the study to accurately assess an association between exposure and outcome?
In summary, NHLBI staff stressed that the best approach was to examine the questions in the tool and assess the
xii Quality Assessment Tool for Case-Control Studies
Table A-4 shows the quality assessment tool for case-control studies along with the guidance document for that tool. The methodology team and NHLBI developed this tool based in part on criteria from AHRQ's EPCs, consultation with epidemiologists, and other factors. This tool includes 12 items for assessment of study quality. They include: clarity of the research objective or research question; definition, selection, composition, and participation of the study population; definition and assessment of case or control status; exposure, and outcome variables; use of concurrent controls; confirmation that the exposure occurred prior to the outcome; statistical power; and other factors.
xiii Guidance for Assessing the Quality of Case-Control Studies
The guidance document below is organized by question number from the tool for quality assessment of case-control studies.
Question 1. Research question
Did the authors describe their goal in conducting this research? Is it easy to understand what they were looking to find? This issue is important for any scientific paper of any type. High quality scientific research explicitly defines a research question.
Question 2. Study population
Did the authors describe the group of individuals from which the cases and controls were selected or recruited, while using demographics, location, and time period? If the investigators conducted this study again, would they know exactly who to recruit, from where, and from what time period?
Investigators identify case-control study populations by location, time period, and inclusion criteria for cases (individuals with the disease, condition, or problem) and controls (individuals without the disease, condition, or problem). For example, the population for a study of lung cancer and chemical exposure would be all incident cases of lung cancer diagnosed in patients ages 35 to 79, from January 1, 2003 to December 31, 2008, living in Texas during that entire time period, as well as controls without lung cancer recruited from the same population during the same time period. The population is clearly described as:  who (men and women ages 35 to 79 with (cases) and without (controls) incident lung cancer);  where (living in Texas); and  when (between January 1, 2003 and December 31, 2008).
Other studies may use disease registries or data from cohort studies to identify cases. In these cases, the populations are individuals who live in the area covered by the disease registry or included in a cohort study (i.e., nested case-control or case-cohort). For example, a study of the relationship between vitamin D intake and myocardial infarction might use patients identified via the GRACE registry, a database of heart attack patients.
NHLBI staff encouraged reviewers to examine prior papers on methods (listed in the reference list) to make this assessment, if necessary.
Question 3. Target population and case representation
In order for a study to truly address the research question, the target population—the population from which the study population is drawn and to which study results are believed to apply—should be carefully defined. Some authors may compare characteristics of the study cases to characteristics of cases in the target population, either in text or in a table. When study cases are shown to be representative of cases in the appropriate target population, it increases the likelihood that the study was well-designed per the research question.
Table A-4. Quality Assessment Tool for Case-Control Studies
However, because these statistics are frequently difficult or impossible to measure, publications should not be penalized if case representation is not shown. For most papers, the response to question 3 will be “NR.” Those subquestions are combined because the answer to the second subquestion—case representation—determines the response to this item. However, it cannot be determined without considering the response to the first subquestion. For example, if the answer to the first subquestion is “yes,” and the second, “CD,” then the response for item 3 is “CD.”
Question 4. Sample size justification
Did the authors discuss their reasons for selecting or recruiting the number of individuals included? Did they discuss the statistical power of the study and provide a sample size calculation to ensure that the study is adequately powered to detect an association (if one exists)? This question does not refer to a description of the manner in which different groups were included or excluded using the inclusion/exclusion criteria (e.g., “Final study size was 1,378 participants after exclusion of 461 patients with missing data” is not considered a sample size justification for the purposes of this question.)
An article's methods section usually contains information on sample size and the size needed to detect differences in exposures and on statistical power.
Question 5. Groups recruited from the same population
To determine whether cases and controls were recruited from the same population, one can ask hypothetically, “If a control was to develop the outcome of interest (the condition that was used to select cases), would that person have been eligible to become a case?” Case-control studies begin with the selection of the cases (those with the outcome of interest, e.g., lung cancer) and controls (those in whom the outcome is absent). Cases and controls are then evaluated and categorized by their exposure status. For the lung cancer example, cases and controls were recruited from hospitals in a given region. One may reasonably assume that controls in the catchment area for the hospitals, or those already in the hospitals for a different reason, would attend those hospitals if they became a case; therefore, the controls are drawn from the same population as the cases. If the controls were recruited or selected from a different region (e.g., a State other than Texas) or time period (e.g., 1991-2000), then the cases and controls were recruited from different populations, and the answer to this question would be “no.”
The following example further explores selection of controls. In a study, eligible cases were men and women, ages 18 to 39, who were diagnosed with atherosclerosis at hospitals in Perth, Australia, between July 1, 2000 and December 31, 2007. Appropriate controls for these cases might be sampled using voter registration information for men and women ages 18 to 39, living in Perth (population-based controls); they also could be sampled from patients without atherosclerosis at the same hospitals (hospital-based controls). As long as the controls are individuals who would have been eligible to be included in the study as cases (if they had been diagnosed with atherosclerosis), then the controls were selected appropriately from the same source population as cases.
In a prospective case-control study, investigators may enroll individuals as cases at the time they are found to have the outcome of interest; the number of cases usually increases as time progresses. At this same time, they may recruit or select controls from the population without the outcome of interest. One way to identify or recruit cases is through a surveillance system. In turn, investigators can select controls from the population covered by that system. This is an example of population-based controls. Investigators also may identify and select cases from a cohort study population and identify controls from outcome-free individuals in the same cohort study. This is known as a nested case-control study.
Question 6. Inclusion and exclusion criteria prespecified and applied uniformly
Were the inclusion and exclusion criteria developed prior to recruitment or selection of the study population? Were the same underlying criteria used for all of the groups involved? To answer this question, reviewers determined if the investigators developed I/E criteria prior to recruitment or selection of the study population and if they used the same underlying criteria for all groups. The investigators should have used the same selection criteria, except for study participants who had the disease or condition, which would be different for cases and controls by definition. Therefore, the investigators use the same age (or age range), gender, race, and other characteristics to select cases and controls. Information on this topic is usually found in a paper's section on the description of the study population.
Question 7. Case and control definitions
For this question, reviewers looked for descriptions of the validity of case and control definitions and processes or tools used to identify study participants as such. Was a specific description of “case” and “control” provided? Is there a discussion of the validity of the case and control definitions and the processes or tools used to identify study participants as such? They determined if the tools or methods were accurate, reliable, and objective. For example, cases might be identified as “adult patients admitted to a VA hospital from January 1, 2000 to December 31, 2009, with an ICD-9 discharge diagnosis code of acute myocardial infarction and at least one of the two confirmatory findings in their medical records: at least 2mm of ST elevation changes in two or more ECG leads and an elevated troponin level. Investigators might also use ICD-9 or CPT codes to identify patients. All cases should be identified using the same methods. Unless the distinction between cases and controls is accurate and reliable, investigators cannot use study results to draw valid conclusions.
Question 8. Random selection of study participants
If a case-control study did not use 100 percent of eligible cases and/or controls (e.g., not all disease-free participants were included as controls), did the authors indicate that random sampling was used to select controls? When it is possible to identify the source population fairly explicitly (e.g., in a nested case-control study, or in a registry-based study), then random sampling of controls is preferred. When investigators used consecutive sampling, which is frequently done for cases in prospective studies, then study participants are not considered randomly selected. In this case, the reviewers would answer “no” to Question 8. However, this would not be considered a fatal flaw.
If investigators included all eligible cases and controls as study participants, then reviewers marked “NA” in the tool. If 100 percent of cases were included (e.g., NA for cases) but only 50 percent of eligible controls, then the response would be “yes” if the controls were randomly selected, and “no” if they were not. If this cannot be determined, the appropriate response is “CD.”
Question 9. Concurrent controls
A concurrent control is a control selected at the time another person became a case, usually on the same day. This means that one or more controls are recruited or selected from the population without the outcome of interest at the time a case is diagnosed. Investigators can use this method in both prospective case-control studies and retrospective case-control studies. For example, in a retrospective study of adenocarcinoma of the colon using data from hospital records, if hospital records indicate that Person A was diagnosed with adenocarcinoma of the colon on June 22, 2002, then investigators would select one or more controls from the population of patients without adenocarcinoma of the colon on that same day. This assumes they conducted the study retrospectively, using data from hospital records. The investigators could have also conducted this study using patient records from a cohort study, in which case it would be a nested case-control study.
Investigators can use concurrent controls in the presence or absence of matching and vice versa. A study that uses matching does not necessarily mean that concurrent controls were used.
Question 10. Exposure assessed prior to outcome measurement
Investigators first determine case or control status (based on presence or absence of outcome of interest), and then assess exposure history of the case or control; therefore, reviewers ascertained that the exposure preceded the outcome. For example, if the investigators used tissue samples to determine exposure, did they collect them from patients prior to their diagnosis? If hospital records were used, did investigators verify that the date a patient was exposed (e.g., received medication for atherosclerosis) occurred prior to the date they became a case (e.g., was diagnosed with type 2 diabetes)? For an association between an exposure and an outcome to be considered causal, the exposure must have occurred prior to the outcome.
Question 11. Exposure measures and assessment
Were the exposure measures defined in detail? Were the tools or methods used to measure exposure accurate and reliable—for example, have they been validated or are they objective? This is important, as it influences confidence in the reported exposures. Equally important is whether the exposures were assessed in the same manner within groups and between groups. This question pertains to bias resulting from exposure misclassification (i.e., exposure ascertainment).
For example, a retrospective self-report of dietary salt intake is not as valid and reliable as prospectively using a standardized dietary log plus testing participants' urine for sodium content because participants' retrospective recall of dietary salt intake may be inaccurate and result in misclassification of exposure status. Similarly, BP results from practices that use an established protocol for measuring BP would be considered more valid and reliable than results from practices that did not use standard protocols. A protocol may include using trained BP assessors, standardized equipment (e.g., the same BP device which has been tested and calibrated), and a standardized procedure (e.g., patient is seated for 5 minutes with feet flat on the floor, BP is taken twice in each arm, and all four measurements are averaged).
Question 12. Blinding of exposure assessors
Blinding or masking means that outcome assessors did not know whether participants were exposed or unexposed. To answer this question, reviewers examined articles for evidence that the outcome assessor (s) was masked to the exposure status of the research participants. An outcome assessor, for example, may examine medical records to determine the outcomes that occurred in the exposed and comparison groups. Sometimes the person measuring the exposure is the same person conducting the outcome assessment. In this case, the outcome assessor would most likely not be blinded to exposure status. A reviewer would note such a finding in the comments section of the assessment tool.
One way to ensure good blinding of exposure assessment is to have a separate committee, whose members have no information about the study participants' status as cases or controls, review research participants' records. To help answer the question above, reviewers determined if it was likely that the outcome assessor knew whether the study participant was a case or control. If it was unlikely, then the reviewers marked “no” to Question 12. Outcome assessors who used medical records to assess exposure should not have been directly involved in the study participants' care, since they probably would have known about their patients' conditions. If the medical records contained information on the patient's condition that identified him/her as a case (which is likely), that information would have had to be removed before the exposure assessors reviewed the records.
If blinding was not possible, which sometimes happens, the reviewers marked “NA” in the assessment tool and explained the potential for bias.
Question 13. Statistical analysis
Were key potential confounding variables measured and adjusted for, such as by statistical adjustment for baseline differences? Investigators often use logistic regression or other regression methods to account for the influence of variables not of interest.
This is a key issue in case-controlled studies; statistical analyses need to control for potential confounders, in contrast to RCTs in which the randomization process controls for potential confounders. In the analysis, investigators need to control for all key factors that may be associated with both the exposure of interest and the outcome and are not of interest to the research question.
A study of the relationship between smoking and CVD events illustrates this point. Such a study needs to control for age, gender, and body weight; all are associated with smoking and CVD events. Well-done case-control studies control for multiple potential confounders.
Matching is a technique used to improve study efficiency and control for known confounders. For example, in the study of smoking and CVD events, an investigator might identify cases that have had a heart attack or stroke and then select controls of similar age, gender, and body weight to the cases. For case-control studies, it is important that if matching was performed during the selection or recruitment process, the variables used as matching criteria (e.g., age, gender, race) should be controlled for in the analysis.
General Guidance for Determining the Overall Quality Rating of Case-Controlled Studies
NHLBI designed the questions in the assessment tool to help reviewers focus on the key concepts for evaluating a study's internal validity, not to use as a list from which to add up items to judge a study's quality.
Internal validity for case-control studies is the extent to which the associations between disease and exposure reported in the study can truly be attributed to the exposure being evaluated rather than to flaws in the design or conduct of the study. In other words, what is ability of the study to draw associative conclusions about the effects of the exposures on outcomes? Any such flaws can increase the risk of bias.
In critical appraising a study, the following factors need to be considered: risk of potential for selection bias, information bias, measurement bias, or confounding (the mixture of exposures that one cannot tease out from each other). Examples of confounding include co-interventions, differences at baseline in patient characteristics, and other issues addressed in the questions above. High risk of bias translates to a poor quality rating; low risk of bias translates to a good quality rating. Again, the greater the risk of bias, the lower the quality rating of the study.
In addition, the more attention in the study design to issues that can help determine whether there is a causal relationship between the outcome and the exposure, the higher the quality of the study. These include exposures occurring prior to outcomes, evaluation of a dose-response gradient, accuracy of measurement of both exposure and outcome, sufficient timeframe to see an effect, and appropriate control for confounding—all concepts reflected in the tool.
If a study has a “fatal flaw,” then risk of bias is significant; therefore, the study is deemed to be of poor quality. An example of a fatal flaw in case-control studies is a lack of a consistent standard process used to identify cases and controls.
Generally, when reviewers evaluated a study, they did not see a “fatal flaw,” but instead found some risk of bias. By focusing on the concepts underlying the questions in the quality assessment tool, reviewers examined the potential for bias in the study. For any box checked “no,” reviewers asked, “What is the potential risk of bias resulting from this flaw in study design or execution?” That is, did this factor lead to doubt about the results reported in the study or the ability of the study to accurately assess an association between exposure and outcome?
By examining questions in the assessment tool, reviewers were best able to assess the potential for bias in a study. Specific rules were not useful, as each study had specific nuances. In addition, being familiar with the key concepts helped reviewers assess the studies. Examples of studies rated good, fair, and poor were useful, yet each study had to be assessed on its own.
xiv Quality Assessment Tool for Before-After Studies
Table A-5 shows the quality assessment tool for before-after (pre-post) studies along with the guidance document for that tool. The methodology team and NHLBI developed this tool based in part on criteria from AHRQ's EPCs, other papers addressing quality assessment of similar studies, and other factors.
This tool includes 12 items for assessment of study quality. They include: clarity of the research objective or research question; definition, selection, composition, and participation of the study population; definition and assessment of intervention and outcome variables; adequacy of blinding; statistical methods; and other factors.
xv Guidance for Assessing the Quality of Before-After (Pre-Post) Studies With No Control Group
The guidance document below is organized by question number from the tool for quality assessment of controlled intervention studies.
Question 1. Study question
Did the authors describe their goal in conducting this research? Is it easy to understand what they were looking to find? This issue is important for any scientific paper of any type. Higher quality scientific research explicitly defines a research question.
Question 2. Eligibility criteria and study population
Did the authors describe the eligibility criteria applied to the individuals from whom the study participants were selected or recruited? In other words, if the investigators were to conduct this study again, would they know whom to recruit, from where, and from what time period?
Here is a sample description of a study population: men over age 40 with type 2 diabetes, who began seeking medical care at Phoenix Good Samaritan Hospital, between January 1, 2005 and December 31, 2007. The population is clearly described as:  who (men over age 40 with type 2 diabetes);  where (Phoenix Good Samaritan Hospital); and  when (between January 1, 2005 and December 31, 2007). Another sample description is women who were in the nursing profession, who were ages 34 to 59 in 1995, had no known CHD, stroke, cancer, hypercholesterolemia, or diabetes, and were recruited from the 11 most populous States, with contact information obtained from State nursing boards.
To assess this question, reviewers examined prior papers on study methods (listed in reference list) when necessary.
Question 3. Study participants representative of clinical populations of interest
The participants in the study should be generally representative of the population in which the intervention will be broadly applied. Studies on small demographic subgroups may raise concerns about how the intervention will affect broader populations of interest. For example, interventions that focus on very young or very old individuals may affect middle-aged adults differently. Similarly, researchers may not be able to extrapolate study results from patients with severe chronic diseases to healthy populations.
Table A-5. Quality Assessment Tool for Before-After (Pre-Post) Studies With No Control Group
Question 4. All eligible participants enrolled
To further explore this question, reviewers may need to ask: Did the investigators develop the I/E criteria prior to recruiting or selecting study participants? Were the same underlying I/E criteria used for all research participants? Were all subjects who met the I/E criteria enrolled in the study?
Question 5. Sample size
Did the authors present their reasons for selecting or recruiting the number of individuals included or analyzed? Did they note or discuss the statistical power of the study? This question addresses whether there was a sufficient sample size to detect an association, if one did exist.
An article's methods section may provide information on the sample size needed to detect a hypothesized difference in outcomes and a discussion on statistical power (such as, the study had 85 percent power to detect a 20 percent increase in the rate of an outcome of interest, with a 2-sided alpha of 0.05). Sometimes estimates of variance and/or estimates of effect size are given, instead of sample size calculations. In any case, if the reviewers determined that the power was sufficient to detect the effects of interest, then they would answer “yes” to Question 5.
Question 6. Intervention clearly described
Another pertinent question regarding interventions is: Was the intervention clearly defined in detail in the study? Did the authors indicate that the intervention was consistently applied to the subjects? Did the research participants have a high level of adherence to the requirements of the intervention? For example, if the investigators assigned a group to 10 mg/day of Drug A, did most participants in this group take the specific dosage of Drug A? Or did a large percentage of participants end up not taking the specific dose of Drug A indicated in the study protocol?
Reviewers ascertained that changes in study outcomes could be attributed to study interventions. If participants received interventions that were not part of the study protocol and could affect the outcomes being assessed, the results could be biased.
Question 7. Outcome measures clearly described, valid, and reliable
Were the outcomes defined in detail? Were the tools or methods for measuring outcomes accurate and reliable—for example, have they been validated or are they objective? This question is important because the answer influences confidence in the validity of study results.
An example of an outcome measure that is objective, accurate, and reliable is death—the outcome measured with more accuracy than any other. But even with a measure as objective as death, differences can exist in the accuracy and reliability of how investigators assessed death. For example, did they base it on an autopsy report, death certificate, death registry, or report from a family member? Another example of a valid study is one whose objective is to determine if dietary fat intake affects blood cholesterol level (cholesterol level being the outcome) and in which the cholesterol level is measured from fasting blood samples that are all sent to the same laboratory. These examples would get a “yes.”
An example of a “no” would be self-report by subjects that they had a heart attack, or self-report of how much they weight (if body weight is the outcome of interest).
Question 8. Blinding of outcome assessors
Blinding or masking means that the outcome assessors did not know whether the participants received the intervention or were exposed to the factor under study. To answer the question above, the reviewers examined articles for evidence that the person(s) assessing the outcome(s) was masked to the participants' intervention or exposure status. An outcome assessor, for example, may examine medical records to determine the outcomes that occurred in the exposed and comparison groups. Sometimes the person applying the intervention or measuring the exposure is the same person conducting the outcome assessment. In this case, the outcome assessor would not likely be blinded to the intervention or exposure status. A reviewer would note such a finding in the comments section of the assessment tool.
In assessing this criterion, the reviewers determined whether it was likely that the person(s) conducting the outcome assessment knew the exposure status of the study participants. If not, then blinding was adequate. An example of adequate blinding of the outcome assessors is to create a separate committee whose members were not involved in the care of the patient and had no information about the study participants' exposure status. Using a study protocol, committee members would review copies of participants' medical records, which would be stripped of any potential exposure information or personally identifiable information, for prespecified outcomes.
Question 9. Followup rate
Higher overall followup rates are always desirable to lower followup rates, although higher rates are expected in shorter studies, and lower overall followup rates are often seen in longer studies. Usually an acceptable overall followup rate is considered 80 percent or more of participants whose interventions or exposures were measured at baseline. However, this is a general guideline.
In accounting for those lost to followup, in the analysis, investigators may have imputed values of the outcome for those lost to followup or used other methods. For example, they may carry forward the baseline value or the last observed value of the outcome measure and use these as imputed values for the final outcome measure for research participants lost to followup.
Question 10. Statistical analysis
Were formal statistical tests used to assess the significance of the changes in the outcome measures between the before and after time periods? The reported study results should present values for statistical tests, such as p values, to document the statistical significance (or lack thereof) for the changes in the outcome measures found in the study.
Question 11. Multiple outcome measures
Were the outcome measures for each person measured more than once during the course of the before and after study periods? Multiple measurements with the same result increase confidence that the outcomes were accurately measured.
Question 12. Group-level interventions and individual-level outcome efforts
Group-level interventions are usually not relevant for clinical interventions such as bariatric surgery, in which the interventions are applied at the individual patient level. In those cases, the questions were coded as “NA” in the assessment tool.
General Guidance for Determining the Overall Quality Rating of Before-After Studies
The questions in the quality assessment tool were designed to help reviewers focus on the key concepts for evaluating the internal validity of a study. They are not intended to create a list from which to add up items to judge a study's quality.
Internal validity is the extent to which the outcome results reported in the study can truly be attributed to the intervention or exposure being evaluated, and not to biases, measurement errors, or other confounding factors that may result from flaws in the design or conduct of the study. In other words, what is the ability of the study to draw associative conclusions about the effects of the interventions or exposures on outcomes?
Critical appraisal of a study involves considering the risk of potential for selection bias, information bias, measurement bias, or confounding (the mixture of exposures that one cannot tease out from each other). Examples of confounding include co-interventions, differences at baseline in patient characteristics, and other issues throughout the questions above. High risk of bias translates to a rating of poor quality; low risk of bias translates to a rating of good quality. Again, the greater the risk of bias, the lower the quality rating of the study.
In addition, the more attention in the study design to issues that can help determine if there is a causal relationship between the exposure and outcome, the higher quality the study. These issues include exposures occurring prior to outcomes, evaluation of a dose-response gradient, accuracy of measurement of both exposure and outcome, and sufficient timeframe to see an effect.
Generally, when reviewers evaluate a study, they will not see a “fatal flaw,” but instead will find some risk of bias. By focusing on the concepts underlying the questions in the quality assessment tool, reviewers should ask themselves about the potential for bias in the study they are critically appraising. For any box checked “no” reviewers should ask, “What is the potential risk of bias resulting from this flaw in study design or execution?” That is, does this factor lead to doubt about the results reported in the study or doubt about the ability of the study to accurately assess an association between the intervention or exposure and the outcome?
The best approach is to think about the questions in the assessment tool and how each one reveals something about the potential for bias in a study. Specific rules are not useful, as each study has specific nuances. In addition, being familiar with the key concepts will help reviewers be more comfortable with critical appraisal. Examples of studies rated good, fair, and poor are useful, but each study must be assessed on its own.
xvi Quality Assessment Tool for Case Series Studies
Table A-6 shows the quality assessment tool for case series studies. The methodology team and NHLBI developed this tool based in part on criteria from AHRQ's EPCs, other papers addressing quality assessment of similar studies, and other factors.
This tool includes nine items for assessment of study quality. They include: clarity of the research objective or research question; definition, selection, composition, and participation of the study population, definition and assessment of intervention and outcome variables, statistical methods, and other factors.
Data Abstraction and Review Process
Articles rated good or fair during the quality rating process were abstracted into the VCW using a Web-based data entry form. Requirements for abstraction were specified in an evidence table template that the methodologist developed for each CQ. The evidence table template included data elements relevant to the CQ such as study characteristics, interventions, population demographics, and outcomes.
The abstractor carefully read the article and entered the required information into the Web-based tool. Once abstraction was complete, an independent quality control review was conducted. During this review, data were checked for accuracy, completeness, and the use of standard formatting.
xvii Development of Evidence Tables and Summary Tables
a Evidence Tables
For each CQ, methodologists worked with the expert panel or work group members to identify the key data elements needed to answer the question. Using the PICOTS criteria as the foundation, expert panel or work group members determined what information was needed from each study to be able to understand the design, sample, and baseline characteristics in order to interpret the outcomes of interest. A template for a standard evidence table was created and then populated with data from several example studies for the expert panel or work group to review. This was done to ensure that all appropriate study characteristics were being considered. Once a final template was agreed upon, evidence tables were generated by pulling the appropriate data elements from the master abstraction database for those studies that met the inclusion criteria for the CQ.
Only studies rated “good” and “fair” were included in the evidence tables.
Templates varied by each individual CQ but generally provided the following information:
- Study characteristics: Author, year, study name, country and setting, funding, study design, research objective, year study began, overall study N, quality rating
- Criteria and end points: I/E criteria, primary outcome, secondary outcome, composite outcome definitions
- Study design details: Treatment groups, descriptions of interventions, duration of treatment, duration of followup, run-in, wash-out, sample size
- Baseline population characteristics: Demographics, biomarkers, other measures relevant to the outcomes
- Results: Outcomes of interest for the CQ with between group p values or confidence intervals for risk ratios, adverse events, attrition, and adherence
Table A-6. Quality Assessment Tool for Case Series Studies
Studies are presented in alphabetical order by study name (if none, the first author's last name was used). Some expert panels combined all the articles for a study and presented it as a single entry, but for those that did not, the articles were presented in chronological order within the group for the same study.
b Summary Tables
To enable a more targeted focus on the specific aspects of a CQ, methodologists developed summary tables, or abbreviated evidence tables, in concert with the panels or work groups. A summary table might be designed to address a general population or a specific subpopulation, such as individuals with diabetes, women, or the elderly, but it only presents concise data elements. All available data in the evidence tables were reviewed for a consistent format to present the specific outcome of interest. For example, some lifestyle interventions have lengthy descriptions in the evidence tables, but only key features were concisely stated in the summary tables. Within an outcome, the time periods were clearly identified and the order of the different measures was consistently applied. For example, weight loss is always listed in order of percentage change in body weight, followed by kilogram change, and lastly by proportion of subjects losing a certain percent of their body weight. Templates varied by each aspect of the CQ being addressed but generally provide the following information:
- Study characteristics: Study name, author/year, design, overall study N, quality rating
- Sample characteristics: Relevant inclusion criteria
- Study design details: Intervention doses and duration
- Results: Change in outcomes by time periods, attrition, and adherence
Each panel or work group determined its own ordering of studies to present the evidence within each summary table. For some, trials were listed in chronological order, for others it was by the type or characteristics of the intervention.
xviii Process for the Development of Evidence Statements and Expert Panel Voting
Using the summary tables (and evidence tables as needed), panel members collaboratively wrote the evidence statements with input from methodology staff and oversight of the process by NHLBI staff. Evidence statements aimed to summarize key messages from the evidence that could be provided to primary care providers and other stakeholders. In some cases, the evidence was too limited or inconclusive, so no evidence statement was developed, or a statement of insufficient evidence was made.
Methodology staff provided the expert panels with overarching guidance on how to grade the level of evidence (high, moderate, low), and the panels used this guidance to grade each evidence statement. This guidance is documented in the following section.
Beginning in September 2011, the GEC set up its own approach to manage relationships with industry and other potential conflicts of interest (see http://www.nhlbi.nih.gov/guidelines/cvd_adult/coi-rwi_policy.htm).
Panel members having relationships with industry (RWI) or other possible conflicts of interest (COI) were allowed to participate in discussions leading up to voting as long as they declared their relationships, but they recused themselves from voting on any issue relating to their RWI or potential COI. Voting occurred by a panel chair asking each member to signify his or her vote. NHLBI project staff, methodologists, and contractors did not vote.
Voting could be open so that differing viewpoints could be identified easily and facilitate further discussion and revisions to address areas of disagreement (e.g., by crafting language or dividing an evidence statement into more than one statement). Voting also could be by confidential ballot if the group so chose.
A record of the vote count (for, against, or recusal) was made without attribution. The ideal was 100 percent consensus, but a two-thirds majority was considered acceptable. In cases where a two-thirds majority was not reached in the initial vote, further discussion and clarification was used to create a consensus majority.
xix Description of Methods for Grading the Body of Evidence
NHBLI's Adult Cardiovascular Disease Systematic Evidence Review Project applied related but distinct processes for grading the bodies of evidence for CQs and for bodies of evidence for different outcomes included within CQs. Each of these processes is described in turn below.
a Grading the Body of Evidence
In developing the system for grading the body of evidence, NHLBI reviewed the following systems: Grading of Recommendations, Assessment, Development, and Evaluation (GRADE); USPSTF; American College of Cardiology/American Heart Association (ACC/AHA); American Academy of Pediatrics; Strength of Recommendation Taxonomy; Canadian Task Force on Preventive Health Care; Scottish Intercollegiate Guidelines Network; and Center for Evidence-Based Medicine in Oxford. In particular, GRADE, USPSTF, and ACC/AHA were considered at length. However, none of those systems fully met the needs of the NHLBI project. NHLBI, therefore, developed its own hybrid version that incorporated features of those systems. The expert panel and work group members strongly supported the resulting system and, with the methodology team, used it to decide about evidence ratings.
Two approaches were used for summarizing the body of evidence for each CQ. The first process was to conduct a de novo literature search and literature review for all of the individual studies that met a CQ's I/E criteria. This approach was used for most of the CQs. The second process, developed in response to resource limitations for the overall project, was to focus the literature search on existing systematic reviews and meta-analyses, that themselves summarized a broad range of the scientific literature. This was used for several CQs across expert panels and work groups. Additional information on the use of systematic reviews and meta-analyses is provided in the following section
Once the panel and work group members reached consensus on the wording of the evidence statement, the next step was to assign assigned a grade to the strength of the body of evidence to provide guidance to primary care providers and other stakeholders about the degree of support the evidence provides for the evidence statement. Three options were identified for grades for the strength of evidence: high, moderate, or low.
Table A-7 describes the types of evidence that were used to grade the strength of evidence as high, moderate, or low by the expert panel and work group members, with assistance from methodologists.
The strength of the body of evidence represents the degree of certainty, based on the overall body of evidence, that an effect or association is correct. It is important to assess the strength of the evidence as objectively as possible. For rating the overall strength of evidence, the entire body of evidence for a particular summary table and its associated evidence statement was used.
Methodologists provided guidance to the panels and work group for assessing the body of evidence for each outcome or summary table of interest using four domains:  risk of bias;  consistency;  directness; and  precision. Each domain was assessed and discussed, and the aggregate assessment was used to increase or decrease the strength of the evidence, as determined by the NHLBI Evidence Quality Grading System shown above. The four domains are explained in more detail below:
b Risk of bias
Risk of bias refers to the likelihood that the body of included studies for a given question or outcome is biased due to flaws in the design or conduct of the studies. Risk of bias and internal validity are similar concepts that are inversely correlated. A study with a low risk of bias has high internal validity and is more likely to provide correct results than one with high risk of bias and low internal validity. At the individual study level, risk of bias is determined by rating the quality of each individual study using standard rating instruments, such as the NHLBI study quality rating tools presented and discussed in the previous section of this report. Overall, risk of bias for the body of evidence regarding a particular question, summary table, or outcome is then assessed by the aggregate quality of studies available for that particular question or outcome. Panel and work group members reviewed the individual study quality ratings with methodologists to determine the aggregate quality of the studies available for a particular question, summary table, or outcome. If the risk of bias was low, then it increased the strength of evidence rating for the strength of the overall body of evidence. If the risk of bias was high, then it decreased the strength of evidence rating.
Table A-7. Evidence Quality Grading System
Consistency is the degree to which reported effect sizes are similar across the included studies for a particular question or outcome. Consistency enhances the overall strength of evidence and is assessed through effect sizes being in the same direction (i.e., multiple studies demonstrate an improvement in a particular outcome), and the range of effect sizes across studies being narrow. Inconsistent evidence is reflected in  effect sizes that are in different directions,  a broad range of effect sizes,  nonoverlapping confidence intervals, or  unexplained clinical or statistical heterogeneity. Studies included for a particular question or outcome can have effect sizes that are consistent, inconsistent, or unknown (or not applicable). The latter occurs in situations when there is only a single study. For the NHLBI project, consistent with the approach of AHRQ's EPCs, evidence from a single study generally should be considered insufficient for a high strength of evidence rating because a single trial, no matter how large or well designed, may not provide definitive evidence of a particular effect until confirmed by another trial. However, a very large, multicentered, well-designed, well-executed RCT that performs well in the other domains could in some circumstances be considered high quality evidence after thoughtful consideration.
Directness has two aspects: the direct line of causality and the degree to which findings can be extended from a specific population to a more general population. The first defines directness as whether the evidence being assessed reflects a single direct link between the intervention (or service, approach, exposure, etc.) of interest and the ultimate health outcome under consideration. Indirect evidence relies on intermediate or surrogate outcomes that serve as links along a causal pathway. Evidence that an intervention results in changes in important health outcomes (e.g., mortality, morbidity) increases the strength of the evidence. Evidence that an intervention results in changes limited to intermediate or surrogate outcomes (e.g., a blood measurement) decreases the strength of the evidence. However, the importance of each link in the chain should be considered, including existing evidence that a change in an intermediate outcome affects important health outcomes.
Another example of directness involves whether the bodies of evidence used to compare interventions are the same. For example, if Drug A is compared to placebo in one study and Drug B is compared to placebo in another study, using those two studies to compare Drug A with Drug B yields indirect evidence and provides a lower strength of the evidence than direct head-to-head studies of Drug A with Drug B.
The second aspect of directness refers to the degree to which participants or interventions in the study are different from those to whom the study results are being applied. This concept is referred to as “applicability.” If the population or interventions are similar, then the evidence is direct and strengthened. If they are different, then the evidence is indirect and weakened.
Precision is the degree of certainty about an estimate of effect for a specific outcome of interest. Indicators of precision are statistical significance and confidence intervals. Precise estimates enabled firm conclusions to be drawn about an intervention's effect relative to another intervention or control. An imprecise estimate is where the confidence interval is so wide that the superiority or inferiority of an intervention cannot be determined. Precision is related to the statistical power of the study. An outcome that was not the primary outcome or not prespecified will generally be less precise than the primary outcome of a study. In a meta-analysis, precision is reflected by the confidence interval around the summary effect size. For systematic reviews, which include multiple studies but no quantitative summary estimate, the quantitative information from each study should be considered in determining the overall precision of the body of included studies because some studies may be more precise than others. Determining precision across many studies without conducting a formal meta-analysis is challenging and requires judgment. A more precise body of evidence increases the strength of evidence and less precision reduces the strength of a body of evidence.
Following discussion of the four criteria for the strength of evidence grading options, in some cases, the expert panels and work groups also considered other factors. For example, the objectivity of an outcome measure needs to be assessed. Total mortality (usually recorded accurately) is a more objective measure than angina. Similarly, urinary sodium excretion is a more objective measure than dietary sodium intake reported by study subjects through recall. And measured height and weight, used to calculate a study subject's BMI, is a more objective measure than self-reported weight and height.
After the panel and work group members reviewed and discussed this range of factors, they voted on the final grade for the strength of evidence for each evidence statement. Methodologists provided analysis and recommendations regarding strength of evidence grading but did not participate in the voting process. A simple majority vote was sufficient to identify the strength of evidence grade. However, in most cases, the panels and work groups discussed the results if there were dissenting opinions until they achieved consensus or large majorities for the votes on the strength of evidence.
xx Policy and Procedures for the Use of Existing Systematic Reviews and Meta-Analyses
Systematic reviews and meta-analyses are routinely used in evidence reviews, and well-conducted systematic reviews and meta-analyses of RCTs are generally considered to be among the highest forms of evidence. As a result, systematic reviews and meta-analyses could be used to inform guideline development in the NHLBI CVD adult systematic evidence review project if certain criteria were met. AHRQ has published guidance on using existing systematic reviews, which has helped to inform the development of the NHLBI criteria .
To use existing systematic reviews or meta-analyses to inform the NHLBI evidence report, the project needed to identify:  those studies relevant to the topic of interest,  those where the risk of bias was low, and  those that were recent. The first item was addressed by examining the research question and component studies in the systematic reviews and meta-analyses as they related to the NHLBI CQs. The second item was addressed by using a quality assessment tool and the third was addressed by examining publication dates.
In general, the project followed the process below in using systematic reviews and meta-analyses:
Three criteria were used in to determine when systematic reviews or meta-analyses could be used.
Situation #1—When a systematic review or meta-analysis addresses a topic relevant to the NHLBI CVD systematic evidence reviews that was not covered by an existing CQ (e.g., effects of physical activity on CVD risk):
- For a systematic review or meta-analysis to be examined for relevance to the topic of interest, the topic needed to be prespecified in the form of a CQ using the PICO structure (population, intervention/exposure, comparator, and outcome). If only portion(s) of a systematic review were relevant, those relevant portions that were reported separately could be used. For example, in the Department of Health and Human Services' (HHS) 2008 systematic review on physical activity, the effects of physical activity on CVD were relevant and used to make evidence statements because they were reported in a separate chapter. However, the effects of physical activity on mental health would not be relevant and, therefore, were not used in crafting NHLBI evidence statements.
- Systematic reviews or meta-analyses could be used if they were recent (i.e., published within 3 years of the end date of the NHLBI systematic review publication window of December 31, 2009), or identified by the panel or work group if published after the end date of the project literature search and before the panel began to deliberate on evidence statements. If the end date of the systematic review or meta-analysis literature search was before December 31, 2009, panel or work group members could conduct a bridging literature search through December 31, 2009 in the following situations:  if they believed it was necessary to review relevant studies, published after the end date, and  if the bridging literature search covered the period up to 1 year before the literature search cut-off date of the systematic review or meta-analysis and extended no later than December 31, 2009.
Situation #2—If the NHLBI literature review identified an existing systematic review or meta-analysis that could possibly replace NHLBI's review of a CQ or subquestion:
- The systematic review or meta-analysis was examined for consistency between the studies in the systematic review or meta-analysis and the CQ I/E criteria. Component studies had to meet the I/E criteria; however, smaller sample sizes were allowed, as were studies published before the beginning of the NHLBI project's search date window, as long as a truly systematic approach was used. If the end date of the systematic review or meta-analysis literature search was before December 31, 2009, panel or work group members could conduct a bridging literature search through December 31, 2009 in these situations:  if they believed it was necessary to review relevant studies, published after the end date, and  if the bridging literature search covered the period up to 1 year before the literature search cut-off date of the systematic review and meta-analysis and extended no later than December 31, 2009.
SITUATION #3—If NHLBI's literature review identified an existing systematic review or meta-analysis that addressed the same or a similar CQ or subquestion as one undergoing NHLBI review:
- Systematic review or meta-analysis component articles that met all the I/E criteria for the CQ, but were not identified in NHLBI's literature search, could be added to the included studies in NHLBI's review and treated the same way (i.e., abstracted, quality rated, and added to evidence and summary tables).
xxi Peer Review Process
A formal peer-review process was undertaken that included inviting several scientific experts and representatives from multiple Federal agencies to review and comment on the draft documents. NHLBI selected scientific experts with diverse perspectives to review the reports. Potential reviewers were asked to sign a confidentiality agreement, but NHLBI did not collect COI information from the reviewers. DARD staff collected reviewers' comments and forwarded them to the respective panels and work groups for consideration. Each comment received was addressed—either by a narrative response and/or a change to the draft document. A compilation of the comments received and the panels' and work groups' responses was submitted to the NHLBI Advisory Council working group; individual reviewers did not receive responses.