Search strategies to identify diagnostic accuracy studies in MEDLINE and EMBASE

  • Review
  • Methodology

Authors


Abstract

Background

A systematic and extensive search for as many eligible studies as possible is essential in any systematic review. When searching for diagnostic test accuracy (DTA) studies in bibliographic databases, it is recommended that terms for disease (target condition) are combined with terms for the diagnostic test (index test). Researchers have developed methodological filters to try to increase the precision of these searches. These consist of text words and database indexing terms and would be added to the target condition and index test searches.

Efficiently identifying reports of DTA studies presents challenges because the methods are often not well reported in their titles and abstracts, suitable indexing terms may not be available and relevant indexing terms do not seem to be consistently assigned. A consequence of using search filters to identify records for diagnostic reviews is that relevant studies might be missed, while the number of irrelevant studies that need to be assessed may not be reduced. The current guidance for Cochrane DTA reviews recommends against the addition of a methodological search filter to target condition and index test search, as the only search approach.

Objectives

To systematically review empirical studies that report the development or evaluation, or both, of methodological search filters designed to retrieve DTA studies in MEDLINE and EMBASE.

Search methods

We searched MEDLINE (1950 to week 1 November 2012); EMBASE (1980 to 2012 Week 48); the Cochrane Methodology Register (Issue 3, 2012); ISI Web of Science (11 January 2013); PsycINFO (13 March 2013); Library and Information Science Abstracts (LISA) (31 May 2010); and Library, Information Science & Technology Abstracts (LISTA) (13 March 2013). We undertook citation searches on Web of Science, checked the reference lists of relevant studies, and searched the Search Filters Resource website of the InterTASC Information Specialists' Sub-Group (ISSG).

Selection criteria

Studies reporting the development or evaluation, or both, of a MEDLINE or EMBASE search filter aimed at retrieving DTA studies, which reported a measure of the filter’s performance were eligible.

Data collection and analysis

The main outcome was a measure of filter performance, such as sensitivity or precision. We extracted data on the identification of the reference set (including the gold standard and, if used, the non-gold standard records), how the reference set was used and any limitations, the identification and combination of the search terms in the filters, internal and external validity testing, the number of filters evaluated, the date the study was conducted, the date the searches were completed, and the databases and search interfaces used. Where 2 x 2 data were available on filter performance, we used these to calculate sensitivity, specificity, precision and Number Needed to Read (NNR), and 95% confidence intervals (CIs). We compared the performance of a filter as reported by the original development study and any subsequent studies that evaluated the same filter.

Main results

Ninteen studies were included, reporting on 57 MEDLINE filters and 13 EMBASE filters. Thirty MEDLINE and four EMBASE filters were tested in an evaluation study where the performance of one or more filters was tested against one or more gold standards. The reported outcome measures varied. Some studies reported specificity as well as sensitivity if a reference set containing non-gold standard records in addition to gold standard records was used. In some cases, the original development study did not report any performance data on the filters. Original performance from the development study was not available for 17 filters that were subsequently tested in evaluation studies. All 19 studies reported the sensitivity of the filters that they developed or evaluated, nine studies reported the specificities and 14 studies reported the precision.

No filter which had original performance data from its development study, and was subsequently tested in an evaluation study, had what we defined a priori as acceptable sensitivity (> 90%) and precision (> 10%). In studies that developed MEDLINE filters that were evaluated in another study (n = 13), the sensitivity ranged from 55% to 100% (median 86%) and specificity from 73% to 98% (median 95%). Estimates of performance were lower in eight studies that evaluated the same 13 MEDLINE filters, with sensitivities ranging from 14% to 100% (median 73%) and specificities ranging from 15% to 96% (median 81%). Precision ranged from 1.1% to 40% (median 9.5%) in studies that developed MEDLINE filters and from 0.2% to 16.7% (median 4%) in studies that evaluated these filters. A similar range of specificities and precision were reported amongst the evaluation studies for MEDLINE filters without an original performance measure. Sensitivities ranged from 31% to 100% (median 71%), specificity ranged from 13% to 90% (median 55.5%) and precision from 1.0% to 11.0% (median 3.35%).

For the EMBASE filters, the original sensitivities reported in two development studies ranged from 74% to 100% (median 90%) for three filters, and precision ranged from 1.2% to 17.6% (median 3.7%). Evaluation studies of these filters had sensitivities from 72% to 97% (median 86%) and precision from 1.2% to 9% (median 3.7%). The performance of EMBASE search filters in development and evaluation studies were more alike than the performance of MEDLINE filters in development and evaluation studies. None of the EMBASE filters in either type of study had a sensitivity above 90% and precision above 10%.

Authors' conclusions

None of the current methodological filters designed to identify reports of primary DTA studies in MEDLINE or EMBASE combine sufficiently high sensitivity, required for systematic reviews, with a reasonable degree of precision. This finding supports the current recommendation in the Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy that the combination of methodological filter search terms with terms for the index test and target condition should not be used as the only approach when conducting formal searches to inform systematic reviews of DTA.

Plain language summary

Search strategies to identify diagnostic accuracy studies in MEDLINE and EMBASE

A diagnostic test is any kind of medical test performed to help with the diagnosis or detection of a disease. A systematic review of a particular diagnostic test for a disease aims to bring together and assess all the available research evidence. Bibliographic databases are usually searched by combining terms for the disease with terms for the diagnostic test. However, depending on the topic area, the number of articles retrieved by such searches may be very large. Methodological filters consisting of text words and database indexing terms have been developed in the hope of improving the searches by increasing their precision when these filters are added to the search terms for the disease and diagnostic test. On the other hand, using filters to identify records for diagnostic reviews may miss relevant studies while at the same time not making a big difference to the number of studies that have to be assessed for inclusion. This review assessed the performance of 70 filters (reported in 19 studies) for identifying diagnostic studies in the two main bibliographic databases in health, MEDLINE and EMBASE. The results showed that search filters do not perform consistently, and should not be used as the only approach in formal searches to inform systematic reviews of diagnostic studies. None of the filters reached our minimum criteria of a sensitivity greater than 90% and a precision above 10%.

Laički sažetak

Strategije pretraživanja elektroničkih baza MEDLINE i EMBASE u svrhu pronalaženje studija o dijagnostičkoj točnosti

Dijagnostički test je bilo koji oblik medicinskog testa koji olakšava prepoznavanje i dijagnozu bolesti. Sustavni pregled o određenom dijagnostičkom testu za neku bolest ima za cilj sakupiti podatke iz svih dostupnih istraživanja o tom testu i bolesti. Bibliografske baze podataka obično se pretražuju kombiniranjem pojmova o bolesti s pojmovima o dijagnostičkom testu. Ovisno o području interesa broj prikupljenih studija može biti jako veliki. Metodološki filteri (programski alati) sadrže slobodno upisane pojmove i pojmove definirane u bazi podataka koja se pretražuje, razvijeni su nadi da će povećati kvalitetu pretraživanja ako se koriste. S druge strane korištenjem filtera za pronalaženje studija za dijagnostičke sustavne preglede mogu se propustiti studije bitne za ta istraživanja a da istovremeno nema značajne razlike u broju odabranih (uključenih) studija (nakon pretraživanja). Ovaj sustavni pregled procijenio je učinkovitost 70 filtera (koji su ispitani u 19 istraživanja) za pronalaženje dijagnostičkih studija u MEDLINE i EMBASE, dvije glavne bibliografske baze podataka u zdravstvu. Rezultati su pokazali da filteri neujednačeno pretražuju baze podataka te da se ne bi trebali koristiti kao jedini način pretraživanja za sustavne dijagnostičke preglede. Niti jedan od filtera nije zadovoljio minimum kriterija koji se odnose na osjetljivost (veća od 90%) i specifičnost (iznad 10%).

Bilješke prijevoda

Hrvatski Cochrane
Prevela: Merica Aralica
Ovaj sažetak preveden je u okviru volonterskog projekta prevođenja Cochrane sažetaka. Uključite se u projekt i pomozite nam u prevođenju brojnih preostalih Cochrane sažetaka koji su još uvijek dostupni samo na engleskom jeziku. Kontakt: cochrane_croatia@mefst.hr

Background

As with Cochrane reviews of interventions, Cochrane diagnostic test accuracy (DTA) reviews should aim to identify and evaluate as much available evidence about a specific topic as possible within the available resources (DeVet 2008). Thus, a systematic and extensive search for eligible studies is an essential step in any review. Recommendations for searching for DTA studies are that electronic bibliographic databases, such as MEDLINE and EMBASE, should be searched by combining search terms for disease indicators (target condition) with terms for the diagnostic test (index test) (DeVet 2008). Depending on the topic area, the number of articles retrieved by such searches may be too large to be processed with the available resources. A number of methodological filters consisting of text words and database specific indexing terms (such as MEDLINE Medical Subject Headings (MeSH)) have been developed in an attempt to increase the precision of searches and reduce the resources required to process results. These search filters are typically added to a search strategy consisting of the target condition and index test(s).

Methodological search filters have been developed for retrieving articles relating to many types of clinical question, including those about aetiology, diagnosis, prognosis and therapy. These filters are typically combinations of database indexing terms or text words, or both, that reflect the study design and statistical methods reported by the articles’ authors. For example, Haynes and co-workers have developed a series of filters to assist searchers to retrieve articles according to aetiology, diagnosis, prognosis or therapy (Haynes 1994; Haynes 2005; Haynes 2005a; Wilczynski 2003; Wilczynski 2004). They are available as ‘Clinical Queries’ limits in both PubMed and via the OvidSP interfaces for MEDLINE and EMBASE (NLM 2005; OvidSP 2013; OvidSP 2013a).

Methodological search filters have proved to be particularly effective in identifying intervention (therapy) studies. Within The Cochrane Collaboration, a highly sensitive search strategy is widely used for identifying reports of randomised trials in MEDLINE (Lefebvre 2011).

For DTA studies, however, the relevant methodology is often not well reported by authors in their titles and abstracts. In addition, MEDLINE lacks a suitable publication type indexing term to apply to DTA studies. EMBASE has recently introduced a check tag for DTA studies (diagnostic test accuracy) but this is only being prospectively applied. Some relevant indexing terms do exist in both EMBASE and MEDLINE, for example sensitivity and specificity, however they are inconsistently assigned by indexers to DTA studies (Fielding 2002; Wilczynski 1995; Wilczynski 2005a; Wilczynski 2007). A consequence of adding filters to subject and index term strategies to identify records for DTA reviews is that relevant studies might be missed without, at the same time, significantly reducing the number of studies that have to be assessed for inclusion (Doust 2005; Leeflang 2006; Whiting 2008; Whiting 2011).

We conducted a methodology review of empirical studies that reported the development and evaluation of methodological search filters to retrieve reports of DTA studies in MEDLINE and EMBASE to assess the value of adding methodological search filters to search strategies to identify records for inclusion in DTA reviews. Until now, a comprehensive and systematic review of studies that develop or evaluate diagnostic search filters has not been published. The findings of this review will help to elucidate the performance of these filters to find studies relevant to diagnostic systematic reviews and to allow a recommendation for their use (or not) when conducting literature searches.

Objectives

To systematically review empirical studies that report the development or evaluation, or both, of methodological search filters designed to retrieve diagnostic test accuracy (DTA) studies in MEDLINE and EMBASE.

Methods

Criteria for considering studies for this review

Types of studies

Primary studies of any design were included. Studies in which the main objective was the development or evaluation, or both, of a methodological filter for the purpose of searching for DTA studies in MEDLINE and EMBASE were eligible. We defined a development study as one in which a new filter was conceived, tested in a reference set of diagnostic studies, and the performance reported. An evaluation study was one in which a filter from a development study publication was tested in a new reference set and the performance reported. A study could be both a development and an evaluation study if it reported on the development and performance of a newly designed filter and evaluated a filter which had previously been published by a different development study. We also included filters assessed in evaluation studies for which there was no corresponding development study publication. We excluded studies that developed or evaluated filters designed to retrieve clinical prediction studies or prognostic studies.

Types of data

Eligible studies must have reported the performance of search filters using a recognised measure, such as sensitivity or precision.

Types of methods

Assessments of the performance of search strategies for identifying reports of DTA in MEDLINE and EMBASE.

Types of outcome measures

Eligible outcome measures were those that assessed the accuracy of the search.

Primary outcomes

Measures of search performance, including:

  • sensitivity (proportion of relevant reports correctly retrieved by the filter);

  • specificity (proportion of irrelevant reports correctly not retrieved by the filter);

  • accuracy (the highest possible sensitivity in combination with the highest possible specificity);

  • precision (the number of relevant reports retrieved divided by the total number of records retrieved by the filter).

We defined a priori the levels of sensitivity (> 90%) and precision (> 10%) from the external validation of evaluation studies as the acceptable threshold for use when searching for DTA studies.

Secondary outcomes
  • Number Needed to Read (NNR) (also called Number Needed to Screen), which is the inverse of the precision (Bachmann 2002).

Search methods for identification of studies

Electronic searches

The following databases were searched to identify relevant studies: MEDLINE (1950 to week 1 November 2012); EMBASE (1980 to 2012 Week 48); the Cochrane Methodology Register (Issue 3, 2012); ISI Web of Science (11 January 2013); PsycINFO (13 March 2013); Library and Information Science Abstracts (LISA) (31 May 2010); and Library, Information Science & Technology Abstracts (LISTA) (13 March 2013). Three information specialists developed and conducted the searches. The search strategies are listed in the appendices (Appendix 1; Appendix 2; Appendix 3; Appendix 4; Appendix 5; Appendix 6; Appendix 7). No language restrictions were applied.

Searching other resources

We also undertook citation searches of the included studies on Web of Science. Furthermore, reference lists of all relevant studies were assessed (Horsley 2011) and the Search Filters Resource website of the InterTASC Information Specialists' Sub-Group (ISSG) was screened (InterTASC 2011). InterTASC is a collaboration of six academic units in the UK who conduct and critique systematic reviews for the National Institute for Health and Care Excellence.

Data collection and analysis

Selection of studies

Two authors independently screened the titles and abstracts of all retrieved records. Inclusion assessment of full papers was conducted by one author and checked by a second. Any disagreements were resolved through discussion or referral to a third author.

Data extraction and management

Data extraction was performed by one author and checked by a second; disagreements were resolved through discussion. The ISSG Search Filter Appraisal Checklist (Glanville 2008) was used to structure the data extraction and assessment of methodological quality. This checklist was developed using consensus methods and tested on several filters. It assesses the scope of the filter (limitations, generalisability and obsolescence), and the methods used to develop the filter, including the generation of the reference set.

Data were extracted on the characteristics of the reference set (inclusion of gold and non-gold standard records, years of publication of the records, journals covered, inclusion criteria, size); how search terms were identified; presence of internal and external validity testing; and any limitations or comparisons between studies. In the context of filter development, the reference set is the same as the reference standard or gold standard in DTA studies. In contrast, the gold standard in the context of filter development is equivalent to diseased individuals in diagnostic accuracy studies (that is the 'relevant' studies) and the non-gold standard is equivalent to non-diseased individuals (that is the non-relevant studies).

Data were also extracted on the date the study was conducted; the date the searches were completed; the database(s) and search interface(s) used; the outcome measures of performance (sensitivity, specificity, precision) and their definitions; and whether the search strategy was developed for specific clinical areas or to identify diagnostic studies over a broad range of topics. We assessed whether the search strategies were described in sufficient detail to be reproducible (that is were the search terms and their combination reported, were the dates of the search reported, and was the interface and database reported?).

Where studies reported data on multiple filters, results were extracted for each filter. However, for filter development studies, if data were also presented on the sensitivity and precision of all tested individual terms, only single term filters that the original authors selected as reporting best performance were extracted, as well as all multiple term filters.

Assessment of risk of bias in included studies

Bias occurs if systematic flaws or limitations in the design or conduct of a study distort the results. Applicability refers to the generalisability of results: can the results of the filter development or evaluation study be applied to other settings with different populations, index tests, reference standards or target conditions?

We identified three areas that we considered to have the potential to introduce bias or affect the applicability of the included studies.

1. Absence of DTA search strategy in reference set development: bias may be introduced when either a development or an evaluation study used a systematic review (or reviews) to provide studies for the reference set, and this systematic review used a search strategy containing diagnostic terms to find primary studies. This could introduce bias because the performance of a filter tested in this reference set will naturally be higher when the difficult to retrieve studies have been missed by the reference set search.

2. Choice of gold standard: concerns about applicability may be introduced in both development and evaluation studies in the generalisability of the filter to all diagnostic studies. Some filters have been developed or evaluated using a reference set that is composed of topic specific studies (such as studies on the diagnosis of deep vein thrombosis), whereas other reference sets will be generic (studies covering a wide range of diagnostic tests and conditions). Ideally, a filter will perform equally well across different topic areas but if it is only evaluated in one specific topic area its performance in other areas will be unclear.

3. Validation of filters in development studies: the process of validation can be split into two parts; the method of internal validation can have bias issues, while the method of external validation (if done) can have both applicability and bias issues. Internal validity is the ability of the filter to find studies from the reference set from which it was developed. A study could be at risk of bias if the internal validation set contained the references from which the filter terms were derived. External validity is the ability of the filter to find studies in a real-world setting (that is using a reference set composed of topic specific studies). This relates to how generalisable the results are to searching for diagnostic studies for different systematic review topics and most closely relates to how the filters would be used in practice by systematic reviewers. This issue only applies to development studies. A study which has used external validation in a real-world setting will be judged to have low levels of concern about applicability. However, a study that includes external validity testing could still be at risk of bias if the validity testing occurred in a validation set containing the references used to derive the terms.

Data synthesis

We synthesised performance measures of the filters separately for MEDLINE and EMBASE. We tabulated the performance measures reported by development and evaluation studies grouped by individual filters, so that a comparison could be made between the original reported performance of a filter and its performance in subsequent evaluation studies. If sensitivity, specificity or precision together with 95% confidence intervals (CIs) were not reported in the original reports, these were calculated from the 2 x 2 data, where possible.

Each of the performance measures can be calculated as shown by the formulae below (a further description of performance measures is available in Appendix 8).

   Reference set
   Gold standard records Non-gold standard records
Searches incorporating methodological filterDetecteda (true positive)b (false positive)
Not detectedc (false negative)d (true negative)

Sensitivity = a/(a + c)

Precision = a/(a + b)

Specificity = d/(b + d)

Accuracy = (a + d)/(a + b + c + d)

Number needed to read = 1/(a/(a + b))

Reference set = gold standard + non-gold standard records = (a + b + c + d)

Gold standard = relevant DTA studies = a + c

NB. This is different to the gold (reference) standard in DTA studies, which is equivalent to the reference set in filter evaluations. The gold standard in DTA studies is able to correctly identify the true positives and as well as the true negatives, unlike the gold standard in a filter evaluation study which is limited to the true positives.

Paired results of either sensitivity and specificity or sensitivity and precision for each filter were displayed in receiver operating characteristic (ROC) plots. The original individual filter performance estimates from the development studies were plotted in the same ROC space as the individual filter performance estimates from the evaluation studies, to allow for visual inspection of disparities and similarities. We did not pool data due to heterogeneity across studies.

Results

Description of studies

The searches retrieved 5628 records, of which 19 studies reported in 21 papers met the inclusion criteria (Figure 1). These assessed 57 MEDLINE filters and 13 EMBASE filters.

Figure 1.

Study selection process.

MEDLINE search filters

Description of development studies

Ten studies reported on the development of 40 MEDLINE filters (range 1 to 12 filters per study). Key features of each study are summarized in the Characteristics of included studies table and Table 1. Thirty-one filters were composed of multiple terms and nine filters were single term strategies. Nine filters consisted of MeSH terms only, six filters had text words only, and 25 filters combined MeSH with text words. Full details of methods used in each study and the size of the reference set are given in Table 2. A description of each filter and its performance are listed in Table 3.

Table 1. Summary of study designs of MEDLINE filter development studies
  1. *Noel-Storr derived filter terms by running published search filters in MEDLINE combined with a subject search, locating 10 papers that all filters missed and choosing a term from the title/abstract or keywords of each.

    ** Only external validation was carried out (no internal validation) in real-world topics.

    Abbreviations used: NR= not reported; N/A= not applicable

  Author (year)
  Astin 2008 Berg 2005 van der Weijden 1997 Deville 2002 Deville 2000 Haynes 2004 Haynes 1994 Bachmann 2002 Vincent 2003 Noel-Storr 2011
Method of identification of reference set records (one from list below selected for each study) 
Hand-searching for primary studies----
DTA systematic reviews-------
Personal literature database---------
If systematic reviews used in reference set development, did they include DTA search terms in search strategy?
 ---Unclear----X
Reference set also contained non-gold standard records
 XNRX
Description of non-gold standard records if used in reference setNR----NRNRNR--
All studies retrieved by search not classified as gold standard records -------
False positive papers selected by a previously published search strategy, exclusion of some publication types e.g. reviews and meta-analyses. --------
Generic gold standard records i.e. not topic specific
 XXXXXXX
Method of deriving filter terms (a combination of methods could be used) 
Analysis of reference set--✓*
Expert knowledge--------
Adaption of existing filter-------
Checking key publications for terms and language used---------
Internal validation in reference set independent from records used to derive filter terms
 XN/A**N/A**XXXXX
External validation in reference set independent from records used to derive filter terms and internal validation set
 XXXXXX
Table 2. Study characteristics and methods of MEDLINE development studies
  1. Abbreviations used: NR=Not reported; ref set= reference set

Author (Year) Study IDIdentification of reference setHow was reference set usedHow were search terms identified for filterRef set years# gold standard records# non-gold standard records# journals ref set
Astin 2008Hand search. Articles reporting on imaging as a diagnostic test in imaging journals. 6 high impact journals used to find studies for development set and 6 lower impact journals used to find studies for validation set. Journals indexed in MEDLINE and were also selected to cover general radiology, specific modalities and specific systems.Two independent sets of records developed. Test set used to derive terms and test strategies. Validation set used to test external validityPerformed statistical analysis of terms in test setdevelopment set 1985 Clin Radiol, 1988 Am J Neuroradiol; validation set 2000333 in development set; 186 in validation set2222 in development set; 1070 in validation set12 (6 in development set; 6 in validation set)
Berg 2005Manual review of a certain set of articles found using a search (via PubMed) combining sensitive terms for nursing literature plus cancer-related fatigue diagnosis terms. Manual review of these articles carried out to find diagnostic studies.To derive terms and test strategies. Did not validate in a separate set of referencesExisting PubMed Clinical Queries filter with extra terms from filters for CINAHL, medical publications, published recommendations & diagnosis definitions. Inductively collected terms derived from indexing of included citations: MeSH terms and frequently used text words in titles/abstracts.NRNR238NR
van der Weijden 1997Personal literature database compiled over 10 years 'by every means of literature searching' of studies reporting on erythrocyte sedimentation rate as a diagnostic test.To test strategies.Checking key publications for definitions & terms used.1985-19942210NR
Deville 2002Studies included in two systematic reviews (relative recall).To test strategiesAdapted three published search strategiesNRNRNRNR
Deville 2000Reference set of publications found through handsearch of 9 highest rank family medicine journals available on MEDLINE for years 1992-95. A ‘control’ set of publications for testing validity of strategies was found by adapting Haynes 1991 most sensitive and most specific searches by adding terms, then run in MEDLINE to retrieve all diagnostic primary studies, then limited to the 9 journals.To derive terms from reference set; to test strategies in control set; to test external validity the best performing filters were compared against Haynes filters in a systematic review (SR) of meniscal lesions in the knee.Performed statistical analysis of terms in reference set. Univariate analysis to calculate sensitivity, specificity & diagnostic odds ratio (DOR) of all relevant MeSH terms & text words. Models developed by forward stepwise logistic regression analysis.1992-199575; 33 in meniscal lesions set2392; NR meniscal lesions set9
Haynes 2004Manual review of 161 journals indexed on MEDLINE for year 2000. Journal titles regularly reviewed for appraisal for Evidence Based Medicine, Evidence Based Nursing, Evidence Based Mental Health and ACP Journal Club.Test strategies and validate. The reference standard could not be divided into a test set and validation set.MeSH terms and text words listed using expert knowledge of the field.200014748881161
Haynes 1994Manual review of 10 high impact journals for the years 1986 and 1991. The 10 journals searched were American Journal of Medicine, Annals of Internal Medicine, Archives of Internal Medicine, BMJ, Circulation, Diabetes Care, Journal of Internal Medicine, JAMA, Lancet and NEJMTo test strategies and validate.MeSH terms and text words listed using expert knowledge of the field.1986 and 199192 in 1986 set; 111 in 1991 set426 in 1986 set; 301 in 1991 set.10
Bachmann 2002Hand search European Journal of Paediatrics, Gastroenterology, American Journal of Obstetrics and Gynecology, and Thorax for years 1989 and 1994. Four different journals searched in 1999: NEJM, JAMA, BMJ and Lancet.1989 set search used to derive terms and test strategies, 1994 and 1999 sets used to validateWord frequency analysis on titles, abstracts and subject indexes of all references in 1989 set.1989, 1994 and 199983 in 1989 test set; 53 in 1994 validation set; 61 in 1999 validation set.1646 in 1989 test set; 1744 in 1994 validation set; 7875 in 1999 validation set8
Vincent 2003SRs retrieved from MEDLINE and EMBASE on OVID reporting on diagnostic tests for DVT. 16 SRs selected and all articles included that were indexed on MEDLINE became the reference set. Only English language articles includedTo test strategiesAdapted from 5 published strategies: CASP, PubMed, Rochester, Deville, and North Thames1969-20001260NR
Noel-Storr 2011SR on the volume of evidence in biomarker studies in those with mild cognitive impairment, conducted by the authors.To derive terms; to test strategiesPublished search filters applied in MEDLINE combined with a subject search (Southampton A, Van der Weijden, and Southampton E), 10 papers were missed by all filters. One term from the title/abstract or keywords of each of 10 papers combined in the new filter.2000-2011128 in Sept 2010 set; additional 16 found in update search therefore 144 in August 201117266 in Sept 2010 set; additional 1654 found in update search therefore 18920 in August 2011NR
Table 3. Performance of diagnostic filters from MEDLINE development studies
  1. NR=Not reported

Author Filter Description Interface Reference set

Sensitivity %

(95% CI)

Specificity %

(95% CI)

Accuracy

(95% CI)

Precision%

(95% CI)

NNR

(95% CI)

Astin 20081. Exp "sensitivity and specificity"/
2. False positive reactions/
3. False negative reactions/
4. du.fs
5. sensitivity.tw
6. (predictive adj4 value$).tw
7. distinguish$.tw
8. differentiat$.tw
9. enhancement.tw
10. identif$.tw
11. detect$.tw
12. diagnos$.tw
13. accura$.tw
14. comparison.tw
15. or/1-14
OvidDerivation set

95.8

(93.1, 97.5)

52.3

(50.2, 54.3)

 

23.1

(21.0-25.4)

0.04*
Validation set

96.8

(93.1, 98.5)

43.9

(41.0, 46.9)

 

23.1

(20.3-26.2)

0.04*
Berg 2005

Some search terms were combined using "OR" thus increasing sensitivity and reducing specificity (e.g. nursing assessment [MeSH: noexp] AND questionnaire [Text Word])
Exemplary MeSH terms - Diagnosis, Differential; psychological tests; Likelihood functions; Area Under Curve; diagnostic tests; routine; diagnosis [MeSH subheading]; Diagnostic Techniques and Procedures; nursing assessment.

Exemplary text words: sensitivity; specificity; predictive value; validity; reliability; likelihood ratio; questionnaire.

PubMed 8773 Positive likelihood ratio (PLR)=3.22.3

Some search terms were combined using "AND" thus increasing specificity and reducing sensitivity (e.g. nursing assessment [MeSH: noexp] AND questionnaire [Text Word])
Exemplary MeSH terms - Diagnosis, Differential; psychological tests; Likelihood functions; Area Under Curve; diagnostic tests; routine; diagnosis [MeSH subheading]; Diagnostic Techniques and Procedures; nursing assessment.

Exemplary text words: sensitivity; specificity; predictive value; validity; reliability; likelihood ratio; questionnaire.

PubMed 7683 PLR= 6.31.7
Haynes 2004sensitiv:.mp OR diagnos:.mp OR di.fsOvid 

98.6

(96.8-100)

74.3

(73.9-74.7)

74.3

(74.0-74.7)

1.1

(1.0-1.3)

0.9*

High specificity:

specificity.tw

Ovid 

64.6

(56.9-72.4)

98.4

(98.2-98.5)

98.3

(98.1-98.4)

10.6

(8.6-12.6)

0.09*

High Sensitivity:

di.xs.

Ovid 

91.8

(87.4-96.3)

68.3

(67.9-68.7)

68.4

(68.0-68.8)

0.9

(0.7-1.0)

1.11*
sensitiv:.mp OR predictive value:.mp OR accurac:.twOvid 

92.5

(88.3-96.8)

92.1

(91.8-92.3)

92.1

(91.8-92.3)

3.4

(2.8-3.9)

0.29*

Optimising sensitivitiy and specificity:

exp "diagnostic techniques and procedures"

Ovid 

66.7

(59.1-74.3)

74.6

(74.2-75.0)

74.5

(74.2-74.9)

0.8

(0.6-0.9)

1.25*
Sensitive:.mp. OR diagnos:.mp. OR accuracy.tw.Ovid 

98.0

(95.7-100.0)

82.7

(82.4-83.1)

82.8

(82.5-83.1)

1.7

(1.4-2.0)

0.59*
Sensitive:.mp. OR diagnos:mp. OR test:.tw.Ovid 

98.0

(95.7-100.0)

75.1

(74.8-75.5)

75.2

(74.8-75.6)

1.2

(1.0-1.4)

0.83*
Specificity.tw. OR predictive value:.tw.Ovid 

72.8

(65.6-80.0)

97.9

(97.8-98.1)

97.9

(97.7-98.0)

9.6

(7.9-11.3)

0.10*
Accuracy:.tw. OR predictive value:tw.Ovid 

52.4

(44.3-60.5)

97.9

(97.8-98.1)

97.8

(97.7-97.9)

7.1

(5.6-8.6)

0.14*
Sensitive:.mp. OR diagnostic.mp. OR predictive value:.tw.Ovid 

92.5

(88.3-96.8)

91.8

(91.6-92.1)

91.8

(91.6-92.1)

3.3

(2.8-3.8)

0.30*
Exp sensitivity and specificity OR predicitive value:.tw.Ovid 79.694.994.84.50.22*
Haynes 1994

Best sensitivity:

diagnosis (subheading pre-explosion) OR specificity (tw)

NR 86737370.14*

Best accuracy:

Exp sensitivity and specificity OR diagnosis (subheading) OR diagnostic use (subheading) OR specificity (tw) OR predicitive (tw) AND value (tw))

NR 868484130.08*

Best specificity:

specificity (tw) OR (predictive (tw) AND value (tw)) OR (false (tw) and positive (tw))

NR 4998 360.03*

Best specificity:

Exp sensitivity and specificity OR predictive (tw) AND value (tw)

NR 5598 400.03*
Diagnosis (subheading pre-explosion) OR Specificity (tw)NR 8673 70.14*

Best sensitivity:

Exp sensitivity and specificity OR diagnosis (subheading pre-explosion) OR diagnostic use (subheading) OR sensitivity (tw) OR specificity (tw)

NR 9273 90.11*
Haynes (2004) 2000 ref set96.6650.00865.70.02*
Diagnostic use (sh)NR1986 set1696 100.10*
1991 set2696 180.06*
Diagnosis (sh)NR1986 set6289 90.11*
1991 set5988 130.08*
Diagnosis& (px)NR1986 set7974 600.02*
1991 set8077 900.01*
Exp Sensitivity and SpecificityNR1991 set5098 30.33*
Specificity (tw)NR1991 set5496   
Sensitivity (tw)NR1991 set5797   
1986 set4398 30.33*
van der Weijden 1997

MeSH short strategy (terms OR'd together)

explode DIAGNOSIS/diagnosis
DIAGNOSIS-DIFFERENTIAL/all subheadings.
explode SENSITIVITY-AND-SPECIFICITY
REFERENCE-VALUES/all subheadings .
FALSE-NEGATIVE-REACTIONS/ all subheadings .
FALSE-POSITIVE-REACTIONS/ all subheadings .
explode MASS-SCREENING/ all subheadings .

OVID 31  340.03*

MeSH extended strategy (terms OR'd together)

explode DIAGNOSIS/ all subheadings .
explode SENSITIVITY-AND-SPECIFICITY
REFERENCE-VALUES/all subheadings .
FALSE-NEGATIVE-REACTIONS/ all subheadings .
FALSE-POSITIVE-REACTIONS/ all subheadings .
Explode MASS-SCREENING/ all subheadings .

OVID 69  110.09*

MeSH extended and free text strategy

explode DIAGNOSIS/ all subheadings .
explode SENSITIVITY-AND-SPECIFICITY
REFERENCE-VALUES/all subheadings .
FALSE-NEGATIVE-REACTIONS/ all subheadings .
FALSE-POSITIVE-REACTIONS/ all subheadings .
Explode MASS-SCREENING/ all subheadings .
diagnos* OR sensitivity or specificity OR predictive value* OR reference value* OR ROC* OR likelihood ratio* OR monitoring

OVID 91  100.1*
Deville 2002Sensitivity and specificity [Mesh; exploded] OR mass screening [Mesh; exploded] OR reference values [Mesh] OR false positive reactions [Mesh] OR false negative reactions [Mesh] OR specificit$.tw OR screening.tw OR false positive$.tw OR false negative$.twNRKnee lesions SR70    
 Urine dipstick SR92    
Bachmann 2002"SENSITIVITY AND SPECIFICITY"# OR predict* OR diagnos* OR sensitiv*Datastar1989 test set

92.8

(84.9-97.3)

  15.6

6.4

(5.2-8.0)

1994 validation set98.1  10.99.2
1999 validation set91.8  4.721.3
"SENSITIVITY AND SPECIFICITY"# OR predict* OR diagnos* OR accura*Datastar1989 test set

95.2

(88.1-98.7)

  16.9

5.9

(4.8-7.3)

1994 validation set

98.1

(89.9-99.9)

  

12

(9.1-1.4)

8.3

(6.7-11.3)

1999 validation set95.1  520.0
Vincent 2003

Strategy A

1. exp 'sensitivity and specificity'/;

2. (sensitivity or specificity or accuracy).tw.;

3. ((predictive adj3 value$) or (roc adj curve$)).tw.;

4. ((false adj positiv$) or (false negativ$)).tw.;

5. (observer adj variation$) or (likelihood adj3 ratio$)).tw.;

6. likelihood function/;

7. exp mass screening/;

8. diagnosis, differential/ or exp Diagnostic errors/;

9. di.xs or du.fs;

10. or/1-9

Ovid 100  3*0.33*

Strategy B

1. exp 'sensitivity and specificity'/;

2. (sensitivity or specificity or accuracy).tw.;

3. (predictive adj3 value$);

4. exp Diagnostic errors/;

5. ((false adj positiv$) or (false adj negativ$)).tw;

6. (observer adj variation$).tw;

7. (roc adj curve$).tw;

8. (likelihood adj3 ratio$).tw.;

9. likelihood function/;

10. exp *venous thrombosis/di, ra, ri, us;

11. exp *thrombophlebitis/di, ra, ri, us;

12. or/1-11

Ovid 98.4  5*0.2*

Strategy C

1. exp 'sensitivity and specificity'/;

2. (sensitivity or specificity or accuracy).tw.;

3. ((predictive adj3 value$) or (roc adj curve$)).tw.;

4. ((false adj positiv$) or (false negativ$)).tw.;

5. (observer adj variation$);

6. likelihood function/ or;

7. exp Diagnostic errors/;

8. (likelihood adj3 ratio$).tw.;

9. or /1-8

Ovid 79.4  10*0.1*
Deville 2000

Strategy 4

SENSITIVITY AND SPECIFICITY (exp) OR specificity (tw) OR false negative (tw) OR accuracy (tw) OR screening (tw)

NR 

89.3

(82.3-96.3)

91.9

(90.8-93)

  DOR 95
Meniscal lesion

61

(42.1-77.1)

  4.70.22*

Strategy 3

SENSITIVITY AND SPECIFICITY (exp) OR specificity (tw) OR false negative (tw) OR accuracy (tw)

NR 

80.0

(71.0-89.1)

97.3

(96.6-97.9)

 

48

(40-56)

DOR 149

Strategy 2

SENSITIVITY AND SPECIFICITY (exp) OR specificity (tw) OR false negative (tw)

NR 

73.3

(63.3-83.3)

98.4

(97.9-98.9)

  DOR 170

Strategy 1

SENSITIVITY AND SPECIFICITY (exp) OR specificity (tw)

NR 

70.7

(60.4-81.0)

98.5

(98.0-98.9)

  DOR 158
Noel-Storr 2011

1. Disease progression/

2. di.fs.

3. logitudinal*.ab.

4. Follow-up studies/

5. conversion.ab.

6. transition.ab.

7. converters.ab.

8. progressive.ab.

9. “increased risk”.ab.

10. “follow-up”.ab.

Ovid2000-Sept 2010

97

(92-99)

38

(37-39)

 

1.1

(0.95-1.4)

 
2000-Aug 2011

98

(94-100)

38

(37-39)

 

1.2

(1.0-1.4)

 
Method of identification of reference set records

Different methods were used to compile the reference sets. Six studies handsearched journals to obtain a database of ‘gold standard’ references reporting relevant DTA studies (Astin 2008; Berg 2005; Deville 2000; Haynes 1994; Haynes 2004).

Three studies used a relative recall reference standard, that is the reference set was based on studies included in systematic reviews. Deville 2002 used references from two published systematic reviews (on diagnosing knee lesions and the accuracy of urine dipstick testing) that had formed part of the first author's thesis. Noel-Storr 2011 used the references from a systematic review on the volume of evidence in biomarker studies in people with mild cognitive impairment. Another study (van der Weijden 1997) developed a reference set based on a personal literature database on erythrocyte sedimentation rate as a diagnostic test, compiled over 10 years ‘by every means of literature searching’. Finally, one study used a validated filter to locate systematic reviews indexed in MEDLINE and EMBASE reporting on diagnostic tests for deep vein thrombosis, and used the studies included in these reviews as the reference set (Vincent 2003).

Two of the 10 studies described above included all articles that were retrieved by the search for gold standard records but which were subsequently rejected from the gold standard as the non-gold standard records in the reference set (Berg 2005; Noel-Storr 2011). A third study used the false positive articles selected by a search using a previously published diagnostic search strategy as the non-gold standard records in the reference set (Deville 2000). This study further restricted the non-gold standard studies by excluding reviews, meta-analyses, comments, editorials and animal studies. The remaining studies that included non-gold standard records in their reference set did not provide details on how these were identified.

Composition of reference set

Seven studies included both gold and non-gold standard references in their reference sets (Astin 2008; Bachmann 2002; Berg 2005; Deville 2000; Haynes 1994; Haynes 2004; Noel-Storr 2011) and two studies used only gold standard studies (van der Weijden 1997; Vincent 2003). One study did not give any details on the composition of the reference set (Deville 2002). It was possible to calculate sensitivity, specificity, precision and NNR from the studies that had a reference set compiled of both included DTA studies (gold standard references) and studies that did not meet the criteria of a DTA study (non-gold standard references) if 2 x 2 data were available. However, it was not possible to calculate specificity or precision from a reference set composed of only included DTA references. This was because the percentage of correctly non-identified studies cannot be calculated since data for only half of a 2 x 2 table were available.

Of the six studies that used handsearching to develop the reference set, two studies concentrated on specific topic areas. Astin 2008 included records on imaging as a diagnostic test and Berg 2005 included articles from the nursing literature on cancer-related fatigue diagnosis. The remaining studies that had a handsearched reference set were not topic specific. The two studies that used published systematic reviews to compile the reference set, and the study which used a personal literature database, were all topic specific.

Where reported, the mean number of gold standard studies in the reference set was 128 (range 33 to 333) from a mean of 35 journals (range 9 to 161). Of the studies that used reference sets which included non-gold standard as well as gold standard records, the mean number of overall references included was 8582 (range 238 to 48,881).

Method of identification of search terms

Three studies used the reference set to derive search terms by performing statistical analysis on terms found in titles, abstracts and subject headings (Astin 2008; Bachmann 2002; Deville 2000). Three studies adapted existing search strategies (Berg 2005; Deville 2002; Vincent 2003), one of which expanded the existing filters by adding frequently occurring MeSH terms and text words found in titles and abstracts of the reference set (Berg 2005). Vincent 2003 also combined the use of existing filters with the results of reference set analysis. Of the remaining four studies, one used expert knowledge of the field to generate a list of terms (Haynes 1994), one used expert knowledge and analysis of the reference set (Haynes 2004), one checked key publications for the definitions and terms used (van der Weijden 1997), and one analysed terms in 10 studies missed by the three most sensitive published filters (Noel-Storr 2011).

Description of studies that evaluated published MEDLINE filters

Ten evaluation studies that assessed 30 MEDLINE filters were included (Table 4; Table 5). Of these, three were development studies that also evaluated published filters and were therefore classed as both development and evaluation studies (Deville 2000; Noel-Storr 2011; Vincent 2003). Most filters (n = 23) were evaluated by at least two studies. The median number of filters evaluated in a each study was 6, but ranged from 1 (Deville 2000; Kastner 2009) to 22 (Noel-Storr 2011; Ritchie 2007; Whiting 2010).

Table 4. Summary of study design characteristics of MEDLINE filter evaluation studies
  Evaluation study: Author (year)
  Kastner 2009 Ritchie 2007 Leeflang 2006 Kassai 2006 Doust 2005 Whiting 2010 Vincent 2003 Deville 2000 Mitchell 2005 Noel-Storr 2011
Method of identification of reference set records(one from list below selected for each study) 
  • Handsearching for primary studies

--------
  • Internet search for DTA systematic reviews

-------
  • Systematic reviews conducted by authors

------
  • Primary studies identified through Internet search

   ------
If systematic reviews used in reference set development, did they include DTA search terms in search strategy?XUnclear-X- -X
Reference set also contained non-gold standard recordsXXXX
Description of non-gold standard records if contained in reference set-----NR--NR-
  • All studies retrieved by search not classified as gold-standard records

-------
  • False positive papers selected by a previously published search strategy, exclusion of some publication types e.g. reviews and meta-analyses.

---------
Generic gold standard records I.e. not topic specificXXXXXX
Table 5. Study characteristics and methods of included MEDLINE filter evaluation studies
  1. Abbreviations used: TR= Tympanometry review; NPR= Natriuretic peptides review; SR= systematic review; NA= not applicable; NR= not reported; ref set= reference set; Se= sensitivity; Sp= specificity.

    * Full strategy obtained from authors

    ** Number of gold-standard records obtained from authors

StudyIdentification of reference setReference set selection criteriaRef set years# gold standard records# non-gold standard records# journals ref set if handsearchedDefinition of DTA study if handsearched gold standard identifiedDescription of filter allows reproducibilityDefinitions of Se & SpNumber of filters evaluated
Kastner 2009Included studies from 12 published SRs on the ACP Journal Club website and indexed on MEDLINE or EMBASE.Eligibility criteria for including SR were: published in 2006; incorporated a MEDLINE and EMBASE search as a data source; and available and downloadable in electronic format. In addition the review cannot have used the Clinical Queries filter, but other search filters were permissible.2006 (date publication SRs4410Not given. The 12 SRs were from 9 journals.The study compared at least two diagnostic test procedures with one another.yesno1
Ritchie 2007SR of DTA studies for UTI in young children carried out by the authorsIncluded studies that could be identified in Ovid MEDLINE1966-200316027804NANAnono22
Leeflang 2006Included studies from 27 published SR. Reviews selected after an electronic search for SRs of DTA studies published between January 1999 and April 2002 in MEDLINE, EMBASE, DARE and MedionCriteria for inclusion SRs: assessment of DTA, the inclusion of >10 original studies with inclusion not based on design characteristics, and sufficient data to reproduce the contingency table. Exclusion of reviews that reported the application of a diagnostic search filter.1999-20028200NANAyesno12
Kassai 2006Used PubMed interface to search MEDLINE, Science Citation Index, EMBASE and Pascal Biomed for relevant articles using search strategies with terms (MeSH and free text for MEDLINE) related to venous thrombosis, venography and ultrasonography in all databases.Any relevant article retrieved through topic search on MEDLINE, Science Citation Index, EMBASE and Pascal Biomed1966-20022371236NRNR yes3
Doust 2005Included studies from two SRs: tympanometry (TR) for the diagnosis of otitis media with effusion in children, and natriuretic peptides (NPR). Initial list of citations was generated from MEDLINE using the search strategy used by the sensitivity option of the Clinical Queries filter for DTA in PubMed. Reference lists of potentially relevant papers and review articles were checked for further possible papers.Included in two SRs conducted by the authorsTR 1966-2001; NPR 1994-2002TR n=33; NPR n=20TR n=0; NPR n=0

TR n=22;

NPR n=16

NRyesyes5
Whiting 2010Test accuracy studies indexed on MEDLINE from 7 SRs carried out by authors. Relative recall reference set.All included studies indexed on MEDLINE from 7 SRs of DTA. SRs that conducted extensive searches that were not limited using methodological filters or search terms relating to measures of test accuracyNR50625880**NRStudies in which cross-tabulation data comparing the results of the index test with the reference standard were available.yesyes22
Vincent 2003SRs retrieved from MEDLINE and EMBASE on OVID using validated SR filter on diagnostic tests for DVT. 16 SR selected and all articles included that were indexed on MEDLINE became the reference set. Only English language articles includedStudies included in 16 SRs that compared one of the specified diagnostic tests for DVT against a venogram.1969-20001260NRCompared specified diagnostic test to reference standardyesyes5
Deville 2000Adapted Haynes 1991 most sensitive and specific filter by adding terms. Ran search in MEDLINE to retrieve all primary DTA studies. Second set of references selected on diagnosis of meniscal lesions of the knee for external validity testing. No further details on how this set was selected are provided.Primary DTA studies indexed on MEDLINE; studies included on physical tests for the diagnosis of meniscal lesions of the knee.1992-199575; 33 in meniscal lesions set2392; NR in meniscal lesions set Diagnostic test was compared with a reference standardyesyes1
Mitchell 2005Handsearch of the 3 top ranking renal journals for the years 1990-1991 and 2002-2003.Primary DTA studies that could be identified in MEDLINE on the diagnosis of kidney disease

1991-1992

2002-2003

9944093A test or tests being compared to a reference standard in a human populationyes*NR6
Noel-Storr 2011SR on the volume of evidence in biomarker studies in those with mild cognitive impairment, conducted by the authors.Primary DTA longitudinal studies indexed on MEDLINE with at least one follow-up period; at least one of biomarkers of interest used as test of interest; included subjects with objective cognitive impairment at baseline, no dementia.2000-Sept 2010; 2000-Aug 2011128 Sept 2010; 144 Aug 201117266 Sept 2010; 18920 Aug 2011NRNAyes*no22
Method of identification of reference set records

Seven studies used a relative recall reference set consisting of studies included in DTA systematic reviews (Doust 2005; Kastner 2009; Leeflang 2006; Noel-Storr 2011; Ritchie 2007; Vincent 2003; Whiting 2010). Of these, three studies located systematic reviews through electronic searches (Kastner 2009; Leeflang 2006; Vincent 2003) and four studies used a convenience sample of systematic reviews that either the authors or colleagues had undertaken themselves (Doust 2005; Noel-Storr 2011; Ritchie 2007; Whiting 2010). One study used references located through handsearching of the nine highest ranking journals available on MEDLINE (Deville 2000); one study handsearched three high ranking renal journals (as identified by the authors) for primary studies on the diagnosis of renal disease (Mitchell 2005); and one study used an electronic search for primary DTA studies related to venous thrombosis, venography and ultrasonography (Kassai 2006).

Three of the studies that used a relative recall reference set included reviews which used a methodological filter to find diagnostic studies in addition to terms for test and condition (Doust 2005; Kastner 2009; Vincent 2003). One of these studies supplemented the search, which had first used the Clinical Queries diagnostic filter in PubMed, by searching the reference lists of included studies (Doust 2005).

Two studies included all articles that were retrieved by the search for gold standard records but which were subsequently rejected from the gold standard as the non-gold standard records in the reference set (Kassai 2006; Ritchie 2007). A third study used the false positive articles selected by a search using a previously published diagnostic search strategy as the non-gold standard records in the reference set (Deville 2000). This study further restricted the non-gold standard studies by excluding reviews, meta-analyses, comments, editorials and animal studies. The remaining studies that included non-gold standard records in their reference set did not provide details on how these were identified.

Composition of reference set

Three of the seven studies derived their reference set from a systematic review that used gold standard and non-gold standard studies (Noel-Storr 2011; Ritchie 2007; Whiting 2010); the remaining four studies used a reference set comprised of only gold standard studies (Doust 2005; Kastner 2009; Leeflang 2006; Vincent 2003). The three studies which used an electronic search or a handsearch to find primary studies also included non-gold standard studies in their reference sets (Deville 2000; Kassai 2006; Mitchell 2005).

The number of gold standard studies included in the reference standard ranged from 53 from two systematic reviews (Doust 2005) to 820 from 27 reviews (Leeflang 2006). In all studies that also included non-gold standard studies, the number of irrelevant studies ranged from 1236 to 27,804.

Description of evaluated filters

All but one of the search strategies combined MeSH terms and text words; one used the single term strategy “specificity.tw” (Whiting 2010). Two of the evaluated filters that were displayed were based on the same original strategy by Haynes 1994. Falck-Ytter 2004 presented an alternative interpretation of the original filter in a PubMed format.

EMBASE search filters

Description of development studies

Two studies reported the development of 12 search filters for finding DTA studies indexed in EMBASE (Table 6; Table 7) (Bachmann 2003; Wilczynski 2005). Eleven of the filters were composed of multiple terms. Table 6 gives a summary of the study design characteristics of the included studies.

Table 6. Summary of study design characteristics of EMBASE filter development studies
  1. Abbreviations used: NR= not reported

  Author
  Bachmann 2003 Wilczynski 2005
Method of identification of reference set records (one from list below selected for each study)
  • Hand-searching for primary studies

  • DTA systematic reviews

--
  • Personal literature database

--
Reference set also contained non-gold standard records
Description of non-gold standard records if contained in reference set-NR
  • All records retrieved by search that were not classified as gold standard studies

 
Generic gold standard records i.e. not topic specific
Method of deriving filter terms (a combination of methods could be used)
  • Analysis of reference set

  • Expert knowledge

-
  • Adaption of existing filter

--
  • Checking key publications for terms and language used

--
Internal validation in reference set independent from records used to derive filter termsxX
External validation in reference set independent from records used to derive filter terms and internal validation setxx
Table 7. Study characteristics and methods of EMBASE filter development studies
  1. Abbreviations used: ref set= reference set

Study Identification of reference set How was reference set used How were search terms identified for filter Ref set years # gold standard records # non-gold standard records # journals ref set
Bachmann 2003Handsearching of all issues of NEJM, Lancet, JAMA and BMJ published in 1999.To derive terms; to test strategiesWord frequency analysis on title, abstract and subject indexing of handsearched records19996160824
Wilczynski 2005Handsearching each issue of 55 journals in 2000.To test strategiesInitial list of MeSH terms and text words compiled using knowledge of the field and input from librarians and clinicians. Stepwise logistic regression used to improve performance of filters.20009727,67255
Method of identification of reference set records

In both studies the reference set was generated by handsearching journals, and included both gold standard and non-gold standard records. One study reported that the non-gold standard records were identified as all articles retrieved by the search that were not classified as gold-standard records (Bachmann 2003). The other study was not clear about how non-gold standard records were selected (Wilczynski 2005).

Composition of reference set

Both studies included both gold standard and non-gold standard records in the reference set.

Method of identification of search terms

One study used the reference set to derive filter terms using word frequency analysis (Bachmann 2003). The other study initially identified terms for the filter by consulting experts and then entered the terms into a logistic regression model to find the most frequently occurring terms (Wilczynski 2005).

Description of studies that evaluated published EMBASE filters

Three studies evaluated four filters designed to find DTA studies in EMBASE (Table 8; Table 9) (Kastner 2009; Mitchell 2005; Wilczynski 2005). One filter was evaluated by two studies, and three filters were evaluated by only one study. A summary of the study design characteristics of included studies is in Table 8.

Table 8. Summary of study design characteristics of EMBASE filter evaluation studies
  1. Abbreviations used: NR= not reported

  Author
  Kastner 2009 Wilczynski 2005 Mitchell 2005
Method of identification of reference set records(one from list below selected for each study)
  • Handsearching for primary studies

-
  • Internet search for DTA systematic reviews

--
  • Systematic reviews conducted by authors

---
  • Primary studies identified through Internet search

 --
If systematic reviews used in reference set development, did they include DTA search terms in search strategy?- -
Reference set also contained non-gold standard recordsx
Description of non-gold standard records if contained in reference setNRNRNR
Generic gold standard records i.e. not topic specificx
Table 9. Study characteristics and methods of studies evaluating EMBASE filters
  1. Abbreviations used: ref set= reference set; Se= sensitivity; Sp= specificity

Study Identification of gold standard Reference set selection criteria Ref set years # gold standard studies ref set # non-gold standard studies in ref set # journals ref set for handsearched gold standard Definition of DTA study Description of filter allows reproducibility Definitions of Se & Sp Number of filters evaluated
Kastner 2009Included studies from 12 published SRs on the ACP Journal Club website and indexed in MEDLINE or EMBASE.Eligibility criteria for including SR were: published in 2006; incorporated a MEDLINE and EMBASE search as a data source; and available and downloadable in electronic format. In addition the review cannot have used the Clinical Queries filter.2006 (date SRs published)441441NAThe study compared at least two diagnostic test procedures with one another.yesno1
Wilczynski 2005Handsearch of each issue of 55 journals in 2000.Studies indexed in EMBASE found through handsearching which met the methodological criteria for a diagnostic study:2000972757555Inclusion of spectrum of participants; reference standard; participants received both the new test of reference standard; interpretation of index test without knowledge of reference standard and vice versa; analysis consistent with study design.yesyes2
Mitchell 2005Handsearch of the 3 top ranking renal journals for the years 1990-1991 and 2002-2003Primary DTA studies that could be identified in EMBASE reporting on the accuracy of tests for kidney disease diagnosis

1991-1992

2002-2003

9639843A test or tests being compared to a reference standard in a human populationyes*no4
Method of identification of reference set records

One study used studies from 12 published systematic reviews to construct the reference standard (Kastner 2009). The other two EMBASE filter studies identified primary DTA studies through handsearching (Mitchell 2005; Wilczynski 2005). Neither study which had included non-gold standard records described how those articles were identified.

Composition of reference set

Two studies included both gold standard and non-gold standard records in the reference set (Mitchell 2005; Wilczynski 2005). The number of gold standard records ranged from 96 to 441. The number of non-gold standard records ranged from 3984 to 27,575.

Description of evaluated filters

One evaluated filter consisted of MeSH terms and text words, the other three filters consisted of text words only. Every filter combined multiple terms.

Risk of bias in included studies

The methodological quality of the identified studies was not formally assessed using a validated tool, but we identified three areas that could affect the methodological quality of the studies in terms of the risk of bias and applicability as described above (see Assessment of risk of bias in included studies).

1. Use of systematic reviews to compile reference set search strategy

MEDLINE development and evaluation studies

Of the eight studies which used systematic reviews to compile their reference sets, three used reviews which did not include diagnostic terms in their search strategies and were at low risk of bias; one development and evaluation study and two evaluation studies specified that they only included systematic reviews which had not used a diagnostic search filter (Noel-Storr 2011; Ritchie 2007; Whiting 2010). The systematic reviews used by Whiting and Noel-Storr were conducted by the authors, therefore the reviewers could be sure that no such filter was applied. Ritchie also used a systematic review carried out by Whiting, which did not use a diagnostic filter.

Three studies used reviews with diagnostic terms in their search strategies and were therefore at high risk of bias. One was a development and evaluation study which contained the references from 16 systematic reviews and, of these, at least one used a diagnostic filter (Vincent 2003). Some of the other systematic reviews did not report whether they used a diagnostic filter or not, while the remaining studies were not available. Two evaluation studies also used reviews with diagnostic filter terms. Kastner's reference set contained the studies from 12 systematic reviews and, of these, just over half used diagnostic terms in their search strategies (Kastner 2009). Doust 2005 conducted two systematic reviews which were used in reference set development, and the search strategy for these applied the PubMed Clinical Queries filter for diagnostic studies.

For one development and one evaluation study, it was not clear whether the systematic reviews used a diagnostic filter in their searches (Deville 2002; Leeflang 2006). The risk of bias for these studies was unclear. The original source of the review used by Deville was not available (from the author's thesis), but a meta-analysis published by the same author on the same topic did describe the use of diagnostic terms in the search strategy. Leeflang stated in their discussion that while they attempted to exclude any review which used a diagnostic filter in their literature search, they found that of the 27 reviews where the studies were included, seven did not describe their search in detail.

EMBASE development and evaluation studies

Only one evaluation study, reporting an EMBASE filter, used the studies from systematic reviews to compile the reference set, and just over half of the 12 systematic reviews used diagnostic terms in their search strategies (Kastner 2009). This study was, therefore, judged to be at high risk of bias.

2. Choice of gold standard records

MEDLINE development and evaluation studies

Of 17 studies, three development and three evaluation studies used generic gold standard records and caused a low level of concern regarding applicability (Bachmann 2002; Haynes 1994; Haynes 2004; Kastner 2009; Leeflang 2006; Whiting 2010). Of these, the development studies handsearched a broad range of general medical journals while the evaluation studies used the included studies from systematic reviews covering a range of diagnostic tests and conditions.

Four development studies used topic specific gold standard records to develop their filters (Astin 2008; Berg 2005; Deville 2002; van der Weijden 1997). In addition, the three studies which both developed and evaluated filters also used topic specific records (Deville 2000; Noel-Storr 2011; Vincent 2003). Four evaluation studies used topic specific gold standard records to test the performance of published filters (Doust 2005; Kassai 2006; Mitchell 2005; Ritchie 2007). These studies caused high levels of concern regarding applicability as they were only likely to be applicable to the particular topic area in which they were developed or evaluated. The topics included in these studies varied in their breadth, for example a very narrow topic was used by Kassai 2006 (limited to studies comparing ultrasound to venography for the diagnosis of deep vein thrombosis), whereas Deville 2000 included studies on diagnostic tests from nine family medicine journals. Other topics included diagnostic tests in radiology and biomarkers for mild cognitive impairment. Noel-Storr 2011 designed their filter to specifically retrieve longitudinal DTA studies and evaluated published filters for their ability to retrieve delayed cross-sectional DTA studies.

EMBASE development and evaluation studies

All but one of the four studies that developed or evaluated a diagnostic EMBASE filter used a set of gold standard records derived from on a broad range of topics and tests. One evaluation study handsearched the three top ranking renal journals for studies on the diagnosis of kidney disease (Mitchell 2005).

3. Validation of filters

MEDLINE development studies

Of the 10 studies reporting the development of a MEDLINE filter, two studies used discrete derivation and validation sets of references to test internal validity and were considered to be at low risk of bias (Astin 2008; Bachmann 2002). Astin handsearched six high ranking radiology journals to find studies for the derivation set and used a different set of six journals to compile studies for the validation set. Bachmann handsearched journals in different years; the studies found in 1989 comprised the set of references used to derive terms, while the studies from 1994 comprised the validation set.

Six of the remaining studies used an internal validation set which contained the references used to derive the terms for the filter and the studies were therefore judged to be at high risk of bias (Berg 2005; Deville 2000; Haynes 1994; Haynes 2004; Noel-Storr 2011; Vincent 2003). Of these studies, three independently selected terms to use as part of their filters, but the final strategies (made up of those terms) were derived from testing in the same set of references (Haynes 1994; Haynes 2004; Vincent 2003). Also of note, Noel-Storr 2011 derived filter terms by running published search filters in MEDLINE combined with a subject search, locating 10 papers that all filters missed and choosing a term from the title, abstract or keywords of each. These 10 papers were included in the reference set of 144 studies.

Two studies did not perform internal validity testing of the two filters that had been developed, rather specific diagnostic topics (reviews) were used only to externally validate (Deville 2002; van der Weijden 1997). These studies reported sensitivities > 90% for their most sensitive filters.

Four studies carried out external validation of their filters in a validation set that represented real-world settings, and the filters were judged to cause low levels of concern about applicability (Bachmann 2002; Deville 2000; Deville 2002; van der Weijden 1997). The remaining studies did not validate their filters in real-world settings and were considered to cause high levels of concern regarding applicability (Astin 2008; Berg 2005; Haynes 1994; Haynes 2004; Noel-Storr 2011; Vincent 2003).

EMBASE development studies

Both EMBASE development studies were at high risk of bias in this domain because neither study used a set of records independent from those used to derive the terms to internally validate their strategies (Bachmann 2003; Wilczynski 2005). Bachmann used word frequency analysis of all the titles and abstracts of studies included in the reference set to find and combine the 10 terms with the highest sensitivity and precision. Wilczynski first derived a list of potential diagnostic terms from clinical studies and then from clinicians and librarians. The individual search terms with sensitivity > 25% and specificity > 75%, when tested in the reference set, were then combined into the search strategies.

Neither study externally validated their newly developed filters and were therefore judged to have high concerns regarding applicability in this domain.

Effect of methods

1. Performance of MEDLINE filters as reported in development studies

Sensitivity ranged from 16% to 100% (median 86%; 39 filters, 10 studies), specificity ranged from 38% to 99% (median 88.5%; 30 filters, 6 studies) and precision ranged from 0.8% to 90% (median 9.3%; 32 filters, 8 studies) (Table 3).

2. Performance of evaluated MEDLINE filters

Performance data on each evaluated filter can be found in Table 10 and full search strategies can be found in Appendix 9. Thirteen of the 30 MEDLINE filters assessed by the evaluation studies had original performance data available from development studies. The other 17 filters were reported without any details on how they were developed or their performance.

Table 10. MEDLINE filters evaluated by two or more studies (values given in percentages)
  1. * Doust combines each methodological filter with a content filter for a Tympanometry systematic review (TR) and a Natriuretic peptides systematic review (NPR), this is the reason for two results being reported for each filter.

    Similarly, Deville (2000) uses an independent set of references to externally validate their own filter and the Haynes 1994 sensitive filter; ALL= all references in main reference set; ML= references on the diagnosis of meniscal lesions of the knee.

    ** Falck-Ytter filter is an adaption of the Hanyes 1994 sensitive filter for OVID into a PubMed format (alternative to the PubMed Clinical Queries adaption of the same filter).

    Abbreviations used: KSR= Knee lesion systematic review; USR= Urine dipstick systematic review.

    $ Filter no longer available from source cited by evaluation studies.

  SENSITIVITY SPECIFICITY PRECISION
  ORIGINAL DEVELOPMENT STUDYRITCHIEWHITINGLEEFLANGKASTNER*DOUST TR*DOUST NPRVINCENTDEVILLEDEVILLE MLKASSAIMITCHELLNOEL-STORR ORIGINAL DEVELOPMENT STUDYWHITINGMITCHELLNOEL-STORR ORIGINAL DEVELOPMENT STUDYRITCHIEWHITING*DOUST TR*DOUST NPRDEVILLE ALLDEVILLE MLMITCHELLNOEL-STORR
Original development study did report performance data
Bachmann 2002 Sensitive 95748788 7090    8484 NR378036 5.01.43.05.04.0  8.80.2
Haynes 2004 Sensitive 996980878870100    6769 74418545 1.11.33.04.05.0  9.10.9
Haynes 2004 Specific 65214328        14 9894 95 10.66.715.0     2.0
Deville 2000 Strategy 4 ALL=89466846 5810075   4955 92819582 NR4.47.09.09.0  16.72.2
ML=61                 4.7        
Haynes 1994 Specific 55335529        51 9890 88 40.07.4      3.0
Haynes 1994 Sensitive** 92708581   967345958091 73238032 9.01.5   293.45.31.0
Vincent 2003 Strategy C 79876744        54 NR85 83 10.03.39.0     2.3
van der Weijden 1997 Sensitive 91 8792 73100    9693 NR159630 NR 2.04.04.0  5.61.0
Deville 2002 Accurate

KSR=70

USR=92

  51                      
Haynes 1994 Accurate 86  81          84    13.0        
Deville 2000 Strategy 3 80  41          97    48.0        
Deville 2000 Strategy 1 71         76   99            

Vincent 2003

Strategy A

10087         81  NR 81  2.53.3     5.5 
Original development study did NOT report performance data
Falck-Ytter 2004** NR7485         71 NR39 51 NR1.33.0     1.1 
CASP 2002 $ NR7383    100  95 67 NR53 49 NR1.23.0     1.0 
Deville 2002a Extended NR5271  58100     60 NR78 78 NR3.97.08.06.0   2.0 

Aberdeen

InterTASC 2011 $

NR6986         87 NR39 33 NR1.23.0     1.0 

Southampton A

InterTASC 2011 $

NR7186         93 NR13 29 NR1.02.0     1.0 

Southampton B

InterTASC 2011 $

NR4569         55 NR80 81 NR4.67.0     2.1 

Southampton C

InterTASC 2011

NR3156         51 NR90 88 NR8.511.0     3.0 

Southampton D

InterTASC 2011

NR6684         89 NR21 42 NR1.12.0     1.1 

Southampton E

InterTASC 2011 $

NR7187         92 NR14 31 NR1.02.0     1.0 

CRD A

InterTASC 2011

NR5373         70 NR62 58 NR2.24.0     1.2 

CRD B

InterTASC 2011

NR4064         67 NR81 71 NR4.17.0     1.7 

CRD C

InterTASC 2011

NR6985         90 NR24 43 NR1.22.0     1.2 

HTBS

InterTASC 2011

NR4669         56 NR83 80 NR3.78.0     2.0 
Shipley Miner 2002 NR4872         63 NR73 73 NR1.85.0     1.7 
Deville 2002a Accurate NR  88          NR    NR         
University of Rochester 2002 $ NR      79      NR    NR         
North Thames 2002 $ NR      53      NR    NR         

None of the filters tested in development or evaluation studies had sensitivity > 90% and precision > 10%. The original studies reported sensitivities ranging from 55% to 100% (median 86%); evaluation studies reporting on the same 13 filters had sensitivities ranging from 14% to 100% (median 73%). Doust 2005 evaluated the two strategies with 100% sensitivity in a reference set composed of included studies from a systematic review of natriuretic peptides. The original searches for the two systematic reviews used the PubMed Clinical Queries filter (from Haynes 2004), supplemented by screening the reference lists of included studies. This might explain why the evaluated filters performed so well in this reference set. The sensitivities of the 18 evaluated filters that did not have accompanying original performance data ranged from 40% to 100% (median 71%).

Specificity was only reported in the original study and three evaluation studies (Mitchell 2005; Noel-Storr 2011; Whiting 2010) for four filters and ranged from 73% to 98% (median 94.5%) in the original study and from 15% to 96% (median 81%) in the evaluation studies. Similarly, precision was only reported in both the original study and evaluation studies for seven filters and ranged from 1.1% to 40% (median 9.5%) in the original study and from 0.2% to 16.7% (median 4%) in the evaluation studies. Similar ranges of specificities and precision were reported in the evaluation studies for the 17 filters without an original performance measure. Sensitivities ranged from 31% to 100% (median 71%), specificity ranged from 13% to 90% (median 55.5%) and precision from 1.0% to 11.0% (median 3.35%).

Original estimates of sensitivity were higher than those reported in the evaluation studies in 43 of 53 comparisons. (If an evaluation study had two reference sets, it contributed twice to the total number of comparisons for each filter evaluated.) Original estimates of specificity were higher in 10 of 14 comparisons, and precision was higher in 16 of 25 comparisons. None of the evaluated filters performed consistently well for any of the performance measures reported by evaluation studies (Table 10).

Seven filters had data on both sensitivity and specificity from the original development study and at least one evaluation study (Figure 2). Original estimates showed greater sensitivity and specificity than the estimates from the evaluation studies. The results from the development studies followed a more uniform pattern along a curve, whereas the estimates from the evaluation studies were more heterogenous, especially for specificity. There were two outliers in the evaluation study results: Mitchell’s (Mitchell 2005) measure of van der Weijden’s (van der Weijden 1997) sensitive filter with very high sensitivity and specificity relative to the other estimates (96% sensitivity; 96% specificity); and Noel-Storr’s (Noel-Storr 2011) measure of Haynes 2004 (Haynes 2004) specific filter with very low sensitivity compared to the other estimates (14% sensitivity; 95% specificity). No apparent reason could be found for these anomalous results.

Figure 2.

ROC plot of sensitivity and specificity of MEDLINE search filters from development and evaluation studies.

Ten filters had data on both sensitivity and specificity from the original development study and at least one evaluation study (Figure 3). The estimates from both development and evaluation studies showed a wide range in precision and there was substantial variation in sensitivity in the evaluation studies. Precision was generally lower in the evaluation studies, but the pattern was not uniform. There were a number of outliers amongst both the development study and the evaluation study data points. Three outliers had much higher precision than the other estimates. These were: the original performance estimate of the Haynes 1994 specific filter, the original estimate of Deville 2000 strategy 3, and Mitchell’s (Mitchell 2005) evaluation of Deville 2000 strategy 4. It was not clear why these precision estimates were high.

Figure 3.

ROC plot of sensitivity and precision of MEDLINE search filters from development and evaluation studies.

3. Performance of EMBASE filters as reported by development studies

Table 11 shows the 12 filters and their performance data (Bachmann 2003; Wilczynski 2005). Sensitivity ranged from 46% to 100% (median 90%), and precision ranged from 1.2% to 27.7% (median 9%). Half the filters had a sensitivity greater than 90% (median 90.2%), but of these six filters only one had a precision greater than 10 (median 10.4) (Bachmann 2003).

Table 11. Performance of EMBASE filters from development studies
Author (year) ID Filter Description Filter interface Sensitivity % (95% CI) Specificity % (95% CI) Precision % (95% CI) NNR
Bachmann 2003sensitiv* OR detect* (specific filter)Datastar, Ovid and Silverplatter

73.7

(60.9-84.2)

 17.6

5.7

(4.4-7.6)

 sensitiv* OR detect* OR accura* OR specific* OR reliab* OR positive OR negative OR diagnos*Datastar, Ovid and Silverplatter

100

(94.1-100)

 3.7

27.0

(21.0-34.8)

 sensitiv* OR detect* OR accura*Datastar, Ovid and Silverplatter85.2 14.27.0
 sensitiv* OR detect* OR accura* OR specific*Datastar, Ovid and Silverplatter86.9 10.49.6
 sensitiv* OR detect* OR accura* OR specific* OR reliab*Datastar, Ovid and Silverplatter90.2 10.49.6
 sensitiv* OR detect* OR accura* OR specific* OR reliab* OR positiveDatastar, Ovid and Silverplatter91.8 9.210.9
 sensitiv* OR detect* OR accura* OR specific* OR reliab* OR positive OR negativeDatastar, Ovid and Silverplatter91.8 8.511.8
 sensitiv*Datastar, Ovid and Silverplatter45.9 27.73.6
Wilczynski 2005Best sensitivity: di.fs OR predict:.tw OR specificity.twOvid

100

(100-100)

70.4

(69.8-70.9)

1.2

(0.9-1.4)

 
 Small drop in sensitivity with substantive gain in specificity: diagnos:.mp OR predict:.tw OR specificity.twOvid

96.9

(93.5-100)

78.2

(77.7-78.7)

1.5

(1.2-1.8)

 
 Small drop in specificity with a substantive gain in sensitivity: specificity.tw OR accurac:.twOvid

73.2

(64.4-82.0)

97.4

(97.2-97.5)

8.8

(6.9-10.8)

 
 Best optimal strategy: sensitiv:.tw OR diagnostic accuracy.sh OR diagnostic.twOvid

89.7

(83.6-95.7)

91.6

(91.3-91.9)

3.3

(2.9-4.4)

 

4. Performance of evaluated EMBASE filters

The original studies reported sensitivities ranging from 74% to 100% (median 90%); evaluation studies reporting on the same filters had sensitivities ranging from 72% to 97% (median 86%). The original studies reported precision ranging from 1.2% to 17.6% (median 3.7%); evaluation studies reporting on the same filters had precision ranging from 1.2% to 9% (median 3.7%) (Table 12). One of the evaluated filters did not have an original estimate of performance from the development study (Ovid 2010). Figure 4 shows that in general filters performed better in the original development studies than in the evaluation studies for both sensitivity and precision. None of the filters offered both high sensitivity (> 90%) and high precision (> 10%). The original development studies did not report specificity estimates for the filters that were also tested in evaluation studies, hence a ROC plot of sensitivity and specificity has not been prepared.

Figure 4.

ROC plot of sensitivity and precision of EMBASE search filters from development and evaluation studies.

Table 12. Performance of evaluated EMBASE filters
  1. Abbreviations used: NR= not reported

Filter (original reference) Author (year) of evaluation study Description of filter from evaluation paper Interface filter developed for Sensitivity % Precision % Comments and other measures

PubMed Clinical Queries

Ovid 2010

ORIGINALsensitiv:.mp. OR diagnos:.mp. OR di.fs.OvidNRNR 
Kastner 2009sensitiv:.mp. OR diagnos:.mp. OR di.fs.Ovid88  

Bachmann 2003

Sensitive

ORIGINALsensitiv* OR detect* OR accura* OR specific* OR reliab* OR positive OR negative OR diagnos*Datastar, Ovid and Silverplatter1003.7 
Wilczynski 2005sensitiv:.tw. OR detect:.tw. OR accura:.tw. OR specific:.tw. OR reliab:.tw. OR positive:.tw. OR negative:.tw. OR diagnos:.tw.Ovid971.2Specificity=72.%; Accuracy=72.%
Mitchell 2005

sensitive* OR

detect* OR

accura* OR

specific* OR

reliab* OR

positive OR

negative OR

diagnos*

Ovid864.4Specificity=60%

Bachmann 2003

Specific

ORIGINALsensitiv* OR detect*Datastar, Ovid and Silverplatter7417.6NNR=5.7
Mitchell 2005sensitiv* .tw. OR detect* .tw.Ovid793.0Specificity=91%; Accuracy=91%

Wilczynski 2005

Sensitive

ORIGINALdi.fs OR predict:.tw OR specificity.twOvid1001.2Specificity=70%; Accuracy=71%
Mitchell 2005

di.fs OR

predict* .tw. OR

specificity.tw.

Ovid729Specificity=83%

Discussion

Summary of main results

Nineteen studies, reporting 57 MEDLINE filters and 13 EMBASE filters, were eligible for this review. We pre-specified that filters should have a sensitivity > 90% and a precision > 10% to be considered acceptable when searching for studies for systematic reviews of diagnostic test accuracy. We acknowledge that other researchers may set alternative performance levels.

Reports of filter performance were available from studies using a variety of designs, ranging from authors’ reports of their filter development process to evaluations of filters carried out by independent researchers using one or more different gold standards. The latter study design should provide best evidence of the performance of filters outside of the original authors’ test environment and the consistency of a filter’s performance across different sets of records.

Several filters reported performance levels in the development studies which met the pre-specified performance criteria. However, these performance levels typically declined when the filters were validated in the evaluation studies. Thirty MEDLINE filters and four EMBASE filters were tested in an evaluation study against one or more gold standards. In both the evaluation studies that developed their reference set from studies included in several systematic reviews on a broad spectrum of topics, covering a number of publication years, and in those that developed reference sets from heandsearching, no single filter achieved the sensitivity (> 90%) and precision (> 10%) that we pre-specified as 'acceptable'. This means that no filter is suitable for combination with the search terms for the target condition and index tests to create a single search strategy with which to identify studies for systematic reviews of diagnostic test accuracy.

As well as not reaching our pre-specified performance criteria, none of the evaluated filters for use in MEDLINE or EMBASE gave consistent sensitivity and precision measures. This may be caused by translation from one platform to another, or from mistakes made in the transcription of the filters. Another reason may be differences between the indexing and reporting of studies from different scientific fields. For these reasons, the degree of reduction in performance cannot be assessed consistently, making the filters unreliable tools for searching when sensitivity is an important consideration.

Overall completeness and applicability of evidence

The search filters were identified by extensive sensitive searches, checking reference lists of published filters and filter evaluations (Horsley 2011), and by searching a key website which identifies and collects search filters: the ISSG Search Filter Resource (InterTASC 2011). We are confident that we have identified the vast majority of published search filters, in particular those filters developed using a research method and those tested by independent researchers.

We did not, however, search for unpublished search filters, such as those which might have been developed by people conducting systematic reviews of diagnostic test accuracy studies. There are likely to be many unpublished filters reported in the search strategies of such reviews. These 'unpublished' filters could be identified and evaluated against gold standard sets of relevant records, in the same way that published filters have been evaluated. However, the evidence from the evaluations of the many published filters developed using research methods that we have compiled in this review suggests that unpublished filters may be subject to the same difficulties in achieving the pre-specified performance criteria if those filters consist of variants of the search terms used in the published search filters.

Quality of the evidence

The most reliable filter development studies are likely to be those where the authors used handsearched gold standards and tested their filters against internal validation record sets that are different from the record sets used to develop the filters, and externally validated the filters in a real-world topic. In the one study where this occurred, the MEDLINE filter performance was maintained and had a higher sensitivity (Bachmann 2002).

The nature of the most reliable filter evaluation studies is a matter for debate. Testing filters against a handsearched gold standard set of records would seem to be the most reliable technique because it should yield a range of different DTA study types. However, the disadvantage of handsearching is that researchers are often limited to a small number of journals, which limits the generalisability of the evaluation. Handsearching can be limited by a narrow range of topics and publication years and so impede judgments about the generalisability of the search filters to other topics and time periods. Only two evaluations of MEDLINE filters used handsearched reference sets, which were both topic specific (Deville 2000; Mitchell 2005). In those two studies, some filters maintained their sensitivity as reported in their development papers and others experienced large drops in sensitivity.

Another method of reference set development is to use the studies included in systematic reviews. Whereas handsearching of journals for reference set studies is limited to a small number of journals, using systematic reviews broadens the journal base and the number of publication years covered. However, the primary diagnostic studies in systematic reviews may have been retrieved using a search strategy containing diagnostic terms, which could introduce bias. By including systematic reviews that used a methodological filter to find diagnostic studies, the performance of the evaluated filters in the reference set may be exaggerated. Precision is improved because irrelevant records will be removed but sensitivity may suffer because ‘difficult to find’ studies may not be retrieved by the filter. This was discussed by Leeflang et al who also used reviews to compile their reference set for the evaluation of 12 filters (Leeflang 2006). Only seven of the reviews in the initial set of 28 reviews used in their study reported search terms. If those seven reviews also used one of the search filters evaluated by Leeflang et al, then the results are likely to be overestimated and the real percentage of missed studies could be even higher than reported. Three other evaluation studies used systematic reviews to compile the reference set, and some of these reviews had included a DTA methodological filter in the original search for eligible studies (Doust 2005; Kastner 2009; Vincent 2003).

How the reference set is used can be a source of bias. If the records used to derive the search terms for the filter are also included in the set of references used in the validation process, this can introduce bias by artificially inflating performance. A discrete set of derivation records and validation records should be used to avoid this. Only two MEDLINE development studies (but neither EMBASE development study) used this approach (Astin 2008; Bachmann 2002).

External validation relates to how generalisable (applicable) the results are to searching for diagnostic studies for different systematic review topics, and only applies to development studies. Four MEDLINE studies carried out external validation of their filters in a real-world setting and were judged to have low concerns about applicability (Bachmann 2002; Deville 2000; Deville 2002; van der Weijden 1997).

The date of the filter may raise concerns. The problem of missed studies is increased in older studies, as shown by Haynes et al whose filter tested in the 1986 reference set did not perform as well as it did in the 1991 reference set. This may be a feature of the reporting of DTA studies. The STARD statement, which was published in 2003, aimed to improve the standard of reporting of DTA studies (Bossuyt 2003). STARD’s first recommendation is that authors should identify their publication as a study of diagnostic accuracy. If authors and editors support STARD, this alone will enhance the efficient retrievability of DTA studies.

There are concerns that the same filter may not have been implemented uniformly across evaluation studies and that this may hamper an evaluation of the consistency of filter performance. Some researchers have translated filters across searching platforms, for example from Ovid MEDLINE to PubMed. The translation process may influence the performance of the filters, although the likely effect of this is unclear. Translations may change the number of missed studies and may impact sensitivity and precision. PubMed, in particular, carries out automatic mapping of search terms and this factor needs to be taken into account when translating from PubMed to other interfaces and when translating a strategy to make it suitable for use in PubMed. An example is the different adaptations made by the Haynes team in translating the original Haynes 1994 sensitive search filter developed in Ovid into the PubMed Clinical Queries sensitive filter, and Falck-Ytter’s adaptation of the same filter for use in PubMed. Sensitivities reported by the evaluation studies varied between each of the three filters, which may be due to differences in translation. Furthermore, some evaluators report strategies with mistakes; the mistakes might have been made in the conduct of the strategies or might have been introduced at the reporting stage. This uncertainty leads to doubts about the performance data reported for some of the filters, and we were unable to make any judgement about whether the original filters were applied correctly in the evaluation studies.

Potential biases in the review process

It can be difficult to identify the filters reported in evaluation studies because the filters can be named differently and the filter used is not always listed in the paper or appendix (that is only a reference may be provided). In those circumstances, it is unclear whether the strategy was used accurately or whether it was adapted. In some cases the original source of a filter has disappeared because of changes to websites. The Shipley Miner and University of Rochester filters evaluated by Ritchie, Whiting and Vincent are no longer available online and we have to rely on the evaluators for a listing of the strategies, rather than being able to visit the original website. This means that our review may have erroneously assigned some performance data to a named filter or to a filter which is a variant of a published filter.

Agreements and disagreements with other studies or reviews

Many of the search filters included in this systematic review have been extensively evaluated in other studies with different but relevant gold standards. This systematic review of evaluation studies draws the same conclusions as the most comprehensive evaluation study by Whiting and colleagues, which concluded that filtered searches miss additional studies compared with searches based on index test and target condition alone (Whiting 2010). None of the filters evaluated by Whiting provided reductions in Number Needed to Read for acceptable sensitivity and should not be used to identify studies for inclusion in systematic reviews (Whiting 2010). A key strength of the Whiting study is the size and homogeneity in the creation of the reference set; the team used seven systematic reviews published on a broad range of topics that had been conducted by the authors using extensive, rigorous and, for the first time, reproducible search methods without the inclusion of a methodological search filter. The inclusion criteria for each review produced sufficient data to allow cross-tabulation of results comparing index tests with a reference standard and meant that only true test accuracy studies were included.

Authors' conclusions

Implication for methodological research

The information retrieval environment is not static and better reporting of DTA studies as advocated by STARD, additional indexing terms (as recently introduced by EMBASE) and more consistent indexing of diagnostic studies could help to make published methodological filters more sensitive or create the opportunity for the development of new filters.

Search filters which make more use of proximity operators and careful exclusion may also yield improvements in performance in traditional database interfaces reliant on Boolean searching. Beyond Boolean approaches, developments in information retrieval such as semantic textual analysis may lead to filtering programs or record matching rules which can better identify diagnostic test accuracy studies from batches of records retrieved by sensitive searches. The increasing availability of full text journals may also improve the retrieval of DTA studies as there will be the whole paper to search and DTA performance measures may be more consistently identified.

In the absence of current suitable search filters, the impact of different search approaches could be investigated. The effectiveness of multi-strand searching is unexplored. In addition, the yield of restricted searching on the results of the systematic review could be explored. A combination of search approaches where the results of strategies using filters are augmented with more extensive reference checking and citation searching could also be investigated as an alternative approach to identifying as many relevant DTA studies as possible.

Acknowledgements

We thank Marit Johansen for her help in designing the search strategies.

Data and analyses

Download statistical data

This review has no analyses.

Appendices

Appendix 1. MEDLINE search strategy

MEDLINE ® OvidSP 1950 to week 1 November 2012

1 “Information Storage and Retrieval”/

2 ((information or literature) adj5 retriev$).tw.

3 Databases, Bibliographic/

4 ((bibliographic adj1 database$) or (electronic adj1 database$) or (online adj1 database$)).tw.

5 Medline/

6 PubMed/

7 Medlars/

8 Grateful Med/

9 (medline or pubmed or medlars or grateful-med or gratefulmed or embase$ or excerpta medica).tw.

10 or/1-9

11 (search$ adj5 (strateg$ or filter$ or hedge$ or technique$ or term$1)).tw.

12 (retriev$ adj5 (strateg$ or filter$ or hedge$ or technique$)).tw.

13 ((methodology or methodologic$) adj5 (strateg$ or filter$ or hedge$ or search$ or term$1)).tw.

14 (search$ adj5 (precision or recall or accura$ or sensitiv*)).tw.

15 (retriev$ adj5 (precision or recall or accura$ or sensitiv$)).tw.

16 or/11-15

17 (diagnos$ adj5 (strateg$ or filter$ or hedge$ or search$ or term$1)).tw.

18 exp Diagnosis/

19 diagnos$.tw.

20 "Sensitivity and Specificity"/

21 (sensitiv$ and specific$).tw.

22 or/18-21

23 10 and 16 and 22

24 10 and 17

25 23 or 24

26 "cochrane database of systematic reviews".so.

27 25 not 26

Appendix 2. EMBASE search strategy

EMBASE OvidSP 1980 to 2012 Week 48

1 Information Retrieval/

2 ((information or literature) adj5 retriev$).tw.

3 Bibliographic Database/

4 ((bibliographic adj1 database$) or (electronic adj1 database$) or (online adj1 database$)).tw.

5 Medline/ or Embase/

6 (medline or pubmed or medlars or grateful-med or gratefulmed or embase$ or excerpta medica).tw.

7 or/1-6

8 (search$ adj5 (strateg$ or filter$ or hedge$ or technique$ or term$1)).tw.

9 (retriev$ adj5 (strateg$ or filter$ or hedge$ or technique$)).tw.

10 (search$ adj5 (precision or recall or accura$ or sensitiv*)).tw.

11 (retriev$ adj5 (precision or recall or accura$ or sensitiv$)).tw.

12 ((methodology or methodologic$) adj5 (strateg$ or filter$ or hedge$ or search$ or term$1)).tw.

13 or/8-12

14 (diagnos$ adj5 (strateg$ or filter$ or hedge$ or search$ or term$1)).tw.

15 exp "Diagnosis, Measurement and Analysis"/

16 diagnos$.tw.

17 "Sensitivity and Specificity"/

18 (sensitiv$ and specific$).tw.

19 or/15-18

20 7 and 13 and 19

21 7 and 14

22 20 or 21

23 "cochrane database of systematic reviews".so.

24 “cochrane database of systematic reviews (online)”.so.

25 23 or 24

26 22 not 25

Appendix 3. ISI Web of Science search strategy

ISI Web of Science searched 11 January 2013

ISI Web of Science Databases=SCI-EXPANDED, SSCI, A&HCI, CPCI-S, CPCI-SSH, CCR-EXPANDED, IC Timespan=All Years

# 6 #3 AND #4 AND #5

# 5 #1 OR #2

# 4 TS=diagnos*

# 3 TS=(information retriev* OR literature retriev* OR bibliographic database OR
medline OR pubmed OR medlars OR grateful med OR gratefulmed OR embase* OR
psycinfo)

# 2 TS=(retriev* same (strateg* OR filter* OR hedge* OR technique*))

# 1 TS=(search* same (strateg* OR filter* OR hedge* OR technique* OR term*))

Appendix 4. PsycINFO search strategy

PsycINFO (OvidSP) searched 13 March 2013

1. exp Automated Information Retrieval/

2. Databases/

3. Information Seeking/

4. Computer Searching/

5. ((information or literature) adj2 retriev$).tw.

6. ((bibliographic adj1 database?) or (electronic adj1 database?)).tw.

7. (medline or pubmed or medlars or grateful med or gratefulmed or embase$ or excerpta medica).tw.

8. psycinfo.ti.

9. psycinfo.ab. /freq=2

10. or/1-9

11. (search$ adj2 (strateg$ or filter$ or hedge? or technique? or term$1)).tw.

12. (retriev$ adj2 (strateg$ or filter$ or hedge? or technique?)).tw.

13. (sensitiv$ or specific$ or recall or precision or precise or number needed to read or NNR).tw.

14. or/11-13

15. Diagnosis/

16. diagnos$.tw.

17. or/15-16

18. and/10,14,17

Appendix 5. Library, Information Science and Technology Abstracts (LISTA) search strategy

Library, Information Science and Technology Abstracts (LISTA) strategy searched 13 March 2013

S41 S37 or S40

S40 S16 and S29 and S39

S39 S28 or S38

S38 S30 or S31 or S32 or S33 or S34 or S35

S37 S16 and S28 and S36

S36 S29 or S30 or S31 or S32 or S33 or S34 or S35

S35 NNR

S34 "number needed to read"

S33 precision

S32 recall

S31 specificity

S30 sensitivity

S29 diagnos*

S28 (S17 or S18 or S19 or S20 or S21 or S22 or S23 or S24 or S25 or S26 or S27)

S27 retriev* N2 techniqu*

S26 retriev* N2 hedge*

S25 retriev* N2 filter*

S24 retriev* N2 strateg*

S23 search* N2 terms

S22 search* N2 term

S21 search* N2 techniqu*

S20 search* N2 hedge*

S19 search* N2 filter*

S18 search* N2 strateg*

S17 DE Search Algorithms

S16 (S1 or S2 or S3 or S4 or S5 or S6 or S7 or S8 or S9 or S10 or S11 or S12 or S13 or S14 or S15)

S15 medline OR pubmed or medlars or "grateful med" or gratefulmed or embase* or "excerpta medica"

S14 DE Electronic Information Resources

S13 DE Bibliographic Databases

S12 DE Databases

S11 DE PubMed

S10 DE EMBASE

S9 DE MEDLINE

S8 DE "Information Storage & Retrieval Systems"

S7 information N2 search*

S6 literature N2 search*

S5 literature N2 retriev*

S4 information N2 retriev*

S3 DE "electronic information resource searching"

S2 DE "database searching"

S1 DE "information retrieval"

Appendix 6. Cochrane Methodology Register search strategy

Cochrane Methodology Register 2012, Issue 3 in The Cochrane Library (Wiley InterScience Online)

#1 ("diagnostic test accuracy" NEXT "search strategies"):kw in Methods Studies

#2 ("study identification" next general) or ("study identification" next "prospective registration") or ("study identification" next "internet") or ("information retrieval" next general) or ("information retrieval" next "retrieval techniques") or ("information retrieval" next "comparisons of methods") or ("information retrieval" next indexing):kw in Methods Studies

#3 search*:ti NEAR/5 (strateg* or filter* or hedge* or technique* or term or terms or precision or recall or accura*):ti in Methods Studies

#4 retriev*:ti NEAR/5 (strateg* or filter* or hedge* or technique* or term or terms or precision or recall or accura*):ti in Methods Studies

#5 search*:ab NEAR/5 (strateg* or filter* or hedge* or technique* or term or terms or precision or recall or accura*):ab in Methods Studies

#6 retriev*:ab NEAR/5 (strateg* or filter* or hedge* or technique* or term or terms or precision or recall or accura*):ab in Methods Studies

#7 methodology:ti NEAR/5 (strateg* or filter* or hedge* or term or terms):ti in Methods Studies

#8 methodologic*:ti NEAR/5 (strateg* or filter* or hedge* or term or terms):ti in Methods Studies

#9 methodology:ab NEAR/5 (strateg* or filter* or hedge* or term or terms):ab in Methods Studies

#10 methodologic*:ab NEAR/5 (strateg* or filter* or hedge* or term or terms):ab in Methods Studies

#11 (medline or pubmed or medlars or "grateful med" or gratefulmed or embase* or excerpta medica):ti in Methods Studies

#12 (medline or pubmed or medlars or "grateful med" or gratefulmed or embase* or excerpta medica):ab in Methods Studies

#13 (diagnos* or sensitiv* or specific*):ti in Methods Studies

#14 (diagnos* or sensitiv* or specific*):ab in Methods Studies

#15 (#2 OR #3 OR #4 OR #5 OR #6 OR #7 OR #8 OR #9 OR #10) AND (#11 OR #12)

#16 (#2 OR #3 OR #4 OR #5 OR #6 OR #7 OR #8 OR #9 OR #10) AND (#13 OR #14)

#17 diagnos*:ti NEAR/5 (strateg* or filter* or hedge* or search* or term or terms):ti in Methods Studies

#18 diagnos*:ab NEAR/5 (strateg* or filter* or hedge* or search* or term or terms):ab in Methods Studies

#19 (#17 OR #18)

#20 (#1 OR #15 OR #16 OR #19)

Appendix 7. Library and Information Science Abstracts (LISA) search strategy

LISA: Library and Information Science Abstracts (Cambridge Scientific Abstracts) - searched 31 May 2010 

(((DE=("databases" or "bibliographic databases" or "cd rom

databases" or "database producers" or "online databases" or "computerized

information retrieval" or "multiple database searches" or "online

information retrieval")) or (TI=((literature or information) within 2

retriev*)) or (AB=((literature or information) within 2 retriev*))

or (TI=((bibliographic or electronic) within 2 database*))

or (AB=((bibliographic or electronic) within 2 database*)) or (TI=(medline

or medlars or pubmed or grateful med or gratefulmed or embase* or

excerpta medica)) or (AB=(medline or medlars or pubmed or grateful med or

gratefulmed or embase* or excerpta medica))) and ((DE=("search strategies"

or "searching" or "boolean strategies" or "non boolean strategies" or

"term selection" or "free text searching" or "full text searching" or

"ranking")) or (DE=("boolean strategies" or "non boolean strategies"))

or (TI=(search* within 2 (strateg* or filter? or hedge? or technique? or

term?))) or (AB=(search* within 2 (strateg* or filter? or hedge? or

technique? or term?))) or (TI=(retriev* within 2 (strateg* or filter? or

hedge? or technique?))) or (AB=(retriev* within 2 (strateg* or filter? or

hedge? or technique?))))) and ((TI=diagnos* or AB=diagnos*)

or (DE=("recall" or "retrieval performance measures" or "exhaustivity" or

"pertinence" or "relevance")) or (DE="retrieval performance measures")

or (TI=(sensitivity or specificity or recall or precision or accuracy or

(number within3 read))) or (AB=(sensitivity or specificity or recall or

precision or accuracy or (number within3 read))))

 

Appendix 8. Definition of terms used in this review

Accuracy – proportion of all articles correctly categorised

Development study – a study which aims to develop and test a search strategy for locating diagnostic test accuracy studies

Diagnostic odds ratio – positive likelihood ratio/negative likelihood ratio

Diagnostic test accuracy study – a study which compares the results of the test of interest, the index test, to those of a reference standard, which should be the best available method of determining disease status

Evaluation study – a study which quantitatively evaluates existing search strategies for locating diagnostic test accuracy studies

Gold standard record – a record included in the reference set that meets the criteria for a diagnostic test accuracy study

Non-gold standard record – a record included in the reference set that does not meet the criteria for a diagnostic test accuracy study

Number Needed to Read – the number of articles needed to read to identify one relevant article, calculated as 1 divided by precision

Positive likelihood ratio – the proportion of the probability of a true positive record to the false positive records

Precision/positive predictive value – proportion of retrieved records meeting diagnostic test criteria – proportion of gold standard records in the result set

Reference set – compilation of records which can be used to derive terms for search filter development and test the performance of search filters. The reference set can be composed of gold standard and non-gold standard records, or gold standard records alone

Sensitivity – percentage of correctly identified gold standard studies

Specificity – percentage of correctly non-identified studies.

  Reference set
Search terms Gold standard records Non-gold standard
Detectedab
Not detectedcd

Sensitivity = a/(a + c); precision = a/(a + b); specificity = d/(b + d); accuracy = (a + d)/(a + b + c + d). All included and excluded references in gold standard = (a + b + c + d)

Appendix 9. MEDLINE filters with full strategies as used by evaluation studies

Filter (original reference) Author (year) of evaluation study Description of filter as appears in evaluation study Interface used by evaluation study

Sensitivity

(95% CI)

Precision

(95% CI)

Comments

Bachmann 2002

Sensitive

ORIGINALexp sensitivity and specificity or predict$ or diagnos$ or di.fs. or du.fs. or accura$    
Ritchie 2007NROvid741.36 
Leeflang 2006"sensitivity and specificity"[MeSH] OR predict* OR diagnose* OR diagnosi* OR diagnost* OR accura*PubMed88  
Doust 2005Sensitivity and specificity [MeSH]
predict* [tw]
diagnos* [tw]
accura* [tw]
Datastar, Ovid, PubMed, Silverplatter705Methodological & content filter for TSR
904Methodological & content filter for NPSR
88 Methodological filter for TSR
90 Methodological filter for NPSR
Whiting 2010Exp "sensitivity and specificity"/
Diagnos$ OR di.fs. or du.fs.
Predict$
Accura$
Ovid

87

(81-98)

3

(1-22)

NNR 36

(4-98)

Noel-Storr 2011NROvid

84

(77-90)

0.17

(0.14-0.20)

 
Mitchell 2005

1. exp "Sensitivity and Specificity"/

2. predict$.tw.

3. diagnos$.tw.

4. di.fs.

5. du.fs.

6. accura$.tw.

7. or/1-6

Ovid848.8Strategy from Table 3

Haynes 2004

Sensitive

ORIGINALsensitiv:.mp. OR diagnos:.mp. OR di.fs.    
Ritchie 2007NROvid691.3 
Leeflang 2006sensitiv*[Title/Abstract] OR sensitivity and specificity[MeSH Terms] OR diagnos*[Title/Abstract] OR diagnosis[MeSH:noexp] OR diagnostic * [MeSH:noexp] OR diagnosis, differential[MeSH:noexp] OR diagnosis[Subheading:noexp]PubMed87  
Whiting 2010

Sensitive$.ti,ab.

"sensitivity and specificity"/

Diagnos$.ti,ab.

Diagnosis/

Diagnostic$.hw.

Diagnosis, Differential/

di.fs.

Ovid823NNR 36
Noel-Storr 2011

Sensitive$.ti,ab.

"sensitivity and specificity"/

Diagnos$.ti,ab.

Diagnosis/

Diagnostic$.hw.

Diagnosis, Differential/

di.fs.

Ovid

69

(60-77)

0.92

(0.74-1.10)

 
Mitchell 2005

1. sensitiv$.mp.

2. diagnos$.mp.

3. di.fs.

4. or/1-3.

Ovid679.1 
Kastner 2009sensitiv:.mp. OR diagnos:.mp. OR di.fs.Ovid88  
Doust 2005sensitiv:.mp. OR diagnos:.mp. OR di.fs.Ovid100 Methodological filter for NPSR
1005Methodological & content filter for NPSR
88 Methodological filter for TSR
704Methodological & content filter for NPSR

Haynes 2004

Specific

ORIGINALSpecifity.tw.    
Ritchie 2007NROvid216.7 
Whiting 2010Specificity.ti,ab.Ovid4315NNR 7
Noel-Storr 2011Specificity.ti,ab.Ovid

14

(9-21)

2.04

(1.22-3.21)

 

Haynes 1994

Accurate

ORIGINAL

Exp Sensitivity a#d

Specificity

Or Diagnosis (sh)

Or Diagnostic Use (sh)

Or Specificity (tw)

Or Predicitive (tw) and Value: (tw)

    
Leeflang 2006‘‘sensitivity and specificity’’[MeSH] OR ‘‘Diagnosis’’[MeSH] OR ‘‘diagnostic use’’[subheading] OR specificity[tw] OR (predictive[tw] AND value[tw])PubMed81  

Haynes 1994

Specific

ORIGINALExp Sensitivity a#d
Specificity
OR Predictive (tw) AND Value: (tw)
    
Ritchie 2007NROvid337.4 
Leeflang 2006‘‘sensitivity and specificity’’[MeSH] OR (predictive[tw] AND value[tw])PubMed29  
Whiting 2010

exp "sensitivity and specificity"/

(predictive and value$).ti,ab.

Ovid5611NNR 9
Noel-Storr 2011

exp "sensitivity and specificity"/

(predictive and value$).ti,ab.

Ovid

51

(42-60)

3.04

(2.36-3.86)

 

Haynes 1994

Sensitive

ORIGINALExp Sensitivity a#d Specificity
or Diagnosis& (px)
or Diagnostic Use (sh)
or Sensitivity (tw)
or Specificity (tw)
    
Ritchie 2007NROvid701.5 
Leeflang 2006"sensitivity and specificity"[MeSH] OR diagnosis[subheading:noexp] OR "diagnostic use"[subheading] OR sensitivity[tw] OR specificity[tw]PubMed81  
Kassai 2006NRPubMed95  
Whiting 2010

exp "sensitivity and specificity"/

di.xs.

Du.fs.

Sensitivity.ti,ab.

Specificity.ti,ab.

Ovid872NNR 45
Vincent 2003

1 exp ‘sensitivity and specificity’/

2 sensitivity.tw.

3 di.fs.

4 du.fs.

5 specificity.tw.

6 or/1-5

NR96  
Deville 2000

Sensitivity and specificity (exploded) (sh)

Diagnosis& (sh)

Diagnostic use (sh)

Sensitivity (tw)

Specificity (tw)

NR

73

(63-8)

 Specificity=94.3 (93.3-95.2); DOR=45
Noel-Storr 2011

exp "sensitivity and specificity"/

di.xs.

Du.fs.

Sensitivity.ti,ab.

Specificity.ti,ab.

Ovid

91

(84-95)

0.98

(0.80-1.17)

 
Mitchell 2005

1. exp "Sensitivity and Specificity"/

2. di.xs.

3. du.fs.

4. sensitivity.tw.

5. specificity.tw.

6. or/1-5

Ovid805.3 
Falck-Ytter 2004 ORIGINALsensitiv:.tw. or exp "sensitivity and specificity"/ or diagnos:.tw,ot,hw,rw. or (di or du).fs.    
Ritchie 2007NROvid741.3 
Whiting 2010Sensitive:.tw.
exp "sensitivity and specificity"/
Diagnos:.tw,ot,hw,rw.
(di or du).fs.
Ovid

85

(80-93)

3

(1-19)

NNR 36

(5-106)

Noel-Storr 2011Sensitive:.tw.
exp "sensitivity and specificity"/
Diagnos:.tw,ot,hw,rw.
(di or du).fs.
Ovid

71

(62-79)

1.06

(0.86-1.31)

 

Deville 2000

Strategy 1

ORIGINAL

sensitivity and specificity (exploded)(sh)

specificity (tw)

    
Kassai 2006NRPubMed75.5  

Deville 2000

Strategy 3

ORIGINAL

sensitivity and specificity (exploded)(sh)

Specificity (tw)

False negative (tw)

Accuracy (tw)

    
Leeflang 2006"sensitivity and specificity"[MeSH] OR specificity[tw] OR false negative[tw] OR accuracy [tw]PubMed41  

Deville 2000

Strategy 4

ORIGINALsensitivity and specificity (exploded) (sh)
specificity (tw)
false negative (tw)
accuracy (tw)
screening (tw)
    
Ritchie 2007NROvid464.4 
Leeflang 2006‘‘sensitivity and specificity’’[MeSH] OR specificity[tw] OR false negative[tw] OR accuracy[tw] OR screening[tw]PubMed46  
Doust 2005Sensitivity and specificity [MeSH]
Specificity [tw]
False negative [tw]
Accuracy [tw]
Screening [tw]
Ovid589Methodological & content filter for TSR
1009Methodological & content filter for NPSR
100 Methodological filter for NPSR
73 Methodological filter for TSR
Whiting 2010

exp "sensitivity and specificity"/

Specificity.ti,ab.

False negative.ti,ab.

Accuracy.ti,ab.

Screening.ti,ab.

Ovid687NNR 14
Vincent 2003

1 exp sensitivity and specificity/

2 specificit$.tw.

3 false negative$.tw.

4 Accuracy.tw.

5 screening.tw.

6 or/1-5

NR75 Authors say they tested the Deville specific strategy; however they have listed Deville sensitive strategy in the appendix.
Noel-Storr 2011

exp "sensitivity and specificity"/

Specificity.ti,ab.

False negative.ti,ab.

Accuracy.ti,ab.

Screening.ti,ab.

Ovid

55

(46-64)

2.20

(1.70-2.77)

 
Mitchell 2005

1. exp "Sensitivity and Specificity"/

2. specificity.tw.

3. false negative.tw.

4. accuracy.tw.

5. screening.tw.

4. or/1-5

Ovid4916.7 

Deville 2002

Extended

ORIGINAL(((((((((((("sensitivity and specificity"[All Fields] OR "sensitivity and specificity/standards"[All Fields]) OR "specificity"[All Fields]) OR "screening"[All Fields]) OR "false positive"[All Fields]) OR "false negative"[All Fields]) OR "accuracy"[All Fields]) OR (((("predictive value"[All Fields] OR "predictive value of tests"[All Fields]) OR "predictive value of tests/standards"[All Fields]) OR "predictive values"[All Fields]) OR "predictive values of tests"[All Fields])) OR (("reference value"[All Fields] OR "reference values"[All Fields]) OR"reference values/
standards"[All Fields])) OR ((((((((((("roc"[All Fields] OR "roc analyses"[All Fields]) OR "roc analysis"[All Fields]) OR "roc and"[All Fields]) OR "roc area"[All Fields]) OR "roc auc"[All Fields]) OR "roc characteristics"[All Fields]) OR "roc curve"[All Fields]) OR "roc curve method"[All Fields]) OR "roc curves"[All Fields]) OR "roc estimated"[All Fields]) OR "roc evaluation"[All Fields])) OR "likelihood ratio"[All Fields]) AND notpubref [sb]) AND "human"[MeSH Terms])
    
Ritchie 2007NROvid523.9 
Doust 2005(((((((((((("sensitivity and specificity"[All Fields] OR "sensitivity and specificity/standards"[All Fields]) OR "specificity"[All Fields]) OR "screening"[All Fields]) OR "false positive"[All Fields]) OR "false negative"[All Fields]) OR "accuracy"[All Fields]) OR (((("predictive value"[All Fields] OR "predictive value of tests"[All Fields]) OR "predictive value of tests/standards"[All Fields]) OR "predictive values"[All Fields]) OR "predictive values of tests"[All Fields])) OR (("reference value"[All Fields] OR "reference values"[All Fields]) OR"reference values/
standards"[All Fields])) OR ((((((((((("roc"[All Fields] OR "roc analyses"[All Fields]) OR "roc analysis"[All Fields]) OR "roc and"[All Fields]) OR "roc area"[All Fields]) OR "roc auc"[All Fields]) OR "roc characteristics"[All Fields]) OR "roc curve"[All Fields]) OR "roc curve method"[All Fields]) OR "roc curves"[All Fields]) OR "roc estimated"[All Fields]) OR "roc evaluation"[All Fields])) OR "likelihood ratio"[All Fields]) AND notpubref [sb]) AND "human"[MeSH Terms])
WebSpirs588Methodological & content filter for TSR
1006Methodological & content filter for NPSR
100 Methodological filter for NPSR
76 Methodological filter for TSR
Whiting 2010

“sensitivity and specificity”.mp.

“sensitivity and specificity”/st

Specificity.mp.

Screening.mp.

False positive.mp.

False negative.mp.

Accuracy.mp.

Predictive value.mp.

Predictive values.mp.

Reference value.mp.

Reference values.mp.

Roc.mp.

Likelihood ratio.mp.

Humans/

Ovid717NNR 15
Noel-Storr 2011

“sensitivity and specificity”.mp.

“sensitivity and specificity”/st

Specificity.mp.

Screening.mp.

False positive.mp.

False negative.mp.

Accuracy.mp.

Predictive value.mp.

Predictive values.mp.

Reference value.mp.

Reference values.mp.

Roc.mp.

Likelihood ratio.mp.

Humans/

Ovid

60

(51-69)

1.99

(1.57-2.47)

 

Deville 2002a

Accurate

ORIGINAL

1. sensitivity and specificity[Mesh; exploded]

2. mass screening [Mesh; exploded]

3. reference values [Mesh]

4. false positive reactions [Mesh]

5. false negative reactions [Mesh]

6. specificit$.tw

7. screening.tw

8. false positive$.tw

9. false negative$.tw

10. accuracy.tw

11. predictive value$.tw

12. reference value$.tw

13. roc$.tw

14. likelihood ratio$.tw

or/1-14

    
Leeflang 2006‘‘Sensitivity and Specificity’’[MeSH] OR ‘‘mass screening’’[MeSH] OR ‘‘Reference values’’[MeSH] OR specificit*[tw] OR screening[tw] OR false positive*[tw] OR false negative*[tw] OR accuracy[tw] OR predictive value*[tw] OR reference value*[tw] OR roc*[tw] OR likelihood ratio*[tw]PubMed51  

Vincent 2003

Strategy A

ORIGINAL

1. exp 'sensitivity and specificity'/

2. (sensitivity or specificity or accuracy).tw.

3. ((predictive adj3 value$) or (roc adj curve$)).tw.

4. ((false adj positiv$) or (false negativ$)).tw.

5. (observer adj variation$) or (likelihood adj3 ratio$)).tw.

6. likelihood function/

7. exp mass screening/

8. diagnosis, differential/ or exp Diagnostic errors/

9. di.xs or du.fs

10. or/1-9

    
Ritchie 2007NROvid873.3 
Mitchell 2005

1. exp “Sensitivity and Specificity”/

2. (sensitivity or specificity or accuracy).tw.

3. ((predictive adj3 value$) or (roc adj curve$)).tw.

4. ((false adj positiv$) or (false negativ$)).tw.

5. (observer adj variation$) or (likelihood adj3 ratio$)).tw.

6. Likelihood Function/

7. exp Mass Screening/

8. Diagnosis, Differential/ or exp Diagnostic Errors/

9. di.xs or du.fs

10. or/1-9

Ovid815.5 

Vincent 2003

Strategy C

ORIGINAL

1. exp ‘sensitivity and specificity’/

2. sensitivity.tw. or specificity.tw.

3. (predictive adj3 value$).tw.

4. exp Diagnostic errors/

5. ((false adj positive$) or (false adj negative$)).tw.

6. (observer adj variation$).tw.

7. (roc adj curve$).tw.

8. (likelihood adj3 ratio$).tw.

9. likelihood function/

10. exp *venous thrombosis/di, ra, ri, us

11. exp *thrombophlebitis/di, ra, ri, us

12. or/1-11

    
Leeflang 2006‘‘sensitivity and specificity’’[MeSH] OR sensitivity[tw] OR specificity[tw] OR predictive value*[tw] OR false positiv*[tw] OR false negativ*[tw] OR observer variation*[tw] OR roc curve*[tw] OR likelihood ratio*[tw] OR ‘‘Likelihood Functions’’[MeSH]PubMed44  
Whiting 2010

exp "sensitivity and specificity"/

Sensitivity.tw.

Specificity.tw.

(predictive adj3 value$).tw.

Exp diagnostic errors/

(false adj positiv$).tw.

(false adj negativ$).tw.

(observer adj variation$).tw.

(roc adj curve$).tw.

(likelihood adj3 ratio$).tw.

Likelihood functions/

Ovid679NNR 12
Noel-Storr 2011

exp "sensitivity and specificity"/

Sensitivity.tw.

Specificity.tw.

(predictive adj3 value$).tw.

Exp diagnostic errors/

(false adj positiv$).tw.

(false adj negativ$).tw.

(observer adj variation$).tw.

(roc adj curve$).tw.

(likelihood adj3 ratio$).tw.

Likelihood functions/

Ovid

54

(45-63)

2.30

(1.79-2.89)

 

van der Weijden 1997

Sensitive

ORIGINAL

MeSH terms

explode DIAGNOSIS/all.s

explode SENSITIVITY-AND-SPECIFICITY

REFERENCE-VALUES/all.s

FALSE-NEGATIVE-REACTIONS/all.s

FALSE-POSITIVE-REACTIONS/all.s

explode MASS-SCREENING/all.s

Freetext terms

diagnos*

sensitivity or specificity

predictive value*

reference value*

ROC*

Likelihood ratio*

monitoring

    
Leeflang 2006"Diagnosis"[MeSH] OR "sensitivity and specificity"[MeSH] OR "Reference values"[MeSH] OR "False Positive Reactions"[MeSH] OR "False Negative Reactions"[MeSH] OR "Mass Screening"[MeSH] OR diagnos* OR sensitvity OR specificity OR predictive value* OR reference value* OR ROC* OR likelihood ratio* OR monitoringPubMed92  
Doust 2005Diagnosis [subheading]
Sensitivity and Specificity [MeSH]
Sensitivity [tw]
Specificity [tw]
Diagnosis differential [MeSH]
Reference values [MeSH]
False negative reactions [MeSH]
False positive reactions [MeSH]
Mass screening [MeSH]
diagnos* [tw]
predictive value [tw]
reference value* [tw]
ROC* [tw]
CD-ROM OvidError noted in strategy – original does not include Diagnosis differential [MeSH] and Doust has omitted to add likelihood ratio* and monitoring textwords
734Methodological & content filter for TSR
1004Methodological & content filter for NPSR
91 Methodological filter for TSR
100 Methodological filter for NPSR
Whiting 2010

Exp diagnosis/

exp "sensitivity and specificity"/

Reference values/

False negative reactions/

False positive reactions/

Exp Mass screening/

Diagnos$.ti,ab.

Sensitivity.ti,ab.

Specificity.ti,ab.

Predictive value$.ti,ab.

Reference value$.ti,ab.

Roc$.ti,ab.

Likelihood ratio$.ti,ab.

Monitoring.ti,ab.

Ovid872NNR 50
Noel-Storr 2011

Exp diagnosis/

exp "sensitivity and specificity"/

Reference values/

False negative reactions/

False positive reactions/

Exp Mass screening/

Diagnos$.ti,ab.

Sensitivity.ti,ab.

Specificity.ti,ab.

Predictive value$.ti,ab.

Reference value$.ti,ab.

Roc$.ti,ab.

Likelihood ratio$.ti,ab.

Monitoring.ti,ab.

Ovid

93

(87-97)

0.98

(0.80-1.17)

 
Mitchell 2005

exp Diagnosis/

exp "Sensitivity and Specificity"/

Reference Values/

False Negative Reactions/

False Positive Reactions/

exp Mass Screening/

diagnos$.ti,ab.

sensitivity.ti,ab.

specificity.ti,ab.

predictive value$.ti,ab.

reference value$.ti,ab.

roc$.ti,ab.

likelihood ratio$.ti,ab.

monitoring.ti,ab.

Ovid965.6 
CASP 2002 $ ORIGINAL

sensitivity-specificity (s)

sensitivity (t)

di.fs.

ri.fs

du.fs

specificity ( t)

    
Ritchie 2007NROvid731.2 
Kassai 2006NRPubMed95  
Whiting 2010“sensitivity and specificity”/
Sensitivity.ti,ab.
di.fs.
Ri.fs.
Du.fs.
Specificity.ti,ab.
Ovid

83

(78-95)

3

(1-24)

NNR 29

(4-89)

Vincent 2003

1 exp ‘sensitivity and specificity/

2 sensitivity.tw.

3 di.xs.

4 du.fs.

5 specificity.tw.

6 or/1-5

NR100  
Noel-Storr 2011“sensitivity and specificity”/
Sensitivity.ti,ab.
di.fs.
Ri.fs.
Du.fs.
Specificity.ti,ab.
Ovid

67

(58-75)

0.97

(0.77-1.19)

 
InterTASC 2011 Aberdeen$ ORIGINAL

MeSH
Exp sensitivity and specificity/
False positive reactions/
False negative reactions/
Du.fs
Text words .tw
Sensitivity
Distinguish$

Differentiat$
enhancement
Predictive adj4 value$
Identif$
Detect$
Diagnos$
Compar$

    
Ritchie 2007NROvid691.2 
Whiting 2010

exp "sensitivity and specificity"/
False positive reactions/
False negative reactions/
Du.fs.
Sensitivity.tw.
(Predictive adj4 value$).tw.
Distinguish$.tw.

Differential$.tw.

Enhancement.tw.
Identif$.tw.
Detect$.tw.
Diagnos$.tw.
Compare$.t

Ovid

86

(81-94)

3

(1-19)

NNR 35

(5-97)

Noel-Storr 2011

exp "sensitivity and specificity"/
False positive reactions/
False negative reactions/
Du.fs.
Sensitivity.tw.
(Predictive adj4 value$).tw.
Distinguish$.tw.

Differential$.tw.

Enhancement.tw.
Identif$.tw.
Detect$.tw.
Diagnos$.tw.
Compare$.t

Ovid

87

(80-92)

0.95

(0.78-1.14)

 

InterTASC 2011 Southampton A$

Unclear how terms combined

ORIGINAL MeSH
Exp sensitivity and specificity/
False positive reactions/
False negative reactions
Exp diagnosis/
Reference-values
Exp mass screening/
Text words
Diagnos*
Sensitivity
Specificity
‘sensitivity and specificity’
predictive value*
Reference value*
Roc
Roc in AD (NOT)
Likelihood ratio*
Monitoring
    
Ritchie 2007NROvid711.0 
Whiting 2010

Exp diagnosis/

exp "sensitivity and specificity"/

Reference values/

False negative reactions/

False positive reactions/

Exp mass screening/

Diagnos$.mp.

Sensitivity.mp.

Specificity.mp.

Predictive value$.mp.

Reference value$.mp.

Roc.mp. NOT roc.in.

Likelihood ratio$.mp.

Monitoring.mp.

Ovid862NNR 51
Noel-Storr 2011

Exp diagnosis/

exp "sensitivity and specificity"/

Reference values/

False negative reactions/

False positive reactions/

Exp mass screening/

Diagnos$.mp.

Sensitivity.mp.

Specificity.mp.

Predictive value$.mp.

Reference value$.mp.

Roc.mp. NOT roc.in.

Likelihood ratio$.mp.

Monitoring.mp.

Ovid

93

(87-97)

0.96

(0.80-1.15)

 

InterTASC 2011 Southampton B$

Unclear how terms combined

ORIGINAL MeSH
Exp sensitivity and specificity/
Text words
Specificity
False negative
Accuracy
screening
    
Ritchie 2007NROvid454.6 
Whiting 2010

exp "sensitivity and specificity"/

Specificity.mp.

False negative.mp.

Accuracy.mp.

Screening.mp.

Ovid697NNR 14
Noel-Storr 2011

exp "sensitivity and specificity"/

Specificity.mp.

False negative.mp.

Accuracy.mp.

Screening.mp.

Ovid

55

(46-64)

2.09

(1.63-2.63)

 

InterTASC 2011

Southampton C$

Unclear how terms combined

ORIGINAL MeSH
Exp sensitivity and specificity/
Text words ti,ab,mesh
Predictive and value
    
Ritchie 2007NROvid318.5 
Whiting 2010

exp "sensitivity and specificity"/

(Predictive and value$).ti,ab,sh.

Ovid5611NNR 9
Noel-Storr 2011

exp "sensitivity and specificity"/

(Predictive and value$).ti,ab,sh.

Ovid

51

(42-60)

3.04

(2.36-3.86)

 

InterTASC 2011 Southampton D$

Unclear how terms combined

ORIGINAL MeSH
Exp sensitivity and specificity/
Exp diagnosis/
Exp pathology/
Text words
Sensitivity
Specificity
    
Ritchie 2007NROvid661.1 
Whiting 2010

exp "sensitivity and specificity"/

Exp diagnosis/

Exp pathology/

Sensitivity.mp.

Specificity.mp.

Ovid842NNR 48
Noel-Storr 2011

exp "sensitivity and specificity"/

Exp diagnosis/

Exp pathology/

Sensitivity.mp.

Specificity.mp.

Ovid

89

(82-94)

1.13

(0.93-1.35)

 

InterTASC 2011 Southampton E$

Unclear how terms combined

ORIGINAL MeSH
Exp Diagnosis/
Exp sensitivity and specificity
False positive reactions/
False negative reactions/
Text words ti,ab
Diagnos$ ti,ab hw
Specificit$
Sensitivit$
Predictive value$
Roc
Sroc
Receiver operat$ charactristic$
Receiver oprat$ adj2 curve
False positiv$
False negative$
accuracy
    
Ritchie 2007NROvid711.0 
Whiting 2010

Exp diagnosis/

exp "sensitivity and specificity"/

False positive reactions/

False negative reactions/

Diagnos$.ti,ab,hw.

Specificit$.ti,ab.

Sensitivit$.ti,ab.

Predictive value$.ti,ab.

Roc.ti,ab.

Sroc.ti,ab.

Receiver operat$ characteristic$.ti,ab.

(Receiver operat$ adj2 curve).ti,ab

False positiv$.ti,ab.

False negative$.ti,ab.

Accuracy.ti,ab.

Ovid872NNR 50
Noel-Storr 2011

Exp diagnosis/

exp "sensitivity and specificity"/

False positive reactions/

False negative reactions/

Diagnos$.ti,ab,hw.

Specificit$.ti,ab.

Sensitivit$.ti,ab.

Predictive value$.ti,ab.

Roc.ti,ab.

Sroc.ti,ab.

Receiver operat$ characteristic$.ti,ab.

(Receiver operat$ adj2 curve).ti,ab

False positiv$.ti,ab.

False negative$.ti,ab.

Accuracy.ti,ab.

Ovid

92

(86-96)

0.98

(0.81-1.17)

 

InterTASC 2011

CRD A

Unclear how terms combined

ORIGINAL MeSH
Exp sensitivity and specificity/ all subheadings
Exp diagnostic errors/ all subheadings
Text Words .ti,ab
Predictive value*
Reproducibility
Logistic regression
Ability near predict*
Logistic model*
Sroc
Roc
Positive rate
Positive rates
Likelihood ratio*
Negative rate
Negative rates
Receiver operating characteristic
Correlation
Correlated
Test or tests near accuracy
Curve
Curves
Test outcome
Pretest probabilities
Posttest probabilities
Roc-curve.mp
Logistic-models.mp
Likelihood-functions.mp
diagnosis
    
Ritchie 2007NROvid532.2 
Whiting 2010

exp "sensitivity and specificity"/

Exp diagnostic errors/

Predictive value$.ti,ab.

Reproducibility.ti,ab.

Logistic regression.ti,ab.

(ability adj5 predict$).ti,ab.

Logistic model$.ti,ab.

Sroc.ti,ab.

Roc.ti,ab.

Positive rate.ti,ab.

Positive rates.ti,ab.

Likelihood ratio$.ti,ab.

Negative rate.ti,ab.

Negative rates.ti,ab.

Receiver operating

characteristic.ti,ab.

correlation.ti,ab.

correlated.ti,ab.

((test or tests) adj5 accuracy).ti,ab.

curve.ti,ab.

curves.ti,ab.

Test outcome.ti,ab.

Pretest probabilities.ti,ab.

Posttest probabilities.ti,ab.

Roc curve.mp.

Logistic models.mp.

Likelihood functions.mp.

diagnosis.ti,ab.

Ovid734NNR 26
Noel-Storr 2011

exp "sensitivity and specificity"/

Exp diagnostic errors/

Predictive value$.ti,ab.

Reproducibility.ti,ab.

Logistic regression.ti,ab.

(ability adj5 predict$).ti,ab.

Logistic model$.ti,ab.

Sroc.ti,ab.

Roc.ti,ab.

Positive rate.ti,ab.

Positive rates.ti,ab.

Likelihood ratio$.ti,ab.

Negative rate.ti,ab.

Negative rates.ti,ab.

Receiver operating

characteristic.ti,ab.

correlation.ti,ab.

correlated.ti,ab.

((test or tests) adj5 accuracy).ti,ab.

curve.ti,ab.

curves.ti,ab.

Test outcome.ti,ab.

Pretest probabilities.ti,ab.

Posttest probabilities.ti,ab.

Roc curve.mp.

Logistic models.mp.

Likelihood functions.mp.

diagnosis.ti,ab.

Ovid

70

(62-78)

1.23

(0.99-1.50)

 

InterTASC 2011

CRD B

Unclear how terms combined

ORIGINAL

MeSH
Exp sensitivity and specificity/
Predictive value of tests/
Logistic models/
Roc curve/
Likelihood functions/
Reference standards/
Reference values/
Severity of illness index/
Reproducibility of results/
Observer variation/
Decision making/
Text words ti,ab
Diagnos* near5 efficac*
Diagnos* near5 efficien*
Diagnos* near5 effective*
Diagnos* near5 accura*
Diagnos* near5 correct*
Diagnos* near5 reliable
Diagnos* near5 reliability
Diagnos* near5 error*
Diagnos* near5 mistake*
Diagnos* near5 inaccura*
Diagnos* near5 incorrect
Diagnos* near5 unreliable
Decision making
Sensitivity near5 test
Sensitivity near5 tests
Specificity near5 test

Specificity near5 tests
Predictive standard*
Predictive value*
Predictive model*
Predictive factor*
Roc
Reliability near2 standard*
Reliability near2 score*
Reliability near2 tool*
Reliability near2 aid
Reliability near2 aids
Performance near2 test
Performance near2 tests
Performance near2 testing
Performance near2 standard*
Performance near2 score*
Performance near2 tool*
Performance near2 aid
Performance near2 aids
Reference value*
sroc
Receiver operat* characteristic
Receiver operat* curve
Likelihood ratio*

    
Ritchie 2007NROvid404.1 
Whiting 2010

exp "sensitivity and specificity"/

Predictive value of tests/

Logistic models/

Roc curve/

Likelihood functions/

Reference standards/

Reference values/

Severity of illness index/

Reproducibility of results/

Observer variation/

Decision making/

(Diagnos$ adj5 efficac$).ti,ab.

(Diagnos$ adj5 efficien$).ti,ab.

(Diagnos$ adj5 effective$).ti,ab.

(Diagnos$ adj5 accura$).ti,ab.

(Diagnos$ adj5 correct$).ti,ab.

(Diagnos$ adj5 reliable).ti,ab.

(Diagnos$ adj5 reliability).ti,ab.

(Diagnos$ adj5 error$).ti,ab.

(Diagnos$ adj5 mistake$).ti,ab.

(Diagnos$ adj5 inaccura$).ti,ab.

(Diagnos$ adj5 incorrect).ti,ab.

(Diagnos$ adj5 unreliable).ti,ab.

Decision making.ti,ab.

(sensitivity adj5 test).ti,ab.

(sensitivity adj5 tests).ti,ab.

(specificity adj5 test).ti,ab.

(specificity adj5 tests).ti,ab.

Predictive standard$.ti,ab.

Predictive value$.ti,ab.

Predictive model$.ti,ab.

Predictive factor$.ti,ab.

Roc.ti,ab.

Receiver operat$ characteristic.ti,ab.

Receiver operat$ curve.ti,ab.

Likelihood ratio$.ti,ab.

Likelihood function.ti,ab.

(false adj2 reaction$).ti,ab.

False positive$.ti,ab.

False negative$.ti,ab.

Gold standard$.ti,ab.

Reference test.ti,ab.

Reference tests.ti,ab.

Reference standard$.ti,ab.

Criter$ standard$.ti,ab.

Criter$ bias.ti,ab.

Criter$ test.ti,ab.

Criter$ tests.ti,ab.

Validat$ standard$.ti,ab.

Validat$ test.ti,ab.

Validat$ tests.ti,ab.

Validat$ bias.ti,ab.

Verificat$ bias.ti,ab.

Work?up bias.ti,ab.

Expectation bias.ti,ab.

Indeterminate result$.ti,ab.

(observer adj2 bias) .ti,ab.

(observer adj10 different) .ti,ab.

Observer variat$.ti,ab.

Interrater reliability.ti,ab.

Interater reliability.ti,ab.

Observer reliability.ti,ab.

(intra$ adj4 reliability) .ti,ab.

(accura$ adj2 test).ti,ab.

(accura$ adj2 tests).ti,ab.

(accura$ adj2 testing).ti,ab.

(accura$ adj2 standard$).ti,ab.

(accura$ adj2 score$).ti,ab.

(accura$ adj2 tool$).ti,ab.

(accura$ adj2 aid).ti,ab.

(accura$ adj2 aids).ti,ab.

(reliability adj2 test).ti,ab.

(reliability adj2 tests).ti,ab.

(reliability adj2 testing).ti,ab.

(reliability adj2 standard$).ti,ab.

(reliability adj2 score$).ti,ab.

(reliability adj2 tool$).ti,ab.

(reliability adj2 aid).ti,ab.

(reliability adj2 aids).ti,ab.

(performance adj2 test).ti,ab.

(performance adj2 tests).ti,ab.

(performance adj2 testing).ti,ab.

(performance adj2 standard$).ti,ab.

(performance adj2 score$).ti,ab.

(performance adj2 tool$).ti,ab.

(performance adj2 aid).ti,ab.

(performance adj2 aids).ti,ab.

Reference value$.ti,ab.

Sroc.ti,ab.

Ovid647NNR 15
Noel-Storr 2011

exp "sensitivity and specificity"/

Predictive value of tests/

Logistic models/

Roc curve/

Likelihood functions/

Reference standards/

Reference values/

Severity of illness index/

Reproducibility of results/

Observer variation/

Decision making/

(Diagnos$ adj5 efficac$).ti,ab.

(Diagnos$ adj5 efficien$).ti,ab.

(Diagnos$ adj5 effective$).ti,ab.

(Diagnos$ adj5 accura$).ti,ab.

(Diagnos$ adj5 correct$).ti,ab.

(Diagnos$ adj5 reliable).ti,ab.

(Diagnos$ adj5 reliability).ti,ab.

(Diagnos$ adj5 error$).ti,ab.

(Diagnos$ adj5 mistake$).ti,ab.

(Diagnos$ adj5 inaccura$).ti,ab.

(Diagnos$ adj5 incorrect).ti,ab.

(Diagnos$ adj5 unreliable).ti,ab.

Decision making.ti,ab.

(sensitivity adj5 test).ti,ab.

(sensitivity adj5 tests).ti,ab.

(specificity adj5 test).ti,ab.

(specificity adj5 tests).ti,ab.

Predictive standard$.ti,ab.

Predictive value$.ti,ab.

Predictive model$.ti,ab.

Predictive factor$.ti,ab.

Roc.ti,ab.

Receiver operat$ characteristic.ti,ab.

Receiver operat$ curve.ti,ab.

Likelihood ratio$.ti,ab.

Likelihood function.ti,ab.

(false adj2 reaction$).ti,ab.

False positive$.ti,ab.

False negative$.ti,ab.

Gold standard$.ti,ab.

Reference test.ti,ab.

Reference tests.ti,ab.

Reference standard$.ti,ab.

Criter$ standard$.ti,ab.

Criter$ bias.ti,ab.

Criter$ test.ti,ab.

Criter$ tests.ti,ab.

Validat$ standard$.ti,ab.

Validat$ test.ti,ab.

Validat$ tests.ti,ab.

Validat$ bias.ti,ab.

Verificat$ bias.ti,ab.

Work?up bias.ti,ab.

Expectation bias.ti,ab.

Indeterminate result$.ti,ab.

(observer adj2 bias) .ti,ab.

(observer adj10 different) .ti,ab.

Observer variat$.ti,ab.

Interrater reliability.ti,ab.

Interater reliability.ti,ab.

Observer reliability.ti,ab.

(intra$ adj4 reliability) .ti,ab.

(accura$ adj2 test).ti,ab.

(accura$ adj2 tests).ti,ab.

(accura$ adj2 testing).ti,ab.

(accura$ adj2 standard$).ti,ab.

(accura$ adj2 score$).ti,ab.

(accura$ adj2 tool$).ti,ab.

(accura$ adj2 aid).ti,ab.

(accura$ adj2 aids).ti,ab.

(reliability adj2 test).ti,ab.

(reliability adj2 tests).ti,ab.

(reliability adj2 testing).ti,ab.

(reliability adj2 standard$).ti,ab.

(reliability adj2 score$).ti,ab.

(reliability adj2 tool$).ti,ab.

(reliability adj2 aid).ti,ab.

(reliability adj2 aids).ti,ab.

(performance adj2 test).ti,ab.

(performance adj2 tests).ti,ab.

(performance adj2 testing).ti,ab.

(performance adj2 standard$).ti,ab.

(performance adj2 score$).ti,ab.

(performance adj2 tool$).ti,ab.

(performance adj2 aid).ti,ab.

(performance adj2 aids).ti,ab.

Reference value$.ti,ab.

Sroc.ti,ab.

Ovid

67

(58-75)

1.69

(1.40-2.10)

 

InterTASC 2011

CRD C

Unclear how terms combined

ORIGINAL

MeSH
Exp Sensitivity and specificity/
False positive reactions/
False negative reactions/
Logistic models/
Roc curve/
Likelihood functions/
diagnosis/
Exp diagnostic errors/
Exp diagnostic techniques and procedures/
Exp laboratory techniques and procedures/
Text words ti,ab
Specificit$
Sensitivit$
False negative$
False positive$
True negative$
True positive$
Positive rate$
Negative rate$
Screening
Accuracy
Reference value$
Likelihood ratio$
Sroc
Srocs
Roc
Rocs
Receiver operat$ curve$
Receiver operat$ character$
Diagnos$ adj3 efficac$
Diagnos$ adj3 efficien$
Diagnos$ adj3 effectiv$

Diagnos$ adj3 accura$
Diagnos$ adj3 correct$
Diagnos$ adj3 reliable
Diagnos$ adj3 reliability

Diagnos$ adj3 error$
Diagnos$ adj3 mistake$
Diagnos$ adj3 inaccura$
Diagnos$ adj3 incorrect
Diagnos$ adj3 unreliable
Diagnostic yield.mp
Misdiagnos$
Reproductivity.mp
Logistical regression.mp
Logistical model$
Ability adj2 predict$
Reliable adj3 test
Reliable adj3 tests
Reliable adj3 testing
Reliable adj3 standard
Reliability adj3 test
Reliability adj3 tests
Reliability adj3 testing
Reliability adj3 standard
Performance adj3 test
Performance adj3 tests
Performance adj3 testing
Performance adj3 standard$
Predictive adj value$
Predictive adj standard$
Predictive adj model$
Predictive adj factor$
Reference adj test
Reference adj tests
Reference adj testing
Index adj test
Index adj tests
Index adj testing

    
Ritchie 2007NROvid691.2 
Whiting 2010

exp "sensitivity and specificity"/

False positive reactions/

False negative reactions/

Logistic models/

Roc curve/

Likelihood functions/

Diagnosis/

Exp diagnostic errors/

exp "Diagnostic Techniques and

Procedures"/

exp "laboratory techniques and

procedures"/

Specificit$.ti,ab.

Sensitivity$.ti,ab.

False negative$.ti,ab.

False positive$.ti,ab.

True negative$.ti,ab.

True positive$.ti,ab.

Positive rate$.ti,ab.

Negative rate$.ti,ab.

Screening.ti,ab.

Accuracy.ti,ab.

Reference value$.ti,ab.

Likelihood ratio$.ti,ab.

Sroc.ti,ab.

Srocs.ti,ab.

Roc.ti,ab.

Rocs.ti,ab.

Receiver operat$ curve$.ti,ab.

Receiver operat$ character$.ti,ab.

(Diagnos$ adj3 efficac$).ti,ab.

(Diagnos$ adj3 efficien$).ti,ab.

(Diagnos$ adj3 effectiv$).ti,ab.

(Diagnos$ adj3 accura$).ti,ab.

(Diagnos$ adj3 correct$).ti,ab.

(Diagnos$ adj3 reliable).ti,ab.

(Diagnos$ adj3 reliability).ti,ab.

(Diagnos$ adj3 error$).ti,ab.

(Diagnos$ adj3 mistake$).ti,ab.

(Diagnos$ adj3 inaccura$).ti,ab.

(Diagnos$ adj3 incorrect$).ti,ab.

(Diagnos$ adj3 unreliable).ti,ab.

Diagnostic yield.mp.

Misdiagnos$.ti,ab.

Reproductivity.mp.

Logistical regression.mp.

Logistical model$.ti,ab.

(ability adj2 predict$).ti,ab.

(reliable adj3 test).ti,ab.

(reliable adj3 tests).ti,ab.

(reliable adj3 testing).ti,ab.

(reliable adj3 standard).ti,ab.

(reliability adj3 test).ti,ab.

(reliability adj3 tests).ti,ab.

(reliability adj3 testing).ti,ab.

(reliability adj3 standard).ti,ab.

(performance adj3 test).ti,ab.

(performance adj3 tests).ti,ab.

(performance adj3 testing).ti,ab.

(performance adj3 standard$).ti,ab.

(Predictive adj value$).ti,ab.

(Predictive adj standard$).ti,ab.

(Predictive adj model$).ti,ab.

(Predictive adj factor$).ti,ab.

(Reference adj test).ti,ab.

(Reference adj tests).ti,ab.

(Reference adj testing).ti,ab.

(index adj test).ti,ab.

(index adj tests).ti,ab.

(index adj testing).ti,ab.

Ovid852NNR 46
Noel-Storr 2011

exp "sensitivity and specificity"/

False positive reactions/

False negative reactions/

Logistic models/

Roc curve/

Likelihood functions/

Diagnosis/

Exp diagnostic errors/

exp "Diagnostic Techniques and

Procedures"/

exp "laboratory techniques and

procedures"/

Specificit$.ti,ab.

Sensitivity$.ti,ab.

False negative$.ti,ab.

False positive$.ti,ab.

True negative$.ti,ab.

True positive$.ti,ab.

Positive rate$.ti,ab.

Negative rate$.ti,ab.

Screening.ti,ab.

Accuracy.ti,ab.

Reference value$.ti,ab.

Likelihood ratio$.ti,ab.

Sroc.ti,ab.

Srocs.ti,ab.

Roc.ti,ab.

Rocs.ti,ab.

Receiver operat$ curve$.ti,ab.

Receiver operat$ character$.ti,ab.

(Diagnos$ adj3 efficac$).ti,ab.

(Diagnos$ adj3 efficien$).ti,ab.

(Diagnos$ adj3 effectiv$).ti,ab.

(Diagnos$ adj3 accura$).ti,ab.

(Diagnos$ adj3 correct$).ti,ab.

(Diagnos$ adj3 reliable).ti,ab.

(Diagnos$ adj3 reliability).ti,ab.

(Diagnos$ adj3 error$).ti,ab.

(Diagnos$ adj3 mistake$).ti,ab.

(Diagnos$ adj3 inaccura$).ti,ab.

(Diagnos$ adj3 incorrect$).ti,ab.

(Diagnos$ adj3 unreliable).ti,ab.

Diagnostic yield.mp.

Misdiagnos$.ti,ab.

Reproductivity.mp.

Logistical regression.mp.

Logistical model$.ti,ab.

(ability adj2 predict$).ti,ab.

(reliable adj3 test).ti,ab.

(reliable adj3 tests).ti,ab.

(reliable adj3 testing).ti,ab.

(reliable adj3 standard).ti,ab.

(reliability adj3 test).ti,ab.

(reliability adj3 tests).ti,ab.

(reliability adj3 testing).ti,ab.

(reliability adj3 standard).ti,ab.

(performance adj3 test).ti,ab.

(performance adj3 tests).ti,ab.

(performance adj3 testing).ti,ab.

(performance adj3 standard$).ti,ab.

(Predictive adj value$).ti,ab.

(Predictive adj standard$).ti,ab.

(Predictive adj model$).ti,ab.

(Predictive adj factor$).ti,ab.

(Reference adj test).ti,ab.

(Reference adj tests).ti,ab.

(Reference adj testing).ti,ab.

(index adj test).ti,ab.

(index adj tests).ti,ab.

(index adj testing).ti,ab.

Ovid

90

(83-94)

1.15

(0.95-1.38)

 

InterTASC 2011 HTBS

Unclear how terms combined

ORIGINAL

MeSH
Exp Sensitivity and specificity/
Exp Diagnostic errors/
Likelihood functions/
Reproducibility of results/
Text words .tw
Sensitivit$
Specificit$
Accurac$
Predictive adj2 value$

False$ adj2 positive$
False$ adj2 negative$
False$ adj2 rate$
roc
Receiver operat$ adj2 curve$
Receiver operat$ characteristic$
Likelihood$ adj2 ratio$
Likelihood$ adj2 function$

    
Ritchie 2007NROvid463.7 
Whiting 2010

exp "sensitivity and specificity"/

Exp diagnostic errors/

Likelihood functions/

Reproducibility of results/

Sensitivity$.tw.

Specificit$.tw.

Accuracy$.tw.

(Predictive adj2 value$).tw.

(False$ adj2 positive$).tw.

(false$ adj2 negative$).tw.

(false$ adj2 rate$).tw.

Roc.tw.

(receiver operat$ adj2 curve$).tw.

(receiver operat$ characteristic$).tw

(likelihood$ adj2 ratio$).tw.

(likelihood$ adj2 function$).tw.

Ovid698NNR 12
Noel-Storr 2011

exp "sensitivity and specificity"/

Exp diagnostic errors/

Likelihood functions/

Reproducibility of results/

Sensitivity$.tw.

Specificit$.tw.

Accuracy$.tw.

(Predictive adj2 value$).tw.

(False$ adj2 positive$).tw.

(false$ adj2 negative$).tw.

(false$ adj2 rate$).tw.

Roc.tw.

(receiver operat$ adj2 curve$).tw.

(receiver operat$ characteristic$).tw

(likelihood$ adj2 ratio$).tw.

(likelihood$ adj2 function$).tw.

Ovid

56

(47-65)

2.04

(1.60-2.57)

 
Shipley Miner 2002 $ ORIGINAL

1 exp "sensitivity and specificity"/
2 (sensitivity or specificity).ti.ab.
3 likelihood functions/
4 exp diagnostic errors/
5 area under curve/
6 reproducibility of results/
7 (predictive adj value$1).ti.ab.
8 (likelihood adj ratio$1).ti.ab.
9 (false adj (negative$1 or positive$1).ti.ab.
10 diagnosis, differential/
11 random allocations/
12 random$.ti,ab.

13 ((single or double or triple) adj blind$3).ti,ab.
14 double blind method/ or single blind method/
15 (randomized controlled trial or controlled clinical trial).pt.
16 practice guideline.pt.
17 consensus development conference$.pt.
18 1 or 2 or 8 or 3
19 or/1-17

    
Ritchie 2007NROvid481.8 
Whiting 2010

exp "sensitivity and specificity"/

(sensitivity or specificity).ti,ab.

Likelihood functions/

Exp diagnostic errors/

Area under curve/

Reproducibility of results/

(predictive adj value$1).ti,ab.

(likelihood adj ratio$1).ti,ab.

(false adj (negative$1 or

positive$1)).ti,ab.

Diagnosis, differential/

Random allocation/

Random$.ti,ab.

((single or double or triple) adj

blind$3).ti,ab.

Double blind method/

Single blind method/

Randomized controlled trial.pt.

Controlled clinical trial.pt.

Practice guideline.pt.

Consensus development

conference$.pt.

Ovid725NNR 19
Noel-Storr 2011

exp "sensitivity and specificity"/

(sensitivity or specificity).ti,ab.

Likelihood functions/

Exp diagnostic errors/

Area under curve/

Reproducibility of results/

(predictive adj value$1).ti,ab.

(likelihood adj ratio$1).ti,ab.

(false adj (negative$1 or

positive$1)).ti,ab.

Diagnosis, differential/

Random allocation/

Random$.ti,ab.

((single or double or triple) adj

blind$3).ti,ab.

Double blind method/

Single blind method/

Randomized controlled trial.pt.

Controlled clinical trial.pt.

Practice guideline.pt.

Consensus development

conference$.pt.

Ovid

63

(54-72)

1.71

(1.35-2.12)

 
University of Rochester 2002 $ ORIGINALUnable to access – website no longer valid    
Vincent 2003

1 exp ‘sensitivity and specificity’/

2 false negative reactions/ or false positive reactions/

3 (sensitivity or specificity).ti,ab.

4 (predicitive adj value$1).ti,ab.

5 (likelihood adj ratio$10.TI,AB.

6 (false adj (negative$1 or positive$1)).ti,ab.

7 or/1-7

NR79  
North Thames 2002 ORIGINALUnable to access – website no longer valid    
Vincent 2003

1 exp ‘sensitivity and specificity’

2 exp diagnostic errors

3 mass screening

4 or/1-3

NR53  

Abbreviations used: TSR = Tympanometry systematic review; NPSR = Natriuretic peptides systematic review.

$ Filter no longer available from source cited by evaluation studies.

What's new

DateEventDescription
30 November 2011AmendedUpdated the protocol and added an author
3 July 2009AmendedUpdated the protocol with other authors and revised text
27 December 2007AmendedConverted to new review format

History

Protocol first published: Issue 2, 2006
Review first published: Issue 9, 2013

DateEventDescription
20 February 2007New citation required and major changesSubstantive amendment

Contributions of authors

Rebecca Beynon designed the study, ran literature searches, screened literature searches, extracted data, synthesized data and drafted the manuscript. Julie Glanville devised and ran literature searches and drafted the manuscript. Mariska Leeflang designed the study, screened literature searches and edited the manuscript. Ruth Mitchell devised and ran literature searches, screened literature searches and edited the manuscript. Anne Eisinga devised and ran literature searches, and edited the manuscript. Steve McDonald devised and ran literature searches, screened literature searches and edited the manuscript. Penny Whiting edited the manuscript.

Declarations of interest

Julie Glanville, together with colleagues from the InterTASC Information Specialist Subgroup, developed the Search Filter Appraisal Checklist that is used in this review for the methodological assessment of the included studies and has published search filters. Julie Glanville, Mariska Leeflang, Ruth Mitchell, Rebecca Beynon and Penny Whiting have published performance evaluations of search filters.

Sources of support

Internal sources

  • National Institute for Health Research, UK.

    Incentive award for completion of review

External sources

  • No sources of support supplied

Characteristics of studies

Characteristics of included studies [ordered by study ID]

Astin 2008

Methods

Method of identification of reference set records - Handsearching

Method of deriving filter terms - Analysis of reference set

Data

Reference set years - Development set 1985 Clin Radiol, 1988 Am J Neuroradiol; validation set 2000

Number of gold standard records - 333 in development set; 186 in validation set

Number of non-gold standard records - 2222 in development set; 1070 in validation set

Comparisons

Reference set also contained non-gold standard records -Yes

Description of non-gold standard records if used in reference set - Not reported

Outcomes Number of filters developed - 1
NotesMEDLINE development study
Risk of bias
ItemAuthors' judgementDescription
If relevant, systematic review did not use DTA strategy?UnclearNot relevant
Generic gold-standard records?NoFilter developed to retrieve radiology DTA studies. High concerns about applicability
Independent internal validation?YesDiscrete set of references in a derivation set from six handsearched journals and a validation set from six different handsearched journals
Externally validated?NoHigh concerns about applicability

Bachmann 2002

Methods

Method of identification of reference set records - Handsearching

Method of deriving filter terms - Analysis of reference set

Data

Reference set years - 1989, 1994 and 1999

Number of gold standard records - 83 in 1989 test set; 53 in 1994 validation set; 61 in 1999 validation set

Number of non-gold standard records - 1646 in 1989 test set; 1744 in 1994 validation set; 7875 in 1999 validation set

Comparisons

Reference set also contained non-gold standard records - Yes

Description of non-gold standard records if used in reference set - Not reported

Outcomes Number of filters developed - 1
NotesMEDLINE development study
Risk of bias
ItemAuthors' judgementDescription
If relevant, systematic review did not use DTA strategy?UnclearNot relevant, systematic review not used
Generic gold-standard records?YesLow concerns about applicability
Independent internal validation?YesTerms derived from 1989 reference set; filter validated in 1994 validation set
Externally validated?YesReferences from search of different journals and year to the derivation and internal validation set. Low concerns about applicability

Bachmann 2003

Methods

Method of identification of reference set records - Handsearching

Method of deriving filter terms - Analysis of reference set

Data

Reference set years - 1999

Number of gold standard records - 61

Number of non-gold standard records - 6082

Comparisons

Reference set also contained non-gold standard records - Yes

Description of non-gold standard records if used in reference set - All records retrieved by search that were not classified as gold standard studies

Outcomes Number of filters developed - 8
NotesEMBASE development study
Risk of bias
ItemAuthors' judgementDescription
If relevant, systematic review did not use DTA strategy?UnclearNot relevant
Generic gold-standard records?YesLow concerns about applicability
Independent internal validation?NoTerms for filters derived through word frequency analysis of the same references as the validation set
Externally validated?No 

Berg 2005

Methods

Method of identification of reference set records - Handsearching

Method of deriving filter terms - Analysis of reference set and adaption of existing filter

Data

Reference set years - Not reported

Number of gold standard records - Not reported

Number of non-gold standard records - 238

Comparisons

Reference set also contained non-gold standard records - Yes

Description of non-gold standard records if used in reference set - Not reported

Outcomes Number of filters developed - 1
NotesMEDLINE development study
Risk of bias
ItemAuthors' judgementDescription
If relevant, systematic review did not use DTA strategy?UnclearNot relevant
Generic gold-standard records?NoCancer-related fatigue topic specific. High concerns about applicability
Independent internal validation?NoUsed the indexing of included citations from gold standard references to derive terms, these references also included in validation
Externally validated?NoHigh concerns about applicability

Deville 2000

Methods

Method of identification of reference set records - Handsearching

Method of deriving filter terms - Analysis of reference set

Data

Reference set years - 1992-1995

Number of gold standard records - 75; 33 in meniscal lesions set

Number of non-gold standard records - 2392; meniscal lesions set not reported

Comparisons

Reference set also contained non-gold standard records - Yes

Description of non-gold standard records if used in reference set - False positive papers selected by a previously published search strategy, exclusion of some publication types (e.g. reviews and meta-analyses)

Outcomes

Number of filters developed - 4

Number of filters evaluated - 1

NotesMEDLINE development and evaluation study
Risk of bias
ItemAuthors' judgementDescription
If relevant, systematic review did not use DTA strategy?UnclearNot relevant
Generic gold-standard records?NoFamily medicine reference set; physical diagnostic tests for meniscal lesion validation set. High concerns about applicability
Independent internal validation?No 
Externally validated?Yes 

Deville 2002

Methods

Method of identification of reference set records - DTA systematic reviews

Method of deriving filter terms - Adaption of existing filter

Data

Reference set years - Not reported

Number of gold standard records - Not reported

Number of non-gold standard records - Not reported

Comparisons

Reference set also contained non-gold standard records - Not reported

Description of non-gold standard records if used in reference set - Not reported

Outcomes Number of filters developed - 1
NotesMEDLINE development study
Risk of bias
ItemAuthors' judgementDescription
If relevant, systematic review did not use DTA strategy?UnclearThe reference cited by the study to the systematic review which was used is unavailable. A meta-analysis published by the same author on the topic did use a search strategy containing diagnostic terms
Generic gold-standard records?NoStudies from a systematic review of diagnostic tests for knee lesions and a systematic review of a urine dipstick test comprised the reference set. High concerns about applicability
Externally validated?YesReal-world validation sets based on two systematic reviews. Low concerns about applicability

Doust 2005

Methods Method of identification of reference set records - DTA systematic reviews conducted by authors
Data

Reference set years - Tympanometry 1966-2001; natriuretic peptides 1994-2002

Number of gold standard records - Tympanometry n=33; natriuretic peptides n=20

Number of non-gold standard records - 0

Comparisons

Reference set also contained non-gold standard records - No

Description of non-gold standard records if used in reference set - Not applicable

Outcomes Number of filters evaluated - 5
NotesMEDLINE evaluation study
Risk of bias
ItemAuthors' judgementDescription
If relevant, systematic review did not use DTA strategy?NoThe authors conducted two systematic reviews whose studies comprised the reference set. The clinical queries filter for diagnostic studies available in PubMed was used.
Generic gold-standard records?NoStudies from a systematic review of tympanometry for the diagnosis of otitis media with effusion in children and a systematic review of natriuretic peptides comprised the reference standard. High concerns about applicability
Independent internal validation?UnclearNot relevant
Externally validated?UnclearNot relevant

Haynes 1994

Methods

Method of identification of reference set records - Handsearching for primary studies

Method of deriving filter terms - Expert knowledge and analysis of reference set

Data

Reference set years - 1986 and 1991

Number of gold standard records - 92 in 1986 set; 111 in 1991 set

Number of non-gold standard records - 426 in 1986 set; 301 in 1991 set

Comparisons

Reference set also contained non-gold standard records - Yes

Description of non-gold standard records if used in reference set - Not reported

Outcomes Number of filters developed - 12
NotesMEDLINE development study. All papers listed under Haynes 1994 used for data extraction
Risk of bias
ItemAuthors' judgementDescription
If relevant, systematic review did not use DTA strategy?UnclearNot relevant
Generic gold-standard records?YesLow concerns about applicability
Independent internal validation?NoTerms were collected through expert knowledge but their combination into a strategy was not independent of the references used for validation. The reference standard was used to eliminate terms with <10% sensitivity or combination with <40% sensitivity or <70% specificity
Externally validated?NoHigh concerns about applicability

Haynes 2004

Methods

Method of identification of reference set records - Handsearching for primary studies

Method of deriving filter terms - Expert knowledge and analysis of reference set

Data

Reference set years - 2000

Number of gold standard records - 147

Number of non-gold standard records - 48,881

Comparisons

Reference set also contained non-gold standard records - Yes

Description of non-gold standard records if used in reference set - Not reported

Outcomes Number of filters developed - 11
NotesMEDLINE development study
Risk of bias
ItemAuthors' judgementDescription
If relevant, systematic review did not use DTA strategy?UnclearNot relevant
Generic gold-standard records?YesLow concerns about applicability
Independent internal validation?NoIndividual search terms with a sensitivity >25% and a specificity >75% (when tested in the reference set) were incorporated into the development of search strategies
Externally validated?NoHigh concerns about applicability

Kassai 2006

Methods Method of identification of reference set records - Primary studies identified through Internet search
Data

Reference set years - 1966-2002

Number of gold standard records - 237

Number of non-gold standard records - 1236

Comparisons

Reference set also contained non-gold standard records - Yes

Description of non-gold standard records if used in reference set - All studies retrieved by search not classified as gold standard records

Outcomes Number of filters evaluated - 3
NotesMEDLINE evaluation study
Risk of bias
ItemAuthors' judgementDescription
If relevant, systematic review did not use DTA strategy?UnclearNot relevant
Generic gold-standard records?NoVenous thrombosis and ultrasonography topic specific. High concerns about applicability
Independent internal validation?UnclearNot relevant
Externally validated?UnclearNot relevant

Kastner 2009

Methods Method of identification of reference set records - Internet search for DTA systematic reviews
Data

Reference set years - 2006

Number of gold standard records - 441

Number of non-gold standard records - 0

Comparisons

Reference set also contained non-gold standard records - No

Description of non-gold standard records if used in reference set - Not applicable

Outcomes Number of filters evaluated - 1
NotesMEDLINE evaluation study
Risk of bias
ItemAuthors' judgementDescription
If relevant, systematic review did not use DTA strategy?NoFive of the twelve systematic reviews which provided studies for the reference set, used search strategies containing DTA search terms to find primary studies
Generic gold-standard records?YesLow concerns about applicability
Independent internal validation?UnclearNot relevant
Externally validated?UnclearNot relevant

Leeflang 2006

Methods Method of identification of reference set records - Internet search for DTA systematic reviews
Data

Reference set years - 1999-2002

Number of gold standard records - 820

Number of non-gold standard records - 0

Comparisons

Reference set also contained non-gold standard records - No

Description of non-gold standard records if used in reference set - Not applicable

Outcomes Number of filters evaluated - 12
NotesMEDLINE evaluation study
Risk of bias
ItemAuthors' judgementDescription
If relevant, systematic review did not use DTA strategy?UnclearOf the 27 systematic reviews whose studies were used to comprise the reference set, seven did not describe their search strategy. It is unclear, therefore, whether diagnostic terms would have been applied in the search.
Generic gold-standard records?YesLow concerns about applicability
Independent internal validation?UnclearNot relevant
Externally validated?UnclearNot relevant

Mitchell 2005

Methods Method of identification of reference set records - Handsearching for primary studies
Data

Reference set years - 1991-1992; 2002-2003

Number of gold standard records - 99

Number of non-gold standard records - 4409

Comparisons

Reference set also contained non-gold standard records - Yes

Description of non-gold standard records if used in reference set - Not reported

Outcomes Number of filters evaluated - 6 MEDLINE filters and 6 EMBASE filters
NotesMEDLINE and EMBASE evaluation study
Risk of bias
ItemAuthors' judgementDescription
If relevant, systematic review did not use DTA strategy?UnclearNot relevant
Generic gold-standard records?NoKidney disease topic specific. High concerns about applicability
Independent internal validation?UnclearNot relevant
Externally validated?UnclearNot relevant

Noel-Storr 2011

Methods

Method of identification of reference set records - DTA systematic reviews conducted by the authors

Method of deriving filter terms - Analysis of reference set (authors ran published search filters in MEDLINE combined with a subject search, locating 10 papers that all filters missed and choosing a term from their title/abstract or keywords of each)

Data

Reference set years - 2000-2001

Number of gold standard records - 128 in September 2010 set with additional 16 found in update search. Therefore, 144 in August 2011

Number of non-gold standard records - 17,266 in September 2010 set; with additional 1654 found in update search; so 18,920 in August 2011

Comparisons

Reference set also contained non-gold standard records - Yes

Description of non-gold standard records if used in reference set - All studies retrieved by search not classified as gold standard records

Outcomes Number of filters developed - 1
NotesMEDLINE development and evaluation study
Risk of bias
ItemAuthors' judgementDescription
If relevant, systematic review did not use DTA strategy?Yes 
Generic gold-standard records?NoStudies included in a systematic review of biomarkers for diagnosing mild cognitive impairment comprised reference set; filter designed to retrieve longitudinal DTA studies. High concerns about applicability
Independent internal validation?NoThe reference was not totally independent of the set used to derive terms, it consisted of 144 gold standard records and 18,920 non-gold standard records, but the 10 studies used to derive terms for the new filter were included in the reference set during validation
Externally validated?NoHigh concerns about applicability

Ritchie 2007

Methods Method of identification of reference set records - Primary studies identified through Internet search
Data

Reference set years - 1966-2003

Number of gold standard records - 160

Number of non-gold standard records - 27,804

Comparisons

Reference set also contained non-gold standard records - Yes

Description of non-gold standard records if used in reference set - All studies retrieved by search not classified as gold standard records

Outcomes Number of filters evaluated - 22
NotesMEDLINE evaluation study
Risk of bias
ItemAuthors' judgementDescription
If relevant, systematic review did not use DTA strategy?UnclearNot relevant
Generic gold-standard records?NoChildhood urinary tract infection diagnosis topic specific. High concerns about applicability
Independent internal validation?UnclearNot relevant
Externally validated?UnclearNot relevant

van der Weijden 1997

Methods

Method of identification of reference set records - Personal literature database

Method of deriving filter terms - Checking key publications for terms and language used

Data

Reference set years - 1985-1994

Number of gold standard records - 221

Number of non-gold standard records - 0

Comparisons

Reference set also contained non-gold standard records - No

Description of non-gold standard records if used in reference set - Not applicable

Outcomes Number of filters developed - 3
NotesMEDLINE development study
Risk of bias
ItemAuthors' judgementDescription
If relevant, systematic review did not use DTA strategy?UnclearNot relevant
Generic gold-standard records?NoErythrocyte sedimentation as a diagnostic test topic specific. High concerns about applicability
Independent internal validation?NoFilters composed of terms that were derived from checking the key publications for terms and language used, not judged to be internally validated as only real-world external validation carried out
Externally validated?YesTwo systematic reviews on ESR and dipstick testing provided references for validation testing. Low concerns about applicability

Vincent 2003

Methods

Method of identification of reference set records - DTA systematic reviews

Method of deriving filter terms - Adaption of existing filter and analysis of reference set

Data

Reference set years - 1969-2000

Number of gold standard records - 126

Number of non-gold standard records - 0

Comparisons

Reference set also contained non-gold standard records - No

Description of non-gold standard records if used in reference set - Not applicable

Outcomes

Number of filters developed - 3

Number of filters evaluated - 5

NotesMEDLINE development and evaluation study
Risk of bias
ItemAuthors' judgementDescription
If relevant, systematic review did not use DTA strategy?NoAt least one of the 16 systematic reviews used to provide studies for the reference set, used diagnostic search terms in the search strategy. Many of the systematic reviews did not provide a full search strategy and, therefore, it is unclear whether they would have used a diagnostic filter or not.
Generic gold-standard records?NoDeep vein thrombosis diagnosis topic specific. High concerns about applicability
Independent internal validation?NoPublished filters were adapted by adding and removing terms based on the results of searches of the reference set
Externally validated?NoHigh concerns about applicability

Whiting 2010

Methods Method of identification of reference set records - Systematic reviews conducted by the authors
Data

Reference set years - Not reported

Number of gold standard records - 506

Number of non-gold standard records - 25,880 (number obtained from authors)

Comparisons

Reference set also contained non-gold standard records - Yes

Description of non-gold standard records if used in reference set - Not reported

Outcomes Number of filters evaluated - 22
NotesMEDLINE evaluation study
Risk of bias
ItemAuthors' judgementDescription
If relevant, systematic review did not use DTA strategy?YesThe authors conducted the systematic reviews and state that their search strategies did not contain any diagnostic terms
Generic gold-standard records?YesDTA studies from seven systematic reviews which covered a range of different types of diagnostic test and condition. Low concerns about applicability
Independent internal validation?UnclearNot relevant
Externally validated?UnclearNot relevant

Wilczynski 2005

Methods

Method of identification of reference set records - Handsearching for primary studies

Method of deriving filter terms - Analysis of reference set and expert knowledge

Data

Reference set years - 2000

Number of gold standard records - 97

Number of non-gold standard records - 27,672

Comparisons

Reference set also contained non-gold standard records - Yes

Description of non-gold standard records if used in reference set - Not reported

Outcomes Number of filters developed - 4
NotesEMBASE development study
Risk of bias
ItemAuthors' judgementDescription
If relevant, systematic review did not use DTA strategy?UnclearNot relevant
Generic gold-standard records?YesLow concern about applicability
Independent internal validation?No 
Externally validated?NoHigh concerns about applicability

Ancillary