In obstetrics and gynaecology, the technology behind existing tests is continuously improving and new tests are developing at a fast rate. The question of whether these developments lead to improved diagnosis is addressed in test accuracy studies. The BJOG covered proper evaluation of clinical tests in 2001 with two commentaries,1,2 which provided the basis for improvement in reporting and peer-review.3 Recently, an international initiative on the Standards for Reporting of Diagnostic accuracy called STARD has been reported.4 In this commentary, we introduce this to authors, peer-reviewers and readers.
Evaluation of research depends on transparent, complete and accurate reporting. This allows the reader to detect the potential for bias in the study and to evaluate the applicability of the results in their practice. Hence, guidelines on reporting randomised trials (CONSORT)5 and systematic reviews (QUORUM6 and MOOSE7) have been developed and widely accepted by both journal editors and authors. STARD is an initiative on similar lines. It is aimed at studies of diagnostic accuracy comparing ‘new’ tests with existing reference standards. STARD consists of a 25-item checklist (Table 1) and flow diagram (Fig. 1) and an explanatory document that describes the background, rationale and evidence for each of the 25 items. Checklist and flow diagram are available from http://www.consort-statement.org/stardstatement.htm. The explanatory document is available from the websites of Clinical Chemistry (http://www.clinchem.org) and the Annals of Internal Medicine (http://www.annals.org).
|Title, abstract and keywords|
|Identify the article as a study of diagnostic accuracy (recommend MeSH heading ‘sensitivity and specificity’).|
|State the research questions or aims, such as estimating diagnostic accuracy or comparing accuracy between tests or across participant groups.|
|Describe the study population: the inclusion and exclusion criteria and the settings and locations where the data were collected.|
|Describe participant recruitment: Was this based on presenting symptoms, results from previous tests or the fact that the participants had received the index tests or the reference standard?|
|Describe participant sampling: Was this a consecutive series of participants defined by selection criteria in items 3 and 4? If not, specify how participants were further selected.|
|Describe data collection: Was data collection planned before the index tests and reference standard were performed (prospective study) or after (retrospective study)?|
|Describe the reference standard and its rationale.|
|Describe technical specifications of material and methods involved, including how and when measurements were taken or cite references for index tests or reference standard or both.|
|Describe definition of and rationale for the units, cutoff points or categories of the results of the index tests and the reference standard.|
|Describe the number, training and expertise of the persons executing and reading the index tests and the reference standard.|
|Were the readers of the index tests and the reference standard blind (masked) to the results of the other test? Describe any other clinical information available to the readers.|
|Describe methods for calculating or comparing measures of diagnostic accuracy and the statistical methods used to quantify uncertainty (e.g. 95% confidence intervals).|
|Describe methods for calculating test reproducibility, if done.|
|Report when study was done, including beginning and ending dates of recruitment.|
|Report clinical and demographic characteristics (e.g. age, sex, spectrum of presenting symptoms, comorbidity, current treatments and recruitment centre).|
|Report how many participants satisfying the criteria for inclusion did or did not undergo the index tests or the reference standard or both; describe why participants failed to receive either test (a flow diagram is strongly recommended).|
|Report time interval from index tests to reference standard and any treatment administered between.|
|Report distribution of severity of disease (define criteria) in those with the target condition and other diagnoses in participants without the target condition.|
|Report a cross tabulation of the results of the index tests (including indeterminate and missing results) by the results of the reference standard; for continuous results, report the distribution of the test results by the results of the reference standard.|
|Report any adverse events from performing the index test or the reference standard.|
|Report estimates of diagnostic accuracy and measures of statistical uncertainty (e.g. 95% confidence intervals).|
|Report how indeterminate results, missing responses and outliers of index tests were handled.|
|Report estimates of variability of diagnostic accuracy between readers, centres, or subgroups of participants, if done.|
|Report estimates of test reproducibility, if done.|
|Discuss the clinical applicability of the study findings.|
STARD was developed after a search for published guidelines about diagnostic research, which had yielded 33 previously published checklists. From these, a list of 75 potential items was developed. At an international consensus meeting of researchers, editors and members of professional organisations, participants short-listed 25 items, considering empirical evidence of bias whenever available. The flow diagram provides information about the method of recruitment of participants, the order of tests and the numbers of participants undergoing the test under evaluation and the reference standard. This flow diagram communicates vital information about the study design and the flow of participants in a transparent matter. It is anticipated that the use of this checklist, in combination with the flow diagram, will enhance the quality of reporting of studies on diagnostic accuracy.
In studies of test accuracy, the information from one or more tests under evaluation is compared with a reference standard as measured in the same subjects suspected of the condition of interest. The word ‘test’ refers to any method for obtaining additional information on a patient's health status (e.g. history, physical examination, laboratory tests, imaging tests, function tests and histopathology). The ‘condition of interest’ is usually a particular disease or diagnosis. The ‘reference standard’ is the best available method for verifying the presence or absence of the diagnosis, which may consist of one or more methods including laboratory tests, imaging tests, histopathology and also clinical follow up of subjects. A complete description of independence and blinding of comparison of the test under scrutiny with reference standard among an appropriate patient group is required by STARD. The term ‘accuracy’ refers to the amount of agreement between the information from the test under evaluation and the reference standard. Test accuracy can be expressed as sensitivity and specificity, likelihood ratios, diagnostic odds ratio, the area under a ROC curve etc. Details of how to estimate these have been previously provided.1
To comprehend and use the results of test accuracy studies, readers must be able to understand the design, conduct, analysis and results. If readers have to speculate, there could be errors in interpretation with undesired clinical consequences. Critical appraisal, a key step in evidence-based practice, is only possible if the design, conduct and analysis of test accuracy studies is thoroughly described in published articles. Inadequate methodological approaches are often associated with erroneous conclusions. Authors themselves have a tendency to exaggerate conclusions about test accuracy.8 Biased and exaggerated inferences can trigger premature dissemination and mislead decision making in health care for individual patients as well as for regional and national policies. A rigorous evaluation of test accuracy studies could help limit health care costs by preventing unnecessary testing and by reducing the number of unwanted clinical consequences related to false test results.
One of the issues deliberately left out from the STARD list at present is that of sample size. It is possible to perform a sample size (power) calculation at the start, but there is no consensus on methods of sample size estimation for diagnostic studies, particularly as there are many designs possible. Yet currently, most diagnostic studies are way too small and produce estimates of accuracy with a large degree of uncertainty. In STARD, therefore, items under ‘statistical methods’ and ‘results’ refer to the level of statistical uncertainty around estimates of diagnostic uncertainty (which are often lacking from test accuracy studies) and encourages authors to report these (e.g. using 95% confidence intervals).
The items of STARD checklist would help readers judge the potential for bias in a test accuracy study and to evaluate the applicability of its findings. Already, STARD has been published in many journals including Clinical Chemistry, Annals of Internal Medicine, Radiology, BMJ, Lancet, American Journal of Clinical Pathology, Clinical Biochemistry, Clinical Chemistry and Laboratory Medicine, Clinical Radiology, Academic Radiology, American Journal of Roentgenology, and Family Practice. It has been adopted and commented by others, such as JAMA and Neurology. The advice to authors and readers on methods and analysis of studies evaluating reliability2 and validity1 of clinical tests previously provided by the BJOG is reinforced by the STARD initiative. Use of this checklist, in combination with the flow diagram, will enhance the quality of reporting of studies on test accuracy in obstetrics and gynaecology.