To identify classification criteria for the rheumatic diseases and to evaluate their measurement properties and methodologic rigor using current measurement standards.
To identify classification criteria for the rheumatic diseases and to evaluate their measurement properties and methodologic rigor using current measurement standards.
We performed a systematic review of published literature and evaluated criteria sets for stated purpose, derivation and validation sample characteristics, methods of criteria generation and reduction, and consideration of validity, and reliability.
We identified 47 classification criteria sets encompassing 13 conditions. Approximately 50% of the criteria sets were developed based on expert opinion rather than patient data. Of the 47 criteria sets, control samples were derived from patients with rheumatic disease in 15 (32%) sets, from patients with nonrheumatic diseases in 4 (9%) sets, and from healthy participants in 2 (4%) sets. Where patient data were used, the number of cases ranged from 20–588 and the number of controls from 50–787. In only 1 (2%) criteria set was there a distinct separation between investigators who derived the criteria set and clinicians who provided cases and controls. Authors commented on the need for individual criterion to be reliable in 5 (11%) sets, precise in 5 (11%) sets; authors noted the importance of content validity in 12 (26%) sets, and construct validity in 12 (26%) sets.
The variation in methodologic rigor used in sample selection affects the validity and reliability of the criteria sets in different clinical and research settings. Despite potential deficiencies in the methods used for some criteria development, the sensitivity and specificity of many criteria sets is moderate to strong.
Classification criteria for the rheumatic diseases are the basis for much research in clinical rheumatology. Classification criteria serve to define disease groups for clinical and epidemiologic studies. If they are not valid, participants without disease may be included in disease groups in studies, and participants with clear-cut disease may be excluded. Thus, the validity of classification criteria is critical to our ability to understand and treat rheumatic diseases. Valid classification criteria facilitate selection of similar patients for clinical trials and epidemiologic studies and allow for comparison of results across studies. Classification criteria are widely cited, indicating their important role.
Standards for the development and validation of classification criteria have changed substantially over time. Recommendations for the development and validation of criteria sets have been proposed based on the current standards of measurement science (1, 2). The classification criteria currently in use have been published over several decades, and many of them precede our current understanding of how to develop valid criteria; therefore there is likely to be variation in the methodologies used in the development, validation, and performance characteristics of classification criteria. Furthermore, advances in our understanding of the pathophysiology of disease, diagnostic tests, patient populations, and treatments during the past 20 years suggest that some criteria sets need to be updated. Therefore, the methodology of the development and the subsequent validity and reliability of classification criteria for the rheumatic diseases deserves closer attention.
The objectives of this study were to identify classification criteria for ankylosing spondylitis, juvenile arthritis, dermatomyositis, fibromyalgia, gout, Kawasaki disease, osteoarthritis, polymyalgia rheumatica, polymyositis, psoriatic arthritis, rheumatoid arthritis, systemic lupus erythematosus, systemic sclerosis, and vasculitis; and to evaluate their measurement properties against current principles and standards of measurement.
Eligible articles were identified using the Ovid MEDLINE database (1966–2005) and the EMBASE database (1966–2005). The search strategy was limited to human studies, but not limited to studies reported in English. Additional articles were identified from the “Bibliography of Criteria, Guidelines, and Health Status Assessments Used in Rheumatology,” published on the American College of Rheumatology (ACR) website (http://www.rheumatology.org/publications/abbreviations/index.asp?aud=mem). The reference lists of eligible studies were hand searched.
The following keywords with mapping to subject heading were used in the database search: (disease name OR alternate disease name[s]) AND (criteria OR criteria development OR classification OR classification criteria OR classification tree OR diagnostic score OR diagnostic measures OR diagnostic criteria OR diagnostic assessment OR diagnostic index OR disease score OR disease criteria OR disease measures OR disease assessment OR disease index OR validity OR face validity OR content validity OR construct validity).
The titles and abstracts of search results were screened for relevance to criteria, classification or diagnosis of disease, or presence of any of the secondary search terms. Abstracts were also screened to identify articles that defined, updated, addressed, reviewed, or commented on classification or criteria for rheumatic disease.
Many sets of classification criteria have been proposed for the rheumatic diseases. We were interested in evaluating those that are commonly used to classify patients with disease. Web of Science (version 3.0, Thomson Scientific, Philadelphia, PA) was used to search the Science Citation Index Expanded (1945 through April 2, 2006), the Social Sciences Citation Index (1956 through April 2, 2006), and the Arts and Humanities Index (1975 through April 2, 2006) to identify the number of times each article was cited. By consensus, we chose a citation number of 100 or greater as the threshold for inclusion of articles regarding adult rheumatic diseases. For pediatric rheumatic diseases, we used a citation number threshold of 35 based on the clinical judgment of the investigators (DS-G and BMF). Articles cited less than 100 times were included in this study if they were deemed to provide important criteria sets for specific diseases.
Criteria sets were evaluated using a standard abstraction form. Each criteria set was evaluated regarding its purpose, methods of item generation, derivation sample characteristics, statement of criterion characteristics, methods of criteria reduction, consideration of criteria set validity, and validation sample characteristics.
The stated purpose of criteria was categorized as being the differentiation of one rheumatic disease from others, its use in epidemiologic studies, both of the above, or unspecified. Item generation refers to the methods used to identify and collate items for inclusion in the criteria set. Methods of item generation were categorized as either literature search, expert opinion, or other. The derivation sample was evaluated based on the source(s) of the cases and control subjects (i.e., academic rheumatology practice[s], community rheumatology practice[s], nonrheumatology practice[s], unclear sources, or not applicable [indicating that no patients were involved]), the type of control sample population (patients with rheumatic disease, patients with nonrheumatic disease, healthy participants, type unclear, or not applicable [indicating that no patients were involved]), and the number of cases and controls that were recorded.
Circularity of reasoning may occur when the group of clinicians who derive the criteria sets are the same clinicians who provided the cases and control subjects, leading to a strong correlation between the criteria set and the diagnosis (1). Two separate groups are recommended to ensure the criteria are not validated by the same people who applied the criteria to their patients for inclusion in the study (1, 2). The relationship between investigators who derived the criteria sets and the clinicians who provided cases and controls were categorized as completely different groups, overlapping groups, the same group, unclear groups, or not applicable (indicating that no patients were involved).
Criteria characteristics were evaluated based on the presence or absence of a statement indicating that the criteria developers considered reliability, precision, and feasibility. Reliability refers to the reproducibility of a criterion for which there is no gold standard when applied by the same rater over time (intrarater) or among raters (interrater) (3). Precision refers to the degree to which results from laboratory tests can be duplicated from one measurement to the next (3). Feasibility refers to the ease of use of a criteria set (4). Consideration of reliability, precision, and availability of test(s) required to fulfill criteria was categorized as yes or no.
Once potential items for inclusion in a criteria set have been identified, statistical methods are often used for criteria ranking and for item reduction, i.e., to eliminate inappropriate or redundant items (5, 6). Tests used to rank criteria and statistical methods for item reduction were categorized as sensitivity and specificity; area under the receiver operating characteristic (ROC) curve; regression; recursive partitioning; frequency, chi-square, and t-tests; latent class analysis; factor analysis; Wilcoxon's rank sum test; cluster analysis; not specified; or not applicable (indicating that statistical methods were neither required nor used).
Content validity refers to the comprehensiveness of the criteria, and evaluates whether all the domains of the disease have been represented (1). Many domains do not have a gold standard test (e.g., pain, weakness, and disability). Therefore, a criterion is used to operationalize the domain based on a theoretical construct. Construct validity evaluates the relationship of a criterion or criteria set to other measures of the same construct. Two strongly correlated measures of the same construct have convergent construct validity (7). Face validity evaluates whether the criteria reflect the attributes of the disease and whether there is biologic coherence of the items (4). Consideration of content, construct, and face validity by the investigators was categorized as yes or no.
External validation of the criteria set was categorized as yes or no. The source or sources of case and control subject recruitment for the validation sample were categorized as academic rheumatology practice(s), community rheumatology practice(s), nonrheumatology practice(s), or an unclear source. Testing of the criteria set by other groups was categorized as yes or no.
Due to the lack of a single gold standard diagnostic test for all rheumatic diseases, convergent and divergent construct validity evaluate the ability of a criteria set to correctly identify patients with the clinical construct of the disease (convergent validity) and to not identify those with other diseases (divergent validity) (1). Sensitivity and specificity are often used as measures of convergent and divergent validity.
A systematic review of the literature identified 24,902 citations. Screening of titles and abstracts excluded 24,634 articles, and 162 articles were excluded because their citation number was low. Additional articles, some not identified in the search and some with a low citation number, were included for full review based on the discretion of the investigator. Overall, 171 articles underwent full review; 109 articles were excluded as they were not deemed relevant to the study goal, leaving 62 articles examining 47 criteria sets for inclusion in this study (see Figure 1).
The criteria set measurement characteristics are outlined in Table 1 (rheumatoid arthritis and systemic lupus erythematosus), Table 2 (osteoarthritis, polymyalgia rheumatica, psoriatic arthritis, and spondylarthropathy), Table 3 (gout, dermatomyositis/polymyositis, and fibromyalgia), Table 4 (ankylosing spondylitis, systemic sclerosis, vasculitis, and Kawasaki disease), and Table 5 (juvenile arthritis).
|Bennett 1956 (31), Ropes 1957 (32), Ropes 1957 (33)||Ropes 1958 (34)||Kellgren 1963 (35)||Lawrence 1968 (36)||Arnett 1988 (25)||Tan 1982 (37)||Hochberg 1997 (38)|
|Purpose of criteria*||A||A||A||A||A||A||A|
|Method(s) of item generation†||B||B||B||B||B||B||B|
|Source of cases and controls‡||D||NA||NA||NA||A + B||A||A|
|Number of cases||122 definite, 109 probable||NA||NA||NA||262||177||177|
|Number of controls||101||NA||NA||NA||262||162||162|
|Relationship between investigators and clinicians¶||D||NA||NA||NA||B||B||B|
|Authors commented on the need for criterion to be|
|Tests to rank potential criteria#||A||None||None||None||E||E||E|
|Content, construct, face validity were considered||No, No, No||No, No, No||No, No, No||No, No, No||No, No, No||Yes, Yes, Yes||Yes, Yes, Yes|
|Final criteria set externally validated||No||No||No||No||Yes||Yes||Yes|
|Source of case and control for validation‡||NA||NA||NA||NA||C||Yes||Yes|
|Final criteria tested by other groups (reference)||No||Yes (39)||No||Yes||Yes (39–42)||Yes (43–48)||Yes (43–48)|
|OA knee||OA hand||OA hip||Polymyalgia rheumatica||Psoriatic arthritis||Spondylarthritis|
|Altman 1986 (29)||Altman 1990 (49)||Altman 1991 (50)||Bird 1979 (51)||Jones 1981 (52)||Chuang 1982 (9)||Taylor 2006 (24)||Moll 1973 (12)||Dougados 1991 (26)|
|Purpose of criteria*||A||A||A||A||A||D||C||A||C|
|Method(s) of item generation†||B||B||B||B||A||B||A + B||A + B||A + B|
|Source of cases and controls‡||D||D||D||A + possibly B||A||D||A||NA||A|
|Control population§||A||A||A||A + B||NA||NA||A||NA||A|
|Number of cases||130||100||114||146||85||96||588||0||68|
|Number of controls||107||99||87||253||0||0||536||0||414|
|Relationship between investigators and clinicians¶||D||D||D||B||C||C||B||NA||D|
|Authors commented on the need for criterion to be|
|Tests to rank potential criteria#||E||E||E||A||NS||NA||C, D, F||NA||A,E|
|Content, construct, face validity were considered||No, No, No||No, No, No||No, No, No||Yes, Yes, Yes||NS, NS, NS||NA, NA, NA||Yes, Yes, Yes||Yes, No, Yes||Yes, Yes, Yes|
|Final criteria set externally validated||Yes||No||Yes||Yes||No||No||No||Yes (53)||Yes (53)|
|Source of case and control for validation‡||D||NA||NA||A + possibly B||NA||NA||NA||A||A|
|Final criteria tested by other groups (reference)||Yes (54)||No||No||Yes||Yes||Yes||No**||Yes (24, 53)||Yes (24, 53)|
|Brochner-Mortensen 1963 (55)||Decker 1968 (56)||Wallace 1977 (23)||Bohan 1975 (57,58)||Love 1991 (17)||DeVere 1975 (8)||Pearson 1963 (11)||Medsger 1970 (10)||Wolfe 1990 (22)||Yunus 1981 (20)|
|Purpose of criteria*||B||B||C||A||A||A||A||B||A||A|
|Method(s) of item generation†||B||B||B||A + B||A + B||NS||A + B||NS||A + B||A|
|Source of cases and controls‡||NA||NA||A + B||NA||A||D||A||D||A + B||A|
|Control population§||NA||NA||A + B||NA||NA||NA||NA||NA||A + B||C|
|Number of cases||NA||NA||178||0||181||118||48||124||293||50|
|Number of controls||NA||NA||528||0||0||0||0||0||265||50|
|Relationship between investigators and clinicians¶||NA||NA||D||NA||D||D||C||D||A||C|
|Authors commented on the need for criterion to be|
|Tests to rank potential criteria#||NA||NA||E||NA||E,H||NS||NA||NS||B, C, E||E, G|
|Content, construct, face validity were considered||No, No, No||No, No, No||No, No, No||Yes, Yes, Yes||No, No, No||Yes, Yes, Yes||No, No, No||No, No, No||Yes, Yes, Yes||Yes, Yes, Yes|
|Final criteria set externally validated||No||No||No||No||No||No||No||No||No||No|
|Source of case and control for validation‡||NA||NA||NA||NA||NA||NA||NA||NA||NA||NA|
|Final criteria tested by other groups||No||No||No||Yes||No||No||No||No||Yes||No|
|Ankylosing spondylitis||Systemic sclerosis||Vasculitis||Kawasaki disease|
|Rome criteria 1963 (59)||NewYork criteria 1968 (60)||van der Linden 1984 (19)||Masi 1979 (61), 1980 (62), 1981 (63)||Bloch 1990 (64)||Kawasaki 1967 (65)||MCLS guidelines 1970 (66), 1972 (67), 1974 (68), 1984 (69)||Morens 1978 (70)||AHA 1990 (71)||AHA 2001 (72)||Newberger 2004 (73,74)|
|Citation index||N/A||86||649||740||CSS 430, HSP 225, PAN 371, GCA 464, WG 584, TA 300, HS 96||7||N/A||65||22||5||30|
|Purpose of criteria†||A||A||A||A||A||D||C||A||A||A||A|
|Method(s) of item generation‡||C||C||C||A,B,C||B||B||B||B||A + B||A + B||A + B|
|Source of cases and controls§||D||D||D||A||A + B||NA||NA||NA||NA||NA||NA|
|Number of cases||NS||NS||34||264||20–214||50||0||0||0||0||0|
|Number of controls||NS||NS||>100||413||593–787||0||0||0||0||0||0|
|Relationship between investigators and clinicians#||NS||NS||C||D||B||NA||NA||NA||NA||NA||NA|
|Authors commented on the need for diagnostic tests to be|
|Tests to rank potential criteria**||NS||NS||E||E||B,E||NA||NA||NA||NA||NA||NA|
|Content, construct, face validity were considered||NS||NS||Yes, Yes, Yes||Yes, Yes, Yes||Yes, Yes, Yes||NS||No, No, Yes||No, No, Yes||NS||NS||No, No, Yes|
|Final criteria set externally validated||NS||NS||No||Yes||No||No||No||No||Yes||No||No|
|Source of case and control for validation§||NS||NS||NA||A||NA||NA||NA||NA||NA||NA||NA|
|Final criteria tested by other groups (reference)||Yes||Yes||Yes||Yes (16)||Yes (14)||Yes (66)||Yes (67–69)||Yes (75)||Yes (76)||No||No|
|Bywaters 1968 (77)||Brewer 1972 (21)||Brewer 1977 (78)||Wood 1978 (79)||Kvien 1982 (80)||Fink 1995 (81)||Petty 1998 (82)||Petty 2004 (83)||Lambert 1976 (84)||Southwood 1989 (85)||Rosenberg 1982 (86)|
|Purpose of criteria*||A||A||B||A||A||B||B||B||A||A||A|
|Method(s) of item generation†||A + B||B||B||B||A + B||B||A + B||A + B||B||A + B||B|
|Source of cases and controls‡||NA||A||NA||NA||NA||NA||NA||NA||NA||NA||NA|
|Number of cases||0||135||0||0||0||0||0||0||0||0||0|
|Number of controls||0||100||0||0||0||0||0||0||0||0||0|
|Relationship between investigators and clinicians¶||NA||B||NA||NA||NA||NA||NA||NA||NA||NA||NA|
|Authors commented on the need for diagnostic tests to be|
|Tests to rank potential criteria#||NA||No||NA||NA||NA||NA||NA||NA||NA||No||NA|
|Content, construct, face validity were considered||No, No, Yes||No, Yes, No||NA||No, No, Yes||No, No, Yes||No, No, Yes||No, No, Yes||No, No, Yes||No, No, Yes||No, No, Yes||No, No, Yes|
|Final criteria set externally validated||Yes||No||No||No||No||No||No||No||No||No||Yes|
|Source of case and control for validation‡||A + C||NA||NA||NA||NA||NA||NA||NA||NA||NA||A|
|Final criteria tested by other groups (reference)||No||Yes (78)||Yes (87)||Yes (80, 88–91)||No||Yes (92)||Yes (89–94)||Yes (95)||No||Yes (89, 96)||Yes (97)|
Of the 47 criteria sets, the stated purpose or purposes were differentiation of one rheumatic disease from another in 34 (72%) of the sets, use in epidemiologic studies in 7 (15%) of the sets, and both of these purposes in 4 (9%) of the sets. The purpose or purposes were unspecified in 2 (4%) of the sets.
The method or methods used for item generation in the 47 criteria sets were literature search only in 2 (4%) of the sets, expert opinion in 24 (51%) of the sets, a combination of literature search and expert opinion in 16 (34%) of the sets, and other methods in 3 (6%) of the sets. The methods used for item generation were unspecified in 2 of the sets (4%).
In the total 47 criteria sets, cases and control participants were recruited from academic practices in 14 (30%) of the studies, from community practices in 5 (11%) of the studies, and from unclear sources in 10 (21%) of the studies. There were no case and control participants in 23 (49%) of the studies. The control sample population consisted of patients with rheumatic disease in 15 (32%) of the studies, patients with nonrheumatic disease in 4 (9%) of the studies, and healthy participants in 2 (4%) of the studies. The control sample population was unclear in 2 (4%) of the studies, and a control sample population was not applicable in 29 (62%) of the studies. The number of cases and controls in each study ranged from 0 to 588, and 0 to 787, respectively. Investigators who derived the criteria sets and clinicians who provided cases and controls were completely different groups in 1 (2%) of the studies, the groups overlapped in 5 (11%) of the studies, the groups were the same in 6 (13%) of the studies, the relationship was unclear in 12 (26%) of the studies, and this concern was not applicable in 23 (49%) of the studies.
Authors commented on the need for each criterion to be reliable in 5 (11%) of the criteria sets, precise in 5 (11%) of the sets, and readily available in 4 (9%) of the sets.
To rank potential criteria, 19 (40%) of the studies used relatively simple tests (e.g., sensitivity, specificity, area under the ROC curve, frequency, chi-square, t-test, Wilcoxon's rank sum test, and/or regression modeling), and 3 (6%) of the studies used more sophisticated methods (e.g., latent class analysis, factor analysis, recursive partitioning, and/or cluster analysis). The tests used to rank criteria were not specified in 6 (13%) of the studies, and criteria ranking was not applicable in 20 (43%) of the studies. Seventeen (36%) of the studies used the same relatively simple statistical methods listed above for item reduction, and 6 (13%) of the studies used the more sophisticated methods. Methods of item reduction were not specified in 7 (15%) of the studies and were not applicable in 21 (45%) of the studies.
Of the 47 articles, the authors commented on the need for their criteria set to have content validity in 12 (26%) articles, construct validity in 12 (26%) articles, and face validity in 17 (36%) articles.
Of the 47 criteria sets, 10 (21%) sets had been externally validated. The source of case and control subject recruitment for the validation sample included academic rheumatology practices for 6 (60%) of 10 of those sets, community rheumatology practices for 1 (1%) of 10 of those sets, and nonrheumatology practices for 2 (20%) of 10 of those sets. The source of case and control subject recruitment was unclear for 1 (1%) of the 10 validated sets. Of the 47 total criteria sets, 27 (57%) sets had been tested by other groups.
The sensitivity and specificity of the criteria sets were 45–95% and 75–99%, respectively. The sensitivity of 26 of the sets and the specificity of 27 of the sets were not specified. For 1 criteria set, only the sensitivity was reported. The sensitivity and specificity of individual criteria sets are reported in Table 6.
|Disease||Sensitivity, % (ref.)||Specificity, % (ref.)|
|New York criteria 1968 (60)||76 (26)||99 (26)|
|Rome criteria 1963 (35)||89 (26)||96 (26)|
|van der Linden et al 1984 (19)||83 (26)||98 (26)|
|Dougados et al 1991 (26)||94 (26)||87 (26)|
|Bywaters 1968 (77)||NS||NS|
|Brewer et al 1972 (21)||94||NS|
|Brewer et al 1977 (78)||NS||NS|
|Wood 1978 (79)||NS||NS|
|Kvien et al 1982 (80)||NS||NS|
|Fink 1995 (81)||NS||NS|
|Petty et al 1998 (82)||NS||NS|
|Petty et al 2004 (83)||NS||NS|
|Juvenile PsA and SpA|
|Lambert et al 1976 (84)||NS||NS|
|Southwood et al 1989 (85)||NS||NS|
|Rosenberg and Petty 1982 (86)||NS||NS|
|Dermatomyositis and polymyositis|
|Bohan and Peter 1975 (57, 58)||45–93 (98–101)||93 (62, 101)|
|DeVere and Bradley 1975 (8)||NS||NS|
|Love et al 1991 (17)||NS||NS|
|Medsger et al 1970 (10)||NS||NS|
|Pearson 1963 (11)||NS||NS|
|Wolfe et al 1990 (22)||88||81|
|Yunus et al 1981 (20)||96||100|
|Brochner-Mortensen et al 1963 (Rome criteria) (55)||NS||NS|
|Decker 1968 (New York criteria) (56)||NS||NS|
|Wallace et al 1977 (23)||85–88||96–97|
|Morens and O'Brien 1978 (70)||NS||NS|
|AHA 1990 (71)||NS||NS|
|AHA 2001 (72)||NS||NS|
|Newburger et al 2004 (73)||NS||NS|
|Altman et al 1986 (knee) (29)||91–94||86–88|
|Altman et al 1990 (hand) (49)||92–94||87–98|
|Altman et al 1991 (hip) (50)||89–91||89–91|
|Bird et al 1979 (51)||82–92||75–80|
|Jones and Hazleman 1981 (52)||NS||NS|
|Chuang et al 1982 (9)||NS||NS|
|Taylor et al 2006 (24)||91 (24)||99 (24)|
|Dougados et al 1991 (26)||56 (52)–81.6 (24,26)||91 (24)|
|Moll and Wright 1973 (12)||91–94 (24,53)||98–99 (24,53)|
|Bennett et al 1956 (31)||70–88||77–91|
|Ropes et al 1958 (34)||NS||NS|
|Kellgren and Bunim 1963 (Rome criteria) (35)||NS||NS|
|Lawrence and Allander 1968 (New York criteria) (36)||NS||NS|
|Arnett et al 1988 (25)||91–94||89|
|Dougados et al 1991 (26)||87 (26)||87 (26)|
|Masi et al 1979 (61), 1980 (62)||97 (61,62)||98 (61,62)|
|Systemic lupus erythematosus|
|Tan et al 1982 (37)||96 (37)||96 (37)|
|Churg-Strauss syndrome (102)||85 (102)||99 (102)|
|Wegener's granulomatosis (103)||88 (103)||92 (103)|
|Giant cell arteritis (104)||94 (104)||91 (104)|
|Henoch-Schönlein purpura (105)||87 (105)||88 (105)|
|Hypersensitivity (106)||71 (106)||84 (106)|
|Polyarteritis nodosa (107)||82 (107)||87 (107)|
|Takayasu arteritis (108)||91 (108)||98 (108)|
Classification criteria are a cornerstone of clinical trials and epidemiologic studies in rheumatology. Our study has identified a large number of classification criteria that are widely cited in rheumatic disease research. Many of these criteria sets have moderate to excellent sensitivity and specificity; however, some criteria sets were developed using methods that are not currently considered to be strong. Many were developed prior to recognition of standards of methodologic quality. Our intention was not to be overly critical of these older criteria sets; rather, we wish to make readers aware of the current state of criteria development, identify potential weaknesses, and highlight areas where additional research is needed.
The majority of classification criteria were developed to define a group of patients with disease for the purpose of clinical research. In a few instances, it was not the primary intent of the authors to develop classification criteria; rather, the criteria were developed within the context of a study, and over time have become widely cited as classification criteria (8–12). Most criteria sets were not developed for clinical decision making, and their performance as diagnostic criteria for rheumatic diseases in general practice has largely not been evaluated. Thus, the use of classification criteria for clinical diagnosis may be inappropriate. For example, research on systemic vasculitis demonstrated that the 1990 ACR classification criteria (13) function poorly as diagnostic criteria (14). When used for diagnostic purposes, the ACR criteria suggest homogeneity of clinical presentation for all subsets of a disease when in fact there is heterogeneity of presentation (14).
Clinicians should be cautious when adopting classification criteria for clinical diagnosis. Validation of criteria sets for diagnostic purposes usually requires very high specificity with good sensitivity. By comparison, validation of criteria sets for use in epidemiologic studies of prevalence and incidence requires a balance of sensitivity and specificity. The variability in the discriminant validity thresholds relates to their intended use. When a patient has been given a diagnosis using criteria with a high specificity, the clinician can be sure the patient has the disease. In contrast, in epidemiologic studies evaluating incidence and prevalence, overly specific criteria would result in underestimation of the true prevalence, and overly sensitive criteria would result in overestimation of the true prevalence. In this situation, researchers would prefer a balance of sensitivity and specificity.
We found that many studies developing classification criteria depend on expert opinion and examination of published literature to generate items for inclusion in a criteria set. Few studies commented on the need to ensure content validity, and even fewer tested for it. Furthermore, the construct or conceptual framework of the disease was rarely explicitly stated. Thus, all domains of a given disease may not be represented in each classification criteria set. Since none of the rheumatic diseases have a single gold standard diagnostic test for the identification of cases, the explicit specification of the clinical characteristics or domains that characterize the construct of the disease is necessary. Many of the older criteria sets may benefit from reevaluation to reflect the current understanding of the pathophysiology and phenotypic expression of the disease. For example, in both systemic sclerosis (15, 16) and inflammatory myopathies (17, 18), advances in our understanding of autoantibody profiles and disease manifestations have led to proposals for new classification criteria. Future investigators should state the conceptual framework on which the classification criteria are based.
In the studies we examined, the methodologic rigor with which cases and controls were selected to derive the classification criteria sets was weak in several ways. First, early classification criteria used small numbers of patients (11, 19, 20), whereas more recent criteria sets have used appropriately large numbers of cases and control participants (generally over 100 of each) (21–25). Second, in the majority of situations, cases and control participants were obtained from academic rheumatology practices. If criteria are to be used in epidemiologic studies, greater effort should be made to include cases and control participants from the community setting (2). Third, the control groups from one-third of the criteria sets were composed of patients with a rheumatic disease; only 4 criteria sets used patients with nonrheumatic diseases as controls, and only 2 criteria sets used healthy control participants. The choice of an appropriate control group has important implications for the use and performance of the criteria set. If criteria are intended to distinguish patients with the disease from patients without the disease, or patients with the disease from patients with nonrheumatic diseases, a wider breadth of control groups should be included (2). The inclusion of appropriate control groups will affect the discriminative utility of the criteria in a generalized clinical setting. Fourth, in general, the investigators developing the criteria were also the ones contributing cases and controls. Due to the potential introduction of bias and circularity of reasoning, criteria should be validated by investigators and in subjects who are distinct from the original group of investigators and subjects. Fifth, nearly 50% of the criteria sets in this study were derived from expert opinion and/or literature search instead of patient data.
We observed a secular trend in the statistical methods used for criteria ranking and item reduction; more recently developed criteria sets have used more sophisticated methods. Less than 10% of the studies we examined contained comments on the need for individual items from the criteria set to be reliable, precise, and readily available. The strength of a criteria set is only as good as its weakest criterion. Thus, a lack of reliability, precision, or feasibility could be a critical flaw in a criteria set. Similarly, less than one-third of the studies' investigators commented on the need for the criteria set to have content, construct, or face validity. Much work is needed to evaluate these characteristics in the individual elements of a criteria set as well as for the set as a whole.
Validation of the final criteria set in an independent set of patients was not considered a requirement in the past and was rarely done. A few groups have compared the validity of older criteria sets with new criteria sets (24, 26), thus indicating the incremental value of one set over the other (27). For diseases where multiple classification criteria exist, independent comparative validation and evaluation is needed to inform researchers of the appropriate choice of criteria for clinical studies.
Overall, most rheumatic disease classification criteria achieve a good balance between sensitivity and specificity. However, the sensitivity and specificity of the criteria can change if the derivation population is dramatically different than the population that the criteria are being applied to. Most of the criteria were originally tested and designed for patients with a high pretest probability of disease. If such criteria are applied in a general clinical practice with a low disease prevalence, they may not have the same discriminative capacity. Similarly, the sensitivity and specificity of a criteria set can change in patients with early manifestations of a disease (28). For example, the performance of the ACR classification criteria for knee osteoarthritis (29) varies based on disease severity, with increased sensitivity for more severe disease (30). Unfortunately, many articles do not publish the sensitivity and specificity of different combinations of positive and negative criteria, engendering losses of valuable information and adaptability for specific purposes. Performance should be reported as a matrix of results according to which combinations of criteria were fulfilled, allowing adaptability for different purposes. It would be highly desirable for criteria sets to be adaptable for different purposes in order to accommodate different objectives in clinical research studies.
There are limitations to our study. Due to the constraints of feasibility, less frequently cited classification criteria have been excluded from our review. Our findings also have limited precision. Criteria developmental and methodologic characteristics may have been used in the studies we examined but not clearly stated in the published articles. The next generation of articles describing development/revision and validation of criteria should clearly specify developmental and methodologic characteristics so that scientific quality may be accurately assessed.
In conclusion existing classification criteria sets for the rheumatic diseases suffer from methodologic issues related to the derivation sample case and control selection. Classification criteria should be tested against both nonrheumatic disease control groups and rheumatic disease control groups, and their performance in these settings should be reported. Criteria sets should also be independently validated in cohorts of patients and control subjects different from the derivation cohort. Additional research is needed regarding the reliability, construct validity, content validity, and face validity of both individual elements in a criteria set and full criteria sets. The sensitivity and specificity of combinations of criteria should be reported to improve adaptability. Despite potential limitations in the methods used to create some classification criteria, the sensitivity and specificity of many of the classification criteria sets for rheumatic diseases are moderate to strong.
Drs. Johnson and Solomon had full access to all of the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis.
Study design. Johnson, Vlad, Feldman, Felson, Hawker, Singh, Solomon.
Acquisition of data. Johnson, Goek, Singh-Grewal, Vlad, Feldman, Hawker, Singh, Solomon.
Analysis and interpretation of data. Johnson, Goek, Singh-Grewal, Feldman, Felson, Hawker, Singh, Solomon.
Manuscript preparation. Johnson, Goek, Singh-Grewal, Vlad, Feldman, Felson, Hawker, Singh, Solomon.
Statistical analysis. Johnson, Goek, Singh-Grewal, Feldman, Singh.