In this issue, Nagy and Naryshkin1 present an analysis of the statistical pitfalls of the cytology proficiency testing (PT) program mandated by the Clinical Laboratory Improvement Amendments of 1988 (CLIA 88).2 Their analysis is based on established statistical principles, and they suggest that experts in test design be engaged to assist in redesigning the testing program so that it is more sound statistically than the current protocol.

The CLIA 88 law was passed largely in response to a 1987 Wall Street Journal article3 that documented substandard practices in some cytology laboratories and related the unfortunate stories of 3 women who had had false-negative Papanicolaou (Pap) tests before a diagnosis of invasive cervical cancer. The regulations specifying the current PT exam are based on the following portion of the law:

The standards … shall include … periodic confirmation and evaluation of the proficiency of individuals involved in screening or interpreting cytological preparations, including announced and unannounced on site proficiency testing of such individuals with such testing to take place to the extent practicable, under normal working conditions.

This law makes no specific requirements regarding the parameters of the testing such as the number of questions, testing interval, or scoring system. Unfortunately, the law also has no requirements regarding the statistical validity or reliability of the program. The regulations written to implement the CLIA 88 law represent a sincere effort to reduce the likelihood of adverse patient outcomes attributable to pathologists or cytotechnologists with insufficient skills, or to substandard laboratory practices. It is therefore surprising that little or no attention appears to have been paid to principles of statistics and standardized test design in the PT section of the regulations. One would have expected that as much scientific and statistical rigor as possible would have been brought to this endeavor.

Although CLIA 88-mandated PT has been severely criticized from the time the regulations were published, other important aspects of the regulations have not been nearly so controversial. For example, although some strongly disagreed with the concept of government-imposed workload limits,3 today there is little dissent in the professional community regarding this issue. Similarly, the importance of cytology-histology correlation is widely accepted. The review of negative specimens preceding high-grade lesions is also generally accepted although the specific details of the CMS prescribed procedure have been questioned based on real-world evidence.4 There was relatively little rigorous, published scientific data proving that workload limitations, cytology-histology correlation, and review of prior negative specimens when there is a current high-grade lesion lead to improved outcomes. Nevertheless, these common-sense laboratory practices have been generally accepted by most experts and the broader laboratory community as well, and in the absence of conclusive data to the contrary most believe that they are likely to be of some benefit to patients.

On the other hand, the disciplines of statistics and test design have been around for a long time, yet the fundamental statistical concept of reliability appears to have been given little or no consideration in the design of PT. In a study published after the final rule had been issued, Keenlyside et al.5 were able to show only a 0.24 correlation between scores on CLIA 88-like PT and work performance evaluations. Nagy and Naryshkin1 point out that the failure rates on the first and second exams are 9% and 10%, respectively. One would expect less than a 90% pass rate after a brief time interval in a group with a 100% initial failure rate. Although different subgroups of examinees show different failure rates, all subgroups of examinees show similar failure rates on the initial test and first retest, suggesting that many failures are due to random statistical variation, not variations in competence. A study of PT in the United Kingdom demonstrated that random failures will occur: 7 of 63 cytologists that had proven their competence on 6 consecutive 10-slide PT exams missed an abnormal slide on the seventh.6 In addition to the issue of misclassification of professionals discussed in detail by Nagy and Naryshkin, other technical and scientific issues with the PT have been pointed out over the past 16 years.7–12 Other issues include the testing interval, the lack of any consideration of new technologies in cytology, and the need for field validation of slides. The highly complex nature of these issues may also be seen or inferred in several publications based on data from the College of American Pathologists Interlaboratory Comparison Program in Gynecologic Cytopathology over the last few years.13–15

Field validation warrants additional discussion here, although both current PT providers use field-validated slides in their programs. Field validation is the process of determining statistically that a slide is a good example of the entity that it is supposed to represent. Without field validation the slides are being tested as much as the examinees. Without field validation some examinees will ‘miss’ some cases due to the well-documented interobserver variability in Pap test interpretation,13 not due to the inability to correctly categorize slides. Field validation identifies slides that the majority of competent practitioners will categorize correctly. It tends to identify groups of slides that are relatively homogeneous within each diagnostic category. In everyday practice the group of cases falling within any 1 diagnostic category are more heterogeneous morphologically than could be fairly included in a rigorously designed PT program with field-validated slides. In a high-stakes test such as PT, field validation is also imperative to make the test fair. Ironically, it has the effect of removing the more challenging (and hence educational and ultimately beneficial to patient care) cases from the test because only straightforward examples are likely to meet field validation criteria. The statistical issues discussed by Nagy and Naryshkin, the lack of field validation in the initial version of the test paired with the well-documented interobserver variability in the interpretation of Pap tests13, 16 probably have all contributed to the at least initial misclassification of some of the examinees in the current PT program.

If PT and its current form is statistically unreliable, what are we to make of the difference in pass rates of primary screening pathologists versus other subgroups?17 Many of the primary screening pathologists work in smaller laboratories. This subgroup constitutes only 3.6% of all examinees and probably examines an even smaller percentage of all Pap tests, but obviously any deficits in performance are significant to the women whose slides are examined by these individuals. The difference in scores on PT between this subgroup and others likely has some statistical validity, indicating that there may be a difference in performance between the 2 groups as a whole. However, Nagy and Naryshkin show that the statistical reliability of the current test is poor, indicating that it is difficult to draw meaningful conclusions about individuals, many of whom are likely to be misclassified.1 This is particularly true for individuals with near-passing scores. According to the 2005 PT data, the failure rates on the initial and first retest for primary screening pathologists were 33% and 34%, respectively.17 Again, one would expect a subgroup with a 100% failure rate in the first test to fail at a greater rate than the overall group on a retest after a short interval if the test is valid and reliable.

Nagy and Naryshkin's suggested solution to the statistical issues they have raised is to have a much longer ‘board-type’ exam, administered at 8 to 10-year intervals. The Cytology and Technology Education Consortium (CETC) and the Clinical Laboratory Improvement Advisory Committee (CLIAC) suggested testing intervals of 5 and 3 and years, respectively, and a 20-question test, which would not be long enough to meet Nagy and Naryshkin's criteria.7, 18 Professional test designers undoubtedly could be of some assistance with this issue. It is unlikely that there are a sufficient number of glass slides to administer a long test such as that suggested by Nagy and Naryshkin any more often than every 8 to 10 years, and because newly certified pathologists and cytotechnologists have time-limited certification and will need to recertify, a long government-mandated test is likely to be redundant and expensive for these individuals, but of little benefit to patients.

Statistical issues with the current implementation of cytology PT should not cause us to think that adjustments to the current program are the only or even the best way to ensure high-quality practice. Our goal is to deliver the best patient care possible within the acknowledged limitations of the Pap test, not blind adherence to any particular model for PT. The approach put forth in a recently introduced bill in the US House of Representatives, H.R. 1237, would mandate annual continuing education of all individuals who examine gynecologic cytology specimens. Mandatory continuing education is also implied in the recent suggestion by the College of American Pathologists that cytology regulations be rewritten in a manner analogous to the Mammography Quality Standards Act (MQSA).18, 19 The concept is certainly worth considering, but regulations would need to be tailored specifically to the needs of cytology QA. If well-executed, this approach could be more effective than the current PT program in remediating practitioners whose interpretive skills need improvement. The effectiveness of the continuing education approach has been demonstrated in an External Quality Assessment (EQA) program for breast histopathology in the British National Health Service.20 Those authors state “Pathologists who joined the scheme improved over time, particularly those who did less well initially” (italics mine). An education-oriented approach to PT probably is also better for assuring that practitioners learn about new concepts and technologies in a timely manner, ultimately leading to better patient care.

The current approach to gynecologic cytology PT in the US is clearly not education-oriented. At best, a PT system based on periodic multiple-choice exams can be expected to detect the worst performers, and the current test cannot even do that reliably even though that is why it was created. Public health would probably be better served and the funds used more effectively if PT had a substantially greater focus on education, and a substantially decreased frequency of testing, as recommended by the CETC7 and CLIAC.18 In conjunction with the laboratory process controls mandated by CLIA 88, overall laboratory performance, and hence detection of precancerous cervical lesions, would likely be better than in a system in which PT has no educational component and consists only of an annual 10-question multiple-choice examination. Unfortunately, the work group convened by CLIAC to develop recommendations was not permitted to discuss education-oriented proposals.

At a minimum, the changes recommended by CLIAC18 should be implemented. The issues raised by Nagy and Naryshkin would still exist with a relatively short test, however. The potential benefits of an educationally oriented program of PT have been demonstrated in the British EQA for breast histopathology. Expertise from all relevant disciplines, including statistics, test design, quality assurance, and continuing medical education programs should be utilized. Knowledge acquired and changes in practice since the CLIA 88 regulations were issued 15 years ago must also be taken into consideration. Regulations should be written in a manner that anticipates innovation so that the multiyear process of updating regulations is not needed to avoid having cytology professionals evaluated on obsolete practices. Our patients deserve no less.


  1. Top of page