Proficiency testing in cytopathology (PTC), which was established in the 1991 regulations to implement the Clinical Laboratory Improvement Amendments of 1988 (CLIA'88),1 has only recently been enforced on a national scale. For more than a decade, during which logistical hurdles hampered the development of a national program for PTC, there was not much incentive to think about the value and potential of PTC or its theoretical background or to worry that the test design was so poor. In 2004, however, the Center for Medicare and Medicaid Services announced that a national PTC program developed by the Midwest Institute for Medical Education had been approved and that the regulations finally would be enforced on a national level. Suddenly, the shortcomings of the test were everyone's problem. What followed was a flurry of comments, articles, proposals, and Internet discussions about the PTC and its future.2–9 Although the testing has proceeded nationwide in conformity with the original regulations, the dust has not yet settled on the subject. The professional organizations agree that PTC, as prescribed in CLIA'88, is inadequate and is in great need of improvement if indeed it should remain in place at all. Regarding the projected revisions, it is a real impediment that some regulatory authorities that are in a position to make decisions about the implementation of PTC apparently are not familiar with most of the theoretical implications of test theory, which is an exceedingly complicated subject. So long as the test is mandatory for every practitioner of gynecologic cytopathology in the United States, it is in the best interest of all participants for PTC to become a scientifically well-founded, valid, and reliable quality-assurance method.8 In the current article, we have attempted to shed light on some gaps in the knowledge about the theoretical underpinnings of PTC that seem to endure in the cytopathology literature.

**Cancer Cytopathology**

# The dysfunctional federally mandated proficiency test in cytopathology^{†}

A statistical analysis

^{†}See related editorial on pages 000–000, this issue.

## Abstract

Proficiency testing in cytopathology and in other disciplines should be based on firm statistical and scientific foundations, because test theory in general is a heavily statistical subject. Statistical considerations have demonstrated that the design of “short” proficiency tests in cytopathology, including the current federally mandated test, fundamentally is unsound because of the lack of sufficient validity and reliability. Examinees too frequently are misclassified by such short-format tests: Competent examinees fail the test in surprisingly high numbers, whereas most of the examinees who have insufficient cytologic skills eventually pass the test after the allowed retakes. Only dichotomous tests are suitable for accurate computation of the effects of test design on reliability, but the statistical conclusions also are generalizable to nondichotomous tests. In conclusion, the current federally mandated proficiency test cannot reliably measure the level of expertise of cytologists and, thus, cannot assure that only adequately skilled individuals evaluate Papanicolaou test samples. To render the test suitable for its intended purpose, the authors believe that complete redesign of the test, with the participation of experts in modern test theory, would be advisable. Cancer (Cancer Cytopathol) 2007. © 2007 American Cancer Society.

## Test Theory Is Statistical

Test theory is a heavily statistical subject. Virtually all aspects of test theory have been investigated in depth almost exclusively by educators and psychologists, which is understandable, because testing is a central issue in their disciplines.10–13 Unfortunately, this valuable body of literature apparently has been disregarded completely by the federal authorities that are responsible for PTC regulations.

The statistical apparatus used in modern test theory is formidable.10–13 Many books and articles written about the subject use highly sophisticated mathematical tools, including differential and integral calculus and matrix algebra. One of the reasons for the high degree of mathematization of test theory in psychology and education science is that these disciplines deal largely with intangibles, like motivation, intelligence, understanding, and adaptability, which are not directly measurable.13 Such entities must be studied indirectly, through measurements of other quantities. That is why psychological test theory introduced the concept of “constructs” that can substitute for and represent the kinds of abstract attributes mentioned above. Even so, the highly complicated mathematical and statistical tools that have been promoted in educational and psychological test theory fulfill mainly academic purposes. Most actual problems in everyday testing can be solved on a practical level that does not use highly complicated mathematical methods but, at the same time, does not disregard basic statistical principles.

## Testing in the Physical and Biologic Sciences

Cytopathology, unlike educational science or psychology, is an applied natural science, and this is one of the reasons why PTC can be performed without the application of overly sophisticated mathematical tools. Interpretation of Papanicolaou smears, reproduction of cytologic diagnoses, and measurement of false-negative proportions,14 among others, are very complex tasks.15–17 By comparison, technically, it is a comparatively straightforward matter to evaluate the examinees' ability to assign diagnostic categories to cytologic changes observed on a slide or computer screen. Thus, abstract constructs hardly are needed in PTC. Nevertheless, a certain level of mathematical and statistical understanding by the designers of the test is crucial if a fair and scientifically valid system of PTC is to be established. Most pathologists, including ourselves, do not have rigorous training in statistics; therefore, if PTC is to continue, then the regulatory authorities ought to contract with experts in statistics and test theory who, through interaction with knowledgeable cytopathologists and cytotechnologists, would design an equitable and scientifically well-founded system for the nationwide PTC.

We do not mean to suggest that statisticians have not participated in the design of cytology testing programs. In fact, the College of American Pathologists' (CAP) Interlaboratory Comparison Program for Cervicovaginal Cytology was designed, implemented, and monitored with the extensive help of statistical expertise.18 However, this educational endeavor was not intended to be a PTC program as envisioned in the federal regulations. In fact, its original, scientifically and statistically supported structure ironically prevented its use as a PTC program because of the specific requirements of the federal regulations.

## Short Tests and Reliability

One of the central problems in the practice of PTC is reliability,8, 13 and the reliability of PTC is related closely to the size of the test sets (the number of the test items or challenges in 1 test set).11 “Short” tests, which require the evaluation of relatively small numbers of slides, are characterized by a high misclassification rate. (The pervasive effect of sample size on the reliability of statistical inference is the reason why pollsters use large samples: The larger the sample, the narrower are the confidence limits in relative terms. The statistical estimates inferred from a single sizable sample that has been chosen by randomization will approach the true parameters of the population.) Short tests will not prevent the frequent failure of competent examinees or the passing of examinees who have less than desirable skill levels. Already in 1991 one of us (G.K.N.), in a report that was written with D.C. Collins, emphasized that the expected misclassification rate of such short tests can be surprisingly high and that, in the case of dichotomous tests, this rate can be calculated (or approximated) through the use of the binomial theory of statistics.19 (A dichotomous test evaluates the responses to test items as “right” or “wrong,” without using intermediate results or weighing of answers. The PTC system used in New York State for 36 years was dichotomous and so was the original Interlaboratory Comparison Program in Cervicovaginal Cytology. The CLIA'88-mandated PTC is not dichotomous.) This so-called “simple binomial error model” was described in test theory initially in the 1950s.10, 11

The results of the CLIA'88 mandated national PTC in 2005 dramatically demonstrated the effect of misclassification during short tests, as described previously.20, 21 According to the data from the National Cytology Proficiency Testing Update, 9% of the examinees failed the test when they attempted it for the first time.22 However, when this group that supposedly had inferior skills retook the test, curiously, the failure rate for this second attempt was similar to that for the entire original group (10%). It appears that the cytologic skills among those examinees who had failed originally improved miraculously, allowing 90% of them to pass the examination, although all of them initially failed. It is hard to believe that a short remedial training between the first and second attempt could result in such an impressive real improvement. The only plausible scientific explanation is the well-known statistical phenomenon, the Galtonian “regression toward the mean.” The majority of failures during the first attempt were the consequence of misclassification because of the poor validity and reliability of the short test and were not caused by the insufficient skills of those who failed. The failure rate in all groups of examinees is about the same on the first attempt and on the second attempt, and previous failures do not seem to matter much. Essentially, the results of the CLIA'88-mandated PTC mostly mirror the statistical chances and not the examinees' skills.20, 21

Of course, multiple other variables beyond regression toward the mean, including experience gained in the technique of the test, differences in the difficulty of particular test sets, and even increased skills after remedial training, etc, also may play a role in the improvement of test results at the second attempt for individual examinees. However, to date, we do not have any data or even a plausible explanation concerning how any of these other factors, with the exception of regression toward the mean, could produce such a consistent result.

## The Simple Binomial Error Model

Misclassification of examinees by any short test, including the CLIA'88-mandated PTC, can be demonstrated by means of an analogy. Strictly speaking, this analogy is applicable only to dichotomous testing systems. However, in this sense, dichotomous and nondichotomous systems are correspondent. For statistical or evaluation purposes, nondichotomous systems can be made dichotomous at any time, even after the tests have been carried out. For example, an answer can be evaluated as correct only if it falls into the appropriate single category (“success”) and all other answers are rated as wrong (“failure”). Another solution to this problem in PTC would be to restrict the number of diagnostic categories to 2, with 1 category, for instance, “negative for premalignant or malignant changes” and the second category “premalignant or malignant lesions are present.” This is the approach used in the original CAP PAP program with its “100 series” and “200 series.”

The CLIA'88 regulations concerning PTC, with their 4 diagnostic categories and complicated scoring system, do not fit into the dichotomous scheme.1 Despite of this fact, the conclusions drawn by using the binomial error model regarding PTC are applicable to any short test to a large extent.

## Example of Simple Binomial Error Model

For the purpose of illustration, let us suppose, that in a large population (for instance, that of an entire country), the results from a scrupulous statistical survey using many thousands of questionnaires and proper randomization indicate that the proportion of individuals who like to watch television (TV) is 90%. Because the survey is conducted in a scientific way and the sample size is very large, this result is considered highly accurate. The basic question on which the analogy with PTC will be based is, “What can we expect if we ask 10 randomly selected individuals in this population about their attitude toward TV?” The most probable result will be that, in this population, 9 of 10 individuals will like TV. However, it is reasonable to expect that, in many samples that consist of 10 individuals, all 10 individuals are TV fans; whereas, in other similar samples, there may be only 8, 7, or 6 such individuals. However, it is hardly conceivable that we will identify as few as only 1 or 2 fans in a sample of 10 individuals if the principle of random selection is followed.

Random selection is important. For example, a nonrandom sample, like one that consists exclusively of nuns in convents, would not yield a statistically valid reflection of the entire population; indeed, we may identify only 1 or 2 individuals in such a sample who like to watch TV. Exclusive selection of nuns or members of any other group with some special interest would not be compatible with the principle of randomness. However, to select a nun occasionally in a sample, with a frequency roughly corresponding to the proportion of nuns in the entire population, would be appropriate.

There is a statistical method that uses the so-called “binomial formula” for calculating the probability of encountering 10, 9, 8, 7, etc, TV fans in a sample of 10 individuals from our postulated population. (This method is not detailed in the current article, but an explanation can be found in any elementary statistical textbook23, 24). The probabilities even can be looked up in tables that are found at the end of statistical books. Under the circumstances outlined above (with a 90% proportion of TV fans in a sample size of 10 individuals), the probabilities of identifying 10, 9, 8, 7, and 6 TV fans in a random sample of 10 individuals are 0.35, 0.39, 0.19, 0.06, and 0.01, respectively.

The probability of identifying ≤5 TV fans under the above-described circumstances in a truly random sample of 10 individuals is exceedingly small. The succession of numbers described above represents a “probability distribution,” which can be observed in a histogram (Fig. 1.). This distribution is interpreted as follows: If, from this very large population, we take numerous random samples, each consisting of 10 individuals, and ask about their preferences for TV, then we will find that 35% of the samples would include 10 fans, 39% of the samples would include 9 fans, 19% of the samples would include 8 fans, and so on.

If we change the size of the sample, then the magnitudes of the single probabilities and their distribution also will change and, along with them, the probability distribution. If we choose sample sizes of 100 individuals instead of 10, then the probabilities will be clustered much more tightly around the value of 90% than was the case in the smaller samples. The larger the size of the sample, the more reliable is the estimation; in other words, the observed value in every sample approaches the real population parameter. It is virtually unimaginable that there will be only 50 or 60 TV fans among 100 randomly selected individuals from this population. (Distribution data for such large samples are not provided even in the tables of larger statistical reference books25: They are not needed, because the probability distribution for large samples can be found by the so-called “normal approximation of the binomial distribution.” To perform this method is mathematically simple, but the results may be slightly inaccurate. There are complex Web-based Internet tools, however, that calculate these probabilities very accurately.26) Of course this holds true only if the randomness principle is strictly observed.

How can we apply the reasoning described above to the issue of sample sizes in PTC? Fortunately, the results of these binomial calculations can be generalized. The reason why we can do this is that, if the “experiment” qualifies as binomial, then the specifics of the experiment, whether they are related to liking TV or to success in PTC, have no bearing on the values of the probabilities or on the probability distribution.

## True Scores

At this point, we need to review the term “true score,” a concept that is used widely in modern test theory. The true score of a hypothetical examinee is defined as the average of the observed or measured scores that would be obtained over an infinite number of repeated testing by the same test, provided that the examinee's skills remain indefinitely stable. For actual examinees, the true score can be estimated with a small error margin, but its exact value is essentially unknowable. For instance, if a cytologist screens 100,000 cervical smears, and if his or her diagnoses are correct 98,000 times, then the approximation of his or her true score is 0.98. Because the accurate determination of the true score would require an infinite number of repeat testing, which is not feasible, this true score of 0.98 remains an approximation. Obviously, we can be rather sure that, when the same individual screens the next 100,000 preparations, the approximation of his or her true score will not remain the same: The chances of this are infinitesimally small. The estimate of the true score will almost certainly change slightly, for instance to 0.97 or to 0.99, and so on, for each successive trial.

It has to be emphasized that assignment of an exact “true score” to a cytologist is somewhat arbitrary for further reasons. It cannot be expected that anybody's cytologic skills will remain invariant for a prolonged time. We can hope, of course, that the professional prowess of cytologists improves over time. Furthermore, everybody who has ever screened cytology specimens knows that screening performance depends on many factors, some of which are extraneous to the level of cytology skills. On a “good” day, a cytologist may function on a 0.98 score level; whereas, on a different, “bad” day, he or she might be less “proficient.” Even his or her experience with particular kinds of cytologic presentations on the previous day, for example, having seen an unusual presentation of high-grade squamous intraepithelial lesion on a quality-assurance review, could affect decision-making on the current day. Of course, these and other psychological variables (eg, the effects of anxiety or tiredness during tests or routine work) cannot be factored into the statistical considerations. Nagy and Collins, describing this concept, used the term “competence level” instead of “true score” in their 1991 article.19

Direct measurement of the true score is not possible. What we have after an evaluation of test results is the “observed score,” which is related to the true score but is not identical to it. It can be considered an estimate of the true score.

## Comparison of TV Preference and PTC Results

TV preference and PTC results can be compared as follows: The values derived by the binomial formula are determined only by the number of trials and the probability of success. If the “experiment” qualifies as binomial, then the specifics of the experiment have no bearing on the numerical results. (In statistical parlance, any methods or procedures that yield raw data are called experiments.) In our TV example, the number of trials (the sample size) is 10, and the probability of success is 0.9. These 2 data are sufficient to calculate the probability distribution for this specific case. Let us consider now an example of PTC in which these specifics are the same as described above. The PTC design prescribes 10 slide test sets (number of trials). A cytologist who performs routine screening and customarily renders accurate diagnoses 9000 times among 10,000 screened slides has an approximate true score of 0.9. (In other words, the probability of success is 0.9.) When this cytologist attempts to pass this particular PTC, then the probability distribution of the possible correct answers will be identical to the probability distribution observed in the TV example, because the specifics of the TV experiments are the same. If this hypothetical cytologist attempts the test many times, then he or she will read 10 slides correctly in 35% of the tests, 9 slides correctly in 39% of the tests, and so on, as illustrated on the histogram in Figure 2. The numerical values in the 2 experiments are identical, as illustrated in Figures 1 and 2.

We also should note that, if an examinee reads 10 slides or 9 slides correctly, which happens in 74% of events under the circumstances described above, then he or she passes the test. However, this individual, who essentially has an adequate true score, will fail a dichotomous PTC 26% of the time because of the low validity and reliability of the test. The phenomenon of failure in this case can be called “type 1 error.” (The null hypothesis is that “the cytoscreener is competent.”) A valid and reliable test is expected to pass virtually all cytoscreeners with true scores on the 0.9 level; however, any dichotomous test that consists of 10 slides or challenges will misclassify approximately 26% of such individuals. It is obvious that this test does not really meet the expectation to determine the competence of an examinee who had a true score of 0.9.

It needs to be reiterated here that binomial calculations can be performed only for dichotomous tests. The probabilities for some well ordered, nondichotomous tests may be calculated by the use of more complicated multinomial assessments.

## Limitations of the Simple Binomial Error Model

The binomial error model provides only a rough appraisal of the statistical factors that need to be taken into account in the design of PTC. One of the drawbacks of the model, as mentioned above, is that it is applicable only to dichotomous testing systems. However, the simplicity, transparency, and mathematical calculability of dichotomous setups counterbalance every other consideration. The dichotomous test design makes it possible to assess the impact of test set size on test validity and reliability and to calculate confidence intervals. Thus, the use of a dichotomous test would confer greater predictability and practicability to PTC. The effects on test validity and reliability of a haphazard design, like the CLIA'88-mandated PTC, hardly are calculable by scientific-statistical means. We do not state that dichotomous designs would solve every problem inherent in every type of test, including PTC. However, given that all other conditions of the testing are equal, dichotomous tests have insurmountable advantages over nondichotomous tests.

## Size of Test Sets and Rate of Misclassification

Figures 3 through 8 illustrate the probability distributions of correct diagnoses for variable test set sizes and for examinees with different theoretical “true scores” (see legends, Figs. 3–8). An ideal and flawless PTC would fail all examinees with true scores of 0.85, but no test design can fulfill such requirements. The reliability of the tests improves, however, as the test sets get larger. Figures 3 through 8 illustrate that, indeed, for examinees with true scores of 0.85 or 0.8, the accuracy of the test increases in parallel with the increasing size of the test sets. (The failure rates become larger for larger test sets.)

Visualization of the effect of sample size on misclassification also is possible by tabulation. Table 1 demonstrates that the more slides the test set contains, the lower the misclassification rate. There appear to be anomalies at the set sizes of 9 and 19, in which the misclassification rate decreases for examinees with low true scores and increases for the more competent examinees. A test set that consists of 9 or 19 slides would be a very impractical choice. If the passing level is set at 90% (eg, 9 correct answers for 10 slides in dichotomous tests), as it is the general practice for PTCs, then 1 error is allowed for a 10-slide set. Under these circumstances, to pass a test based on 9-slide sets with a 90% passing grade would be incomparably more difficult than to pass a test based on a 10-slide set, because a single mistake would mean an error >10% and, consequently, a failure. The situation is similar for 19- or 29-slide sets. The greater grade of difficulty with a 9-slide test set is reflected in the smaller passing rates for both competent and less competent examinees. (This circumstance, paradoxically, improves the accuracy of the test for the participants with low true scores.) For these reasons, if the passing level is set at 90%, then only decimal-based test set sizes (10, 20, 30, etc. slides or challenges) should be used.

No. of slides in 1 test set | Percent misclassified | ||
---|---|---|---|

True score of incompetent examinees* | True score of competent examinees* | ||

80% | 85% | 95% | |

- *
Passing level is set to 90%. - †
These are workable set sizes (for details, see text).
| |||

1 | 80 | 85 | 5 |

2 | 64 | 72 | 10 |

3 | 51 | 61 | 14 |

9 | 13 | 23 | 37 |

10† | 38 | 54 | 9 |

19 | 8 | 20 | 25 |

20† | 20 | 41 | 8 |

30† | 12 | 31 | 6 |

40† | 8 | 26 | 5 |

Another observable phenomenon in Table 1 is the “law of diminishing returns,” in which, as the number of slides in the test sets is increases, the misclassification rates decrease. However, the rate of decrease is not level but trails off with increasingly larger set sizes. For instance, misclassification of examinees with a true score of 0.8 is almost halved, from 38% to 20%, when the number of slides in the sets increases from 10 to 20. The next step, from a 20-slide set to a 30 slide set, is accompanied by a smaller relative improvement, and so on.

An important conclusion that can be drawn from Table 1 is that, when the number of slides is increased in the test sets, the decrease in the misclassification rate is more precipitous if the true score is 0.8 or 0.85, ie, on the side of the table for less competent examinees, than if the true score is 0.95. From our viewpoint, this is an advantage. The basic purpose of PTC is not the confirmation of the proficiency of the average cytologist who performs well but the identification of individuals who may have problems with expertise and need remediation. The type 1 error, the failure of competent examinees, is less consequential than the type 2 error, the passing of less competent examinees. The simple binomial model is more suitable to investigate the latter than the former in the set-size ranges that are prevalent in the practice of PTC.

## What Should Be the Minimal Number of Test Slides in Test Sets?

The question about the minimal number of test slides in test sets could be formulated more accurately as follows: What should be the minimal number of test slides so that we can be 90% confident that the test result is accurate? This type of calculation is relatively simple to perform if the test is dichotomous. In our calculations, we assumed a dichotomous test and 90% as the passing level for the observed score.

The minimum necessary number of test slides depends to a large extent on the competence of the individual examinee. For a cytologist with very poor skills, a relatively small test set would suffice. However, the discriminatory power of PTC decreases at the point where the skills of the examinee are almost satisfactory but still insufficient. Therefore, for such an individual, the test sets should be much larger if we want 90% confidence. It would be unrealistic to expect any test to differentiate easily between an “incompetent” cytologist whose true score is 0.89 and a “competent” cytologist with a true score of 0.9.

Just to illustrate a possible solution, we calculated the minimal size of test sets for examinees who had a true score of 0.8. We wanted to have 90% confidence in the accuracy of the test result. (This means that at least 90% of examinees with a true score of 0.8 will fail the test if the test set contains the calculated number of test slides.) Similar calculations were performed for examinees who had a true score of 0.85.

For the calculation, we used the algorithm written by the Vassar Education Department, which is in the public domain and may be found on the Internet.26 According to the results, a 40-slide set would provide >90% confidence (exactly, 92.409% confidence) in the accuracy of the results for examinees with a true score of 0.8. A 30-slide set would provide only an 87.729% confidence level for these individuals.

For examinees with a true score of 0.85, much larger test sets would be necessary to provide 90% confidence in the results. A test set consisting of 90 slides would provide 88.468% confidence, and only the use of a 100-slide test set would ensure >90% confidence (exactly, 90.055 confidence) in the test results. The extent of the confidence intervals can be easily visualized. Lord et al. presented the 90% confidence intervals for a 30-item dichotomous test on different true score levels.27

The numbers provided above are given only for illustrative purposes. It is obvious that test sets consisting of 100 slides, or even 40 slides, could not be used under the generally accepted conditions of PTC. Evidently, only a board-type, full-day, or 2-day-long examination would satisfy the statistical requirements for an accurate and equitable test. Conversely, because such a board-type test would determine the capabilities of the examinees with a high level of accuracy, it would become safe to increase the intertest interval to 8 years or 10 years.9, 28

However, if most aspects of the current federal regulations for PTC remain in force—in other words, if a highly inaccurate and unreliable test also will be used in the future—then it will not be advisable to increase the yearly interval between tests very much. The main reason for this is that short tests are incapable of accurately identifying examinees with low professional skills. Competent examinees who fail the test (type 1 error) pass the test on the second or third attempt with a high probability. Most of these valuable professionals are not harmed much beyond the inconvenience of repeated testing. In contrast, examinees with questionable skills who pass the test (type 2 error) do not have to submit to repeat testing, and they continue to screen patient slides without censure at least until the next test. Of course, it may be argued that, if the test were totally useless, then increasing the interval between test events would not have any effect on public health. However, if the test were totally useless, then the only honest course to follow would be the complete abolishment of PTC. In our opinion, the test in its present form is not totally useless. The current test will force a certain number of cytologists with very poor professional skills (regardless of their low proportion in the entire cytopathology community) to recognize their deficiencies, to participate in remediation(s), and at least to attempt to improve their professional skills. However, as made obvious in the discussion above, the federally mandated PTC in its current form is not able to identify all cytologists with very poor skills. Allowing such individuals, unidentified by the test, to continue screening constitutes a certain danger for the public. If we try to make the current PTC useful at least to some degree, then we should not increase the time interval between tests to 3 or 4 years.

## The High Passing Rate of Less Skilled Professionals in Short Tests

Through the use of the simple binomial model, it also is possible to calculate the number of less than competent individuals who eventually will pass the short tests after repeated attempts. For instance, as illustrated in Table 1, among 100 examinees who have true scores in the less competent range of 0.85, 54 individuals will pass a dichotomous test that consists of 10 test slides on the first attempt. The remaining 46 examinees will attempt the test a second time, and 54% of them (ie, 25 individuals) will pass on this second try. The remaining 21 examinees will attempt the test a third time, and 54% of them (ie, 11 individuals) will pass. In summary, 54 + 25 + 11 = 90 of these less-skilled examinees among 100 who were supposed to be identified by the system will avoid serious consequences if a short, 10-slide-based dichotomous test with 3 permitted retakes is used.

A similar calculation illustrates that, among 100 examinees with true scores of 0.8, 76 individuals eventually will pass, if 3 attempts are allowed, in a 10 slide-set, dichotomous PTC system.

These numbers indicate all too clearly the utter uselessness of short dichotomous PTCs in terms of capability to identify less skilled cytologists. However, we do not go so far as to declare that short PTC systems, dichotomous or nondichotomous, are totally lacking in utility. Even a short test generates interest, creates opportunity for self-assessment, and possibly highlights deficiencies in some areas in the professional knowledge of the individual cytologist. This effect should be perceived as beneficial. Our personal experience indicates that very short educational tests, although they may not be suitable in themselves as statistical assessments of professional knowledge of individuals, almost always provide a welcome impetus for continuing education. A short PTC, as an educational experience, may remain a valuable quality-assurance method, although it is limited in scope. In this regard, other valuable educational activities, such as the CAP Pap program, have their full justification. However, we in the cytopathology community should persevere in our attempts to prevent the deleterious situation in which PTC remains an expensive and rather meaningless ritual; a test that, on repeated attempts, can be passed by virtually all competent cytologists, as expected, and also by a very high percentage of those who would be adjudged incompetent if a more reliable testing process were available.

## Statistics Are Not Everything

A more intensive integration of statistical principles would be needed to make the current design of PTC more functional. However, we do not believe that, even if statistical principles were applied optimally to PTC, all of the inherent problems of testing could be eliminated. There are many nonstatistical facets of all tests, including PTC. For instance, because, in cytopathology, we are confronted with the morphologic manifestations of extremely complicated biologic systems, total equivalence in the difficulty of test challenges (that is, absolute conformity of corresponding slides in different test sets) cannot be achieved. Perhaps this can be overcome with computerized digital tests to some extent in the future.

## Lessons From the Simple Model of Dichotomous PTC That Can Be Applied to the Dysfunctional Federal Design

We emphasize once more that the discussions and calculations above are based on the relatively simple model of dichotomous proficiency testing. The current CLIA'88-mandated test, with its elaborate scoring system and multiple diagnostic categories, is much more complicated; therefore, our conclusions cannot be transferred to it in any straightforward or easy way. The proportions of expected misclassification rates, the widths of confidence intervals, and other statistical parameters in nondichotomous systems cannot be calculated accurately by using the simple binomial model. In other words, the generalizability (“external validity”) of the foregoing statistical considerations to nondichotomous systems could be questioned. The Galtonian regression toward the mean in the results of the first year of the CLIA'88-mandated test, however, provides indirect evidence that misclassification by the federal test is substantial, and its magnitude is in the range indicated by the simple binomial model.20, 21 Therefore, it is plausible that the conclusions of the statistical considerations outlined above are applicable to the federally mandated PTC to a large extent.

We emphasize that the theoretical underpinnings of PTC are much more complex than may be perceived readily. We hope that, if mandatory, nationwide PTC remains in any form, then it is redesigned to be a valid and reliable proficiency testing system or possibly a board-type examination. We believe that accomplishing this would require the engagement of both cytologists and experts who are well versed in the practical and theoretical aspects of modern test theory. This does not mean that more descriptive data from the existing results of the CLIA'88-mandated PTC should be collected. On the contrary, because the design of the CLIA'88-mandated test is flawed, little true insight may be gained by amassing and further studying descriptive data from such a source. Rather, we advocate the careful application of more inferential or theoretical statistics, which would allow a fairer conceptual design of PTC while leaving the final decisions in the hands of expert cytopathologists and cytotechnologists who are familiar with the wider aspects of our difficult discipline.

## Acknowledgements

We thank Adriana Verschoor, PhD, for valuable editorial assistance.