The impact of implementation of the Bethesda System for Reporting Thyroid Cytopathology on the quality of reporting, “risk” of malignancy, surgical rate, and rate of frozen sections requested for thyroid lesions
Amanda Crowe MD,
Division of Anatomic Pathology, Department of Pathology, University of Alabama at Birmingham, Birmingham, Alabama
The Bethesda System for Reporting Thyroid Cytopathology (TBSRTC) has been anticipated to improve communication between pathologists and clinicians and thereby patient outcomes. In the current study, the impact of TBSRTC on various quality and outcome measures was assessed.
The current study included all patients who underwent fine-needle aspiration (FNA) of the thyroid between April 2006 and April 2009. Before implementation, the authors used generic diagnostic categories; after implementation, TBSRTC was used. Quality of reporting, diagnostic categories, rate of surgery, rates of frozen section, the “risk” of malignancy after a cytologic diagnosis, and errors before and after implementation of TBSRTC were compared using the chi-square and Fisher exact tests. Multilevel likelihood ratios and the receiver operating characteristic were used to compare the accuracy of FNA before and after implementation.
A total of 1671 FNAs (957 obtained before and 714 obtained after implementation of TBSRTC) were obtained from 1339 patients. Of these, 301 patients (191 before and 110 after implementation) underwent subsequent surgical resection. Before implementation, the reports were more ambiguous (3.7% vs 0.5%; P < .05) and implicit (5.1% vs 2.7%; P < .05) than after implementation. The overall rate of surgery decreased after implementation of TBSRTC (24.5% vs 19.6%; P < .05). The overall risk of malignancy did not appear to be affected by implementation of TBSRTC, but it decreased significantly after a benign FNA diagnosis compared with a diagnosis of an atypical lesion or follicular neoplasm. The rate of frozen section remained unchanged. The diagnostic accuracy was not found to be significantly different before compared with after implementation of TBSRTC.
Fine-needle aspiration (FNA) of the thyroid provides information regarding thyroid nodules safely and rapidly to triage patients for appropriate management. Although FNA of the thyroid is a valuable test, until recently a lack of uniformity in terminology and reporting (between laboratories/institutions and even intradepartmental variability) has caused confusion among clinicians and partially limited its effectiveness. Although clinicians, including endocrinologists, radiologists, and surgeons, have previously reached consensus regarding the clinical aspects of FNA of the thyroid, until recently there was no consensus concerning the reporting of thyroid FNA results among pathologists. That changed with the National Cancer Institute Thyroid Fine-Needle Aspiration State of the Science Conference, which took place as a 2-day conference in Bethesda, Maryland in October 2007.1, 2 The participants (who included pathologists, endocrinologists, surgeons, and radiologists) acknowledged the importance of developing a uniform terminology for reporting thyroid FNA results. The Bethesda System for Reporting Thyroid Cytopathology (TBSRTC) includes recommendations regarding the format of the report, evaluation of adequacy, diagnostic categories, and suggested risk of malignancy as well as recommended clinical management.1, 2 TBSRTC has been suggested to make the cytology report “unambiguous, clear, succinct and clinically relevant.”1 In the current study, we attempted to assess the impact of TBSRTC on 1) quality of cytopathology reporting, 2) referral rate to surgery for patients with atypical diagnoses, 3) frequency of frozen sections performed during surgery, and 4) the “risk” of malignancy associated with each diagnostic category.
MATERIALS AND METHODS
The study included all patients who underwent FNA of the thyroid over a 3-year period from April 2006 to April 2009. We began implementation of TBSRTC in January 2008. A total of 1671 FNAs from 1339 patients were collected by clinicians with and without ultrasound guidance: 957 before and 714 after implementation of TBSRTC. The specimens were processed according to laboratory policy and typically were comprised of several Diff-Quik–stained and Papanicolaou-stained smears, which were evaluated by 5 pathologists at the study institution. During the study period, there was no change in the personnel who reviewed and signed the cytology or surgical pathology reports or the physicians and surgeons who managed these cases. Before implementation, we used generic, nonstandardized diagnostic categories for all nongynecological cytology: unsatisfactory, benign, reactive, atypical, and suspicious or positive for malignancy, with a generous use of free text by the pathologists. Our system was not radically different from TBSRTC, except for the intermediate diagnostic categories and the generous use of free text that retrospectively could be translated into 1 of the above categories or TBSRTC. During implementation of TBSRTC, the participating pathologists were asked to review TBSRTC and a template was used to standardize the reporting process. Clinicians were informed about TBSRTC, the risks associated with each category, and prospective management. Briefly, there are 6 general categories in TBSRTC: 1) nondiagnostic or unsatisfactory; 2) benign; 3) atypia of undetermined significance or follicular lesion of undetermined significance (atypical lesion [AL]); 4) follicular neoplasm or suspicious for a follicular neoplasm (FN); 5) suspicious for malignancy (SM); and 6) malignant neoplasm (MN), with each category having further subcategories.1, 2 For the purpose of the current study and similar to previous reports, we reconstructed the diagnostic categories of thyroid FNA, before implementation of TBRSTC, to closely match TBSRTC diagnostic categories.3-6 TBSRTC diagnostic categories 1, 2, 5, and 6 were identical to our unsatisfactory, benign, suspicious, and malignant categories, respectively. In our system, the intermediate diagnostic categories (atypical with a comment) were accompanied with generous free text and were translated into TBSRTC diagnostic categories 3 and 4.
The quality of the cytology reports was evaluated as: 1) explicit when a specific diagnosis such “nodular hyperplasia” was rendered, 2) implicit when a general term such as “benign” was used, or 3) ambiguous when a descriptive diagnosis or a list of diagnoses was rendered. Quality of reporting, diagnostic categories, rate of surgery, rates of frozen section, the “risk” of malignancy after a cytologic diagnosis, and errors before and after implementation of TBSRTC were compared using the chi-square and Fisher exact tests. The Mann-Whitney U test was used to compare the distribution of categories before and after implementation. Multilevel likelihood ratios with confidence intervals and receiver operator characteristics (ROCs) (area under the curve and standard error) were determined using MedCalc statistical software (MedCalc Software Mariakerke, Belgium). The multilevel likelihood ratio is simply the true-positive rate/the false-positive rate for each diagnostic category. The overall accuracy of FNA before and after implementation was compared using the methods of Hanley and McNeil.7, 8 A P < .05 was considered statistically significant.
We received 1671 FNAs (957 obtained before and 714 obtained after implementation of TBSRTC) from 1339 patients, with 256 FNAs obtained from > 1 site and 86 follow-up FNAs. A total of 301 patients (191 from before and 110 from after implementation) underwent subsequent surgical resection. When there was > 1 site from which the FNA was obtained, we considered the worse diagnosis as the final diagnosis.
The overall distribution of diagnostic categories demonstrated a statistically significant difference after implementation of TBSRTC (Fig. 1). Specifically, before implementation, the combined frequency of the SM (2% before vs 0% after) and MN (4% before vs 3% after) categories (43 of 777; 5.5%) was higher than after implementation (17 of 562; 3.0%) (P < .05) (Fig. 1).
Before implementation, the reports were more ambiguous (29 of 777 [3.7%] vs 3 of 562 [0.5%]; P < .05) and implicit (40 of 777 [5.1 %] vs 15 of 562 [2.7%]; P < .05) than after implementation. These results did not demonstrate a statistically significant difference when stratified by diagnostic categories.
The overall rate of surgery significantly decreased after implementation of TBSRTC (24.5% vs 19.6%; P < .05) (Fig. 2) because of the reduced number of patients with benign findings who underwent surgery (15 % vs 8%; P < .05) (Fig. 2). Conversely, the rates of surgery after a diagnosis of AL remained the same before and after implementation of TBSRTC (57% vs 62%) (Fig. 2). The same was true after a diagnosis of FN (72% vs 76%) (Fig. 2). The rate for MN was 65% before and 87% after implementation of TBSRTC. Of all surgeries, 44% were performed for benign lesions before implementation compared with nearly 32% after implementation (Table 1). The rate of frozen section performed at the time of surgery before (26 of 191 patients; 13.6%) was not statistically different from the rate after (22 of 110 patients; 20%) implementation of TBSRTC.
Table 1. FNA Diagnostic Category Before (191 Patients With Surgical Follow-Up) and After (110 Patients With Surgical Follow-Up) Implementation of TBSRTC for Patients Who Underwent Surgery
Abbreviations: FNA, fine-needle aspiration; TBSRTC, the Bethesda System for Reporting Thyroid Cytopathology.
Suspicious for malignancy
Before implementation, the rates of confirmed MN, including microcarcinoma, after a cytologic diagnosis of benign, AL, or FN were similar (Table 2). These rates were significantly different when compared with the rates of confirmed MN after a cytologic diagnosis of SM (55.6%) or positive for malignancy (MN) (95%) (P < .05). Implementation of TBSRTC was associated with a decrease in the rate of MN after a benign FNA diagnosis (from 16.7% to 8.6%) compared with that after AL (from 23.1% to 18.2%) or FN (from 16.7% to 18.8%) diagnoses (Table 2). These rates did not change remarkably when microcarcinoma was excluded from the analysis (Table 3). In addition, the overall diagnostic accuracy before and after implementation of TBSRTC, as measured by multilevel likelihood ratios and ROC area under the curve, was not found to be statistically significant (Table 4). The overall accuracy was not affected when unsatisfactory results were excluded from the analysis. The false-positive rate was 17.1 % before and 9% after implementation of TBSRTC when microcarcinoma was included in the analysis and was 8.3% versus 0%, respectively, when it was excluded. However, these differences were not statistically significant.
Table 2. Risk of Malignancy Associated With Each Diagnostic Category Before (191 Patients With Surgical Follow-Up) and After (110 Patients With Surgical Follow-Up) Implementation of TBSRTC (Microcarcinoma Included)
Abbreviation: TBSRTC, the Bethesda System for Reporting Thyroid Cytopathology.
Suspicious for malignancy
Table 3. Risk of Malignancy Associated With Each Diagnostic Category Before (191 Patients With Surgical Follow-Up) and After (110 Patients With Surgical Follow-Up) Implementation of TBSRTC (Microcarcinoma Not Included)
Abbreviation: TBSRTC, the Bethesda System for Reporting Thyroid Cytopathology.
Suspicious for malignancy
Table 4. Positive Likelihood (95% CI) Ratios of Malignancy and the Overall Diagnostic Accuracy (AUC [SE]) of Thyroid FNA for Each Diagnostic Category Before and After Implementation of TBSRTC
Abbreviations: 95% CI, 95% confidence interval; AUC, area under the curve; FNA, fine-needle aspiration; NA, not applicable; SE, standard error; : TBSRTC, the Bethesda System for Reporting Thyroid Cytopathology.
Assessment of accuracy: with microcarcinoma, before vs after: P = .056; without microcarcinoma, before vs after: P = .26; before implementation, with vs without microcarcinoma: P = .24; after implementation, with vs without microcarcinoma: P = .007.
During the entire study period, 48 patients with thyroid FNA results underwent frozen section during intraoperative consultation. Approximately 27% (13 of 48 patients) were incorrectly diagnosed on cytology compared with 13% (6 of 48 patients) on frozen section examinations. The difference was not found to be statistically significant.
Until recently, thyroid FNA reporting has been an excellent example of discordance between pathologists and clinicians.9, 10 TBSRTC1, 2 was introduced to reduce this confusion. In the current study, we found that the implementation of TBSRTC lowered the number of ambiguous and implicit diagnoses. Had clinicians reviewed these reports, more would have been characterized as ambiguous or implicit. In general, surgeons/clinicians may misunderstand the pathology report with regard to up to 30% of the information items in it, partly due to format style or lack of familiarity with the report type.11 For example, diagnoses of “atypical” or “indeterminate,” which generally carry a low risk of malignancy, have been shown to prompt the surgeons surveyed to send the patient to surgery in 52% and 32%, respectively, of cases.9 To alleviate the discord between clinicians and pathologists, Redman et al,9 Ortel,10 and Attanoos et al12 called for standardization of reports and frequent communication feedback between clinicians and pathologists. The findings of the current study support the main premise of TBSRTC: it makes the cytology report “unambiguous, clear, succinct and clinically relevant.”1
The primary purpose of thyroid FNA is to reduce unnecessary surgeries because thyroid nodules, despite being very common, carry a low risk of malignancy. In the current study, we found that the overall rate of surgery was similar to previously reported rates.13-17 This rate was reduced significantly after implementation of TBSRTC. Because the benign group formed the major category within the current study population (Fig. 1), reduction of the surgical rate in this category explains the majority of the observed overall reduction. Another possible reason is the lower rate of SM and MN diagnoses after implementation. It is not clear why the frequency of these 2 diagnoses was reduced after TBSRTC implementation. It is possible that some of the cases we used to term SM (such as cases with nondiagnostic features of papillary carcinoma) are now grouped in with AL. Further analysis of these categories is needed to test this hypothesis.
The “risk” of malignancy associated with each thyroid FNA diagnostic category has been discussed in the literature, but it is difficult to assess the true value or meaning of these numbers. Marchevsky et al examined the previous literature and demonstrated that the percentages of malignancies range from 0% to 18%, 3% to 77%, 8% to 85%, 21% to 100%, and 50% to 100% for the benign, AL, FN, SM, and MN categories, respectively.17 It is also unclear in many publications whether thyroid microcarcinomas (those measuring < 10 mm) were included in the analysis. In the current study, when microcarcinoma was considered to be an incidental finding, the diagnostic performance was found to be slightly improved after implementation of TBSRTC and, although comparable to previous studies,3, 13, 15, 16 there were some variations noted among different categories (Table 4). As frequently presented, “risk” of malignancy as a measure of performance of thyroid FNA is no more than the positive predictive value or 1-negative predictive value (for negative cases) disguised as an epidemiologic term. The predictive values strongly vary with the prevalence of the disease in the study population. This most likely explains Renshaw's finding of an inverse relation between the rate of surgical follow-up (selection for surgery) and risk of malignancy after a benign thyroid FNA result.18 In some studies, risk depends on the denominator used and whether it is an absolute risk or a relative risk.17 Because thyroid FNA is a laboratory test, we prefer to use the term predictive value or post-test probability over the term “risk” of malignancy.
Like the case in other laboratory tests, one can use prevalence and multilevel likelihood ratios for thyroid FNA to determine the post-test probability for each diagnostic level.19-21 In this study, the likelihood ratios for benign and malignant diagnoses essentially establish the diagnosis in both situations (Table 4). The intermediate results, AL and FN, had likelihood ratios close to unity with little influence on the pretest probability. However, despite the many merits of likelihood ratios over other diagnostic test parameters, to the best of our knowledge there are only a few reports published to date that present likelihood ratios calculated for TBSRTC categories.22, 23 Using likelihood ratios is better than other test parameters because they are largely unaffected by the prevalence of the disease; they can be determined for each diagnostic category; and they can be combined with clinical, sonographic, or other laboratory findings to calculate post-test probability. However, when comparing overall accuracy, the ROC performs better than overall likelihood ratios because the ROC does not ignore unsatisfactory results or collapse the indeterminate diagnoses into 1 category .8 In the current study, the overall ROC was not found to be different from that of previous reports (Table 4).8, 22 In addition, we found no change in the overall diagnostic accuracy before and after implementation of TBSRTC.
Assessing the use of frozen section during thyroid surgery (which is typically recommended to clarify nondefinitive diagnoses or ambiguous FNA results) represents another indirect way in which TBSRTC improves communication between pathologists and surgeons. These rates of use of frozen section are similar to those previously reported24 and do not appear to be affected by implementation of TBSRTC.
The current study has several limitations. Surgical follow-up was not provided in a large number of patients, resulting in both verification and disease spectrum bias. The preimplementation reporting was evaluated by a cytopathologist and not a clinician, who was the final user of our reports. The translation of 1 system to another may not be a perfect match. The follow-up varied for patients both before and after implantation. Histology may not be ideal, but it is the best available gold standard. Pathologists were not blinded when reading the final histology or the frozen sections. However, given the nature of this retrospective analysis, some of these limitations are unavoidable.
In conclusion, implementation of TBSRTC improves the quality of reporting by lowering the number of ambiguous and implicit diagnoses and decreasing overall surgery rates, particularly for benign lesions, but appears to have no effect on the accuracy of thyroid FNA, false-positive rates, or the frequency of intraoperative consultations. A prospective clinical trial with some elements of blinding is needed to define the pretest and post-test probabilities of malignancy, particularly for intermediate diagnostic categories. With the frequent use of high-resolution ultrasound, there is a need to develop a consensus regarding whether the discovery of microcarcinoma in thyroid resection specimens should be considered as a missed diagnosis (false-negative result).