Thyroid fine-needle aspiration (FNA) biopsy, the preoperative diagnostic standard of care for patients with thyroid nodules, has limitations. Spectral imaging captures visible light information that is beyond the capability of the human eye, potentially increasing the accuracy of FNA biopsy. In the current study, the authors demonstrated the feasibility of using spectral imaging in combination with automated spatial analysis based on trainable pattern recognition as an adjunct test for thyroid FNA classification by developing an algorithm that distinguishes between images of papillary thyroid carcinoma (PTC) and benign goiter (BG).
A multispectral camera was used to capture spectral images representing 100 cases of PTC and BG. Used in conjunction with commercial software, 10 cases were used as a training set to develop a “classifier,” a classification algorithm that segments digitized multispectral images into regions of PTC, BG, and “nonfeature.” This algorithm was used to generate a screening test and a diagnostic test that were validated on an independent set of images representing 30 cases of PTC and 30 cases of BG.
The area under the receiver operating characteristic for the PTC/BG classifier was 0.90. The screening test had a sensitivity of 0.93 and a specificity of 0.73. The diagnostic test had a sensitivity of 0.70 and a specificity of 0.90.
Fine-needle aspiration (FNA), the diagnostic procedure of choice in the assessment of thyroid nodules for malignancy, has recently gained a standardized classification nomenclature in an effort to improve its capacity to determine the need for surgical intervention.1 The FNA technique has reduced the resection of benign thyroid tissue to the point that, in a recent study, < 50% of removed nodules were found to be benign.2 This number underscores both the value and the limitations of this procedure.
Thyroid FNA analysis does not definitively classify thyroid samples as “benign” versus “malignant,”; it is not a bimodal (one vs the other) test because of inherent limitations.1-3 Cytopathologists cannot morphologically distinguish between follicular carcinoma, Hurthle cell carcinoma, and their benign counterparts.4, 5 FNA is also vulnerable to variations in the quality and quantity of the thyroid sample, leading to an even higher risk of diagnostic equivocation. Therefore, a national attempt at standardizing terminology via a probabilistic approach was made through the recently established Bethesda thyroid FNA classification system, which defines 6 categories corresponding to various degrees of certainty that a thyroid nodule is malignant.3 Its clinical value lies in providing the treating physician with a method that tiers management decisions based on risk of malignancy.1 Unfortunately, surgical management is still indicated in categories in which malignancy is uncertain, highlighting the need for improved diagnostic techniques. Our feasibility study initially concentrated only on the bimodal Bethesda categories of “positive” (for papillary thyroid carcinoma [PTC]) and “benign” (benign goiter [BG]).
One limitation to existing classification schemes is the human eye itself. Human vision occurs in 3 overlapping spectral bands that are specified by the long (L-), medium (M-), and short (S)-cone cells of the retina. Although a great amount of color data are afforded by these cells, the human eye is thereby limited in its ability to perceive subtle differences in color.6, 7 These limitations may be addressed through spectral imaging, in which specialized cameras are used to capture complete spectral data in images.
Conventional cameras store data corresponding to red, green, and blue signals (RGB), approximating the range of color perception of the human eye. Analysis of wavelengths of light beyond RGB is accomplished via a specialized digital camera that incorporates a liquid crystal tunable filter for specifying desired wavelengths of light. A complete spectrum at each pixel in a captured image file is captured and stored rather than limited data corresponding to RGB. When combined with the use of spatial-morphologic relations, spectral information can provide enhanced discriminatory capacity.7, 8 Beyond the potential gains in diagnostic accuracy, this modality also lends itself to standardization through semiautomated or fully automated analysis.7, 9 Furthermore, it may be used in conjunction with conventional Papanicolaou (Pap) staining, avoiding the time- or resource-consuming steps of immunoperoxidase staining for cancer antibodies7 or other molecular tests.
Previous work in multispectral analysis of thyroid FNAs has demonstrated improved accuracy in the classification of images when compared with analysis of standard RGB images.8 Other reports have addressed aspects of full automation feasibility.9 In our laboratory, Mansoor et al developed a machine-learning algorithm based on spectral images to differentiate between morphologically indistinguishable thyroid follicular adenoma and parathyroid adenoma on FNA cytologic specimens, proving clinical feasibility in this uncommon clinical circumstance in which a parathyroid gland has been sampled via FNA.7 Given the rare nature of this problem, the algorithm has not been clinically adopted. In the current study, we developed a spatial spectral image (SSI)-based classifier that can distinguish between PTC and BG on FNA and determined its value on a pilot validation set.
MATERIALS AND METHODS
A total of 50 PTC and 50 goiter FNA cases from the 2008 archive of the Yale Department of Pathology were collected. Each of the PTC cases had histologic confirmation whereas not all of the cases of BG had histologic follow-up because of the relatively low number of patients with goiters who undergo surgical resection. Each case was represented by a single slide that was stained according to the Pap technique using standard methods and diagnosed according to the Bethesda thyroid FNA classification system.3 An integrated multispectral camera (Nuance; Cambridge Research & Instrumentation, Inc [CRi], Woburn, Mass) and software system (PerkinElmer, Hopkinton, Mass) were used in conjunction with a BH2 light microscope (Olympus America Inc, Left Valley, Pa) to obtain images of the thyroid FNAs. Spectral images were captured at wavelengths divided by 20-nanometer (nm) intervals ranging from 400 nm to 700 nm, the visual spectrum. All images were taken using a × 60 objective lens.
For the PTC/BG classifier, 20 images were taken from 5 PTC cases and 20 images were taken from 5 BG cases; together, these images comprised the “training set.” The set was loaded onto InForm software (PerkinElmer), which was first used to “unmix” the spectral images based on the 3 spectral signatures of the pure Pap stain component dyes (orange G, eosin Y, and light green plus hematoxylin). A classifier was then developed from the unmixed images. Specifically, regions of PTC, BG, background, and crowded/out-of-focus or “nonfeature” areas were selected to train the PTC/BG classifier (Fig. 1). The classifier divides the spectral image into areas representing PTC, BG, or a nonfeature. Classifier features are determined by the algorithm based on the intensity of each component of dye after decomposition based on the spectral signatures. We tested the classifier on a threshold selection set, which was comprised of 30 PTC images and 30 BG images taken from 60 cases that were independent of the training set cases. PTC images that demonstrated classical cytopathologic features such as nuclear grooves and intranuclear pseudoinclusions were chosen, whereas images of BGs were free of these features.
The performance of the PTC/BG classifier was initially assessed by generating its receiver operating characteristic (ROC) curve manually in Microsoft Excel (Microsoft Inc, Redmond, Wash). ROC curves were generated from analysis of image planes of the 3 dye components of the Pap stain, created by unmixing spectral absorption information in the wavelength range from 400 nm to 700 nm, in 20-nm increments. This range is believed to provide reliable unmixing and ROC analysis, although wavelength range optimization may be the subject of future studies. The spectral images constituting the “PTC/BG threshold selection set” were divided into regions of PTC, BG, and “nonfeature” by the PTC/BG classifier (Fig. 2). The cutpoint parameter used was the ratio of the area classified as PTC to the total area classified (ie, all areas excluding nonfeature areas) such that an image was classified as PTC if this ratio exceeded the cutpoint, which we termed the PTC threshold ratio.
For potential use in a clinical setting, a screening test and a diagnostic test were derived from the PTC/BG classifier by selecting 2 specific PTC threshold ratios. Each test used a spectral image as input and output a bimodal classification of either “benign” or “malignant.” Both the screening and diagnostic tests were evaluated on a PTC/BG validation set, comprised of 30 PTC and 30 BG images. These images were distinct from the images used in the threshold selection set but were taken from the same cases/specimens (ie, different regions of the same slides were photographed).
Two additional classifiers were developed that either increased the number of cases represented in the training set or used nuclear training regions alone as opposed to nuclear and cytoplasmic regions. In the former case, 20 images of PTC and 20 images of BG taken from 20 cases of PTC and 20 cases of BG that were independent of the PTC/BG validation set were used to retrain the classifier. In the latter case, the training set used was identical to that used in developing the PTC/BG algorithm, but only nuclear training regions were selected.
The ROC curve generated by running the PTC/BG classifier on the threshold selection set is shown in Figure 3, and the area under the curve (AUC) was 0.88. This classifier was validated in a new set of 30 PTC and 30 BG images (PTC/BG validation set). The AUC of the PTC/BG validation set was 0.90, which was similar to that of the PTC/BG threshold selection set.
Based on the divisions found using the classifier on the threshold selection set, we selected a PTC threshold ratio of 0.32 for a PTC/BG screening test and 0.84 for a diagnostic test. These correlated with a sensitivity of 0.97 and a specificity of 0.66 for the screening test and a sensitivity of 0.70 and a specificity of 0.91 for the diagnostic test. Both tests were validated on the PTC/BG validation set, in which the screening test was found to have a sensitivity of 0.93 and a specificity of 0.73 and the diagnostic test had a sensitivity of 0.70 and a specificity of 0.90. These results are summarized in Table 1.
Table 1. Results of PTC/BG Screening and Diagnostic Tests
We next examined the effect of using variable and multiple cases while keeping the total number of training images constant. Up to a maximum of 20 cases of PTC and 20 cases of BG (using only 1 image per case) were used in the training set. Increasing the number of cases represented in the training set did not appear to significantly improve classification of the PTC/BG validation set (Fig. 4), yielding an AUC of 0.90. Similarly, altering the algorithm training to use nuclear regions alone, rather than a combination of both cytoplasmic and nuclear regions, yielded a PTC/BG classifier with an AUC of 0.91 (Fig. 5).
We developed a PTC/BG classifier and 2 tests, a screening test and a diagnostic test, based on SSI of thyroid FNA cytologic samples. Each test exhibited high sensitivity or high specificity, respectively. Both tests underwent evaluation on a validation set.
The PTC/BG validation set was comprised of FNA images that exhibited morphologically classic diagnostic features of PTC and BG. Human cytomorphologic classification of these images in the RGB spectrum would likely yield sensitivities and specificities of nearly 100%. However, the value of SSI lies in features that are not detectable by the human eye. Unlike human operators, the spectral tests do not classify entire groups of cells based on the presence or absence of classical morphologic features such as nuclear grooves or intranuclear pseudoinclusions. Rather, the spectral test proceeds by analyzing small localized patches of a field after decomposition of images into unmixed images that represent each dye component. Because these regions rarely include grooves or pseudoinclusions, it is apparent that the classifier is able to recognize diagnostic features that become “visible” to the computer software after being highlighted by specific dye components.
Optimization of the PTC/BG classifier was attempted via several methods. Both an increase in the number of cases used in the training set and an alteration of the case images used while maintaining the case number constant (ie, the introduction of greater case-to-case variation) did not appear to improve the algorithm significantly. This suggests that the color wavelength features recognized by SSI do not vary significantly between cases of PTC and cases of BG. Alternatively, the more complex the training set, the greater the imaging “noise” perceived by the classifier, an effect that nullifies potential improvements. We believe that standardizing the preparation of slides using commercially available semiautomated specimen preparation methods such as the liquid monolayer-based ThinPrep method may obviate this impediment to the use of SSI.
Our attempt at optimization of the PTC/BG classifier by limiting the algorithm to training regions comprised solely of nuclei presented results similar to those in which cytoplasmic training regions were also included. This suggests that cytoplasmic features are unimportant in the SSI diagnosis of PTC or BG or, as above, that the cytoplasm introduces a degree of imaging “noise” when compared with the SSI classification based on nuclei alone.
The findings of the current suggest that SSI may act as an adjunct to FNA classification by the human eye. It mimics the capacity of the human eye to detect classic PTC and BG without using the familiar morphologic criteria, but rather channels of wavelengths of light in an expanded visual spectrum. Its true strength may lie in cases in which classic diagnostic criteria are absent, such as the Bethesda thyroid FNA classification category of “atypia of undetermined significance” or the “suspicious” categories. In addition, future studies will examine whether SSI may assist in the difficult diagnosis of entities such as follicular carcinoma or Hurthle cell carcinoma. Given the stated goal of eventual clinical application and the relative clinical frequency of such specimens, we propose a future collection of a new validation set rather than cross-validation or bootstrapping.
Furthermore, the incorporation of SSI into existing workflow parameters requires evaluation. One possibility is to incorporate SSI as an adjunct step to cytotechnologist screening of thyroid FNAs. Alternatively, SSI could be used in a reflex test after a cytopathologist's “clinically actionable” diagnosis (eg, a diagnosis, such as “suspicious,” in which surgical intervention is a possibility). This latter scenario would mimic the use of molecular diagnostics such as BRAF gene mutation testing on thyroid FNAs that are not definitive for malignancy. Costs would be limited to equipment/software including the algorithm, and employee time of a few minutes per case.
In summary, the current study presents the results of a preliminary classifier based on SSI that can distinguish between standard Pap-stained PTC and thyroid FNAs of BG. We intend to improve on this algorithm with the eventual goal of providing an adjunct test that aids classification in scenarios in which a diagnosis of thyroid malignancy is uncertain.
Funded in part by a grant of software and equipment from Caliper/PerkinElmer.
CONFLICT OF INTEREST DISCLOSURES
Summer stipend funds were provided to medical student L. Hahn by the Yale School of Medicine, Office of Student Research. Clifford Hoyt is employed by and owns stock in Caliper/PerkinElmer.