Accuracy of optical spectroscopy for the detection of cervical intraepithelial neoplasia: Testing a device as an adjunct to colposcopy

Authors


Abstract

Testing emerging technologies involves the evaluation of biologic plausibility, technical efficacy, clinical effectiveness, patient satisfaction, and cost-effectiveness. The objective of this study was to select an effective classification algorithm for optical spectroscopy as an adjunct to colposcopy and obtain preliminary estimates of its accuracy for the detection of CIN 2 or worse. We recruited 1,000 patients from screening and prevention clinics and 850 patients from colposcopy clinics at two comprehensive cancer centers and a community hospital. Optical spectroscopy was performed, and 4,864 biopsies were obtained from the sites measured, including abnormal and normal colposcopic areas. The gold standard was the histologic report of biopsies, read 2 to 3 times by histopathologists blinded to the cytologic, histopathologic, and spectroscopic results. We calculated sensitivities, specificities, receiver operating characteristic (ROC) curves, and areas under the ROC curves. We identified a cutpoint for an algorithm based on optical spectroscopy that yielded an estimated sensitivity of 1.00 [95% confidence interval (CI) = 0.92–1.00] and an estimated specificity of 0.71 [95% CI = 0.62–0.79] in a combined screening and diagnostic population. The positive and negative predictive values were 0.58 and 1.00, respectively. The area under the ROC curve was 0.85 (95% CI = 0.81–0.89). The per-patient and per-site performance were similar in the diagnostic and poorer in the screening settings. Like colposcopy, the device performs best in a diagnostic population. Alternative statistical approaches demonstrate that the analysis is robust and that spectroscopy works as well as or slightly better than colposcopy for the detection of CIN 2 to cancer.

Cervical cancer remains a major cause of morbidity and mortality in the developing world, where 85% of cancers arise.1, 2 Identification of cancerous and precancerous lesions at earlier stages, when interventions are more likely to be effective, is critical for effective cancer control.3–6 Recent advances in fiber-optic and semiconductor technologies have enabled the development of a new generation of inexpensive, miniature optical sensors that can probe the interaction of light with potentially cancerous tissue in real-time.7, 8 Work to integrate this new technology with existing detection modalities and clinical screening efforts is ongoing.

Two other recent developments in cervical cancer diagnosis and prevention are also impacting disease detection. These are human papillomavirus (HPV) testing for screening and the HPV vaccine. The HPV test for screening is intended for the developing world.9 There is preliminary evidence that it may be more cost-effective than the Papanicolaou smear.10 However, this test requires electricity, laboratory access, and effective follow-up after a positive result. HPV vaccines face similar challenges. Although we expect that such vaccines will reduce cervical neoplasia incidence by approximately 70% in the developed world,11, 12 these results will likely take more than 20 years to achieve, and several hurdles must be overcome before this option is available in developing countries. Where the vaccination program is successful, fewer cases are likely to be discovered in screening and diagnostic settings. Consequently, cost-effective methods will be needed to address this reduced incidence.

Optical spectroscopy is a candidate technology to fill this role, particularly in areas lacking health care resources. The assessment of “emerging technologies” is a complicated process.13 Figures 1a and 1b depict the current clinical paradigms for cervical screening and diagnosis and show what optical technologies could do to transform the process to one that occurs in real-time, is more automated, and is less energy dependent, that is, battery powered.14

Figure 1.

(a) and (b). Current paradigm of clinical care and demonstration of optical technologies in the process.

Following the Littenberg paradigm for the assessment of emerging technologies, we examined biologic plausibility, technical efficacy, clinical effectiveness, patient and provider satisfaction, and cost-effectiveness of optical spectroscopy.13 Biologic plausibility studies demonstrated that (1) tissue fluorescence varied with age and menopausal status,15 (2) changes in NADH, FAD, and collagen occurring in lesions in studies of fresh tissue were likely responsible for these changes in variation16–20 and (3) convolution and deconvolution of data from fluorophores measured in the laboratory and spectra from patients could be used to model fluorophores to resemble tissue and tissue to resemble fluorophore measurements.17, 19, 21–29 Technical efficacy studies showed that these research grade devices were safe28 but performed differently from each other; however, each device did not perform differently over the course of a day or several days.30–35 We developed a quality assurance program and a database to hold the terabytes of data generated from patient measurements. We examined several aspects of clinical effectiveness to reassure ourselves that we were controlling for potential biases or confounders. We compared the screening and diagnostic populations we recruited to be certain that the screening patients were typical of low-risk patients and that the diagnostic patients had a higher number of risk factors for cervical cancer.36, 37 The patients could be measured any time during the cycle except during menses.38–40 The pathologists had good agreement.41 Our evaluation of patient satisfaction showed that colposcopy-directed biopsies were more painful and made patients more anxious than spectroscopy or the Papanicolaou smear, which were equally uncomfortable.42 We also studied provider satisfaction, and these findings are in a forthcoming manuscript.43 To address the final concept of the Littenberg paradigm, we modeled the cost effectiveness of spectroscopy in a see-and-treat scenario and showed that, if we were able to achieve sensitivities of at least 84% and specificities of at least 76%, there was indeed a huge savings of health care dollars in “biopsies avoided” and in being able to “see-and-treat” more accurately with a loop electrical excision procedure.44 We reviewed the literature carefully to learn from other investigators who have studied optical instruments for the cervix, noting the trade-off that may exist regarding device accuracy and sample size.45, 46

In this report, we present the use of fluorescence spectroscopy as an adjunct to colposcopy, one step in the multistep process of automating and replacing existing clinical tests in the screening and diagnosis of cervical cancer and precancer.

Material and Methods

Specific objectives

Our specific objective was to select an effective algorithm using optical spectroscopy and obtain preliminary estimates of its accuracy for the detection of cervical intraepithelial neoplasia (CIN) 2 or worse in both the screening and diagnostic settings. Because we intended to sample normal and abnormal sites, a secondary objective included the evaluation of colposcopy against the gold standard of doubly or triply read histopathologic biopsy.

Overview of study procedures

Women who were at least 18 years of age and not pregnant were eligible to participate in the study, which took place between October 1998 and November 2005. Patients were recruited either from a screening group of women who had no history of abnormal Papanicolaou smears, or a diagnostic group of women from colposcopy clinics who were referred with an abnormal Papanicolaou smear or previous treatment for CIN. Volunteers for the screening group were solicited from the community through advertising, media coverage, and participant word of mouth.47 A research nurse described the study to eligible patients and obtained informed consent from those agreeing to participate. All patients were asked to complete an interview conducted by the research nurse that covered demographic variables and aspects of sexual behavior. These data were entered into a database but were not made available to providers.

Each patient provided a complete history and received a physical exam, a Papanicolaou smear, and colposcopic examination of the vulva, vagina, and cervix. A blood sample for the measurement of follicle stimulating hormone, estrogen and progesterone was taken, and prevention recommendations (regarding tobacco use, sunscreen use, obtaining mammography, increasing calcium intake, and following American Cancer Society screening guidelines) were discussed with the patient. Two cervical smears were obtained using an Ayre's spatula and Cytobrush; the first sample was placed directly on a glass slide and fixed with fixative, and the second specimen was placed in Cytyc liquid-based medium for quantitative cytology and for HPV testing using the Hybrid Capture® II (HCII) test (Digene Corporation, Gaithersburg, MD). Two additional cytobrush specimens were obtained from the endocervical canal for detection of HPV DNA and mRNA by PCR.48–50 (Table 1 depicts all technologies we evaluated in this clinical trial.) Results from these technology assessments have been published.51–53 After examining the cervix with white light, acetic acid at 6% strength was placed on cotton balls against the cervix for 2 minutes. Following colposcopic examination using white and green light at 3.5×, 7× and 15× magnification, but before biopsy, a fiber optic probe 5.1 mm in diameter was advanced through the speculum and placed in gentle contact with the cervix. Spectroscopic measurements were obtained from one or two normal cervical sites covered with squamous epithelium and, when visible, one colposcopically normal cervical site with columnar epithelium. If abnormalities were present and visible, measurements were taken from two colposcopically abnormal sites. Thus, all patients had sampling of both abnormal and normal areas if colposcopic abnormalities were present.

Table 1. Technologies tested against the gold standard of clinical histopathology with blinded review in this clinical trial
inline image

The center of the probe, which is about the size of a pencil, has both the light emitting and light detecting systems. The probe interrogated an area 2 mm in diameter on the cervix and left a circular impression. Following the spectroscopic measurements, the biopsy was taken from the center of the circular impression so that, as much as possible, the biopsy site was the same as the spectroscopic site. The biopsies were obtained with forceps, yielding specimens that were 2 mm long by 1 mm wide by 1 mm deep. The biopsies were fixed in formalin and submitted for permanent section. Sections 4-micron thick were stained with hematoxylin and eosin for routine reading in patient care, and Feulgen staining was performed for research on the quantitative measurements of the epithelium.

Colposcopy was performed by five gynecologic oncologists, two generalist obstetrician gynecologists, one family practitioner, and six nurse practitioners specialized in colposcopy. Each colposcopy was recorded with a drawing detailing the cervical areas of squamous and columnar epithelium, the squamocolumnar junction, the transformation zone, the presence of white or erythematous areas, and the presence of vascular atypia (fine or coarse punctation, loose or tile-like mosaicism and atypical vessels). A study of interprovider and intraprovider variability in colposcopic technique showed excellent to outstanding agreement (by kappa statistic) on the findings of acetowhitening, erythema, fine and coarse punctation, mosaiform and mosaic vasculature, and atypical vessels (data not shown).

Study population

The study protocol was reviewed and approved by the Institutional Review Boards at The University of Texas MD Anderson Cancer Center (MDACC), The University of Texas Health Science Center at Houston, the Harris County Hospital District for the Lyndon Baines Johnson (LBJ) General Hospital, The University of Texas at Austin, Rice University, and the British Columbia Cancer Agency (BCCA). The study was carried out at three clinical sites including The University of Texas MDACC (Houston, TX), the Lyndon Baines Johnson General Hospital (Houston, TX), and the British Columbia Cancer Agency (Vancouver, BC). At MDACC, patients were self-referred or referred by private physicians. At the LBJ General Hospital, patients were referred by county health department clinics. At the BCCA, patients were referred from the network of physicians in British Columbia.

Laboratory tests

All routinely stained cytology and pathology specimens were submitted for diagnosis by an experienced pathologist with specialization in gynecology on call for the day and blinded to the results of the spectroscopy. Cytologic smears and histopathologic sections were initially reviewed clinically by the cytologist or pathologist at the respective cancer center institutions. The pool of physicians reading samples included study cytologists and histopathologists. Both the Bethesda System and the World Health Organization (WHO) classification were used for cytology samples; the WHO classification was used for biopsies. All specimens were independently reviewed a second time by a study cytopathologist (J.M., G.S.) or a study histopathologist (D.V.N., A.M.), who was blinded to the results of the first review, to colposcopy, and to all other clinical tests including the spectroscopy. When diagnoses were discrepant, the specimens were reviewed a third time by the study cytopathologist or histopathologist to resolve the discrepancy. To determine whether any bias was introduced by the histopathology being read at the BCCA or MDACC, a study was conducted to be certain that the readings of the biopsies at both institutions were similar. The kappa statistics reported in Malpica et al. for inter- and intrarater reliability for the expert group of pathologists were in the substantial and almost perfect ranges for the histopathology review of study samples with high-grade lesions. The histopathologic consensus diagnosis was used as the gold standard for the trial.41

HPV typing was performed using the Food and Drug Administration-approved HCII test. The test was performed by Laboratory Corporation of America®, an external clinical laboratory. The HPV HCII tests were performed according to the manufacturer's recommended protocol.

Each patient's menstrual history, hormone use, follicle stimulating hormone level, estrogen level, and progesterone level were reviewed with a reproductive endocrinologist (SNE). Menopausal status was classified into three categories: pre-, peri-, and post-menopausal based on a review of each patient's current and past menstrual history and consistent with the laboratory results.

Spectroscopic measurements

We developed four research-grade, fiber-optic spectrometers to measure fluorescence and reflectance spectra from cervical tissue in vivo at three clinical locations over the 5 years of the study. In fluorescence spectroscopy, the tissue is illuminated by light of a particular wavelength called the excitation wavelength, and the light is re-emitted at a longer wavelength called the emission wavelength. The devices measured fluorescence emission spectra at 16 different excitation wavelengths ranging from 330 nm to 480 nm and collected at a range of emission wavelengths from 360 nm to 800 nm. These data are referred to as an excitation-emission matrix.

Each measurement took approximately 1 minute or less and the total light exposure was less than the American National Standards Institute standard for tissue exposure to light.28 Details of the devices used during the trial can be found in Freeberg et al.34 Of note, there were two generations of the device used during the 7 years of the trial. The second-generation device was cheaper to construct and took measurements more quickly than its predecessor. The details of the processing of the data from the devices can be found in Marin et al.35

All spectra were reviewed three times, beginning with four independent investigators blinded to the histology. Spectra were independently reviewed a second and third time by two medical physicists who were blinded to the histology data. Spectra were excluded for several indications: it was inferred that the probe slipped during the measurement, blood obscured the site, a chipped probe lead to altered reflectance measurements, autofluorescence from the probe was present, fluorescence or reflectance spectra were saturated, the signal was weak, or the device failed during a measurement. Spectra were discarded if both reviewers agreed that at least one of the aforementioned abnormalities was present. For the 4,864 fluorescence and reflectance spectra, the two medical physicists disagreed in only one situation, thus demonstrating outstanding agreement. Those spectra that were rejected correlate to device breakdowns and were not related to any specific patient characteristics. An ongoing analysis of the differences in accepted and rejected spectra helped us develop quality assurance software that eventually allowed us to evaluate in real time whether spectra were acceptable or required repetition (data not shown).

Development of an algorithm for optical spectroscopy

The objective was to find the most effective classification algorithm for detecting “disease” based on the spectroscopic data and other data that were available at the same time. We evaluated numerous classification algorithms and selected the one which had the best estimate of performance. Essentially, it was a contest between all the algorithms, with the winner having the highest specificity using an 80% sensitivity. The classification algorithms we tried were Bayesian variable selection, naïve Bayes, logistic regression with forward and backward variable selection, random forests, classification trees, neural nets, penalized logistic regression, linear discriminant analysis, nearest neighbor, linear support vector machines, kernel support vector machines, and others. There were several complications that needed to be addressed.54–56

One issue was whether the final output of the algorithm would be a prediction of whether the site of the measurement was diseased (per-site classification), or if a prediction needed to be made regarding whether the patient had disease anywhere in her cervix. All the algorithms use the per-site data, so we decided to select the winner of our algorithm competition by using this data. Once the best algorithm was selected, we reported per-patient results based on assigning to each patient the worst biopsy from among all sites measured for that patient. The actual outcome at a site was of course the histologic reading of the biopsy from that site. The actual per-patient outcome was the worst histologic reading for all biopsies from the patient.

We chose to simplify the classification by dichotomizing the outcome as either “diseased” or “non-diseased.” Our aim was to detect patients (or sites within patients) with high-grade squamous intraepithelial lesions or worse. Thus, “disease” (when applied to a measurement at a single site or to a patient) means “CIN 2 or worse,” including a histology reading of CIN 2, CIN 3, CIS and invasive squamous cancer. Patients (or sites) were classified as “non-diseased” if their histology reading was normal, atypia, inflammation, HPV-related changes, or CIN 1.

Most classification algorithms have tuning parameters that must be estimated from the data. The process of obtaining the parameters leads to biased performance estimation on the data on which it was trained. For example, some algorithms have many more tuning parameters than others; this makes them more susceptible to overtraining, wherein their performance looks good on the data on which they are trained, yet they perform poorly on new data. To obtain an unbiased estimate of the performance of an algorithm, the performance needs to be estimated on an independent set of data. Therefore, we split the data into a training set and a test set, with 70% and 30% of available data in each set, respectively. We optimized the algorithm performance on the training data before applying it to the test data.

Classification algorithms require a sufficient number of cases of disease and non-disease. We were simultaneously evaluating potential screening devices in this trial, so spectroscopic measurements were taken on a screening population as well as the intended diagnostic population. Because we had access to the additional data, and due to the data's high-dimensionality, we used it to assist in the development of an algorithm for the diagnostic device.

Because we were considering multiple algorithms and were going to select the one with the best classification accuracy on the training data, we needed a reliable estimate of accuracy based only on the training data. Rather than splitting the training data into two sets (a training set and a validation set), we conducted 5-fold cross-validation within the training set for each algorithm. This allowed the algorithm to be trained on a larger subset and its performance estimated from the entire training set. The training set (70% of the total data set) was split into five subsets. Each of the algorithms was trained five times while omitting one of the subsets; thus 80% of the training data was used to train the algorithm each time. The algorithm was then applied to the remaining 20% of the training data. This produced a score for each of the remaining observations; the five sets of scores were then combined to evaluate the overall performance of the algorithm on the training set. Once we selected a final algorithm, it was trained on the entire training set and applied to the test set (the remaining 30% of the entire data) to obtain unbiased estimates of the sensitivity and specificity of the algorithm.

In creating the training set, test set, and five subsets of the training set, we employed the following sampling strategy. We assigned all measurements for a given patient to the same subset. We approximately stratified the sampling that was used for creating the subsets by the demographic and biologic variables we thought might be important: menopausal status, the presence of high-grade histology (CIN 2 or worse), and the date of measurement (to account for possible time trends). The data within each stratum were ordered by diagnosis (“≥CIN 2” or “<CIN 2”), then by menopausal status within diagnosis, and then by study identification number, for which the highest digits designate population (screening or diagnostic), the second highest digits designate the clinic, and the remaining digits give the order of recruitment within the clinic. The algorithm took successive blocks of ten patients and randomly assigned seven patients from each block to the training set and three patients from each block to the test set. The patients in the training set (in the same order) were separated into blocks of five patients and were then randomly assigned to each of the five subsets of training data (one patient from the block to each of the five subsets). This method guaranteed that the training and test sets would be approximately balanced for each of the variables known to affect the spectra in previous studies of tissue biology.56–58

The per-site data are complex high-dimensional arrays; therefore, dimension reduction was needed to analyze them. We primarily used principal components analysis for the purpose of dimension reduction, although we also investigated fast Fourier transforms, B-splines, and an inverse model.56 We tried various dimension reduction methods as inputs for each of the algorithms. Two strategies were employed for principal component analysis: concatenating all of the measured intensities into a single vector, or using principal components of the emission spectra for each excitation wavelength separately. The principal components were computed from the covariance matrix of the emission spectra from the individual excitation wavelengths. We kept only the principal components that accounted for at least 95% of the total variation. For the principal components computed for individual excitation wavelengths, this reduced the dimension of the measurements to between 1 and 3 principal components per excitation wavelength.

We developed the classification algorithms using the features obtained from the methods of dimension reduction, the colposcopic impression, and the biographical variables: age, menopausal status (pre-menopausal, peri-menopausal, and post-menopausal), hormone use (use of any of oral contraceptive pills (OCPs), hormone replacement therapy (HRT), or depo-provera), colposcopic tissue type (columnar or squamous), and bleeding (during measurement). We included these variables because this information was available at the time of the clinical visit and because several studies have shown that biological variables influence spectroscopy measurements.

We provide details on a logistic regression approach, as this demonstrated the most effective performance of all data reduction methods and all algorithms.53–55 The variable selection was applied to the biographical variables (menopausal status, hormonal use, colposcopic tissue type, and age), the colposcopic impression and to the principal components of the spectroscopic data computed for individual excitation wavelengths. We then fit a logistic regression model to the principal components and biographical variables and used Akaike's information criterion to perform a backwards stepwise selection of variables.

We used receiver operating characteristic (ROC) analysis to summarize the performance of the selected algorithm. To account for differences in the prevalence of the diagnostic and screening populations, covariate-adjusted ROC curves were computed.55 These can be viewed as a weighted average of the screening and diagnostic population ROC curves. To compare the covariate-adjusted ROC curves, bootstrap confidence intervals (CIs) were calculated at a fixed sensitivity of 80%. We wished to compare the specificities of colposcopy and spectroscopy when the sensitivity was set at 0.80. A challenge is that some patients in the study were evaluated by both technologies, whereas others were evaluated by colposcopy only (because of drop out, faulty readings, etc.) The p value was computed by modeling the subjects who received both measurements using a multinomial model and the subjects who received colposcopy using a binomial model. We were then able to compute a likelihood ratio chi-square to determine whether the differences were statistically significant.

Statistical analysis was performed using the statistical packages R version 2.6.2 (R Foundation for Statistical Computing, Vienna, Austria), Matlab® (The MathWorks, Natick, MA), Mathematica (Wolfram Research, Champaign, IL), and Stata Statistical Software Release 10.1 (StataCorp LP, College Station, TX). Exact binomial CIs were calculated for sensitivity and specificity. Statistical significance was set at 0.05.

Results

Figure 2 shows a flow diagram detailing how the initial enrollment resulted in the final sample size after data quality control. The demographic and diagnostic information concerning the 1,850 patients can be found in Tables 2 and 3. The 1,850 patients yielded 4,686 biopsies and spectra. Although the total number of patients entering our clinical trials was 1,850, we had usable spectroscopy and biopsies on 735 diagnostic patients (115 eliminated) and on 707 screening patients (293 eliminated). Approximately 30% of spectra, including repeated measures, were judged inadequate for analysis during the quality assurance review.

Figure 2.

Patient enrollment and study flow diagram showing final post-quality control sample size.

Table 2. Demographic, socioeconomic, and health-related factors for diagnostic and screening populations in Houston, TX and Vancouver, BC
inline image
Table 3. Histology, HPV infection, menopausal status, oral contraceptive use, and age for diagnostic and screening populations in Houston, TX and Vancouver, BC
inline image

As noted, the patients ranged in age from 18 to 85 years; the mean age for the entire sample of patients was 39 years. The majority of patients were born in the United States or Canada, had a college education, and were married. We were successful in recruiting a racially and ethnically diverse sample of women, reflective of the populations of Houston, Texas, and Vancouver, British Columbia. There were no adverse events throughout the course of the trial; specifically, no patients returned for bleeding or infection after the measurements and cervical biopsies. All of the Canadian patients have been seen again by their physicians and long-term follow-up shows no adverse effects of the spectroscopy on subsequent Papanicolaou smears. Approximately 70% of the diagnostic patients were followed for 2 years, and no adverse effects of the spectroscopy were noted.

According to tissue histology, 201 of 735 patients had biopsies that had CIN 2, CIN 3, CIS, or cancer in the diagnostic group (27%) compared with 12 of 707 patients (2%) in the screening group. HCII test results were high-risk type positive in 46% of patients in the diagnostic group, compared with 10% of the screening group.

As stated earlier, the best algorithm results as estimated from five-fold cross-validation with training data were obtained using logistic regression with variable selection applied to the biographical variables (menopausal status, hormonal use, colposcopic tissue type, and age), colposcopic call, and principal components of the fluorescence data only. The final variables in the model included all four biographical variables, the colposcopic impression, and 23 of the principal component variables from the spectroscopic data. Not only did this algorithm perform the best as determined by ROC curve analysis but also the algorithm would be easy to implement in the diagnostic device.

Figure 3 presents boxplots of the predicted scores from the classification algorithm. Using the point on the ROC curve closest to perfect classification (0.80 sensitivity, 0.84 specificity), we identified a cut point of 0.221 on the spectroscopy score in the training set. This yielded a by-patient sensitivity (correctly identifying those patients who had histology reading CIN 2 or worse) of 1.00 [one-sided 95% CI = 0.92–1.00] and a specificity (correctly identifying those patients who had histology reading CIN 1 or better) of 0.71 [95% CI = 0.62–0.79] in the test set using the second-generation device. Given the prevalence of 10% for CIN 2 or worse in our combined screening and diagnostic data, the positive predictive value was 0.58 and the negative predictive value was 1.00.

Figure 3.

Boxplot of scores in test set by histologic grade.

The purpose of showing the data in a ROC curve format is to demonstrate how the device performs over a range of sensitivity and specificity pairs, emphasizing the trade-off between sensitivity and specificity. Figures 4a and 4b show the diagnostic classification algorithm applied to the training and test sets, respectively. The area under the ROC curve (AUC) for spectroscopy when applied to the test set for the second-generation device was 0.85 [95% CI = 0.81–0.89].

Figure 4.

(a) and (b). By-patient receiver operating characteristic (ROC) curve analysis for each device, and combined data of: (a) logistic regression cross-validated on training and validation data, and (b) final test set.

Figure 5 shows the results of the ROC curve analysis for both the screening and diagnostic populations. Figure 5a shows the results of a per-patient analysis and Figure 5b shows the results of a per-site analysis. The per-patient AUC for the diagnostic study in Figure 5a is 0.83 (95% CI = 0.77–0.88), and for the screening study the AUC is 0.58 (95% CI = 0.47–0.69). The per-site AUC in Figure 5b for the diagnostic study in Figure 5a is 0.78 (95% CI = 0.73–0.83), and for the screening study the AUC is 0.60 (95% CI = 0.46–0.74). We assume, as with colposcopy, that the poor performance in the screening study is due to the lower prevalence of disease. The per-patient and per-site analyses are quite similar.

Figure 5.

(a) and (b). Receiver operating characteristic (ROC) test characteristics of the per-patient and per-site analysis of the data. Each graph shows the performance of the whole data set, the diagnostic trial, and the screening trial.

To provide some basis for showing that spectroscopy can add value even when colposcopically-directed, we performed two further analyses shown in Figures 6a and 6b. In the first analysis, we controlled for the differences in prevalence between the screening and diagnostic trials by using whether the patient was enrolled in the screening or diagnostic study as a covariate in the analysis. This allowed us to compare spectroscopy in the combined population in the test set, allowing for the lower prevalence in the screening setting, to colposcopy in the whole data set. In the second analysis, we compared diagnostic spectroscopy with diagnostic colposcopy, excluding the screening patients. This analysis was intended to show how colposcopy and spectroscopy perform in the intended diagnostic setting. These analyses model what we hope to find in a randomized trial of colposcopy versus colposcopy plus spectroscopy. Figure 6a shows covariate-adjusted ROC curves comparing spectroscopy with colposcopy. At a sensitivity of 0.80, the specificity was 0.73 (CI = 0.64–0.82) for spectroscopy and 0.46 (CI = 0.42–0.5) for colposcopy, a statistically significant difference (p < 0.05). The classification performance of the tests in the high-sensitivity region of the ROC curve decreases some, mostly because of the expected small number of cases in the screening population. For Figure 6b, we compared the accuracy of spectroscopy in the test set with the accuracy of colposcopy in the whole data set. We chose to use the whole data set for colposcopy because it did not require training of an algorithm; therefore, we could use more data to obtain a more precise unbiased estimate of its accuracy. Using the whole data for spectroscopy would have biased our spectroscopy accuracy estimates upward because the training data was used to select the best model. Figure 6b shows ROC curves comparing diagnostic spectroscopy with diagnostic colposcopy. At a sensitivity of 0.80, the specificity was 0.76 (CI = 0.69–0.82) for spectroscopy and 0.68 (CI = 0.64–0.72) for colposcopy. Using the likelihood ratio chi-square test [as described in the Material and Methods section], we determined that the difference in specificity between spectroscopy and colposcopy was not statistically significant (p = 0.5). For both the diagnostic and screening populations, we observed an increase in accuracy similar to that in the covariate-adjusted ROC curves. The colposcopy-naïve spectroscopy and colposcopy ROC curves are very similar to each other. We recognize that the colposcopy-naïve spectroscopy described here is, in fact, colposcopically-directed. However, we included these ROC curves to provide some basis for comparison. The areas under the covariate-adjusted ROC curves for diagnostic spectroscopy and diagnostic colposcopy were 0.77 and 0.78, respectively.

Figure 6.

(a) and (b). Receiver operating characteristic (ROC) test characteristic comparisons of spectroscopy to diagnostic colposcopy using by-patient covariate-adjusted ROC curve analysis on the test set. The second curve compares diagnostic colposcopy to diagnostic spectroscopy.

Discussion

We developed an algorithm for point-probe optical spectroscopy that yielded operating characteristics with reasonable performance and that has the potential for use in real time. The data show that in a diagnostic setting, research-grade point-probe devices using colposcopically-directed optical spectroscopy perform similarly to colposcopy in expert hands. The role of this technology was to be an adjunct to colposcopy so that one could avoid biopsies of inflammatory lesions and see and treat with confidence that disease would be in the specimen. We continue to develop a multispectral digital colposcope (MDC) for screening and for eventual combination with the probe technology. The MDC has been through two pilot trials, each of which demonstrated sensitivities of 85% and specificities of 90% in automated algorithms on few patients. We will begin testing a combined device based on this work.

In the original statistical plan of the protocol, we used a sensitivity of 84% and a specificity of 76% as parameters to calculate the sample size. These data were based on work using three wavelengths of light in 104 patients, many of whom had high-grade squamous intraepithelial lesions. We estimated that 200 patients with high-grade squamous intraepithelial lesions would allow us to study the diagnostic algorithm, and we hoped to stratify the diagnostic group for several demographic categories: premenopausal on OCPs, premenopausal not on OCPs, postmenopausal on HRT, and postmenopausal not on HRT. Then the results of the Women's Health Study were published and fewer women took OCP or HRT. In the end, we were able to examine these demographic variables in the analysis, but 200 patients alone were too few to develop a classifier in the multidimensional data.

The high-dimensional data we obtained required cross-validation for the algorithm development. We learned that developing a classifier would force us to combine the screening and diagnostic populations. We calculated that the development of a classifier in a screening population with a point probe would require 16,000 patients, for which funding would be a near impossible task. We also thought of the MDC as the instrument that would accompany the point probe in the screening setting. This was true of our studies of quantitative cytology, for which we also needed a large data set to develop a classifier.

Although we expected the per-site analysis to yield a higher AUC and higher sensitivities and specificities than the per-patient analysis, it did not. This may be because the spectroscopy assesses the epithelial/stromal interaction. Thus, one would expect that if any area of the cervix has high-grade dysplasia, there is a field effect such that the entire cervical epithelial-stromal interface is different from that of a patient with no disease. Biomarker data supports that normal areas of a diseased cervix are not as genetically stable as normal areas in a normal cervix. We are actively investigating the stromal biology of these types of lesions.

How do our results compare to the literature? Table 4 shows the spectroscopic approach and modality as well as the sensitivity and specificity obtained in the trials.22, 25, 59–82 Our results compare favorably with those of other investigators.

Table 4. Performance of instrumentation and spectral approach from the literature
inline image

The main strengths of this study are that (1) each patient had several biopsies that underwent multiple blinded reviews and thus provided an excellent gold standard on which to judge all the technologies under study, (2) few registration problems occurred with the biopsied tissues, (3) robust analysis of the multiple algorithms yielded similar results using different approaches for both data reduction and data analysis and (4) attention was paid to all aspects of technology assessment. Previous trials by Alvarez82 and DeSantis77 used multispectral technologies to view the whole cervix and compared the multispectral readings to both biopsies and areas of loop electrosurgical excision procedure specimens from the cervix.77, 82 However, registration, or linking, of the optical image to the area of histopathologic reading was difficult in their study designs. In our studies, each 2 mm area that was measured was biopsied, thus registration was not an issue. Further, because we had spectroscopic measurements from both colposcopically positive (if any were present) and colposcopically negative sites on all patients, we were able to find false positive and false negative colposcopic lesions, eliminating the problem of verification bias and something that was not done in other studies.

The weaknesses of the study are (1) the number of scrapes to the cervix before measurement, (2) the probe placement possibly not being precisely over the area biopsied, (3) the discarding of approximately 30% of spectrographs and (4) the use of a cutoff point for the classifier. The scrapes may have affected the epithelium, but we believe we sampled the epithelial stromal interface; however, in future studies of the probe we will not take any scrapes. Also, the probe placement may never be precisely over the area imaged. However, the microenvironment in a region may be similar over the microns that are measured. Discarding spectrographs comes with the territory of studying emerging technologies, as does the development of first- and second-generation devices. We resisted changes to the devices but instead quantified how the devices performed differently for our own understanding of how changes might impact the data. Finally, choosing a cut point is a complicated process. We have not focused on this in this paper, but the cut point changes the trade-off between the number of false negatives and the number of false positives. This will be the subject of a future cost-effectiveness analysis. Because we subjected the algorithmic analyses to a robust number of methods of data reduction and then analysis using the cut point of CIN 2 and above, we are certain the results reflect what is contained in the data set.

The endocervical canal presents challenges for two reasons: first, there may be squamous lesions high in the canal, and second, there is an increasing incidence of adenocarcinoma in situ and adenocarcinoma arising in the columnar epithelium. In general, patients referred for colposcopy with abnormal Papanicolaou smears (diagnostic patients) usually receive an endocervical curettage (scraping of the endocervix) along with their cervical biopsies, whereas those patients with a history of normal Papanicolaou smears (screening patients) receive only a cytologic evaluation of the endocervical canal. Many gynecologic oncologists have found an invasive cancer or lesions of adenocarcinoma in situ or adenocarcinoma in a patient for whom the only abnormality was a positive endocervical curettage or suspicious endocervical cytology. This is a particular concern in older patients, as the squamocolumnar junction moves farther up the endocervical canal with age and endocervical lesions could be missed without a vigilant approach. At present, we would still recommend an evaluation of the endocervical canal apart from or in addition to optical spectroscopy. This is a major limitation of this spectroscopy, but also of visual inspection of the cervix with acetic acid and other existing screening devices. We are researching ways to use spectroscopy to detect abnormalities in the canal and perhaps will have other solutions in the future. For now, the Papanicolaou smear and/or endocervical curettage remain critically important in the clinical evaluation of patients to diagnose all cervical cancers.

The device developed in this study would be an adjunct to colposcopy and probably would not be commercially viable by itself. As we found, a point probe cannot be used for screening; for that purpose we are developing the MDC,83–86 a device that sees the whole cervix. The combined MDC and point probe would be made for the developed world, where it is important to save heath care dollars by eliminating unnecessary biopsies and treating only those patients with CIN 2 to cancer. For the developing world, the lack of sufficient electricity led us to develop a portable battery-powered device that we hope to test with our collaborators in developing nations.

Acknowledgements

The authors thank all the patients who selflessly participated in these studies for beneficence that is in the interest of the advancement of medicine. These patients are the leaders of change. This research would not have been possible without the contribution of Nick MacKinnon, who served as the lead device engineer for the project team, and Pierre Lane, who has been critical to device evaluation in our technical efficacy projects. We would like to recognize the efforts of the clinical staff at The University of Texas MD Anderson Cancer Center and the British Columbia Cancer Research Centre, who conducted the trials, including Karen Rabel, Alma Sbach, Judy Sandella, Karen Smith, Jessica Lara, Latira Chenevert, Irene Pabon, Rosa Morales, Veronica Perry, Glenda Dickerson, Patricia Trigo, Maria Theresa Arbalaez, and Abderrahim Zennouhi. The engineers who built and operated the devices during the study were Brian Pikkula, Roderick Price, Adrian Freeberg, Richard Swartz, Olga Shuhatovich, Yvette Mirabal, Stacy Crain, Mark Cardenas, and Sylvia Au. Rebecca Richards-Kortum evaluated fluorescence and reflectance spectra. Dan Serachitopol played a critical role throughout all the trials with programming, data analysis, data management, and quality assurance. Nan Earle was a superb data manager. Trey Kell managed the volumes of data collected during trials in all domains, built the current robust database, assured the quality of every entry, and participated in all phases of quality assurance and data analysis. Sun Young Park, Sung Chang, Hongxiao Zhu, and Crystal Redden Weber developed and tested valuable algorithm models. Brian Crain is the chief editor for the Program Project, and Rebecca Partida and Vicky Cervantes assisted in the editing and preparation of this manuscript.

The funding agreement ensured the authors' independence in designing the study, interpreting the data, and writing and publishing the report. Calum MacAulay, Dennis Cox, Michele Follen, and E. Neely Atkinson all hold some patents that Remicalm, L.L.C., has licensed from their respective past and present universities, including the British Columbia Cancer Research Centre, Rice University, and The University of Texas MD Anderson Cancer Center. Drs. Cantor, Yamal, Guillaud, Benedet, Miller, Ehlen, Matisic, Van Niekerk, Bertrand, Milbourne, Rhodes, Malpica, Staerkel, Nadar-Eftekhari, Adler-Storthz, Scheurer, Basen-Engquist, Shinn, West, Vlastos, Tao, and Beck have no agreements, jointly-held patents, nor other potential conflicts of interest with regard to this research.

Ancillary