How to evaluate emerging technologies in cervical cancer screening?

Authors

  • Marc Arbyn,

    Corresponding author
    1. Unit of Cancer Epidemiology / Belgian Cancer Centre, Scientific Institute of Public Health, Brussels, Belgium
    2. ECCG (European Cooperation on development and implementation of Cancer screening and prevention Guidelines), IARC, Lyon, France
    • Unit of Cancer Epidemiology, Scientific Institute of Public Health, J. Wytsmanstreet 14, B1050 Brussels, Belgium
    Search for more papers by this author
    • Fax: 0032/(0)2 642 54 10.

  • Guglielmo Ronco,

    1. Unit of Cancer Epidemiology, Centro per la prevenzione Oncologica, Turin, Italy
    Search for more papers by this author
  • Jack Cuzick,

    1. Cancer Research UK Centre for Epidemiology, Department of Mathematics and Statistics, Wolfson Institute of Preventive Medicine and Cancer Research, UK, Queen Mary University of London, London, UK
    Search for more papers by this author
  • Nicolas Wentzensen,

    1. Division of Cancer Epidemiology and Genetics, National Cancer Institute, NIH, DHHS, Bethesda, MD
    Search for more papers by this author
  • Philip E. Castle

    1. Division of Cancer Epidemiology and Genetics, National Cancer Institute, NIH, DHHS, Bethesda, MD
    Search for more papers by this author

  • Conflict of interest: JC: advisory board (Abott, GenProbe, Qiagen, Roche), research support (Abott, Gen Probe, Qiagen, Roche); GR: advisory board (GenProbe), currently ceased.

Abstract

Excellent recommendations exist for studying therapeutic and diagnostic questions. We observe that good guidelines on assessment of evidence for screening questions are currently lacking. Guidelines for diagnostic research (STARD), involving systematic application of the reference test (gold standard) to all subjects of large study populations, are not pertinent in situations of screening for disease that is currently not yet present. A five-step framework is proposed for assessing the potential use of a biomarker as a screening tool for cervical cancer: i) correlation studies establishing a trend between the rate of biomarker expression and severity of neoplasia; ii) diagnostic studies in a clinical setting where all women are submitted to verification by the reference standard; iii) biobank-based studies with assessment in archived cytology samples of the biomarker in cervical cancer cases and controls; iv) prospective cohort studies with baseline assessment of the biomarker and monitoring of disease; v) randomised intervention trials aiming to observe reduced incidence of cancer (or its surrogate, severe dysplasia) in the experimental arm at subsequent screening rounds. The 5-phases framework should guide researchers and test developers in planning assessment of new biomarkers and protect clinicians and stakeholders against premature claims for insufficiently evaluated products. © 2009 UICC.

Principle of cytology-based screening

The rationale of cervical cancer cytological screening is to identify and treat high-grade cervical intraepithelial neoplasiaa (CIN) (precancerous lesion) to prevent its progression to invasive cancer.4 Programme sensitivity is a convenient metric of assessing cancer incidence reduction and population effectiveness although it does not account for the impact of false positives on cost-effectiveness, the negative consequences of over-screening, and the occurrence of side effects.5 Programme sensitivity depends on the sensitivity of the chosen screening test, the compliance with further follow-up and the sensitivity of triage and diagnostic work-up, the natural history of the disease, and the screening policy (the target age group, screening interval, clinical thresholds for follow-up and treatment).6 The essential elements in the natural evolution of the disease are the rates of onset of precursor lesions, the progression and regression rates of these precursor lesions and the distribution of their sojourn times. The mean sojourn time (period from detectability of a lesion until it develops into a clinically manifest cancer) generally is believed to be in the order of 10 years or more with cytology and the probability of detection increases as the preclinical phase progresses.7, 8 Sojourn times of cancer precursors are usually not observable because of treatment and are therefore only estimable by modelling. A unique (unethical) experience in New-Zealand, where CIN3 lesions were left untreated, allowed observation of the natural history. The 30-year cumulative incidence of invasive cancer among women with CIN3 was 30% and among women with persistent CIN3 was 50%.9 Because of the long natural history of precursors, repetition of a moderately sensitive screen test, such as the Pap smear, can achieve high programme sensitivity and thereby reduce incidence of and mortality from cervical cancer to a low residual level.10 The International Agency for Cancer Research estimated that well-organised cytological screening for cervical cancer precursors every 3–5 years between the ages of 35 and 64 years reduces the incidence of cervical cancer by 80% or more among the women screened.8, 11, 12b The success of screening depends essentially on the participation of the target population, the quality of the screening test and further on the compliance with follow-up and the efficacy of treatment of screen-detected lesions. The efficiency of screening decreases in subsequent rounds because successive sensitive screening followed by appropriate therapy reduces the endemicity of precursors over time. The lesions still found are smaller lesions with less invasive potential.

Shortcomings of cytological screening

The cross-sectional test accuracy of cervical cytology is highly variable since it depends on the availability of an adequately collected and prepared sample taken from the transformation zone and well-trained and motivated cyto-technicians for microscopic interpretation of the morphologic changes. By good quality assurance, a reasonably high sensitivity for high-grade CIN can be reached (>70%) but low sensitivity values (<50%) are not exceptional.13

Because of low sensitivity, reported in several settings, alternative screening methods have been developed. We can distinguish four new methods of screening: a) alternative forms of cytology e.g., liquid-based cytology [see reference14 for a systematic review], automated or computer-assisted cytology; b) molecular detection of DNA or RNA of high-risk types of human papillomavirus (HPV), the virus causing cervical cancer;15c) biomarkers associated with a progressive HPV infection such as immuno-staining of certain cell cycle regulating proteins whose expression has been altered, or maybe in the future, proteomic, transcriptomic or methylomic signatures of transforming HPV infections16, 17 and d) biophysical changes identifiable by spectroscopy.18, 19 In the rest of the paper, we discuss how such new techniques should be evaluated using established methods to assess evidence of efficacy.

We first propose a methodology to rank evidence from published studies already performed. Subsequently, we propose a comprehensive framework for setting up new studies through which evaluation of biomarkers should pass to generate evidence on their potential application as a cancer screening test.

Levels of evidence of efficacy derived from published studies

Strength of evidence of screening effectiveness

A list of indicators for screening effectiveness, assessed by different study methods, is enumerated in Table I and ranked from high to low according to the level of evidence that such studies provide.

Table I. Ranking of Indicators by Level of Decreasing Evidence for Effectiveness of Cervical Cancer Screening Methods According to the Studied Outcome and the Used Study Design (Adapted From Ref.6)
  • a

    Only controlled studies are considered, this means studies where two or more screening methods are compared.

Outcome:
1. Reduction of mortality from cervical cancer, (quality-adjusted) life-years gained.
2. Reduction of morbidity due to cervical cancer: incidence of cancer (Ib+).
3. Reduction of incidence of cancer (including micro-invasive cancer).
4. Reduction of incidence of CIN3 or worse disease (CIN3+).
5. Increased detection rate of CIN3+ or CIN2+.
6. Increased test positivity with increased, similar or hardly reduced positive predictive value.
Study designa:
1. Randomised clinical trial, randomised population based trial.
2. Cohort studies (possibly with imbedded case-control studies).
3. Case-control studies.
4. Trend studies, ecological studies on routinely collected data.

Randomised clinical trials (RCTs) designed to demonstrate a reduction in invasive cervical cancer provide the highest level of evidence of efficacy of screening. Observation of a lower incidence of cervical cancer in the trial arm where a new screening test is applied provides the proof that the new method (including the management of screen positives) is more effective than the control method. Nevertheless, conducting such studies requires enormous financial resources and huge study populations to be followed for many years including a high risk of contamination between the experimental and control arms.c Meanwhile, during the lengthy interval to validate the new method, it may no longer be available or may have become obsolete. Therefore, it is often proposed to study intermediate or surrogate outcomes (for instance outcomes 4 to 6 in Table I) and to simulate the most likely outcomes relevant to public health using mathematical models. CIN3 is the direct precursor of invasive cancer and therefore, reduced incidence of CIN3+ is considered as an acceptable a proxy outcome of trials evaluating new preventive strategies.20, 21 Prospective cohort studies do not allow obtaining more rapid results than randomised trials and suffer from several potential biases. Retrospective evaluation of previously identified cohorts can speed evaluation but not reduce bias. Case-control studies, comparing screening histories in women with and without cervical cancer are appropriate to evaluate effectiveness retrospectively but are also prone to several selection and information biases. Changes over time (secular trends) or geographical differences in incidence or mortality can be interpreted as screening effects but can only be accepted as indication of screening effectiveness when no other factors can plausibly explain the observed changes.

It must be stressed that the aim of screening is to prevent cervical cancer, not simply detect pre-invasive lesions. A new screen test allowing detection of more high-grade CIN does not necessarily result in more pronounced reduction of cancer incidence since just additional non-progressive lesions might be detected.

Cross-sectional test accuracy, threshold of disease

For screening, an accurate test is needed:22 this means that it is positive when CIN2/3 is present and negative when CIN2/3 is not present. In other words, a screen test must have a good clinical sensitivity and specificity. The severity of CIN must be explicitly defined when assessing the accuracy of a test. CIN1 is the histopathologic manifestation of a carcinogenic or non-carcinogenic HPV infection that rarely progresses on a per event basis to cancer.23, 24 Its detection is not clinically useful, possibly leading to over-treatment, and should not be targeted by any screening test. On the other hand, CIN2 and especially CIN3 indicate a considerable risk of developing cancer and should therefore not be missed by a screen test. CIN2 is an intermediate condition, which contains over-called CIN1 (caused by both carcinogenic and non-carcinogenic HPV), and under-called CIN3.25–29 CIN2 is a more regressive30 and less reproducible histological diagnosis than CIN3.29 Thus, while a CIN2 diagnosis is typically the clinical threshold for triggering excisional or ablative treatment, its inclusion as an endpoint for evaluation of a screening test may exaggerate the overall impact of a screening test. The observation that a new screen test is more sensitive than the conventional test in detecting CIN3 provides more convincing evidence that its use in screening will result in a higher reduction in cancer incidence than the detection of CIN2/3, which can be artificially elevated due to the detection of low-risk CIN2 destined to regress (over-diagnosis). Whether detection of more CIN2 with a new method corresponds (at least partly) with either progressive or regressive disease, cannot be assessed from cross-sectional studies. However, observing, at the second screening round among women with a negative first screen test, less CIN3+, in the experimental compared to the control arm of a trial, indicates that at least a part of the additionally detected CIN2 was not regressive. The total amount of CIN2+ cases in first and second screening arm in the experimental over the conventional arm, represent a measure of over-diagnosis, not a measure of efficacy. Therefore, future authors should be recommended to report cross-sectional accuracy separately for both CIN2+ and CIN3+.

Incomplete application of the gold standard, verification bias

The most comprehensive design for evaluating the cross-sectional accuracy of screen tests is the independent application of all the tests to a screening population followed by verification in all study subjects, irrespective of the screen test results, using a valid gold standard assessed without prior knowledge of the screen test results. Under these conditions unbiased estimation of the test sensitivity and specificity is possible. We invite readers to consult STARD guidelines31 for good diagnostic research and QUADAS guidelines for evaluation of the quality of individual studies included in systematic reviews of diagnostic studies.32

Often, even in a research context (because of cost and/or ethical concerns), only women with positive screen tests and none or only a few with negative screen tests are verified and this situation results in verification bias yielding inflated sensitivity and underestimated specificity. Nevertheless, if multiple tests are evaluated and at least one test is very sensitive, the extent of verification bias is reduced, because virtually all women with CIN2/3 or CIN3 undergo diagnostic evaluation. Verification bias can be adjusted for if a random fraction of screen-negatives are referred for the application of the gold standard.33–37 Also long term follow-up can be used to capture missed disease.30

When 2 screen tests are applied to the same study subjects and all subjects, positive for one or both tests, are verified with an acceptable gold standard, unbiased estimation of the test positive predictive value, the relative sensitivity and detection rate of true positives is possible.38, 39d,e Thus, while the true absolute sensitivity cannot be determined, test performance can be ranked in an unbiased fashion. The same is true for randomised clinical trials, where different tests are applied to subjects in two or more study arms. For this reason, we believe that the Cochrane Collaboration should consider including such studies in systematic reviews (see further below). The reader should be warned that correction for verification bias by additional verification of test negative cases can yield erroneous results (sometimes even more biased than the original verification bias) if subjects are not selected at random, see reference40 for an example.

When the prevalence of disease is low (which always is the case in a screening setting) and only test-positive cases are verified, an approximated test specificity can be computed, (see formula).

equation image

This approximated test specificity does not suffer from verification bias.

Reproducibility

The reliability or reproducibility of a test, including intra-batch and inter-batch reproducibility as well as intra-laboratory and inter-laboratory reproducibility, expresses the capacity to obtain the same test result—correct or not—when the screening test is repeated on the same individual. The reliability depends on the definition of distinct test criteria that can be applied by skilled personnel. Poor reproducibility automatically yields low average sensitivity and specificity. Reproducibility can be enhanced by training. Evaluation of new screening tests requires reproducibility experiments, preferentially including field circumstances.

Quality of the gold standard

Assessment of the gold standard, knowing the screen test result, includes a serious risk of overestimation of both the sensitivity and specificity. Therefore, in diagnostic research, where the objective is to evaluate the cross-sectional accuracy of a screen test, verification should be performed independently. This can be difficult when the screen test and the gold standard are based on the same principle, for instance in case of VIA screening (visual inspection of the cervix after application of acetic acid), validated using colposcopy.41–43

It is usually assumed that histological examination of material obtained by colposcopically directed biopsy, loop excision or endocervical curettage, and—in absence of biopsy—a negative colposcopic impression provide a valid ascertainment of the true disease status. Recent data indicate that this assumption might not always be true.44, 45

Colposcopy performance has been challenged by results from prospective studies suggesting that up to 50% of prevalent precancers may be missed during colposcopy.46 The visual assessment of the cervix in colposcopy has a high inter-observer variability.47, 48, 48 It has been demonstrated that the sensitivity of colposcopy is not related to the experience of the colposcopist, but to the number of biopsies taken.44 In random biopsies from normal appearing regions on the cervix substantial disease has been identified.49 Again, follow-up can be used to compensate partially for the lack of sensitivity of colposcopy. As a consequence, one-time colposcopic-directed biopsy as it has been practiced should be considered an imperfect referent standard.

Studies are underway that aim at identifying better colposcopic procedures and at determining how many biopsies are necessary to improve disease ascertainment. Meanwhile, a combined endpoint including histology and cytology results can improve the disease ascertainment.50

Longitudinal sensitivity

Once again, it must be repeated that the observation of increased cross-sectional sensitivity of a new test for histologically confirmed CIN2/3 or CIN3 does not necessarily imply that its inclusion in a screening programme will yield a reduction in incidence of lethal cervical cancer with respect to conventional cytological screening.f Nevertheless, when biological and epidemiological arguments justify the assumption that the lesions detected in excess by the new method have a substantial chance of progression (acceptable longitudinal positive predictive value) and that screen negatives have a substantially lower chance to develop cancer in the future (higher longitudinal negative predictive value), planning of evaluating the new test in a randomised population-based trial in an organised setting can be considered.51 Audits of screening effectiveness, including linkages with screening and cancer registries, that allow picking up missed disease detected beyond the timelines of studies, are a particularly useful tool of evaluation.52, 53 Finally, simulation models must help in identifying best choices but also in orienting the most influential issues to be addressed in future studies.

Costs of screening

Until now we studied essentially programme effectiveness, stressing sensitivity. Cervical cancer screening involves large populations and therefore can be extremely costly. Costs are mostly determined by the test cost and specificity. An overview of the cost components attributed to screening is presented in Table II.

Table II. Overview of Cost Components of a Screening Programme
1. Cost price of the screen-test (investment and recurrent costs); fees of health professionals (time for preparation, interpretation of the screen test, documentation, training); logistical costs (transport, processing, storage); administrative costs (invitation, registration and analysis of data).
2. Specificity of the screen test: cost of follow-up and treatment of women with false-positive results or having non-progressive screen-detected lesions (over-diagnosis).
3. Sensitivity of the screen test (longitudinal): cost for follow-up and treatment of true positives; this cost may be off-set by cost savings in avoided treatment of advanced disease.
4. Human costs: time spent by women to be screened, anxiety and discomfort for follow-up and/or treatment of women with true and false-positive results, increased risk of adverse obstetrical outcomes in treated women; consequences of delay in detection of cancer in false-negative women.
5. Specificity of quality control, triage and diagnostic follow-up procedures, contributing to increased positive predictive value and savings by avoiding treatment of false-positive women.
6. Quality of screen test procedures; satisfactory rate influencing the need for repeat tests.

Since the prevalence of progressive cervical precursors is very low the number of false positive cases results from the false positive rate applied to nearly the entire target population. Therefore even a small decrease in specificity can have serious consequences on costs, if the next step involves a complicated or invasive procedure. Nevertheless, the loss in specificity of a screen test can be limited by raising the screening interval, by increasing the age at onset of screening and by raising the cut-off for test positivity. Mathematical models can be used to estimate the final outcome per unit of cost, but should rely on accurate estimates of the screening performance, which are not always available.

Comprehensive framework for setting up new studies for evaluation of biomarkers potentially applicable as a cancer screening test

The Cochrane Collaboration

The Cochrane Collaboration is a world-wide not-for-profit and independent organisation, dedicated to making up-to-date, accurate information about the effects of healthcare readily available worldwide. It produces and disseminates systematic reviews of healthcare interventions and promotes the search for evidence in the form of clinical trials and other studies of interventions. The Cochrane Collaboration essentially addresses therapeutic questions or effects of interventions, assessed by randomised clinical trials (conducted following the rules of good research practice: CONSORT guideline),54 and has developed a rigorous method for assessing and pooling of such trials (based on the QUORUM guidelines).55 In 2007, at the Cochrane Colloquium in Sao Paulo, the Cochrane Diagnostic Test Accuracy Working Group officially launched the implementation of systematic reviews of diagnostic test accuracy in its Library. The original studies should involve testing subjects for the presence of a target disease with two (or more) tests (for instance a conventional and a new test) and, subsequently, submitting all tested subjects with a valid gold standard method (STARD guideline).31 All tests should be applied independently and nearly simultaneously, in a setting representative for the situation where the tests will be used. The hierarchical summary ROC curve analysis is an adequate statistical tool that allows summarizing accuracy estimates accounting for the intrinsic negative correlation between sensitivity and specificity corresponding with different test cutoffs.56 In the evaluation of a new biomarker as a potential screening method, it often is unfeasible, unpractical and even unethical to apply the gold standard (for instance excision biopsies). Moreover, it is possible that such ‘gold standard’ verification is unreliable when the target disease, is not yet detectable or, if the procedure detects lesions which have a high chance of spontaneous regression (over-diagnosis).

We agree that strict application of the Cochrane methodology for reviewing and the STARD guidelines31 for original diagnostic studies will result in tremendous improvements of the quality of the research on diagnosis for current clinical disease. Nevertheless, more appropriate methods and longitudinal study designs are needed for screening studies aimed at identifying cancer precursors, where the target disease is not yet developed and where management is restricted to screen-positive subjects. The conceptual five-step evaluation process (see Table III) will be of guidance as a paradigm for screen test evaluation.57 In particular, biobank-based case-control studies exploring presence of biomarkers in samples, collected years to decades before the outcome, can provide a powerful research tool, but still require investigations with respect to feasibility. We refer readers to a more extensive discussion of the use of stored cervical cytology samples as a resource for molecular epidemiology.58

Table III. Phases in the Evaluation of a Biomarker for Future Use in Cancer Screening
• Phase 1—“preclinical exploratory studies”: assessment of markers in biosamples of cancer patients and healthy individuals or in a series of biopsies of selected subjects with no dysplastic lesions, mild, moderate and severe dysplastic lesions or in a series of cervical cell samples reported as negative, equivocal, low- or high-grade intraepithelial lesions.
• Phase 2—“clinical assay development for clinical disease and assessment in non-invasive samples” (for instance Pap smears) in selected subjects with known outcome. Purpose: estimation of sensitivity and specificity in relation to test-cutoffs; ROC curve analysis. A typical example is a diagnostic study, in a colposcopy clinic, where the new test is applied and all women are verified with colposcopy and biopsies.
• Phase 3—“retrospective longitudinal repository studies”: for instance biobank-based case-control studies with cases selected (at random) from the cancer registry and controls selected according to appropriate matching variables from the population registry; assessment of biomarkers in samples years or decades before diagnosis of cancer in archival biosamples. Biosample degradation can be a major shortcoming. Its impact can be restricted by high-quality biobanking, and at least partially adjusted for by quality monitoring.
• Phase 4—“prospective screening studies” involving baseline assessment of healthy subjects for presence of biomarkers and follow-up over time. The results of baseline assessment can be concealed or not.
• Phase 5—“prospective intervention study” which preferentially should be a population-based randomised trial where screen-positive subjects are followed or managed according to established protocols. Aimed outcome: reduction in cause-specific mortality and/or incidence of invasive disease (beyond a certain stage) are the major outcomes.

Following Pepe,57 five phases can be distinguished in a straight forward evaluation of biomarkers with the purpose of use in screening (see Table III).

It is the intention of the authors to work out this conceptual model for cervical cancer screening including triage of screen-positive women. A major outcome would be a concept and guideline for the design and conduct of biobank case-control studies as also proposed recently by Pepe et al.59 This concept will require thorough discussion and levels of approval by international methodologists.

As one example, the triage of LSIL (and its equivalent, hr HPV-positive ASCUS) offers an interesting opportunity to evaluate the capacity of biomarkers to distinguish between regressing and progressing abnormalities using a biobank-based design. High-risk (hr) HPV testing is considered insufficiently specific.60, 61 One could select prior cases of LSIL archived in the biobank and follow these up with repeat testing and registration (different algorithms are possible). After two or more years certain cases will have progressed and others regressed. Subsequently, one can retrieve the stored original LSIL samples from cases that progressed to high-grade CIN and from matched disease-free controls and apply one or more biomarker assays. When the new biomarker assay requires fresh samples, such biobank-based studies must be designed prospectively with concealed testing at baseline.59

Two examples: high-risk (hr)HPV testing, over-expression of p16

hrHPV testing

Cervical cancer screening using detection of DNA of hrHPV types passed through all phases of evaluation (as listed in Table III), although some RCTs are still running. It was already known for many years that hrHPV testing is more sensitive but less specific than cervical cytology.62 More recently, randomised population-based trials have demonstrated that hrHPV-negative women older than 30–34 years, are at 47–71% lower risk of developing CIN3 or worse (CIN3+) than women who have a negative Pap smear over the next 5 years.63–65 This reduction in the CIN3+ burden can be regarded as a proxy for reduced incidence of invasive cancer.15 A large RCT, conducted in India, demonstrated lower incidence of and mortality from cervical cancer in women testing HPV-negative compared to not-screened women, in contrast to women screened with visual inspection or cytology.66

Triaging screen-positive women

HPV infection is common but usually transient. Reaching high sensitivity for detection of underlying high-grade CIN requires inclusion of all high-risk types in the assays, which inevitably reduces specificity because it includes weaker carcinogenic HPV genotypes.26 Therefore, when HPV-based screening for cervical cancer is considered, the challenge will be to identify appropriate triage algorithms that limit the burden of hrHPV positive women needing follow-up. Cytology triage is one possibility.67, 68 Biomarkers which are widely expressed in transforming infections could also fulfil this role.69, 70 Biomarkers can also be used to triage low-grade or borderline cytology,61, 71 when cytology is used for primary screening.

Overexpression of p16

A recent meta-analysis (including manly phase 1 studies) summarised the correlation between p16INK4a (abbreviated as p16) over-expression and the severity of squamous cytological lesions, and demonstrated a high variation in the proportion of p16 positives (ranging between 10% and 100% in ASCUS [atypical squamous cells of undermined significance] and between 24% and 86% in LSIL [low-grade squamous intraepithelial lesions]), underlining lack of standardisation in immuno-staining, interpretation and reporting.72 Nevertheless, in experienced hands and using clearly defined criteria, p16 immuno-staining has shown excellent results with sensitivities for CIN2+ similar to hrHPV testing,61 remarkably lower positivity rates (27% in ASCUS, 24% in LSIL) and consequently substantially higher specificities (84% and 81%, in respectively in ASCUS and LSIL) (one phase 2 study).73

Currently, we must acknowledge the lack of good triage studies comparing p16 with currently used alternative strategies to triage equivocal cytological results. Concerning triage of hrHPV positive women, we note only one recent Italian study where hrHPV testing followed by p16-enhanced cytology showed a higher sensitivity for high-grade CIN and similar referral rate to colposcopy compared to primary screening by non-stained conventional cytology.74 Pepe did not include triage studies in the framework of ranking evidence for efficacy of screening (Table III). We propose to consider triage studies as providing evidence of level 2, if designed as a diagnostic study with concurrent gold standard assessment. Randomisation of two or more triage options including longitudinal outcome assessment (via screening and cancer registries, or via systematic gold standard assessment 2-3 years after triage testing) should be classified at a superior level (2+ level).

The question whether sufficient evidence exists to recommend p16-immunostaining as an alternative primary cervical cancer screening method must be answered negatively (many phase 1 studies,72 a small number of pending phase 2 studies [C. Bergeron, personal communication], and one trial targeting p16-triage of HPV positive women [phase 2+]74). Yet these promising results warrant further evaluation by for more powerful and well-designed studies (of higher phases).

In order to explore the potential to use p16 over-expression as a progression marker in triage, we propose to set up an international workshop to standardise issues of sample processing and to define clear criteria for categorising levels of positivity.61 In Table IV, we propose a comprehensive set of studies, which are needed to demonstrate performance of p16 testing in screening.

Table IV. Studies Needed to Establish Evidence to Use p16-Overexpression as a Screening Test for Cervical Cancer
PhaseStudy design
1.Assessment of the p16 positivity rate in selected series of cervical cytology samples (ASCUS, LSIL, HSIL) and in biopsies without CIN, CIN1, CIN2, CIN3, AIS, squamous cancer, adenocarcinoma. A significant positive trend has been established72. The test needs further standardisation. Systematic review to do on reproducibility of histological interpretation with versus without p16 staining.
2.a) Diagnostic study in a colposcopy clinic, where biopsies are taken from all women referred for diagnostic work-up with p16 immunostaining on the cell samples that triggered referral. Outcome: absolute accuracy.
b) Triage studies, including women with ASCUSL or LSIL (setting of cytology based screening) or hrHPV-positive women (setting of HPV-based screening), with p16 immuno-staining versus other triage tests, followed by colposcopy and biopsy of all women [outcomes: absolute sensitivity, specificity] or on those with a positive triage test [outcomes: PPV, relative PPV, detection rate of CIN2+, relative sensitvity]. Allocation of p16 versus other triage tests can be randomised. Outcome assessment can be done via cancer (screening) registries[outcome absolute risk of CIN3+ in p16+ and p16−].
2b.Biobank-based case-control study, with as cases women with LSIL who subsequently progressed to CIN2+ and, as controls, women whose lesions regressed. If p16 on the index LSIL samples, is consistently over-expressed in cases and hardly recognisable in controls, the test can be used to predict prognosis.
3.Biobank-based case-control study, including retrieval of archived Pap smears from women with cervical cancer selected from the cancer registry (cases) and from cancer-free women (age-matched controls). Biomarker assessment on retrieved Pap smears.
4.Baseline immunostaining of screening samples and follow-up through cancer screening registries
5.RCT comparing cytology- or HPV-based screening with p16-based screening.

Which requirements must be fulfilled for new tests similar to clinically validated existing ones?

This question intrigues not only the developers of new assays but also the public and health policy makers who whish to avoid dependency from one manufacture. It is agreed that lower-level evidence can be accepted for systems similar to those for which already sufficient evidence of efficacy is available.

Alternative cytology systems

Liquid-based cytology and/or automated cytology could be accepted as an alternative for conventional cytology if at least equal sensitivity and/or specificity, or preferentially, superior sensitivity and equal specificity or, equal sensitivity and superior specificity, using CIN2+ as outcome, can be demonstrated in a screening population. This can be achieved through a cross-sectional study with double testing (conventional and new assay) and blind interpretation of both assays and blind verification of subjects with cytological abnormality according to standard follow-up algorithms. A preferred alternative is the randomised trial, where colposcopists and histologists are blinded to the type of screen test. Example are the RCTs currently being conducted in the Netherlands, comparing liquid and conventional cytology75 and that conducted in Italy.76 In case of comparable accuracy, other elements, such as the proportion of unsatisfactory preparations, reading time, possibility of ancillary testing and costs should be considered, which can be done through a decision analysis.

hrHPV DNA testing assays

Accepting that screening using HC2 or GP5/6+ PCR significantly reduces the cumulative incidence of CIN3+,15, 65 experts proposed that a new high-risk HPV test should reach a minimum relative sensitivity of at least 0.90 and a relative specificity of at least 0.98, using HC2 as comparator test and CIN2+ as threshold for disease. Moreover the new test should be highly reproducible (agreement>87%, minimum 500 samples).77

The future of molecular progression markers

Research for other new markers, based on molecular processes associated with carcinogenesis, should undergo all phases of evaluation. Possible applications of p16 immuno-cytochemistry, mRNA testing and HPV genotyping to secondary cervical cancer prevention are passing through the hierarchical path of generating evidence, unfortunately not always following the logic framework outlined in Table III. Triage of women with LSIL is a particularly pertinent research field for molecular biomarkers since neither hrHPV testing nor repeated cytology appear to be sufficiently discriminatory to find underlying or incipient relevant disease.78

The expected reduction in background risk of several cancers brought about by future HPV vaccination will be an additional dimension that must be integrated in search of screening methods with an acceptably high predictive value.79, 80 In fact, screen and follow-up strategies with high positive predictive value are also needed in well-screened populations, where over time, prevalent, large CIN3 with significant invasive potential will be preferentially detected and eliminated, leaving fewer CIN3 that have lower invasive potential. It is the intention of the authors to try assisting the research community by offering advice on future straight foreword study designs. The environment of the Cochrane Review Collaboration, involving cooperation with methodologists in diagnostic research, clinicians and clinical epidemiologists could offer a fruitful forum to realise the ambition of assessing current and future evidence for cervical cancer prevention strategies.15

  • a

    In this paper “CIN” (cervical intraepithelial neoplasia) is used for histologically confirmed lesions, while the SIL (Bethesda) terminology is used to describe cytological findings, as recommended in recent international guidelines.1–3

  • b

    It must be remarked that this estimate implies 100% compliance of screened women and that cancer occurring in women with lesions when screening starts are excluded from the estimate of 80% reduction.

  • c

    Contamination means that study subjects enrolled to participate to a trial arm do not follow the procedures foreseen in the study protocol. For instance: women randomised to screening with cytology are screened with an HPV test in the context of opportunistic screening.

  • d

    The same is true when different tests are studied in different populations as long as the prevalence of disease can be assumed to be the same (e.g. in randomised trials).4

  • e

    When not all screen-positives are verified and the selection of verified positive cases is not random, verification bias still can occur at the level of the PPV, detection rate and relative sensitivity.

  • f

    It is important to distinguish cross-sectional and longitudinal accuracy parameters. Increased detection with a new test of CIN2 that will largely regress, will result in a higher cross-sectional sensitivity which is clinically not useful (over-diagnosis). In contrast, a screen-positive woman who, currently, does not have colposcopically visible CIN can develop a high-grade CIN2 in the future. Such a case may initially be classified as false-positive, only to be re-classified subsequently as a true-positive with longitudinal surveillance.

Ancillary