- Top of page
- Materials and Methods
We evaluated interobserver reproducibility for the response evaluation criteria in solid tumors (RECIST) guidelines and the influence of minimum lesion size (MLS) on reproducibility. The 110 consecutive patients with non-small cell lung cancer were treated with platinum-based chemotherapy. Five observers measured target lesions according to both the World Health Organization (WHO) criteria and RECIST. The percentage changes for unidimensional measurements (UD; RECIST type) and bidimensional measurements (BD; WHO type) were calculated for each patient. Interobserver reproducibility among five observers, that is 10 pairs, was expressed as the Spearman's correlation coefficient for the percentage changes, the proportion of agreement and the kappa statistics for response categories. The same analysis was carried out using MLS. BD was more reproducible than UD (Spearman rank correlation coefficient, 0.84 vs 0.81; proportion of agreement, 84.4% vs 82.5%; kappa value, 0.69 vs 0.61). When MLS was applied to UD, eligible cases decreased by 6.4% and the number of target lesions by 44.6%, whereas interobserver reproducibility for UD improved (Spearman rank correlation coefficient, 0.81–0.84; proportion of agreement, 82.5–84.2%; kappa value, 0.61–0.65). The introduction of MLS to UD could also improve intercriteria reproducibility between WHO and RECIST. It is important to apply the MLS when using RECIST for the comparable interobserver reproducibility attained with WHO. (Cancer Sci 2006; 97: 214–218)
Tumor response to chemotherapy was previously evaluated using the WHO criteria, which stipulate bidimensional measurement (BD; WHO type) of lesions.(1) With these standardized criteria for evaluating tumor response, valid and reproducible results could be obtained by all investigators. However, a number of modifications to the WHO criteria have been developed by different institutions, which made it difficult to compare response rates for screening new anticancer agents across different investigators. This has led to the introduction of a new system, the RECIST guidelines,(2) which have been widely accepted as the new standard.
In order to standardize the methodology for evaluating tumor response, RECIST simplified the response evaluation through the use of unidimensional measurements (UD; RECIST type) instead of the BD used by the WHO criteria. Furthermore, the MLS allowable for measurement at baseline study was defined as being no less than double the slice thickness on CT or MRI.
The validity and intercriteria reproducibility between the new RECIST guidelines and the previous WHO criteria have been investigated.(2–7) However, to the best of our knowledge, no analysis of the influence of MLS on interobserver reproducibility, specified for measurability in tumor response evaluation according to the RECIST guidelines, has been published in the literature.
The purpose of the present study was therefore to evaluate interobserver reproducibility in tumor response evaluation using RECIST, intercriteria reproducibility between WHO and RECIST, and whether this reproducibility is affected by the application of MLS.
- Top of page
- Materials and Methods
Standardized tumor response evaluation systems are considered reliable in clinical trials when they are valid and reproducible among different observers. Although the intercriteria reproducibility between the new RECIST guidelines and the previous WHO criteria had been investigated,(2–7) little information was available concerning interobserver reproducibility of tumor response evaluation. In addition, statistical analysis results regarding the effect of MLS on interobserver reproducibility had not been provided in previous reports. This is the first study to investigate interobserver reproducibility of the RECIST guidelines evaluating the MLS.
The importance of interobserver reproducibility for any classification scheme has been discussed previously for other grading systems.(10–12) Clinical investigators must take into account interobserver reproducibility in tumor response evaluation, which can greatly affect the results in clinical trials. Our findings demonstrated that interobserver variability exists for bidimensional measurements, as in studies published previously.(13,14) For example, Hopper et al. showed considerable interobserver variability in CT tumor measurements between radiologists interpreting thoracic and abdominal/pelvic CT scans.(13) In another report, the impact of an evaluation committee on patients’ overall response status in a large multicenter trial in oncology was evaluated.(14) Major disagreements occurred in 40% of cases and minor disagreements occurred in 10.5% of the cases reviewed. The number of responders was reduced by 23.2% after review by the evaluation committee.
The range of response rates among five observers was clearly narrowed by the MLS (Table 2). The response rates assessed by UD varied from 23.6 to 32.7%. When assessed by BD, the response rates ranged from 27.3 to 33.6%. Response rates assessed with UD-MLS ranged from 27.2 to 33.0%, which was almost identical when BD was used.
The results of the present study also suggested that BD was more reproducible than UD. When MLS was applied to UD, the mean values and ranges of Spearman's correlation coefficient, proportion of agreement and the kappa statistics improved (Table 4). In order to ensure comparable interobserver reproducibility (as was originally achieved with the WHO criteria) it is essential that the MLS be used in combination with UD when using RECIST.
Because of the need to retain some ability to compare results of future therapies with those available currently, no major discrepancy should exist between the old (WHO) and new (RECIST) criteria, although measurement criteria would be different. The mean values and ranges of intercriteria reproducibility in the response rates between UD-MLS and BD were lower and narrower than those between UD and BD (Table 3). The introduction of MLS to UD improved the intercriteria reproducibility between WHO and RECIST.
As for intercriteria reproducibility, the mean values and ranges for intraobserver reproducibility were better than those for interobserver reproducibility (Table 3). Erasmus et al. have suggested that consistency can be improved if the same reader carries out serial measurements for any one patient.(15)
When MLS is included in the eligibility criteria, the number of patients with measurable lesions is less than that obtained with the previous WHO criteria because patients with only small lesions are excluded from measurement. In the present study, when MLS criteria were used the number of eligible cases decreased by 6.4% from 110 to 103 and the number of target lesions by 44.6% from 402 to 223. This reduction could affect the number of patients enrolled in clinical trials.
The present study had several limitations. First, the study cohort comprised NSCLC patients only and the application of the measurement modalities was limited to chest CT. Second, intraobserver variability between evaluations with different intervals was not investigated. Third, our reference was a 10 mm slice thickness and therefore the minimum lesion size was defined as 20 mm. However, RECIST guidelines allow for a minimum lesion size of 10 mm as a slice thickness of 5 mm measured by helical CT is used. Recently, multidetector CT, which creates a thinner slice thickness, has been developed and is being used in daily clinical practice. Therefore, the addition of the outcomes of patients ineligible for our study as a result of using a thinner slice thickness might change our results and should be evaluated in a further study.
In conclusion, the results of the present study suggest that UD yields poorer interobserver reproducibility of tumor response evaluation than BD; however, if MLS is applied to UD, interobserver reproducibility can improve and become the same as that obtained with BD. The introduction of MLS to UD could also improve intercriteria reproducibility between WHO and RECIST. It is therefore essential that investigators include MLS when using RECIST guidelines to ensure interobserver reproducibility comparable with the WHO criteria.