We evaluated interobserver reproducibility for the response evaluation criteria in solid tumors (RECIST) guidelines and the influence of minimum lesion size (MLS) on reproducibility. The 110 consecutive patients with non-small cell lung cancer were treated with platinum-based chemotherapy. Five observers measured target lesions according to both the World Health Organization (WHO) criteria and RECIST. The percentage changes for unidimensional measurements (UD; RECIST type) and bidimensional measurements (BD; WHO type) were calculated for each patient. Interobserver reproducibility among five observers, that is 10 pairs, was expressed as the Spearman's correlation coefficient for the percentage changes, the proportion of agreement and the kappa statistics for response categories. The same analysis was carried out using MLS. BD was more reproducible than UD (Spearman rank correlation coefficient, 0.84 vs 0.81; proportion of agreement, 84.4% vs 82.5%; kappa value, 0.69 vs 0.61). When MLS was applied to UD, eligible cases decreased by 6.4% and the number of target lesions by 44.6%, whereas interobserver reproducibility for UD improved (Spearman rank correlation coefficient, 0.81–0.84; proportion of agreement, 82.5–84.2%; kappa value, 0.61–0.65). The introduction of MLS to UD could also improve intercriteria reproducibility between WHO and RECIST. It is important to apply the MLS when using RECIST for the comparable interobserver reproducibility attained with WHO. (Cancer Sci 2006; 97: 214–218)
Tumor response to chemotherapy was previously evaluated using the WHO criteria, which stipulate bidimensional measurement (BD; WHO type) of lesions.(1) With these standardized criteria for evaluating tumor response, valid and reproducible results could be obtained by all investigators. However, a number of modifications to the WHO criteria have been developed by different institutions, which made it difficult to compare response rates for screening new anticancer agents across different investigators. This has led to the introduction of a new system, the RECIST guidelines,(2) which have been widely accepted as the new standard.
In order to standardize the methodology for evaluating tumor response, RECIST simplified the response evaluation through the use of unidimensional measurements (UD; RECIST type) instead of the BD used by the WHO criteria. Furthermore, the MLS allowable for measurement at baseline study was defined as being no less than double the slice thickness on CT or MRI.
The validity and intercriteria reproducibility between the new RECIST guidelines and the previous WHO criteria have been investigated.(2–7) However, to the best of our knowledge, no analysis of the influence of MLS on interobserver reproducibility, specified for measurability in tumor response evaluation according to the RECIST guidelines, has been published in the literature.
The purpose of the present study was therefore to evaluate interobserver reproducibility in tumor response evaluation using RECIST, intercriteria reproducibility between WHO and RECIST, and whether this reproducibility is affected by the application of MLS.
Materials and Methods
This is a retrospective study of the radiological findings for patients who underwent chemotherapy for advanced NSCLC. The subjects were patients treated during clinical trials at the Medical Oncology Division of the National Cancer Center Hospital in Tokyo, Japan, between January 1999 and January 2001. All clinical trials were conducted in accordance with the Helsinki Declaration and the protocols were approved by the institutional review board. Written informed consent was obtained from each patient for the treatment protocols, which included the secondary use of treatment-associated documents. Patients were staged according to the Union Internationale Contra le Cancer TNM classification of malignant tumors.(8) The 110 eligible patients included those histologically or cytologically diagnosed with NSCLC. Patients were required to undergo CT scans periodically for evaluating tumor response prior to and once after treatment, to have at least one bidimensionally measurable lesion, and to be treated with chemotherapy in clinical trials.
Patients treated in clinical practice were considered to be unsuitable and excluded from this study as tumor response evaluation in the clinical practice of oncology is not always carried out according to predefined criteria, but rather is made by subjective medical judgment based on clinical and laboratory data. In addition, tumor response evaluation is not always carried out by CT examination, and the intervals between tumor evaluations can be irregular.
Almost all images were acquired with a TCT-900S Superhelix (Toshiba Medical, Tokyo, Japan), with the remainder scanned on an X-Vigor helical CT scanner (Toshiba Medical). Helical CT was carried out with fixed scanning parameters, including a table speed of 15 mm/s, a pitch ratio of 1:1.5 per rotation time 1 s, and the same contrast agent for both baseline and follow-up evaluations. Image reconstruction was carried out at intervals of 10 mm.
On chest CT obtained during baseline examination before the initiation of chemotherapy, target lesions up to a maximum of five lesions per patient with longest and perpendicular diameters that could be measured accurately were selected by one diagnostic radiologist. In addition, one follow-up chest CT examination, indicating tumors with the greatest response to chemotherapy, was selected retrospectively. Target lesions included primary lung lesion, pulmonary metastases and lymph nodes.
For the target lesions, the two parameters consisting of the longest diameter and the diameter perpendicular to it were measured with electronic calipers on digitized images. Five observers of different backgrounds, blinded to patient profiles, reviewed all patients independently and no attempt was made to arrive at a consensus. These observers included one diagnostic radiologist, one thoracic physician, two medical oncologists and one thoracic surgeon.
Tumor response evaluation
The sum of the longest diameters for all target lesions was calculated for pretreatment and post-treatment UD. Similarly, the sum of the products of the longest diameters and their perpendicular diameters for all target lesions was calculated for pretreatment and post-treatment BD. If there were two or more lesions, the sum of all target lesions was calculated. The baseline sum was used as the reference from which objective tumor response could be calculated. The percentage changes were calculated as post-treatment value divided by pretreatment value for both UD and BD.
Percentage changes were then classified using the current RECIST guidelines and the previous WHO criteria tumor response classification system. Tumor response was categorized into CR, PR, SD and PD based on both RECIST guidelines and WHO criteria. The RECIST PR was defined as a 50% decrease in the percentage changes for UD, and the WHO PR was defined as a 30% decrease in the percentage changes for BD. The RECIST PD was defined as a 20% increase in the sum of the longest diameters, and the WHO PD was defined as a 25% increase in the sum of the products of the two diameters of all lesions or in the product of the diameters of one lesion. For the present study, no minimum interval was required for the confirmation of either CR or PR.
Analysis of intercriteria reproducibility
To examine intercriteria reproducibility, the mean and ranges of differences in the response rate between UD and BD were calculated. We then estimated those between UD-MLS and BD. Interobserver differences among the five observers yielded 20 pair comparisons. Intraobserver differences of the same observer yielded five pair comparisons.
Analysis of interobserver reproducibility
First, to examine the interobserver reproducibility of the percentage changes according to the two different dimensional measurements, we estimated the Spearman's correlation coefficient of the percentage changes among the five observers, calculated for each pair observed (five observers yielded 10 pair comparisons).
Second, to examine the interobserver reproducibility for two tumor response criteria, we estimated the proportion of agreement to the categories of CR, PR, SD and PD for both UD and BD among the five observers (10 pair comparisons). We then calculated the kappa statistics, a measure of agreement in which agreement is taken into consideration by chance, to assess interobserver reproducibility for tumor response categories.(9)
Third, we examined the influence of MLS on the number of eligible cases and target lesions. The same analyses on interobserver reproducibility were conducted applying the MLS. MLS was introduced into the RECIST guidelines, which specify a minimum lesion size of less than double the slice thickness on images. The slice thickness was 10 mm in the present study, so the MLS was set at no less than 20 mm at baseline evaluation before treatment. Cases that only had tumors smaller than the MLS were excluded from the present study. We defined the RECIST guidelines as the evaluation by UD for measurable cases and the WHO criteria as the evaluation by BD for all cases.
SAS version 8.02 (SAS Institute, Cary, NC, USA) was used for all analyses.
The characteristics of the 110 patients were as follows: male/female = 80/30, median age = 59 years (range 36–72 years), stage IIIB/IV = 33/77. Chemotherapy regimens are listed in Table 1. A total of 220 CT images were reviewed, comprising 110 CT images each from the baseline study (pretreatment) and from the follow-up (post-treatment) study.
Table 1. Characteristics of the 110 patients enrolled in the present study
Disease stage at study entry
Unclassified non-small cell
Cisplatin and gemcitabine
Cisplatin and paclitaxel
Nedaplatin and paclitaxel
Cisplatin and vinorelbine
Carboplatin and paclitaxel
Cisplatin and vindesine
Cisplatin, docetaxel and ifosfamide
Cisplatin and docetaxel
Tumor response evaluation between UD and BD
The tumor response evaluation was categorized into CR, PR, SD and PD without MLS. The response rate results are shown in Table 2. None of the patients were rated CR. The use of UD resulted in response categories by observers A, B, C, D and E of 35, 28, 26, 34 and 36 PR, 73, 79, 81, 73 and 71 SD, and 2, 3, 3, 3 and 3 PD, respectively. The response rate ranged from 23.6 to 32.7%. For BD, the corresponding response categories were 37, 30, 33, 36 and 36 PR, 67, 73, 68, 68 and 68 SD, and 6, 7, 9, 6 and 6 PD, respectively. The response rate ranged from 27.3 to 33.6%.
Table 2. Response rate (%) using four different measurements among five observers
Tumor response evaluation between UD-MLS and BD-MLS
When the MLS criteria were applied, the number of eligible cases decreased by 6.4% from 110 to 103, and the number of target lesions decreased by 44.6% from 402 to 223.
The response rate results are shown in Table 2. None of the patients were rated CR. When UD was used with MLS, the respective response evaluations made by observers A, B, C, D and E were 34, 28, 33, 31 and 33 PR, 68, 73, 67, 72 and 68 SD, and 1, 2, 3, 0 and 2 PD. The response rates of UD applying MLS ranged from 27.2 to 33.0%, showing a reduction in interobserver difference compared with those of UD not applying MLS. With BD using the MLS, the corresponding response categories were 36, 33, 34, 34 and 35 PR, 63, 66, 65, 63 and 64 SD, and 4, 4, 4, 6 and 4 PD. The response rate ranged from 32.0 to 35.0%.
The intercriteria reproducibility in the response rates is shown in Table 3. Between UD and BD, the intraobserver difference in the response rates ranged from 0 to 6.4% with a mean of 2.36%, and the interobserver difference ranged from 0 to 10.0% with a mean of 4.25%. Between UD-MLS and BD, the intraobserver difference in the response rates ranged from 0.1 to 2.6% with a mean of 1.26%, and the interobserver difference ranged from 0.1 to 6.4% with a mean of 2.76%.
Table 3. Intercriteria reproducibility: difference in the response rate (%) among five observers
The mean and ranges of interobserver reproducibility among five observers using the two dimensional measurements are shown in Table 4. The mean value of the Spearman rank correlation coefficient for the percentage changes when using UD (0.81) was lower than that using BD (0.85), and the same tendency was observed for the mean value of proportion of agreement for the tumor response categories (82.5%, 908/1100 vs 84.4%, 928/1100) and the mean kappa statistics for the tumor response categories (0.61 vs 0.69). The lowest kappa statistics among the 10 pair comparisons were 0.49 with UD and 0.61 with BD. The kappa statistics obtained with BD were higher than those with UD in nine out of 10 pair comparisons (Fig. 1).
Table 4. Interobserver reproducibility (10 pair comparisons) using four different measurements among five observers
The mean values and ranges of interobserver reproducibility when applying the MLS are shown in Table 4. The mean value of Spearman's correlation coefficient for UD-MLS (0.84) was higher than that for UD (0.81), and the same tendency was observed for the mean value of proportion of agreement for the tumor response categories (84.2%, 867/1030 vs 82.5%, 908/1100) and the mean kappa statistics for the tumor response categories (0.65 vs 0.61). The lowest kappa statistics among the 10 pair-based comparisons was 0.57 with MLS and 0.49 without. When MLS was used together with UD, the kappa statistics increased in eight out of 10 pair comparisons (Fig. 2).
Standardized tumor response evaluation systems are considered reliable in clinical trials when they are valid and reproducible among different observers. Although the intercriteria reproducibility between the new RECIST guidelines and the previous WHO criteria had been investigated,(2–7) little information was available concerning interobserver reproducibility of tumor response evaluation. In addition, statistical analysis results regarding the effect of MLS on interobserver reproducibility had not been provided in previous reports. This is the first study to investigate interobserver reproducibility of the RECIST guidelines evaluating the MLS.
The importance of interobserver reproducibility for any classification scheme has been discussed previously for other grading systems.(10–12) Clinical investigators must take into account interobserver reproducibility in tumor response evaluation, which can greatly affect the results in clinical trials. Our findings demonstrated that interobserver variability exists for bidimensional measurements, as in studies published previously.(13,14) For example, Hopper et al. showed considerable interobserver variability in CT tumor measurements between radiologists interpreting thoracic and abdominal/pelvic CT scans.(13) In another report, the impact of an evaluation committee on patients’ overall response status in a large multicenter trial in oncology was evaluated.(14) Major disagreements occurred in 40% of cases and minor disagreements occurred in 10.5% of the cases reviewed. The number of responders was reduced by 23.2% after review by the evaluation committee.
The range of response rates among five observers was clearly narrowed by the MLS (Table 2). The response rates assessed by UD varied from 23.6 to 32.7%. When assessed by BD, the response rates ranged from 27.3 to 33.6%. Response rates assessed with UD-MLS ranged from 27.2 to 33.0%, which was almost identical when BD was used.
The results of the present study also suggested that BD was more reproducible than UD. When MLS was applied to UD, the mean values and ranges of Spearman's correlation coefficient, proportion of agreement and the kappa statistics improved (Table 4). In order to ensure comparable interobserver reproducibility (as was originally achieved with the WHO criteria) it is essential that the MLS be used in combination with UD when using RECIST.
Because of the need to retain some ability to compare results of future therapies with those available currently, no major discrepancy should exist between the old (WHO) and new (RECIST) criteria, although measurement criteria would be different. The mean values and ranges of intercriteria reproducibility in the response rates between UD-MLS and BD were lower and narrower than those between UD and BD (Table 3). The introduction of MLS to UD improved the intercriteria reproducibility between WHO and RECIST.
As for intercriteria reproducibility, the mean values and ranges for intraobserver reproducibility were better than those for interobserver reproducibility (Table 3). Erasmus et al. have suggested that consistency can be improved if the same reader carries out serial measurements for any one patient.(15)
When MLS is included in the eligibility criteria, the number of patients with measurable lesions is less than that obtained with the previous WHO criteria because patients with only small lesions are excluded from measurement. In the present study, when MLS criteria were used the number of eligible cases decreased by 6.4% from 110 to 103 and the number of target lesions by 44.6% from 402 to 223. This reduction could affect the number of patients enrolled in clinical trials.
The present study had several limitations. First, the study cohort comprised NSCLC patients only and the application of the measurement modalities was limited to chest CT. Second, intraobserver variability between evaluations with different intervals was not investigated. Third, our reference was a 10 mm slice thickness and therefore the minimum lesion size was defined as 20 mm. However, RECIST guidelines allow for a minimum lesion size of 10 mm as a slice thickness of 5 mm measured by helical CT is used. Recently, multidetector CT, which creates a thinner slice thickness, has been developed and is being used in daily clinical practice. Therefore, the addition of the outcomes of patients ineligible for our study as a result of using a thinner slice thickness might change our results and should be evaluated in a further study.
In conclusion, the results of the present study suggest that UD yields poorer interobserver reproducibility of tumor response evaluation than BD; however, if MLS is applied to UD, interobserver reproducibility can improve and become the same as that obtained with BD. The introduction of MLS to UD could also improve intercriteria reproducibility between WHO and RECIST. It is therefore essential that investigators include MLS when using RECIST guidelines to ensure interobserver reproducibility comparable with the WHO criteria.
This work was supported by Grants-in-Aid from the Ministry of Health, Labour and Welfare, Japan, and was presented in part at the 38th ASCO Annual Meeting, 19 May 2002.