Assessment of consistency in contouring of normal‐tissue anatomic structures

The purpose of this work is to estimate the uncertainty in the manual contouring of normal anatomical structures. The heart, esophagus, and spinal cord were contoured manually on six sets of computed tomography images by six dosimetrists whose experience ranged from 1 year to over 15 years. To determine the differences between inter‐ and intraobserver variations, each data set was contoured by one of the dosimetrists five times and once each by the five other dosimetrists. The magnitude of the discrepancies in delineating the contours was assessed. Intradosimetrist contouring discrepancies were as follows: esophagus, average 0.3 cm and maximum 2.9 cm; heart, average 0.5 cm and maximum 7.6 cm; and spinal cord, average 0.1 cm and maximum 0.7 cm. Interdosimetrist contouring discrepancies were as follows: esophagus, average 0.4 cm and maximum 3.1 cm; heart, average 0.7 cm and maximum 8.1 cm; and spinal cord, average 0.2 cm and maximum 0.9 cm. Significant discrepancies can occur when normal anatomic structures are contoured manually. Interdosimetrist discrepancies are typically slightly greater than intradosimetrist discrepancies. The magnitude of the discrepancies does not appear to be correlated to the experience of the dosimetrist. © 2003 American College of Medical Physics. PACS number(s): 87.53.–j, 87.66.–a


INTRODUCTION
The need for manual delineation of normal-tissue anatomic structures is a major stumbling block that impedes full automation of the radiation treatment planning process. Consequently, much effort has been put into the search for reliable ways to automate this task. [1][2][3][4][5][6][7][8] An important step in the development of any automatic contouring algorithm is assessing the accuracy of the algorithm. A reasonable expectation for the accuracy of the algorithm is that it be no worse than what is manually achievable. Knowledge of the inter-and intraobserver variations in manual contouring can help to quantify this expectation.
Prior studies [9][10][11][12][13][14][15][16][17][18] have examined the consistency in delineation of target volumes ͓gross tumor volume ͑GTV͒, clinical target volume ͑CTV͒, and planning target volume ͑PTV͔͒ and normal tissue anatomic structures. Not surprisingly, intraobserver variation was found to be less than interobserver variation, [9][10][11][12][13] for example, a 5.5% volume variation as compared with a 17.5% variation. 11 In these cases, however, delineation of the target volume was based on interpretation of the extent of tumor as displayed in an imaging study and not on the assessment of a particular anatomic structure. Prostate target volumes are notable exceptions, because the target volume is often the entire prostate. 7,9,10 For these cases, delineation of the target volume is tantamount to delineation of the prostate, and variations in the delineated target volume are the result of different interpretations of the extent of the anatomic structure.
The purpose of this study was to assess the variations in the manual delineation of normal anatomy. Unlike previous studies, 7,12,16,18 which focused on uncertainties in the total volume of an anatomic structure, the present study assesses the spatial uncertainties that arise in locating the boundaries of anatomic structures. Because beam geometries and treatment portals determined during treatment planning are based partly on the location of one or more critical anatomic structures, errors in locating the boundaries of these structures may have profound effects on the design of a treatment plan. 19

A. Contour delineation
Six dosimetrists, with experience ranging from 1 year to over 15 years, participated in the study. The heart, esophagus, and spinal cord were outlined on computed tomography ͑CT͒ image data sets of six patients with thoracic cancer who were treated at The University of Texas M. D. Anderson Cancer Center. The CT images were acquired using a virtual simulator ͑AcQSim; Philips Radiation Oncology Systems, Highland Heights, OH͒ with a slice thickness of 0.3 cm, and between 64 and 160 transverse slices per patient data set, and then transferred to a treatment planning system ͑Pinnacle 3 ; Philips Radiation Oncology Systems, Milpitas, CA͒. All patientidentifying information was removed from the system.
To determine the intraobserver variation, each dosimetrist was assigned one of the CT data sets and asked to contour the three structures five times. Intervals between contouring sessions varied from sequentially up to several days. To determine the interobserver variation, each CT data set was contoured once each by the remaining five dosimetrists. The dosimetrists were asked not to use automatic contouring tools provided by the treatment-planning system, but rather to contour each structure manually.
To facilitate the analysis, the contours were transferred from Pinnacle 3 to a treatment planning system that was developed in-house 20 using the Radiation Therapy Oncology Group ͑RTOG͒ transfer utility provided by the Pinnacle 3 system. The contours for each of the six patients were then separated and individual files created for each contour of the heart, esophagus, and spinal cord. A program was written to read in the contour data and carry out the analysis.

B. Data analysis
If the variation in a set of contours is not too great, the contours will typically divide the plane ͑i.e., a particular CT slice͒ into three regions. These regions are an inner region of points enclosed by all contours, an outer region of points that are outside all contours, and an annular region of points that are inside some contours and not others. Figure 1 illustrates this situation for five hypothetical contours ͑thin lines͒ where, for emphasis, the inner and outer boundaries of the annular region are drawn with a thick line.
In our analysis, we examined the width of the annular region at points along its perimeter. These points corresponded to the points defining the anatomic regions on the treatment planning system and varied from around 10 to over 100 points per slice. The width can be determined from a map where each pixel is assigned a value equal to the number of contours that enclose it. A pixel is assigned a value of 5 ͑for a set of five contours͒ if it lies inside the inner region, a value of 0 if it lies in the outer region, or an intermediate value equal to the number of enclosing contours if it lies within the annular region. The map is then used to generate a histogram of the number of pixels, or area A(n), contained within n or more contours. Thus, A(5) is the number of pixels enclosed by the inner boundary, A(4) is A(5) plus the number of pixels with a value of 4, and so forth, with A(1) being the number pixels enclosed by the outer boundary. The quantity A(5)/A(1) equals the ratio of areas enclosed by the inner and outer boundaries.
To determine the width of the annular region from the map, we use an edge-detection algorithm 1 to trace out the boundary of a region with a specified pixel value. We select a threshold pixel value of 5 to trace the inner boundary and a threshold pixel value of 1 to trace the outer boundary. For each point on the outer boundary, we define the width of the annular region at that point to be the distance to the nearest inner boundary point. From this set of width ''measurements'' for a given anatomical structure, CT slice, and inter/intradosimetrist data set, we extract the maximum values, values corresponding to the 95th percentile width, and mean values. Table I summarizes for each dosimetrist, the maximum, 95th percentile, and mean discrepancies for the heart, esophagus, and spinal cord. The experience of each dosimetrist is also noted. The esophagus had a mean discrepancy of 0.3 cm and a maximum of 2.9 cm. For the heart, the mean discrepancy was 0.5 cm, and the maximum was 7.6 cm. The mean discrepancy for the spinal cord was 0.1 cm, whereas the maximum was 0.7 cm. Figure 2 summarizes the information in Table I on graphs of maximum discrepancy versus years of experience ͓Fig. 2͑a͔͒ and mean discrepancy versus years of experience ͓Fig. 2͑b͔͒. It does not appear that contouring discrepancy can be correlated with the experience of the dosimetrist.

A. Comparison of intradosimetrist contours
In addition to the discrepancies in locating the boundaries of structures on individual transverse slices, discrepancies also arose in locating the boundaries of structures in the superior-inferior direction. In particular, the heart is a structure that is displayed in only a limited number of slices in the thorax. Significant inconsistency existed as to the extent of the heart, even when the same dosimetrist repeated contouring the heart on the same CT data set. Table II summarizes the intra-dosimetrist discrepancy in contouring the heart, for which variations of over 2 cm were found in the superior-inferior direction.

B. Comparison of interdosimetrist contours
Generally, interdosimetrist discrepancies were slightly greater than intradosimetrist discrepancies. The distances between the innermost contour and outermost contour varied depending on the structure. Each average distance was calculated for every slice contained in the image set. Table III summarizes the maximum and mean discrepancies for the three anatomic structures for each patient. Perhaps more remarkable than the magnitude of the discrepancies is the observation that in some transverse planes for some patients there was significant disagreement as to the location of the esophagus. Figure 3 shows an instance in which two different dosimetrists placed the esophagus in two disjoint locations. In addition, variations of over 5 cm were observed in the extent of the delineated heart in the superior-inferior direction.

DISCUSSION
Previous studies have demonstrated that the delineation of target volumes by radiation oncologists is highly subjective and may be significantly affected by the availability of additional information, e.g., from imaging studies. By contrast, delineation of normal anatomic structures is usually based on CT data alone and, therefore, should be somewhat less subjective. This study has found dependence on the judgment of the dosimetrist in contouring normal anatomic structures. To improve on those inconsistencies would be very desirable. Thus, an automated method should have a consistency at least as good as that which dosimetrists can achieve manually.
Agreement does not yet exist on a suitable metric for assessing consistency in contouring normal-tissue anatomic structures. Most of the earlier studies, which, addressed contouring tumor volumes, used the ratio of volumes as a metric. 11,13,16 This ratio may be a useful metric when applied to the evaluation of consistency in delineation of target volumes, but the relative volumes  21 have used a standard statistical value, , that assesses the number of voxels contained within the intersections and complements between two or more structure volumes. The connection between a value of and clinical consequences has not yet been established. We propose that a distance metric may be clinically relevant for assessing contouring uniformity. During the treatment-planning process, beam geometries and treatment portals are often determined to avoid normal-tissue anatomic structures for which the margins can be relatively small. If the boundary contour of a critical anatomic structure is displaced from its true location, it is possible that the margin between the portal edge and the structure would be insufficient to spare irradiation of the anatomic structure. This may have significant consequences for the potential morbidity of the treatment plan.
The level of disagreement among dosimetrists in contouring normal-tissue anatomy was somewhat surprising and may have significant consequences, especially for the esophagus, where, in several instances, contours were placed in disjoint locations. It is possible that, given the differ-

CONCLUSIONS
Discrepancies in the manual delineation of anatomic structures can vary substantially, up to several centimeters. Generally, the discrepancies do not appear to be correlated with the experience of the dosimetrist. An automated method could standardize the delineation of anatomical structures and thereby decrease the amount of variation from patient to patient. Automated delineation would need to be at least as accurate as that achievable manually. This study will help to set a guideline for the accuracy expected of an automated method.