Dr Gregory V. Goldmacher MD PhD, Medical and Scientific Affairs, ICON Medical Imaging, 2800 Kelly Road, Suite 200, Warrington, PA 18976, USA. Tel.: 1-267-482-6300. Fax: 1-267-482-6301. E-mail: firstname.lastname@example.org
Serial evaluations of tumour burden using imaging, mainly computed tomography and magnetic resonance imaging, form the basis for assessing treatment response in many clinical trials of anticancer therapeutics. Traditionally, these evaluations have been based on linear measurements of tumour size. Such measurements have limitations related to variability in technical factors, tumour morphology and reader decisions. Measurements of entire tumour volumes may overcome some of the limitations of linear tumour measurements, improving our ability to detect small changes reliably and increasing statistical power per subject in a trial. Certain technical factors are known to affect the accuracy and precision of volume measurements, and work is in progress to define these factors more thoroughly and to qualify tumour volume as a biomarker for the purposes of drug development.
In assessing the efficacy of novel cancer therapies, the gold standard for trial endpoints is a clinical outcome, such as overall survival. Such endpoints may not be practical, however, and endpoints based on imaging are often used as surrogates, especially because imaging is commonly used to monitor treatment response in clinical practice. Figure 1 illustrates the information flow in oncology studies. The patient is scanned at baseline to determine the initial tumour burden, and then at a series of subsequent time points or visits. A combination of quantitative and qualitative assessments is used to derive an overall treatment response at each visit. Visit responses are combined to yield an endpoint (such as progression-free survival or best overall response).
Measurement methods and instruments should ideally be both precise and accurate. An instrument is precise when its output exhibits low variance (the square of the standard deviation of a set of measurements). It is accurate when repeated measurements centre around a truth standard, without a systematic bias. If a method is accurate and precise, serial measurements of tumour burden represent true changes in disease state. An instrument with low precision will also have low sensitivity to changes, because changes in measurement are likely to be the result of random chance.
To measure tumour size reproducibly in clinical trials, the most common modality is computed tomography (CT), but magnetic resonance imaging (MRI) is also useful. Radiographs lack sensitivity, whereas ultrasound lacks reproducibility because it is highly operator dependent. Although CT and MRI have a different physical basis, they both generate an image of the body composed of picture elements (pixels), each of which contains information about the chemical and structural composition of the tissues present in a corresponding small volume of space (a voxel). Both types of images are acquired as ‘slices’ through the body, which can be assembled into a three-dimensional representation. Although there has been more extensive characterization of CT in the measurement of volumes, the general considerations discussed in this paper apply both to CT and to MRI.
Linear measurements of tumour size
Historically, tumour response assessment criteria have been based on unidimensional or bidimensional lesion measurements, which are summed to produce a quantitative estimate of tumour burden. The disease state at each time point is categorized into complete response, partial response (by comparison with the tumour at baseline), stable disease or progressive disease (by comparison with the nadir tumour burden during the trial). The earliest research evaluated the reproducibility of measurements made using physical examination and chest radiographs [1, 2]. Based on this work, the World Health Organization (WHO) criteria  used the sum of the products of the longest perpendicular diameters of measured lesions, and defined partial response as a 50% decrease in this measurement. The Response Evaluation Criteria In Solid Tumours (RECIST) introduced by the US National Cancer Institute (NCI) and the European Organization of Research and Treatment of Cancer (EORTC) in 2000 and revised in 2009 , replaced bidimensional measurements with the single longest lesion diameters, added rules regarding the minimal size of lesions that can be tracked quantitatively, and specified image acquisition parameters in both CT and other modalities that support such assessments. In RECIST, a partial response is achieved with a 30% decrease in the lesion diameter, which corresponds to a 50% decrease in the product of two diameters for simple spherical lesions. Since its release, RECIST has largely replaced WHO criteria for assessment of solid tumours in clinical trials. Modifications of RECIST to accommodate the needs of specific trials are common, and other systems have been developed for lymphoma (the International Working Group criteria) , brain tumours (the Macdonald and more recent Response Assessment in Neuro-Oncology criteria) , and other malignancies.
Methods based on linear measurements offer several advantages. There has been a great deal of experience in their use and implementation. Regulatory agencies are familiar with them and their common variants, and accept their use in evaluating efficacy endpoints. Owing to the simplicity of the measurements, it is easy to train readers in their use. There is evidence from meta-analyses combining data from several large trials that shows a correlation between disease response by these criteria and clinical outcomes .
However, such methods face a number of challenges. One theoretical limitation is that the measurement of a single tumour diameter captures only a tiny proportion of the information available in a high-resolution image. Another limitation, which limits the sensitivity for detection of changes in tumour burden, is that the ‘stable disease’ category covers a large range of tumour sizes. If a spherical tumour starts out measuring 4 cm in diameter and is composed of cells that are 7 µm in diameter, then according to RECIST any size between 2.8 and 4.8 cm, ranging from the death of 64 billion cells to the generation of 71 billion new tumour cells would still be considered ‘stable.’ Moreover, many tumours are not simple spheres, and linear measurements are quite insensitive for detecting growth and shrinkage in complex shapes (Figure 2). This limitation is particularly significant for trials that use response rate (the proportion of patients who achieve a response category of partial response or complete response) as a primary endpoint, and somewhat less so for trials that use time to progression or progression-free survival as primary endpoints.
The precision of linear tumour measurements is affected by a number of factors. Technical factors include slice thickness, contrast administration and the specific method of measuring lesions on the images. Patient factors include anatomy (including the presence of intra-abdominal fat to provide natural contrast), positioning (or state of respiration, for lung lesions), tumour shape (regular or spiculated) and tumour margin. Reader decisions can also contribute to variability. Several studies of inter- and intrareader variability in unidimensional and bidimensional measurement criteria [8–10] have found that inter-reader variability is higher than intrareader variability, and interobserver misclassifications are most common when deciding whether progression has occurred (up to 30%). The lower variability seen in measurements made by the same reader has supported the concept of centralized readings for cancer clinical trials .
A retrospective analysis  of factors contributing to interobserver variability in response classification examined imaging data from 876 subjects in breast cancer trials. With two observers applying RECIST criteria, there were 459 incidences of discordance in best overall response, response date, or date of progression. Overall, 77% of discordant cases resulted from justifiable perception differences between the two observers. While many of these differences in perception resulted from choices about which lesions to measure, new lesion detection and other factors, reasonable disagreements by readers about where in a lesion to make measurements did contribute significantly to measurement variability.
The advantages of volumetric assessment
Volumetric assessment can overcome some of the difficulties associated with unidimensional or bidimensional measurement criteria. Even with a simple shape, such as a sphere, that grows or shrinks in a uniform fashion, volume measurements show much greater changes than linear measurements. When a sphere grows in diameter by 20% (the threshold for progression in RECIST), its volume increases by 72.8%. When a sphere shrinks in diameter by 30% (the threshold for partial response), its volume decreases by 65.7%. These differences are even greater with more realistically complex lesions. Figure 3 compares serial changes in longest diameter and volume of a nonsmall cell lung cancer lesion and illustrates the greater power of volumetry in detecting changes in tumour size. In addition to capturing information about lesion growth in three dimensions, volumetry eliminates variability from reader decisions about where to make the measurement in a lesion, because the entire lesion is measured.
It is reasonable to wonder whether the advantage of volumes in detecting changes is based only on elementary geometry (volume equals linear measurement cubed) and whether simply cubing the longest diameter of a lesion might give the same sensitivity to changes in size as volumetric assessment. This question was addressed in a study of vestibular schwannoma assessment by MRI . Vestibular schwannoma arises in the cerebellopontine angle, forced by the anatomy of this space into an irregular shape. In lesions that showed growth, the average annual increase in tumour size was 8 ± 6% for the longest diameter, 31 ± 26% for the diameter cubed and 61 ± 34% for volumetric measurements. Volumetric measurements showed a smaller intrareader coefficient of variation than linear measurements, resulting in a fivefold greater sensitivity to change relative to the coefficient of variation. Thus, volumetric assessments are more sensitive to changes in tumour size than either greatest linear measurements or the cube of the greatest linear measurement.
Volumetry, therefore, offers advantages over RECIST and similar instruments. Using volume as a continuous variable allows the use of more powerful statistical analysis tools, such as parametric tests. Even using categorical assessments, the greater sensitivity of volume to response and progression increases the statistical power of a trial per subject enrolled. This improves the ability to detect both response and progression, which permits a trial to be conducted using fewer subjects over a shorter period of time, with obvious benefits to trial sponsors in cost savings and simplified trial logistics. Moreover, there is a strong ethical reason to detect progression early (rather than waiting until progression is so unequivocal that precise quantitative assessment is not required), because it allows patients to stop an ineffective treatment and try something else. This is particularly true in trials of first-line therapies. Accurate assessment of when progression has occurred is also important to ensure that an effective treatment is not stopped prematurely. In clinical trials, this is sometimes assisted by having an independent reviewer confirm progression before a patient is taken off trial.
Image analysis in volumetric assessment
While lesion volumes may be estimated from measurements of lesion diameters, using simplifying assumptions, measurements based on more sophisticated methods called ‘planimetry’ or ‘segmentation’ are more accurate . These require defining which portions of an image represent tumour and which represent nonmalignant tissue. In manual segmentation, a reader draws a region of interest around the lesion boundaries on each slice where the lesion is seen. The area of the region of interest is calculated from the number of pixels enclosed by it. The area is multiplied by the slice thickness to calculate the volume of the lesion within that slice, and the volumes from all slices are added to produce the total lesion volume. This method is highly accurate, but extremely time consuming (and therefore expensive).
A variety of computerized segmentation methods can facilitate defining tumour boundaries and calculating volumes. These processes incorporate region growing, connectivity and edge-detection algorithms. In seed-based region growing, an initial matrix of pixels within a lesion (the ‘seed’) is sampled. The pixel values within the seed are used to establish a range of values (defined by a ‘z-score’ or other criteria). The pixels adjacent to the defined matrix are evaluated, and those with values within the defined range are added to the lesion. This process is iterated for each surrounding pixel, causing the initially defined region to grow until it covers the entire tumour. By changing the parameters of the algorithm, such as the size of the sampled matrix, the z-score and how robust the connections between adjacent regions of tumour must be, the algorithm can be made more or less ‘avid’. Edge-detection algorithms are used to analyse the signal characteristics of pixels, moving out from some initially selected point, convert the signal characteristics into a mathematical function, and then use inflection points (defined by the second derivative of the function) in the signal to define boundaries.
The automated algorithms described above allow the application of modern computational power to image segmentation. If applied without human guidance, however, they can result in significant errors. A semi-automated approach combines the accuracy of human measurements with the speed of automation. A reviewer initially identifies lesions to be measured and defines a broad boundary (such as a circle around the lesion on one slice, which is propagated through all the adjacent slices to define a cylinder). An automated segmentation algorithm is used to outline the lesion within that boundary. The reader then adjusts the outline to correct the results of the algorithm. Many software packages for carrying out this process are in active development.
Precision of volume measurements
Whether it is done manually or automatically, volumetry relies on distinguishing the boundaries between tumour and normal tissue. Errors in measuring volumes occur at the edges, where an analog reality must be converted into a digital representation. Volumetric measurements depend on the accurate modelling of the lesion surface. This has important implications for the factors that affect the precision of volumetric assessment.
First, measurement precision is inversely related to an object's surface area-to-volume ratio. This ratio decreases with increasing lesion size. Thus, ceteris paribus, smaller lesions are harder to measure accurately than larger lesions, and irregularly shaped lesions (lobulated or spiculated) are more problematic than simple shapes. Second, smaller pixels give more accurate surface models and thus more accurate volume measurements. This becomes intuitively clear when considering a shape built from children's blocks (Figure 4). Larger blocks yield crude approximations of complex objects, while smaller blocks yield more accurate depictions. Pixel size is determined by field of view and matrix size within the plane of a slice, and by slice thickness in the perpendicular axis. Therefore, thinner slices give more accurate volume measurements.
These predictions have been confirmed by phantom studies. In one study , spherical and lobulated nodule phantoms of known sizes (from 3 to 15 mm in diameter) were imaged using a multidetector CT scanner. The precision and accuracy of semi-automated volume measurements were influenced by object size and geometry and by slice thickness. For example, nodule volume was overestimated by 57.2 and 58% at reconstruction slice thicknesses of 0.625 and 1.25 mm, respectively, for a spherical 3 mm nodule. The overestimation was 6.6 and 12.6%, respectively, for a 5 mm nodule. For a lobulated 3 mm nodule, the volume was underestimated by 4.7% at a slice thickness of 0.625 mm, but overestimated by 56 and 108.9% at slice thicknesses of 1.25 and 2.5 mm, respectively. In translating these results to clinical lesions, it must be noted that larger lesions may abut surrounding anatomical structures and may have edges that are difficult to discern.
For clinical lesions, the most extensive work has been done on lung tumours, where the precision of CT volumetry has been characterized in both phantom measurements and clinical scenarios [16–18]. A team at the US Food and Drug Administration (FDA) has studied nodules of varying size, density and geometry using a variety of acquisition and reconstruction parameters. Using one software volumetric tool with thin-slice CT acquisition showed a relative bias of 88, 14 and 4% when measuring 5, 8 and 10 mm low-density (−630 Hounsfield units) spherical nodules attached to the vasculature, and −3, −6 and −8% for similar high-density (+100 Hounsfield units) nodules. The error for low-density nodules could be reduced by an order of magnitude using mathematical methods to account for vasculature attachment.
Improved image contrast (the difference between the signal intensity of normal and abnormal tissue) also leads to more accurate volumetric measurements, because it allows clearer definitions of edges. In CT, this means that using intravenous contrast appropriately is important if volumetric measurements are planned. This is particularly true for lesions in the liver, spleen and pancreas, where the timing of the scan after contrast administration can affect the appearance of certain types of lesions. For MRI, greater magnetic field strength generates greater image contrast, as well as allowing for smaller pixels (a larger matrix size within the same field of view).
Different tumour types pose different challenges, both in image acquisition and in image analysis. This makes the precision and accuracy of volumetric assessment in general difficult to summarize in statistically rigorous terms. An extensive review of the literature on CT volumetry  indicates that appropriately selected acquisition and reconstruction parameters can lead to highly accurate and precise measurements. However, there is still a need for a better understanding of how to control volumetric accuracy as a function of various interrelated technical variables, and the data on MR volumetry are less well developed than on CT.
Limitations of volumes
Some confounders of size-based response assessment apply to volumetric analysis at least as much as to linear measurements. These include variability in the selection of a subset of lesions to follow, when numerous lesions are present. The accuracy of lesion measurement is limited by technical factors, such as partial volume averaging and the conspicuity of the lesion against the background. In fact, volume measurements are more dependent on accurate edge detection than linear measurements, because for a linear measurement the lesion has to be well defined on only one slice. Volume measurements can be confounded by inconsistent image acquisition parameters across time points, including variations in slice thickness and contrast administration. The more stringent technical requirements of volumetric imaging may result in a greater proportion of scans being unevaluable, resulting in loss of patients from trials. This is somewhat offset by the smaller number of data points required in a trial, because of the greater sensitivity of volumetric measurements to change. Training readers to segment lesions consistently can also be a challenge.
Some investigators have argued that neither linear measurements nor volumes are adequate predictors of clinical outcome [20–23]. These arguments have largely been based on studies of tumours that are internally heterogeneous. Assessment of tumour volume may need to be combined with segmentation methods that distinguish viable tumour from necrotic material and other nonmalignant tissue types, such as fibrosis. Size alone, even with optimized volumetric assessment, does not capture all relevant information about tumour biology. In the future, volumetrics may be coupled with structural and functional imaging methods that characterize tissue by its microanatomy or physiology (Figure 5). Microanatomy can be assessed in a crude way by tumour attenuation on CT, and this information is already incorporated into methods such as the Choi criteria for gastrointestinal stromal tumours . Computed tomography ‘texture analysis’, a measurement of microscopic heterogeneity within a mass, can be used to distinguish malignant from benign tissue . Diffusion-weighted MRI is used to assess the cellular density within a mass, which can be used to detect tumour cells undergoing involution and death and to predict the response to treatment .
Physiological imaging includes indicators of metabolism and vascularity. Tissue metabolic rate can be measured with positron emission tomography (PET) using 18F-fluorodeoxyglucose as a tracer. 18F-Fluorodeoxyglucose-PET has been a workhorse in diagnosing, staging and monitoring response to therapy for the last 20 years. 18F-Fluorothymidine is a novel PET tracer, which is a more specific indicator of cellular proliferation. Tumour tissue shows increased vascularity and microvascular permeability. Microvascular assessment with CT or MR perfusion imaging, or with dynamic contrast-enhanced MRI, can provide early measures of antitumour agent activity .
These methods are still evolving and are currently useful mainly in early-phase trials. Methods based on other molecular imaging techniques are in laboratory development. Ultimately, multispectral imaging, combining tumour volumetrics with microanatomical and physiological imaging, can give a more comprehensive assessment of treatment response.
Efforts to qualify volume as a biomarker
The Foundation for the National Institutes of Health (FNIH) Biomarkers Consortium Oncology Imaging Project Teams and Radiological Society of North America/Quantitative Imaging Biomarkers Alliance (RSNA/QIBA) have collaborated on the technical and clinical qualification of CT tumour volumetry as an improved biomarker of chemotherapy response. The objective is to engage the FDA through its biomarker qualification process, to review the evidence from literature and consensus and to collect data prospectively from fit-for-purpose study designs. Initially, lung cancer is the indication being targeted in this process. Studies are also underway by the FNIH, the RSNA/QIBA Quantitative CT Committee and the National Cancer Institute to gather additional data on the precision of volumetry systematically, using both phantom and clinical data sets , and to link volumetric assessment to clinical outcomes.
Radiological measurement of tumour burden in clinical trials has evolved with the development of imaging and computational technology. Volumetric assessment offers significant advantages for measuring disease burden over more traditional linear measurements. Together with centralized review of image data, volumetric assessment can significantly reduce measurement variance, increasing statistical power per subject and enabling earlier detection of progression. While questions remain about some technical aspects of volumetric assessment, a focused effort is underway to qualify tumour volume as a biomarker in specific settings. In addition, novel imaging biomarkers that evaluate microanatomy and physiology offer the promise of more rapid and accurate evaluation of the efficacy of anticancer drugs in future trials, though these methods need to be validated and compared with conventional studies before they gain wide acceptance.