Variability in commercially available deformable image registration: A multi‐institution analysis using virtual head and neck phantoms

Abstract Purpose The purpose of this study was to evaluate the performance of three common deformable image registration (DIR) packages across algorithms and institutions. Methods and Materials The Deformable Image Registration Evaluation Project (DIREP) provides ten virtual phantoms derived from computed tomography (CT) datasets of head‐and‐neck cancer patients over a single treatment course. Using the DIREP phantoms, DIR results from 35 institutions were submitted using either Velocity, MIM, or Eclipse. Submitted deformation vector fields (DVFs) were compared to ground‐truth DVFs to calculate target registration error (TRE) for six regions of interest (ROIs). Statistical analysis was performed to determine the variability between each DIR software package and the variability of users within each algorithm. Results Overall mean TRE was 2.04 ± 0.35 mm for Velocity, 1.10 ± 0.29 mm for MIM, and 2.35 ± 0.15 mm for Eclipse. The MIM mean TRE was significantly different than both Velocity and Eclipse for all ROIs. Velocity and Eclipse mean TREs were not significantly different except for when evaluating the registration of the cord or mandible. Significant differences between institutions were found for the MIM and Velocity platforms. However, these differences could be explained by variations in Velocity DIR parameters and MIM software versions. Conclusions Average TRE was shown to be <3 mm for all three software platforms. However, maximum errors could be larger than 2 cm indicating that care should be exercised when using DIR. While MIM performed statistically better than the other packages, all evaluated algorithms had an average TRE better than the largest voxel dimension. For the phantoms studied here, significant differences between algorithm users were minimal suggesting that the algorithm used may have more impact on DIR accuracy than the particular registration technique employed. A significant difference in TRE was discovered between MIM versions showing that DIR QA should be performed after software upgrades as recommended by TG‐132.


| INTRODUCTION
When treating a patient, anatomical changes that occur over the treatment course can result in meaningful effects on radiation dose delivery. 1,2 To quantify the impact of these changes, it is important to identify corresponding physical points between multiple images of the same patient. Deformable image registration (DIR) has become available in many commercial treatment planning and contouring systems for this purpose. This is particularly useful to propagate contours from one image to another 3 or accumulate dose across a treatment. 4 Understanding the accuracy of DIR is a critical duty for the clinician, as the DIR accuracy may influence clinical decisions.
For example, the DIR accuracy for contour propagation may be less stringent, as long as the propagated contours are reviewed and corrected, than the accuracy required for deformable dose accumulation where decisions may be made concerning organ-at-risk (OAR) tolerances. This study aims to further the comprehension of DIR accuracy for typical head and neck cancer patients over the treatment course.
Deformable image registration is a non-affine process that uses mathematical models to deform one image to match another. Unlike rigid or affine registration, each voxel in DIR is assigned a deformation vector that may be only loosely dependent on neighboring voxels. The resulting deformed image is characterized by its deformation vector field (DVF), vector matrices which define the relationship between the original and deformed images. Due to the high number of degrees of freedom present in DIR algorithms, the DVF output from an algorithm may result in an image deformation that is not biologically or geometrically plausible.
Because of the freedom with which DIR may deform an image, there have been research efforts to validate the use of DIR in radiation therapy. Ground-truth models with a known DVF relating preand post-deformation images, or with phantoms containing known landmarks, are often used for this purpose. [5][6][7][8][9][10] Each of these ground-truth models provides a framework that can help the clinical physicist validate the DIR implemented in their clinic. This study uses virtual phantoms provided by the Deformable Image Registration Evaluation Project (DIREP). 10 Deformable Image Registration Evaluation Project created publicly available phantoms based on computed tomography (CT) data from head and neck cancer patients. These phantoms provide a clinically based ground-truth model that encompasses the anatomical changes that occur over the course of a typical treatment.
As an additional tool to validate DIR accuracy, AAPM's Task Group 132 (TG-132) report provides a framework to commission and provides QA of DIR output. 11 The report provides several digital deformable phantoms for testing DIR along with their recommended tolerances. When using DIR, TG-132 does recommend that relevant boundaries and anatomical features in the registered images be within 1-2 voxels, and that any additional error should feed into planning margins. However, it does not provide information on how the digital phantoms might compare to clinical cases or site-specific accuracy expectations.
Several studies have quantified DIR accuracy from different commercial or research algorithms, using data submitted by multiple institutions. This has been done using either contour-based 12,13 or landmark-based 14,15 analysis. Contour-based methods can be subjective, as they introduce variability based on the observer drawing the contours and contain little information about the accuracy of voxels within the contour. Landmark-based methods provide accuracy information in the vicinity of the landmark, but are limited to those regions. In both cases, a contour or landmark that is easy for a human to identify may also be easy for an algorithm to identify. This may produce biased accuracy data. Comparatively, the DIREP model assesses deformation accuracy for each voxel by comparing the entire registration DVF to the ground-truth DVF. Additionally, the previous studies referenced have been limited by sample size, with a maximum of 14 institutions submitting results for commercially available algorithms.
To establish a benchmark for DIR accuracy, several commercial algorithms were previously tested with the DIREP phantoms. 16 Following the benchmarking study, 35 institutions have submitted DIREP registrations for the complete phantom set. The aim of this work is to analyze the data submitted from these institutions using the DIREP ground-truth model in order to characterize the inter-algorithm and inter-institutional variability of three commercial DIR software packages: Velocity (Varian Medical Systems, Palo Alto, CA), MIM (MIM Software Inc., Cleveland, OH), and Eclipse (Varian Medical Systems, Palo Alto, CA). This is done to provide the clinical physicist with insight that can assist in implementing DIR clinically, and to enhance the understanding of the inherent accuracy and limitations of these DIR algorithms. Additionally, this study aims to use these results to augment the expectations set by the recommendations of TG-132.

2.A | Ground-truth model
Deformable Image Registration Evaluation Project provides ten virtual phantoms, based on CT images taken at the start of treatment (SOT) and near the end of treatment (EOT) for ten patients treated for head and neck cancer. All of the virtual phantom datasets had an in-plane resolution of 0.97-1.37 mm and a slice thickness of 3 mm. 10 A biomechanical algorithm 17,18 and a thin-plate splines algorithm 19 were used to create an anatomically representative pair of images where the underlying "true" deformation field was known.
The thin-plate splines algorithm was available as a deformation tool within the ImSimQA software package (Oncology Systems Limited, Shrewsbury, Shropshire, UK). In all, these tools allowed for the modeling of the following anatomical changes: head rotation and translation, mandible rotation, spine flexion, shoulder movement, hyoid movement, tumor/node shrinkage, weight loss, and parotid shrinkage. Physician-drawn brainstem, spinal cord, mandible, left parotid, and right parotid contours are included in these phantoms to allow for the analysis of those structures. Figure 1 shows an example of one of the DIREP phantoms along with its associated ground-truth DVF.

2.B | Basic DIR evaluation
To test the accuracy of a given DIR process, each EOT phantom is registered to its associated SOT phantom, as would be performed for dose accumulation. The DVFs created by each of these registrations can then be directly compared to the ground-truth DVFs provided by the DIREP phantoms. Complete datasets were submitted by 35 institutions. The algorithms used by the submitting institutions are listed in Table 1. The companies behind these DIR algorithms do not typically publish detailed information on their algorithms to protect intellectual property. Aside from the information given in Table 1, these algorithms will be treated as a "black box" for the purposes of this study.
Using target registration error (TRE) as a figure of merit, the DVFs submitted by each institution were compared to the groundtruth DVFs for all ten phantoms. Registration accuracy was evaluated for six regions of interest (ROIs): brainstem, spinal cord, mandible, left parotid, right parotid, and external contour. From this, the TRE from the ground truth for each voxel in each ROI was calculated. These values were summarized into a mean TRE across each region for each phantom. The maximum deviation in each region was also recorded. This information was averaged across all ten phantoms in order to calculate a set of summary statistics represent-ing the overall accuracy of the deformable registration for each institution.

2.C.1 | Inter-algorithm variability
To determine inter-algorithm variability, the summary statistics for each institution were grouped by the registration software used for the deformation. A Welch's t-test was performed in order to compare the mean TRE for each combination of DIR algorithms, for each contoured region. This was done to test the null hypothesis that any two algorithms have the same mean TRE.

2.C.2 | Inter-institutional variability
In order to evaluate the variability within each algorithm, the institutional data were again grouped by the DIR algorithm used. Because the summary statistics for an institution are averaged across the 10 DIREP phantoms, an ANOVA was performed to test the null hypothesis that the average TRE within an algorithm for each contour was not institutionally dependent. Table 2 shows the summary statistics for all institutions submitting data using Velocity, MIM, and Eclipse. On average, MIM performed the tests with the smallest registration error, with average TRE values consistently smaller than the other two algorithms.

3.A | Summary statistics
However, the maximum error for MIM is greater than the other two algorithms in areas that tend to fluctuate in overall volume throughout the course of a treatment, in particular with both parotid contours.   Although MIM initially showed an overall user dependence on registration accuracy, there appeared to be a difference between the mean TRE produced by MIM versions 6.5 and prior, when compared to MIM versions after 6.6. Table 5 shows the mean and maximum TRE for each region for MIM data grouped by versions. Once data were grouped in this way, Welch's t-tests were performed in order to test for a significant difference in mean TRE through this grouping. From these data, it is clear that there is a notable improvement in mean TRE after version 6.6. The maximum error is also notably lower for MIM versions after 6.6, especially for the mandible, parotid, and external contours. Beyond this, the variability found both pre-and post-MIM version 6.6 is lower than the overall variability found in MIM when these data were grouped together; the true variability in MIM output is lower than what was shown in Table 2.

3.C | Inter-institutional variability
One-way ANOVAs, shown in Table 6, were performed in order to test for user variability both before and after this change in version. From this, it was found that there was no user dependence on registration accuracy for pre-6.6 and post-6.6 registrations when treated as separate groups. Consequently, the user dependence found with MIM is likely the result of a difference in algorithm between MIM versions, and not a user dependence on the software. The outlier in the MIM data appears in the right parotid results.

| DISCUSSION
For eight of the MIM institutions, the standard deviation of the registration error for the right parotid contour was much greater than was seen in the other ROIs. This was the result of a failure of the MIM algorithm within the contour of the right parotid for one of the DIREP phantoms. As shown in a previous study, 16 MIM was unable F I G . 2. Mean TRE for each institution in this study. The error bars represent one standard deviation of the mean registration error for each of the ten phantoms. Note that the Y-axis scale is 4 mm and that the mean TRE difference between institutions is typically <1 mm.
to reproduce the correct registration of the right parotid for Phantom 9. This investigation found mean and maximum registration errors similar to the previous study (greater than 6 and 20 mm, respectively) for the right parotid of Phantom 9 prior to MIM version 6.6. With version 6.6, the mean and maximum TRE for the right parotid of Phantom 9 were reduced to just over 3 and 10.9 mm, respectively. This finding emphasizes the changes that may occur, in this case improvements, as updates are made to DIR algorithms.
When comparing algorithm performance, Table 3 shows that the difference between Velocity and Eclipse was not significant for most ROIs. The exceptions are for the cord and mandible where one algorithm consistently performed better than the other. This is consistent with the data observed in Fig. 2. for creation of the virtual phantoms, only ten phantoms were created and evaluated which may not be representative of the anatomic changes seen in the entire population of head and neck radiotherapy patients. These results should not be extrapolated to other sites or situations (e.g., retreatments) without further research to confirm their validity. Furthermore, while the authors found no significant difference between users of the same DIR algorithms in this study, the use of advanced tools was not investigated. As new DIR tools are developed by vendors, such as the ability to refine a registration or contour-guided DIR, these tools may provide more options and differentiation between users.

| CONCLUSION S
Three hundred and fifty registrations from 35 institutions were evaluated for DIR accuracy in this study. While it was shown that the average error was <3 mm for all three software platforms, care should be exercised when using DIR because localized or maximum errors can be much greater. The authors found that one algorithm performed statistically better than the others, but that all algorithms were typically more accurate than the largest voxel dimension. For the relatively small displacements between registered images studied here, no significant inter-institutional difference was found between users of the same algorithm. This suggests that, for head and neck DIR within a treatment course, the algorithm used may have more of an impact on registration accuracy than a trained user's DIR technique.
A significant difference between versions of one of the algorithms was reported. This finding supports the TG-132 recommendation that registration algorithms should be tested upon upgrade. Unfortunately, the current DIR testing options are often time-consuming and limited to academic centers. To enable more frequent testing and the use of appropriate DIR margins, vendors should provide analysis tools to simplify testing for various sites and situations.

CONF LICT OF I NTEREST
Dr. Pukala reports grants from Accuray, Inc., outside the submitted work. Dr. Langen reports personal fees from Varian Medical, outside the submitted work.

AUTHOR CONTRIBU TI ONS
All listed authors made substantial contributions to this work, assisted in the drafting or review of this work, will have the opportunity to approve the final version to be published (if revisions are required), and agree to be accountable for all aspects of the work.