Assessment of the Sun Nuclear ArcCHECK to detect errors in 6MV FFF VMAT delivery of brain SABR using ROC analysis

Abstract Institutions use a range of different detector systems for patient‐specific quality assurance (QA) measurements conducted to assure that the dose delivered by a patient’s radiotherapy treatment plan matches the calculated dose distribution. However, the ability of different detectors to detect errors from different sources is often unreported. This study contains a systematic evaluation of Sun Nuclear’s ArcCHECK in terms of the detectability of potential machine‐related treatment errors. The five investigated sources of error were multileaf collimator (MLC) leaf positions, gantry angle, collimator angle, jaw positions, and dose output. The study encompassed the clinical treatment plans of 29 brain cancer patients who received stereotactic ablative radiotherapy (SABR). Six error magnitudes were investigated per source of error. In addition, the Eclipse AAA beam model dosimetric leaf gap (DLG) parameter was varied with four error magnitudes. Error detectability was determined based on the area under the receiver operating characteristic (ROC) curve (AUC). Detectability of DLG errors was good or excellent (AUC >0.8) at an error magnitude of at least ±0.4 mm, while MLC leaf position and gantry angle errors reached good or excellent detectability at error magnitudes of at least 1.0 mm and 0.6°, respectively. Ideal thresholds, that is, gamma passing rates, to maximize sensitivity and specificity ranged from 79.1% to 98.7%. The detectability of collimator angle, jaw position, and dose output errors was poor for all investigated error magnitudes, with an AUC between 0.5 and 0.6. The ArcCHECK device’s ability to detect errors from treatment machine‐related sources was evaluated, and ideal gamma passing rate thresholds were determined for each source of error. The ArcCHECK was able to detect errors in DLG value, MLC leaf positions, and gantry angle. The ArcCHECK was unable to detect the studied errors in collimator angle, jaw positions, and dose output.


| INTRODUCTION
Intensity-modulated radiation therapy (IMRT) and volumetric modulated arc therapy (VMAT) treatments still frequently rely on patientspecific QA measurements to ensure that the dose delivered to the detector using a patient's treatment plan matches the expected dose distribution as calculated by the treatment planning system (TPS). 1 These measurements can be performed with a variety of different detectors, including ionization chambers, diode arrays, radiochromic film, and portal imaging. [2][3][4] The QA workflow is detector-specific but, for systems like the ArcCHECK (Sun Nuclear Corporation, Melbourne, Florida, USA), it generally consists of re-calculating the dose delivered by a patient's treatment plan on the detector system and comparing it to a measurement to ensure accurate dose delivery. 5,6 Patient-specific QA is essential for patient safety, especially in the case of a complex treatment delivery technique such as stereotactic ablative body radiotherapy (SABR), which delivers a high radiation dose in only a single or a few fractions and involves tight margins and often complex targets and beam geometries. 7,8 Systems like the ArcCHECK are useful for QA of conventional IMRT and VMAT plans as well as SABR treatments. 9 A methodology commonly used for dose distribution comparisons is the gamma analysis method, which combines a distance-toagreement (DTA) with a dose difference criterion to avoid inaccuracies in high-gradient and low-gradient regions, respectively. 10,11 Patient-specific QA procedures use a previously set gamma passing rate threshold (e.g., 95%) to determine whether a sufficient percentage of points on the measured dose distribution agrees with the calculation. 1 If this is not the case, the treatment plan fails the patient-specific QA procedure, and the treatment cannot proceed with the plan in question before the reason for the failure has been determined and it has been established whether there is a need to revise the treatment plan. One of the shortcomings of reducing the gamma analysis results to a few metrics such as the passing rate is that such an approach does not allow the detector's ability to identify errors originating from different sources to be taken into account. 11 Receiver operating characteristic (ROC) analysis has previously been used to investigate a detector's ability to detect treatment machine variations during plan delivery. In the case of the TrueBeam linear accelerator (Varian Medical Systems, Palo Alto, California, USA), such sources of error include the jaws which determine the size of the treatment field, the multileaf collimator (MLC) which conforms the radiation to the target, and the angle of the gantry. 12 Studies using ROC analysis can fully evaluate the capabilities of a detector, including defining its rate of false positives and false negatives, that is, a detector wrongfully marking a plan as passing or failing because of its inability to accurately detect certain errors. 13,14 ROC curves are particularly useful for evaluating detector performance because they are independent of biases in the decision threshold which determines whether a plan passes or fails the QA procedure. 15 Examples of ROC-based error detectabiliy studies include research by Carlone et al., 15 McKenzie et al., 16 Bojechko & Ford, 17 Nithiyanantham et al., 18 Liang et al., 19 Sjölin & Edmund, 20 Maraghechi et al., 21 and Scarlet. 22 However, combining the findings even of studies investigating the same detector can prove difficult because of limitations such as a small dataset, no differentiation between different treatment sites or delivery techniques, or some sources of error not having been studied. This study seeks to expand upon the aforementioned works by conducting a complete and systematic evaluation of the performance limits of a single detectornamely, Sun Nuclear's ArcCHECKin terms of its ability to detect expected machine-related treatment errors in a set of brain VMAT SABR treatment plans using a 6 MV flattening filter free (6FFF) beam. This study was based on the clinical treatment plans of patients treated with brain SABR, as these plans require particularly high precision and accuracy. The data set included the original treatment plans of 29 patients who received brain SABR at BC Cancer Kelowna.
These clinical plans used 6 MV or the 6FFF mode of the Varian TrueBeam system. Since the 6FFF beam provides increased dose rates which can shorten the treatment time, which is beneficial when treatment fields are small and high doses are required, results for the 6FFF beam were of particular interest. 23,24 The clinical plans that used 6MV were therefore re-planned using the 6FFF mode for this study.
An in-house tool was used to anonymize all patient data, and the Varian Eclipse (V13) analytical anisotropic algorithm (AAA) was used to calculate the dose distributions delivered to the detector. 25,26 All treatment plans were delivered to Sun Nuclear's ArcCHECK, which is a cylindrical polymethyl methacrylate (PMMA) phantom of a diameter of 21 cm with an array of 1386 SunPoint diodes on its surface. 27 The same TrueBeam system and ArcCHECK detector were used for all measurements to prevent slight differences between different machines from influencing the results, and for a given source of error, all versions of a treatment plan were measured in the same session to avoid variations in the detector set-up.

2.B | Gamma analysis
The calculated dose distributions were compared to the dose distributions measured on the surface of the ArcCHECK using the gamma analysis approach as implemented in Version 6.2.3 of Sun Nuclear's SNC Patient software. 10,28 All studies were repeated for three different sets of criteria: 2%/2 mm, 2%/1 mm, and 4%/1 mm. The 2%/ 2 mm criteria were chosen in accordance with the planning target volume (PTV) margin of 2 mm, while the other two criteria were added to study the effects of variations in the dose difference or the DTA criterion. A threshold of 10% was used below which dose values were disregarded.

2.C | Determination of the consensus optimal dosimetric leaf gap value
The Eclipse AAA model uses a dosimetric leaf gap (DLG) parameter to model the leakage through the curved edges of the MLC leaves. 29,30 However, the clinically-used DLG value is determined for a broad set of patients and treatment sites, and differences of up to 0.8 mm between the clinical and the plan-specific optimal DLG value have been reported. 22 Due to the high-precision requirements for the clinical 6FFF beam, the optimal DLG value for brain SABR treatment planning had to be determined.
Nine representative brain SABR treatment plans were delivered to a cylindrical ionization chamber (Scanditronix Wellhöfer Dosimetrie, Schwarzenbruck, Germany), EBT3 Gafchromic film (Ashland Inc., Covington, Kentucky, USA), and the ArcCHECK to determine as accurate an optimal DLG value as possible. For the ionization chamber, the difference between the measured dose and the calculated dose was plotted as a function of the DLG value and the position of the minimum difference was defined as the optimal DLG value. For film and the ArcCHECK, the optimal DLG value was defined as the value that maximized the gamma passing rates. 31,32 All three methods yielded the same consensus optimal DLG value of 1.47 mm. This value was very close to the clinical value used at our institution, which is 1.40 mm.

2.D | Implementation of machine-related treatment errors
The investigated sources of error were the DLG value, the MLC leaf positions, the gantry angle, the collimator angle, the jaw positions, and the dose output. The latter five were investigated because they were specifically mentioned as potential sources of error in the specifications of Varian's TrueBeam system, while the former was included because the plan-specific optimal DLG value is known to commonly differ from the value used in the clinical context, which is determined for and applied to a vast range of treatment sites. 12,33 Simultaneous errors from different sources lay outside the scope of this study because of the sheer number of possible permutations and because such studies would not help quantify the ArcCHECK's limits with respect to the detectability of errors from a given source.

2.D.3 | Collimator angle errors
To determine the detectability of collimator angle errors, random errors of between 0.00°to ±0.25°, ±0.25°to ±0.50°, ±0.50°to ±0.75°, ±0.75°to ±1.00°, ±1.00°to ±1.25°, and ±1.25°to ±1.50°w ere introduced into the six sets of modified treatment plans. Collimator angle errors were forced to be within a range rather than being completely random because the low number of modifiable col- Beam specifications, which state a worse positional accuracy for the upper than the lower jaw. 12 As in the case of the collimator angle, jaw position errors were forced to be in a range rather than being completely random to assure that errors of the studied magnitudes were actually introduced into the treatment plans.

2.D.5 | DLG errors
As the dosimetric leaf gap is solely a TPS parameter, determining the detectability of DLG value errors did not require additional measurements. Instead, four additional dose calculations were run for every unmodified treatment plan. These calculations used DLG values deviating from the previously determined consensus optimal DLG value by −0.4 mm, −0.2 mm, +0.2 mm, and +0.4 mm. These error magnitudes were deemed to be realistic because they were in line with DLG value errors which have been reported previously. 22 2.D.6 | Dose output errors Dose output errors did not require further measurements either because their detectability was approximated using changes in the "dose per count" value of the *.txt file created by the measurement of an unmodified treatment plan. The six different maximum possible error magnitudes studied were 0.25%, 0.50%, 0.75%, 1.00%, 1.25%, and 1.50%. The mean values and standard deviations for the resulting distributions were 0.16% ± 0.08%, 0.36% ± 0.07%, 0.62% ± 0.08%, 0.85% ± 0.08%, 1.13% ± 0.06%, and 1.37% ± 0.07%.

2.E | Receiver operating characteristic curves
Receiver operating characteristic curves are analytical tools for the evaluation of a diagnostic test which outputs binary results. 36 value, that is, the value at which the distance between the ROC curve and the point of perfect sensitivity and specificity (0,1) was minimal, was determined for all sources of error which exhibited sufficient detectability. 15 The ways in which the gold standard and the evaluated data set were defined for the different sources of error are shown in Table 1.

3.A | DLG value errors
The results regarding the detectability of DLG value errors are depicted in Fig. 1 To emphasize how differences in the plan-specific optimal DLG values influenced the detectability of DLG errors, Fig. 2 depicts the gamma passing rate as a function of the DLG error for two casesone in which the plan-specific optimal DLG value was equal to the consensus value, and one in which this was likely not the case.
On a single-plan basis and at low error magnitudes, error detectability can roughly be approximated by the difference between the gamma passing rate at the gold standard and at the error magnitude in question. In cases like the one shown in Fig. 2(a), in which the consensus value was equal to the plan-specific optimal DLG value, the gamma passing rates at error magnitudes of −0.2 mm and +0.2 mm were approximately equal. Such cases generally contributed to a similar detectability of negative and positive DLG errors of the same magnitude.
In the case shown in Fig. 2(b

3.B | MLC leaf position errors
The results of the MLC leaf position error detectability study are shown in Fig. 3. The data exhibited variations in the detectability of MLC leaf position errors of 0.75 mm or lower, especially for the 2%/ 1 mm and 4%/1 mm criterion. This was again caused by the planning system modeling of the MLC leaf ends, which uses a single DLG value.
T A B L E 1 Definitions of reference (gold standard) and evaluated gamma indices. The ways in which the reference (gold standard) and evaluated sets of gamma indices from which the ROC curves were created were defined for the different sources of error.

Source of error Evaluated gamma indices Reference gamma indices
F I G . 1. The area under the ROC curve as a function of the error introduced into the dosimetric leaf gap value. The data points on the left, in the middle, and on the right of each column denote the 2%/ 1 mm, 2%/2 mm, and 4%/1 mm criterion, respectively. Error bars indicate the standard error.
F I G . 2. The gamma passing rate (2%/2 mm) as a function of the error introduced into the dosimetric leaf gap value for a case for which the plan-specific optimal DLG value was equal to the consensus optimal DLG value determined for a representative set of nine treatment plans (a) and a case for which the plan-specific optimal DLG value likely differed from the consensus value by −0.2 mm (b). The dotted lines indicate the gamma passing rate of the gold standard.

| 39
To elaborate on this point, Fig. 4 depicts the gamma passing rate as a function of the maximum MLC leaf position error for a case in which the plan-specific DLG value was equal to the consensus value and an example in which this was likely not the case. The leakage through the curved MLC leaf edges is simulated by the position of every MLC leaf being retracted by half of the DLG value. 33 In cases like the one depicted in Fig. 4(a), in which the plan-specific optimal DLG value was equal to the consensus value, the highest gamma passing rate was reached when the plan was unmodified. In such cases, the gamma passing rate decreased with increasing MLC leaf position errors, contributing to a higher AUC and better error detectability at higher MLC leaf position error magnitudes.
In cases like the one depicted in Fig. 4(b), in which the Arc-CHECK measurement suggested that the plan-specific optimal DLG value differed from the consensus value, the gamma passing rate was not necessarily highest when the plan was unmodified. Instead, plans with errors in the MLC leaf positions were able to match the dose distribution calculated using a DLG value that was likely not the plan-specific optimal value more closely. In such cases, the gamma passing rate peaked when MLC leaf position errors were

3.C | Gantry angle errors
The results regarding the detectability of gantry angle errors are shown in Fig. 5. In this case, the expected trend of the area under the ROC curve increasing with higher gantry angle errors was observed for all criteria, with a generally good detectability in cases with a maximum possible gantry angle error of at least 0.6°.

3.D | Collimator angle, jaw position, and output errors
For errors in the collimator angle, jaw positions, and output, detectability was poor for all criteria and all investigated error magnitudes, with an AUC around or below 0.6 in all cases.

3.E | Optimal threshold values
For the sources of error for which the ArcCHECK exhibited an ability to detect errors of a given magnitudenamely, the DLG value, the MLC leaf positions, and the gantry anglethe optimal threshold values (i.e., gamma passing rates) for all investigated error magnitudes and criteria are shown in Table 2. For the 2%/2 mm, the 2%/ 1 mm, and the 4%/1 mm criterion, the ideal threshold for errors in The area under the ROC curve as a function of the error introduced into the multileaf collimator leaf positions. The data points on the left, in the middle, and on the right of each column denote the 2%/1 mm, 2%/2 mm, and 4%/1 mm criterion, respectively. Error bars indicate the standard error.
F I G . 4. The gamma passing rate (2%/2 mm) as a function of the multileaf collimator leaf position error for two cases: (a) a plan for which the plan-specific optimal DLG value was equal to the consensus optimal DLG value determined for a representative set of nine treatment plans and (b) a plan for which the plan-specific optimal DLG value likely differed from the consensus value. The dotted lines indicate the gamma passing rate of the gold standard.

4.A | DLG value errors
The asymmetry in the results regarding DLG error detectability was due to the Eclipse AAA model using a single DLG value to model the MLC leaf ends. This is a deficiency in the MLC modeling of the AAA algorithm and can be corrected by determining a plan-specific DLG value for each plan, but doing so would be infeasible in the clinic. Due to some plan-specific optimal DLG values likely being lower than the determined consensus optimal DLG value, the detectability of DLG errors of −0.2 mm was poor. This aspect also caused DLG errors of −0.4 mm and +0.2 mm to exhibit a similar level of detectability, which was decent to good. Only DLG errors of +0.4 mm were detected excellently. This is to be considered in light of the magnitude of realistic DLG errors, and differences of up to 0.8 mm between the clinically used and the plan-specific optimal DLG value have been reported. 22 The ArcCHECK device is therefore able to detect medium to high DLG errors which may realistically be encountered in the clinical context. Detecting such errors in a number of cases may indicate that the clinical DLG value used is inaccurate for the cases to which it is applied and needs to be corrected.

4.B | MLC leaf position errors
The detectability of MLC leaf position errors of 0.75 mm and smaller was highly dependent on how close the plan-specific optimal DLG value was to the optimal DLG value used in the calculation. Since

4.C | Gantry angle errors
The detectability of gantry angle errors followed the expected trend of improving with increasing gantry angle error magnitudes, and the data for all three investigated criteria showed good agreement. For The determined optimal threshold values for errors in the DLG value, the MLC leaf positions, and the gantry angle as a function of error magnitude.  The ArcCHECK's advantage over systems like Delta 4 (ScandiDos, Uppsala, Sweden) and an EPID in terms of gantry angle error detectability has previously been reported in a study based on a set of VMAT plans for head and neck patients. 19 In the aforementioned study, the AUC of 0.78 for a gantry angle error magnitude of 1°was still associated with good error detectability but was lower than the error detectability determined as part of this work. The differences between the results of the two studies could be explained by factors such as the different treatment sites, the different types of treatment, and the different ways in which the errors were implemented, amongst others. For example, the error magnitudes in the VMAT study were a function of the gantry angle while the gantry angle errors investigated for this study were random.
Whether higher magnitudes of gantry angle errors are realistic is questionable. Varian's TrueBeam system, for instance, states a rotational gantry accuracy ≤0.3°, which the ArcCHECK would not be able to detect. 12 However, if a different delivery system was used or larger gantry angle errors were anticipated for other reasons, the ArcCHECK may be able to detect gantry angle errors relatively well.

4.D | Collimator angle errors
For all criteria, the ArcCHECK's ability to detect collimator angle errors of any of the studied magnitudes was poor. This is true despite the introduced collimator angle errors having been forced to be within a range to assure that errors of the studied magnitudes were actually introduced into the treatment plans. Because of the magnitude of the standard error, the small differences between the AUC values at different collimator angle error magnitudes were negligible.
The ArcCHECK's perceived inability to detect collimator angle errors was hinted at by a previous study, which showed that a collimator angle error of 1°only changed the gamma passing rate of a brain and a head and neck VMAT treatment plan by 0.3% and 1.6%, respectively, when a 2%/2 mm criterion was used. 40 Since systems like Varian's TrueBeam claim a rotational accuracy of ≤0.5°for the collimator and collimator angle error magnitudes of up to 1.5°were investigated, the ArcCHECK is unable to detect collimator angle errors of the magnitudes one may generally expect to encounter. 12

4.E | Jaw position errors
For all considered criteria, the detectability of jaw position errors of all investigated magnitudes was also poor. Once again, the small differences between data points at the different error levels were negligible compared to the size of the standard error. The ArcCHECK's poor detectability of jaw position errors has also been indicated by a previous study, which introduced an error of 3 mm into the Y1 jaw position of a brain and a head and neck VMAT treatment plan and only reported gamma passing rate decreases of 0.1% and 0.0%, respectively, when a 2%/2 mm criterion was used. 40 The specifications of Varian's TrueBeam system suggest an upper jaw positional accuracy of ±2 mm and a lower jaw positional accuracy of ±1 mm for static fields. 12 The highest investigated error magnitudes of ±3 mm and ±1.5 mm, respectively, exceeded these values, and errors in the upper and lower jaw were investigated together, meaning that the highest jaw position error level corresponded to the worst-case scenario regarding the accuracy of both the upper and lower jaw position. Despite these considerations, none of the criteria suggested even decent detectability at any error level. The ArcCHECK's ability to detect realistic jaw position errors in either jaw in the studied brain SABR treatment plans can therefore be regarded as being poor.

4.F | Dose output errors
Independently of the criterion used, the ArcCHECK was not able to detect dose output errors of any of the investigated magnitudes, with an AUC of approximately 0.6 or lower at all error magnitudes, even though the highest such errors were larger than the uncertainty of systems such as Varian's TrueBeam. 12 Dose output errors were simulated through modifications of the measurement files rather than being investigated through measurements of modified treatment plans. Prior to choosing this approach, sample measurements confirmed that the scaling of the "dose per count" value was equivalent to the measurement of a modified treatment plan, but comparisons to confirm this were not run for all 29 cases included in this study. However, the highest dose output error magnitude studied was 1.5%, whereas the dose difference criteria used were 2%, 2%, and 4%. As the dose output error magnitude was always within the dose difference criterion, the detectability of the studied dose output errors was not necessarily expected to be good, even though such errors affect the entire dose distribution. This is also in line with the results of a previous study, which reported the ArcCHECK's inability to detect even output errors of 5% in a set of VMAT head and neck treatment plans. 19 The same study also showed that Delta 4 and an EPID were equally unable to detect the same output errors. It was therefore concluded that the ArcCHECK does not detect output errors of up to 1.5% with any reliability when using the studied criteria.

4.G | Clinical implications
The ArcCHECK's capabilities with respect to the detectability of errors from different sources as determined by this study constitute its limits rather than what would necessarily be expected to be observed in clinical practice at every institution. This is because the gold standard dose calculations made use of the optimal DLG value determined specifically for brain SABR treatment plans rather than a compromised value which is often used clinically. This approach was chosen because it allowed the ArcCHECK's limits with regards to error detectability to be established, rather than yielding results which are strictly dependent on the accuracy of the DLG value used at a given institution. The optimal DLG value determined as part of this study differed from the clinical value of 1.40 mm by only 0.07 mm, compared to deviations of up to 0.80 mm reported elsewhere. 22 The ArcCHECK's ability to detect errors in the clinical context of the institution at which the study was conducted would therefore be expected to be similar to the limits established in this study. If the deviation between the clinical and the optimal DLG value was larger, however, the ArcCHECK would be expected to exhibit poorer error detectability. Error detectability was generally found to be consistent for all three investigated gamma analysis criteria. The criteria used at a given institution are therefore generally not expected to affect error detectability. The ideal threshold values (i.e., the gamma passing rates maximizing sensitivity and specificity when analyzing 6FFF brain SABR plans) determined as part of this study may be used to improve the analysis of detector measurements.

| CONCLUSION
Of the investigated machine-related sources of error, the ArcCHECK Using a generalized DLG parameter in the underlying dose calculations is expected to negatively affect error detectability. Ideal threshold values (i.e., gamma passing rates) which may be used to optimize the analysis of detector measurements were also determined.

ACKNOWLEDGMENTS
The authors thank John Wolters, Aylin Yar-Uyaniker, and Jose Zayas for the equipment-related instructions.

CONF LICTS OF INTEREST
No conflicts of interest.

AUTHOR CONTRIBUTION
Because of his previous experience in the area, M.C. conceived the idea behind the work. All authors contributed to its design and the interpretation of the data. Measurements were conducted and the resulting data were analyzed by S.T. after extensive instruction by C.A. All authors contributed to the drafting of the work and subsequent revisions and approved the final version prior to submission.

D A T A A V A I L A B I L I T Y S T A T E M E N T
The data that support the findings of this study are available from the corresponding author upon reasonable request.