Quantified VMAT plan complexity in relation to measurement‐based quality assurance results

Abstract Volumetric‐modulated arc therapy (VMAT) treatment plans that are highly modulated or complex may result in disagreements between the planned dose distribution and the measured dose distribution. This study investigated established VMAT complexity metrics as a means of predicting phantom‐based measurement results for 93 treatments delivered on a TrueBeam linac, and 91 treatments delivered on two TrueBeam STx linacs. Complexity metrics investigated showed weak correlations to gamma passing rate, with the exception of the Modulation Complexity Score for VMAT, yielding moderate correlations. The Spearman’s rho values for this metric were 0.502 (P < 0.001) and 0.528 (P < 0.001) for the TrueBeam and TrueBeam STx, respectively. Receiver operating characteristic analysis was also performed. The aperture irregularity on the TrueBeam achieved a 53% true positive rate and a 9% false‐positive rate to correctly identify complex plans. Similarly, the average field width on the TrueBeam STx achieved a 60% true‐positive rate and an 8% false‐positive rate. If incorporated into clinical workflow, these thresholds can identify highly modulated plans and reduce the number of dose verification measurements required.

complex treatments, and consequently increased uncertainty in delivery. [3][4][5][6][7][8][9][10][11][12] Complexity metrics can be used to characterize a treatment plan based on the parameters of the machine used as well as the properties of the treatment plan such as fluence, MLC positions, gantry speed, and dose rate variations. Based on the sources of modulation, complexity metrics can be broadly categorized as fluence map-based metrics, and aperture-based metrics. 13 Fluence map-based metrics consider the resulting fluence from a given beam or plan. However, these metrics are insensitive to the degeneracy of fluence maps. For example, a fluence map can be the result of a single large beam, or the sum of many small field beams.
While the latter may be more mechanically demanding on the linac, a fluence map-based metric may not always distinguish between these situations. 5 Aperture-based metrics generally focus on variations of the MLC positions during delivery. 13 These metrics can be used to describe the variations in the mechanical and dosimetric machine parameters, noted as deliverability metrics by Chiavassa et al. 3 Conversely, the MLC alone can be used to describe plan parameters that are likely to compromise accurate dose calculation in the treatment planning system, 3 or result in disagreements between the treatment planning system and the delivered plan. 13 This study investigated the use of aperture-based complexity metrics as PTQA tools with consideration to the recommendations made by Miften et al. 1 and Chan et al. 14 Table 1 describes the distribution of plans investigated by general treatment site. All treatments considered for this study were randomly selected, delivered using coplanar beams, and clinically approved plans generated in Pinnacle3 (Version 9.10, Philips) using a collapsed cone convolution algorithm. Treatments were delivered at an angular gantry separation of 2°, a maximum dose rate of 600 MU/min, and a nominal energy of 6 MV.
Dose verification measurements and analysis followed recommendations made by Miften et al. 1 The measurements were performed using the IBA MatriXX Evolution ion chamber array with a spatial resolution of 7.6 mm to produce 2D planar dose measurements. The detector array was placed in a central cavity of an inhouse polystyrene phantom along the coronal plane. Using the true composite setup, the phantom and detector remained stationary without rotation during measurements. An inclinometer fixed to the linac gantry head was used to correct for the angular dependence of the response of individual ion chamber detectors. Linac output variation was accounted for by delivering 200 MUs on a 10 × 10 cm 2 field before and after measurements. Isocenter shifts were made as deemed necessary to best represent the clinically relevant regions.
The OmniPro ImRT software was used to record measurements and compare the measured dose planes to the Pinnacle3 calculated dose planes via gamma index analysis.

2.B | Complexity metrics
The degree of complexity of VMAT treatment plans was evaluated using previously established complexity metrics. [3][4][5][6][8][9][10][11][12] Metrics were selected from those reported to have statistically significant correlations to quality assurance results in previous works, with an emphasis on those describing MLC behavior of the treatment. The following measures were considered: 1. MU Factor, defined as the ratio of the total monitor units to the prescribed dose in cGy. 4 2. Aperture Irregularity (AI), which describes the aperture shape in relation to a circle. 5 Irregularly shaped apertures, including offcentral axis fields and small leaf gaps may be more mechanically demanding of the linac to deliver as intended.

2.C | Quality assurance analysis
Gamma index analysis was performed at the 3%/2 mm and 2%/2 mm dose difference and distance to agreement criteria, with a 10% lowdose threshold and global dose normalization. A tolerance limit indicated by a gamma passing rate (GPR) of 95% is used to distinguish between plans that may be more likely to have dose disagreements between measurement and TPS calculation. Plans with GPRs above the tolerance limit are considered to pass, whereas plans with GPRs below the tolerance limit are considered to fail. The measured dose distribution was captured at the 7.6 mm spacing of the detector, and was the reference distribution for gamma analysis. The dose distribution from Pinnacle3 was calculated at a resolution of 2.5 mm in all dimensions, and cubic spline interpolation was applied to yield a resulting spatial resolution of 0.5 mm in all dimensions to improve the gamma calculation accuracy. 1 The interpolated planned dose distribution was used as the evaluated dose distribution in gamma analysis.
Internal treatment planning system files containing plan parameters were used to determine complexity metrics for each treatment.
An in-house Python script was used to calculate complexity metrics from planning files as well as to perform statistical analysis. Spearman's rank correlation coefficient (r s ) was determined for each pair of GPR and complexity metric to test for the existence of correlations. Strong correlations are indicated as |r s | ≥ 0.7, moderate as 0.7 > |r s | ≥ 0.5, weak as 0.5 > |r s | ≥ 0.3, and no correlation as 0.3 > |r s |. Statistical significance of a correlation was taken by a twotailed P value at P < 0.001.
Receiver operating characteristic (ROC) curves were produced to determine if complexity metrics can identify treatment plans with GPRs below the tolerance limit. For each complexity metric, the threshold value used to categorize a given plan to a pass or a fail is varied to determine the true positive and false positive values used in the ROC curves, where a positive result is a failing plan. A true positive is then defined to be a plan with a complexity value less than a given threshold value, and a GPR below the tolerance limit.
Similarly, a false-positive is defined as a plan with a complexity value less than a given threshold value, but a GPR above the tolerance limit.
For example, the MCSv is defined to indicate more complex plans with lower values. With a threshold value of 0.4, a treatment plan with an MCSv of 0.3 and a GPR below the tolerance limit will be considered a true positive occurrence. Similarly, this would be considered a false positive occurrence if the same plan yielded a GPR above the tolerance limit. The MU Factor, AI, and SAS are defined to indicate complex plans at higher values, and require that the complexity value be greater than the threshold value to indicate a plan with a GPR below tolerance as a true positive.
The area under the curve (AUC) for each ROC curve was also determined as an indication of classification performance. The AUC takes values between 0.5 and 1.0, representing chance accuracy and perfect accuracy, respectively. Using the benchmarks presented by Nauta et al., 9 a value between 0.5-0.6 is considered poor performance, 0.6-0.7 is fair, 0.8-0.9 is good, >0.9 is excellent, and >0.95 is near perfect performance.    Tables 5 and 6 for the TrueBeam and True-Beam STx, respectively. ROC curve analysis was also performed to investigate the classification performance of each complexity metric. The AUC is often used to represent the classification performance as a single value, ranging from 0.5 to 1 to indicate random classification and perfect classification, respectively. In this investigation, complexity metrics investigated in this work generally yielded AUCs between 0.7 and 0.8. In comparison, Park et al. 11 reported the MCSv yielded an AUC of 0.527 using a 2%/2 mm criteria with a 90% tolerance limit, F I G . 1. Complexity metrics evaluated for plans delivered on the TrueBeam linac plotted against gamma passing rate using 2%/2 mm. Plans with gamma passing rates above the 95% tolerance limit are denoted as blue circles, and plans with gamma passing rates below the tolerance limit are denoted as red triangles.

3.C | Receiver operating characteristic curves
whereas the modulation index presented yielded an average AUC of approximately 0.8.
For the purpose of using complexity metrics as substitutes for dose verification measurements, a threshold value can be used to determine if the complexity of a treatment plan would indicate a high dose uncertainty. As a result, the given treatment plan may be considered for re-planning. In this case, the threshold value should correspond to a low false positive rate to avoid flagging clinically acceptable plans and a high true positive rate to identify highly complex plans. However, any threshold value selected will be a compromise between the false positive rate and the true positive rate. The threshold values presented in Tables 5 and 6 were selected to ensure false positive rates did not exceed 10%. Younge et al. 12 used the same constraint on the false positive rate and found that the author's aperture complexity metric yielded a 44% true-positive rate with a 7% false-positive rate. In this work, the AI yielded a 53% true-positive rate with a 9% false positive rate for the TrueBeam linac, whereas the average field width yielded a true positive rate of 60% and a false positive rate of 8% for the True-Beam STx linacs.
F I G . 2. Complexity metrics evaluated for plans delivered on the TrueBeam STx linacs plotted against gamma passing rate using 2%/2 mm. Plans with gamma passing rates above the 95% tolerance limit are denoted as blue circles, and plans with gamma passing rates below the tolerance limit are denoted as red triangles.
The results of analyzing complexity metrics are highly institute dependent, thus making direct comparisons between institutions dif- The values presented in Tables 5 and 6 are limited in their use to identify VMAT plans that may require re-planning. All plans investigated in this study had been deemed clinically acceptable for T A B L E 4 Correlations of complexity metrics to gamma passing rate (2%/2 mm).   Table 5 may be more likely to be a product of chance than those reported in Table 4 for the TrueBeam linac. A larger sample size with a higher proportion of failing plans may result in a better indication of classification performance. 15 Using the 3%/2 mm criterion, all plans investigated yielded GPRs above the tolerance limit. As a result, the 2%/2 mm criterion was required to show a larger range of GPRs.
In addition, the gamma analysis was performed by comparing

| CONCLUSION
This works investigated the potential use of complexity metrics as PTQA tools to compliment measurement-based quality assurance at our institution. Complexity metrics can identify highly modulated plans that may require re-planning without the need for dose verification measurements. Furthermore, complexity metrics can be used as a means of plan evaluation prior to physics check.
Most complexity metrics had weak correlations to PSQA results, with the exception of the MCSv which had a moderate correlation for both types of linacs considered. Using ROC analysis to investigate classification performance, the AI and the average field width were both found to have high true positive rates in identifying highly modulated plans, with corresponding false positive rates below 10%. The capacity for these complexity metrics to identify complex plans should be tested in future investigations. Treatment plans with artificial constraints on modulation, as well as those considered clinically unacceptable should also be incorporated in validation studies.