Prediction of patient‐specific quality assurance for volumetric modulated arc therapy using radiomics‐based machine learning with dose distribution

Abstract Purpose We sought to develop machine learning models to predict the results of patient‐specific quality assurance (QA) for volumetric modulated arc therapy (VMAT), which were represented by several dose‐evaluation metrics—including the gamma passing rates (GPRs)—and criteria based on the radiomic features of 3D dose distribution in a phantom. Methods A total of 4,250 radiomic features of 3D dose distribution in a cylindrical dummy phantom for 140 arcs from 106 clinical VMAT plans were extracted. We obtained the following dose‐evaluation metrics: GPRs with global and local normalization, the dose difference (DD) in 1% and 2% passing rates (DD1% and DD2%) for 10% and 50% dose threshold, and the distance‐to‐agreement in 1‐mm and 2‐mm passing rates (DTA1 mm and DTA2 mm) for 0.5%/mm and 1.0%.mm dose gradient threshold determined by measurement using a diode array in patient‐specific QA. The machine learning regression models for predicting the values of the dose‐evaluation metrics using the radiomic features were developed based on the elastic net (EN) and extra trees (ET) models. The feature selection and tuning of hyperparameters were performed with nested cross‐validation in which four‐fold cross‐validation is used within the inner loop, and the performance of each model was evaluated in terms of the root mean square error (RMSE), the mean absolute error (MAE), and Spearman's rank correlation coefficient. Results The RMSE and MAE for the developed machine learning models ranged from <1% to nearly <10% depending on the dose‐evaluation metric, the criteria, and dose and dose gradient thresholds used for both machine learning models. It was advantageous to focus on high dose region for predicating global GPR, DDs, and DTAs. For certain metrics and criteria, it was possible to create models applicable for patients’ heterogeneity by training only with dose distributions in phantom. Conclusions The developed machine learning models showed high performance for predicting dose‐evaluation metrics especially for high dose region depending on the metric and criteria. Our results demonstrate that the radiomic features of dose distribution can be considered good indicators of the plan complexity and useful in predicting measured dose evaluation metrics.


INTRODUCTION
Volumetric modulated arc therapy (VMAT) is used frequently since it provides a conformal dose distribution to a target while achieving a dose reduction to organs at risk around the target.Although VMAT has clinical benefits including the quick delivery of treatment, it is considered a complex treatment technique due mainly to the complex motion of a multi-leaf collimator (MLC) concurrent with gantry rotation; in addition, the dosimetric accuracy of VMAT for each treatment plan must be verified by patient-specific quality assurance (QA) prior to treatment delivery. 1Gamma analysis is the most widely used method for comparisons between the measured dose distribution and the calculated dose distribution in patient-specific QA for VMAT, with the degree of agreement usually quantified by using the gamma passing rate (GPR). 2,3Dosimetric measurement as a part of patient-specific QA is usually performed using multi-dimensional detectors.Since patient-specific QA is known to present a heavy workload, a much more efficient QA process is desirable, and obtaining such efficiency might be possible if the results of QA could be predicted accurately based on parameters involved in the created VMAT plan prior to the measurement.
5][6][7][8][9] The modulation complexity score for VMAT (MCSv) is a well-known example that comprehensively quantifies the complexity of VMAT plans in terms of the MLC positions at each segment and the irregularity of the field shape. 5Although numerous complexity metrics have been introduced, only a few studies have indicated that a single complexity metric had a strong correlation with the GPR. 4,71][12][13][14][15][16][17][18][19][20][21][22][23] The machine learning models developed by Ono et al. showed high performance in predicting GPRs by combining multiple complexity metrics. 16More recently, a synthesized gamma map generated by a generative adversarial network was used to accurately predict failing points in the map and GPRs. 246][27] Hirashima et al. proposed machine learning models that accurately predict the GPRs for VMAT plans by using the dosiomic features (radiomic features of dose distribution) incorporated in the conventional plan complexity metrics. 32Deep learning-based GPR prediction models using coronal and sagittal 2D planar doses that showed a strong or moderate correlation between the measured and predicted GPRs were developed by Tomori et al., and interestingly, even though they used only "dummy target plans"that were created with a spherical phantom as the training dataset, the developed model also worked well for clinical target plans. 19These studies suggested that radiomic features of dose distributions alone could efficiently exhibit the complexity of each VMAT plan in the sense that GPRs could be accurately predicted by the features.
Although several research groups have developed machine learning models to predict GPRs based on radiomic features of dose distributions, several points have not been sufficiently addressed.The first point regards the dimension of dose distribution.Since VMAT produces a three-dimensional (3D) modulation of the dose in a patient or a phantom by its nature, it is possible that the complexity of VMAT plans may be reflected more efficiently in a 3D dose distribution rather than a 2D dose distribution.Moreover, the 3D gamma analysis (which is an extension of the 2D gamma analysis into another dimension) may be more suitable for evaluating the entire volumetric dose distribution.A second point concerns the object at which the dose distribution is evaluated.Since the complexity of a VMAT plan is expected to reflect primarily the complexity of the beam delivery according to MLC motion and/or the irregularity of the field shape, it is reasonable to speculate that the VMAT plan's complexity has no direct relationship with the patient inhomogeneity correction.
At present, a typical VMAT patient-specific QA procedure uses dose measurement by a phantom or multi-dimensional detectors, and it may thus be reasonable to develop models to predict GPRs based on the dose distribution in a homogeneous phantom, which is expected to be more directly related to GPRs.][36][37] The use of a homogeneous phantom may help minimize such variation and obtain a robust prediction.A third point regards the selection of the normalization method and the criteria of dose evaluation metrics.The American Association of Physicists in Medicine (AAPM) Task Group (TG)−218 report stated that global normalization with the 3%/2 mm criterion is clinically relevant; however, one or more other metrics or criteria may be useful for predictions (e.g., local GPRs, passing rates based on the dose difference, or the distance-to-agreement alone), and there have been few systematic comparisons among multiple metrics and criteria.
In this study, we examined the predictive ability of several dose evaluation metrics and criteria based on the radiomic features of the 3D dose distribution of VMAT.We extracted the radiomic features from the dose distributions and created machine learning models based on those features to predict (i) 3D global and local GPRs,(ii) the passing rates based on the dose difference, and (iii) the distance-to-agreement alone determined by measurement with a multi-dimensional detector array.We then evaluated the performance of each model.We used the so-called "hand-crafting" approach, in which the feature extraction and machine learning modeling are performed independently, as we speculated that doing so could enable us to observe a clear relationship between specific radiomic features and GPRs and the other dose evaluation metrics.We expected that, compared to deep learning-based prediction, the handcrafting approach would make it much easier to interpret the results and compare them with the results of similar studies.

Workflow
As described in Figure 1, the overall workflow of our study was divided into the following five steps: (

Datasets
We used the cases of 68 patients who underwent VMAT at our institute between October 2019 and March 2020.The numbers of patients for each treatment site (the total number of arcs) were 30 (30 arcs) for the prostate, 16 (51 arcs) for the head and neck, 18 (41 arcs) for the brain, 3 (12 arcs) for the whole pelvis, and 1 (6 arcs) for malignant pleural mesothelioma.The dose prescriptions of all of the treatments are summarized in Table 1.All of the clinical VMAT plans were created with the Eclipse treatment planning system ver.15.5 (Varian Medical Systems, Palo Alto, CA).The anisotropic The study workflow.DDx%: passing rate for dose difference evaluation with x% criteria, DTAx mm: passing rate for distance-to-agreement in x mm criteria, FO: first order, GLCM: gray-level co-occurrence matrix, GLDM: gray-level dependence matrix, GLRLM: gray-level run-length matrix, GLSZM: gray-level size-zone matrix, GPR: gamma passing rate, NGTDM: neighboring gray-tone difference matrix.
analytic algorithm (AAA, ver.15.5) was used for the dose calculations, in which the dose grid size was 2.5 mm for the prostate, pelvis, and malignant pleural mesothelioma, 2.0 mm for the head and neck, and 1.25 mm for the brain.All of the clinical VMAT plans were delivered by a 6 MV photon beam and a Novalis Tx radiosurgery system (Varian Medical Systems) in which the MLC leaf width is 2.5 mm at the central area and 5 mm at the peripheral area.This retrospective study was approved by the institutional review board at our institute.We obtained the GPRs with global and local normalization plus the dose difference (DD) of 1% (DD1%) and 2% (DD2%) and the distance-to-agreement (DTA) of the 1-mm passing rate (DTA1 mm) and 2-mm passing rate (DTA2 mm) by using the measured dose distributions with the Delta4 and the dose calculated with the TPS in a cylindrical, homogeneous numerical phantom (dia.22 cm, length 40 cm; ScandiDos). 38The dose distribution in the phantom was calculated under the assumption that the phantom was irradiated by the clinical treatment beams with the dose calculation grid size of 2.0 mm.The computed tomography (CT) number of the phantom was set as 217 corresponding to the relative electron density of 1.147 in accordance with the specifications from ScandiDos. 38The 3D GPRs were obtained with a dose threshold of 10% for the following four criteria: 1%/1 mm, 1%/2 mm, 2%/1 mm, and 2%/2 mm.Criteria with larger dose differences, such as 3%/2 mm, were not included in this study, because their GPRs increased and could be concentrated around 100% when we performed the 3D gamma analysis. 39,40e also evaluated the 3D passing rates for the DD1%, DD2% for a 10% dose threshold, DTA1 mm, and DTA2 mm for a 0.5%/mm dose gradient threshold.Additionally, the 3D GPRs, DD1%, and DD2% with the dose threshold of 50% threshold and DTA1 mm, and DTA2 mm for a 1.0%/mm dose gradient threshold were obtained and used to the additional machine learning modeling for comparison.All of the values of the dose evaluation metrics were calculated by the Delta4 software. 38

Extraction of the radiomic features of the 3D dose distributions
A total of 106 radiomic features were extracted from the calculated 3D dose distributions in the Delta4 phantom with the use of 3D Slicer ver.4.8.1 with PyRadiomics ver.2.0.1. 41,42The region of interest (ROI) used for the feature extraction was the region for the 10% or 50% dose threshold of the prescription dose.The 106 extracted features were classified into 18 first-order features, 13 shape features, and 75 texture features.The texture features were classified into five classes: gray-level dependence matrix (GLDM) (n = 14 features), gray-level co-occurrence matrix (GLCM) (n = 24), gray-level runlength matrix (GLRLM) (n = 16), gray-level size-zone matrix (GLSZM) (n = 16), and neighboring gray-tone difference matrix (NGTDM) (n = 5).The details of the calculated radiomic features are provided in Table 2.
All of the radiomic features were calculated for five different bin widths: 0.01, 0.1, 1.0, 10, and 100.We also obtained the features which 3D wavelet-filter was applied at the first decomposition level consisting of 8 different blocks namely LLL, HLL, LHL, HHL, LLH, HLH, LHH, and HHH wavelet-filters along the x-, y-, and zdimensions (where L = low-pass filter and H = high-pass filter).The effectiveness of wavelet-filtered features in predicting gamma passing rate has been verified by Hirashima et al. in which the almost all selected useful radiomic features were wavelet-filtered. 32Since the wavelet-filter was not applied to the shape features, the total number of shape features was 65 (13 original features multiplied by 5 bin sizes), and the total number of the other features was 4185 (93 original features multiplied by 5 bin sizes and 9 patterns including 1 original and 8 wavelet-filtered).The total number of features was thus 4,250.

Machine learning modeling
Figure 1 is a schematic of the workflow of the development of the machine learning models.We created the machine learning models based on the EN and the ET by using the PyCaret program (ver.Training and validation process described below were performed only with the 'training/validation' dataset and the 'test' dataset remained completely unknown for the created models.To avoid multi-collinearity, the number of features was reduced by removing the features for which the correlation coefficients were > 0.95 (Figure 1).The radiomic features with a non-zero coefficient were then selected using the Lasso regression using the PyCaret and created a ranking list of features.We created model based on the features from top of the list and tuned the hyperparameters of the EN and ET models by searching the optimal number of features that produces the minimum root mean squared error (RMSE) in a nested cross-validation in which the four-folds CV is used within the outer and inner loop.The performance of the developed models for predicting dose evaluation metrics with various criteria for the validation and test datasets was evaluated based on the RMSE, the mean absolute error (MAE), and Spearman's rank correlation coefficient (r).
The machine learning modeling processes mentioned above were repeated for dataset of the global and local GPRs (1%/1 mm, 1%/2 mm, 2%/1 mm, and 2%/2 mm), DD1%, and DD2% for the 50% dose threshold, and DTA1 mm and DTA2 mm for the 1.0%/mm dose gradient threshold.Finally, all the created models were applied to the test dataset of 3D dose distributions in patients' CT images with heterogeneity and evaluated them in predicting dose evaluation metrics.

Dose evaluation metrics
Table 3 presents the measured values of the range, mean, standard deviation (SD meas ), and median of the global and local GPRs with the criteria of 1%/1 mm, 1%/2 mm, 2%/1 mm, and 2%/2 mm and the DD1%, DD2% passing rates for the 10% and 50%, and DTA 1 mm, and DTA2 mm passing rates for the 0.5%/mm and 1.0%/mm dose gradient threshold, respectively.For more stringent criteria, the smaller mean and median values of the GPRs and larger range and SD meas for the global and local GPRs were observed.The mean and median values of the global GPRs were always larger than those of the local GPRs for the same gamma analysis criterion.For all the criteria except the local 2%/1 mm and 2%/2 mm, SD meas were larger for 50% threshold.
The mean and median passing rates were the lowest and the range and SD meas were the largest for DD1% among all of the dose evaluation metrics.

Prediction of dose evaluation metrics by the machine learning models
The learning curves in which the vertical axis represents the RMSE of the EN and ET models for the global and local GPRs, DD1%, DD2%, DTA1 mm, and DTA2 mm passing rates are shown in Figures 2 and 3, respectively.The training errors and the test errors became almost stable up to 60-80 samples for all of the metrics and criteria.The shape of the training and test curves of the EN and ET were similar for all the metrics.Table 4 presents the mean value and SDs of the predicted passing rates (SD pred ) and the RMSEs, MAEs, and r-values for the developed machine learning models based on the EN and ET for all of the criteria of dose evaluation metrics for 10% dose threshold (GPRs and DDs) and 0.5%/mm dose gradients threshold (DTAs).For the global GPRs with 10% dose threshold, the largest values of the RMSE and MAE for the training/validation dataset were 5.45% (EN) and 5.50% (ET) and 4.24% (EN) and 4.22% (ET) for the 1%/1 mm criterion, respectively.The largest values of the RMSE and MAE for local GPR values were also identified for the 1%/1 mm criterion, and the values were larger than those for the global GPRs.As the criteria became more stringent and the SD pred became larger, the RMSE and MAE values became larger.The RMSE and MAE for test datasets were larger than training dataset for the EN and ET models overall.The highest correlation coefficients for training/validation were 0.81 (ET) for the DTA1 mm.The RMSE and MAE values for the training/validation of DD1% were 7.33% (EN) and 7.35% (ET) and 6.07% (EN) and 6.10% (ET), respectively, which were the largest values among all of the dose evaluation metrics, and the corresponding values for DTA2 mm were the smallest.
The RMSE divided by SD meas (RMSE/SD meas )where the SD meas is the standard deviation of the measured global and local GPRs, DDs, and DTAs for training/validation and test datasets-were in the range of 0.59-0.91 and 0.84-1.42for all of the dose evaluation metrics and machine learning models, respectively.It is observed that the values of RMSE/SD meas were relatively small for the local GPRs with test dataset compared to the other metrics.Scatterplots of the measured and predicted global GPRs, local GPRs, DDs, and DTAs of the training/validation and test datasets by the EN model are provided in Figure 4.The distribution of values centered around the diagonal line, indicating perfect prediction for all of the metrics.Figure 5   results for the ET models in the same manner.Figures 4  and 5 show that the scatterplots for global GPRs and DDs revealed a pattern of predicting only specific values which appeared as an alignment in the scatterplot especially for the ET models.On the other hand, the scatterplots for local GPRs and DTA1 mm distributed centering around the diagonal line.Thus, the results that the values of RMSE/SD meas were relatively small for the local GPRs and DTA1 mm compared to the other metrics confirmed graphically by these scatterplots.
Table 5 shows the results for 50% dose threshold (GPRs and DDs) and 1.0%/mm dose gradients threshold (DTAs), respectively.Although the values of the SD pred , RMSE, and MAE were much larger than Table 4, the values of RMSE/SD meas were reduced except local GPRs.Especially for the DDs and DTAs, the much smaller values of RMSE/SD meas were presented.Using a high dose region was advantageous to predict global GPRs, DDs, and DTAs accurately.
Table 6 shows the results for that the created models were tested with the dataset which originated from dose distribution in patients' CT images with heterogeneity for the dose threshold of 10% and 50% (GPRs and DDs), and 0.5%/mm and 1.0%/mm dose gradients threshold (DTAs).The RMSE, MAE, and RMSE/SD meas increased and the r reduced for the same metric and criterion compared to the models based on the phantom overall.The exceptional cases were global 1%/1 mm, global 2%/2 mm, and local 2%/2 mm GPR with 50% dose threshold for which the created models maintained a small RMSE/SD meas and a high r even with test dataset with patients' heterogeneity.
The selected radiomic features in the machine learning modeling are summarized in Tables 7-9 for the global GPRs, local GPRs, and DDs and DTAs, respectively.The number of selected radiomic features ranged from 1 to 10 depending on the dose evaluation metric and criteria.The maximum number of features was 10 for the DTA1 mm with 1.0%/mm dose gradient threshold, and the second-highest number was 9 for the DD1% with 50% dose threshold.There are some features which were present in multiple metrics and criteria.The 28 features were selected in multiple criteria.Especially, the "zone entropy" was selected for 11 criteria, the "gray-level nonuniformity" was selected for 10 criteria, the "dependence variance" was selected for 9 criteria, and the "Imc1" was selected for 7 criteria.The class of the most majority was the GLSZM.There was no particular bin width that was selected more than the others.In addition, the great majority of the features listed in Tables 7-9 were features in which a wavelet-filter was applied.

DISCUSSION
The performance of our developed machine learning models for predicting global and local GPRs, DDs, and DTAs are summarized in  31 The EN and ET models showed comparable results overall implying that the obtained results are robust and mostly regardless to the selection of the machine learning algorithm.The RMSE and MAE values for the GPRs were higher for the more stringent criteria, that is, those with a smaller dose difference and DTA.This may be simply understood by noting that the SDs of the measured GPRs were larger for stringent criteria, as suggested by the data in Table 4.The ratio of the RMSE to the SD of the measured dose evaluation metric for the EN and ET models were in a narrow range, that is, 0.59-0.91(training) and 0.84-1.42(test), suggesting that the RMSE increases as the SD of the measured dose evaluation metric increases universally, and then the ratios are considered a good indicator of predicting accuracy and a single use of the RMSE or MAE may not be sufficient.
The results for 50% dose threshold (Table 5) showed a superior accuracy of prediction according to the ratio of the RMSE to the SD of the measured dose evaluation metric except local GPRs.A possible reason for this may be that dose distribution higher than 50% is more likely to reflects the characteristics of the treatment plan and more closely related to the complexity of the plan than the lower dose.Compared to the other F I G U R E 4 Scatterplots of the predicted and measured dose evaluation metrics in the elastic net (EN) model: (a) global GPRs for the 1%/1 mm, 1%/2 mm, 2%/1 mm, and 2%/2 mm criteria, (b) local GPRs for the 1%/1 mm, 1%/2 mm, 2%/1 mm, and 2%/2 mm criteria, (c) DD1% and DD2% criteria, and (d) DTA1 mm and DTA2 mm criteria.The dose threshold is 10% for GPRs, DD1%, and DD2%, and the dose gradient threshold is 0.5%/mm for DTA1 mm and DTA2 mm.Diagonal line: perfect prediction.Abbreviations are explained in the Figure 1

TA B L E 4
The number of selected radiomic features, the mean and SD of the predicted values of metrics, the RMSE, MAE, Spearman's correlation coefficient, and RMSE/SD meas , where SD meas is the standard deviation of measured values of metrics for the validation and test datasets for the EN and ET models with the dose threshold of 10% (global and local GPRs and DDs) and the dose gradient threshold of 0.5%/mm (DTAs).metrics, the SD of the measured values of local GPRs were not changed very much for 10% and 50% dose threshold as shown in Table 3.The less variability of GPR may be the reason for that the prediction accuracy of local GPRs exhibited inferior for 50% dose threshold.

No. of selected features Mean [%] SD pred [%] RMSE [%] MAE
The poor accuracy of the created models based on dose distribution in the phantom in applying for test dataset originated from dose distribution in patients' CT images with heterogeneity as shown in Table 6 suggested that there is a limitation of applying the models.It implied that only dataset originated from phantom may be not sufficient to accurately predict all of the dose evaluation metrics by dose distribution in patients' CT images with heterogeneity.However, the global 1%/1 mm, global 2%/2 mm, and local 2%/2 mm GPR with 50% dose threshold considerably maintained a comparable prediction accuracy to the pure phantom study, exhibiting a high capability of the created model in applying them to a calculated dose distribution with patients' heterogeneity.
The selected radiomic features were distributed among the first-order features and the five texture features classes as shown in Tables 7-9.There was no prominently important features or class, and it was concluded that a combination of features was required to accurately predict dose evaluation metrics.It is notable that there was no shape feature listed in Tables 7-9 which include descriptors of the size and shape of the selected region of dose distribution.It implied that the size and shape of the selected region of dose distribution may not be an indicator of the complexity of TA B L E 5 The number of selected radiomic features, the mean and SD of the predicted values of metrics, DD1%, DD2%, DTA1 mm, and DTA2 mm, the MSE, MAE, Spearman's correlation coefficient, and RMSE/SD meas , where SD meas is the standard deviation of measured values of metrics for the validation and test datasets for the EN and ET models with the dose threshold of 50% (global and local GPRs and DDs) and the dose gradient threshold of 1.0%/mm (DTAs).the VMAT plan and irrelevant to the prediction dose evaluation metrics.The performance of our developed models can be closely compared to the "dosiomics model" presented by Hirashima et al. and the deep learning-based models presented by Tomori et al. 19,31 The RMSE for the global GPR of 2%/2 mm was 1.2% for our EN model; the value reported by Hirashima et al. for the same criterion was 5.7%.However, this superior accuracy may be due primarily to the difference in the SD of the measured value of the GPRs.The SD of the measured GPR of 2%/2 mm reported by Hirashima et al. was 7.4%, resulting in an RMSE to SD ratio of 0.77, which is close to the range of our results presented in Table 4.The reason that the SDs of our models were significantly smaller than those reported by Hirashima et al. may be that we used a 3D gamma evaluation, which produces higher and more compactly distributed passing rates than a 2D gamma evaluation. 38,39Our models showed slightly superior accuracy compared to the models developed by Tomori et al. for the similar SD values of the measured global GPRs. 19This could also be attributed to their larger SDs of the measured GPRs, that is, 2.94% for the 2%/2 mm GPR.The RMSE to SD ratio for their results was 0.90.Collectively, these findings suggest that the ratio of the RMSE to the SD of measured GPRs is approximately constant regardless of the prediction model and the detector used for the GPR measurement.Therefore, the accuracy of prediction models must be evaluated by the conventional metrics (MAE and RMSE) in association with the variation (SD) of the measured objective.

TA B L E 6
The mean and SD of the predicted values of metrics, DD1%, DD2%, DTA1 mm, and DTA2 mm, the MSE, MAE, Spearman's correlation coefficient, and RMSE/SD meas , where SD meas is the standard deviation of measured values of metrics for the validation and test datasets for the EN and ET models for that the test data set which originated from dose distribution in patient with heterogeneity with the dose threshold of 10% and 50% (global and local GPRs and DDs), and dose gradient threshold of 0.5%/mm and 1.0%/mm (DTAs).

Mean [%]
SD Abbreviations: DDx%: passing rate for dose difference evaluation with the x% criterion; DTAx mm: passing rate for distance-to-agreement evaluation with the x mm criterion; GPR: gamma passing rate; r: Spearman's rank correlation coefficient; SD meas : standard deviation of measured dose evaluation metrics; SD pred : standard deviation of predicted dose evaluation metrics.
Hirashima et al. reported that the class of GLDM was the most important parameter for predicting global GPRs. 31Two of the top-ten ranked radiomic features in the Hirashima et al. study were also selected in our global GPR results, namely the "dependence entropy" and the "dependence variance."These features are thus thought to represent universal characteristics of dose distribution that are relevant to the global GPRs.In this work, we adopted the so-called hand-crafting approach rather than a deep learning-based approach.The stepby-step process of the hand-crafting approach enabled us to specify the effective features for each metric and criterion and to determine what they have in common with other criteria.It also enabled us to compare the effective features with those described in similar studies and to identify universally important features of VMAT dose distribution.5][36][37] Our present results demonstrated that the radiomic features for which a wavelet-filter was applied were certainly necessary to optimize the performance of the machine learning models, since these dominated in the lists in Tables 7-9.The results also suggested that the bin size of each radiomic feature must be carefully selected in order to optimize the performance of the models.This point was not discussed in detail in previous studies.
Our study included the five categories of treatment sites as presented in Abbreviations: FO: first order; GLCM: gray-level co-occurrence matrix; GLDM: gray-level dependence matrix; GLRLM: gray-level run-length matrix; GLSZM: gray-level size-zone matrix; GPR: gamma passing rate; LLL/HLL/LHL/HHL/LLH/HLH/LHH/HHH: wavelet-filters along x-, y-, and z-dimensions applied (L and H: low-pass and high-pass filters,respectively);NGTDM:neighboring gray-tone difference matrix;ORG:the original value of the feature with a wavelet-filter not applied;WF:wavelet-filter.
example, head-and-neck or whole pelvis plans are likely to be rather complex than prostate plans.The created models can be considered more comprehensive and versatile about the plan complexity than a treatment site specific model.Moreover, the results of our study indicated that the radiomic features of dose distribution were confirmed to be an indicator of the plan complexity.
The improvement of VMAT plans by the use of models for predicting dose evaluation metrics may provide a clinical benefit in the sense that the dose uncertainty in VMAT delivery to patients can be minimized. 43Our present findings also indicate the feasibility of improving VMAT plans by applying a radiomic analysis of dose distributions to predict the results of patient-specific QA in the treatment planning process.The approach we adopted in this study may have an important benefit in that the dose evaluation metrics can be accurately predicted by only the dose distribution and may be easily implemented on a TPS.For example, radiomics features of dose distributions are expected to be optimized in the VMAT optimization process so that the resulting dose distribution would provide a high passing rate of a dose evaluation metric of patient-specific QA.
There are several study limitations to address.First, our results may not be adaptable for cases in which types of detector arrays other than the Delta4 are used.Abbreviations: DDx%: passing rate for dose difference evaluation with the x% criterion; DTAx mm: passing rate for distance-to-agreement evaluation with the x mm criterion; FO: first order; GLCM: gray-level co-occurrence matrix; GLDM: gray-level dependence matrix; GLRLM: gray-level run-length matrix; GLSZM: gray-level sizezone matrix, GPR: gamma passing rate; LLL/HLL/LHL/HHL/LLH/HLH/LHH/HHH: wavelet-filters along x-, y-, and z-dimensions applied (L and H: low-pass and high-pass filters, respectively); NGTDM: neighboring gray-tone difference matrix; ORG: the original value of the feature with a wavelet-filter not applied; WF: wavelet-filter.
Several authors have indicated that the differences in the geometry of detectors may lead to differences in GPR values even for the same VMAT plan.For example, Li et al. found that the correlation with plan complexity metrics was higher for GPRs measured by a Delta4 than for GPRs measured using an ArcCHECK2 array. 44teer et al. showed that the Delta4 had higher sensitivity for error detection than the ArcCHECK. 45These studies suggested that among the variety of detectors used in patient-specific QA, the Delta4 detector has at least non-inferiority in predicting GPRs and error detection sensitivity.Secondly, the performances of our developed models may have depended specifically on the patient cohort we used, meaning that the models may be site-specific and the same approach could have led to the different levels of performance at other institutes.To overcome this problem, it may be useful to carry out a multi-institutional study to improve the generalization capability of our developed machine learning models.

CONCLUSIONS
We developed machine learning models to predict the values of the gamma, dose-difference, and distanceto-agreement passing rates of patient-specific QA for VMAT based on the radiomic features of 3D dose distribution in a dummy phantom.The developed models showed good performance and were comparable to the findings of the previous studies overall.It is advantageous to focus on a high dose region for improve the prediction accuracy.For certain metric and criteria, creating a model applicable for patients' heterogeneity by training only with dose distributions in the phantom was possible by focusing on a high dose region.Our results demonstrate that the radiomic features of dose distribution can be considered good indicators of the plan complexity and useful in predicting measured dose evaluation metrics.The results of our study can provide a useful method to reduce the dosimetric uncertainty of VMAT plans by evaluating only the dose distribution.

AU T H O R C O N T R I B U T I O N S
depicts the
legend.F I G U R E 5 Scatterplots of the predicted and measured dose evaluation metrics in the extra trees (ET) model: (a) global GPRs for the 1%/1 mm, 1%/2 mm, 2%/1 mm, and 2%/2 mm criteria, (b) local GPRs for the 1%/1 mm, 1%/2 mm, 2%/1 mm, and 2%/2 mm criteria, (c) DD1% and DD2% criteria, and (d) DTA1 mm and DTA2 mm criteria.The dose threshold is 10% for GPRs, DD1%, and DD2%, and the dose gradient threshold is 0.5%/mm for DTA1 mm and DTA2 mm.Diagonal line: perfect prediction.Abbreviations are explained in the Figure 1 legend. 11 Treatment sites, number of patients, total number of arcs, and prescription dose of the VMAT plans used.

TA B L E 2 Number and class of the extracted radiomic features. Class No. of features Description
the whole original samples and randomly selects cut points in order to split nodes.The dataset was divided into 54 patients including 108 arcs for training/validation and other 14 patients including 32 arcs for testing, so that plans from a same patient were not divided into both the training and test dataset to avoid an information leakage.
2.3.10)withPython(ver.3.8.16)to predict the global and local GPRs (1%/1 mm, 1%/2 mm, 2%/1 mm, and 2%/2 mm), DD1%, and DD2% for the 10% dose threshold, and DTA1 mm and DTA2 mm for the 0.5%/mm dose gradient threshold, respectively.The EN model is a regression model combining the Lasso and Ridge regularizations.The ET model is similar to the random forest algorithm, but it uses The measured range, mean, SD meas , and median of the measured global and local GPRs, DDs passing rates with the dose threshold of 10% and 50%, and DTAs passing rates with the dose gradient threshold of 0.5%/mm and 1.0%/mm.
TA B L E 3Abbreviations: DDx%: passing rate for dose difference evaluation with the x% criterion; DTAx mm: passing rate for distance-to-agreement evaluation with the x mm criterion; GPR: gamma passing rate; SD: standard deviation; SD meas : standard deviation of measured dose evaluation metrics.
Table 4 and presented graphically in Figures4 and 5as scatterplots of the predicted and measured passing rates.The developed models showed accuracy at approx.1%−10% of the RMSE for the test dataset in predicting the dose evaluation metrics and criteria overall.According to the learning curves shown in Figures2 and 3and the results in Table4, although the EN and ET models showed the larger RMSE and MAE values for test datasets than the training dataset, it suggests that the models acquired a sufficient level of generalization performance if those are compared to the results of Hirashima et al in which the RMSE for test dataset was more than double of that for validation dataset.
Table1which are associated with various level of complexity of VMAT plan.For Selected radiomic features in descending order of the absolute value of coefficients of Lasso regression with the dose threshold of 10% and 50% are displayed with the bin width, the class, and the wavelet-filter for global GPRs.The number of selected features depending on the machine learning model are presented in Tabled 4-5. Note: Selected radiomic features in descending order of the absolute value of coefficients of Lasso regression with the dose threshold of 10% and 50% are displayed with the bin width, the class, and the wavelet-filter for local GPRs.Selected radiomic features in descending order of the absolute value of coefficients of Lasso regression with the dose threshold of 10% and 50% (DDs) and the dose gradient threshold of 0.5%/mm and 1.0 %/mm (DTAs) are displayed with the bin width, the class, and the wavelet-filter for DDs and DTAs.The number of selected features depending on the machine learning model are presented inTable 4-5.
Note: The number of selected features depending on the machine learning model are presented in Tables4-5.Abbreviations: FO: first order; GLCM: gray-level co-occurrence matrix; GLDM: gray-level dependence matrix; GLRLM: gray-level run-length matrix; GLSZM: gray-level size-zone matrix; GPR: gamma passing rate; LLL/HLL/LHL/HHL/LLH/HLH/LHH/HHH: wavelet-filters along x-, y-, and z-dimensions applied (L and H: low-pass and high-pass filters,respectively);NGTDM:neighboring gray-tone difference matrix;ORG:the original value of the feature with a wavelet-filter not applied;WF:wavelet-filter.TA B L E 9