Effects of model size and composition on quality of head‐and‐neck knowledge‐based plans

Abstract Purpose Knowledge‐based planning (KBP) aims to automate and standardize treatment planning. New KBP users are faced with many questions: How much does model size matter, and are multiple models needed to accommodate specific physician preferences? In this study, six head‐and‐neck KBP models were trained to address these questions. Methods The six models differed in training size and plan composition: The KBPFull (n = 203 plans), KBP101 (n = 101), KBP50 (n = 50), and KBP25 (n = 25) were trained with plans from two head‐and‐neck physicians. KBPA and KBPB each contained n = 101 plans from only one physician, respectively. An independent set of 39 patients treated to 6000–7000 cGy by a third physician was re‐planned with all KBP models for validation. Standard head‐and‐neck dosimetric parameters were used to compare resulting plans. KBPFull plans were compared to the clinical plans to evaluate overall model quality. Additionally, clinical and KBPFull plans were presented to another physician for blind review. Dosimetric comparison of KBPFull against KBP101 , KBP50 , and KBP25 investigated the effect of model size. Finally, KBPA versus KBPB tested whether training KBP models on plans from one physician only influences the resulting output. Dosimetric differences were tested for significance using a paired t‐test (p < 0.05). Results Compared to manual plans, KBPFull significantly increased PTV Low D95% and left parotid mean dose but decreased dose cochlea, constrictors, and larynx. The physician preferred the KBPFull plan over the manual plan in 20/39 cases. Dosimetric differences between KBPFull , KBP101 , KBP50 , and KBP25 plans did not exceed 187 cGy on aggregate, except for the cochlea. Further, average differences between KBPA and KBPB were below 110 cGy. Conclusions Overall, all models were shown to produce high‐quality plans. Differences between model outputs were small compared to the prescription. This indicates only small improvements when increasing model size and minimal influence of the physician when choosing treatment plans for training head‐and‐neck KBP models.


INTRODUCTION
3][4] Knowledge-based planning (KBP) has been proposed as an automated tool to reduce planning variability, improve plan quality, and decrease planning time.
][7][8][9][10][11][12] These predictions can then be used to automatically generate optimization objectives, decreasing the amount of trial and error needed in manual treatment planning.][15][16] To create high-quality plans using a KBP model, a database of "good" plans is required to train the model.7,18 In this process, after initial training the DVH estimations for each plan in the training sets are evaluated.If a specific organ in a plan exceeds the dose predicted, one can eliminate this instance of the organ from the training set.The entire model is then retrained.By removing this outlier, a more accurate DVH prediction is then achieved.Further, the choice of optimization objectives has been shown to impact the quality of resulting plans and therefore, it is recommended to carefully fine-tune optimization objectives. 14As a result of these considerations, it takes quite a few resources for a clinic to build their own KBP model.Clinics looking to build their own models are faced with many questions: Does model size influence resultant plan quality?Do planning preferences of different treating physicians influence the resulting plans from a KBP model?Will a model be applicable for a new or different physician?How many plans are needed to make a reasonable model?
It can be challenging if a clinic does not have access to copious amounts of previously treated plans, if the personnel time is limited, or if there is not the local expertise to tune such a model.0][21][22][23][24] Institutional or physician bias can affect choices in what is considered when making planning decisions for patients. 25Several questions arise for a clinic such as whether an outside model can match institutional or physician standards for treatment, whether such physician preferences affect the model built, or if a model appeases each physician's preferences.While these questions have partially been answered in a prostate model, 26 the applicability of the conclusions to more complex disease sites has yet to be tested.This study aims to shed light on these ques-tions for head-and-neck (HN), which represents one of the most complicated disease sites treated with multiple targets at different dose levels, complex target shapes and proximity of many critical organs-at-risk (OAR).
In particular, this study seeks to answer the questions of whether increasing model size necessarily improves plan quality, and whether each physician requires their own KBP model to meet their planning preferences.Six HN KBP models which included volumetric modulated arc-therapy (VMAT) plans were built for this purpose and tested by re-optimizing an independent test set of VMAT plans.Models were built with differences in model size and different physician plan composition to evaluate the effect of size and potential differences when using specific patient cohorts for training.

KBP training
Models were built in a commercial KBP solution (Rapid-Plan, ver.16.1, Varian Medical Systems, Palo Alto, California).The makeup of the models is listed in Table 1.The "KBP Full " model was trained with 203 HN patients treated at our institution with VMAT between 2013 and 2019.Patients were treated with simultaneous integrated boost with 2−3 target levels with the Planning Target Volume (PTV) High receiving 6000−7000 cGy in 30−35 fractions.The "KBP 101 ", "KBP 50 ", and "KBP 25 " models were trained selecting 101, 50, and 25 out of these 203 patients respectively, approximately half of which were from each physician "A" and physician "B".
The "KBP A " and "KBP B " models were trained with 101 plans each, but exclusively consisted of patients treated by physician A and physician B, respectively.Where possible the models were matched to have similar makeup in terms of diagnosis (see Table 1).Prescriptions by physicians A and B were evaluated and it was found that physician A had stricter limits on cord Dmax by 700 cGy.Physician B had stricter constraints on brainstem, optics, oral cavity, constrictors, submandibulars and lips by 500−1500 cGy.Quality filtering was performed on the models as described in the introduction.
As shown in previous literature, optimization objectives strongly influence resulting plans. 14In this study, the aim was to determine plan differences based on the training set and not optimization objectives.Therefore, one set of optimization objectives was developed and fine-tuned using the KBP Full model: 39 independent test plans were re-optimized using the initial optimization objectives of the KBP Full model.To not bias the results, the 39 test patients were taken from a cohort of patients treated by a third head-and-neck physician at our institution.Plan differences in the clinical plans were compared in terms of DVH parameters for targets  2. In other words, optimization objectives between the KBP models were identical except for the generated line objectives.This study aims at investigating differences in resulting plans based on changes in the training sets only.By keeping optimization objectives the same-except for the line objectives which are generated based on the training data set-any differences in resulting plans from the models are thus solely due to differences in the underlying training set.

Comparison of plans
To analyze the differences resulting from the different training sets, the 39 test patients from a third physician were then re-planned using each of the six KBP models.All KBP plans were normalized to the same V100% for the PTV High of the clinical plan (typically, but not always, V100% = 95%) for comparability.Dosimetric parameters were obtained for Body, PTV High/Int/Low, brainstem, cochlea, constrictors, cord, eyes, mandible, larynx, optic chiasm, optic nerves, oral cavity, parotids, and submandibular glands.For further plan evaluation, conformity index (CI) (prescription volume/target volume), 27 homogeneity index (HI) ((D2%-D98%)/D50%) 28 and body V100%, V50%, V20%, and V5% were calculated.These parameters were compared for clinical plans and KBP Full plans to evaluate the overall dosimetric quality of the KBP model.
Differences in the mean parameters averaged over all 39 plans were tested for significance using a paired ttest (p < 0.05).Differences in variance between the sets of plans were tested for significance using the F-test (p < 0.05).Clinical and KBP Full plans were also evaluated by a fourth independent physician.Both sets of plans were randomly blinded to plan 1 and plan 2. The physician was asked to evaluate whether plans were clinically acceptable, and which plan they preferred.If possible, the physician was asked to provide brief reasoning for their choice.
To analyze the effects of increasing KBP model size, a dosimetric analysis was then performed for KBP Full versus KBP 101, KBP 50 , and KBP 25 .Finally, the influence of physician preferences when building KBP models was investigated by a dosimetric comparison of KBP A to KBP B generated plans.As described above, dosimetric differences were tested for significance using a paired t-test and F-test (p < 0.05).

RESULTS
An overview of all dosimetric results is given in Tables 3  and 4. Table 3 gives the average value and standard deviation for all dosimetric parameters across the entire patient cohort for the clinical plans, KBP Full , KBP 101, KBP 50 , KBP 25 , KBP A , and KBP B plans.Note: These optimization objectives were set for all six KBP models.Therefore, the only changes in optimization objectives were due to the generated line objectives.Overall, the KBP Full model was demonstrated to produce plans that are at least of similar dosimetric quality to the human planners.Aggregate differences over 39 patients in the analyzed parameters between cohorts ranged from −74 cGy (clinical plans were better) to +352 cGy (KBP Full plans were better).

Overall KBP model quality
The blind review by an independent physician showed that the physician preferred KBP Full plans over the manual clinical plans in 20/39 cases (51.3%).When KBP Full plans were preferred, the physician explained the choice with improved OAR sparing in 19/20 cases and a reduced hotspot in 1/20 cases.When the manual clinical plan was chosen, the physician noted OAR sparing in all 19 cases as the reason, with the submandibular glands being most common (11 cases) and parotid glands 2nd most common (three cases).The physician considered 5/39 manually generated clinical and 7/39 KBP Full plans not acceptable.The lack of OAR sparing was given as a reason for all but one case.Lack of PTV High coverage was given as the reason for the other case (both clinical and KBP Full ).

Effect of KBP model size
The dosimetric comparison of the KBP Full and KBP 101, KBP 50 , KBP 25 models is illustrated in Figure 2. The significant PTV High V105% difference of −1.9 ± 5.1% indicates the PTV High was on average slightly hotter in the KBP 101 model.For OARs, left cochlea  Overall, it appears that training size differences between the different KBP models resulted in only small differences between plans when compared to the prescription dose.No model appeared to be clearly better or worse than the others.The cochlea in the KBP 25 model represent a notable exception.Other differences were not significant.Aggregate differences between KBP A and KBP B ranged from −79 to +110 cGy and were thus relatively small compared to the target dose of 6000−7000 cGy.

DISCUSSION
In this work, six KBP models were trained for head-andneck to explore multiple questions.Models were trained with varied model sizes and plan compositions and then used to re-plan a set of 39 test plans.Optimization objectives were identical except for the line objectives which are dependent on the training set.Plans resulting from the KBP models were compared to clinical manually generated plans and each other.

Overall KBP model quality
The dosimetric comparison of clinical and KBP Full plans showed KBP was able to significantly reduce the dose to cochlea, constrictors, and larynx mean dose.On the other hand, KBP Full increased left parotid mean dose by 74 ± 177 cGy on average.These results implied that KBP was able to match clinical plans dosimetrically but has the potential for further improvement with a refined model.
A reduction in the variability of the cord max dose in KBP Full plans was the only significant difference when looking at planning variability.In this study, variability for a DVH parameter was evaluated over the entire patient cohort and may thus be dominated by variability across patients.The variability among human planners for the same plan could not be assessed as part of this study.
The blind review of clinical and KBP Full plans showed the physician preferred KBP plans in 20/39 cases.The physician deemed 7/39 KBP Full plans unacceptable compared to 5/39 manual plans.It was suggested to improve the parotid and/or submandibular dose in the cases where the manual plan was preferred or KBP Full plans were deemed unacceptable.This suggestion is in line with the dosimetric results that showed a small but significant increase of left parotid dose in KBP Full plans.Plans from this model also appear to be slightly hotter (in terms of V105%) than the clinical plans, representing another issue that could be addressed.Of note, the KBP Full plans represent the direct output without any human intervention.In a clinical environment, a human planner could refine the initial KBP output to improve specific DVH constraints.
Nevertheless, future iterations of our KBP Full HN model will aim at improving parotid and submandibular dose as well as hot spots.This could be achieved, for example, by increasing priorities on the existing objectives, adding optimization structures for parts that do not overlap targets, and changing the model to account for ipsilateral and contralateral OARs rather than left and right location.Further, retraining the KBP model with KBP generated plans could potentially enhance resulting plan quality as has been demonstrated in previous literature. 29Despite the potential for improvement, the dosimetric results and blinded physician review indicate the model is suitable for clinical application and produces plans that are at least of a similar quality to plans that were manually generated in previous treatments.

Effect of KBP model size
The effect of model size was evaluated by comparing the plans generated by the KBP Full model, consisting of 203 training plans, to models trained with 101, 50, and 25 plans.Average differences to KBP Full were at most 51, 134, and 187 cGy for the KBP 101 , KBP 50 , and KBP 25 models, respectively.A notable exception was observed in the KBP 25 model where average cochlea mean dose across the 39-test patient cohort was reduced by up to 388 cGy.This is an unexpected result given the general assumption that increasing the number of training plans will result in better KBP generated plans.This behavior may be a consequence of only using line objectives for the cochlea in our models.We noticed two effects at play in the KBP 25 model.The DVH estimation was lower than in the KBP Full model, indicating this particular subset of the training cohort had a lower cochlea dose.Secondly, the standard deviation of the DVH estimation was also larger.In the chosen KBP solution, both of these lead to the generation of a stricter line objective and thus lower cochlea dose in the end.Despite already producing a significantly lower cochlea dose than the clinical plans, this indicates the KBP Full model could be further improved for this OAR.
Besides the cochlea, aggregate dose differences across the different model sizes were small compared to the prescription dose.We therefore conclude that KBP model size did not have a substantial impact on the resulting plan quality in terms of dosimetric parameters.An important addendum to this conclusion is that KBP users need to understand the minimum requirements for training their KBP models.For example,the KBP solution chosen for this study needs at least 20 instances of an OAR to be able to create DVH estimations.Without this minimum number, no DVH estimations can be created likely leading to an increase in OAR dose for KBP plans.
The results from the different size KBP models confirm the findings of another published study that investigated the effects of model size in KBP for the prostate with models comprising 31, 66, and 97 patients. 26From the results in this present and the previous study, it seems a prudent conclusion that clinics looking to employ KBP can start by training a small model and adding on to the model over time without having to expect a drastic change in resulting plans after retraining.

Effect of physician preferences
Finally, the dosimetric comparison of KBP models generated by plans from two different physicians also showed minimal differences.Despite differences of 500−1500 cGy in prescribed OAR dose, KBP plans had aggregate differences of 110 cGy or less.As a limitation, only two physicians could be compared in this study.Further, the plans in the training sets were not fully independent of each other because of an overlap of planners.Future studies could include more physicians and categorize models additionally by the planner.Finally, full physician review of the plans generated by the different KBP models would strengthen the analysis but was beyond the scope of this study.As it stands, however, the results are interpreted as an argument for training larger KBP models that encompass a variety of patients, physicians, and planners rather than creating specific models for each physician and planner.As an added benefit, this reduces the effort needed for maintenance and continued improvement of KBP models in the clinic.If physicians have differing opinions on clinical tradeoffs, these could manually be incorporated by the planner as optimization objectives rather than needing to undergo the arduous task of training and testing separate KBP models.Further, this also provides evidence that clinics lacking a sufficiently large database of previous plans could import a KBP model from an outside source and adjust optimization objectives according to their planning philosophy.

CONCLUSIONS
Several head-and-neck KBP models were developed and compared.The analysis showed training models with different head-and-neck planning databases yielded only small dosimetric differences in resulting KBP plans.Therefore, clinics looking to implement KBP are encouraged to train models that do not separate between different planners or physician preferences.Alternatively, KBP models can be imported from an outside source.Using adequate optimization objectives these models will be able to generate plans suited for each clinic's demands.As more planning data becomes available over time, these models can be retrained using the larger database, without concerns of drastic changes in resulting plan quality.These findings reduce the burden of the initial roll-out of KBP into the clinic.

F I G U R E 1
The table compares the dosimetric parameters between the clinical plans and different KBP plans.Values shown represent the average difference across all 39 patients ± the standard deviation of all (up to) 39 differences.Negative average values thus indicate the former plan cohort having a lower dose value.Asterisk (*) and dagger ( † ) highlight significant differences (p < 0.05) from paired t-test and F-test, respectively.p-values are given in parentheses when significant.Boxplot of dosimetric parameters for organs-at-risk comparing manual clinical plans to plans generated by KBP Full .Boxes represent quartile groups 2 and 3 separated by the median line for the entire cohort of plans.Whiskers denote minimum/maximum within 1.5 times interquartile range.White squares show average values and black diamonds outliers.Asterisks (*) and brackets highlight statistical significance (p < 0.05) in paired t-test.mean dose and mandible max dose were significantly increased in the KBP 101 model by 34 ± 127 cGy and 29 ± 69 cGy, respectively.Mean larynx dose was reduced by 51 ± 91 cGy in the KBP 101 model.No other differences were statistically significant.With aggregate differences ranging from −34 cGy to +51 cGy,differences between KBP Full and KBP 101 seem minimal.Compared to KBP Full, the KBP 50 plans significantly increased right cochlea Dmean by 61 ± 172 cGy and left submandibular Dmean by 134 ± 249 cGy, but significantly decreased larynx Dmean by 72 ± 157 cGy.No other changes were significant.Several OAR doses were significantly different between KBP Full and KBP 25 .KBP 25 plans significantly increased Dmean to constrictors, left submandibular, and lips by 109 ± 184, 187 ± 260, and 136 ± 78 cGy, respectively.Surprisingly, Dmean for left and right cochlea was significantly reduced by 355 ± 620 and 388 ± 639 cGy, respectively, in KBP 25 .

F I G U R E 2
Boxplot of dosimetric parameters for organs-at-risk comparing plans generated by KBP Full to those generated by KBP 101 , KBP 50 , and KBP 25 .Statistically significant differences (p < 0.05) in the paired t-test to KBP Full highlighted with brackets and asterisk (*).F I G U R E 3 Boxplot of dosimetric parameters for organs-at-risk comparing plans generated by KBP A to those generated by KBP B .Statistically significant differences in paired t-test shown with asterisks (*) and brackets (p < 0.05).

TA B L E 1 Plan composition. Diagnosis KBP Full KBP 101 KBP 50 KBP 25 KBP A KBP B Test set
KBP 50 , KBP 25 , KBP A , KBP B ) and are shown in Table

Table 4
gives the difference in the average DVH parameters of the clinical plans and KBP Full , KBP Full to KBP 101, KBP 50 , and KBP 25 as well as the difference of KBP A and KBP B .Table 4 also gives the standard deviation of differences and highlights if differences were statistically significant.TA B L E 2 Optimization objectives.

Parameter n Clinical KBP Full KBP 101 KBP 50 KBP 25 KBP A KBP B
TA B L E 3 Dosimetric parameters of all plans.Note: Overview of dosimetric parameters across the entire patient cohort for the clinical plans, and plans re-planned with TA B L E 4 Dosimetric differences between plans.