Knowledge-based Radiation Treatment Planning: A Data-driven Method Survey

This paper surveys the data-driven dose prediction approaches introduced for knowledge-based planning (KBP) in the last decade. These methods were classified into two major categories according to their methods and techniques of utilizing previous knowledge: traditional KBP methods and deep-learning-based methods. Previous studies that required geometric or anatomical features to either find the best matched case(s) from repository of previously delivered treatment plans or build prediction models were included in traditional methods category, whereas deep-learning-based methods included studies that trained neural networks to make dose prediction. A comprehensive review of each category is presented, highlighting key parameters, methods, and their outlooks in terms of dose prediction over the years. We separated the cited works according to the framework and cancer site in each category. Finally, we briefly discuss the performance of both traditional KBP methods and deep-learning-based methods, and future trends of both data-driven KBP approaches.


Introduction
Cancer is the second-leading cause of death in North America with the most common types being the cancer of lung, breast, and prostate [8]. Radiation therapy (RT), chemotherapy, surgery or their combination are used to control the disease. An approximately 50% of all cancer patients undergo RT during the course of their illness [53], which makes RT a crucial component of all cancer treatments. In terms of clinical usefulness and effectiveness of RT treatments, the transition from conformal RT to intensity modulated radiation therapy (IMRT) has significantly improved the two-fold dosimetric goal of improving the organ-at-risk (OAR) sparing while maintaining target dose homogeneity and conformity. Furthermore, algorithmic advancements have also played major roles in enhancing the efficiency of RT treatments. These include transition from forward treatment planning to inverse treatment planning approaches, and extension of IMRT to volumetric modulated arc therapy (VMAT). However, despite use of complex inverse optimization algorithms, an inverse planning approach typically demands a large amount of manual intervention to generate a high-quality treatment plan with a desired dose distribution, taking up to a few days before patient gets the first fraction of RT treatment. To further enhance the treatment planning efficiency, there has been significant progress into the development of data-driven treatment planning approaches that entail utilizing the knowledge from the past to predict the outcome of a similar, yet new, task. In treatment planning, this concept was introduced by the researchers over a decade ago in the form of knowledge-based planning (KBP). It entails utilizing a large number of previously optimized plans to build a mathematical model or atlas-based repository that can be used to predict the dosimetry (i.e., dose-volume metrics, dose volume histogram (DVH), spatial dose distribution, etc.) for a new patient plan. In 2014, one of the traditional KBP approaches was also made commercially available as RapidPlanTM by the Varian Eclipse treatment planning system (Varian Medical Systems, Palo Alto, CA). In the past few years, another data driven approach -namely deep learning (DL) has been gaining popularity in the field of radiation oncology for outperforming many state-of-art techniques [23,24,36,44,70,71,72,73,74,80,81,135,140,141,142,143]. For instance, convolutional neural network (CNN), a class of deep neural networks (DNN) with regularized multilayer perceptron, have significantly enhanced the performance of imaging and vision tasks. A complex architecture originally designed for image segmentation, also known as U-Net [110], has recently been shown to predict dose distribution without going through a treatment planning process [27,60,99,131]. Though there is a review paper summarizing the articles on traditional KBP methods published between 2011 and 2018 [38], to our knowledge, there is no review paper specific to datadriven dose prediction approaches including both, traditional and recently introduced DL-based KBP.
A key difference between traditional and DL-based KBP is the way in which previous knowledge is utilized. In general, traditional KBP methods require user to utilize geometric features such as overlapping volume information between planning target volume (PTV) and neighboring OARs in order to either find the best matched case(s) from repository of previously delivered treatment plans or build dose prediction models (i.e. machine learning (ML), statistical model) [163]. DL methods, on the other hand, can learn patterns hidden within the raw data without any requirement of manual feature extraction process, which makes it a more appealing KBP technique compared to the traditional KBP methods. It is important note here that ML-based approaches are included in traditional KBP category in this review as it follows the similar framework to other traditional KBP methods in terms of input (geometric features) and outputs (dose volume metric or DVH). The traditional KBP methods include atlas based, statistical modelling and ML methods. While a previous review summarizes these traditional approaches based on methods, current work presents the review of recently emerging DL-based methods as well as the traditional KBP methods from the standpoint of various key parameters and their influence on dose prediction tasks. The goal of this review paper, therefore, is to present the success of traditional KBP methods thus far and highlight the potential of recent DL-based methods in dose prediction tasks. We separate data-driven treatment planning approaches in this regard into two categories: traditional KBP methods and DL-based methods. For each category, we first present a review of key parameters and methods. Subsequently, we present a review of specific investigations and influence of various parameters on dose prediction. Finally, we discuss the advantages and challenges of each dose prediction technique followed by highlighting the potential future trends in data-driven dose prediction methods.

Literature search
We searched papers using Elsevier Scopus, Web of Science, PubMed, Google Scholar and medical physics category of arXiv.org by using logical statements that included the following keywords: knowledgebased treatment planning, ML, DL, dose prediction, RapidPlan, treatment planning automation, artificial neural network (ANN), convolutional neural network and generative adversarial network.

Article selection criteria
Only peer reviewed research articles were included in this review. Each research article during literature search was manually scanned based on the information presented in the abstract, which was followed by further in-depth review of specific articles. The articles with description of methodology and comparable or improved aspects of dose prediction quality or efficiency were included. Retrospective studies based on a commercialized KBP approach, RapidPlan, were also considered. The articles on external beam radiation therapy (IMRT, VMAT, Tomotherapy, Proton etc.) were included, whereas articles on brachytherapy were excluded. The review of articles on predictions of patient specific quality assurances of a treatment plan was not presented. In this review paper, the term dose prediction includes prediction of entire DVH curve, dose metric (i.e. dose-volume parameter, mean or max dose), voxel dose, spatial dose distribution including slice by slice in 2D manner or 3D dose distribution, objective weights/constraints based on previous knowledge and also the transfer of all these metrics to the new case for generating an actual plan. Figure 1 shows the number of publications per year as well as cumulative publications for both traditional KBP and DL-based dose predictions. Between 2009 and 2014, there was a gradual increase in the number publications on traditional KBP dose prediction in what appears to be the initial development stage of the data driven treatment planning. The curve demonstrates an uplift in the number of traditional KBP articles between 2015 and 2018. Majority of traditional KBP studies in the past few years have been based on commercial RapidPlanTM versus on further expansion of earlier ML or statistical methods. This is certainly not because traditional methods have been fully explored that it has reached its capacity in exploring potential research, but presumably due to recent emergence of DL-based methods owing to their flexibility and superior performances compared to many state-of-the-art techniques. In the past few years, the number of publications on DLbased image processing has increased exponentially. To expand the horizons of DL-based applications, researchers have already begun to explore its potential scope for dose prediction tasks. In last four years, the number of DL-based dose prediction publications has gone from 1 in 2016 to already 15 in 2020 as can be seen in Figure 1. The trend appears to demonstrate an increased rate of publications on DL-based versus traditional KBP in the current year. To combine strengths of various linear regression models to build a more robust model [166] MB K-nearest neighbors

Knowledge Based Planning
This review includes over 90 articles on traditional dose prediction methods. These traditional KBP approaches can be classified into two categories: I) Atlas based II) Model based. In atlas-based approaches, a physical parameter (i.e. overlap volume histogram (OVH), beams eye view projections, tumor location, etc.) is first identified to determine similarity between previous patients plans and a new patient plan. This is followed by transfer of knowledge (i.e. dose constraints, DVH values, beam geometrical parameters, DVHs of best matched cases) to predict achievable DVHs or to provide a better starting point to a treatment planner for further trial-and-error optimization. Within atlas-based methods, an indirect approach first predicts the dosimetric parameters through models and features, which are then used to select matching cases. Whereas a direct approach directly predicts a similarity parameter based on features of the plan, CT images, beams eye view (BEV) projections. In model-based approaches, statistical or ML models are built from previously approved treatment plans. These methods require manually handcrafted features such as PTV-OAR overlap volume, OVH values, OAR distance-to-PTV to predict DVH by using different regression models. In this review, we categorized traditional KBP dose prediction articles into three groups according to prediction of: I) entire DVHs in Table 1, II) one or more dose volume metrics in Table 2, and III) voxel doses in Table 3. The articles listed in Table 1 aim to predict the entire DVH for new patient case and utilize the predicted DVHs to guide the treatment planning for a new patient. Commercially available RapidPlanTM module also estimates DVH metrics and generates objectives for a new plan, which are also included in Table 1. Table 2 shows the articles that aim to predict one or more dose metric in order to guide the treatment planning for a new case. Table 3 shows the publications that aim to predict the voxel-level dose distributions to either assist in optimizing a new plan or automatically generate an actual new plan. Figure 2 demonstrates the total number of investigations on traditional KBP methods for various treatment sites. Prostate, head/neck and lung cancers were amongst the most frequently investigated cancer sites as anticipated, whereas very few investigations are conducted on complex sites such as abdominal, intracranial and thoracic.
In this section, we first provide an overview of key concepts involved in traditional KBP methods. Subsequently, we present a review of different metrics and their extension over the years. Finally, we summarize the influence of different parameters on the performance of traditional methods in dose prediction tasks. 3.1.1 Dimensionality reduction Though it is desirable to have more data for implementing different models, some implications of having too many data is that they can be redundant, irrelevant, and may result in overfitting, reducing models generalizability. Therefore, dimensionality reduction methods were used in majority of traditional KBP studies to decrease the number of variables. Two main components in the process of dimensionality reduction are: feature extraction and feature selection. The process of feature extrac-tion begins with an initial set of features followed by redefining with the intention for them to be more informative. Principle component analysis (PCA) is one of the most commonly used reduced order modeling techniques in model-based approaches. The PCA determines features that retain the most of the variation among the data [106] so that they can be represented by a smaller number of dimensions. For example, in a binary classification problem, if the goal is to classify an object A, represented by a P number of features in a P-dimensional vector, as either of two classes. If P is too large, some characteristics may be more valuable than others for the purpose of classification. The goal of PCA is to reduce the dimensionality of the dataset consisting interrelated variables into a smaller set of mutually uncorrelated variables [106]. Feature selection process involves the selection of valuable features from the ones at our disposal. In many traditional KBP studies, the PCA is used in the process of feature selections [9,11,13,16,29,33,32,34,50,76,78,114,126,127,129,136,137,154,159,161,162,165,170].
3.1.2 Various features/metrics A common theme in majority of traditional approaches is that the optimality of desired plan is strongly influenced by geometries of critical structures with respect to the PTV. Commonly reported geometric features include OVH, distance to target histograms (DTH), OAR distance-to-PTV. The influence of parotid size and proximity to the PTV on the dosimetric sparing of parotid was first studied by Hunt et al. [49]. In addition to geometric features, additional plan features such as PTV-OAR volumes, mutual information including beams eye view projections, number and angles of specified beams and photon energy have also been utilized in traditional KBP studies. List of these key parameters are tabulated in Table 1, 2 and 3 along with their corresponding references.
Overlap volume histogram (OVH) based methods The OVH was introduced to study the influence of OARs proximity to the target on its received dose. It is one of the most frequently used metrics in both atlas-based and model-based approaches as can be seen in Table 1 [59,148]. The OVH calculation involves uniform expansion and contraction of the target. Target contraction and expansion is repeated until OAR completely overlaps the target and there is no overlap between the target and the OAR, respectively [148]. In other words, it is the percentage of the OARs volume that overlaps with a uniformly expanded or contracted target. In general, OVH-driven models assume that the dose to an OAR is inversely proportional to its distance from the target.
A large array of studies has combined historical data with the OVH methods for prediction of entire DVH (Table 1) and one or more dose metrics (Table 2). Wu et al. used the OVH for its use in head/neck IMRT treatment plan quality control to help planners with evaluation [148]. This was followed by using OVH to generate the achievable DVH objectives for head and neck cancer case [149]. With a model based on OVH [59] and PCA [161,170], Wang et al. investigated the effect of interorgan dependency and impact of data inconsistency [146]. Larger prediction errors were found for head/neck region (¡4 Gy for 83% of the cases) compared to similar model applied to prostate case (¡2 Gy for 96%) presumably due to interorgan dependency [146]. Moore et al. also used OVH information to predict OAR dose metrics for head/neck and prostate IMRT plans [97]. Yuan et al. used OVH metric to quantify the effects of an array of patient anatomical features of the PTV and OARs and their spatial relationship on interpatient OAR dose sparing in IMRT and found mean distance between OAR and PTV, mean volume between OAR and PTV, out-of-field volume of OARs and geometric relationship between multiple OARs to be important factors contributing to the organ dose sparing [161]. For multiple OARs, using separate OAR-specific prediction models was found to be more accurate in predicting voxel doses compared to all OAR voxels in a single training model [64].
The success of OVH based prediction primarily rests on the observation that the minimum achievable dose to OAR depends on its distance and orientation to the PTV. However, the OVH based model [149] has been investigated for pancreatic cancer in which the OARs are larger compared to the tumor, part of OARs can engulf the PTV, and highly deformable organs can vary the beam configurations among different patients [107]. Petit et al. showed that the OVH based predicted doses were achieved within 1 and 2 Gy for more than 82% and 94% of the patients, respectively, with improvement of 1.4 Gy and 1.7 Gy for mean dose to the liver and kidneys, respectively. To further investigate the capability of OVH parameter, the global shift of the OVH was quantified after hydrogel injection to represent the efficacy of hydrogel injection in separating the rectum from PTV. The OVH was found to be a better metric for rectum sparing than the hydrogel volume [158]. Wang et al. used OVH to build a treatment planning QA model from consistently planned pareto-optimal plans for prostate cancer, improving planning standardization and preventing validation with possibly suboptimal benchmark plans [145]. In earlier OVH-driven studies, a large variations in IMRT dose at a given OVH distance for a specific EC = Esophageal cancer; NC = Nasopharyngeal carcinoma; HC = Hepatocellular Cancer Figure 2. The total number of traditional KBP investigations on dose prediction for various cancer sites.
fractional volume of an OAR was reported [152,161]. To address this disparity in the distance-to-dose correlation, Wall et al. studied inherent inter-planner variations in plan quality of the previous plans and second order dosimetric and anatomical factors. Out of all factors, in-field bladder and rectal volume showed the strongest correlation (R = 0.86 and R = 0.76) with doses. Therefore, in-field OAR volume was incorporated into the OVH only metric [134]. Generic OVH introduced by Kazhdan et al. directly infers a DVH rather than a spatial dose distribution [59]. With multi-patient atlas baseddose prediction approach, McIntosh and Purdie demonstrated that incorporating spatial information into the model can improve the dose prediction accuracy in comparison to the generic OVH method. This method was found to be less important for breast cavity and lung whereas improved prediction accuracy for whole breast, rectum and prostate cancer [89]. Table 2. Traditional KBP studies that aimed to predict one or more dose metrics for providing a starting point for the plan optimization process.

Ref. Method Approach/ Model
Key The minimum DVH value at the percentage volume of the bladder and rectum was used OVH = overlap volume histogram; DTH = distance-to-target histogram; AB = atlas based; MB = model based Projection based methods These algorithms typically rely on matching 2D images, beams eye view (BEV) of the projection at each corresponding gantry angle, based on statistical properties of image histogram. The best matched case is generally identified based on the sum of mutual information (i.e., beams eye view projection) values for each of the total number of beam angles involved. This method has been used for prostate [12] and head/neck cancer [112]. Good et al. calculated mutual information representing the best match for the query case. The PTV projection of matched case were deformed to the query cases PTV projections at each angle to adjust for shape differences between the PTVs of the query and match case. This approach reduced doses to the OARs and improve target dose conformity and homogeneity in KBP generated plans compared to the original plans [40].
Distance-to-target histogram (DTH) based methods Distance to target histogram (DTH) is the fractional volume of the OAR within certain distance from the PTV surface. This metric along with volumes of the PTV and OARs are typically used as input features in ML approaches such as multivariable nonlinear regression (MVNLR) and support vector regression (SVR) [170]. It is important to note that DTH is equivalent to OVH [59] when the Euclidean form of the distance function is used. This DTH metric was extended to generalized distanceto-target histogram (gDTH) by Zheng et al. in order to account for the relative shape distribution of multiple PTVs for head and neck cancer [166]. In comparison to conventional model, the gDTH model improved DVH prediction accuracy for brainstem, cord, larynx, mandible, parotid, oral cavity and pharynx [166]. While this gDTH model presented similar plans with respect to an individual OAR, to develop a knowledge based tradeoff hyperplane model that assists with clinical decisions, the concept of gDTH was further extended to select similar plans with respect to all OARs by employing case similarity metric that is a weighted sum of gDTH Euclidean distances between two cases across all OARs [167]. Finally, the DTH has also been utilized with multivariate regression-based models, which is commercially available as RapidPlanTM in Eclipse treatment planning software.

Influence of various parameters
Outliers/Data inconsistency Outlier detection is one of the important factors to consider when building a data driven dose prediction model or repository that is generalizable to new cases. Outliers can reduce the goodness of fit between geometry and dosimetry, which, in turn, can comprise the model performance [97]. Two commonly reported outliers in the literature are geometric outliers and dosimetric outliers. Geometric outliers, on the other hand, entail large anatomical variations including OAR distance to the PTV. An example of geometric outlier is including a prostate + nodes case to prostate only cases. Several studies investigated the influence of outliers on model performance as shown in Table 4. Dosimetric outliers represent the presence of plans in which OARs are not actively spared or there are violations of dose-volume criteria. In other words, dosimetric outliers are the plans for which the re-planning can significantly reduce in OAR dose without compromising target coverage. Appenzoller et al. described a model to identify outliers in the form of suboptimal plans and showed that excluding outliers in refined model resulted in a strong correlation between predicted and realized gains after re-planning (r = 0.92 for rectum, r = 0.88 for bladder and r = 0.84 for parotid glands). For head/neck RapidPlanTM based KBP, Delaney et al. analyzed the influence dosimetric outliers and showed a moderate degradation in accuracy of the model attributed to decreased precision of the estimated DVHs [20]. For pelvic cases, Sheng et al. assessed the effectiveness of outlier identification by studying the impact of both, geometric and dosimetric, outliers. This study suggested a greater impact of dosimetric outliers with negative impact on both bladder and rectum model compared to geometric outliers with negative impact only on bladder model [118]. Wang et al. studied effect of data inconsistency with respect to planning prioritizations through a) mixed training dataset with a consistent validation dataset b) a consistent training dataset with a mixed validation dataset c) both a mixed training and validation dataset d) both consistent training and validation dataset and found that data inconsistency led to a large increase in prediction error with errord ¡ errorc ¡ errora ¡ errorb. [146]. In addition to removing the outliers (i.e. suboptimal plans) from the training cohort [2], an alternative to address the issue of outliers reported in the literature is re-planning of the identified suboptimal plans for prostate and head/neck cancer [1] and lung cancer [58] followed by inclusion in the training cohort. Clinically available RapidPlanTM provides different statistical evaluation metrics for identifying the outliers as shown in Table 4.
Diversities within traditional methods Many retrospective studies were published after 2014 presumably due to clinical implementation of traditional KBP module in the form of RapidPlanTM in Eclipse treatment planning software. These studies investigated the applicability of traditional methods with respect to variations in external parameters (i.e. multi-modality, multi-institution, sample size etc.). Here, we present a review of these studies with their findings. Wu et al 2013 used the DVH objectives derived from previous IMRT plans as an optimization parameter for VMAT treatment planning in head/neck cancer, resulting in a similar dosimetric quality compared to IMRT plans [151]. Wu et al demonstrated that supine VMAT model for rectal plans can optimize IMRT plans of prone patients, yielding superior OAR sparing and quality consistency than conventional treatment planning method [155]. The prediction models trained on Helical Tomotherapy for prostate cancer were utilized to predict constraints to perform an optimization of new plans using RapidArcTM technique, it resulted in comparable/increased bladder and rectum doses compared to expert planners plan. Delaney et al. demonstrated that using a model only on photon beam characteristics could make the DVH predictions for proton therapy and can be used as a patient selection tool for protons [21]. McIntosh et al. studied contextual atlas random forest (cARF) algorithm with and without OAR region of interest features and found that the algorithm can pick better atlases without ROI features, however is not compatible enough to map the dose distribution from those atlases onto a new patient [90]. Huang et al demonstrated that RapidPlanTM model for one energy (10 MV) can generate dose volume objectives for plans with 6 and 10 MV, however a RapidPlanTM model for flattened beams cannot optimize un-flattened beams prior to adjusting the target objectives [48]. A RapidPlanTM module also has the potential to generate high quality treatment plans on a newly implemented treatment planning software compared to manually optimized plans for prostate cancer [87].
For esophageal cancers, the RapidPlan created from plans optimized using RayStationTM produced comparable lung doses [130]. For patients enrolled in Radiation Therapy Oncology Group (RTOG) 0617, Kavanaugh et al showed the feasibility of a single-institution RapidPlanTM model as a quality control tool for multi institutional clinical trials to improve overall plan quality and provide decision support to determine the need for clinical trade-offs between target coverage and OAR sparing [58]. For prostate cancer, Schubert et al. have demonstrated the possibility of sharing models among different institutes in a cooperative framework [114]. For prostate cancer RapidPlanTM amongst five different institutions, Ueda et al. also suggested that it is critical to ensure similarity of the registered DVH curves in the models to the institutions plan design before sharing the models. For prostate cancer, Good et al., applied the model trained on their institute to generate plans for patient datasets outside institution with the potential for homogenizing plan quality by transferring planning expertise from more to less experienced institutions [40]. Good et al. achieved superior or equivalent to the original plan in 95% of 55 tests patients [40]. More recently, a disease site specific multi-institutional, NRG-HN001 clinical trial based RapidPlanTM model was built as an offline quality assurance tool for which it improved sparing of OARs in a large number of reoptimized plans submitted to the NRG-HN001 clinical trial [39].
Sample size Figure 3 shows an average number of training and test set for each cancer site in traditional KBP methods with standard deviation over number of investigations listed on the top x-axis. The number of training/test sample size were not directly mentioned or required in the methods described in some publications. For RapidPlanTM , it is indicated that the minimum number of plans required for model creation is 20, however adding additional plans will usually help create a more robust plan [133]. Numerous studies have compared the quality of plans generated by RapidPlan by using high quality plans in training and found that 25 30 plans may produce clinically acceptable plan for prostate [31] and head/neck [127] cancer. For prostate cancer, Boutilier et al analyzed effects of the training set size on the accuracy of four models from three different classes: DVH point prediction, DVH curve prediction and objective function weights. The authors concluded that minimum required sample size depends on the specific model and endpoint to be predicted [7]. Zhang et al showed that approximately 30 plans were sufficient to predict dose-volume levels with less than 3% relative error in both head and neck and whole pelvis/prostate [164].
The requirement of sample size also partially depends on the robustness of the model used. Yuan et al. used 64 and 82 cases for prostate and head/neck case, respectively, in support vector regression (SVR) model for DVH predictions [161]. Landers et al. demonstrated statistical voxel dose learning (SVDL) to be more robust to patient variability compared to spectral regression and SVR for noncoplanar IMRT and VMAT for head/neck, lung and prostate cancer by using 20 cases for each site in 4-fold cross-validation [64]. An atlas-based dose prediction [89] is more sophisticated method in which each patient in the training set represents 1 atlas. Feature extraction and characterization is typically performed on CT of the patients, which results in a probabilistic dose estimates to find the most likely voxel dose from similar atlases. In comparison to ANN and SVR methods, a large training sample sizes were required for this method (58 for rectal, 77 for lung, 97 for breast cavity, 113 for central nervous system (CNS) brain, 144 for breast and 144 for prostate cancer). Overall, the review of traditional KBP dose prediction publications thus far suggests an improved efficiency compared to manual optimization, sufficient flexibility of traditional KBP methods in terms of their applicability (i.e. multimodality in EBRT), the need of these models for more complex sites, the requirement of an automated approach for accounting for outliers to further enhance the treatment planning efficiency and the potential of building site specific universal RapidPlanTM models for multi-institution adaptation.

Deep Learning
DL offers numerous advantages and support to personals of different disciplines in the different steps of radiotherapy treatment planning. An appealing feature of DL methods is that the layers of features are not manually designed, rather learned directly from raw data. Because DL methods are good at discovering intricate structures in high-dimensional data, it is applicable to a wide range of applications in science [66]. In this section, we provide an overview of different architectures and neural networks that have been applied to dose prediction tasks up to now. The use of DL in dose prediction was initially utilized in the form of ANN [119]. In these earlier DL-based methods, organ volumes including PTV and OARs, number of fields and distances from OARs to the PTV were used to train ANN, which was then used to correlate dose at a given voxel to a number of geometric and plan parameters, similar to CNS = Central Nervous System; NC = Nasopharyngeal Cancer; EC = Esophageal Cancer that of used in traditional KBP methods. The DNNs are the most commonly used networks in DL-based dose prediction. It resembles the traditional ANN, but with a large number of layers. Therefore, ANNbased studies are included into DL-based dose prediction category in Table 5 despite their comparable framework to that for traditional KBP methods. Neurons within each layer are nodes which are connected to subsequent nodes via links that correspond to biological axon-synapse-dendrite connections, analogous to the neural cell of human. The layers embedded between an initial input layer and the final output layer are called hidden layers. The number of layers determines networks width, whereas the number of neurons determines its depth. Each neuron between its input and output undergoes a linear followed by a non-linear operation. In layered format, each neuron receives the information from the neurons in the previous layer and passes it to neurons of the next layer after processing it.
On the other hand, the residual connections can be added to connect neurons in non-adjacent layers such as ResNet proposed by He et al. [45]. The ResNet architecture has been presented with different number of layers: ResNet (18,34,50,101,152). Many DNN architectures have been presented for various applications. For dose predictions, CNN namely fully convolutional neural network (FCN) and fully connected CNN (FCNN) have been used so far. A DL-based generative model, commonly known as generative adversarial network (GAN), has also been employed to aid the main network (FCN) for predicting dose distribution. 3.2.1 Convolutional Neural Network Multilayer perceptron has the fully connected networks in which each neuron in one layer is connected to all the neurons in the next layer. It is now succeeded by CNN, a class of DNN with regularized multilayer perceptron [65]. CNN, by far, is the most widely used DNN for dose prediction task as can be seen in Table 5. Main components of a typical CNN are convolutional layers, max pooling layers, batch normalization, dropout layers, a sigmoid or softmax layer. The convolutional layer consists of a set of convolutional kernels where each kernel acts as a filter. The image is divided into small slices, known as receptive fields, through convolutional kernel, which aids in extracting features. Kernel uses a specific set of weights to convolve with corresponding elements of the receptive field. The weight sharing ability of convolutional operation allows extraction of different set of features within an image by sliding kernel with the same set of weights on the image. This makes CNN parameter more efficient compared to the fully connected networks. This operation can be grouped based on the type and size of filters, direction of convolution, and type of padding [66]. From the result of convolution operation, the feature motifs can occur at different locations in the image. The goal is to preserve its approximate position relative to others rather than the exact location. The pooling or down-sampling sums up similar information in the neighborhood of the receptive fields and outputs the dominant response within this local region, helping to extract combination of features that are invariant to translational shifts [68]. Commonly reported pooling formulations used in CNN are max, average, L2, spatial pyramid pooling and overlapping [46,139] Nonlinear operation, also known as activation function, helps in learning of sophisticated patterns by serving as a decision function. Different activation functions reported in the literature are sigmoid, tanh, SWISH, ReLU and its variants including leaky-ReLU, PReLU have been used to inculcate non-linear combination of features [43,67,109,139,157]. More recently proposed activation function is MISH, which has shown better performance than ReLU on benchmark datasets [95]. Batch normalization is applied to address the question of internal covariance shifts, a change in the distribution of hidden unit values, within feature maps that can reduce the convergence speed. It essentially unifies the distribution of feature map values by setting them to zero mean and unit variance, which, in turn, improves the generalization of the network by smoothening the flow of the gradient [124]. Finally, weight regularization and dropout layers are used to alleviate data overfitting. The difference between the predicted and the target output is calculated through loss function. CNN is generally trained by minimizing the loss via gradient back propagation using optimization methods. Different architectures have been proposed in the literature to enhance the performance of CNN. U-Net, originally introduced for segmentation of neuronal structures in electron microscope stacks [110], is the one of the most widely used architectures in CNN. In addition to segmentation, it is also used for image-to-image translation tasks that outputs an image that has a one-to-one voxel correspondence with the input. U-Net permits effective feature learning even with small number of training sample size. Milletary et al. proposed a three dimensional variant of U-Net known as V-Net [91]. A known issue of training DNN occurs from the vanishing gradient. Therefore, ReLU [66] and its variants are generally preferred as activation function owing to their ability in overcoming the vanishing gradient problem [102]. LeCun et al. formulated the layers as learning residual functions instead of directly fitting a desired underlying mapping [66]. A densely connected neural network (DenseNet) by Huang et al. connects each layer to every other layer [47]. More recently, attention gate was used in CNN in order to suppress irrelevant features and highlight salient features useful for a given task [111]. [41]. Two major components of GAN are generative network and discriminator network that are trained concurrently to compete against each other. The goal of generative network is to generate artificial data that can approximate a target data distribution from a low-dimensional latent space, whereas the goal of discriminator network is to recognize the data presented by the generator and flag it as either real or fake. Both networks get better over the course of training to reach nash equilibrium, which is the minimax loss of the aggregate training protocol [41]. Some of popular variants of GAN include CycleGAN [169], conditional GAN (cGAN) [93] and StarGAN [17]. GAN is widely used in medical imaging [44,54,70,71].

Generative Adversarial Network Generative adversarial network (GAN) is a widely used semi-supervised learning method in DL
3.2.3 Reinforcement Learning Reinforcement learning (RL) trains an agent, connected to its environment through perception and action, to make adjustments based on interaction between the agent and the environment. The agent gets certain indication of the current knowledge of the environment at each step of its interaction. Based on received indication, the agent then chooses an action to generate as output. This action changes the state of the environment, the value of this state transition is communicated to the agent through a reward function. The agents behavior can learn to do this over time through trial and error [116]. In other words, the goal of RL is to find the balance between the search and the current knowledge. RL has been combined with DNN to accomplish human-level performances [96]. RL is a unique framework that resembles the workflow of treatment planning optimization. The potential scope of RL in DLbased dose prediction task (Table 5) has been investigated in a recent study [115]. RL was used to train a DNN named virtual treatment planner network, which, in turn, decides the way of changing treatment planning parameters to improve plan quality instead of a treatment planner similar to the treatment planning process [115]. Table 3. Traditional KBP studies that aimed to predict voxel level doses for providing a starting point for the plan optimization process.

Ref.
Method To converts a predicted per voxel dose distribution into a complete radiotherapy plan through fully automated pipeline using cARF-CRF. OVH = overlap volume histogram; DTH = distance-to-target histogram; AB = atlas based; MB = model based

Deep Learning in Dose Prediction
DL-based dose prediction methods can be categorized according to DL properties such as network architectures (CNN, GAN etc.), training process (supervised, unsupervised, semi-supervised, deep reinforcement etc.), input image types (CT only, CT + OAR + PTV contours, etc.), output types (2D or 3D dose distribution) and sample size (training, testing etc.). As shown in Figure 1, DL-based dose prediction methods have gained popularity amongst the researchers only in the past few years, there are nearly 30 publications on DL-based dose prediction so far. These DL-based dose prediction publications are tabulated in Table 5 along with their network architectures, input and output characteristics. Figure 4 represents the total number of DL-based dose prediction investigations per treatment site. This follows a similar trend to that observed for traditional KBP dose prediction approaches with the highest number of investigations being on prostate and head/neck cancer sites. Here, we categorized DL-based dose prediction publications thus far into two groups based on network architectures: I) CNN namely U-Net architecture and II) GAN. We first provide the review of work for each network architecture followed by their applicability on various dose prediction application and limitations. Subsequently, we discuss the influence of different parameters in DL-based dose prediction methods.
3.3.1 Overview of CNN based works As shown in Table 5, U-Net has been widely used CNN architectures used for predicting dose distributions. U-Net is effective in terms of calculation and combination of global and local features because it is consisted of encoding and decoding path. The decoding path concatenates the features from both previous layers in encoding path and features from current layers in decoding path. Many variants of U-Net including 3D U-Net have appeared in literature for dose prediction purposes. Earlier work in DL-based dose prediction methods involved predicting doses in 2D manner [27,99]. Sumida et al. used the U-net model, initially proposed by Ronneberger et al. [110], to make 2D dose prediction. Two main flows of this were encoding and decoding parts. Encoding parts layers followed 2D convolution layer, batch normalization, rectified linear unit (ReLU) and max-pooling layer. The network was trained to make dose prediction for Acuros XB (AXB) from low resolution dose calculated through AAA algorithm and CT. Similarly, Nguyen et al. also trained a seven-level hierarchy with modified version of original U-Net to make dose prediction for a prostate case [99].
More recent works were focused on predicting 3D dose distributions using DL methods. To overcome increased computation load in 3D dose prediction, Nguyen et al. proposed Hierarchically Densely U-Net (HD U-Net), which not only was able to predict 3D dose distribution, but also outperform dose predictions made by standard U-Net model [100]. HD U-Net combines DenseNets efficient feature propagation and utilize U-Nets ability to infer both local and global feature by connecting each layer to every other layer in feed-forward fashion, yielding better RAM usage and better generalization of the model. To further simplify 3D dose prediction problem and increase prediction accuracy, Xing et al. projected the fluence maps to the dose distribution using a fast and inexpensive ray-tracing dose calculation algorithm and trained HD U-Net to map the ray-tracing low accuracy dose distribution (does not consider scatter effect) into an accurate dose distribution calculated using collapsed cone convolution/superposition algorithm [156]. DL-based methods have also been expanded to predict pareto optimal dose distributions so that physicians can learn the desired dosimetric trade-offs in real time and learn the viability of different dosimetric goals. Ma et al. constructed 3D U-Net architecture to predict individualized dose distribution for different tradeoffs [84]. In predicting pareto dose distribution, the network should be able to map many dose distributions from a single anatomy. In doing so, it should be able to differentiate between the clinical consequences and corresponding predicted dose distribution. To address this clinically relevant differences amongst different dose distribution, Nguyen et al. proposed the differentiable loss function based on the DVH and adversarial loss in addition to traditional voxel wise mean square error (MSE) loss to train the network [101]. Along the same line of work, Bohara et al. incorporated beam information to predict pareto dose distribution using anatomy-beam model proposed by BarragnMontero et al. [6].
U-Net architecture has also been used for internal radiation dose predictions [42,69] where the network was trained to predict 3D dose rate maps given the mass density distribution and radioactivity maps. Since clinically available Medical Internal Radiation Dose Committee (MIRD) based dose estimations are least precise, the long-term goal of these studies is to create a stable DL-based dose estimation model that achieves a precision close to that of Monte Carlo simulations. He [15,120]. Since networks with very deep layers are difficult to train due to vanishing gradient, such networks used shortcut connections to add to the outputs of the stacked layers [45]. More recently, Liu et al. proposed ResNet for dose prediction in the nasopharyngeal cancers for Helical Tomotherapy. To achieve multi-scale feature learning, Liu et al. divided the ResNet into several parts without fully connected layers and respectively combined with input data to achieve pixel-wise feature abstraction and extraction in structural image.
3.3.2 Overview of GAN-based works GAN entails a pair of neural networks: a generator and a discriminator. From the treatment planning standpoint, generator could be represented as the treatment planner who generates the plan and radiation oncologist could be represented as discriminative network who evaluates the plan generated by the treatment planner. Both the treatment planner and a radiation oncologist get better at performing their tasks as they become experienced over time. Only a handful of studies have investigated the performance of GAN for dose prediction task as shown in Table 5 [131]. All four studies [4,85,98,131] on GAN-based dose predictions constructed a generator and discriminator network using the pix2pix architecture proposed by Iosa et al. [52]. U-net generator was used, which passes a contoured CT image slice thorough consecutive layers, a bottleneck layer and subsequent deconvolution layers. U-net also uses skip connections to easily pass high dimen-NC = Nasopharyngeal Cancer; PD = Personalized Dosimetry Figure 4. The total number of DL-based dose prediction investigations for various cancer sites. sional information between the input (CT image slice or contoured structures) and the output (dose slice).
3.3.3 Overview of learning processes In this section, we briefly present a review of four learning processes including supervised learning (SL), unsupervised-learning (USL) and semi-supervised learning (SSL) that have been utilized so far in DL-based dose prediction tasks. Earlier approaches used SL that trained a model by using labeled data in the form of different geometrical parameters and distance to the target to train the network. In contrast, USL does not require such target information and rely solely on the input data to learn the patterns hidden within raw data. A typical example of USL is training deep auto encoder (DAE), which has a flexible network structure with encoder and decoder. These USL networks can be CNN, fully connected networks, or hybrid [116]. It can be seen from table 5 that USL is the most widely used learning strategies in DL-based dose prediction tasks. A category that falls between USL and SL is SSL. SSL is commonly used for tasks in which the target information is only partially available. GAN, a popular SSL, has also been utilized for dose prediction tasks (Table 5). Table 4. A list of articles with investigations on effects of outliers on plan quality and summary of evaluation metrics used by RapidPlan T M with threshold in parentheses.

Input parameters
In terms of number of input parameters, Williems et al. studied the impact of four different inputs (Table 5) for dose prediction under with and without data normalization of dose distribution. The order of models in terms of performance was CT + isocenter + contours ¿ CT + contours ¿ CT + isocenter ¿ CT only. While the dose distribution normalization had more benefits for CT + contours, it was found to be less necessary for CT + isocenter + contours model. Whereas, normalization produced hot and cold spots for CT + isocenter model [147]. While many studies use only CT with anatomical information (i.e. PTV and OAR contours) as inputs to the CNN [5,100,99] as can be seen in Table 1, BarragnMontero et al. included beam gemoetry information along with anatomical information as inputs. As a result, the model was able to learn from database that was heterogeneous in terms of beam configurations (i.e. noncoplanar) [5], which was the limitation of network proposed in the earlier studies [99]. For rectal cancer IMRT, Zhou et al. showed improvements in the prediction accuracy by including beam configurations as input to the network compared to that of without beam configuarations [168]. For head/neck cancer, Chen et al. investigated the influence of adding out of field labels into the network training to deal with inability of 2D network to account for radiation beam geometry. It resulted in a better overall performance compared to the network excluding out-of-field labels [15]. For prostate cancer, Murakami et al. compared the performance of CT-only based GAN with contour-based GAN in predicting target images (i.e., RT-dose images) and found prediction performance of contoured-based GAN to be superior.

Loss functions
In terms of losses, MSE is one of the most widely used cost functions in DL methods as it has many desirable properties from an optimization standpoint. Owing to its simplicity, well behaved gradient and convexity, majority of previous studies including the ones shown in Table 1 utilized only MSE loss for dose prediction. Nguyen et al. trained network with domain-specific loss function by adding nonconvex DVH and adversarial loss in addition to MSE loss function. While this outperformed dose predictions compared to MSE based trained model, for the same computational system, it increased the training time to 3.8 days with 100000 iterations compared to 1.5 days for MSE only based network [101]. Lee et al. and Chen et al. utilized mean absolute error (MAE) cost function between the ground truth and dose rate map predicted by CNN [15,69]. A key difference between MSE and MAE that MAE is more robust to outliers but may be inefficient to find the solution, whereas MSE provides more stable and closed form solution. Other loss functions may include Huber loss, smooth mean absolute error, quantile loss, and log cosh loss function. So far, MSE loss function has been the standard cost function used in DL-based dose prediction studies.
Sample size In general, the DL based methods require a large number of high quality data to be effective. A small datasets in DL can be challengening as it can result in overfitting. Overfitting occurs when the model is trained to exactly fit a set of training data, however cannot learn the hidden pattern to maintain model generality [116]. Data augmentation [122], dropout layer [121], estimation based on the training and the validation curves [100], synthesizing new data based on physics principles [86] or incorporating regularizations to model parameters [132] have been used in the literature to prevent overfitting. The process of data augmentation, more commonly used in dose prediction approaches, is to expand dataset by synthesizing additional realistic samples from available samples. It is important to note here, however, that the process of augmentation to be used depends on the suitability of the context. For the purpose of dose prediction, we have presented the average training and testing sample size for each treatment site in Figure 5 for all DL-based dose prediction methods to date, which provides the readers with an approximate range of training and testing data set for each cancer site.
As shown in Table 1, three investigations on prostate cancer have been reported so far for predicting pareto optimal dose distirbutions [6,84,101]. For each patient in training set, 10, 100, and 500 plans were generated by Ma et al., Nguyen et al., and Bohara et al., respectively, to sample the pareto surface with different tradeoffs. An optimal number of plans per patient in training set is unknown as it may depend on case to case basis. Nonetheless, in the case of predicting pareto optimal plans, it may be ideal to stay within clinically relevant regime by including only those plans that covers dosimetric tradeoffs presented by a physician.
Kandalan et al. studied the issue of generalizing DL-based dose prediction models and to make use of transfer learning to adapt a DL dose prediction model to different planning styles in the same institutions and planning practices at different institutions. A source model was adapted to four different planning styles only with 14-29 cases [57]. A long-term goal of these studies is to generate a universal model that can easily be transferred to different institutions for a similar task.  [120] ResNet-50 CT + OAR +PTV +body contours 3D [122] U-Net Low resolution dose + CT 2D [156] HD U-Net CT + RT dose distribution 3D [123] GAN CT + PTV + OAR 2D [131] Attention gated GAN CT + PTV + OAR 3D [101] GAN PTV + OAR + Body Pareto Dose Distribution [4] 3D GAN Contoured CT images 3D [168] 3D U-Net + Residual Network CT + OAR + PTV contours + Beam + Dose 3D [57] 3D U-Net OAR + PTV contours 2D [77] Dense-Res hybrid Network Beam + structural information Static field fluence prediction [115] Virtual

Discussion
With the aim of minimizing the variations in treatment planning and improving the treatment planning efficiency namely a time-consuming trial-and-error process of planning a treatment from scratch for every patient, the researchers introduced the concept of using previously delivered treatment plans in order to guide treatment planning for a new patient. This concept has been labelled as a knowledgebased planning today. In the last decade, there has been a rapid growth in the number publications in traditional KBP dose prediction. On the other hand, the number of publications on DL has increased exponentially in recent years owing to its flexibility and superior performances compared to many state-of-the-art techniques. Over 90 papers have been published on traditional KBP dose prediction methods between 2011 and August 2020, whereas over 15 publications have already been published on DL-based dose prediction this year so far. In general, most paper demonstrated improvements in comparison to manually optimized clinical plans in terms of both treatment planning quality and efficiency. A large number of manuscripts were published on traditional methods between 2015 2018, with the highest number of publications in 2017. This is presumably due to commercialization of the Rapid-PlanTM in Eclipse treatment planning software in 2014, which allowed researchers from different centers to perform range of retrospective studies for investigating the influence of various parameters on the quality of plans generated through RapidPlanTM KBP. While the number of traditional KBP based publications has been quite steady in the past 2 years, the DL-based publications have been rapidly increasing since 2017. In terms of modality, both techniques were mostly applied to IMRT, VMAT and other noncoplanar intensity modulated external beam radiation therapy treatments. Only a small number of data driven dose prediction studies were reported for the purpose of magnetic resonance imaging guided therapy (MRgRT) [125]. The number of traditional KBP and DL based publications for on-table adaptation may increase in the future, owing to recent technical developments such as MR-Linear Accelerator (MR-Linac). In terms of treatment sites, prostate, head/neck and lung were amongst the most investigated sites in both traditional KBP and DL-based methods compared to complex abdominal or cranial sites. This was anticipated as both KBP techniques require a large training sample set and these three are commonly treated sites in external beam radiation therapy. Therefore, a large repository of previously treated plans is likely to be available for building dose prediction models for these two sites over other complex sites. In KBP, three commonly reported dose prediction metrics in the literature were entire DVH curve (Table 1), one or more dose metrics (Table 2) and voxel-based dose prediction (Table 3). A known limitation of DVH prediction is that DVHs are only predicted for contoured OARs, which may limit the accountability of enhance conformity and hotspots that may occur outside of the region of interest. This was addressed through voxel-based dose prediction approaches in which the models are built to predict individual voxels within the CT image. However, this approach relies heavily on the quality of the plans used to build the model as the inclusion of outliers can compromise the model performance. Even for RapidPlanTM based KBP, several studies indicated the need to investigate the proper identification of outlier plans [31,32,127]. Outlier identification in RapidPlanTM involves statistics and regression plots for each structure, suggesting Cooks Distance ¿ 10.0, Studentized Residual ¿ 3.0, Areal Difference of Estimate ¿ 3, and Modified Z-score ¿ 3.5 as potential outliers [133]. To an extent, this also requires removal of outliers in iterative manner with either stopping the removal once no significant improvement is observed or identification of the outliers followed by replanning of all the outliers so that it can be reused in the training cohort [1]. The time required to address the issue of outliers may vary from one institute to the other as institutions without standardize techniques can have many dosimetric outliers presumably due to a large variations in treatment planning, which, in turn, can result in a time consuming process of eliminating outliers either through visual inspection or additional statistical analysis. In the literature, limited amount of emphasis has been given on establishing a systematic process for identifying dosimetric and geometric outliers. To our knowledge, currently, there is no well-established workflow for outlier identification and mitigation in terms of model creation for both KBP techniques. Therefore, a standardized automated method of outlier identification and model creation could further enhance the treatment planning experience [165]. In contrast to a previous review that presented a number of training and testing sample size per year [38,51], we separated the datasets per cancer sites for traditional KBP and DL-based in Figure 3 and 5, respectively. This would provide readers with a range of training sample size for each cancer site, as required number of training set depends not only on the prediction model but also on the complexity of a treatment site. For instance, the number of cases required to train a model may be more for more complex cases such as head/neck to represent the case population versus a simple case such prostate cancer. Direct comparison of training sample size between the traditional and more recent DL-based KBP was not made as DL-based dose prediction is a relatively new technique with a fewer number of investigations per site compared to traditional KBP methods. In contrast to DL, an inherent limitation of traditional methods is that it is unable to process the raw data and extract important features and patterns hidden within. Both, similarity measures in atlas-based methods and input features to model-based methods, require considerable effort to extract valuable features (i.e. overlap volume histogram, OAR distance to the PTV, projections, etc..) that can process raw data either to identify the best matched case or into a representation from which patterns within the input can be classified through a classifier. In traditional approaches, PCA has been widely used in the literature for feature selection owing to its simplicity. However, a major limitation of PCA is that it learns low dimensional representation of data only with a linear projection. Whereas, DNNs can be used to address this issue and untangle non-linear projections. For instance, an autoencoder is a type of neural network that is consisted of encoder, which encodes the input into a low dimensional latent space, and decoder, which restores the original input from the low dimensional latent space [37]. This has been adopted in DL-based dose prediction methods ( Table 5) and extension of such unsupervised method is anticipated in the near future to further enhance the dose prediction accuracy. In terms of DL-based dose prediction methods, two mostly investigated networks, thus far, included CNN and GAN. From the results so far, it appeared that GANs may be a good choice for dose prediction tasks over conventional CNNs for several reasons. First, GAN have been proven to perform well in lesion detection and data augmentation tasks [4,35]. In addition, GAN does not rely on pure spatial loss, such as mean square error between dose volumes, which makes it a suitable candidate not only for dose prediction of conventional radiation therapy but also for SBRT in which dose heterogeneity is prevalent. Furthermore, Babier et al. found that GAN models did not require significant parameter tuning and architecture modifications during implementations compared to other conventional methods [4]. However, in contrast to CNN, one limitation of conventional GAN is that they are difficult to train and requires the number of network parameters to be as low as possible. Future studies are anticipated to account for such shortcomings by proposing extension of networks such as attention-gated GAN [131].

Method comparison of KBP dose predictions
For head/neck cancer, the difference between the traditional KBP predicted and actual median doses for the parotids ranged from -17.7% to 15.3% [78], whereas it ranged between 7.7 to 13.5% for DL-based dose prediction [15]. With the same level of prediction accuracy, DL-based KBP was able to predict median dose for 80% of parotids compared to 63% by the traditional KBP method [15]. Kajikawa et al. made the direct comparison of dose distribution predicted by DL method with that of generated by RapidPlanTM for prostate cancer [56]. This dosimetric comparison showed that CNN significantly predicted DVH accurately for D98 in PTV-2 and V35. V50, V65 in rectum. Given that features automatically extracted by DL methods can include both geometric/anatomic features and the mutual tradeoffs between the OARs, it gives an edge to DL-methods in terms of dose prediction accuracy compared to traditional KBP methods that mainly rely on DVH and geometry-based expected dose. For oropharyngeal cancer, Mahmood et al. directly compared GAN approach for generating predicted dose distribution with several traditional approaches including bagging query [3,148] and generalized PCA [161], random forest [90]. Mahmood et al., through the gamma analysis [83], demonstrated that GAN plans were the most similar to the clinical plans and achieved 4.0 % to 7.6 % improvements in frequency of clinical criteria satisfactions compared to traditional approaches [85]. For prostate SBRT, Vasant et al. compared the performance of proposed attention gated GAN with an earlier approach that used relative distance map information of neighboring input structures [119]. In contrast to conventional radiation therapy, SBRT produces hot spots within the target volume. Mean absolute difference in V120 between KBP like approach and actual plan was four-fold higher compared to that achieved by attention gated GAN technique, demonstrating the ability of a DL-based method to predict cold spots and hotspots that are prevalent to SBRT dose volumes. Both, traditional and DL-based, KBP approaches used the data from previously treated patients to make dose prediction for a new patient. DL based approaches, however, have been shown to outperform traditional methods in dose prediction tasks as demonstrated by several studies in the literature. This is presumably due to ability of DL-methods to not be limited by a small number of features in contrast to that of in traditional KBP.

Future trends
From the statistics of publications on data driven dose prediction approaches in recent years, there is a clear trend of transformation from traditional methods to DL-based methods for KBP. This is presumably due to flexibility and superior performance of DL based approaches in contrast to traditional approaches. In terms of traditional KBP methods, future investigations are anticipated to be retrospective in nature by using clinically available tools (i.e., RapidPlanTM). On the other hand, DL based methods appear to be in its initial development stage, hence, its potential will be explored in different areas of dose prediction tasks in treatment planning workflow including adaptive radiotherapy in near future. Adaptive radiotherapy (ART) involves adjusting dose distribution based on anatomical changes observed on intra-procedural imaging such CBCT. The standard approach requires physician to perform recontouring of OAR and tumor regions followed by plan re-optimization, which is difficult to implement in an ART. To date, only one study has been reported to adopt DL methods for the purpose of ART of head/neck cancer [123]. The future trend will certainly be towards utilizing the flexibility and efficiency offered by DL-based methods to present dose prediction models of dosimetry changes and radiotherapy response for ART. Post dose prediction, a main component of treatment planning workflow includes ensuring the achievability of the predicted dose plans, which often involves inverse treatment planning through manual intervention. Only handful studies extended such data driven approaches in a fully automated pipeline that not only predict the dose distribution but also generates a complete treatment plan with minimum human interaction in traditional [26,55,89,153,171] and DL based methods [4,77,85]. The deliverability of the predicted plans is more important as it has to account for various mechanical and algorithmic constraints. It is important to note here that good predictions with low error may not necessarily lead to the final deliverable plan with the same performance on clinical criteria. For instance, five of the seven prediction methods investigated by Babier et al. resulted in a significantly worse clinical criteria satisfaction despite lower error post dose predictions [4]. We, therefore, believe synchronizing an inverse optimization engine with dose prediction methods hold a great potential in improving treatment planning efficacy and efficiency. Alternatively, a DL-based fluence prediction has also been proposed for real-time prostate treatment planning [77]. This approach follows conversion of predicted fluence maps to a deliverable treatment plan through delivery parameter generation and dose calculations directly in a treatment planning software. Such approaches do not require inverse optimization process and involve minimal human intervention. A subsequent task, after generating a deliverable plan, involves patient specific quality assurance measurements that are performed routinely prior to actual treatment delivery to ensure delivery and dosimetry accuracies. Several ML [63] and DL [105] approaches have been reported for predicting gamma passing rates for IMRT patient specific QA. More efforts are also anticipated to be placed to incorporate such approaches into treatment planning pipeline to establish a fully automated workflow. One of the challenges in data driven algorithms, including both ML and DL, is that it requires a large set of a high-quality data. Since the quality of data and radiotherapy practices vary from one center to the other, the heterogeneity in previously treated plans become a major obstacle in deployment of data-driven solutions in the field of radiation oncology. To address this issue, the concept of transfer learning for model adaptation to different learning styles at different centers may be investigated further in the future. A long-term goal of this area of investigations would be to incorporate data-driven predictive tools as a part of the clinical pathway.

Conclusion
In the last decade, a tremendous amount of work has been done towards automation to improve treatment planning quality and efficiency. We have performed a review of two major KBP approaches to dose prediction: traditional KBP methods with over 90 articles and more recently introduced DL-based KBP with nearly 30 articles. While traditional approaches are either equivalent or superior to an experienced planner with greater efficiency, recent developments in DL holds a greater potential in dose prediction task. Both KBP approaches, however, are needed to be expanded for more complex sites such as abdominal and intercranial. Given commercial accessibility of RapidPlanTM module, more retrospectives studies are foreseen in the future. However, new approaches DL-based KBP are actively being introduced and trending in a steep upward direction. There are various areas of future research, several of which have been highlighted in this review, required to achieve an ultimate goal of a fully automated treatment planning system.

Disclosures
The authors declare no conflicts of interest.