TrDosePred: A deep learning dose prediction algorithm based on transformers for head and neck cancer radiotherapy

Abstract Background Intensity‐Modulated Radiation Therapy (IMRT) has been the standard of care for many types of tumors. However, treatment planning for IMRT is a time‐consuming and labor‐intensive process. Purpose To alleviate this tedious planning process, a novel deep learning based dose prediction algorithm (TrDosePred) was developed for head and neck cancers. Methods The proposed TrDosePred, which generated the dose distribution from a contoured CT image, was a U‐shape network constructed with a convolutional patch embedding and several local self‐attention based transformers. Data augmentation and ensemble approach were used for further improvement. It was trained based on the dataset from Open Knowledge‐Based Planning Challenge (OpenKBP). The performance of TrDosePred was evaluated with two mean absolute error (MAE) based scores utilized by OpenKBP challenge (i.e., Dose score and DVH score) and compared to the top three approaches of the challenge. In addition, several state‐of‐the‐art methods were implemented and compared to TrDosePred. Results The TrDosePred ensemble achieved the dose score of 2.426 Gy and the DVH score of 1.592 Gy on the test dataset, ranking at 3rd and 9th respectively in the leaderboard on CodaLab as of writing. In terms of DVH metrics, on average, the relative MAE against the clinical plans was 2.25% for targets and 2.17% for organs at risk. Conclusions A transformer‐based framework TrDosePred was developed for dose prediction. The results showed a comparable or superior performance as compared to the previous state‐of‐the‐art approaches, demonstrating the potential of transformer to boost the treatment planning procedures.

beamlets, and the intensity of each beamlet is modulated to achieve prescribed dose for the planning target volumes (PTVs) while sparing the organs at risk (OARs). However, treatment planning for IMRT is a timeconsuming and labor-intensive process, as planners need to repeatedly adjust a number of parameters in treatment planning system (TPS) to determine the intensities of beamlets. Moreover, the quality of a treatment plan highly depends on the expertise and experience of planners. 3 To streamline the treatment planning process, one solution is multi-criteria optimization (MCO) which is dedicated to compute a database of Pareto optimal plans for each patient and then select a plan that meets the clinical metrics. [4][5][6] Another well-documented solution is knowledge-based planning (KBP) which generates patient-specific treatment plans for new patients based on the previously delivered plans. In the last decade, some KBP works aimed at predicting the desirable dose-volume histograms (DVHs). [7][8][9] The main limitation of these methods is the lack of 3D information. To overcome this shortcoming, there is a tendency to focus on the voxel-level dose prediction. The traditional voxel-level methods devoted to predict 3D dose distributions based on artificial neural networks (ANNs) [10][11][12] from handcraft features. Although these features are related to anatomical and plan parameters (e.g., PTV volume, distance from OARs), it fails to preserve the spatial relationship between each voxel. Nowadays, convolutional neural networks (CNNs) including U-net variants, [13][14][15][16][17][18] residual networks 19 and generative adversarial networks 20 have been widely applied for the voxel-level dose prediction. Wang et al. gave a detailed summary about these methods. 21 On the other hand, transformer, which is well-known for its self -attention mechanism and the capability of learning long-range dependencies, 22 has achieved great success in natural language processing (NLP). Motivated by this, several studies tried to introduce the self -attention mechanisms in CNNs. 23,24 Recently, transformers have been applied to the vision tasks. Based on the large-scale data pre-training, Vision Transformer (ViT), which took as input the sequences of image patches and the position encoding, obtained the promising performance for the image recognition. 25 Furthermore, many efforts have been devoted to employ transformers on the object detection 26 and semantic segmentation. 27 PvT first constructed a four-stage pyramid structure and proposed a spatial reduction attention. 28 Swin Transformer calculated the attention within local windows. 29 CvT utilized the convolutional projection to capture low-level features. 30 An in-depth survey about these vision transformers and self -attentions can be found in Ref. [31]. For the dose calculation, a recent study framed the particle transport physics as sequence modeling via the transformer encoder and convolutional decoder, improving the speed 33 times faster than the clinical pencil beam algorithm. 32 In this study, a 3D transformer-based algorithm was first proposed to predict the treatment dose distributions for the head and neck cancers and evaluated with CNN-based methods quantitatively. With a database of only 200 patients, the algorithm can be trained from scratch and complete the 3D dose prediction for a patient in seconds.

Dataset
The proposed algorithm was trained and evaluated on the OpenKBP dataset. 33

2.2
Architecture of TrDosePred Figure 1a shows the overall architecture of the proposed TrDosePred. With a three-channel feature of contoured CT as input, a patch embedding block first projected it into a sequence of patch tokens. A transformer-based encoder and decoder then built the relationship between embedded input features and dose maps. Finally, a simple patch expanding block was applied to generate the 3D dose distribution. These components are elaborated in Section 2.2.1 and Section 2.2.2.

Patch embedding block and expanding block
Traditionally in ViT, the input image was first split and linearly mapped to non-overlapping patches by patchify convolutions (i.e., stride equals to kernel size) before being fed into the transformer encoder.However,a recent research suggested that the overlapping convolutional stem (i.e.,stride smaller than kernel size) used in shallow layers of the vision transformer can remarkably improve the optimization stability and performance. 35 Motivated by this, a patch embedding block which composed of several stacked overlapping convolution layers, was provided to extract patches from the input volume.As shown in Figure 1b, the patch embedding block consisted of three submodules, each with a 3 × 3 × 3 convolution, an Instance Normalization 36 and a Gaussian Error Linear Units activation function (GELU). 37 After the third submodule, a point-wise convolution with 96 filters was used to project the features to embedding tokens. After the patch embedding, the dimension of features was reduced by the factor of 2 × 4 × 4.
Symmetrically, a patch expanding block constructed with a 2 × 4 × 4 transpose convolution and a 3 × 3 × 3 convolution, was applied to recover the resolution of feature maps after the decoder (illustrated in Figure 1c). In the end, a point-wise convolution was used to generate the dose prediction.

2.2.2
Transformer-based encoder and decoder After the patch embedding, the extracted tokens were fed into a U-shape encoder and decoder, where several 3D swin transformer blocks were heaped. Compared to the vanilla one, 25 the computation complexity is linear to image size in the swin transformer, which makes it more suitable for the medical image analysis.
Two consecutive 3D swin transformer blocks are elaborated in Figure 1d. Each 3D swin transformer block consists of a window-based local multi-head selfattention (W-MSA) module and a Multi-layer Perceptron (MLP) module. To add the locality, a depth-wise convolution was introduced between the fully connected layers in MLP. In addition, Layer Normalization (LN) and residual connection were adopted before and after each module, respectively. It should be noted that, to establish cross-window connections, the windows are cyclic-shifted between two consecutive swin transformer blocks (i.e., SW-MSA). The computational procedure of two consecutive 3D swin transformer blocks can be formulated as: where Z ′ l and Z l denote the output of the 3D(S)W-MSA and the MLP module for the l th block.
The attention in each 3D local window can be computed as: where Q, K, V ∈ ℝ N T ×d k denote the query,key and value metrics; N T represents the number of tokens in a 3D window; d k is the dimension of the query and key.
The values in B are token from a bias matrixB ∈ is the dimension of the 3D local window. Details can be found in Liu et al. 29 Refer back to Figure 1a, between the encoder and decoder blocks, down-sampling and up-sampling layers were inserted respectively. Specifically, 3 × 3 × 3 convolutions pre-activated by a GELU and LN were applied to halve the features and double the number of channels between encoder blocks, while 2 × 2 × 2 transpose convolutions were employed to progressively restore the features between decoder blocks. Moreover, residual connections were added between the features extracted by encoder and the counterpart in decoder.

Comparison with state-of-the-art methods
To evaluate the performance, the results of the top three methods reported in the official OpenKBP challenge paper 33 were retrieved and compared with the TrDosePred, namely (1) C3D: 16 a Cascaded 3D U-net, (2) 3D DCNN: 17 an ensemble of 3D patchbased densely connected U-net with dilated convolutions, and (3) Unet-ResNet3D: 18 a U-net implementation of the pix2pix model with a feature-based loss.
Furthermore, several cutting-edge methods were implemented and compared with TrDosePred, including (1) DeepDose: 14 a 3D variant of U-net first used for the dose calculation, (2) HD-Unet: 13 a hierarchically densely connected 3D U-net for the dose prediction of HNCs, (3) 2D DCNN: 15 a 2D densely connected U-Net with dilated convolutions, (4) Swin-Unet: 39 a 2D pure transformer-based U-shape architecture primarily used in medical image segmentation. The training setting of these models was carried out under the same conditions of the C3D (batch size: 2, Adam optimizer and cosine annealing scheduler, learning rate: 3 × 10 −4 , weight decay: 3 × 10 −5 ). The same data augmentation and ensemble strategies were used for all models.
In the end, the predictions of the best model will be used to validate and compare the dose distributions and DVH criteria with clinical plans. Two-sided Wilcoxon test was applied to determine the statistical difference between models. All experiments were conducted on Python 3.9, Pytorch 1.9 and a NVIDIA RTX 3090 GPU.

Ablation study
To investigate the effectiveness of each component, we constructed a baseline based on the modules applied in the Swin Transformer model, 29 while kept its architecture same as the proposed model. The single TrDosePred was re-assembled from the baseline step by step and compared the performance quantitatively.  Table 1 summaries the dose score and DVH score of each model over the entire test set. The dose score measures the mean absolute error (MAE) of dose distributions between the prediction and clinical plans. 33 The DVH score calculates the MAE over five DVH criteria (i.e., D 99 , D 95 , D 1 for three PTVs and D mean , D 0.1cc for seven OARs).

TA B L E 1
Comparison with state-of -the-art methods on the test set through the dose score and the DVH score. On average, for a patient on the test set, TrDosePred ensemble predicted the dose distribution within 2.32 s (including pre-processing). It achieved the dose score of 2.426 Gy (3.5% of prescription dose of PTV70), outperforming all implemented methods and the top three solutions of the OpenKBP Challenge. For DVH score, TrDosePred ensemble obtained 1.592 Gy (2.2% of prescription dose of PTV70), ranking 3rd in the listed approaches of Table 1. Table 2 shows the MAE of all regions of interest (ROIs) for the ensemble models on the test set. The TrDosePred ensemble significantly outperforms the DeepDose and HD-Unet on the PTV70 and most of the OARs. The DeepDose ensemble achieved smaller MAEs on the PTV63, PTV56, and the HD-Unet ensemble achieved a smaller MAE on the Larynx.No significant difference was found on these structures than the proposed model. Table 3 shows the MAE between the clinical plans and the predictions of ensemble models on the test dataset. The TrDosePred ensemble predicted D 99 , D 95 , D 1 within 1.838 ± 2.383Gy, 1.407 ± 1.964 Gy and 1.474 ± 1.269 Gy while predicted D mean , D 0.1cc within 1.312 ± 1.442Gy and 1.898 ± 2.185Gy. Significant differences were found for metrics except D 0.1cc .

Method
The differences between the clinical and predicted DVH metrics of TrDosePred ensemble are presented in Figure 2. On general, the medians of all metrics were distributed between −1.208 and 0.854 Gy, and the means of all metrics were distributed between −1.022 and 1.010 Gy.  Figure 4. Visually, the predicted dose distribution is consistent with the clinical one, as shown in Figure 4b and Figure 4c. Figure 4d shows the differences between the predicted and clinical dose distributions. The mean difference was −0.293Gy on the PTV70 and 0.869Gy on the PTV56. Table 4 gives out the results of the ablation study. As the process of re-assembling, both the dose score and DVH score were improved clearly. Especially, a significant improvement was observed when the convolutional sampling strategy was used. ConvPE further boosted the performance by 0.105 and 0.060 Gy for the dose score and the DVH score respectively. DW-MLP improved the dose score by 0.041 Gy. However, it resulted in a slight decrease on the DVH score and a significant increase on the training time (34 h needed). TA B L E 2 Comparison with ensemble methods for all ROIs on the test set through the MAE (mean ± standard deviation).

DISCUSSION
In this study, a novel transformer-based framework TrDosePred, was proposed and evaluated for 3D dose prediction task, based on a group of HNC patients treated by IMRT. Experimental results indicated encouraging performance of TrDosePred for this regression task.
In our experiment, the ensemble of five TrDosePreds on cross-validation was used and improved the dose score of 3.4% (from 2.512 Gy in the single to 2.426 Gy in the ensemble) against the best single model. To improve the performance, ensemble of models was also applied in the top OpenKBP algorithms. For example, in 3D DCNN, the similar ensemble strategy as in our case was used. 17 In C3D, a more complex ensemble algorithm was employed. To be specific, five single models were trained and their predictions on the training set were averaged as teacher doses for training a new C3D. 16 Moreover, the cascaded strategy was explored during the development stage.However,no improvement was observed when two TrDosePred models were cascaded. We assume that the difference was caused by the relative scale of the latter model. Further experiments were not carried out because of the excessive GPU memory cost.
Many of the recent studies on KBP are based on CNNs, which have been limited by the receptive field. To increase the receptive field without loss of resolution, some studies used the dilated convolutions as the basic units. 15,17 Our approach employed the transformer, whose receptive field can cover the whole input features. This allowed the proposed model to build the long-range and global connections compared to CNNs. Furthermore, the transformers can make the model robust to perturbations due to their flexible and dynamic receptive fields. 40 The proposed algorithm can predict 3D dose distributions precisely on the OpenKBP dataset. However, as mentioned in the OpenKBP, 33 the synthetic dose distributions were used to augment the real clinical ones. Our further investigation indicated that few of the OpenKBP data met the dose constraints of the Radiation Therapy Oncology Group (RTOG) 1016 protocol. 41 Thus, extended evaluations are required to guarantee the proposed algorithm works well on the real clinical cases. Additionally, in order to generate deliverable plans, the optimization procedure will be required. A popular pipeline for KBP is prediction-mimicking, which attempts to generate treatment plans as similar as possible to the predicted dose distributions based on voxel-level, DVHlevel or structure-level objectives. 19,42,43 In the future, we could shift our research from the 3D dose prediction to derive optimization parameters of treatment plans.  Some limitations are worth noting. First, it is possible that the capacity of transformer has yet to be fully exploited due to the limited data. Recent studies indicated that combining a few of convolutions with transformers can enjoy both good generalization and capacity, achieving comparable performance against CNNs in the scenario of small dataset and even superior performance as the scale of dataset grows. 30,44 A further study indicated transformers can outperform CNNs as well based on thousands of medical images. 45 Therefore, we expect the performance of TrDosePred could be further improved if a larger dataset is given (e.g., thousands of patient data).
Second, a single global dose distribution based loss (i.e., MAE) may not sufficient for optimal DVH metrics. Nguyen et al. suggested that a differentiable DVH loss can improve the domain relevant metrics. 46 Zimmermann et al. employed a feature-based loss network, resulting in good performance in DVH score. 18 Further improvement may be available with incorporating these advanced losses.

CONCLUSION
In this study, a transformer-based network, TrDosePred was proposed for dose prediction task. Compared with several existing CNN-based approaches, TrDosePred can achieve a comparable or superior performance, demonstrating the potential of transformer to accurately and rapidly predict the 3D dose distribution for head and neck cancer patients treated by IMRT.

AU T H O R C O N T R I B U T I O N S
Chenchen Hu and Haiyun Wang performed the experiments, the statistical analysis, and drafted the manuscript. Wenyi Zhang and Yaoqin Xie contributed to design of the study and review of analysis. Ling jiao and Songye Cui planned the study and reviewed the manuscript.