Error detection using a multi‐channel hybrid network with a low‐resolution detector in patient‐specific quality assurance

Abstract Purpose This study aimed to develop a hybrid multi‐channel network to detect multileaf collimator (MLC) positional errors using dose difference (DD) maps and gamma maps generated from low‐resolution detectors in patient‐specific quality assurance (QA) for Intensity Modulated Radiation Therapy (IMRT). Methods A total of 68 plans with 358 beams of IMRT were included in this study. The MLC leaf positions of all control points in the original IMRT plans were modified to simulate four types of errors: shift error, opening error, closing error, and random error. These modified plans were imported into the treatment planning system (TPS) to calculate the predicted dose, while the PTW seven29 phantom was utilized to obtain the measured dose distributions. Based on the measured and predicted dose, DD maps and gamma maps, both with and without errors, were generated, resulting in a dataset with 3222 samples. The network's performance was evaluated using various metrics, including accuracy, sensitivity, specificity, precision, F1‐score, ROC curves, and normalized confusion matrix. Besides, other baseline methods, such as single‐channel hybrid network, ResNet‐18, and Swin‐Transformer, were also evaluated as a comparison. Results The experimental results showed that the multi‐channel hybrid network outperformed other methods, demonstrating higher average precision, accuracy, sensitivity, specificity, and F1‐scores, with values of 0.87, 0.89, 0.85, 0.97, and 0.85, respectively. The multi‐channel hybrid network also achieved higher AUC values in the random errors (0.964) and the error‐free (0.946) categories. Although the average accuracy of the multi‐channel hybrid network was only marginally better than that of ResNet‐18 and Swin Transformer, it significantly outperformed them regarding precision in the error‐free category. Conclusion The proposed multi‐channel hybrid network exhibits a high level of accuracy in identifying MLC errors using low‐resolution detectors. The method offers an effective and reliable solution for promoting quality and safety of IMRT QA.


INTRODUCTION
Intensity Modulated Radiation Therapy (IMRT) is a critical modality in radiation therapy, offering significant dose modulation capabilities to deliver high doses to the tumor while minimizing dose to surrounding organs. 1 Since numerous control points are used in IMRT for dose modulation, ensuring patient safety and treatment efficiency necessitates thorough quality assurance (QA) procedures. 2 QA in IMRT encompasses several steps, including transferring the clinal treatment plan to a measurement device, delivering the planned dose to the measurement device, and analyzing the measured dose compared to the planned dose.Gamma analysis is widely utilized in clinical practice to assess the accuracy of treatment plans, 3 which takes both dose deviations and distance-to-agreement between measurements and calculated points into account.Typically, gamma analysis employs criteria such as a 3% dose deviation and 2 mm spatial distance, with a 10% dose threshold.A gamma index exceeding 95% indicates the clinical acceptability of the treatment plan. 4lthough gamma analysis is a commonly used QA method in clinical practice to determine the accuracy of treatment delivery, clinical physicists are unable to determine the cause of the errors if the QA results are deemed faulty. 5To address this problem, a combination of radiomics and machine learning was employed to identify delivery errors in IMRT QA.This approach involved extracting numerous quantitative features (e.g., shape, texture, intensity) from dose difference (DD) maps or gamma maps generated by electronic portal imaging device (EPID).Subsequently, machine learning algorithms, such as logistic regression or support vector machines, were applied to these radiomic features for error classification in IMRT QA.Wootton et al.  utilized radiomics analysis to classify multileaf collimator (MLC) random and systematic errors based on the gamma distribution. 6Nyflot et al. employed machine learning methods to analyze gamma images generated from EPID to identify different MLC errors. 7Ma et al. also utilized machine learning methods to analyze structural similarity index measure (SSIM) index generated from EPID images to identify delivery errors, including MLC and monitor unit (MU) errors. 8Leveraging the remarkable success of deep learning in machine vision classification tasks, another approach involved utilizing Convolutional Neural Networks (CNN) for feature extraction from DD maps generated from 3D detectors to facilitate error detection.Kimura et al. employed DD maps generated from Delta4 phantom (ScandiDos, Sweden) and applied CNN to predict two types of MLC errors. 9Both methods yielded superior results in error detection compared to traditional gamma analysis.
In previous research, [6][7][8] the DD maps or gamma maps used for error detection were obtained from either EPID or 3D detectors.Although these two detectors have been widely applied in clinical practice, 2D dosimeters such as 2D diode arrays or 2D ion chamber arrays are more commonly used for patient QA. 10 Therefore, the direct utilization of 2D arrays for identifying machine delivery errors holds clinical significance.However, due to the superior spatial resolution of EPID compared to 2D detector arrays, and the high sensitivity of EPID in patient-specific QA, the application of low-resolution 2D arrays presents a considerable challenge for detecting errors in QA. [11][12] Therefore, one aim of our research is to apply deep learning methods in conjunction with a low-resolution 2D array for MLC error detection.
Furthermore, it is important to note that previous research efforts utilizing CNN methods for MLC error identification in patient-specific QA only used a single image type as input, either the DD map or the gamma image.Kimura et al. showed that using the DD map as input yielded superior results in CNN predictions compared to the gamma images, though this research was based on 3D detectors and did not involve a 2D array. 9Wolfs et al. investigated the effect of many different dose comparison methods on the performance of AI models for error detection in pre-treatment QA. 13 In specific-patient QA, the DD map is highly sensitive in high-gradient dose regions, where small spatial discrepancies can result in significant DDs.However, these discrepancies may not be caused by MLC errors but rather human operation errors or set-up errors, such as detector positioning errors.Although gamma images are not highly correlated with the spatial positions of failed points, they align better with clinical application guidelines. 4Therefore, the other aim of this study is to determine the types of images to use as input for deep learning when utilizing a low-resolution 2D array, particularly to combine DD maps and gamma images as multi-channel inputs to enhance the performance of the deep learning network.
In this study, we developed a novel deeplearning network for MLC error detection using a low-resolution 2D array.This network utilized DD maps and gamma maps as multi-channel inputs to improve network performance.The training parameters and source code can be found on Github (https://github.com/shijun18/MLC_CLS).

Treatment plan and delivered dose measurement
In this study, 68 IMRT plans of patients treated in our hospital were selected.The anatomic sites of the plans include head and neck, chest, abdomen, and pelvis.There were 358 beams in all plans, and the number of beams per plan ranged from 3 to 13.All treatment plans used 6MV photon with a dose rate of 600MU/min, designed using the Monaco (Version 5.11.02,Elekta, Sweden) treatment planning system (TPS) and the Monte Carlo dose algorithm with the sliding window technique of fixed-gantry IMRT and performed on the Elekta Synergy linear accelerator with a 1 cm width MLC at the isocenter. 14he perpendicular field-by-field method was utilized to measure these patient-specific IMRT plans.All IMRT plans were transferred to be delivered to the 2D ion chamber array (PTW seven29, PTW, Germany), with the gantry angle at 0 degree, and the dose was recalculated using TPS.In this way, the calculated dose is in the same plane as the detector.IMRT QA plans were delivered to the 2D ion chamber array with gantry at 0 degrees for all beams.The measuring device was calibrated using a 10×10 cm field before measurement.

MLC error simulation
To facilitate a comparative analysis with previous literature and substantiate the effectiveness of our method, [6][7][8] four types of MLC position error plans including shift, opening, closing, and random were simulated in this study.The types of errors are described below.Shift error: For each control point of the beams, all MLC leaves were shifted by 2 and 3 mm in the same direction, resulting in identical field sizes and a 2 and 3 mm shift relative to error-free plans.
Opening error: For each control point of the beams, all MLC leaves were modified by 1 and 2 mm in the opposite direction away from the beam axis.Compared with the error-free plan, this error plan corresponds to an expansion of the field size by 2 and 4 mm in the X-direction for all control points of the beams.
Closing error: To simulate a closing error, the positions of all MLC leaves were adjusted by 1 and 2 mm towards the beam axis for each control point of the beams,resulting in 2 and 4 mm shrinking in the field size of all control points compared to the original error-free plan.
Random errors: For each control point of the beams, all MLCs move randomly according to a Gaussian distribution, with standard deviation δ = 1 and 2 mm of errors.
To acquire the predicted dose of these MLC leaf error plans to the same plane as the 2D ion chamber array, error-free IMRT plans were exported from TPS as DICOM RT-plan files.The position of all control points of the MLC leaves in the RT-plan files were modified to the specified position using an in-house program in MAT-LAB (Version R2016a, Math Works, USA), and then the modified RT-plan files were re-imported into TPS and recomputed to generate predicted dose.
Measured dose maps for the error-free plan were acquired via delivered dose measurement, while prediction maps for both the error-free plans and eight MLC leaves error types of plans were obtained through error simulation, as shown in Figure 1.

Data processing
Three hundred fifty-eight measured dose maps of error-free plans and 3222 prediction dose maps were obtained through QA measurement and simulation of MLC leaves error plans.The predicted dose maps include 358 error-free plans and 2864 error plans, with 716 plans for each of the four error types: random, opening, closing and shift errors.These maps were processed to obtain DD maps and gamma maps, which were used as input to the deep learning algorithm for classifying MLC leaves delivery errors.Kimura's method 15 was employed to ensure consistency in processing the DD maps.The processing steps for the gamma maps were similar, except that the gamma maps were evaluated using two passing rate criteria of 3%/2 mm and 2%/1 mm, with a dose threshold of 10% and global normalization. 4Positive and negative symbols were introduced in the gamma evaluation to distinguish hot and cold spots in the predicted and measured dose maps.Specifically, negative gamma values represented cold spots, while positive gamma values represented hot spots.Finally, a normalization process was conducted to use the gamma values as the network input.The gamma values were subjected to thresholding with an upper limit value of 1.5 and a lower limit value of −1.5, then a linear scaling was applied to normalize the values to the range of −1 to 1, following the method of Kimura. 9Before training, all pre-processed images were merged and linearly interpolated to 224 × 224 ×3.We interpolated the size of images from 27 × 27 to 224 × 224 to improve the model's performance for two reasons.The first reason is that the architecture of the network in our study is very deep and contains multiple downsampling processes.
For each downsampling process, the image size undergoes a 50% reduction.Employing a 27 × 27 image size would entail a considerable loss of spatial information within the feature map after several downsampling processes, adversely affecting the model's overall performance.The second reason is based on Wolfs's research, which indicates that higher-resolution images can enhance the classification performance of deep learning networks. 13It is recommended to use an image resolution of 128 × 128 with a pixel size of 0.7 mm × 0.7 mm for the input of the classification model.As the two-dimensional low-resolution detector (PTW seven29, PTW, Germany) used in our study has a resolution of only 27 × 27 (with a pixel size of 1 × 1 cm for the original dose images), to improve the performance of the model, we resized the images to 224 × 224 with a pixel size of 1.2 mm × 1.2 mm.Gamma maps were processed using Pygamma, a Python package available on GitHub (https://github.com/christopherpoole/pygamma). 16igure 2 illustrates a typical processed DD map and gamma map.

Network architecture
As shown in Figure 3, the proposed method was a hybrid structure that combined the CNN and Vision Transformer (ViT). 17There were two considerations behind this design: On one hand, due to the constraints of the local receptive field, the existing CNN-based classification methods [18][19][20][21][22] had limited performance in MLC error identification.On the other hand, although most ViT-based structures 17,23,24 could model global spatial dependencies, the data-hungry property of these methods led to the need for large-scale labeled data, which was an insurmountable challenge in our task.Therefore, from the perspective of combining the advantages of the two structures, we proposed a novel multi-channel hybrid network (HybridNet-MC) for MLC error identification.HybridNet-MC comprised CNN and ViT structures connected in series, where the former extracted the semantic features of the merged image, and the latter learned global dependencies to obtain high-quality discriminative features and projected them into the decision space.Specifically, the CNN part of the network consisted of a stem module and multiple Residual Channel Attention (RCA) modules.The stem module performed feature encoding on the input image through a successive Convolution-Batch Normalization-ReLU (CBR) operator cluster with a kernel size of (7,7), resulting in a low-level feature map of (224, 224, 32) resolution.Subsequently, multiple RCA modules extracted high-level semantic features with lower spatial resolution and larger feature dimensions from these low-level feature maps.To reduce the computation cost, the RCA module adopted a bottleneck structure, expanding the feature dimension by the CBR operator cluster with a kernel size of (1,1).In the ViT part, the Swin Transformer (SwinT) module was employed to model the global context dependencies to enhance semantic representation.Unlike the standard ViT structure, the SwinT module introduced a non-overlapping local window mechanism and achieved cross-window information interaction through a shift operation, leading to lower computational complexity.During the decision-making stage, the discriminative features obtained from the network were mapped to predicted probabilities by the classifier head.The predicted result corresponded to the error category with the highest probability.Compared with pure CNN and ViT structures, our method could extract local and global semantic features and eliminate the datahungry attribute, making it more suitable for MLC error identification.The experimental results also proved the effectiveness of the proposed method.

Comparison models
We employed two types of networks to compare the performance of HybridNet-MC and validate the impact of multi-channel input and hybrid network architecture on the model.Three single-channel hybrid models (HybridNet-SC) were implemented to investigate the effect of multi-channel input on the model.These networks were named HybridNet-DD, HybridNet-Gamma32, and HybridNet-Gamma21 based on the input image types (DD maps,3%/2 mm standard gamma maps, and 2%/1 mm standard gamma maps, respectively).In the same dataset, pure CNN and ViT models, such as ResNet18 18 and Swin Transformer, 17 were trained and tested to study the advantages of the hybrid architecture.To ensure experimental reliability, these networks adopted the same optimization method and training strategy as HybridNet-MC.

F I G U R E 2
Examples of dose difference maps and gamma maps under two different criteria for a single beam.From top to bottom, the difference maps correspond to the following scenarios: error-free, random error, shift error, opening error, and closing error.These maps were generated based on predicted dose distributions with and without errors, where the reference dose distribution is the predicted dose distribution with and without errors, and the measured distribution corresponds to the error-free measured dose distribution.

Implementation details
The the baselines and the proposed method.Each fold of all models was trained from scratch using 1 NVIDIA A100 graphics processing unit (GPU) with 40 GB of memory.The cross-entropy loss was adopted as the optimization function, using the AdamW 25 optimizer with an initial learning rate of 1e-5 and a weight decay of 1e-4.In particular, the cosine annealing strategy 26 was applied to adjust the learning rate during the training process.The default batch size was 32, and the input sizes for the merged and single-channel images were 224 × 224 × 3 and 224 × 224 ×1 pixels, respectively.To mitigate the overfitting problem, an early stopping strategy was utilized with a tolerance of 50 epochs to search for the best model within 150 epochs.Moreover, online data augmentation was conducted, including random erasure, affine, rotation, flipping, and color jitter.8][29][30] Dropout prevents overfitting by discarding (both hidden and visible) units of the CNN with a probability of p. Inspired by dropout, random erasure is somewhat similar to performing dropout on the image level. 27Random erasure involves locally obscuring parts of an image, compelling the model to learn more diverse and descriptive features, thereby preventing the model from overfitting to specific visual characteristics.Similar to our work, Nakamura 28 created and evaluated deep learning models for the detection and classification of Transmission Factor (TF) and Dosimetric Leaf Gap (DLG) errors in volumetric modulated radiation therapy (VMAT).This study also employed random era-sure for data augmentation on dose-difference images, demonstrating that random erasure does not affect the model's performance.Color jitter is a commonly used data augmentation technique in computer vision and image processing, aimed at increasing the diversity of training data and enhancing the robustness of deep learning models through subtle changes to the colors of images. 29Feng et al. employed a siamese network, combining CT images and 3D dose distributions, to predict radiation pneumonitis.In their study, data augmentation included random flipping, rotation, contrast adjustment, color jitter, and affine transform, achieving a prediction accuracy of 0.818, demonstrating the feasibility of applying color jitter data augmentation to dose image datasets. 30Based on this research, we used the color jitter technique in our data augmentation.It should be noted that, in our study, color jitter only transformed the contrast and brightness of the images.This alteration represented a global change, and the differences between each pixel were preserved.

Model evaluation
The One versus All methodology was utilized to compute the Receiver Operating Characteristic (ROC) curves and the Area Under the ROC Curve (AUC) to assess the networks' classification effectiveness. 31he AUC ranges between 0 and 1, with higher values indicating superior classification capabilities.To further evaluate the classification performance of the networks, several metrics were computed, including accuracy, precision, sensitivity, F1-score, and a normalized matrix which can help visually examine the relationship between the predicted results of the different networks and the true labels.Furthermore, the t-SNE method was applied to reduce the dimensionality of the network's feature maps to a two-dimensional space, facilitating the visualization of classification errors across diverse networks. 32

Classification results of HybridNet-MC
Table 1 presents the precision, accuracy, sensitivity, specificity, and F1-scores of the hybrid network for the five catalog classifications in the test dataset of five-fold cross-validation.The mean values for the precision, accuracy, sensitivity, specificity, and F1-score TA B L E 1 The precision, accuracy, sensitivity, specificity, and F1-score of the HybridNet-MC model, which employed the dose difference maps and two criteria gamma maps (3%/2 mm, 2%/1 mm), were evaluated through five-fold cross-validation.Note: ''Fold 1−5″ refers to the results obtained from each of the five cross-validation iterations, while "All Folds" indicates the mean value from fold 1 to fold 5.

Value
were 0.87, 0.85, 0.85, 0.97, and 0.89, respectively.The results showed that the metrics from each iteration of the cross-validation consistently exhibited similarity, indicating that HybridNet possessed exceptional stability and robustness in the variations in the input data.

Classification results between HybridNet-MC and HybridNet-SC
A comparative analysis was conducted between the HybridNet-MC and the HybridNet-SC trained with DD maps, gamma maps (3%/2 mm), and gamma maps (2%/1 mm), respectively.The results revealed that the HybridNet-MC performed better in terms of accuracy, precision, sensitivity, specificity, and F1 score than the HybridNet-SC, as shown in Table 2. Regarding the classification of specific error types, all four net-works exhibited similar performance levels for shift errors, opening errors, and closing errors.However, the HybridNet-MC exhibited enhanced performance compared to the other three single-channel networks in detecting random errors and error-free cases, as illustrated in Figures 5 and 6.Specifically, the HybridNet-MC demonstrated higher AUC values for random errors (0.964) and no errors (0.946) compared to the other three networks.
The high-dimensional features of Hybrid-DD, HybridNet-Gamma32, HybridNet-Gamma21, and HybridNet-MC networks were projected onto a twodimensional scatter plot using the t-SNE method, as presented in Figure 7.In comparison to Hybrid-DD, HybridNet-Gamma32, and HybridNet-Gamma21, the HybridNet-MC network distinctly separated shift errors, opening errors, and closing errors into distinct clusters, with higher separation between clusters and stronger intra-cluster compactness.

Classification results between HybridNet-MC and CNN/ transformer networks
The HybridNet-MC, ResNet-18, and SwinT were trained using five-fold cross-validation on the same training dataset.Compared to the HybridNet-MC, the average precision of ResNet-18 and SwinT was 0.86, slightly lower than that of the HybridNet-MC.When considering the classification of error-free cases, the precision values for ResNet-18 and SwinT were 0.56 and 0.57, respectively, demonstrating a significant decrease compared to the HybridNet-MC's precision value of 0.67.The precision, accuracy, sensitivity, specificity, and F1score of the HybridNet-MC, ResNet-18, and SwinT are shown in Table 3.

DISCUSSION
In this paper, we proposed a hybrid network that employed DD maps and gamma maps obtained from a 2D ion chamber array as multi-channel inputs for detecting MLC errors in patient-specific IMRT QA.The results demonstrate that the proposed method can effectively identify MLC errors using measured lowresolution dose maps.7][8] In our study, we replaced the high-resolution EPID with a low-resolution 2D ion chamber array for detecting MLC errors in patient-specific QA plans.To improve the accuracy of MLC error classification, we utilized multiple types of image inputs and a hybrid network.While our simulation methodology for inducing MLC errors was similar to that of Wootton, 6 Nyflot, 7 and Kimura, 9 the results cannot be directly compared due to the use of different detectors.Nonetheless, our results demonstrated that utilizing multiple types of inputs had higher sensitivity than using a single type of input, and the hybrid network outperformed the CNN network in terms of sensitivity.To the best of our knowledge, this is the first report suggesting that a hybrid network with multiple types of image inputs performs better than a CNN network with a single type of image input for MLC error classification.
Our proposed method achieved the best results on all evaluation metrics when taking merged images as input, as shown in Table 2.The reason behind this is that the merged image has richer semantic information than single-channel inputs such as DD maps, gamma maps (3%/2 mm),and gamma maps (2%/1 mm).HybridNet can effectively fuse these different types of information and extract high-quality decision-making features, resulting in better classification performance.Kimura's research revealed that the precise locations associated with error points can be directly determined using DD maps. 9However, our findings indicated that when DD maps were used as stand-alone input, the classification results became overly sensitive, leading to the potential misclassification of measurement points with significant DDs in error-free plans as random errors.On the other hand, the use of gamma maps reduced this sensitivity, but the drawback is the

TA B L E 3
The precision, accuracy, sensitivity, specificity, and F1-score of the HybridNet-MC, CNN and transformer networks were evaluated using five-fold cross-validation.lack of direct positional information regarding the error points.For instance, Hybrid-DD exhibited a sensitivity of 0 for error-free plans, while HybridNet-Gamma32 and HybridNet-Gamma21 achieved a sensitivity of 0.08 and 0.18, respectively.The advantages of a multi-input network lie in the ability to share information among different types of error images (DD maps, gamma maps) within the feature extraction path, thereby enhancing the performance of the multi-input network.

Plan type
The HybridNet exhibited higher discriminability among shift, opening, and closing errors as shown in Figure 7.However, the cluster of random errors overlapped with the cluster of error-free cases, partially explaining the confusion between random errors and error-free classes across the four networks.There may be two reasons, the first reason was that the IMRT error plans were simulated in this study by introducing fixed-value errors to the leaf positions of control points in the MLC.Gaussian errors with standard deviations of 1 and 2 mm were added to the MLC leaf positions to simulate random errors.Due to the small magnitude of MLC leaf position changes and the limited resolution of the detectors, the differences in DD maps or gamma maps between the cluster of random errors and the cluster of error-free cases were minimal. 33The second reason was attributed to the fact that measurement was integrated over time, and errors in different control points may cancel out when summed up.For example, within the same beam, if control point 1 exhibited a positive error at a measurement point and control point 2 showed a corresponding negative error at the same measurement point, the dose error in the beam at that specific measurement point might have diminished or disappeared.These reasons may result in the HybridNet's inability to perfectly differentiate between error-free and random errors.
5] Transformer was first applied to natural language processing (NLP) as a potential deep learning network and achieved state-ofthe-art (SOTA) performance. 36The transformer model was recently adopted in computer vision and performed excellently in numerous machine vision tasks. 23,37his study employed a hybrid methodology to address the inherent complementarity between local feature extraction and global contextual representation in CNN and ViT.Through a comparative analysis of the HybridNet-MC network with the conventional CNN and ViT models, the HybridNet-MC network demonstrated superior network performance, particularly in classification accuracy under random errors and error-free conditions, as evidenced by the results presented in Table 3 The HybridNet exhibited greater distances between clusters and higher intra-cluster cohesion, indicating that the hybrid network was more adept at recognizing differences between various MLC error categories and common features among errors within the same category in high-dimensional feature space, as shown in Figure 7. Thus, the hybrid network architecture allows for the effective integration of the strengths of both CNN and ViT, resulting in enhanced model performance.
In clinical practice,2D detectors are commonly used in patient-specific IMRT QA.Due to the lower spatial resolution of 2D detectors and the comparatively insensitivity of gamma analysis, these detectors exhibit insensitivity to detecting MLC leaf position errors.Yan et al. introduced systematic errors ranging from 1 to 2 mm and random errors up to 2 mm in MLC leaf positions during patient-specific IMRT QA, revealing that only MLC systematic errors of 2 mm or more could be detected. 39hang et al. also found that employing the 2D chamber array could only detect MLC leaf shift errors larger than 2 mm and suggested more sensitive methods to detect subtle MLC errors. 40Traditional gamma threshold analysis could not detect these subtle MLC leaf positioning errors, which caused clinically relevant dosimetric changes to planning target volume (PTV) and organs at risk (OAR) of treatment plans.Bai et al. recommended controlling MLC leaf random errors to below 2 mm and systematic errors within 0.5 mm to minimize dose changes in the clinical relevance of the PTV and OAR for nasopharyngeal carcinoma. 41In previous studies, [6][7][8][9] deep learning methods enhanced sensitivity in detecting MLC positioning errors compared to traditional gamma analysis.These studies focused on EPID or 3D detectors, lacking application to 2D low-resolution detectors.To improve the detection sensitivity and classification accuracy of MLC positioning errors with 2D low-resolution detectors, we employed a hybrid deep network architecture combining CNN and Vit, along with a multi-channel input using DD and Gamma maps.The results demonstrated classification accuracy exceeding 0.8 for systematic and random errors ranging from 1 to 2 mm.Compared to the studies by Yan 39 and Shang, 40 our proposed method effectively enhanced the sensitivity and classification accuracy of MLC positioning error detection.This work advanced the sensitivity of lowresolution detectors in detecting MLC positioning errors and enabled the classification of MLC positioning error types, thereby ensuring delivery accuracy.
Our study was subject to several limitations that should be noted.First, our investigation focused on classifying MLC leaf position errors using a lowresolution 2D ion chamber detector in patient-specific QA.Although MLC leaf position errors represented a typical and well-recognized type of error, it was important to acknowledge the existence of other error sources within patient-specific QA, such as MU, collimator rotation, gantry rotation, transmission factor, etc. 42 Our study aimed to detect and classify MLC positioning errors.The detection and classification of other MLC errors, such as leaf transmission, jaw tracking positions, and multi-MLC errors combined, will be subjects in our future research.Another limitation pertained to the narrow focus of our method, which solely encompassed the classification of MLC errors without considering their impact on clinically relevant dose distributions within the PTV or OAR regions.Specifically, our method did not assess whether MLC errors introduced notable deviations in dose distribution within these critical anatomical structures.If the errors resulted in dose discrepancies in clinically insignificant regions, they might be deemed clinically acceptable.To address this limitation in future research, we propose to incorporate the delineation of PTV and OAR contours as additional input variables.By augmenting the existing classification framework, we aim to simultaneously assess their adherence to the requirements of clinically relevant dose distributions.

CONCLUSION
In this study, we have developed a novel multi-channel hybrid network that combines CNN with Vit to address the detection of errors in the MLC for patient-specific IMRT QA.Our approach involves a comprehensive analysis of multi-channel DD maps and gamma maps generated from low-resolution two-dimensional dose matrices.Notably, the hybrid model trained on these maps exhibits a high level of accuracy in identifying MLC errors, surpassing models trained solely on single channel DD maps or gamma maps.Furthermore, when compared to conventional CNN and transformer models, our hybrid model demonstrates superior precision in detecting MLC errors, highlighting its unique advantages in IMRT QA.By harnessing the combined capabilities of multiple-channel DD maps and gamma maps, our approach offers an effective and reliable solution for detecting MLC errors in IMRT QA processes, especially when employing low-resolution 2D ion chamber array.Future research can further explore the potential of our hybrid model in clinical applications, ultimately enhancing the overall quality and safety of IMRT treatments.

C O N F L I C T O F I N T E R E S T S TAT E M E N T
The authors have no conflicts to disclose.

F I G U R E 1
Examples of MLC positions and predicted doses with four types of introduced errors.The blue solid line represents the MLC positions in the error-free plan.The yellow dashed line represents the MLC positions in the error plan.MLC, multileaf collimator.
dataset was randomly split into training and testing sets following an 80:20 in the training and evaluation stages.This division was carried out at the plan level to ensure a comprehensive assignment of all DD maps and gamma maps for beams within the same plan to the training and testing sets.A five-fold crossvalidation approach was implemented during model training, resulting in five trained models.The test dataset was evaluated separately with each of the five trained models, and the mean value of the outputs was the model result, as shown in Figure 4. PyTorch, the popular deep learning framework, was used to implement F I G U R E 3 Overall architecture of our proposed HybridNet consists of CNN and ViT parts.CBR represents the successive operator cluster of Convolution-Batch Normalization-ReLU, while LN, patch embedding and merging, and MLP are the standard modules in Swin Transformer.W-MSA and SW-MSA denote multi-head self -attention modules with regular and shifted windows, respectively.CNN, convolutional neural networks; LN, layer normalization; MLP, multi-layer perceptron.

F I G U R E 4
The process of model training and evaluation.A five-fold cross-validation approach was implemented during model training, resulting in five trained models.The test dataset was evaluated separately with each of the five trained models, and the mean value of the outputs was the model result.To mitigate the overfitting problem, an early stopping strategy was utilized with a tolerance of 50 epochs to search for the best model within 150 epochs.

F I G U R E 6
Normalized Confusion Matrices (Averaged over five-fold Cross-Validation) for (a) HybridNet-MC, (b) Hybrid-DD, (c) HybridNet-Gamma32, (d) HybridNet-Gamma21, (e) Resnet-18 and (f) SwinT Networks.Each matrix represents the relative frequencies of correct classifications and misclassifications among the different classes, providing insights into the performance of the respective networks.The matrices are presented in a normalized format to facilitate comparison and analysis.
Bing Yan conceived the experiments.Bing Yan and Jun Shi acquired and analyzed the data for the work.Bing Yan, Jun Shi, Hu Peng, Aidong Wu and Xiao Wang designed the study and analyzed the result.Bing Yan, Xudong Xue, Hu Peng, Jun Shi and Chi Ma participated in writing manuscript.The final version of the manuscript has been reviewed and approved for publication by all authors.AC K N OW L E D G M E N T S This work was supported by the National Natural Science Foundation of China (No. 62071165).
Where "Average" signifies the average value obtained across different error types, including shift, random, error-free, opening, and closing errors. Note: . While CNN possess robust capabilities in local feature extraction, it cannot model long-term dependencies and capture global contextual representations.Conversely, ViT excels at establishing global representations but exhibits limitations in local feature extraction. 38Thus, our proposed methodology employed CNN at the network's input stage to extract low-level features, with a specific emphasis on local feature extraction.Subsequently, ViT was leveraged to model the long-term dependencies among features, thereby accentuating global contextual relationships.