4D‐CT deformable image registration using unsupervised recursive cascaded full‐resolution residual networks

Abstract A novel recursive cascaded full‐resolution residual network (RCFRR‐Net) for abdominal four‐dimensional computed tomography (4D‐CT) image registration was proposed. The entire network was end‐to‐end and trained in the unsupervised approach, which meant that the deformation vector field, which presented the ground truth, was not needed during training. The network was designed by cascading three full‐resolution residual subnetworks with different architectures. The training loss consisted of the image similarity loss and the deformation vector field regularization loss, which were calculated based on the final warped image and the fixed image, allowing all cascades to be trained jointly and perform the progressive registration cooperatively. Extensive network testing was conducted using diverse datasets, including an internal 4D‐CT dataset, a public DIRLAB 4D‐CT dataset, and a 4D cone‐beam CT (4D‐CBCT) dataset. Compared with the iteration‐based demon method and two deep learning‐based methods (VoxelMorph and recursive cascaded network), the RCFRR‐Net achieved consistent and significant gains, which demonstrated that the proposed method had superior performance and generalization capability in medical image registration. The proposed RCFRR‐Net was a promising tool for various clinical applications.

Four-dimensional computed tomography (4D-CT) was referred to as a technique whereby the three-dimensional computed tomography (3D-CT) volume containing a moving structure was imaged over a period, creating a dynamic volume dataset. 1 4D-CT had previously been widely utilized in radiation oncology for planning purposes, especially for tumors located in the thoracic cavity, abdomen, and breast.In recent years, 4D-CT has opened avenues in the diagnostic arena. 2 In the diagnosis of central nervous system diseases, 4D-CT could be utilized to evaluate vascular pathology in the brain and spine.
Additionally, 4D-CT provided the added benefits of perfusion maps and the visualization of structural changes, such as hydrocephalus or hematoma, which could aid in surgical planning.It was superior to current time-resolved MR angiography in terms of both spatial and temporal resolution. 3For cardiac imaging, 4D-CT could be used to visualize the heart and pulmonary vessels and further characterize and measure the in-vivo 3D deformations of both the right ventricular outflow tract and the pulmonary arteries throughout the cardiac cycle. 4formable image registration (DIR) was essential in the clinical application of 4D-CT.DIR was the process of establishing the nonlinear correspondences between an image pair of moving and fixed images.Though DIR had been extensively studied for decades, it remained an active area of research since its current performance did not fully meet the increasingly high accuracy and speed requirements for registration.Furthermore, the large and complex lung motions and low image contrast of abdominal 4D-CT images posed additional difficulties for accurate registration.
The current DIR methods could be divided into two categories: The DVF optimization-based methods included elastic matching, 5,6 statistical parametric mapping, 7 Demons, 8,9 etc. Popular diffeomorphic transform-based methods included large diffeomorphic distance metric mapping (LDDMM), 10,11 diffeomorphic anatomical registration through exponentiated lie algebra (DARTEL), 12 standard symmetric normalization (SyN). 13All these non-DL-based approaches optimized the transformation for each image pair, resulting in slow registration, especially for 3D and 4D images.
5][16][17] According to the output of the network, the deep learning-based DIR methods could be divided into two categories: (1) DL-based similarity calculation; and (2) DLbased DVF prediction.The DL-based similarity calculation method was normally developed for multi-modality image registration.The similarity was normally used by incorporating traditional registration methods for iterative optimization or another DL network.For example, Cheng et al. 18 proposed a DL-based similarity method that trained a binary classifier to learn the correspondence of two image patches from CT and MR image pairs.The classification output was transformed into a continuous probability value, which was then used as the similarity score.
The DL-based DVF prediction method included supervised and unsupervised strategies.For the registration methods based on the supervised learning strategy, the true DVF (i.e., ground truth) calculated based on the moving image and fixed image was used to train the registration network.Three methods were normally used to generate the ground truth: (1) random DVF generation; (2) traditional registration method-based DVF generation; and (3) DVF simulation generation.For the random DVF generation method, the generated DVF was used as the ground truth.The original image (the fixed image) was warped based on the simulated DVF to generate the moving image.
However, the simulated DVF might introduce bias during training since the simulated DVF was unrealistic and quite different from actual physiological motion.For the traditional registration method-based DVF generation method, the traditional registration methods were used to register the image pair to generate the ground truth for registration network training.Using this method, the performance of the developed registration network was greatly affected by the DVF generation methods.Further, finite element (FE) simulations could also be used in DVF generation. 19mpared with the supervised learning strategy-based registration methods, the unsupervised learning strategy-based registration methods only required the image pairs to be registered without the true DVF during model training.In general, the unsupervised learningbased registration methods calculated the DVF using the input image pair and then performed interpolation and deformation on the moving image to generate the warped image. 20,21The unsupervised learning strategy-based registration methods could be classified into three categories, including generative adversarial network (GAN) methods, reinforcement learning (RL) methods, and convolutional neural network (CNN) methods.
The GAN-based methods were mostly used to register the multimodality images by mapping images from one modality to another.Mahapatra et al. 22 trained a GAN to perform registration between retinal color fundus images and fluorescein angiography images.Tanner et al. 23 employed CycleGAN for DIR between MR and CT images.
They firstly transformed the source domain to the target domain before calculating a monomodality image similarity.They discovered this method could achieve, at best, similar performance to the traditional multi-modality deformable registration methods.Further, the accuracy of the absolute intensity mapping of GAN remained to be investigated.The combination of RL and CNN had been applied to decompose the registration task into a sequence of classification problems.This method was mainly employed for rigid registration.
Liao et al. 24 used RL to perform rigid cone-beam CT (CBCT)-CT image registration.Krebs et al. 25 built a statistical deformation model (SDM) for prostate 2D and 3D MR image registration.The method achieved dice similarity coefficient scores of 0.87 and 0.80 for 2D and 3D images, respectively.The unsupervised learning-based CNN methods presented comparable performance to the traditional registration methods recently. 21,26,27The proposition of spatial transformation networks (STN) promoted the development of unsupervised deep learningbased image registration methods since the STN allowed the definition of loss functions without any manually aligned or pre-aligned image pairs. 28de Vos et al. 21applied the proposed STN to the field of image registration based on unsupervised learning for the first time.In this literature, the STN was directly connected to the network, and the generated DVF was used to deform the moving image.During the training process, the deformed and fixed images were used to calculate the training loss, which was back-propagated through the differentiable warping operation and gradually optimized to minimize the value.Balakrishnan et al. 27 presented a fast learning-based registration framework (called VoxelMorph) for medical image registration.In their study, the registration task was formulated as a function that mapped the input image pair to a DVF that aligned the input images.
The function was parameterized via the CNN and optimized based on the input image pairs.The studies of de Vos et al. and Balakrishnan et al. both indicated that the proposed unsupervised methods outperformed state-of-the-art iterative registration methods, while operating a few orders of magnitude faster. 21,27These proposed networks were enforced to make straightforward predictions, which might be a burden when handling complicated deformations, especially with large deformations. 29 Voset et al. 30 proposed a DL image registration (DLIR) framework for unsupervised affine and deformable image registration.DLIR stacked multiple networks into a larger architecture to perform coarse-to-fine image registration.DLIR was trained on each cascade one by one, that was, after fixing the weights of previous cascades.
The volume tweening network (VTN) was an end-to-end cascading scheme that resolved large displacements. 31VTN jointly trained all cascades, while the similarity was measured based on the fixed image and all warped images.Zhao et al. 29

| Recursive cascaded networks
The fixed and moving images were defined over the 3D spatial domain.The work in this paper focused solely on the 3D image volume with grayscale data.Theoretically speaking, the proposed method was dimension-independent.The DVF was the mapping of ϕ : Ω !Ω.
For DIR, a reasonable and plausible DVF should be continuously vary- The entire registration task was cascaded by recursively performing registration on the warped images.The R θ included three cascaded registration subnetworks, denoting as r 1 , r 2 , and r 3 .Following the recursion, the moving image was warped successively three times.
Each cascaded subnetwork was an independent image registration network.The n-th cascaded subnetwork (r n ) could predict the DVF (ϕ n ) based on the input warped image (I nÀ1 w ) and the fixed image, as Equation ( 2).The predicted DVF of the entire network (ϕ RCFRRÀNet ) was the composition of the DVFs of the three cascaded subnetworks: The final warped image (I 3 w ) was generated based on the warped image (I 2 w ) of r 2 and the predicted DVF (ϕ 3 ) of r 3 , which could also be expressed using the input moving image and the final predicted DVF, as: The DVF prediction network was implemented using a ResNetsimilar architecture, the full-resolution residual network (FRRN). 32,33 example of the abstract structure of the FRRN was provided in residual stream and a pooling stream.The residual stream carried feature maps at full resolution with precise boundary information.The features of the residual stream were calculated by adding successive residual units.The pooling stream carried high-level feature maps with information on image texture.The features of the pooling stream were the direct output of the multiple convolution and pooling operations applied to the input images.By fusing the two processing streams, both kinds of features could be calculated simultaneously.
The FRRN was constructed using a sequence of full-resolution residual units (FRRUs). 32The abstract structure of the FRRU was presented in Figure 1c.Since the FRRN could process the residual stream and pooling stream synchronously, each FRRU had two inputs and two outputs accordingly.Let x nÀ1 denote the input of the residual stream to the nth FRRU and y nÀ1 denote the input of the pooling input.The outputs, functions of the residual stream (Fr), and pooling stream (Fp) of the nth FRRU were defined as: where W Fr and W Fp were the parameters of the residual stream and pooling stream, respectively.In Figure 1c, the red box indicated the function of the pooling stream, while the blue box indicated the function of the residual stream.
The FRRU firstly reduced the size of the feature map of the residual stream using a max-pooling layer and then concatenated the two feature maps of the two streams.Next, the concatenated feature maps underwent two 3 Â 3 convolution layers, aiming to extract highly local and complex features.Each convolutional layer was followed by a spatial batch normalization (BN) layer and a rectified linear activation function (ReLU). 33The outputs of the latter convolution layer were fed into the pooling stream and residual stream of the next FRRU.For the pooling stream, the outputs were directly used as inputs.For the residual stream, we adjusted the number of feature channels using a 1 Â 1 convolution layer and upscaled the spatial dimensions using an unpooling layer.
To use the standard gradient-based optimization method, the spatial transformation function-based differentiable operation was used to calculate I m ∘ ϕ: 28 For the voxel v of the I m , the voxel (subpixel) location ϕ v ð Þ was calculated.Since the voxels were only defined at the integer locations, the voxel of I m ∘ ϕ v ð Þ was calculated based on the neighboring voxels using linear interpolation: where principles employed by state-of-the-art methods. 32,34,35The base channel, which was different from Simonyan and He's studies, was set to 32 to have a manageable number of trainable parameters. 34,36The channel number was doubled after each pooling operation (up to the certain maximum number).The encoder and decoder formulations were used according to Noh and Pohlen's studies. 32,35The network architectures of the three cascaded registration subnetworks were presented in Figure 2.

| Loss function
Three cascaded registration subnetworks were jointly trained by merely calculating the unsupervised loss (L us ) between the fixed image and the final warped image. 29This allowed the recursive cascaded networks to learn to perform progressive registration cooperatively.The loss function consisted of two components, which were the image similarity loss (L sim I f ,I w ð Þ) and the DVF regularization loss 27 The image similarity loss penalized the differences in image appearance between I f and I 3 w .The DVF regularization loss penalized the local spatial variations in DVF.
where α indicated the regularization parameter.The α was set as 0.01 according to the previous work of Balakrishnan et al. 27 The L sim I f , I 3 w was conducted using the mean squared error: Minimizing the term L sim I f , I 3 w , would enforce I 3 w to approximate I f .It might generate an unrealistic non-smooth DVF.The DVF regularization loss encouraged a smooth DVF using the diffusion regularizer on the spatial gradients: 2.5 | Experiments

| Dataset
The proposed RCFRR-Net was trained and tested using a retrospective dataset involving 13 patients' 4D-CT images.All images were  Each 4D-CBCT dataset also included 10 phases.The 3D images were resampled and normalized prior to testing.The 3D image of the first phase was used as the fixed image, and the 3D images of the other nine phases were used as the moving images.A total of nine CBCT image pairs were included for the 4D-CBCT testing.

| Evaluation
To compare the registration performance of the proposed RCFRR-Net with other DIR methods, we used the iteration-based demon registration method as the first comparison method. 9In addition, two deep learning-based methods, VoxelMorph, and recursive cascaded networks (RC-Net), were also included as comparison methods. 27,29xelMorph and RC-Net were trained using the same training dataset.The VoxelMorph and RC-Net were both constructed based on the vm2double architecture.The hyperparameters of the two networks were consistent with the previous papers. 27,29We developed Root mean square error (RMSE), normalized cross-correlation (NCC), structural similarity index (SSIM), and Dice score were calculated for quantitative evaluations. 37,38Considering that the organs in the abdomen, for example, the liver and stomach, had large amplitudes of motion during the respiratory process, the registration accuracy was also measured based on these key organs.Four regions of interest (ROIs), including the liver, stomach, lung, and surroundings, were selected from the global image to calculate the quantitative metrics for registration evaluation.The ROIs had the same size of 30*30*5.
The ROI-1 and ROI-3 were chosen in the stomach and lung regions.
The ROI-2 and ROI-4 were chosen in the liver and lung regions.The example slices of the ROIs were presented in Figure 3.
The RMSE indicated the squared differences of voxel intensities of two images and was defined as: where i indicated the index of the voxel in the image, m indicated the total number of the voxels in the image, μ i w was the intensity value of the ith voxel in the warped image, and μ i f was the intensity value of the ith voxel in the fixed image.
The NCC was used to measure the correlation between two images: where μ w and μ f were the mean values of the warped image and fixed image, respectively.
The SSIM indicated the similarity of the two images regarding three terms, namely the intensity term, the contrast term, and the structural term.It was proposed to provide a good approximation of perceptual image quality: where were the small constants to avoid instability of SSIM calculation.Here, L was the dynamic range of the input image.σ w and σ f were the standard deviations of the warped image and fixed image, respectively.σ wf was the cross-covariance between the warped and fixed images.
The Dice score was a spatial overlap index.In our study, the Dice score was calculated based on the selected ROIs.The Dice score was measured: where ROI W and ROI F were the ROIs of the warped image and fixed image, respectively.

| Implementation
The proposed algorithm was implemented in Python 3.6 and PyTorch 1.7.0 framework. 39The proposed model was trained using a batch size of 1 on a NVIDIA GeForce RTX 3090 GPU with 24 GB of memory.The training stage ran for 9 Â 10 4 iterations with the Adam gradient optimizer. 40The learning rate was initially 10 À4 and halved after 3 Â 10 4 iterations and again after 6 Â 10 4 iterations.

| Registration evaluation of RCFRR-Net
Table 1 showed the global image and ROI-based comparison of the average registration results using different registration methods over the 10 image pairs of the testing patient.As seen in Table 1, the Vox-elMorph performed better than the demon registration method according to the global image evaluation and performed comparably to the demon registration method according to the ROI-based evaluation.The RC-Net-3 performed better than the VoxelMorph and demon registration methods.The RC-Net-10 had slightly better performance than the RC-Net-3.The RCFRR-Net showed optimal performance among all registration methods.Figure 3 showed the axial, coronal, and sagittal planes and selected ROIs of the moving images, fixed images, and warped images registered using different methods.
The absolute difference between the fixed images and the warped images of different registration methods were shown in
The percentage between the RMSE difference and the RMSE before registration was 73.46%, 10.95%, and 2.4%, respectively.The RMSE difference and percentage decreased significantly with the increase in the number of cascaded networks.This trend could also be observed in the comparison based on ROIs.

| Registration runtime
Table 4 showed the runtime results of VoxelMorph, RC-Net-3, RC-Net-10, and RCFRR-Net.Since the demon registration method required a few minutes to finish the registration, we only reported the runtimes of the DL-based methods in this study.All methods were conducted using the GPU.The VoxelMorph computed a registration with an average runtime of 1.334 ± 0.0219 s.
For the VoxelMorph method, the reported runtime was 0.45 s, while it had an average runtime of 1.334 ± 0.0219 s in this study. 27is might result from the difference in the datasets.The dataset for this study was abdominal 4D-CT images, and the previous

| Registration evaluation on DIRLAB datasets
Table 5 showed the global image and the ROI-based comparison of the average RMSE, NCC, SSIM, and Dice score using different registration methods for the testing patients from DIRLAB datasets.Consistent with the testing dataset from our institution, the RCFRR-Net showed optimal performance among all registration methods.Figure 8 presented the axial, coronal, and sagittal planes and selected ROIs of the moving images, fixed images, and warped images registered using different methods using images from DIRLAB datasets.The absolute difference between the fixed images and the warped images of different registration methods were shown in Figure 9.As shown in the residual images, the RCFRR-Net showed the smallest difference.
Using the quantitative metrics for the registration evaluation, the average RMSE between the moving image and the fixed image was 78.66 ± 13.04 HU while decreasing to 15.54 ± 2.47 HU between the warped image and the fixed image after the proposed RCFRR-Net registration.The NCC increased from 0.9835 ± 0.0050 to 0.9997 ± 0.0001 and the SSIM increased from 0.7326 ± 0.0430 to 0.8427 ± 0.0460 using the RCFRR-Net registration.
For the comparison based on ROIs, the RMSE was 336.00 ± 40.The novelty of this study could be summarized in three folds.
First, the RCFRR-Net was trained using the unsupervised approach.
The advantage of unsupervised training was that it could alleviate the problem of a lack of ground truth DVFs.Since ground truth DVF was not required, any 4D-CT dataset could be used for network training.
Second, the RCFRR-Net was constructed using the recursive cascaded method.The similarity was only measured based on the final warped

( 1 )
the non-deep learning (DL)-based methods; and (2) the DL-based methods.Popular non-DL-based registration methods included two categories: the deformation vector field (DVF) optimizationbased methods and the diffeomorphic transform-based methods.
presented a general recursive cascaded networks (RC-Net) architecture that enabled learning deep cascades for deformable image registration.The registration procedure was recursive, such that every cascade learned to perform a progressive deformation for the current warped image.The similarity was only measured based on the fixed image and the final warped image.The base network of the RC-Net had a simple and identical architecture of VoxelMorph and VTN.In addition, the base network had the same network depth and parameters.One drawback of the RC-Net was that the number of parameters increased linearly with the number of stacks.The RC-Net with five cascades showed a Dice value of 0.949.The RC-Net achieved a Dice value of 0.953 to 0.956 from 10 to 20 cascades using the VTN base network.With a greater number of cascades, the improvement in registration performance might be limited.This article made several technical contributions towards the goal of developing neural networks for abdominal 4D-CT image registration.(i) We introduced a novel registration network with three recursive cascaded subnetworks to perform progressive registration.Each subnetwork was designed with different architectures to capture different movement conditions.(ii) We proposed to calculate the DVF based on the full-resolution residual network, which could capture the precise boundary and texture information of the image simultaneously.(iii) We evaluated the proposed network using multiple datasets (an internal 4D-CT dataset, a public DIRLAB 4D-CT dataset, and a 4D-CBCT dataset).

Figure
Figure 1a depicted the workflow of the proposed recursive cascaded full-resolution residual network (RCFRR-Net).Considering the large and complex motion patterns of abdominal 4D-CT images, the recursive cascaded method was used to conduct successive registration.The RCFRR-Net had three independent registration subnetworks with different network parameters and depths, aiming to perform global and refined registrations.Each subnetwork was directly connected and used as an independent network with respective input and output.The first subnetwork used the fixed and moving image pairs as inputs.The second and third subnetworks used the fixed image and the warped image (output of the front registration subnetwork) as the paired input images.To accurately capture the precise boundary information of image elements, the subnetwork was developed by merging a full-resolution residual network (FRRN) and a spatial transformation function.The loss of the RCFRR-Net was calculated based on the fixed image and the final warped images during the network training process.The trained network was applied to the input image data to calculate the final warped image to achieve registration.
ing.A DVF prediction function, R θ I f ,I m ð Þ, was developed in the proposed method, where θ indicated the network parameters, I f indicated the fixed images and I m indicated the moving images.The function R θ used the I f , I m as inputs and predicted the DVF (ϕ) and the warped image I w as outputs.The I w was generated based on the predicted DVF and the moving image:

2. 3 |
Registration subnetworks Each cascaded registration subnetwork consisted of a DVF prediction network and a spatial transformation function.Each subnetwork was designed to predict a DVF and a warped image of itself based on the input moving image (or warped image from the front registration subnetwork) and the fixed image.The DVF prediction network was designed to compute the DVF.The spatial transformation function was used to warp the input moving image based on the predicted DVF.

Figure 1b .
Figure 1b.The FRRN processed two distinct feature streams: a Since the spatial transformation function was differentiable, the errors could be backpropagated during the training process.The three cascade registration subnetworks were designed with different network parameters and depths in this study.The top (first) cascade was designed to learn global registration with a shallow network architecture (FRRU blocks: 5, maximum number of FRRU channels: 128).The second cascade was designed with a deeper network architecture compared with the first cascade (FRRU blocks: 7, maximum number of FRRU channels: 256).The bottom cascade was designed to learn the refined registration with the deepest network architecture (FRRU blocks: 9, maximum number of FRRU channels: 512).The proposed architectures were constructed based on obtained from Siemens SOMATOM Definition AS using 120 kVp tube voltages.The original image pixels of the datasets ranged from 0.6836 to 0.9736 mm.The slice thickness ranged from 2 to 3 mm.All images were cropped and resampled to the same size of 96 Â 96 Â 96.Image gray values were normalized to [0, 1].Each 4D-CT dataset included 11 phases of 3D-CT images throughout the respiratory cycle: one initial phase (denoted as: In0%), five exhalation phases (denoted as: Ex20%, Ex40%, Ex60%, Ex80%, and Ex100%), and five inhalation phases (denoted as: In20%, In40%, In60%, In80%, and In100%).The 4D-CT datasets of 12 patients, including a total of 132 3D-CT images, were used as the training dataset.During training, image pairs between any two phases of the 11 phases were used as the moving image and the fixed image.In total, the training dataset included 1320 pairs of 3D-CT images.For testing, the 3D-CT image of the exhale and inhale phases was registered to the initial phase image.A total of 10 CT image pairs were used for testing.

F I G U R E 2
The network architectures of the three cascade registration subnetworks.The notation Conv (n Â n, m) indicated a convolution layer of m kernels with each of size n Â n.The notations RU(v) and FRRU(v) indicated the RU and FRRU whose convolutions included v channels.The three cascade registration subnetworks differed in network depth.(a) Subnetwork-1; (b) Subnetwork-2; and (c) Subnetwork-3.The performance of the proposed RCFRR-Net was tested using the public DIRLAB dataset (www.dir-lab.com).The 3D images were resampled to the same size of 96 Â 96 Â 96 and normalized to the same range of 0 to 1.Each 4D-CT of the DIRLAB dataset included 10 phases.The 3D image of the first phase was used as the fixed image, while the images of the other nine phases were used as the moving images.Nine image pairs were generated for each case.Three cases (case 1, case 2, and case 3) from DIRLAB were included.Thus, a total of 27 image pairs were used for testing based on the DIRLAB dataset.Furthermore, to test the generalization capability of the proposed RCFRR-Net, the authors tested the proposed RCFRR-Net using 4D-CBCT images.The images were obtained from "The Sparse-view Reconstruction Challenge for Four-dimensional Cone-beam CT" (https:// image-x.sydney.edu.au/spare-challenge/).The 4D-CBCT images had higher noise and artifact levels compared with the 4D-CT images.
Comparison of the moving images, fixed images, and warped images of different registration methods.The first three rows showed the comparison of global images, and the last two rows showed the comparison of ROI-1 and ROI-2.The red box in the axial plane and the line in the coronal plane indicated the example slices of ROIs.The red arrow in the figures indicated the organs with large motion.The images of the six columns were the fixed images, moving images, and warped images of demon-based registration, VoxelMorph, RC-Net-3, RC-Net-10, and RCFRR-Net, respectively.Display window: [-200, 300] HU. two RC-Nets with three and 10 cascades, abbreviated as RC-Net-3 and RC-Net-10, respectively.

Figure 4 .
It could be seen from the residual images that the difference images from the proposed RCFRR-Net were smaller compared with the comparison methods, especially in the selected ROIs.Using the quantitative metrics for the registration evaluation, the original average RMSE between the moving image and fixed image was 82.52 ± 23.75 HU.With the proposed RCFRR-Net, the average RMSE between the warped image and fixed image after the registration decreased to 10.88 ± 3.78 HU.The NCC increased from 0.9831 ± 0

F I G U R E 6
Comparison of the residual images between the fixed images and warped images of intermediate cascades.The first three rows showed the comparison of global images, and the last two rows showed the comparison of ROI-1 and ROI-2.The images of the four columns were the residual images between fixed images and moving images, and warped images of Cascade-1, Cascade-2, and Cascade-3, respectively.Display window: [0, 100] HU.ROIs based on the warped image and fixed image after the registration using RCFRR-Net.For the selected ROIs of the moving image and the fixed image, the ROI-based NCC and SSIM were lower than the values based on the overall image.The NCC increased from 0.6621 ± 0.1102 to 0.9991 ± 0.0002 for the ROI-1 and from 0.7937 ± 0.0795 to 0.9993 ± 0.0001 for the ROI-2.The SSIM also increased from 0.3504 ± 0.1171 to 0.9695 ± 0.0091 for the ROI-1 and from 0.4343 ± 0.1175 to 0.9503 ± 0.0183 for the ROI-2.The Dice score increased from 0.5532 ± 0.0766 to 0.8681 ± 0.0201 for the ROI-1 and from 0.7836 ± 0.0464 to 0.9579 ± 0.0111 for the ROI-2.Furthermore, to access the progressive registration process of the RCFRR-Net, the intermediate registration results of the RCFRR-Net were also evaluated, that was, the registration results of the first cascaded network (Cascade-1) and the second cascaded network (Cascade-2).

3. 2 |
Registration evaluation of RCFRR-Net with different number of cascadesTo measure the registration results of RCFRR-Net with different number of cascades, we constructed RCFRR-Net with two cascaded registration subnetworks and one registration subnetwork.The maximal number of cascades was three in the current study to consider the trade-offs between registration time and accuracy.In this section, the RCFRR-Net with three cascades, two cascades, and one subnetwork were denoted as RCFRR-Net-3, RCFRR-Net-2, and RCFRR-Net-1, respectively.The network architecture and hyper-parameters of the RCFRR-Net-2 were consistent with the second and third subnetworks of the RCFRR-Net-3.The network architecture and hyper-parameters of the RCFRR-Net-1 were consistent with the third subnetwork of the RCFRR-Net-3.
study was developed for brain MR datasets.Compared with the brain dataset, 4D-CT images had more complex deformations.The runtime of the RC-Net-3 was slightly longer (1.446 ± 0.0207 s) than that of the VoxelMorph.The RC-Net-10 (1.878 ± 0.0216 s) had a longer runtime than the RC-Net-3.Since the subnetworks of the RCFRR-Net were relatively deeper than the RC-Net, the runtime of the RCFRR-Net-3 was slightly longer than the RC-Net-10 by 1.940 ± 0.0158 s.The runtimes of the RCFRR-Net-1, RCFRR-Net-2, and RCFRR-Net-3 increased gradually.T A B L E 5 Comparison of RMSE, NCC, SSIM, and dice score using different registration methods based on DIRLAB datasets.
The RMSE values for the selected CBCT ROIs were higher than the RMSE values of the selected CT ROIs using the RCFRR-Net registration.For the selected ROIs of the moving image and the fixed image, the RMSE decreased from 296.25 ± 97.10 HU to 57.80 ± 13.01 HU, from 282.10 ± 100.88 HU to 55.23 ± 14.42 HU, from 282.68 ± 102.69 HU to 56.25 ± 12.51 HU, and from 254.09 ± 81.85 HU to 54.91 ± 13.42 HU, for ROI-1, ROI-2, ROI-3, and ROI-4, respectively.The NCC, SSIM, and Dice score all had the highest values for the selected ROIs.4 | DISCUSSIONIn this study, we proposed a novel unsupervised deep learning-based method for abdominal 4D-CT image registration.The developed method was tested using multiple datasets (an internal 4D-CT dataset, a public DIRLAB 4D-CT dataset, and a 4D-CBCT dataset).The RCFRR-Net was trained using the internal 4D-CT dataset without the DIRLAB 4D-CT dataset or 4D-CBCT dataset.The RCFRR-Net outperformed the comparison methods using all testing datasets.This was a shred of strong evidence that the RCFRR-Net had a high generalization capability in medical image registration.

F I G U R E 1 0
Comparison of the moving images, fixed images, and warped images of different registration methods on 4D-CBCT datasets.The red dotted lines indicated the organs with large motion.Display window: [À800, 200] HU. image and the fixed image, allowing all cascades to be trained jointly and perform the progressive registration cooperatively.Third, the cascaded registration subnetwork was developed based on a novel ResNet-similar architecture, the full-resolution residual network.The FRRN processed two distinct feature streams: a residual stream and a pooling stream.The residual stream carried feature maps at full resolution with precise boundary information.The pooling stream carried high-level feature maps with information on image texture.By fusing the two processing streams, both kinds of features carrying image boundary and element information could be obtained simultaneously.Although the proposed RCFRR-Net could conduct 4D-CT image registration with high speed and accuracy, it still needed improvement in future work.The RCFRR-Net was trained based on the global image, which has a relatively small size.The 3D images were resampled to a size of 96 Â 96 Â 96.The RCFRR-Net was not trained using image patches.Because the image patches were usually randomly cropped from the global image, there might be differences in the organizational structure within the paired image patches.This might introduce inevitable errors during the model training process.In the future, we will design an advanced and more appropriate patch extraction method for registration network training.Without increasing the computation resource, the proposed method could achieve image registration at larger sizes.Since the RCFRR-Net was trained in a completely unsupervised manner without any prior knowledge about the ground truth, the loss was essential for accurate DVF predictions.In the current study, the loss was composed of similarity in the image domain and smoothness in the DVF domain.The loss was calculated equally for different regions of the global image without considering the physiological motion variance of different regions.In future studies, the physiological motion around the lung could be specifically modeled to accurately capture the complex motion conditions to improve the DIR's performance.For example, the adaptive weights for image similarity and DVF regularization could be incorporated into the loss function.5| CONCLUSIONSThis study presented an unsupervised recursive cascaded architecture for abdominal 4D-CT DIR.Experiments based on diverse datasets and evaluation metrics demonstrated that the proposed architecture achieved significant gains over state-of-the-art methods.With the superiority of outstanding performance and the general applicability of the unsupervised manner, we expected that the proposed architecture could potentially be extended to more DIR clinical tasks.AUTHOR CONTRIBUTIONSLei Xu: Formal analysis (lead); investigation (lead); methodology (lead); resources (lead); software (lead); validation (lead); visualization (lead); writingoriginal draft (lead); writingreview and editing (lead).Ping Jiang: Writingreview and editing (supporting).Tiffany Tsui: F I G U R E 1 1 Comparison of the residual images between the fixed images and warped images of different registration methods on 4D-CBCT datasets.Display window: [0, 400] HU.

1
Comparison of RMSE, NCC, SSIM, and dice score using different registration methods.
T A B L E 2 Comparison of RMSE, NCC, SSIM, and Dice score of intermediate cascades.

Table 2
presented the results of intermediate cascades based on the global image and ROIs.As shown in the table, the RCFRR-Net achieved performance gains with the higher cascade number.Figure 5 showed the axial, coronal, and sagittal planes and 30.50 ± 4.02, and 17.79 ± 2.85 for the ROI-1 based on Cascade-1, Cascade-2, and Cascade-3, respectively.The RMSE decreased from 261.36 ± 71.35 (before registration) to 83.91 ± 17.32, F I G U R E 7 Box plots for the results of intermediate cascades.RMSE (a, b, c), NCC (d, e, f) and SSIM (g, h, i) comparison of intermediate cascades based on the global image, ROI-1, and ROI-2.24.18 ± 2.31, and 14.58 ± 2.28 for the ROI-2 based on Cascade-1, Cascade-2, and Cascade-3, respectively.The NCC, SSIM, and Dice score all gradually increased with higher cascade numbers for both ROIs. Figure 7 plotted the results of intermediate cascades to illustrate progressive registration.

Table 3
Comparison of the RMSE, NCC, SSIM, and Dice score using RCFRR-Net with different number of cascades.
presented the global image and ROI-based comparison of the RMSE, NCC, and SSIM of RCFRR-Net with different number of cascades.As shown in the table, RCFRR-Net-3 outperformed RCFRR-Net-1 and RCFRR-Net-2 as expected.For the global T A B L E 3 Comparison of RMSE, NCC, SSIM, and Dice score using different registration methods based on 4D-CBCT datasets.
ROI-1 and from 0.5497 ± 0.086 to 0.9184 ± 0.0177 for the ROI-2.T A B L E 6

Table 6
showed the global image and ROI-based comparison of the average RMSE, NCC, SSIM, and Dice score using different registration methods for the 4D-CBCT testing patient.Due to the high level of image noise and artifacts in CBCT images, the registration accuracy was worse than the registration accuracy of CT images.As could be seen from