A review on medical imaging synthesis using deep learning and its clinical applications

Abstract This paper reviewed the deep learning‐based studies for medical imaging synthesis and its clinical application. Specifically, we summarized the recent developments of deep learning‐based methods in inter‐ and intra‐modality image synthesis by listing and highlighting the proposed methods, study designs, and reported performances with related clinical applications on representative studies. The challenges among the reviewed studies were then summarized with discussion.


| INTRODUCTION
Image synthesis across and within medical imaging modalities is an active area of research with broad applications in radiology and radiation oncology. Its primary purpose is to facilitate the clinical workflow by bypassing or replacing an imaging procedure when acquisition is infeasible due to constraints on time, labor, or expense; when exposure to ionizing radiation is contraindicated; or when image registration introduces unacceptable uncertainty between images of different modalities. These benefits have sparked growing interest in a number of exciting clinical applications, such as magnetic resonance imaging (MRI)-only radiation therapy treatment planning and positron emission tomography (PET)/MRI scanning.
Image synthesis and its potential applications have been investigated for decades. Conventional methods usually rely on models with explicit human-defined rules for the conversion of images from one modality to another and require case-by-case parameter tuning for optimal performance. These models are also application specific, depending upon the unique characteristics of the involved imaging modalities, resulting in a multitude of application-specific complex methodologies. It is difficult to build such models when the two imaging modalities considered provide distinct information, such as anatomical imaging and functional imaging. This, at least in part, is why the majority of these studies are limited to computed tomography (CT synthesis from MRI). 1 Owing to the rapid progress in the fields of machine learning and computer vision over the last two decades, image synthesis across other imaging modalities such as PET and cone-beam CT (CBCT) is now viable and a growing number of applications are benefitting from recent advancements in image synthesis techniques. [2][3][4] Deep learning, as a broad subdiscipline within machine learning and artificial intelligence, has dominated this field for the past several years.
Deep learning utilizes neural networks with many layers containing large numbers of neurons to extract useful features from images.
Various networks and architectures have been proposed for better performance on different tasks. Deep learning-based image synthesis methods usually share a common framework that uses a data-driven approach for image intensity mapping. The workflow typically consists of a training stage for the network to learn the mapping between the input and its target, and a prediction stage to synthesize the target from an input. Compared with conventional modelbased methods, deep learning-based methods are more generalizable since the same network and architecture for a pair of image modalities can be generalized to different pairs of image modalities with minimal adjustment. This allows rapid translation to a variety of imaging modalities whose synthesis is clinically useful. Although

| LITERATURE SEARCH
We defined the scope of this review study to include both interand intra-modality image synthesis using deep learning methods.
Inter-modality applications included studies of image synthesis between two different imaging modalities, whereas intra-modality applications included studies that transform images between two different protocols of the same imaging modality, such as between MRI sequences, or the restoration of images from a low-quality protocol to a high-quality protocol. Studies with a sole aim of image quality improvement, such as image denoising and artifact correction, were not included in this study. Conference abstracts and proceedings were not considered due to the lack of strict peer review in study design and reported results.
Peer-reviewed journal publications were searched on PubMed using the criteria in title or abstract as of February 2020: ("pseudo" OR "synth*" OR "reconstruct*" OR "transform" OR "restor*" OR "correct*" OR "generat*") AND "deep" AND "learning" AND ("CT" OR "MR" OR "MRI" OR "PET" OR "SPECT" OR "Ultrasound"). The search yielded 681 records. We manually screened each record, removing those ineligible by the previously defined criteria. The remaining 70 articles were included in this review. We also performed a citation search on the identified literature and an additional 41 articles were included. Therefore, 111 articles were included in this review. Compared with current review papers on this topic, 5 this review is more comprehensive, covering more articles using a systematic approach.

| DEE P LEARN ING METHODS
The methodological frameworks of the reviewed studies can be grouped into three categories: Auto-encoder (AE), U-net, and generative adversarial network (GAN). These three groups of methods are not completely different from each other, but represent stepwise increases in architecture complexity. An AE is a basic network and can act as a basic component in advanced architectures such as Unet and GANs. For example in U-net, composed of an encoder that downsamples images to feature maps and a decoder that up-samples the feature maps before finally mapping to targets, the encoder or decoder is usually a fully convolutional AE. Similarly, GANs are commonly viewed as a two-player zero-sum game between two neural network architectures: GANs are composed of a generator, which can be an AE or a U-net, and a discriminator, which is usually an AE.
Therefore a hierarchy of complexity can be constructed ranging from the simplest AE to the most complex GAN, with U-net residing somewhere in between. Figure 2 indicates that U-net and GAN studies, which are close in total numbers, comprise the mainstream, accounting for about 90% of the considered articles. Figure 1 also demonstrates that the studies using U-net and GAN have been increasing since 2017, with GAN utilization increasing at a faster rate than U-net. While most of the 111 considered studies employ these methods in supervised learning context, three used an unsupervised strategy learning image translation from unpaired datasets. A review of methods within AEs, U-net, and GANs is provided in this section.

3.A | Auto-encoder
An Auto-encoder (AE) is a class of deep neural networks that use convolution kernels to explore spatially local image patterns. It consists of an input, an output, and multiple hidden layers. The hidden layers contain a series of convolutional layers that convolve the input with trainable convolution kernels and pass the feature maps to the next layer. In order to restrict the input of each convolutional layer into a certain range, an activation layer is added between convolutional layers to map the output of previous layers to the input of the next layers by a predefined function. The Rectified Linear Unit (ReLU) layer which has zero-valued output for all negative inputs and preserves the input otherwise is the most commonly used activation layer due to its computational simplicity, representational F I G . 1. Number of peer-reviewed articles in medical imaging synthesis using deep learning with different neural networks. This study only covers the first 2 months of 2020. The dashed line predicting the total number of articles in 2020 is a linear extrapolation based on previous years. sparsity, and linearity. To further standardize the input of each layer, batch normalization is usually applied to the activations of a layer, rescaling the input to have a standard distribution. This step has been shown to reduce internal covariate shift of the training datasets for improved robustness and faster convergence. Dropout layers are commonly used to reduce the chances of overfitting by intentionally and randomly ignoring some number of layer outputs. In this way, the prior layer is practically implemented with a different number of nodes and connectivity relative to its state before subjecting it dropout. To save memory, the large size of images is typically reduced by pooling and convolution layers to allow a larger number of feature maps and, ultimately, deeper networks. The pooling layer is usually added after the activation layer and involves a pooling operation that uses a specified mathematical filter to downsample the feature map. With multiple hidden convolutional layers, a hierarchy of increasingly complex features with high-level abstraction is extracted. The ultimate goal is to train the network to minimize the output of an objective loss or cost function: a mathematical representation of the goodness-of-fit of the model in matching its predictions to ground truth where greater "loss" or "cost" is associated with poorer fit. During the training process, iterative adjustments are made on the weights and biases of the kernels of the convolutional layers until the loss function is minimized. These weights and biases are trainable parameters of the network. Gradient descent, wherein the mathematical gradient (multi-dimensional derivative) of the cost function is utilized to minimize the function in a stepwise fashion and update trainable parameters of networks. Several optimization algorithms, such as stochastic gradient descent (SGD), the adaptive gradient algorithm (AdaGrad), root mean square propagation (RMSProp), and Adaptive Moment Estimation (Adam) have been developed. A basic AE is composed of several connected convolutional layers to map input to output; however, very few studies employ AEs in this basic form. Instead, most of the reviewed studies use variants of the basic AE architecture for better performance. For example, the residual neural network (ResNet) was chosen in a few studies due to its shortcut connections that skip one or more layers, easing the training of the deep network without adding extra parameters or computational complexity. [6][7][8] ResNet also allows feature maps from the initial layers that usually contain fine details to be easily propagated to the deeper layers. AEs and their variants are commonly utilized as a basic component in advanced architectures such as those that follow.

3.B | U-net
In one of the first of several studies employing deep learning in image synthesis, Han used AE in synthesizing CT from MR images by adopting and modifying a U-net architecture. 9 The U-net model used in the study of Han has an encoding and a decoding part. In this case, an encoder extracts hierarchical features from an MR image input using convolutional, batch normalization, ReLU, and pooling layers while a mirrored decoder replaces pooling layers with deconvolution layers, transforms the features, and reconstructs the predicted CT images from low-to high-resolution levels. The two parts are connected through shortcuts on multiple layers.
These shortcut connections are used to concatenate early layers to late layers such that late layers can also learn simple features captured in early layers. In the study of Han, these shortcuts enable high-resolution features from the encoding part to be used as extra inputs in the decoding part. Moreover, the original AE design includes several fully connected "hidden layers," so-called because these fully connected layers connect every neuron in the previous layer to every neuron in the next and neither inputs nor outputs of these layers are typically monitored during production. The fully connected layers correspond to global image features that are critical for image classification tasks but not very relevant for dense pixel-wise prediction. 9 Han's model removed fully connected layers such that the number of parameters was highly reduced. In their study, the model was trained using pairs of MR and CT two-dimensional (2D) slices. During the training process, a loss function of mean absolute error (MAE) between prediction and ground truth was minimized. Use of an L1-norm loss function such as MAE can improve robustness to noise, artifacts, and misalignment among the training images.
Most studies employing U-net generally followed the above architecture, with many variants and improvements proposed and  10,11 Instead of using CT images as ground truth in their MR-based CT study, they used discretized maps from CTs by labeling three materials, transforming CT synthesis into a segmentation problem. Finally, a multi-class soft-max classifier giving the probabilities of each material class within each voxel (e.g., 0.5 bone, 0.3 air, 0.1 soft tissue) was applied to the final layer of the decoder.
Another notable feature presented in Jang et al. 11 is inclusion of a fully connected conditional random field, which considers neighboring voxels when generating label predictions, providing complementary information in addition to the base classifier, which only F I G . 2. Pie chart of numbers of articles in different categories of neural networks. considers single voxel each time. In this application, the conditional random field provided 3D context to 2D image slices, building pairwise potentials between all pairs of voxels using the output of the model and the original 3D volume when predicting the label for each voxel. A landmark advance in U-net architecture came when Dong et al. discovered that the information carried in the long skip connection of U-net from the encoding path is characterized by its high frequency, often including irrelevant components from noisy input images. In order to address this issue, they used a self-attention strategy that uses the feature maps extracted from coarse-scale early in the encoder module to identify the most relevant emerging features, assign them attention scores and use these to eliminate noise prior to concatenation. 12 In an alternative strategy, Hwang et al. only employed the skip connection in deeper layers. 13 The choice of building blocks within the encoding and decoding modules has also been investigated. Fu et al. made a few improvements based on the architecture of Han. For example, batch normalization layers, wherein normalization is applied across image subsets of the original sample to speed convergence, were replaced with instance normalization layers, wherein normalization occurs instead at the level of image channels, for further performance improvements when training with a small batch size. The unpooling layers in the decoder, which up-sample and therefore reverse pooling layers in the encoder and produce sparse feature maps, were also replaced with deconvolutional layers that produce dense feature maps and the skip connections were replaced with residual shortcuts, inspired by ResNet, to further save computational memory. 14 Neppl et al. also replaced the ReLU layer with a generalized parametric ReLU (PReLU) to adaptively adjust the activation function. 15 Torrado-Carvajal et al. added a dropout layer before the first transposed convolution in the decoder to avoid overfitting. 16 Various loss functions have been investigated in the reviewed studies. In addition to the most commonly used L1-norm and L2norms that enforce voxel-wise similarity, other functions that describe different image properties are usually combined into the total loss function. For example, Leynes et al. used a total loss function which was a sum of MAE loss, gradient difference loss, and Laplacian difference loss, the last two of which help improve image sharpness. 17 Similarly, Chen et al. combines the MAE loss with structure dissimilarity loss to encourage whole-structure-wise similarity. 18 L2-regularization has also been incorporated into the loss function in a few studies to avoid overfitting. 19,20 Kazemifar et al. used mutual information, which has been widely implemented in loss functions applied to the task of image registration, in their loss function and demonstrated its advantages over MAE loss in better compensating the misalignment between CT and MR images. Largent et al. introduced a perceptual loss, which can mimic human visual perception using similar features rather than only intensities, into their U-net.
The perceptual loss was proposed in three different implementations with increasing complexity: on a single convolutional layer, on multiple layers with uniform weights, and on multiple layers with different weights that give more importance to the layers yielding the lower MAE. 21 3.C | Generative adversarial networks A generative adversarial network (GAN) is composed of a generative network and a discriminative network that are trained simultaneously. The generative network is trained to generate synthetic images, and the discriminative network is trained to classify an input image as real or synthetic. The training goal of a GAN then is to let the generative network produce synthetic images that are as realistic as possible to fool the discriminator while the discriminative network attempts to distinguish the synthetic from real images. Network training occurs when the adversarial generative and discriminative networks compete against each other until equilibrium is reached.
When deployed in production, the trained generative network is applied on new incoming image.
Similar to AEs, GANs were also used in one of the earliest publications in medical image synthesis using deep learning. Nie et al.
used a fully convolutional AE (AE without fully connected layers) for the generative network and a standard AE for the discriminative, respectively. 22 A binary cross-entropy loss function was employed for both networks with an important distinction: the discriminative network's loss is formulated to minimize the difference between assigned labels and ground truth in the usual fashion while the generative network's loss is instead formulated to maximize the error of the discriminative network by minimizing the difference between the labels assigned by the discriminative network and an incorrect label.
Since the network in this study was trained in a patch-to-patch manner that may limit the context information available in the training samples, an auto-context model that integrates low-level and context information from low-level appearance features was employed to refine the results.
Many variants of the GAN have been designed and investigated.
Emami et al. adopted conditional GAN (cGAN) in CT synthesis from MR. 7 Unlike standard unconditional GAN, both the generative and discriminative networks of cGAN observe the input images (e.g., the MR images in CT synthesis from MR). It can be formulated by conditioning the loss function of the discriminator on the input images and has been proved to be more suitable for image-to-image translation tasks. 23 Liang et al. implemented CycleGAN in their CBCT-based synthetic CT study. 24 The CycleGAN includes two generators: a CBCT/CT generator and a CT/CBCT generator, as well as two discriminators: a real CT/synthetic CT discriminator and a real CBCT/ synthetic CBCT discriminator. In the first cycle, the input CBCT is fed into the CBCT/CT generator to synthesize a CT, then the synthetic CT is fed into the CT/CBCT generator to regenerate a cycle CBCT, which is ideally identical to the input CBCT. The cycle CBCT is compared to the original input CBCT to generate CBCT cycle consistency loss. Meanwhile, the real CT-synthetic CT discriminator distinguishes between the real CT and the synthetic CT to generate CT adversarial loss, similar to a standard GAN. To encourage one-to-one mapping between CT and CBCT, a second cycle transformation from CT to CBCT is performed. The second cycle is same as the first, except the roles of CBCT and CT are swapped, that is, the real CT is fed into the same CT-CBCT generator to synthesize CBCT, and then the synthetic CBCT is fed into the same CBCT-CT generator to generate cycle CT. The cycle CT is compared to the real CT to generate CT cycle consistency loss. The real CBCT-synthetic CBCT discriminator distinguishes between the CBCT and the synthetic CBCT to generate CBCT adversarial loss, similar to a standard GAN. Unlike GAN, the CycleGAN couples an inverse mapping network by introducing a cycle consistency loss which enhances the network performance, especially when paired CT/CBCT training image sets are absent. As a result, CycleGAN can tolerate a certain level of misalignment in the paired training dataset. This property of Cycle-GAN is attractive to inter-modality synthesis because misalignment in the training datasets is often inevitable due to the difficulty of obtaining exact matching image pairs. In many studies, training images are still paired by registration to preserve quantitative pixel values and reduce baseline geometric mismatch, allowing the network to focus on mapping details and accelerate training. 25 Varying structures of feature extraction blocks have proven useful for different applications. A group of studies showed that AEs with residual blocks can achieve promising results in image-transforming tasks where source and target images are largely similar, such as between CT and CBCT, non-attenuation-corrected (NAC) PET and attenuation-corrected (AC) PET, and low-counting PET and full-counting PET. Since these pairs of images are similar in appearance but are quantitatively different, residual blocks, composed of a residual connection in combination with multiple hidden layers, were integrated into the network to learn the differences between the pairs. An input bypasses these hidden layers via the residual connection, thus the hidden layers enforces minimization of a residual image between the source and ground truth target images, thereby minimizing noise and artifacts. [25][26][27][28][29] In contrast, dense blocks concatenate outputs from previous layers rather than using feed-forward summation as in a standard AE block, capturing multifrequency (high and low frequency) information to better represent the mapping from the source image modality to the target image modality. Dense blocks are therefore commonly used in inter-modality image synthesis such as MR-to-CT and PET-to-CT. 12,[30][31][32][33][34] Within GANs, the AEs and its variants are commonly used for both generative and discriminative networks. Emami et al. used ResNet for its generative network. 7 They removed the fully connected layers and added two transposed convolutional layers after residue blocks as deconvolution. Kim et al. combined the U-net architecture and the residual training scheme in their generative network. 35 Olberg et al. proposed a deep spatial pyramid convolutional framework that includes an atrous spatial pyramid pooling (ASPP) module in a U-net architecture. The module performs atrous convolution at multiple rates in parallel such that multiscale features can be exploited to characterize a single pixel. 36 The encoder is then able to capture rich multi-scale contextual information, which aids image translation. Compared to the generator, the discriminator is typically implemented in a simpler form.
A common example consists of a few downsampling convolutional layers followed by a sigmoid activation layer to binarize the output, as proposed by Liu et al. 33 Generative adversarial networks and its variants incorporate adversarial loss functions in addition to the image quality and accuracy loss functions contained within U-net. The adversarial term, unlike the reconstruction term that represents image intensity accuracy, reflects the correct or incorrect decision that the discriminator makes on real or synthetic images. In addition to the binary crossentropy loss mentioned above or a similar sigmoid cross-entropy loss, the negative log-likelihood functions outlined in the original computer vision publication describing GANs are also widely used.
However, the training process may suffer from divergence caused by vanishing gradients and mode collapse when the discriminator is trained to be optimal for a fixed generator. 37 To address this problem, Emami et al. proposed to use least-square loss, which has been shown to be more stable during training and generates higher quality results. 7 The Wasserstein distance loss function is an alternative with even smoother gradient flow and faster convergence. 37 It has also been shown that, in GANs, simply providing the true or false labels output by the discriminator may not be sufficient for the gen-

3.D | Other
In addition to the above architectures, other designs have also been proposed to adapt to specific applications in the reviewed studies.  with varying contrast as multichannel inputs in a multipath architecture which has three training paths in the encoder, with each channel possessing its own feature network. 8 The separate image feature extractions on different MR images are able to avoid the loss of unique features that may be merged in a lower level.

| APPLICATION AREAS
The reviewed articles were categorized into two groups based on their objectives: inter-modality (56%) and intra-modality (44%) synthesis. Within each group, subgroups are described that specify the involved imaging modalities and their clinical applications.

4.A | Inter-modality
The group of inter-modality synthetic techniques includes studies of image synthesis from one image modality to another, such as from MR to CT, from CT to MR, from PET to CT, etc. We also consider the transformation between CT and CBCT to be inter-modality

4.A.1 | MR-to-CT
Image synthesis from MR to CT is one of the first applications to utilize deep learning for medical image synthesis and remains the most commonly published topic in this field. The main clinical motivation of MR-based CT synthesis is to replace CT with MR acquisition. 41 The image quality and appearance of the synthetic CT in current studies is still considerably different from real CT, which prevents its direct diagnostic usage. However, many studies have demonstrated its utility in the nondiagnostic setting, such as treatment planning for radiation therapy and PET attenuation correction.
In the current radiation therapy workflow, both MR and CT imaging are frequently performed on many patients for the purpose of treatment planning (i.e., simulation). MR images feature excellent soft tissue contrast that is useful for delineation of gross tumor as well as organs at risk (OARs) 42 while CT images provide electron density maps for dose calculation and reference images for pretreatment positioning. The contours from MR images are propagated to CT images by image registration for treatment planning. However, using both imaging modalities not only leads to additional time and cost for the patient, but also introduces systematic positioning errors during the CT-MRI image fusion process. [43][44][45] Moreover, CT also subjects patients to exposure to a non-negligible dose of ionizing radiation, 46  Replacing CT with MRI is also preferable in current PET imaging applications, although CT is widely combined with PET in order to perform both imaging examinations simultaneously during a single encounter. The CT images acquired are then used to derive the 511 keV linear attenuation coefficient map to model photon attenuation by a piecewise linear scaling algorithm. 49,50 The linear attenuation coefficient map is then used to correct for the loss of annihilation photons by attenuation processes in the object on the PET images to achieve a satisfactory image quality. Magnetic resonance has been proposed to be incorporated with PET as a promising alternative to existing PET/CT systems for its advantages of superior soft tissue contrast and radiation dose sparing; however, a challenge similar to that encountered in applications in radiation therapy remains: MR images cannot be directly used to derive the F I G . 3. Pie chart of numbers of articles in different categories of applications. MR-to-CT: RT, MR-to-CT: PET, and MR-to-CT: Registration represent MR to CT image synthesis used in radiotherapy, PET, and image registration, respectively. PET: AC and PET: Low-count represent PET image synthesis used in attenuation correction and low-count to full-count, respectively. 511 keV attenuation coefficients used in the attenuation correction process. Therefore, MR-to-CT image synthesis could be useful to develop a PET/MR system capable of providing the necessary data for photon attenuation correction. 51 The absence of a one-to-one relationship between MR voxel intensity and CT HU values leads to a large difference in image appearance and contrast, which results in failure of intensity-based calibration methods. For example, bone is bright and air is dark on CT imaging while both are dark on MRI. Conventional methods proposed in the literature either segment MR images into several classes of materials (e.g., air, soft tissue, bone), then assign corresponding CT HU values, [52][53][54][55][56][57] or register MR images to an atlas with known CT HU values. [58][59][60] These methods rely heavily on the performance of segmentation and registration, which introduces significant error due to the ambiguous boundary between, for instance, bone and air and large inter-patient variation.
Tables 1 and 2 list the studies synthesizing CT from MR images for radiation therapy and PET attenuation correction, respectively.
For CT synthesis applications in radiation therapy, the MAE is the most common and well-defined metric by which nearly every study reported the image quality of its synthetic CT. For synthetic CT in PET attenuation correction, synthetic CT quality is more commonly evaluated indirectly by assessing the quality of PET attenuation correction than by direct evaluation of the synthetic CT itself. For studies presenting several variants of methods, we listed that with the best MAE for radiation therapy and the best PET quality for PET attenuation correction.

Synthetic CT image accuracy
In most of the studies, the MAE of the synthetic CT within the patient's body ranges from 40 to 70 HU, with some of the reported results approaching uncertainties observed in standard CT simulation. For example, the MAE of soft tissue reported in several studies 7,14,21,61-64 is <40 HU. In contrast, due to their indistinguishable contrast on MR images, the MAE of bone or air is more than 100 HU. Another common source of error is misalignment between CT and MR images in the patient datasets. The misalignment that happens on the bone not only causes intensity mapping error during training, but also leads to overestimation of error in evaluation since the error from misalignment registers as synthetic error. Two studies also reported much higher MAE for rectum (~70 HU) than other soft tissue, 21,65 which may also be attributed to mismatch on CT and MRI due to variable filling. Moreover, considering that the number of bone pixels is far fewer than those of soft tissue, the training process tends to map pixels to low HU region in the prediction stage.
Potential solutions may include assigning higher loss weights on bone or adding bone-only images for training. 14 Compared with conventional methods, learning-based methods demonstrate superior performance in synthetic CT accuracy in multiple studies, indicating an advantage of the data-driven approach over model-based methods. 9,22,62,65 For example, synthetic CT generated by atlas-based methods was shown to be more noisy and prone to registration error, leading to significantly greater MAE than learning-based methods. However, atlas-based methods were shown to be more robust to image quality variation in some cases. 65 One of the limitations of learning-based methods is that performance can be unpredictable when applied to datasets that are very different from the training sets. These differences may be attributed to unusual or abnormal anatomy or images with degraded quality due to severe artifacts and noise. Atlas-based methods, in contrast, generate a weighted average of templates from prior knowledge, and are thus less likely to fail on unexpected or unusual cases.
The results reported among these studies cannot be compared directly to determine a single best methodology for all applications because they utilize diverse datasets as well as training and testing strategies. However, some studies compared proposed methods with competing methods using the same datasets, which may reveal their relative advantages and limitations. For example, a GAN-based method was shown to better preserve detail, be more similar to real CT with less noise compared to a AE-based method on a cohort of 15 brain cancer patients. 7 Specifically, GAN-based synthetic CT was more accurate at the bone/air interface and in determining fine Among the reviewed studies, several different MR sequences have been adopted for synthetic CT generation. The specific sequence used in each study usually depends upon availability. The optimal sequence yielding the best performance has not, to our knowledge, been studied. T1-weighted and T2-weighted sequences are the two most common general diagnostic MR sequences. Due to their wide availability, models can be trained from a relatively large number of datasets with CT and accompanying co-registered T1-or T2-weighted MR images. T2 images may be preferable to T1 due to their intrinsically superior geometric accuracy within regions of great anatomic variability, such as the nasal cavity, and have less chemical   Studies have also evaluated synthetic CT in the context of proton therapy for prostate, liver, and brain cancer. 33

PET attenuation correction
For the studies of PET attenuation correction, the bias on PET quantification caused by synthetic CT error has been evaluated. Although it is difficult to specify an error tolerance beyond which clinical decision-making is impacted, the general consensus is that quantitative errors of 10% or less typically do not affect decisions in diagnostic imaging. 77 Based on the average relative bias represented by these studies, almost all of the proposed methods in the studies met this criterion. However, it should be noted that, due to variation among study objects, the bias in some volumes-of-interest (VOIs) may exceed 10% for some patients, 17,68 suggesting that attention should be given to the standard deviation of the bias as well as its mean when interpreting results, since the proposed methods may have poor local performance that would affect some patients. Alternative results reporting listing or plotting all data points, or at least their range, would ultimately be more informative than giving a mean and standard deviation in demonstrating the performance of the proposed methods.
Since bone has the highest capacity for attenuation due to its high density and atomic number, 78

MR-CT image registration
In addition to radiation treatment planning and PET attenuation correction, MR-based CT synthesis has also proven promising in facilitation of inter-modality image registration. Direct registration between CT and MR images is very challenging due to disparate image contrast and is even less reliable in deformable registration wherein significant geometric distortion is allowed. McKenzie et al. proposed a CycleGAN-based method to synthetize CT images and used the synthetic CT to replace MRI in MR-CT registration in the head and neck, reducing an inter-modality registration problem to an intramodality one. 80 As summarized in Table 3, they found that, using the same deformable registration algorithm, the average landmark error decreased from 9.8 ± 3.1 mm in direct MR-CT registration to 6.0 ± 2.1 mm using synthetic CT as a bridge. Similar results were also reported in the inverse CT-MR registration task.

4.A.2 | CT/CBCT-to-MRI
Due to superior soft tissue contrast produced by MRI, it is attractive to generate synthetic MRI from CT or CBCT in applications that are sensitive to soft tissue contrast, such as segmentation. 81 Synthesizing MR from CT/CBCT may at first seem more challenging than synthesizing CT from MR, in part because MR contains greater contrast and detail that must be recovered but are not shown on CT; however, deep learning methods have proven quite competent in mapping high nonlinearity, making the proposed application possible.
WANG ET AL.

| 21
Studies reviewed synthesizing MR from CT/CBCT-adopted similar networks to those employed in MR-to-CT synthesis and are listed in

4.A.3 | CBCT-to-CT
Cone-beam CT and CT image reconstruction are subject to the common physics principles of x-ray attenuation and back projection; however, they differ in the details of their implementation of acquisition and reconstruction as well as their clinical utility. Therefore, they are considered as two distinct imaging modalities in this review.
cOne-beam CT has been widely utilized in image-guided radiation therapy (IGRT) to determine the degree of patient setup error and inter-fraction motion by comparing the displacement of anatomic landmarks from the treatment planning CT images. 87 With increasing adoption of adaptive radiation therapy techniques, more demanding applications of CBCT have been proposed, such as daily dose estimation and auto-contouring based on a deformable image registration (DIR) with CT imaging obtained at simulation. 88,89 Unlike CT scanners using fan-shaped x-ray beams with multi-slice detectors, CBCT generates a cone-shaped x-ray beam incident on a flat panel detector. The flat panel detector features a high spatial resolution and wide coverage along the z-axis, but also suffers from increased scatter signal since the x-ray scatter generated from the entire body volume may reach the detector. The scatter signals cause severe streaking and cupping artifacts on the CBCT images and lead to significant quantitative CT errors. Such errors complicate the calibration process of CBCT Hounsfield Unit (HU) to electron density when images are used for dose calculation. 90 The degraded image contrast and suppression of bone can also cause large errors in DIR for contour propagation from planning CT to CBCT. 91 The significantly degraded image quality of CBCT prevents its use in advanced quantitative applications in radiation therapy.
Deep learning-based methods, as listed in Table 5 In the reviewed studies, the ground truth considered while train-

4.B | Intra-modality
The group of intra-modality investigations includes studies that transform images between two different protocols within an imaging modality, such as among different sequences of MRIs, or the restoration of images from a low-quality protocol to higher quality. Studies solely aiming at image quality improvement such as image denoising and artifact correction are not included in this study. Studies within this group are further subdivided into CT, MR, and PET. As shown in

4.B.1 | CT
Computed tomography imaging delivers a non-negligible dose of ionizing radiation during acquisition leading to a small, but real, increase in risk of radiation-induced cancer and genetic defects. [100][101][102] During diagnosis, treatment and surveillance of many malignancies, it is common for patients to be subject to frequent CT imaging. In this setting, accumulated imaging dose is of even greater concern, particularly for pediatric patients who are more sensitive to radiation and have longer life expectancy than adults throughout which secondary malignancies are more likely to develop. 103 Computed tomography dose can be lowered by either reducing x-ray exposure (mAs) [104][105][106][107] or the number of x-ray projections. [104][105][106][107] However, if reconstructing an image with a conventional filteredbackprojection (FBP) algorithm, image quality would be degraded with greater image noise and reduced signal-to-noise ratio for a lowexposure protocol, or with severe undersampling artifacts for a reduced projection protocol. These low-quality images would make routine tasks requiring CT images difficult for clinicians. Hardwarebased methods such as optimization of the data acquisition protocol (automatic exposure control) 108 and improvements in detector designs 109 have been shown to be effective in reducing imaging dose to some extent while maintaining clinically acceptable image quality. However, further dose reduction from these techniques is limited by detector physical properties and is therefore very costly.
For decades, iterative CT image reconstruction algorithms have been proposed to address the degraded image quality resulting from insufficient data acquisition. 110 These methods model the physical process of CT scanning with prior knowledge and are more robust to noise, requiring less radiation dose for the same image quality relative to FBP. [110][111][112] However, iterative reconstruction suffers from long computation time due to the large number of iterations with repeated forward and back projection steps.
Moreover, in the forward projection step, it requires knowledge of the energy spectrum which is difficult to measure directly. [113][114][115][116] This is usually addressed by a monoenergetic forward projection matrix, or by obtaining an indirect simulation/estimation of the energy spectrum. 106,107,112,117,118 T A B L E 6 Summary of studies on PET-based synthetic CT for PET attenuation correction.  Image synthesis by deep learning is attractive for low-dose CT (LDCT) restoration due to its data-driven approach to automatically learning image features and model parameters. As listed in Table 7 The related studies are listed in Table 8 Low-count PET has extensive applications in pediatric PET scanning and radiotherapy response evaluation with the advantages of better motion control and lower radiation dose. However, low-count statistics result in increased image noise, reduced contrast-to-noise ratio, and significant bias in uptake measurement. The reconstruction of a standard-or full-count PET from low-count PET cannot be achieved by simple postprocessing operations such as denoising, since the diminished radiation dose changes the underlying biological and metabolic processes, leading not only to noise but also local uptake values changes. 129 Moreover, even given the same radiotracer injection dose, the uptake distribution and signal level can vary greatly among patients. The learning-based low-count PET reconstruction methods summarized in Table 10  filter or postprocessing without retraining the model. 130 They also compared results using original images and projections as input and found that projection-based results better reflect uptake pattern and  Due to limitations of GPU memory, some of the deep learning approaches examined were trained on two-dimensional (2D) slices.

| SUMMARY AND OUTLOOK
Since the loss functions of 2D models do not account for continuity in the third dimension, slice discontinuities can be observed. Some studies trained models on three-dimensional (3D) patches to exploit 3D spatial information with even less memory burden, 31  different networks for all three combinations of orthogonal 2D planes to produce pseudo-3D information. 133 The reviewed studies illustrate the advantages of learning-based methods over conventional methods in performance as well as clinical application. Learning-based methods generally outperform conventional methods in generating more realistic synthetic images with higher similarity to real images and better quantitative metrics.
Depending on hardware, training a model in development usually takes hours to days for learning-based methods. However, once the model is trained, it can be applied to new patients to generate synthetic images in seconds to minutes. Conventional methods vary widely in specific methodologies and implementations, resulting in a wide range of run times. Iterative methods such as CS were shown to be unfavorable due to significant costs in time and compute power.
Unlike conventional methods, learning-based methods require large training datasets. The size of training sets has been shown to affect the performance of machine learning in many challenging computer vision problems as well as medical imaging tasks. [134][135][136][137] Generally, a larger training set size with greater data variation can reduce overfitting of the model and enable better performance.
Compared with studies in some medical imaging applications where it is common to see thousands of patients enrolled, studies in medical image synthesis involve far fewer patients. As shown in Tables 1-10, a training size of dozens of patients is more common in these studies while hundreds of patients per set are rare and can be considered as a relatively "large" study. Moreover, it is very common to see the leave-X-out or N-fold cross validation strategy used in evaluating methods. The lack of an independent test set unseen by the model may complicate the generalization of results for broad clinical applications. The current small sample norm arises from circumstances which vary from application to application. In radiation oncology, clinical patient volume is inherently lower than other specialties such as radiology, so that fewer eligible patients are available for study. In addition to limitations in data collection, data cleaning further eliminates a portion of data that are low in quality or represent outliers, such as image pairs with suboptimal registration. In order to address the problem posed by limited training data, novel techniques have been proposed, such as transfer learning, 138

CONF LICT OF I NTEREST
None.