Deep learning-based synthetic-CT generation in radiotherapy and PET: a review

Recently, deep learning (DL)-based methods for the generation of synthetic computed tomography (sCT) have received significant research attention as an alternative to classical ones. We present here a systematic review of these methods by grouping them into three categories, according to their clinical applications: I) To replace CT in magnetic resonance (MR)-based treatment planning. II) Facilitate cone-beam computed tomography (CBCT)-based image-guided adaptive radiotherapy. III) Derive attenuation maps for the correction of positron emission tomography (PET). Appropriate database searching was performed on journal articles published between January 2014 and December 2020. The DL methods' key characteristics were extracted from each eligible study, and a comprehensive comparison among network architectures and metrics was reported. A detailed review of each category was given, highlighting essential contributions, identifying specific challenges, and summarising the achievements. Lastly, the statistics of all the cited works from various aspects were analysed, revealing the popularity and future trends, and the potential of DL-based sCT generation. The current status of DL-based sCT generation was evaluated, assessing the clinical readiness of the presented methods.


I. Introduction
Medical imaging's impact on oncological patients' diagnosis and therapy has grown significantly over the last decades 1 . Especially in radiotherapy (RT) 2 , imaging plays a crucial role in the entire workflow, from treatment simulation to patient positioning and monitoring 3,4,5,6 .
Traditionally, computed tomography (CT) is considered the primary imaging modality in RT. It provides accurate and high-resolution patient's geometry, enabling direct electron density conversion needed for dose calculations 7 . X-ray based imaging, including planar imaging and cone-beam computed tomography (CBCT), are widely adopted for patient positioning and monitoring before, during or after the dose delivery 4 . Along with CT, positron emission tomography (PET) is commonly acquired to provide functional and metabolic information allowing tumour staging and improving tumour contouring 8 . Magnetic resonance imaging (MRI) has also proved its added value for tumours and organs-at-risk (OARs) delineation, thanks to its superb soft tissue contrast 9,10 .
To benefit from the complementary advantages offered by different imaging modalities, MRI is generally registered to CT 11 . However, residual misregistration and differences in patient set-up may introduce systematic errors that would affect the accuracy of the whole treatment 12,13 .
Recently, MR-only based RT has been proposed 14,15,16 to eliminate residual registration errors. Furthermore, it can simplify and speed up the workflow, decreasing patient's exposure to ionising radiation, which is particularly relevant for repeated simulations 17 or fragile populations, e.g. children. Also, MR-only RT may reduce overall treatment costs 18 and workload 19 . Additionally, the development of MR-only techniques can be beneficial for MR-guided RT 20 .
The main obstacle regarding the introduction of MR-only radiotherapy is the lack of tissue attenuation information required for accurate dose calculations 12,21 . Many methods have been proposed to convert MR to CT-equivalent representations, often known as synthetic CT (sCT), for treatment planning and dose calculation. These approaches are summarised in two specific reviews on this topic 22,23,24 , in site-specific reviews 18,25,26 or broader review on MR-guided 27 or proton therapy 28 .
Additionally, similar techniques to derive sCT from a different imaging modality have

I. INTRODUCTION
been envisioned to improve the quality of CBCT 29 . Cone-beam computed tomography plays a vital role in image-guided adaptive radiation therapy (IGART) for photon and proton therapy. However, due to the severe scatter noise and truncated projections, image reconstruction is affected by several artefacts, such as shading, streaking and cupping 30,31 . For this reason, daily CBCT has not commonly been used for online plan adaptation. The conversion of CBCT-to-CT would allow accurate dose computation and improve the quality of IGART provided to the patients.
Finally, sCT estimation is also crucial for PET attenuation correction. Accurate PET quantification requires a reliable photon attenuation correction (AC) map, usually derived from CT. In the new PET/MRI hybrid scanners, this step is not immediate, and MRI to sCT translation has been proposed to solve the MR attenuation correction (MRAC) issue.
Besides, standalone PET scanners can benefit from the derivation of sCT from uncorrected PET 32,33,34 .
In the last years, the derivation of sCT from MRI, PET or CBCT has raised increasing interest based on artificial intelligence algorithms such as machine learning or deep learning (DL) 35 . This paper aims to systematically review and summarise the latest developments, challenges and trends in DL-based sCT generation methods. Deep learning is a branch of machine learning, a field of artificial intelligence that involves using neural networks to generate hierarchical representations of the input data to learn a specific task without handengineered features 36 . Recent reviews have discussed the application of deep learning in radiotherapy 37,38,39,40,41,42,43 , and in PET attenuation correction 34 . Convolutional neural networks (CNNs), which are the most successful models for image processing 44,45 , have been proposed for sCT generation since 2016 46 , with a rapidly increasing number of published papers on the topic. However, DL-based sCT generation has not been reviewed in details, except for applications in PET 47 . With this survey, we aim at summarising the latest developments in DL-based sCT generation, highlighting the contributions based on the applications and providing detailed statistics discussing trends in terms of imaging protocols, DL architectures, and performance achieved. Finally, the clinical readiness of the reviewed methods will be discussed.

II. Material and Methods
A systematic review of techniques was carried out using the PRISMA guidelines. PubMed, Scopus and Web of Science databases were searched from January 2014 to December 2020 using defined criteria (for more details, see Appendix VII.). Studies related to radiation therapy, either with photons or protons and attenuation correction for PET, were included when dealing with sCT generation from MRI, CBCT or PET. This review considered external beam radiation therapy, excluding, therefore, investigations that are focusing on brachytherapy.
Conversion methods based on fundamental machine learning techniques were not considered in this review, preferring only deep learning-based approaches. Also, the generation of dualenergy CT was not considered along with the direct estimation of corrected attenuation maps from PET. Finally, conference proceedings were excluded: proceedings can contain valid methodologies; however, the large number of relevant abstracts and incomplete report of information was considered not suitable for this review. After the database search, duplicated articles were removed and records screened for eligibility. A citation search of the identified articles was performed.
Each included study was assigned to a clinical application category. The selected categories were: I MR-only RT; II CBCT-to-CT for image-guided (adaptive) radiotherapy; III PET attenuation correction.
For each category, an overview of the methods was constructed in the form of tables 1 .
The tables were composed by capturing salient information of DL-based sCT generation approaches, which has been schematically depicted in Figure 1.

2D(+)
sag tra cor + Figure 1: Schematic of deep learning-based sCT generation study. The input images/volumes, either being MRI (green), CBCT (yellow) or PET (red), are converted by a Convolutional Neural Network (CNN) into sCT. The CNN is trained to generate sCT similar to the target CT (blue). Several choices can be made in terms of network architecture, configuration, data pairing. After the sCT generation, the output image/volume is evaluated with image-and task-specific metrics. 2D+ when independently trained 2D networks in different views were combined during of after inference; multi-2D (m2D, also known as multi-plane) when slices from different views, e.g. transverse, sagittal and coronal, were provided to the same network; 2.5D when training was performed with neighbouring slices which were provided to multiple input channels of one network; 3D when volumes were considered as input (the whole volume, 3D, or patches, 3Dp). The architectures generally considered are introduced in the next section (II.A.). The sCTs are generated inferring on an independent test set the trained network or combining an ensemble (ens) of trained networks. Finally, the quality of the sCT can be evaluated with image-based or task-specific metrics (II.B.).
For each of the sCT generation category, we compiled tables providing a summary of the published techniques, including the key findings of each study and other pertinent factors, here indicated: the anatomic site investigated; the number of patients included; relevant information about the imaging protocol; DL architecture, the configuration chosen to sample the patient volume (2D or 2D+ or m2D, 2.5D or 3D); using paired/unpaired data during the network training; the radiation treatment adopted, where appropriate, along with the most popular metrics used to evaluate the quality of sCT (see II.B.).
The year of publication for each category was noted according to the date of the first online appearance. Statistics in terms of popularity of the mentioned fields were calculated with pie charts for each category. Specifically, we subdivided the papers according to the anatomical region they dealt with: abdomen, brain, head & neck (H&N), thorax, pelvis and whole body; where available, tumour site was also reported. A discussion of the clinical feasibility of each methodology and observed trends follows.
The most common network architectures and metrics will be introduced in the following sections to facilitate the tables' interpretation.

II.A. Deep learning for image synthesis
Medical image synthesis can be formulated as an image-to-image translation problem, where a model that maps input image (A) to a target image (B) has to be found 48 . Among all the possible strategies, DL methods have dramatically improved state of the art 49 . DL approaches mainly used to synthesise sCT belong to the class of CNNs, where convolutional filters are combined through weights (also called parameters) learned during training. The depth is provided by using multiple layers of filters 50 . The training is regulated by finding the "optimal" model parameters according to the search criterion defined by a loss function (L ). Many CNN-based architectures have been proposed for image synthesis, with the most popular being the U-nets 51 and generative adversarial networks (GANs) 52 (see Figure 2). U-net presents an encoding and a decoding path with additional skip connections to extract and reconstruct image features, thus learning to go from domain A to B. In the most simple GAN architecture, two networks are competing. A generator (G) that is trained to obtain synthetic images (B ) similar to the input set (L G ), and a discriminator (D) that is trained to classify whether B is real or fake (L D ), improving G's performances.
GANs learn a loss that combines both the tasks resulting in realistic images 53

II.B. Metrics
An overview of the metrics used to assess and compare the reviewed publications' performances is summarised in Table 1, subdivided in image similarity, geometric accuracy and task-specific as suggested in 58 . Table 1: Overview of the most popular metrics reported in the literature subdivided into image similarity, geometric accuracy, task-specific metrics, and category.

Category Metric
Image similarity , with n=voxel number in ROI; PSNR is the ratio between the maximum in an image and the intensity of the corrupting noise affecting the fidelity of its representation, calculated as MSE. PSNR evaluates the noise introduced in the CT synthesis relatively to the ground truth CT. SSIM is a more sophisticated metric developed to take advantage of the known characteristics of the human visual system 59 perceiving the loss of image structure due to variations in lighting.
Geometric accuracy Along with voxel-based metrics, the geometric accuracy of the generated sCT can also be assessed by comparing corresponding segmented structures on CT and sCT, e.g. bones, fat, muscle, air and body. The segmentation can be performed manually but can also be automatic. In this context, the delineations are found after applying a threshold to CT and sCT and, if necessary, morphological operations on the obtained binary masks. The metrics for geometric accuracy are, therefore, generally the same used for a segmentation task. For example, the Dice similarity coefficient (DSC) 60 is a common metric that assesses the accuracy of depicting specific tissue classes/structures. DSC is twice the ratio between the correctly classified voxel and all the voxels in the mask from CT and sCT (Seg CT and Seg sCT ). Additionally, metrics generally used to estimate the distance among segmentations can also be adopted as the Hausdorff distance (HD) 61 or mean absolute surface distance, which measures two sets of contours' maximum and average distance, respectively.
Even if segmentation-based metrics are common, choosing the right metric for the specific task is a non-trivial task, as recently highlighted by Reinke et al. 62 and should be assessed on an application basis.
Other image-based metrics can be subdivided according to the application and presented in the following sections' appropriate sub-category.
Task-specific metrics In MR-only RT and CBCT-to-CT for adaptive RT, dose calculation accuracy on sCT is generally compared to CT-based in specific ROIs for dose calculations performed either for photon (x) and proton (p) RT.
Last edited Date:August 30, 2021 The most common voxelwise-based metric is the dose difference (DD), calculated as the average dose (D CT D sCT ) in ROIs as the whole body, target or other structures of interest.
The dose difference can be expressed as an absolute value (Gy) or relative (%), either to the prescribed dose, the maximum dose or the voxel-wise reference dose. The dose pass rate (DPR) is directly correlated to DD, and it is calculated as the percentage of voxels with DD< than a set threshold.
Gamma (γ) analysis allows combining dose and spatial criteria 63 , and it can be performed either in 2D or 3D. Several parameters need to be set to perform γ-analysis, including dose criteria, distance-to-agreement criteria, local or global analysis, and dose threshold. Interpretation and comparison between studies of gamma index results are challenging since they depend on the chosen parameters, dose grid size, and voxel resolution 64,65 . Results of γ-analysis are generally expressed as gamma pass rate (GPR), counting the percentage of voxels with γ < 1 or the mean γ in an ROI generally defined based on a threshold of the reference dose distribution.
Dose-volume histograms (DVHs) are one of the most diffused tools in the clinical routine 66 . DVH summarises 3D dose distributions in a graphical 2D format offering no spatial information.
For the evaluation of sCT, generally, the differences among clinically relevant DVH points is reported.
In proton RT, range shift (RS) analysis is also performed. Here, the ideal range (known as the prescribed range) is defined as the depth at which the dose has decreased to 80% of the maximum dose, on the distal dose fall-off (R 80 ) 67 . RS error (RSe) can be defined both as the absolute difference between the prescribed and the actual range (RSe = R 80CT − R 80sCT ) and as relative RS (%RS) error, expressed as the shift in % relative to the prescribed range, along the beam direction 68 For sCT for PET attenuation correction, the relative error (signed PET err and unsigned PET |err| ) of PET reconstruction is usually reported along with the difference in standard uptake values (SUV).

II. MATERIAL AND METHODS II.B. Metrics
Please note that even if two papers calculate the same metric, differences could occur in the ROI where the metrics are calculated, making challenging performance comparisons.
For example, MAE can be computed on the whole predicted volume, in a volume of interest or a cropped volume. In addition to that, the implementation of the metric computation can change. In gamma analysis, for example, different dose difference and distance to agreement criteria can be stated (γ 3%,3mm (γ 3 ), γ 2%,2mm (γ 2 ) and γ 1%,1mm (γ 1 )). Moreover, it can be calculated on ROI obtained from different dose thresholds and 2D or 3D algorithms. In the following sections, we will highlight the possible differences speculating on the impact.

III. Results
Database searching led to 91 records on PubMed, 98 on Scopus and 218 on Web of Science.
After duplicates removal and content check, 83 eligible papers were found. , and sCT for PET attenuation correction (category III), respectively. The first conference paper appeared in 2016 46 . Given that we excluded conference papers from our search, we found that the first work was published in 2017. In general, the number of articles increased over the years, except for CBCT-to-CT and sCT for PET attenuation correction, which was stable in the last years. Figure 3 shows that the brain, pelvis and H&N were the Most papers enrolled adult patients. Paediatric (paed) patients represent a more heterogeneous dataset for network training, and its feasibility has been investigated first for

III. RESULTS
attenuation correction in PET 74 (79 patients) and more recently for photon and proton RT 75,76 .
All the models were trained to perform a regression task from the input to sCT, except for two studies where networks were trained to segment the input image into a pre-defined number of classes, thus performing a segmentation task 77,78 .
In most of the works, training was implemented in a paired manner, with unpaired training investigated in 13/83 articles. Four studies compared paired against unpaired 71,79,80,81 .
Over all the three categories, 2D networks were the most common adopted. Specifically, 2D networks were used about 61% of the times, 2D+ 6%, 2.5D 10%, and 3D configuration 24%.
In some studies, multiple configurations were investigated, for example 79,82,83 . GANs were the most popular architectures (45-times), followed by U-nets (36) and other CNNs. Note that U-nest may be employed as generator of GANs, and that in this case, the architecture was categoraised as GAN.
All the investigations employed registration between sCT and CT to evaluate the quality of the sCT, except for Xu et al. 81 and Fetty et al. 84 , where metrics were defined to assess the quality of the sCT in an unpaired manner, e.g. Frechet inception distance (FID). Table 2 for studies on sCT for MR-only RT without dosimetric evaluations, in Table 3a, 3b for studies on sCT for MR-only RT with dosimetric evaluations, in Table 4 for studies on CBCT-to-CT for IGART, and in Table 5

III.A. MR-only radiotherapy
The first work ever published in this category, and in among all the categories, was by al. 90 , investigating a 2D paired GAN trained on prostate patients and evaluated on prostate, rectal and cervical cancer patients.
Considering the imaging protocol, we can observe that most of the MRIs were acquired at 1.5 T (51.9%), followed by 3 T (42.6%), and the remaining 6.5% at 1 T or 0.35/0.3 T.
The most popular MRI sequences adopted depends on the anatomical site: T1 gradient recalled-echo (T1 GRE) for abdomen and brain; T2 turbo spin-echo (TSE) for pelvis and H&N. Unfortunately, for more than ten studies, either sequence or magnetic field were not adequately reported.
Generally, a single MRI sequence is used as input. However, eight studies investigated using multiple input sequences or Dixon reconstructions 73,76,90,98,99,102,112,125 based on the assumption that more input contrast may facilitate sCT generation. A relevant aspect related to MRI is which kind on pre-processing is applied to the data before being fed to the network.
Generally intensity normalisation techniques like z-score 126  Some studies compared the performance of sCT generation depending on the sequence ac- When focusing on the DL model configuration, we found that 2D models were the most popular ones, followed by 3D patch-based and 2.5D models. Only one study adopted a multi-2D (m2D) configuration 106 . Three studies also investigated whether the impact of combining sCTs from multiple 2D models after inference (2D+) shows that 2D+ is beneficial compared to single 2D view 75, 111,122 . When comparing the performances of 2D against 3D models,  99.4±0.5 2 * comparison with other architecture has been provided 3 γ 3%,3mm = γ 3 , 2 γ 2%,2mm = γ 2 , 1 γ 1%,1mm = γ 1 ; + trained in 2D on multiple view and aggregated after inference t robustness to training size was investigated c multiple combinations (also ± Dixon reconstruction, where present) of the sequences were investigated but omitted; m data from multiple centers; x: photon plan; p: proton plan; paed: paedriatic.  If we now turn to the architectures employed, we can observe that GAN covers the majority of the studies (∼55%), followed by U-net (∼35%) and other CNNs (∼10%). A detailed examination of different 2D paired GANs against U-net with different loss functions by Largent et al. 117 showed that U-net and GANs could achieve similar image-and dose-base performances.
Fetty et al. 119 focused on comparing different generators of a 2D paired GAN against the performance of an ensemble of models, finding that the ensemble was overall better than single models being more robust to generalisation on data from different scanners/centres. When considering CNNs architectures, it is worth mentioning using 2.5D dilated CNNs by Dinkla et al. 106 where the m2D training was claimed to increase the robustness of inference in a 2D+ manner, maintaining a big receptive field and a low number of weights.
An exciting aspect investigated by four studies is the impact of the training size 69,71,75,95,125 , which will be further reviewed in the discussion section.
Finally, when considering the metric performances, we found that 21 studies reported only image similarity metrics, and 30 also investigated the accuracy of sCT-based dose calculation on photon RT (19), proton RT (8), or both (3). Two studies performed treatment planning, considering the contribution of magnetic fields 79,86 , which is crucial for MR-guided RT. Also, only four publications studied the robustness of sCT generation in a multiple centres 69,75,118,120 .
Overall, DL-based sCT resulted in DD on average <1% and γ 2%,2mm GPR>95%, except for one study 124 . For each anatomical site, the metrics on image similarity and dose were not always calculated consistently. Such aspect will be detailed in the next section. Only three studies investigated unpaired training 88,132,137 ; in eleven cases, paired training was implemented by matching the CBCT and ground truth CT by rigid or deformable registration. In Eck et al. 70 , however, CBCT and CT were not registered for the training phase, as the authors claimed the first fraction CBCT was geometrically close enough to the planning CT for the network. Deformable registration was then performed for image similarity analysing. In this work, the quality of contours propagated to sCT from CT was compared to manual contours drawn on the CT to assess each step of the IGART workflow: image similarity, anatomical segmentation and dosimetric accuracy. The network, a 2D cycle GAN implemented on a vendor's provided research software, was independently trained and tested on different sites, H&N, thorax and pelvis, leading to best results for the pelvic region.
Other authors studied training a single network for different anatomical regions. In   Torrado et al. 146 pre-trained their U-net on 19 healthy brains acquired with T 1 GRE MRI and, subsequently, they trained the network using Dixon images of colorectal and prostate cancer patients. They showed that pre-training led to faster training with a slightly smaller residual error than U-net weights' random initialisation.
Pozaruk et al. 149 proposed data augmentation over 18 prostate cancer patients by perturbing the deformation field used to match the MR/CT pair for feeding the network. They In the effort to reduce the time for image acquisition and patient discomfort, some authors proposed to obtain the sCT directly from diagnostic images, T 1 -or T 2 -weighted, both using images from standalone MRI scanners 115,151,153 or hybrid machines 78 . In particular, Bradshaw et al. 78 trained a combination of three CNNs with T 1 GRE and T 2 TSE MRI (single sequence or both) to derive an sCT stratified in classes (air, water, fat and bone) which was  between CBCT and CT. Usually, the training is performed by registering, cropping and resampling the volume to the CBCT size, which is smaller than the planning CT.

IV. Discussion
Nonetheless, for replanning purposes, the limited FOV may hinder calculating the plan to the sCT. Some authors have proposed to assign water equivalent density within the CT body contour for the missing information 134 . In other cases, the sCT patch has been stitched to the planning CT to cover the entire dose volume 88 . Ideally, appropriate FOV coverage should be employed when re-calculating the plan for online adaptive RT. Besides the dosimetric aspect, improved image quality may increase accuracy during image guidance for patient set-up and OAR segmentation. These are necessary steps for online adaptive radiotherapy, especially for anatomical sites prone to large movements, as speculated by Liu et el. 135 in the framework of pancreatic treatments.
CBCT-to-CT resulted in accurate dose calculations both for photon and proton radiotherapy. For proton RT, the set-up accuracy and dose calculation are even more relevant to avoid range shift errors that could jeopardise the benefit of treatment 67 . Because there is an intrinsic error in converting HU to relative proton stopping power 182 , it has been shown that deep learning methods can translate CBCT directly to stopping power 183 . This approach has not been covered in this review, but it is an exciting approach that will probably lead to further investigations.
Interestingly, increasing the quality of CBCT can be tackled as an image-to-image translation problem and as an inverse problem, i.e. from a reconstruction perspective.
Specifically, by having the raw data measurements (projections), DL could improve tomography. In this sense, many investigations have been proposed but considered out of the scope of this review. For the interested reader, we suggest the following resources 184,185,186,187,188 . Currently, it is unclear whether formulating (CB)CT quality enhancement as a synthesis or reconstruction problem would be beneficial. First attempts showed that training convolutional networks for reconstruction enhanced their

IV. DISCUSSION
generalisation capability to other anatomy 189 ; however, research on such aspects is still ongoing.
III PET attenuation correction. The sCT in this category is obtained either from MRI or from uncorrected PET. In the first case, the work's motivation is to overcome the current limitations in generating attenuation maps (µ-maps) from MR images in MRI/PET hybrid acquisitions that miscalculated the bone contribution 190  the idea that DL-based sCT will substitute current AC methods, being also able to overcome most of the limitations mentioned above. These aspects seem to contradict the stable number of papers in this category in the last three years. Nonetheless, we have to consider that the recent trend has been to derive the µ-map from uncorrected PET via DL directly. Because this review considered only image-to-CT translation, these works were not included, but they can be found in a recent review by Lee 47 .
However, it is worth mentioning a recent study from Shiri et al. 191 , where the largest patient cohort ever (1150 patients split in 900 for training, 100 for validation and 150 for test) was used for the scope. Direct µ-map prediction via DL is an auspicious opportunity that may direct future research efforts in this context.

Deep learning considerations and trends
The number of patients used for training the networks is quite variable, ranging from a min- This investigation can indicate the minimum amount of patients necessary to include in the training to achieve the state of the art performances. The optimal patient number may also depend on the anatomical site and its inter-fraction and intra-fraction variability. Besides, attention should be dedicated to balancing the training set, as performed in 69,75 . Otherwise, the network may overfit, as previously demonstrated for segmentation tasks 193 .
GANs were the most popular architecture, but we cannot conclude that it is the best network scheme for sCT. Indeed, some studies compared U-net or other CNN vs GAN finding GAN performing statistically better 89,143 ; others found similar results 149,150 or even worse performances 80, 148 . We can speculate that, as demonstrated by 117 , a vital role is played by the loss function, which, despite being the effective driver for network learning, has been investigated less than the network architecture, as highlighted for image restoration 194 .
Another important aspect is the growing trend, except category III, in unpaired training (5 and 7 papers in 2019 and 2020, respectively). The quality of the registration when training in a paired manner influences the quality of deep learning-based sCT 126 . In this sense, unpaired training offers an option to alleviate the need for well-matched training pairs. When comparing paired vs unpaired training, we observed that paired training leads to slightly better performances. However, the differences were not always statistically significant 71,80,95 .
As Focusing on the body sites, we observed that most of the investigations were conducted in the brain, H&N and pelvic regions. Fewer studies are available for the thorax and the abdomen, representing a more challenging patient population due to the organ motion 196 .
In MR-only RT, we found contradicting results regarding the best performing spatial

Benefits and challenges for clinical implementations
Deep learning-based sCT generations may reduce the need for additional or non-standard MRI sequences, e.g. UTE or ZTE. Avoiding additional sequences will shorten the total acquisition time, speed up the workflow, increasing patient throughput. As already mentioned, speed is particularly interesting for MR-guided RT and for adaptive RT in II, which is considered crucial for online correction. For what concern categories II and III, the generation of DL-based sCT possibly enables dose decreasing during imaging by reducing the need for CT in case of anatomical changes (in II) or by possibly diminishing the amount of radioactive material injected (in III).
Finally, it is worth commenting on the current status of the clinical adoption of DL-based sCT. We could not find that any of the methods considered are now clinically implemented and used. We speculate that this is probably related to the fact that the field is still relatively young, with the first publications only from 2017 and that time for clinical implementations generally last years, if not decades 200,201 . Additionally, as already mentioned, for categories I/II, the impact of sCT for position verification still needs to be thoroughly investigated.
The implementation may also be more comfortable for category III if the methods would be directly integrated into scanners. In general, the involvement of vendors may streamline the clinical adoption of DL-based sCT. In this sense, we can report that vendors are currently active in evaluating their methods in research settings, e.g. for brain 69 , pelvis 120 in I, and for H&N, thorax and pelvis in II 70 . In the last month, Palmer et al. 202 also reported using a pre-released version of a DL-based sCT generation approach for H&N in MR-only RT.
Another essential aspect that needs to be satisfied is the compliance to the currently adopted regulations 203 , where vendors can offer vital support 204,205 .
A key aspect of clinical implementation is the precise definition of a showing the potential of additive manufacturing.
Alternatively, it would be relevant if a CNN could automatically generate a metric to assess the quality of sCTs, as, for example, already presented for automatic segmentation 222 .
In this sense, Bragman et al. 223 introduce uncertainty for such a task by adopting a multi-task network and a Bayesian probabilistic framework. More recently, two other works proposed to use uncertainty either from the combination of independently trained networks 75 or via dropout-based variational inference 224 . So far, the field of uncertainty estimation with deep learning 225 has been superficially touched for sCT generation. It would be interesting to see future work focusing on developing criteria for automatically identifying failure cases using uncertainty prediction. Patients with inaccurate synthetic CTs will be flagged for CT rescan or manual adjustment of the sCT if deemed feasible.

Beyond sCT for radiotherapy
We found other possible applications of DL-based image generation during the database search, which are beyond the categories mentioned so far or the radiotherapy application.
For example, Kawahara et al. 226 proposed to generate synthetic dual-energy CT from CT to assess the body material composition using 2D paired GANs. Also, commercial solutions start to be evaluated for the generation of DL-based sCT from MRI for lesion detection of suspected sacroiliitis 227 or to facilitate surgical planning of the spine 228

V. Conclusion
Deep learning-based methods for sCT generation have been reviewed in the context of I) MR to replace CT in radiotherapy treatment planning, II) CBCT-based adaptive radiotherapy, and III) in generating attenuation maps for PET.
For each category, we presented a detailed comparison in terms of imaging protocols, DL architectures, and performances according to the most popular metrics reported. We found that DL-based sCT generation is an active and growing area of research. For several anatomical sites, e.g. H&N/brain and pelvis, sCT seems feasible, with deep learning achieving dose difference to CT-based planning < 1% in the radiotherapy context and better performance for PET attenuation correction to the standard MRAC methods.
We can conclude that the deep learning-based generation of sCT has a bright future, with an extensive amount of research work being done on the topic. Further steps to spread DL-based sCT techniques into the clinic will be necessary to evaluate their generalisation among multiple centres and propose comprehensive commissioning and QA methods, to ensure treatment efficacy and patient safety.

VII. Conflict of interest
None of the authors has conflict of interests to disclose.