Auxiliary Features-Guided Super Resolution for Monte Carlo Rendering

This paper investigates super resolution to reduce the number of pixels to render and thus speed up Monte Carlo rendering algorithms. While great progress has been made to super resolution technologies, it is essentially an ill-posed problem and cannot recover high-frequency details in renderings. To address this problem, we exploit high-resolution auxiliary features to guide super resolution of low-resolution renderings. These high-resolution auxiliary features can be quickly rendered by a rendering engine and at the same time provide valuable high-frequency details to assist super resolution. To this end, we develop a cross-modality Transformer network that consists of an auxiliary feature branch and a low-resolution rendering branch. These two branches are designed to fuse high-resolution auxiliary features with the corresponding low-resolution rendering. Furthermore, we design residual densely-connected Swin Transformer groups to learn to extract representative features to enable high-quality super-resolution. Our experiments show that our auxiliary features-guided super-resolution method outperforms both super-resolution methods and Monte Carlo denoising methods in producing high-quality renderings.


Introduction
Monte Carlo rendering algorithms are now widely used to generate photo realistic computer graphics images for applications such as † Email: qiqi.hou2012@gmail.com‡ Email: fliu@pdx.eduvisual effects, video games, and computer animations.These algorithms generate a pixel's color by integrating over all the light paths arriving at a single point [CPC84].To rendering a high-quality image, a large number of rays need to be cast for each pixel, which makes Monte Carol rendering a slow process.
A great amount of effort has been devoted to speeding up Monte Carlo rendering.The core idea is to reduce the number of rays submitted to COMPUTER GRAPHICS Forum (10/2023).

arXiv:2310.13235v1 [cs.GR] 20 Oct 2023
for each pixel.For instance, numerous denoising algorithms are now available to reconstruct a high-quality image from a rendering produced at a low sampling rate.Such Monte Carlo denoising algorithms often use auxiliary features generated by a rendering algorithm to help denoise the noisy rendering result.The recent deep neural network-based denoising algorithms can now generate very high-quality images at a fairly low sampling rate [BVM*17;CKS*17;KKR18;GLA*19].
Monte Carlo rendering can also be sped up by reducing the number of pixels to render.For example, pixels from the frames that have already been rendered can be warped to generate frames inbetween existing frames to increase the frame rate [BDM*21] or to generate future frames to reduce latency [GFL*21].Another approach is to only render one pixel for a block of neighboring pixels to further reduce the total number of pixels to render.This can be implemented by first rendering a low-resolution image and then applying super resolution to increase its resolution [XNC*20;HLM*21].As super resolution is a fundamentally ill-posed problem, it alone often cannot recover high-frequency details from only the low-resolution rendering.To address this problem, Hou et al. render a high-resolution rendering with a low sampling rate and use that together with the high-resolution auxiliary features to help super resolve the low-resolution rendering rendered at a high sampling rate.While this method produces a high-quality result, it needs to render the high-resolution image at a low sampling rate, which still takes a considerable amount of time [HLM*21].
Can we only use the fast-to-obtain high-resolution auxiliary features without the high-resolution-low-sample rendering to effectively assist super resolution of the corresponding low-resolution rendering?If so, we can further speed up Monte Carlo rendering.We are encouraged by the recent work on neural frame synthesis that showed fast-to-obtain auxiliary features of the target frames can greatly help interpolate or extrapolate the target frames [BDM*21;GFL*21].On the other hand, Hou et al. showed that using a wide range of auxiliary features and the highresolution-low-sample rendering help super resolution more than only using a subset of auxiliary features within their own deep neural network-based super resolution framework [HLM*21].Therefore, if we only use a small number of fast-to-compute auxiliary features, we need to have a better super resolution method.This paper presents a Cross-modality Residual Denselyconnected Swin Transformer (XRDS) for super resolution of a Monte Carlo rendering guided by its auxiliary features.For the seek of speed, we only use two auxiliary features: albedo and normal.To effectively use these features, we design a super resolution network based on Swin Transformer that recently has been shown powerful for a wide variety of computer vision tasks.Our Transformer network has two branches, one for the low resolution rendering and the other for the auxiliary features.Such two branches are designed to perform cross-modality fusion to effectively use auxiliary features to assist super resolution of the low-resolution rendering.While the auxiliary feature branch consists of convolutional blocks, the branch for the low-resolution rendering consists of a sequence of residual densely-connected Swin Transformer blocks to extract effective features.The features from the two branches are combined together using a cross-modality fusion module and are finally used to generate the high-resolution high-quality rendering.This paper contributes to Monte Carlo rendering as follows.First, we present the first super resolution approach to Monte Carlo rendering that only uses fast-to-compute high-resolution auxiliary features to enable high-quality upsampling of a low-resolution rendering.Second, we design a dedicated Cross-modality Swin Transformer-based super resolution network that can learn to effectively combine high-resolution auxiliary features with the corresponding low-resolution rendering to generate the final highresolution high-quality image.Third, our experiments show that our method outperforms super-resolution and denoising methods in producing high-quality renderings.

Our Method
This paper proposes a super resolution method guided by the fastto-compute auxiliary features to speed up the Monte Carlo rendering.Our method takes a low resolution rendering I LR and its high-resolution fast-to-compute auxiliary features A as input, and outputs the corresponding high-quality high-resolution result I SR .The high-resolution auxiliary features provide the essential highfrequency information for the super-resolution.Different from the previous work [HLM*21], which leverages a wide range of auxiliary features, our method only employs the auxiliary features that can be computed very fast [BDM*21], including albedo and normal.On the one hand, although our method doesn't leverage the shading layers, albedo and normal could provide a lot of high-frequency information, e.g., the texture of the material, which is essential for super-resolution.As we will show, it can help improve the super-resolution results.On the other hand, albedo and normal can be computed fast [BDM*21].It not only reduces the rendering time but also enables us to render these highresolution layers at a relatively higher sampling rate, which typically contains fewer artifacts, such as aliasing.
We design a cross-modality transformer network to effectively fuse two categories of visual input, namely the low-resolution rendering and its corresponding high-resolution auxiliary features, to recover visual details. Figure 2 shows the architecture of our network.It contains two parallel branches, one for the low-resolution rendering and the other for the corresponding high-resolution auxiliary features.
Auxiliary feature branch.The auxiliary feature branch takes auxiliary features as inputs, which provide essential high frequency visual details.As discussed above, we select albedo and normal, which are relatively fast to acquire.Since this branch processes high-resolution input, we design a shallow architecture for the sake of memory and speed.As shown in Figure 2, we employ a convolutional layer and N = 3 residual blocks (RB) [HZRS16] in a sequence to get the features where fconv(•) indicates the convolution operation.f RB (•) indicates the operation of a residual block.In our experiments, we set the channels as 32 for the auxiliary feature branch.
We then obtain the downsampled features {D i } N−1 i=0 with a group of deshuffle layers [HLM*21], which is able to downscale the feature while keeping the high frequency information.
Low resolution rendering branch.Following the recent works on We feed the resulting feature F 0 to a sequence of cross-modality residual densely-connected Swin Transformer groups (XDG).
where f XDG (•) indicates the XDG module.N indicates the number of XDG.We choose N = 3 in our experiments.XDG is designed to fuse the auxiliary features D i−1 and the low-resolution rendering features F i−1 .It consists of a cross-modality module (XM) and a sequence of residual densely-connected Swin Transformer blocks (RDST).Specifically, XM is designed to fuse the local information from the low-resolution rendering and the high frequency information from the auxiliary features, while the RDST sequence learns more dedicated representations for super resolution from them.sequence of B = 5 residual densely-connected Swin Transformer blocks (RDST),

Cross-modality module (XM
where f RDST indicates the RDST block.We also use a short skip connection to combine the shallow feature X i−1 with the deep feature Upscale.We adopt the pixel shuffle layer [SCH*16] to upscale the dense feature F DF to a high resolution feature.We also uses a 3 × 3 convolutional layer with 3 channels to predict the final high resolution image I SR .
where f UP indicates the operation of the pixel shuffle layer.
Training details.We adopt the robust loss to handle the prediction with a high dynamic range image [HLM*21].
where I HR indicates the ground truth image.M indicates the number of pixels.β indicates the robust factor, which is set to 0.1.
We implement our network in PyTorch.We train our super resolution network on examples of size 256 × 256.We select Adam [KB14] with β 1 = 0.9, β 2 = 0.999 as the optimizer.The learning rate is set to 0.0001.We train the network for 400 epochs with a mini-batch size of 16 for our 4× super resolution models,

Experiments
We evaluate our network by quantitatively and qualitatively comparing them with state-of-the-art image super resolution methods and the Monte Carlo denoising methods on the BCR dataset [HLM*21] and the Gharbi dataset [GLA*19].We also conduct the ablation study to examine our method.Following [HLM*21], we adopt Relative Mean Square Error (RelMSE) and PSNR to evaluate our methods in the scene linear color space and the sRGB space, respectively.Please refer to the supplementary material for an interactive demo that provides more results.

Comparison with Super Resolution Methods
We compare our method with state-of-the-art super-resolution methods, including EDSR As shown in Table 1, our method outperforms super-resolution methods.This improvement can be largely attributed to the use of high-resolution auxiliary features to capture high-frequency visual details.For this experiment, we use the ground truth auxiliary features as they are fast to acquire.We also vary the number of samples used to generate these features in order to examine their effect on our method.As shown in Figure 5, while having more samples to generate these auxiliary features benefits our method, the features generated with only one sample per pixel allow our method to outperform the standard super-resolution methods.
MSSPL takes both the low-resolution rendering and the high resolution noisy rendering as well as a wide variety of high resolution auxiliary features as input [HLM*21].In this test, the high resolution rendering and features are rendered with one sample per pixel.As shown in Figure 5   per pixel, can achieve 35.16 dB which is higher than MSSPL (34.27 dB) for the ×4 task, even though our method takes much less input information from the high resolution input.
Speed and memory.As most denoising methods do not take high-resolution auxiliary features as input, we follow MSSPL [HLM*21] to compute the average spp for our method and MSSPL as appavg = spp LR /s 2 + spp HR , where s indicates the scale.spp LR and spp HR indicate the sampling rates for the low-resolution and high-resolution inputs, submitted to COMPUTER GRAPHICS Forum (10/2023).take 1spp RGB and 1spp auxiliary buffers as inputs.Our method takes 4-spp low-resolution RGB (×2, effectively the same sampling rate as 1-spp at the high resolution ) and 1-spp high-resolution auxiliary buffers) respectively.In our case, we take the sampling rates for the auxiliary features as spp HR .We would like to note that this measurement of spp is unfair to our method, as our method only uses highresolution albedo and normal features which takes much less time than rendering all the shading layers to obtain the high-resolution rendering as done in MSSPL.
As shown in Table 3, our method generates better results than the state-of-the-art methods on the BCR dataset [HLM*21].Ours ×2 model wins 0.18dB, 0.28dB, and 0.35 dB in terms of PSNR on 2spp, 4sppp, and 8spp, respectively.
We also conduct our experiments on ×16 scale.On the one hand, with ×16, our method produces worse results than MSSPL because MSSPL uses the high-resolution RGB image as input that is not available to our method.While the high-resolution RGB input to MSSPL is rendered at a low sampling rate, it still provides useful information.As shown in the existing literature on Monte Carlo denoising, even the rendering result at 1 spp can be denoised to a reasonable quality.At such a high upsampling rate of ×16, superresolution is very difficult.On the other hand, in practice, given a target overall spp rate, our method can select an optimal combination of (spp rate, super-resolution scale) that outperforms MSSPL and other methods, as shown in Table 3.In practice, ×16 will not be used for rendering by either MSSPL or our method to achieve an overall target spp as it produces the worst results among alternative combinations of spp rate and super-resolution scale.
Figure 7 shows the visual comparisons.Our results are more visually plausible.Briefly, instead of working in the pixel color space that can potentially cause the color fidelity problem, our method fuses the low-resolution RGB and high-resolution feature maps in the feature space and learns to fuse them into correct colors, thus alleviating the color ambiguities/artifacts at fine details.For example, In Figure 7, the wall of our results is less noisy and more accurate than the results from other methods that are either blurred or inconsistent with the ground truth.In the second example, our method produces high-frequency geometric details in the wine basket area that well differentiates the mesh color and the background color.Following MSSPL [GLA*19], we directly test our models pretrained on the BCR dataset without fine-tuning as the training set of the Gharbi dataset is not available.Our ×2 model wins 0.27dB and 0.17dB in terms of PSNR on 4spp and 8spp, respectively.When the spp is 16, our PSNR is slightly lower than MSSPL [HLM*21].We would like point out our method takes less high-resolution information than MSSPL.Our input high-resolution auxiliary features only include the albedo and normal, while MSSPL also takes all the shading layers as inputs.When the high resolution input is rendered at a high spp, the shading layers can contribute a lot of high fre-  8, our method produces high-quality results with much fewer artifacts when compared to the ground truth.

Discussions
Auxiliary features sampling rates.As discussed above and shown in Figure 5, using more samples to generate the auxiliary features help our method generate better super resolution results.However, even using one sample per pixel to generate the auxiliary features can already enable our method to significantly outperform standard super resolution methods.Moreover, when we use 16 samples to generate these features, our results are already very close to the results that use the features generated using 4000 samples per pixel denoted as Agt in the figure.
Input layers of auxiliary features.We examine how our method works with different auxiliary feature layers.The upsampling scale is set to 4×.We use 4000 spp for I LR and A. As shown in Table 6, both albedo and normal can improve the results significantly, as they can provide the essential high frequency visual details for super resolution.The performance of our network can be further improved if we take both of them as inputs.We evaluated the performance of both our method and MSSPL [HLM*21] using fastto-compute auxiliary features as well as full auxiliary features.In the experiments, the upsampling scale is set to ×4.As shown in Table 5, both our network and MSSPL benefit from using the full auxiliary features due to the richer high-resolution information they provide.However, our method with fast-to-compute layers still outperforms MSSPL with full auxiliary layers, which demonstrates the effectiveness of our network architecture.
Network Effectiveness.We examine how our network architecture   Specifically, we feed high-resolution 1-spp RGB and 1-spp auxiliary buffers to AdvMC and MCSA and fine tune them on the BCR dataset.In this experiment, our method takes 4-spp low-resolution RGB (×2, effectively the same sampling rate as 1 spp at the high resolution) and 1-spp high-resolution auxiliary buffers.Table 7 shows our method outperforms these methods, which demonstrates the effectiveness of our transformer-based network architecture.
Network architecture components.We examine the effect of the network architecture.Our robust loss vs SMAPE loss [Mea86].Our robust loss is used based on our observations that there are a very small number of pixels with abnormally large intensity values in our dataset, mostly due to the firefly artifacts.These pixels will often incur very large submitted to COMPUTER GRAPHICS Forum (10/2023).
errors during training and thus compromise the performance of our model.We use the robust loss to reduce these undesirable impacts of these pixels as this robust loss will limit the maximal loss value to 1 no matter how large the pixel error is.We compared these two loss functions.In our experiments, the upsampling factor is set to 4, and we set the sampling rate to (16 -1).Models trained with the SMAPE loss showed slightly worse results: 33.96 v.s.34.12 in PSNR, and 0.0046 v.s.0.0035 in RelMSE.
When comparing scales ×4 and ×2, ×4 takes less peak memory and is faster than ×2, but ×2 leads to better quality.To make a fair comparison, we maintain a consistent average sampling rate across different scales.Consequently, the low-resolution input of our ×1 model is rendered at a much lower average sampling rate than that of our ×2 model.This makes the resulting input RGB image to our model very noisy for ×1 and thus comprises the final quality of Ours ×1, as reported in the 2-spp column of Table 3.In the 4-spp column of the same table, the difference between Ours ×1 and Ours ×2 is less significant as in this setting, the average sampling rate of Ours ×1 is reasonably higher and provides more information for our model to synthesize higher-quality results.
In addition, we used the same training pipeline for our ×1 model as we did for other scales, keeping the number of epochs consistent across all scales.However, due to the high memory requirement to train the ×1 model, we have to set a smaller mini-batch size.This would also potentially impact the performance, but we believe that this is not as significant as the first reason we discussed above.
Perceptual quality.We examine the perceptual quality of our results using the LPIPS metric [ZIE*18].Table 7 and Table 9 present the results for AdvMC [XZW*19], MCSA [YNL*21], and our method.Our approach outperforms the others in terms of both PSNR and LPIPS, thereby demonstrating its ability to generate images with high perceptual quality.

Limitations and Future Work
The fusion for the high-reflection parts is challenging.Our method produces high-frequency visual details by two means: 1) train a neural network to learn to recover high-frequency information from low-resolution input and 2) use high-frequency information from the high-resolution albedo and normal maps.Our neural network can learn to produce visual details for many examples.However, super resolution from a low-resolution input alone is necessarily an ill-posed problem.In the high-reflection parts of the scene, such as the example shown in Figure 9, when the high-resolution normal and albedo map cannot, by its nature, provide high-frequency details in those regions, our method may fail.
Compared to CNN-based methods, our method is slow.However, compared to another Transformer-based method [YNL*21], our method uses less peak memory (0.89Gb vs 30.56Gb) and is faster (1.0s vs 2.5s) when producing a 1024x1024 image using an Nvidia A40.Research on fast transformers has been advancing quickly recently.Patro et al. [PA23] offer an extensive review of efficient vision transformers.Through the advancement of effective In this paper, we specifically explored albedo and normal as quick-to-compute auxiliary features.However, we acknowledge that other auxiliary features, such as a whitted ray-traced layer, could offer valuable high-frequency information and be generated fast.Incorporating such a layer can potentially improve the performance of our method.Unfortunately, the BCR dataset doesn't contain such layers.We plan to explore this in our future research.

Conclusion
This paper explored high-resolution fast-to-compute auxiliary features to guide super resolution of Monte Carlo renderings.We developed a dedicated cross-modality Transformer network to fuse high-resolution fast-to-compute auxiliary features with the corresponding low-resolution rendering.We designed a Transformerbased cross-modality module to fuse the features from two modalities.We also developed a Residual Densely-connected Swin Transformer block to learn more representative features.Experimental results indicate that our proposed method surpasses existing stateof-the-art super-resolution and denoising techniques in producing high-quality images.

Figure 2 :
Figure 2: The architecture of our network.Our network takes a low-resolution rendering and its corresponding fast-to-compute highresolution auxiliary features as input and predicts the final high-resolution-high-quality image.

Figure 3 :
Figure 3: The cross-modality module.It takes feature F from the low-resolution rendering branch and D from the auxiliary feature branch, and outputs the fused feature X.
We design RDST by combining the ideas of the Residual Densely-connected Network (RDN) [ZTK*18] and Swin Transformer [LLC*21].We are specifically inspired by SwinIR [LCS*21] that explores Swin Transformers for image restoration tasks.It replaces traditional convolutional layers with Swin layers in residual blocks, allowing for the learning of more descriptive features and delivering impressive results.Taking inspiration from RDN [ZTK*18], we introduce RDST, where the convolution layers in densely-connected blocks are replaced with Swin layers.As shown in Figure 4, RDST consists of a sequence of densely-connected Swin Transformer blocks and a local feature fusion block.For the densely-connected Swin Transformer blocks, we shift the windows.We also use a local skip connection to fuse the features from the shallow layer.

Figure 5 :
Figure5: The effects of the sample rates used to generate fast-tocompute auxiliary features on the performance of our method.Agt indicates the ground truth auxiliary features (4000spp).

Figure 9 :
Figure 9: Failure example.The performance of our method is compromised in the area where the albedo and normal could not provide high-frequency details.
[KBS15]Kaj86]briefly discusses relevant work to our paper, including Monte Carlo denoising, super resolution, and vision Transformers.Monte Carlo Denoising.Monte Carlo rendering algorithms need numerous samples per pixel to generate a high-quality rendering[CPC84;Kaj86].With insufficient samples, the rendering results suffer from noise.To address this problem, many Monte Carlo denoising methods have been developed to reconstruct highquality renderings from only a small number of samples.Traditional methods reconstruct renderings in a similar way to general image denoising methods by designing specific denoising kernels based on image variance or geometric features or directly regress These methods learn to reconstruct high-quality renderings from small number of samples.In their seminal work, Kalantari et al. estimated optimal filter parameters using a multi-layer perceptron neural network[KBS15].Bako et al. estimated spatially adaptive kernels for denoising in a convolutional manner [BVM*17].Vogels et al. extended the concept of kernel prediction methods to temporal denoising [VRM*18].With asymmetric loss functions, their method could produce highquality results for a sequence of frames.Chaitanya et al. developed a recurrent autoencoder to denoise a sequence of frames while maintaining temporal stability [CKS*17].Xu et al. developed an adversarial approach to Monte Carlo rendering denoising that can greatly reduce artifacts such as blurs and unfaithful details from denoising results [XZW*19].Gharbi et al. developed a kernel splatting network that reconstructs the final image by splatting samples to pixels according to the estimated splatting kernels [GLA*19].Munkberg et al. proposed to filter auxiliary layers of individual samples [MH20].Their method works well on outliers and complex visibility.Hasselgren et al. proposed a neural spatial-temporal sampling method for Monte Carlo video denoising [HMS*20].Their method first estimates the sampling map from the temporal reprojection and auxiliary features and then denoises the resulting imsubmitted to COMPUTER GRAPHICS Forum (10/2023).

Table 1 :
Comparison with super resolution methods with different upsampling scales on the BCR dataset [HLM*21].
[GLDZ22]red by the success of Swin Transformer [LLC*21; LCS*21] and Transformer Decoder[GLDZ22], we design XM based on Swin Transformer, which can efficiently model the long-range dependency.Figure3shows the architecture of XM.It takes features F from the lowresolution rendering branch and features D from the auxiliary feature branch as input and outputs the fused feature X. mid from F, which serve as the "query" Q.From D, which holds high-resolution information, the "key" K and "value" V are extracted.Then, the crossattention is calculated following [VSP*17] and combined with F mid to generate Fcross.Finally, an MLP layer is used to integrate the features from the low-resolution branch and the cross-attention.Residual Densely-connected Swin Transformer block (RDST).As shown in Figure2, we feed the fused feature X from XM to a submitted to COMPUTER GRAPHICS Forum (10/2023).

Table 2 :
and Table1, our method, when only using albedo and normal as auxiliary features obtained with one sample submitted to COMPUTER GRAPHICS Forum (10/2023).Comparison of runtime cost and peak memory with super resolution methods to produce a 1024 × 1024 image on an Nvidia Titan XP.

Table 3 :
Comparison Table 2 reports the speeds and the peak memory of the above methods.As our method is based on the Transformer, our method is slower than CNN-based methods, including EDSR [LSK*17], RCAN [ZLL*18] and MSSPL [HLM*21].

Table 4 :
Comparison on the Gharbi dataset [GLA*19].We directly test our models pretrained on the BCR dataset without finetuning.

Table 6 :
The effects of input fat-to-compute auxiliary feature layers on the BCR dataset [HLM*21].

Table 4
reports the comparison on the Gharbi dataset [GLA*19].

Table 9 :
Comparison on the perceptual quality on the BCR dataset [HLM*21].We utilize the LPIPS [ZIE*18] metric as a measure of perceptual quality.quency information.Similar to the findings in MSSPL [HLM*21], our results on RelMSE are heavily affected by a small number of pixels with abnormal large errors.Excluding these abnormal pixels can greatly improve our scores on RelMSE.As shown in Figure These findings are consistent with previous denoising methods [BVM*17; GLA*19] where intermediate layers can improve the final results.

Table 11 :
The effects of the number of XDG blocks on the BCR dataset [HLM*21].We measured the flops and macs for a single 1024 × 1024 image [RRRH20].
The upsampling scale is set to 4×.In this test, we remove XM modules and replace our RDST with the state-of-the-art blocks, including RDB from RRN [ZTK*18] and RSTB from SwinIR[LCS*21].As shown in Table8, our RDST can greatly improve the results.These improvements can be attributed to the strong generalization capability of RDST.Besides, XM modules can further improve the results.