TRIPS: Trilinear Point Splatting for Real-Time Radiance Field Rendering

Point-based radiance field rendering has demonstrated impressive results for novel view synthesis, offering a compelling blend of rendering quality and computational efficiency. However, also latest approaches in this domain are not without their short-comings. 3D Gaussian Splatting [KKLD23] struggles when tasked with rendering highly detailed scenes, due to blurring and cloudy artifacts. On the other hand, ADOP [RFS22] can accommodate crisper images, but the neural reconstruction network decreases performance, it grapples with temporal instability and it is unable to effectively address large gaps in the point cloud. In this paper, we present TRIPS (Trilinear Point Splatting), an approach that combines ideas from both Gaussian Splatting and ADOP. The fundamental concept behind our novel technique involves rasterizing points into a screen-space image pyramid, with the selection of the pyramid layer determined by the projected point size. This approach allows rendering arbitrarily large points using a single trilinear write. A lightweight neural network is then used to reconstruct a hole-free image including detail beyond splat resolution. Importantly, our render pipeline is entirely differentiable, allowing for automatic optimization of both point sizes and positions. Our evaluation demonstrate that TRIPS surpasses existing state-of-the-art methods in terms of rendering quality while maintaining a real-time frame rate of 60 frames per second on readily available hardware. This performance extends to challenging scenarios,


Introduction
Novel view synthesis methods have been a significant driver for computer graphics and vision, as they have revolutionized the way we perceive and interact with 3D scenes.Many of these methods rely on explicit representations, such as meshes or points.Typically, the explicit models are derived from 3D reconstruction processes and can be efficiently rendered through rasterization, which aligns well with contemporary GPU capabilities.Nevertheless, these reconstructed models often fall short of perfection and necessitate additional steps to mitigate artifacts.
A common strategy to handle these artifacts is to use scenespecific optimization methods, known as inverse rendering.This allows for the adjustment of the scene's texture, geometry, and camera parameters to align the rendering with the photograph.Prominent techniques in this domain incorporate per-point descriptors [RFS22, ASK * 20, KPLD21], explicit optimization of point sizes via Gaussians [KKLD23] and learned neural refinement networks [TZN19,RFS22,KLR * 22].While this generally extends render times, it significantly enhances visual quality.
In the realm of point-based inverse and neural rendering techniques, two successful recent approaches are 3D Gaussian Splatting [KKLD23] and ADOP [RFS22].The former method employs a unique strategy where each point is rendered as a 3D Gaussian distribution, allowing for direct optimization of the points' shape and size.This process effectively fills gaps in point clouds within the global coordinate space through the utilization of large splats.Remarkably, this approach yields high-quality images without necessitating the integration of a neural network for reconstruction.However, a drawback is the potential loss of sharpness, as Gaussians tend to introduce blurriness and cloudy artifacts, particularly when there are limited observations available.
In contrast, ADOP rasterizes radiance fields as one-pixel points with depth testing at multiple resolutions.Subsequently, it employs a neural network to address gaps and enhance texture details in screen space.This approach possesses the capability to reconstruct texture details that surpass the resolution of the original point cloud, although the neural network adds an additional computational overhead and shows weaknesses in filling large holes.
In this paper, we introduce TRIPS, a novel approach that seeks to harness the strengths of both ADOP and 3D Gaussians without loosing real-time rendering capabilities.Similar to 3D Gaussian Splatting, TRIPS rasterizes splats of varying size, however, like ADOP, it also applies a reconstruction network to generate hole-free and crisp images.More precisely, we first rasterize the point cloud as 2 × 2 × 2 trilinear splats into an image pyramid and blend them using front-to-back alpha blending.Subsequently, we feed the image pyramid through a compact and efficient neural reconstruction network, which harmonizes the various layers, addresses remaining gaps, and conceals rendering artifacts.To ensure the preservation of high levels of detail, particularly in challenging input scenarios, we incorporate spherical harmonics and a tone mapping module into our pipeline.
In our evaluations, we demonstrate that our approach can yield crisper images compared to 3D Gaussians, with almost the same perfomance.Furthermore, it surpasses ADOP in the task of filling sizable gaps and maintaining temporal consistency throughout the rendering process.In summary, our contributions are: • The introduction of TRIPS, a novel trilinear point splatting technique for radiance field rendering.• A differentiable pipeline for optimization of all input parameters, including point positions and sizes, creating a robust scene representations.• An implementation of the method resulting in high-quality realtime renderings under varying capturing conditions at: https://github.com/lfranke/TRIPS

Related Work
In this section, we provide an overview of the field of novel view synthesis and choices for scene representations in this problem domain.For the scope of real-time radiance field rendering however, Kerbl and Kopanas et al. [KKLD23] argue that ray-marching as a rendering concept is challenging on current GPU hardware.

Real-Time Rendering for Radiance Fields via Points
In the domain of real-time radiance field rendering, point clouds as an explicit proxy representation remain a great option.Point clouds are easily captured via LiDAR-based mapping [LXG22] Another problem shared by point rendering techniques is how to fill holes in the unstructured data.Two main approaches have evolved over the years [KB04]: splatting (in world-space) and screen-space hole filling.
In world-space hole-filling, points are represented as oriented discs, often termed "splats" or "surfels", with disc radii precomputed based on point cloud density.To reduce artifacts between neighboring splats, these discs can be rendered using Gaussian alpha-masks and combined with a normalizing blend function [AGP * 04,PZVBG00,ZPVBG01].Recent techniques optimize splat sizes [KKLD23,ZBRH22] or improve quality with neural networks [YCA * 20].For performance, overdraw poses a major issue as splats tend to overlap a lot.Thus, special care has to be taken regarding the amount of splats drawn.3D Gaussian Splatting [KKLD23] can be considered state of the art in this domain.They combine anisotropic Gaussians with a very fast tiled renderer and optimize splat sizes via gradient descent.However, limiting Gaussian numbers is necessary to avoid performance hits, which in turn can lead to over-blurring of small detailed elements.
The second direction involves screen-space hole-filling, where points, often rendered as tiny splats, are post-processed either through traditional methods [PGA11,MKC07,GD98] or using convolutional neural networks (CNNs) [ASK * 20,MGK * 19,SCCL20].While these techniques bridge large point distances, their need for a large receptive field can result in artifacts or performance issues.A multi-resolution pyramid rendering approach mitigates this by assigning different network layers to varied resolutions [ASK * 20,RALB22,HFF * 23], albeit reintroducing overdraw issues at lower layers [RFS22].Notably, ADOP [RFS22] excels in screenspace hole-filling, enabling the rendering of hundreds of millions of points for sharp object visualization [SKW22], but encounters challenges with temporal aliasing and substantial hole-filling.
Our approach aims to take the best of both worlds.Using TRIPS, we can render large splats by optimizing their size, but avoid high rasterization costs.This allows rendering enormous point clouds and detailed textures, while still being real-time capable without aliasing or temporal instability.

Method
Fig. 2 provides an overview of our rendering pipeline.The input data consists of images with camera parameters and a dense point cloud, which can be obtained through methods like multiview stereo [SZPF16] or LiDAR sensing.To render a specific view, we project the neural color descriptors of each point into an image pyramid using the TRIPS technique (as detailed in Sec.3.1) and blend them (Sec.3.2).Subsequently, a compact neural reconstruction network (described in Sec.3.3) integrates the layered representation, followed by the application of a spherical harmonics module (discussed in Sec.3.4) and a tone mapper that transforms the resulting features into RGB colors.
Core to our method is the trilinear point renderer, which splats points bilinearly onto the screen space position as well as linearly to two resolution layers, determined by the projected point size.
where C are the camera intrinsics, (R,t) the extrinsic pose of the target view, x the position of the points, E the optional environment map, sw the world space size of the points, τ the neural point descriptors and α the transparency for each point.
In contrast to other approaches, we do not use multiple render passes with progressively smaller resolutions, as this causes severe overdraw in the lower resolution layers.Instead, we compute the two layers which best match the point's projected size and render it only into these layers as 2 × 2 splat.By doing so, we mimic varying splat sizes, although effectively rendering only 2 × 2-splats.The layers are then later merged in a small neural reconstruction network (Sec.3.3) to the final image, resembling the decoder part of a U-Net.

Differentiable Trilinear Point Splatting
Using camera intrinsics C and pose (R,t), we project each point position (xw, yw, zw) to continuous (non-rounded) screen space coordinates (x, y, z) and each world-space point size sw to screen space size s with the camera's focal length f : Next, we render these points as a 2 × 2 × 2 splats bilinearly and handle point size by splatting into two neighboring resolution layers L, as shown in Fig. 3.The resolution layers are selected to be the two closest in sizes to the projected size of the point with L lower = ⌊log(s)⌋ and Lupper = ⌈log(s)⌉.
For each of the then selected eight pixels, we compute the contribution of the point to that pixel and augment its own transparency value value with it.The final opacity value γ that is written to the image pyramid for pixel where β is the bilinear weight inside the image layer, ι is the linear layer weight, and α the opacity value of the point.The layer weight ι is a standard linear interpolation if the point size s is inside the image pyramid.The second case of Equ. ( 5) handles far away points that have a pixel size smaller than one.In order not to miss these, we always add them to the finest level 0. To avoid that their weight disappears, we ensure that their contribution is at least ε = 0.25.

Multi Resolution Alpha Blending
Since each point is written to multiple pixels and multiple points can fall into the same pixel, we collect all fragments in per pixel lists Λ li,xi,yi .These lists are sorted by depth and clamped to a maximum size of 16 elements.Eventually, the color C Λ is computed using front-to-back alpha blending (Fig. 4):

Neural Network
The result produced by our renderer consists of a feature image pyramid comprising n layers.These individual layers are finally consolidated into a single full-resolution image by a compact neural network, as depicted in Fig. 2. Our network architecture incorporates a single gated convolution [YLY * 19] in each layer with a self-bypass connection and a feature size of 32.Additionally, we include a bilinear upsampling operation for all layers except the final one, merging the output with the subsequent level.This configuration is shown in Fig. 5 and resembles an efficient decoder network, due to its restrained number of features, pixels, and convolutional operations.
Unlike well-established hole-filling neural networks [ASK * 20, RFS22, RALB22], our approach demands a significantly smaller and more efficient network.This reduced network size stems from the fact that our renderer is adept at filling gaps autonomously and generates smooth output through trilinearly splatting points.Consequently, the network's primary task is to learn minimal hole-filling and outlier removal, allowing it to concentrate its efforts on highquality texture reconstruction.

Spherical Harmonics Module and Tone Mapping
To model view dependent effects and camera-specific capturing parameters (like exposure time), we optionally interpret the network output as spherical harmonics (SH) coefficients, convert them to RGB colors, and finally pass the result to a physically-based tone mapper.This allows the system to make use of explicit view directions.The SH-module makes use of spherical harmonics with degree 2, which corresponds to 27 input coefficients (9 coefficients per color channel).These coefficients are the output of the last convolution of our network.The tone mapper follows the work of Rückert et al. [RFS22], which models exposure time, white balance, sensor response, and vignetting.

Optimization Strategy
Before novel views can be synthesized, the rendering pipeline is optimized to reproduce the input photographs.This optimization includes point position, size, and features, as well as the camera model and poses, neural network weights, and tone mapper parameters.We train for 600 epochs, which, depending on scene size, requires 2-4 hours to converge.
As training criterion, we use the VGG-loss [JAF16] which has been shown to provide high-quality results [RFS22].The VGG network, however, tends to be slow to evaluate, thus increasing training times significantly compared to MSE loss.Therefore, we use a combination of MSE and SSIM [KKLD23] in the first 50 epochs when the advantages of VGG are still negligible.This speeds up training time by about 5% percent.Similar to Kerbl and Kopanas [KKLD23], we use a "warm-up" period of 20 epochs, during which we train with half image resolutions.Afterwards we randomly zoom in and out each epoch, so that all convolutions (whose weights are not shared) are trained to contribute to the final result.

Implementation Details
Our implementation uses torch as auto-differentiable backend, however the trilinear renderer is implemented in custom CUDA kernels, as they commonly provide better performance [KKLD23,RFS22].Fast spherical harmonics encodings are provided by tinycuda-nn [Mül21].
The renderer is implemented in three stages: collecting, splatting and accumulation, albeit diverging from other state-of-theart multi-layer blending strategies, this turned out to work best in our scenario [FHSS18, LZ21, VVP20].We first project each point (xw, yw, zw) to the desired view and collect each point's (x, y, z) as well as point size s in a buffer, and also count how many elements are mapped to each pixel.This counting is then used for an offset scan to index into one continuous arrays for all layers.The following splatting pass duplicates each point and stores a pair of (z, i) (with i an index to the stored information) in each pixels' list.
Following, a combined sorting and accumulation pass is done.Regarding performance, this part is critical, as such we opt to only use the front most 16 elements from each (sorted) list, a common practice when blending points [LZ21].We could not identify any loss of quality caused by this approximation, as the blending contribution of later points is very low.This limitation allows us to use GPU-friendly sorting, as we repeat warp-local (32 threads) and shuffle-based bitonic sorts, always replacing the latter 16 elements with new unsorted ones, until the lists are empty.For the backwards pass, the sorted per-pixel lists are stored, allowing fast backpropagation.The front-to-back alpha blending (see Sec. 3.2) is done in the same pass as the sorting pass, because all relevant elements are already in registers.
In contrast to Kerbl and Kopanas et al. [KKLD23], we use this per-pixel sorting, which proved to be faster for us then global sorting.This is mostly due to the higher amount and smaller sizes of points in our approach.
For scenes with a large deviation in point density, we found that occlusion may not be correctly evaluated by the neural network in edge cases.Therefore, we include points from coarser layers during blending (in the usual way), of which the additional cost is very small (< 0.5ms).
Point sizes are initialized with the average distance to the four nearest neighbor, which is then efficiently optimized during training (see Fig. 6).

Evaluation
Next, we compare our approach with prior arts as well as showcase the effectiveness of our design decisions in ablation studies.

Setup and Datasets
We have evaluated our approach on several scenes from the Tanks&Temples [KPZK17] and the MipNeRF-360 [BMV * 22] datasets.Additionally, we use the BOAT and OFFICE scene from Rückert et al. [RFS22] to evaluate robustness towards difficult input conditions.The former contains outdoor auto-exposed images while the later is a office floor with multiple distinct room and a large LiDAR point cloud, but sparsely placed cameras.
From Tanks&Temples, we use the intermediate set containing eight scenes: TRAIN, PLAYGROUND, M60, LIGHTHOUSE, FAM-ILY, FRANCIS, HORSE and PANTHER.These scenes are outdoor scenes captured under varying lighting conditions but with good spatial coverage and can be seen as a good baseline for robustness.The MipNeRF-360 dataset [BMV * 22] consists of 5 outdoor and 4 indoor scenes.This dataset was captured with controlled setups and has capture positions well suited for volumetric rendering with a hemispherical setup [KD23].We use half resolution for images of this dataset, resulting in resolutions of around 2500 × 1600 px for outdoor and 1550 × 1030 px for indoor scenes.For results with the resolutions used in related works (outdoor: quarter resolution; indoor half resolution), take a look at the Appendix, Tabs.10-13.
Point clouds of all scenes were acquired via COLMAP's MVS [SZPF16], except OFFICE which was captured by LiDAR.
For the quantitative evaluation we use the LPIPS VGG [ZIE * 18], PSNR, and SSIM metrics.We note however, that neither of these metrics always reflect visual impression.Some approaches are trained with MSE-loss or SSIM and therefore naturally perform better in PSNR and SSIM.Our approach, on the other hand, is trained with VGG-loss and thus usually shows better scores on LPIPS.For a fair comparison, we recommend to look at all metrics and closely inspect the provided image and video comparison.
In all experiments, we leave every 8th view out for testing.This is the same train/test split as used in current related work [BMV * 22, KKLD23].
On the Tanks&Temples dataset, our approach achieves in average the best LPIPS score with an improvement of 20% over the second best.In PSNR and SSIM the score is on par with stateof-the-art.On the MipNeRF-360 dataset, we obtain again the best LPIPS score, however the volumetric methods and Gaussian Splatting show an improved PSNR and SSIM.The difference can be inspected in Fig. 7.For example, in row 3, the TRIPS rendering provides better sharpness with more details, but the MipNeRF-360 and Gaussian output is overall cleaner with less noise.On the difficult BOAT and OFFICE scenes, we can show that our rendering pipeline, is robust to extreme input conditions.
Individual scores per scene can be seen in the Appendix in Tabs.14-19.Video results are showcased at https://youtu.be/Nw4A1tIcErQ.

Ablation Studies
In this section, first we show the effect of our design choices.

Point-Size Optimization
With our trilinear splatting technique, point sizes can be optimized to fill large holes in the scene.We show this capability in Fig. 6, where the initial point cloud exhibits a large hole in the pedestal of the horse producing artifacts in rendering (top row).To combat this, our pipeline efficiently moves and enlarges the points to fill the hole (bottom row), thus providing great render quality.

Point Position Optimization
To test the efficiency of our trilinear point position optimization compared to the (cheaper) approximate gradients from ADOP, we added random noise (of 0.01) to the positions of all points after training, then optimize only point positions for 100 epochs.The Table 2: View dependency on different scenes.On scenes with strong view dependency (GARDEN), adding view dependant configurations, either via our SH network module (SH-net) or optimized per point (SH-point) increases quality, however the per-point point setup severely impacts performance.Our module gives a balanced trade off, which also avoids over-fitting on less view-dependent scenes (PLAYGROUND).result can be seen in Fig. 8.Our pipeline is able to reconstruct the correct rendering, while ADOP's result barely improves.

Number of Render Layers
Due to our trilinear point rendering algorithm, increasing the number of pyramid layers has almost no negative impact on render time.
As seen in Tab. 3, having 8 layers improves quality, especially with PSNR.For reference, other approaches make use of 4 [RFS22] or 5 [ASK * 20] layers and describe significant performance impacts when increasing the number of layers [RFS22].

View Dependency
After the neural network, optionally we use a spherical harmonics module to model view depended artifacts of the scene.This improves the rendering quality for some scenes (GARDEN), while for others it makes little to no difference (see Tab. 2).Applying the spherical harmonics before the network achieves roughly the same  quality, but also reduces efficiency due to additional memory overhead.On scenes without reflective materials, skipping the spherical harmonics module is thus possible.

Feature Vector Dimensions
Our pipeline uses by default four feature descriptors per point.More features only marginally increase the quality, while requiring significantly more memory and slightly increasing rendering time, as shown in Tab. 4.

Networks
In our pipeline, we use a small decoder network made out of gated convolutions, presented in Sec.3.3.ADOP [RFS22] on the other hand uses a four layer U-net with double convolutions for encoder and decoder (thus around 6 times more parameter).As seen in Tab. 5, in our pipeline our networks provide similar quality to ADOP's full network, while being much faster in inference.With spherical harmonics, inference times slightly increase, but the system is now able to model view dependency.Adding the SHmodule to the second finest layer (ours+SH L2 ) instead of the finest (ours+SH) of the network improves efficiency but weakens results.

Time Scaling on Number of Points
As seen in Tab.6, TRIPS is very efficient in rendering large amounts of points.Even for our largest scene with more than 70M   points, the pipeline remains real-time capable with only 15ms required for rasterization.

Rendering Efficiency
In Tab. 7, we evaluate training and rendering time for all examined methods.Our method trains for around 2-4h per scene on an Nvidia A100 and renders a novel view in around 11ms on an RTX4090.A finer breakdown of the steps involved can be found in Tab. 8.

Outlier Robustness
As seen in Fig. 9, our approach is robust to outlier measurements, for example, people walking through the scene.Especially volumetric approach like MipNeRF-360, suffer from severe artifacts in this case, due to strong view-dependant over-fitting capability.

Comparison to Prior Work with Number of Points
We have seen in previous experiments that Gaussian Splatting [KKLD23] has blurrier results compared to TRIPS, which can be confirmed by their weak LPIPS scores.However, they start with fewer point primitives (the SfM reconstruction) and thus are limited in the amount of detail to display.To this end, we conducted an experiment, where the Gaussian Splatting pipeline is provided with the dense point cloud (providing the same input as for our pipeline).Gaussian splatting has a pruning mechanism to remove unwanted Gaussian, thus after their full training, from the initial 12.5M points only around 8M survived.
The results of this experiment are presented in Tab. 9.It can be seen that LPIPS improves with more Gaussians (however PSNR declines) as fine details can be reconstructed better.The qualitative comparison paints the same picture (see Fig. 10), where the quality of the grass improves drastically, however finer details such as the chains still can only be reconstructed by us.Overall the technique cannot reach the quality and scores of TRIPS, as we can keep more points to render efficiently as well as use neural descriptors to encode more detailed information.Furthermore, our approach performs more efficiently in scenarios with large point clouds.In the dense setup, TRIPS outperforms Gaussian Splatting, as the resolution-dependant computation cost of our neural network (4.5ms at 1920 × 1080) catches up with our more efficient point rasterizer (see Tab. 8).

Limitations
In the preceding section, we have demonstrated TRIPS' effectiveness on commonly encountered real-world datasets.Nonetheless, we have also identified potential limitations.One such limitation arises from the prerequisite to have an initial dense reconstruction (in contrast to Gaussian Splatting), which may not be practical in certain scenarios.
Additionally, our lack of an anisotropic splat formulation can create problems: When our method is tasked with strong holefilling of elongated, slender object (such as poles), noisy artifacts surrounding their silhouettes can be observed.An example of this is depicted in Fig. 11.In such instances, the slightly blurred edge characteristic of Gaussian Splatting is often preferred.Furthermore, even though the temporal consistency compared to previous point rendering approaches [ASK * 20, RFS22] has been drastically improved, slight flickering can still occur in areas with too many or too little points.
Our trilinear point splatting splits up points into distinct layers and as such looses depth information.Theoretically, during recombination this could create holes in solid geometry.In practice, we could not find instances of this happening except in extreme zoomins far outside the training data.We believe that the per-point descriptors, the point inclusion in coarse layers, and the networkbased recombination are capable to combat this issue, as reflected in the rendering quality.

Conclusion
In this paper, we presented TRIPS, a robust real-time point-based radiance field rendering pipeline.TRIPS employs an efficient strategy of rasterizing points into a screen-space image pyramid, allowing the efficient rendering of large points and is completely differentiable, thus allowing automatic optimization of point sizes and positions.This technique enables the rendering of highly detailed scenes and the filling of large gaps, all while maintaining a realtime frame rate on commonly available hardware.
We highlight that TRIPS achieves high rendering quality, even in challenging scenarios like scenes with intricate geometry, largescale environments, and auto-exposed footage.Moreover, due to the smooth point rendering approach, a comparably simple neural reconstruction network is sufficient, resulting in real-time rendering performance.

Figure 2 :
Figure2: Our pipeline: TRIPS renders and blends a point cloud trilinearly as 2×2×2 splats into multi-layered feature maps with the results being passed though our small neural network, containing only a single gated convolution per layer.Following, an optional spherical harmonics module and tone-mapper is used to produce the final image.This pipeline is completely differentiable, so that point descriptors (colors) and positions, as well as camera parameters are optimized via gradient descent.

Figure 3 :
Figure 3: Trilinear Point Splatting: (left) all points and their respective size are projected into the target image.Based on this screen space size, each point is written to the correct layer of the image pyramid using a trilinear write (right).Large points are written to layers of lower resolution and therefore cover more space in the final image.

Figure 4 :
Figure 4: In each pixel of the image pyramid, a depth-sorted list of colors and alpha values is stored.The final color of each pixel is computed using front-to-back alpha blending on the sorted list.

Figure 5 :
Figure 5: Our design of one gated convolution block that processes the features of the image pyramid with the number of channels passed through indicated at each step.

Figure 6 :
Figure 6: The initial COLMAP reconstruction lacks points on the pedestal of the statue (top left).Our approach distributes the few present points and increases their sizes (bottom left) thus rendering them also in lower layers (middle).Thus our pipeline can avoid distracting holes (right).

Figure 8 :
Figure 8: We added noise to the converged point clouds of ADOP and ours, then restarted optimization for positions only.Ours is able to converge back to the correct result, ADOP fails at that.

Figure 9 :
Figure 9: Comparison of outlier robustness on the FAMILY scene.Only our methods is able to remove floating artifacts while still retaining full color precision on the sidewalk.

Figure 10 :
Figure 10: Visual results of Gaussian splatting with COLMAP's dense point cloud as input compared its normal setup as well as ours, which provides the sharpest results (PLAYGROUND scene).

Figure 11 :
Figure 11: Limitation: Holefilling close to the camera exhibits fuzzy edges and shine-through.

Table 1 :
Results on the Tanks&Temples and MipNeRF-360 datasets, as well as BOAT and OFFICE.See also Fig.7for visual comparisons.

Table 3 :
Number of resolution layers used (HORSE scene).

Table 4 :
Features per point on the PLAYGROUND scene.

Table 6 :
Efficiency of our approach regarding point cloud sizes.

Table 8 :
Breakdown of the frame time for the PLAYGROUND scene.Our method's "Rasterize" consists of: counting and memory allocation with 1.9ms, splatting with 2.6ms and combined sorting and blending with 1.7ms.

Table 9 :
Performance of the methods on the PLAYGROUND scene.Gaussian (dense) starts with COLMAP's dense reconstruction of 12M points and prunes them to 8M, Gaussian (sparse) is the original sparse setup and has about 2M points.Also see Fig.10.