Real-Time Novel-View Synthesis for the Web Using 3D Gaussian Splatting Exploring Mesh-Supervised 3D Gaussian Scene Optimization and Efficient Web Rendering for Product Visualization Master’s Thesis in Computer Science and Engineering BENJAMIN SANNHOLM Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2024 Master’s thesis 2024 Real-Time Novel-View Synthesis for the Web Using 3D Gaussian Splatting Exploring Mesh-Supervised 3D Gaussian Scene Optimization and Efficient Web Rendering for Product Visualization BENJAMIN SANNHOLM Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2024 Real-Time Novel-View Synthesis for the Web Using 3D Gaussian Splatting Exploring Mesh-Supervised 3D Gaussian Scene Optimization and Efficient Web Rendering for Product Visualization BENJAMIN SANNHOLM © BENJAMIN SANNHOLM, 2024. Supervisor: Erik Sintorn, Department of Computer Science and Engineering Advisor: Pontus Holmertz Liljekvist, Rapid Images Examiner: Ulf Assarsson, Department of Computer Science and Engineering Master’s Thesis 2024 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Typeset in LATEX Gothenburg, Sweden 2024 iv Real-Time Novel-View Synthesis for the Web Using 3D Gaussian Splatting Exploring Mesh-Supervised 3D Gaussian Scene Optimization and Efficient Web Rendering for Product Visualization BENJAMIN SANNHOLM Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract This thesis explores real-time novel-view synthesis for web applications using 3D Gaussian Splatting (3DGS), with a focus on enhancing product visualization. The study investigates two primary research questions: The impact of utilizing classical scene representations (i.e., polygonal meshes) on the optimization process and results of 3D Gaussian Splatting, and the efficient rendering of 3D Gaussians within web constraints. Firstly, a method for initializing a 3D Gaussian scene from existing scene geometry is proposed. Evaluation across various synthetic scenes suggests that while there is noticeable quality improvement in some cases, the average improvement is marginal. Secondly, multiple WebGPU-based rendering methods for 3D Gaussian scenes are implemented and evaluated. Results indicate that using the original 3DGS archi- tecture on the web is viable, with a geometry-based rendering method significantly outperforming the original renderer in terms of frame-time speed-up. An optimization technique to tighten 3D Gaussian screen-space bounding boxes further enhances performance. Overall, the findings demonstrate that 3D Gaussian Splatting can be effectively applied to real-time web-based novel-view synthesis, offering a potential avenue for interactive and high-quality product visualization. Keywords: 3D Gaussian Splatting, novel-view synthesis, web applications, real-time rendering, mesh-supervised optimization, product visualization, computer graphics. v Acknowledgments Thank you to Rapid Images for allowing me to be in an inspiring and motivating workplace with such friendly and helpful colleagues, for letting me freely explore my topic of interest, and for providing the technical resources that enabled my thesis work. Furthermore, I would like to express my gratitude to my academic supervisor Erik Sintorn and my company advisor Pontus Holmertz Liljekvist for your guidance, our insightful technical discussions, and your constructive feedback. Last but far from least, thank you to my mother, father, sister, and friends for your unwavering support, thoughtful encouragement, and feedback throughout my master’s program and this concluding academic milestone. Benjamin Sannholm, Gothenburg, 2024-07-01 vii Contents List of Figures xi List of Tables xiii 1 Introduction 1 2 Theory 5 2.1 Novel-View Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Neural Radiance Fields . . . . . . . . . . . . . . . . . . . . . . 5 2.2 3D Gaussian Splatting (3DGS) . . . . . . . . . . . . . . . . . . . . . 5 2.2.1 Scene Representation . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.3 Differentiable CUDA-Driven Tile-Based Renderer . . . . . . . 9 3 Method 15 3.1 Mesh-Supervised 3D Gaussian Optimization . . . . . . . . . . . . . . 15 3.2 Web-Based Renderer . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.1 Original 3DGS Architecture in WebGPU . . . . . . . . . . . . 16 3.2.2 Geometry-Based Renderer . . . . . . . . . . . . . . . . . . . . 18 3.2.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4 Results 23 4.1 Mesh-Supervised 3D Gaussian Optimization . . . . . . . . . . . . . . 23 4.1.1 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . 23 4.1.2 Reconstruction Quality . . . . . . . . . . . . . . . . . . . . . . 23 4.2 Web-Based Renderer . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.2.1 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . 25 4.2.2 Image Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2.3 Run-Time Performance . . . . . . . . . . . . . . . . . . . . . . 29 5 Discussion 31 5.1 Mesh-Supervised 3D Gaussian Optimization . . . . . . . . . . . . . . 31 5.1.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.1.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.2 Web-Based Renderer . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.2.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 ix Contents 5.2.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.3 Risks and Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Bibliography 37 A Additional Evaluation Results I A.1 Web-Based Renderer . . . . . . . . . . . . . . . . . . . . . . . . . . . I A.1.1 Run-Time Performance for Medium Views . . . . . . . . . . . I A.1.2 High-Resolution Output Images . . . . . . . . . . . . . . . . . I x List of Figures 2.1 An illustration exemplifying the characteristics of a 3D Gaussian. . . 6 2.2 An overview of the CUDA kernels executed by 3DGS’ differentiable CUDA-driven tile-based renderer. . . . . . . . . . . . . . . . . . . . . 10 3.1 An overview of the logical steps performed by the geometry renderer. 18 3.2 A comparison of the square axis-aligned bounding box used by the 3DGS renderer and our tight bounding box used by the 3DGS-web- opt and geometry-opt renderers. . . . . . . . . . . . . . . . . . . . 21 4.1 An example of two scenes where our method exhibits the greatest reconstruction quality improvement compared to 3DGS. . . . . . . . . 25 4.2 An example of test cases with the three proximity levels used for each camera angle during evaluation. . . . . . . . . . . . . . . . . . . . . . 25 4.3 The worst performing test-case, with regard to similarity, for the 3DGS-web renderer in comparison to the 3DGS renderer. . . . . . 27 4.4 An example illustrating how images of scenes with sub-pixel Gaussians exhibit a noticeable difference when using the geometry-opt renderer, in comparison with the 3DGS renderer. . . . . . . . . . . . . . . . . 28 xi List of Figures xii List of Tables 4.1 Reconstruction quality of our mesh-supervised 3D Gaussian optimiza- tion method compared to the original 3DGS optimization method. . . 24 4.2 Image similarity of rendered images from the web-based renderers compared to the 3DGS renderer averaged over all test cases. . . . . . 27 4.3 Run-time performance of the web-based renderers compared to the 3DGS renderer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 A.1 Run-time performance of the web-based renderers compared to the 3DGS renderer for medium views of each scene. . . . . . . . . . . . . I A.2 Image similarity of rendered images from the web-based renderers compared to the 3DGS renderer averaged over all test cases, but using output image dimensions of 2000× 2000 pixels. . . . . . . . . . I A.3 Run-time performance of the web-based renderers compared to the 3DGS renderer, but using output image dimensions of 2000× 2000 pixels. II xiii List of Tables xiv 1 Introduction The use of computer graphics to render and display products to consumers in a web-based environment is presently ubiquitous. However, due to requirements of high quality and therefore use of complex geometry combined with intricate materials, the rendering process is often prohibitively computationally expensive to perform at interactive, let alone real-time, rates. In cases where multiple views of the same product need to be shown, a small set of images from a few fixed angles are typically rendered and can be cycled through by the user. For a more dynamic experience, an image sequence in which the camera moves around an object could be rendered and even interactively scrubbed through using user controls. However, both approaches produce a fixed set of observable views. They cannot allow the user to interactively observe a product from any desired viewpoint among the infinite set of possible views or to smoothly transition between viewpoints. Ideally, it would be possible to observe the scene from any viewpoint in real time without compromising quality. Novel-view synthesis is the problem of generating previously unobserved views of an existing scene given a limited set of images depicting the scene. With the advent of neural radiance fields (NeRF) [1], it was shown that neural networks (in this case, used to represent a scene in the form of a radiance field) are an effective tool for achieving novel-view synthesis. Since NeRF, it has been shown more generally that neural fields [2] and other rendering methods that use machine learning for scene optimization, are effective for novel-view synthesis, either as a direct scene representation or as a way of deriving a traditional scene representation from a set of images. Early methods using machine learning for scene optimization were fairly restrictive, slow to train, and far from performing in real time. However, much research has been conducted to bring the methods closer to being useful in a broader context. For example, improving performance, enabling scene editing, relighting, composition, dynamic scenes, large-scale scenes, and generalizing training to multiple scenes. Successful usage in a real-world application is demonstrated by Google’s Immersive View [3], where they synthesize fly-through videos of indoor environments using neural fields. Additionally, recent works have shown methods for training and rendering to achieve novel-view synthesis in real time, such as in Nvidia’s “Instant Neural Graphics Primitives” [4] and 3D Gaussian Splatting [5]. Seeing this successful innovation with methods using machine learning for scene optimization, we choose to use the methods presented in “3D Gaussian Splatting for 1 1. Introduction Real-Time Radiance Field Rendering” (3DGS) [5] to perform novel-view synthesis in the context of a web-based environment to allow a more interactive product experience for the user without significant loss of visual quality compared to original still image renders. 3D Gaussian Splatting is one of the few novel-view synthesis methods that maintain state-of-the-art quality while producing frames at real-time rates [6]. Their use of an explicit-volume scene representation in the form of 3D Gaussians allows efficient GPU-accelerated rasterization, unlike previous methods based on radiance fields, which use an implicit-volume scene representation, requiring expensive integration along camera rays. We address two aspects of 3DGS in parallel. Firstly, in a typical novel-view synthesis scenario, nothing but a set of 2D images depicting the scene is assumed to be known. However, since in our problem domain the set of input images that novel-view synthesis will be performed on originates from a virtual 3D scene, we recognize that additional information from the original scene representation could be used to improve the result. Secondly, the renderer presented in the 3DGS paper is implemented using Nvidia’s CUDA platform, meaning it is not immediately runnable in a portable way, let alone runnable on the web platform. Successful attempts to render 3D Gaussians on the web have been made [7]–[14]. However, we have not found a clear comparison of different methods targeted specifically at the web platform. Accordingly, we explore the following research questions: 1. Given a classical scene representation (i.e., polygonal mesh), what effect does us- ing the existing scene geometry to inform initialization have on the optimization process and results of 3D Gaussian Splatting? 2. How can 3D Gaussians be rendered efficiently within the constraints of the web? We propose a method for initializing a 3D Gaussian scene from existing scene geometry. The method is evaluated across a variety of small-scale synthetic scenes. Our results suggest that the method provides noticeable quality improvement for some scenes in some cases, however, only a marginal improvement is seen on average. Furthermore, we implement multiple WebGPU-based methods for rendering 3D Gaussian scenes. Our evaluation across multiple small-scale synthetic scenes using a variety of camera angles shows that using the architecture of the original 3DGS renderer is viable on the web. Moreover, our geometry-based rendering method mostly outperforms the original 3DGS renderer significantly, with a frame-time “speed-up” of ∼0.5×, ∼1.5×, and ∼5.1× in the worst, average, and best case, respectively. Finally, we recognize that the size of the 3D Gaussian screen-space bounding boxes used by the original 3DGS renderer is overly large, causing an unnecessarily large workload. We introduce an augmented method providing tighter screen-space bounding boxes, achieving an average frame-time speed-up of ∼1.2–2.0× and ∼1.2–2.1× for our WebGPU adaption of the 3DGS renderer and our geometry- based renderer, respectively. With our optimization applied, compared to the original 3DGS renderer, our 3DGS-based and geometry-based WebGPU renderers achieve an average frame-time speed-up of ∼1.0–1.3× and ∼1.0–6.3×, respectively. 2 1. Introduction In summary, our main contributions are • a method for initializing a 3D Gaussian scene from existing scene geometry along with an evaluation of the method, • an implementation and comparison of the 3DGS renderer’s original architecture and our geometry-based rendering architecture on the web using WebGPU, • a performance optimization by making 3D Gaussian screen-space bounding boxes tighter. 3 1. Introduction 4 2 Theory The following chapter gives an overview of the topics fundamental for understanding our method and our discussion. Section 2.1 introduces the problem of novel-view synthesis, and Section 2.2 covers how the paper “3D Gaussian Splatting for Real-Time Radiance Field Rendering” [5] approaches solving this problem. 2.1 Novel-View Synthesis The problem of novel-view synthesis can be defined as follows: Given only a set of 2D raster images of the same scene taken from arbitrary viewpoints, how can novel views consistent with the previously observed ones be synthesized? 2.1.1 Neural Radiance Fields The paper “Representing Scenes as Neural Radiance Fields for View Synthesis” (NeRF) [1] popularized a method for novel-view synthesis from which many state-of- the-art works are derived. This method uses a neural network to encode a continuous volumetric scene representation. To create an image rays are marched at fixed intervals for each pixel of the output image. The neural network is queried at each sample along the ray to determine the density and outgoing radiance in the camera’s direction. These samples are accumulated to determine the final incoming radiance to the camera. To train the neural network, gradient descent is used with a loss function that quantifies the reconstruction error between the input reference images and images rendered of the NeRF scene. 2.2 3D Gaussian Splatting (3DGS) Following the ideas and success of NeRF [1] and its many derivatives, the paper “3D Gaussian Splatting for Real-Time Radiance Field Rendering” [5] (3DGS) proposes a more direct and simplified method to achieve novel-view synthesis. The key pieces of the 3DGS method are: A novel explicit continuous volumetric scene representation based on 3D Gaussians, an optimization method for fitting 3D Gaussian properties to match input views, and a differentiable CUDA-driven tile-based renderer for efficient rasterization of 3D Gaussians. The following sections describe these three pieces in more detail. 5 2. Theory 2.2.1 Scene Representation As opposed to previous approaches to novel-view synthesis or scene reconstruction, such as NeRF [1] or photogrammetry, which typically use an explicit surface (e.g., triangular meshes), an implicit surface (e.g., signed distance fields), or an implicit volume (e.g., neural fields) for scene representation, 3DGS instead uses an explicit volume representation to describe a scene [5]. While the smallest primitive for a triangular mesh scene representation is a triangle, the primitive of a 3DGS scene is a 3D Gaussian. Cut-Off Boundary µ y z x Figure 2.1: An illustration exemplifying the characteristics of a 3D Gaussian. In this context, as illustrated in Figure 2.1, a 3D Gaussian can be thought of as an ellipsoid with an arbitrary position, potentially non-uniform scale, and rotation in 3D space [5]. Additionally, the ellipsoid is not solid; rather, it can be seen as being filled with gas where the center is the most dense, and the density decreases toward the ellipsoid’s surface. More specifically, the fall-off is proportional to a Gaussian function. A Gaussian function is continuous for its whole domain [5]. Therefore, a scene constructed using 3D Gaussians is a continuous and differentiable function describing volume density. This differentiability makes 3D Gaussians a good candidate for use with gradient descent techniques to optimize the scene toward some optimal configuration, which is further discussed in Section 2.2.2. However, it should be noted that a Gaussian function only reaches zero at the limit of positive and negative infinity [5]. In practice, this unlimited extent poses a problem for drawing 3D Gaussians since in theory all Gaussians contribute a non-zero value to every point in space and, therefore, every pixel. This problem is solved by bounding the Gaussian function at three standard deviations from its center, seen in Figure 2.1 as the “Cut-Off Boundary”. Using this bound, less than 1% of the Gaussian’s contribution is lost [15], and the 3D Gaussians can be treated as ellipsoids. Additionally, to facilitate modeling view-dependent appearance of objects in the scene, the color of a 3D Gaussian for a given viewing direction is described by spherical 6 2. Theory harmonic functions of degree 0 to 3 [5]. This description requires 48 real-valued spherical harmonic coefficients to be stored per 3D Gaussian. To render an image from a scene constructed out of 3D Gaussians as volume elements, the density function for the scene needs to be integrated along each camera ray to determine the total light transmission through the volume [5]. Since each 3D Gaussian is artificially bounded, assuming no 3D Gaussians overlap, each Gaussian can be integrated separately in order along the ray. However, for efficiency, 3DGS instead approximates the transmission through a single 3D Gaussian as a screen-space 2D Gaussian. The practical details of this approximation are seen in Section 2.2.3. Formally the configuration of a 3D Gaussian, centered at µ with rotation q, scale s, spherical harmonic coefficients SH, and opacity at the center α, can be defined as g = (µ, s,q, SH, α), (2.1) where µ ∈ R3, s ∈ R3, q is a unit quaternion, SH ∈ R48, and α ∈ [0, 1]. The scale and rotation of a 3D Gaussian can also be jointly described by a matrix Σ = RSSTRT3D where S and R are the corresponding scale and rotation matrices derived from s and q, respectively [5]. Furthermore, the density D : Rn → [0, 1] at point x ∈ Rn for a Gaussian centered at µ ∈ Rn with transformation matrix Σ ∈ Rn×n and center opacity α is defined as DΣ(x) = α ·GΣ(x), (2.2) where G is the Gaussian function for the Gaussian. G : Rn → (0, 1] is defined as (x) = − 1vTΣ−1G e 2 vΣ , (2.3) where v = x− µ. 2.2.2 Optimization The second key piece of the 3DGS method is its optimization process, which is responsible for configuring a set of 3D Gaussians such that they together resemble the shape and appearance of the scene observable in a given set of input images V [5]. The process is performed in two stages: The initialization stage and the iterative refinement stage. Initialization The optimization process begins by creating an initial set of Gaus- sians S0, roughly representing the shape of the scene [5]. The positions of the Gaussians in this set are initialized using one of two ways: 1. For “Real-World Scenes”, such as found in the Mip-NeRF 360 [16] paper, the Tanks and Temples dataset [17], and the Deep blending paper [18], COLMAP1 [19] is run with the images of V as input. COLMAP produces a sparse point cloud with points in places where common features are found in the views of the input set. A single Gaussian is placed at every point in the point cloud. 1A widely used Structure-from-Motion library for estimating camera parameters and producing a 3D point cloud from a set of 2D images. 7 2. Theory 2. For “Synthetic Bounded Scenes”, such as in NeRF’s Realistic Synthetic 360° dataset [1], N positions are uniformly sampled in a cuboid with fixed dimensions covering the contents of all scenes in the dataset. A single Gaussian is placed at each sampled position. No rotation is initially applied, and each Gaussian is uniformly scaled proportional to the mean distance to its three closest neighbors [5]. Each Gaussian’s first three spherical harmonic coefficients are randomly selected uniformly within valid ranges, and the rest are set to zero, making each Gaussian initially have the same color no matter which side it is observed from. Meanwhile, opacity is set to 0.1. Finally, for later parts of the optimization process the camera parameters (e.g., position, orientation, and field-of-view) for each input view need to be known [5]. These parameters are determined in two different ways for the previously mentioned types of datasets. For the “Real-World Scenes” where the images typically come from real-world pictures, the camera parameters are usually not known ahead of time. However, as part of the COLMAP process, the camera parameters for each input view are estimated and used as is for this case. Meanwhile, for the Realistic Synthetic 360° dataset where the images are produced from Blender scenes, the camera parameters for each view are known ahead of time and can be used directly. Iterative Refinement The initial set of Gaussians (S0) is a rough approximation of the scene observable in the input views and will typically not resemble it very well. To bring the set of Gaussians closer to a configuration that matches the scene observable in the input views, 3DGS uses an iterative approach based on stochastic gradient descent. The iterative refinement process takes the current set of Gaussians Si−1 and produces an augmented set of Gaussians Si at each step, where i is the number of iterations performed so far. For each iteration, a view is randomly selected from the set of input views V . Let Ii be the input image, Di be the dimensions of the input image, and Ci be the camera parameters for the selected view at iteration i. The set of Gaussians is then rendered using 3DGS’s differentiable CUDA-driven renderer (further detailed in Section 2.2.3) with dimensions Di, Gaussians Si−1, and camera parameters Ci as input, producing an image Ri. The optimization process aims to minimize the difference between the rendered image and the ground truth image. To quantify how much the images differ for the current view, 3DGS uses the following loss function: L(Ri, Ii) = 0.8 · L1(Ri, Ii) + 0.2 · LD-SSIM(Ri, Ii), (2.4) where L1 is the widely-known metric mean absolute error (computed per channel of each pixel), and LD-SSIM(X,Y) = 1− SSIM(X,Y) with SSIM being the commonly used objective image similarity metric Structural Similarity Index [20]. Furthermore, in typical gradient descent fashion, the loss function is used to determine how the set of Gaussians should be augmented to approach a configuration where the ground truth image and the rendered image are as similar as possible. Since, as 8 2. Theory previously mentioned, the scene representation and the renderer are differentiable, the resulting colors of each pixel in the rendered image can be differentiated with respect to any input parameter. More importantly, the loss function L can be differentiated with respect to any Gaussian parameter. For example, if the partial derivative of L with respect to µx for some Gaussian is a positive number, this indicates that the Gaussian’s position’s x-component should be decreased for the loss function to decrease, and in turn, on average bring the ground truth and the rendered image to be more similar. Using the derivatives of the loss function (collectively called the function’s gradient), the update step producing a new set of Gaussians with augmented parameters can approximately be described by the following oversimplified relation: Si = {g − λ∇gL(Ri, Ii) | g ∈ Si−1}, (2.5) where i > 0, λ is the step size (controlling how much the parameters change each step), and ∇gL is the gradient of L with respect to the parameters of Gaussian g. In reality, however, 3DGS uses the Adam optimizer [21] which employs a slightly more sophisticated approach to gradient descent optimization. Additionally, 3DGS uses separate step sizes for different Gaussian parameters, and exponential decay is used for the step size of Gaussian positions. Furthermore, some parameters are fixed until a configurable number of iterations have passed. Finally, not all Gaussian parameters are directly optimized as is. Rather, opacity and scale use a sigmoid and an exponential activation function, respectively, to facilitate easier optimization. For complete details regarding the optimization schedule, see the 3DGS paper [5]. The 3DGS optimization process additionally adapts to scenes of varying complexity and ensures not too many Gaussians are created by introducing a densification and pruning scheme. After a warm-up period, at regular intervals of 100 iterations, a densification step is performed to create new Gaussians in areas where there are too few Gaussians to reconstruct the scene well and Gaussians with a low α or that are overly large in world- or view-space are pruned. The densification step is performed for Gaussians with large view-space position gradients (indicating a lack of Gaussians nearby). These Gaussians are either cloned and the new Gaussian is moved in the direction of the view-space position gradient, or the Gaussian is split and the original and the new Gaussian are both scaled down and moved to cover the same area as the original Gaussian occupied. Finally, to ensure that only Gaussians with meaningful contribution are kept, every 3000th iteration, the α of all Gaussians is lowered to a small value. The optimization will then increase the α of Gaussians that are needed and, as previously described, the Gaussians whose α stays low will be pruned. 2.2.3 Differentiable CUDA-Driven Tile-Based Renderer The third and final key contribution of 3DGS is a method for efficiently and dif- ferentiably producing an image depicting a set of 3D Gaussians. The 3DGS paper proposes a tile-based rasterizer that projects 3D Gaussians into screen space and draws them using differentiable operations. To allow for GPU acceleration and ease of integration with PyTorch which is used for the optimization process, the rasterizer 9 2. Theory Pre-Process Allocate Create Sort Identify 3D Gaussians 2D Gaussian 2D Gaussian 2D Gaussian Accumulate Tile Ranges Instances Instances Instances 2D Gaussians Figure 2.2: An overview of the CUDA kernels executed by 3DGS’ differentiable CUDA-driven tile-based renderer. is implemented as PyTorch modules using C++ and CUDA kernels. This method will hereafter be referred to as the 3DGS renderer. The inputs to the renderer are a set of 3D Gaussians S and camera parameters C. Let S = {g0, g1, . . . , gN−1}, where gi is a 3D Gaussian with ID i and N is the total number of 3D Gaussians. As seen in Figure 2.2, broadly, the rasterization is performed by projecting each 3D Gaussian into a screen-space 2D Gaussian using the given camera parameters and calculating view-dependent properties, such as color. Additionally, the 2D Gaussians are instantiated into tiles in a screen-space grid. The 2D Gaussian instances are thereafter grouped per tile and sorted by depth to allow for efficient front-to-back access per tile during drawing. Finally, tile-by-tile and pixel-by-pixel, the 2D Gaussians assigned to the current pixel’s corresponding grid tile are accumulated using alpha blending. The following sections detail how these logical operations are implemented in practice as consecutive executions of CUDA kernels. Pre-Process 3D Gaussians Rendering a given set of Gaussians S begins by executing a kernel with thread groups of dimensions 256× 1× 1 where each thread handles one 3D Gaussian each. The 3D Gaussian’s position µ is transformed into camera space and then projected into continuous pixel coordinates µ2D using the camera’s world-to-camera and camera-to-clip matrices. With the Gaussian’s camera- space coordinates now known, the Gaussian is culled if its center is behind the camera’s near plane. Subsequently, the 3D Gaussian’s transformation matrix Σ3D, jointly describing its rotation and scale in world space, must be transformed into screen space to describe the rotation and scale of a corresponding 2D Gaussian in pixel coordinates. To ensure the resulting transformation is affine, 3DGS combines the already affine world- to-camera transform W with a local affine approximation of the camera’s projective transformation, derived from the first two terms of the perspective transformation’s Taylor series, as proposed by Zwicker et al. [22]. The resulting transformation matrix for a 2D Gaussian becomes Σ T T2D = JW Σ3D W J , (2.6) where J is the Jacobian matrix of the local affine approximation. With the Gaussian’s screen-space position and transform determined, the renderer can now calculate the 2D Gaussian’s opacity for any pixel in the image using function DΣ2D , described in Equation (2.2). However, it would be wasteful to iterate through all the scene’s Gaussians for every pixel. Therefore, a uniform screen-space grid with tiles occupying 16× 16 pixels is introduced. To know which tiles the current 2D 10 2. Theory Gaussian overlaps, the kernel additionally determines the Gaussian’s screen-space extents. The extents are found by constructing a square axis-aligned bounding box centered on the Gaussian’s center. The width and height of the box are set to be double of three standard deviations of Σ2D’s largest eigenvalue, where the largest eigenvalue corresponds to the scale of the Gaussian’s longest axis, as follows: [ ] ⌈ ⌉T √ Esquare(Σ2D) = s s s = 2 3 max(λ1, λ2) (2.7) where λ1 and λ2 are the two eigenvalues of Σ2D. Finally, using the current 3D Gaussian’s spherical harmonic coefficients, its view- dependent color is calculated using the direction from the camera to the Gaussian’s center. Once this kernel has finished, screen-space position µ2D, camera-space depth d, inverse of the 2D transformation matrix Σ−12D, number of tiles overlapped o, and view-dependent color c will have been stored, in global memory, in individual buffers with one slot for each Gaussian. Allocate 2D Gaussian Instances For each tile a 2D Gaussian has overlapped, the renderer will create a 2D Gaussian instance stored in a single contiguous buffer. To determine how many instance slots should be allocated per 2D Gaussian, the previously written “number of tiles overlapped” buffer is used. Let this buffer be Bo = [o0, o1, . . . , oN−1], where oi is the number of tiles overlapped by the Gaussian with ID i. Furthermore, to create the 2D Gaussian instances it also needs to be known at what offset in the instances buffer each 2D Gaussian’s instances should be written. This is determined using an inclusive prefix sum over Bo, resulting in a buffer Boffset = [offset0, offset1, . . . , offsetN−1], where offset ∑k k = i=0 oi. In practice, the prefix sum is performed efficiently in parallel using “Single-pass Parallel Prefix Scan with Decoupled Look-back” [23] through Nvidia’s CUB library [24], [25]. Furthermore, the prefix sum is performed in place so no additional memory is allocated for buffer Boffset. Create 2D Gaussian Instances With the offsets and total number of instances now known, two buffers, one for keys and one for values, are dynamically allocated at run-time with a capacity equal to the total number of instances. After that, the 2D Gaussian instances are created by executing a kernel with thread groups of dimensions 256× 1× 1 where each thread handles one 2D Gaussian each. If the current 2D Gaussian overlaps no tiles, as indicated by the offsets buffer, no instances are created and the 2D Gaussian is effectively culled from all future steps. In any other case, for every overlapped tile a 64-bit integer sorting key and value are separately written into the two buffers. The value is merely the ID of the Gaussian. Meanwhile, the key is a 64-bit integer where the 32 most significant bits equal the tile ID of the overlapped tile and the 32 least significant bits equal the 32-bit floating-point camera-space depth d of the current 2D Gaussian. 11 2. Theory Sort 2D Gaussian Instances Next, the 2D Gaussian instance key and value buffers are sorted in ascending order with respect to the values in the key buffer. This sorting step is efficiently performed using Nvidia’s parallel radix sort Onesweep [26] through Nvidia’s CUB library [24], [27]. Since the keys are composed of the tile ID and depth of each instance, the instances will be arranged such that all instances belonging to the same tile are placed at consecutive indices, and within each group, the instances will be ordered from lowest to highest camera-space depth. Using this scheme the relevant 2D Gaussians for each tile can efficiently be fetched in front-to-back order. Identify Tile Ranges Moreover, before the 2D Gaussians in each tile can be accumulated, it needs to be determined how many instances have been assigned to each tile and where each range of instances is located in the sorted instance buffers. To determine the ranges, a kernel with thread groups of dimensions 256× 1× 1 where each thread handles one 2D Gaussian instance each is executed. Let i be the index of the current instance and ti be the tile ID of the current instance. Collaboratively, the threads fill a buffer Branges = [(rs0, re0), (rs1, re1), . . . , (rsT−1, reT−1)], where T is the total number of screen-space tiles, and rsk and rek are the inclusive start and exclusive end indices in the instances buffers for the tile with ID k, respectively. The buffer is initially cleared to all zeroes. To fill the buffer, for each 2D Gaussian instance there are four cases to consider. Either the instance is the first overall, i.e., i = 0, or the instance is the first in a new tile, i.e., i ≥ 1 and ti ≠ ti−1. In the first case, the start index of the current instance’s tile rst is set to 0. In thei second case, the start index of the new tile rst and the end index of the tile beforei ret −1 are both set to i. Thirdly, if the instance is the last overall, i.e., i = N − 1i where N is the total number of instances, the end index of the current instance’s tile rst is set to N . Finally, if none of the previous conditions hold, meaning the instancei is not at the boundary of any tile’s instances range, the current thread writes no value. Accumulate 2D Gaussians Finally, to calculate and write the color of each pixel in the output image, a kernel where each thread group handles one tile and each thread corresponds to a pixel in that tile is executed. The thread groups have dimensions 16× 16× 1 to match the size of the grid tiles. Let t be the ID of the current thread group’s corresponding tile. Firstly, the 2D Gaussian instances range (rst, ret) for the current tile is fetched. Let N = ret − rst be the number of 2D Gaussians in the current tile. Thereafter, all threads in the thread group will alternate between two roles: Fetching 2D Gaussians and accumulating 2D Gaussians. All threads synchronize using a barrier to wait for all other threads to finish their current work before switching roles. This process will be repeated either until all N Gaussians have been accumulated or until all threads in the thread group report themselves as having finished early. A thread can finish early if for its corresponding pixel the accumulated transmission of all Gaussians processed so far goes below a low threshold (0.0001). In this case, since the 2D Gaussians are processed from front to back, accumulating the color of more 12 2. Theory Gaussians would make no meaningful difference to the pixel’s color. In the role of fetching 2D Gaussians, all threads in the thread group will collaboratively fetch, at most, the 256 frontmost remaining 2D Gaussian instances and the properties of their corresponding 2D Gaussians by loading one each into thread group shared memory. If there are fewer than 256 Gaussian instances left, a subset of the group’s 256 threads will stay idle. The properties loaded into shared memory are the Gaussian’s ID, screen-space position µ2D, inverse of the 2D transformation matrix Σ−12D, and opacity α. View-dependent color c is loaded upon demand by each individual thread during accumulation and not ahead of time. In the role of accumulating 2D Gaussians, each thread in the thread group handles one pixel each. Let xp be the center of the pixel. Each thread keeps track of its corresponding pixel’s current color and transmission. If not already marked as having finished early, the thread accumulates the contribution of the currently loaded chunk of 2D Gaussians using a standard alpha-blending model such that C0 = 0, (2.8) Ci = Ci−1 + Ti−1αici, T0 = 1, (2.9) Ti = Ti−1 · (1− αi), where Ci and Ti are the pixel’s color and transmission, respectively, after the i frontmost Gaussians have been accumulated. Here α ii = DΣi (xp), Σ2D 2D is the transformation matrix and ci is the view-dependent color of the ith 2D Gaussian in the current tile. Once a thread has finished loading and accumulating Gaussians, it writes the final color CN to its corresponding pixel’s location in the output image. 13 2. Theory 14 3 Method The following chapter introduces and details our approach to answering the previously presented research questions. Sections 3.1 and 3.2 correspond to our methods for the first and second research questions, respectively. 3.1 Mesh-Supervised 3D Gaussian Optimization To approach answering the first research question, 3D Gaussian Splatting is aug- mented with an extension to supervise the optimization process using the original geometry of a scene. Our method creates an initial set of 3D Gaussians with param- eters such that the Gaussians approximate the original geometry at the start of the optimization process. The method is implemented directly on top of the PyTorch and CUDA implementation [28] provided by the authors of the “3D Gaussian Splatting for Real-Time Radiance Field Rendering” [5] paper. As described in Section 2.2.2, 3DGS begins the optimization process with an ini- tialization stage in which an initial set of N 3D Gaussians S0 is created. Let S0 = {g0, g1, . . . , gN−1} where gi = (µi, si,qi, SHi, αi) is the configuration of Gaus- sian i. The position µi, scale si, and orientation qi are augmented. All other parts of the initialization stage are unmodified. The only additional input to our method is a triangular mesh whose coordinates are assumed to be in the same coordinate space as the given cameras. To initialize the set of 3D Gaussians, for each 3D Gaussian, firstly, a point is uniformly randomly sampled on the surface of the mesh. Let p be the sampled point. The sampled point is used as the position of the Gaussian, i.e., µ = p. Secondly, the Gaussian is oriented such that its major and minor axes are aligned with the major and minor axes of the face the point was samp[led on. Let v1, v2, and v] 3T be the positions of the face’s three vertices and let V = v1 − v̄ v2 − v̄ v3 − v̄ , where v̄ is the mean of the three vertices. The major axis x and minor axis y in the plane defined by the three vertices are derived from the eigenvectors of the covariance matrix of V, in standard Principal Component Analysis (PCA) fashion [29]. To describe the final orientation of the 3D Gaussian, a local-to-world rotation matrix R is constructed as follows: [ ] R = x y n , (3.1) 15 3. Method where n = x × y is the normal of the face. Furthermore, the rotation matrix is converted into a quaternion which is used for the 3D Gaussian’s q property. Finally, the scale of the 3D Gaussian is set such that it roughly covers the same area as the original face it is reconstructing. To determine how much the Gaussian should be scaled along its major and minor axes, the corresponding eigenvalues of the previously mentioned eigenvectors are used. Let λx and λy be the eigenvalues of the major and the minor axes, respectively. However, it has to be taken into account that many 3D Gaussians might be randomly placed on the same face. In this case Gaussians that are not placed near the middle of the face are likely to span outside the face if scaled to a size similar to the whole face. Therefore, all Gaussians whose position ended up on the same face are scaled down proportional to the total number of Gaussians on that face. Let k be the total number of Gaussians placed on the same face as the current one. The fina[l scale of ]the 3D Gaussian is then s = λx λy ϵ , (3.2) k k where ϵ is a small value larger than 0 to ensure the 3D Gaussian is flat. However, note that due to the usage of log as the activation function for a Gaussian’s scale, a value of 0 cannot be used as it would result in the parameter used in the optimization process to be −∞, which cannot be further optimized. 3.2 Web-Based Renderer To take steps toward answering the research question of how 3D Gaussians can efficiently be rendered on the web, two methods were implemented and examined. The first method, presented in Section 3.2.1, and the second method, presented in Section 3.2.2, will hereafter be referred to as 3DGS-web and geometry, respec- tively. Furthermore, variants of the two renderers with additional optimization were implemented. These are referred to as 3DGS-web-opt and geometry-opt, and will be presented in Section 3.2.3. All our methods are implemented in Rust and compile to a WebAssembly module that, together with a tiny amount of JavaScript for initialization, runs in the browser. For hardware-accelerated graphics processing, WebGPU, through the wgpu Rust library, is used. A key benefit of WebGPU, as opposed to WebGL2, is the availability of compute shaders, which are required by all our methods. 3.2.1 Original 3DGS Architecture in WebGPU Our first method of rendering 3D Gaussians on the web (the 3DGS-web renderer) directly takes inspiration from the architecture of the 3DGS renderer, presented in Section 2.2.3. The idea was to assess whether or not using this architecture in a web-based context is feasible and where compromises need to be made due to the limitations of the web platform. This section covers how the 3DGS renderer’s architecture was adapted to a WebGPU implementation and where the original and our implementation differ. 16 3. Method Similarly to the 3DGS renderer, our implementation renders a set of 3D Gaussians through six phases: Pre-Process 3D Gaussians, Allocate 2D Gaussian Instances, Create 2D Gaussian Instances, Sort 2D Gaussian Instances, Identify Tile Ranges, and Accumulate 2D Gaussians. All phases are implemented as one or more executions of a WebGPU compute shader, with work-group sizes equivalent to the ones used for the thread groups of the corresponding CUDA kernels. The following sections highlight notable differences for each phase compared to the 3DGS renderer, if any. Allocate 2D Gaussian Instances Since Nvidia’s CUB library is made specifically for the CUDA platform, it could not be reused for our WebGPU implementation. Therefore, the parallel prefix sum is performed using an implementation [30], by Raph Levien and Reese Levine, based on the state-of-the-art method “Single-pass Parallel Prefix Scan with Decoupled Look-back” [23]. This implementation was chosen because it is one of the few based on the same paper as the CUB implementation. However, according to Levien [31], due to a lack of inter-workgroup synchronization primitives and a forward-progress guarantee for WebGPU compute shaders, the original method cannot be fully implemented in WebGPU and still be guaranteed to work cross-platform. Create 2D Gaussian Instances As mentioned in Section 2.2.3, during the rendering of a frame, the 3DGS renderer dynamically allocates appropriately sized buffers for 2D Gaussian instances in GPU memory. However, to perform this allocation the total number of instances has to be copied from the offsets buffer (produced in the Allocate 2D Gaussian Instances phase) in GPU memory to main memory. For simplicity and to avoid this copy, we define an upper limit of 10 000 000 instances and allocate a single buffer for these before rendering a frame. Any 2D Gaussian instance that does not fit in the buffer is never created and, therefore, not rendered. With each instance occupying 8 bytes (as seen in the following paragraph), this buffer takes a constant amount of ∼77MiB in GPU memory. This upper limit was chosen experimentally such that all test cases used for evaluation (see Section 4.2.1) render without artifacts. However, it should be noted that the number of instances can exceed the limit for extreme cases (not included in our evaluation set) such as when a large amount of 2D Gaussians cover a large portion of the screen, especially if a high-resolution image is rendered. Nevertheless, the 2D Gaussians are typically too small in screen space for most scenes and views for this case to occur. Additionally, in the original 3DGS implementation, the sorting key and value for each 2D Gaussian instance are 64-bit integers. However, due to a lack of support for 64-bit keys in the used sorting algorithm implementation (presented in the next section) and a lack of a 64-bit integer type in WGSL, 32-bit integers were used for simplicity. In our implementation, the key is therefore split into two 16-bit chunks. In the 16 most significant bits, the ID of the instance’s tile is stored, just like in the 3DGS renderer. Consequently, due to the limited space of 16 bits, at most 65 536 tiles can be used. Assuming a square image is rendered, the maximum dimensions of the rendered image is 4096× 4096 pixels. 17 3. Method In the 16 least significant bits, the 32-bit floating-point camera-space depth d of the Gaussian is quantized and encoded. To mitigate issues due to the low precision of 16 bits when sorting 2D Gaussians by depth, unlike the 3DGS renderer, we linearly encode the camera-space depth into the 16-bit integer dencoded as follows: d− znear dnormalized = , zfar − znear (3.3) dencoded = ⌊0.5 + 65535 · dnormalized⌋ , where znear and zfar are the camera-space distances to the current view’s near and far plane, respectively. Sort 2D Gaussian Instances For sorting 2D Gaussian instances, ideally Nvidia’s state-of-the-art Onesweep algorithm [26] for parallel radix sorting would have been used to facilitate comparison of the 3DGS renderer and our renderer. Since Nvidia’s CUB library cannot be used outside the CUDA platform, a WebGPU implementation of the Onesweep algorithm would have had to be used. Unfortunately, Onesweep cannot be correctly implemented using WebGPU due to a lack of compute shader sub-group operations (warp and wavefront in Nvidia and AMD parlance, respectively) [32]. Instead, we used a hybrid parallel radix sort implementation [33] by Raph Levien. This implementation is primarily based on AMD’s FidelityFX radix sort [34] and additionally mixes in a technique called warp-level multi-split from Onesweep [26], [32], [35]. However, due to the lack of sub-group operations in WebGPU, the warp- level multi-split cannot be as efficiently performed as in Onesweep. Instead, it is implemented using work-group shared memory [32]. Accumulate 2D Gaussians As previously mentioned, while accumulating 2D Gaussian instances for a tile, each thread in the 3DGS renderer continues as long as all threads have not voted to finish early. This collaborative cross-thread vote on whether to stop is implemented in the original renderer using the CUDA synchronization function __syncthreads_count [36], [37, p. 173]. WebGPU lacks a corresponding synchronization function. An attempt was made to emulate this function using atomics, work-group barriers, and work-group shared memory, however, this turned out to be slower than doing nothing. Therefore, our renderer does not check for this early termination condition and all 2D Gaussian instances in the tile will always be loaded even if all pixels are finished. 3.2.2 Geometry-Based Renderer Sort Project Rasterize 3D Gaussians 3D Gaussians 2D Gaussian Quads Figure 3.1: An overview of the logical steps performed by the geometry renderer. Our second method of rendering 3D Gaussians on the web (the geometry renderer) utilizes the traditional graphics rendering pipeline by drawing each 3D Gaussian 18 3. Method through rasterizing geometry. The inputs to the renderer are a set of 3D Gaussians S and camera parameters C. Let S = {g0, g1, . . . , gN−1}, where gi is a 3D Gaussian with ID i and N is the total number of 3D Gaussians. As seen in Figure 3.1, the rasterization begins by sorting the given 3D Gaussians by camera-space depth. After that, each 3D Gaussian is projected into a screen-space 2D Gaussian using the given camera parameters, and the Gaussian’s view-dependent color is calculated. For each 2D Gaussian, a bounding screen-space quadrilateral is constructed. Finally, each quadrilateral is rasterized and alpha blended in back-to- front order. The following sections detail how these logical operations are implemented using compute shaders and the traditional graphics processing pipeline through WebGPU. Sort 3D Gaussians For the 3D Gaussians to be rasterized and blended in back- to-front order, they need to be sorted. Therefore, rendering a given set of Gaussians S begins by executing a compute shader with workgroups of dimensions 256× 1× 1 where each invocation (the WebGPU equivalent of a CUDA thread) handles one 3D Gaussian each. Each 3D Gaussian’s world-space position µ is transformed into camera-space position µcam using the camera’s world-to-camera matrix, where the camera’s forward direction is along the positive z-axis. Let z be the z-component of µcam. Finally, the 32-bit floating-point number z is reinterpreted as a 32-bit integer and stored along with the Gaussian’s ID in sorting entries buffer Bunsorted = [(z0, ID0), (z1, ID1), . . . , (zN−1, IDN−1)]. Thereafter, buffer Bunsorted is sorted in ascending order by zi, producing buffer Bsorted. The sorting is performed using the parallel radix sort WebGPU implementation by Raph Levien presented in Section 3.2.1. Project 3D Gaussians The next key step is to construct a 2D quadrilateral for each 3D Gaussian such that it bounds the 3D Gaussian in screen space. The purpose of the quad is to cause fragment shader invocations for, at least, every pixel to which the corresponding 3D Gaussian can contribute. This next step is started by a single draw call that executes a WebGPU render pass with camera parameters, buffers for all 3D Gaussian properties, and the buffer Bsorted as input. No vertex or index buffers are used for the draw call; instead, it is instructed to use instanced drawing where four vertices are to be constructed by the vertex shader and this is repeated for N instances. Let iinstance ∈ {0, 1, . . . , N − 1} be the ID of the current geometry instance. To ensure the 3D Gaussians are drawn back to front, instance iinstance corresponds to the sorting entry of Bsorted at index N − iinstance − 1, and therefore, 3D Gaussian with ID i = IDN−iinstance−1. Using the same procedure as described in the “Pre-Process 3D Gaussians” phase of the original 3DGS renderer (presented in Section 2.2.3), the vertex shader calculates the corresponding 2D Gaussian’s center µ2D in continuous pixel coordinates and transformation matrix Σ2D. Furthermore, the extents of the 2D Gaussian’s axis- aligned bounding box centered at µ2D are determined as e = Esquare(Σ2D) (see 19 3. Method Equation (2.7)). Using µ2D and e, the framebuffer coordinates of the four vertices for the current instance are determined and then transformed into clip space. Finally, using the current 3D Gaussian’s spherical harmonic coefficients, its view- dependent color is calculated using the direction from the camera to the Gaussian’s center. Once the vertex processing stage has finished, screen-space position µ2D, the inverse of the 2D transformation matrix Σ−12D, and view-dependent color c is passed on to the fragment processing stage. Rasterize 2D Gaussian Quads With one screen-space quadrilateral now con- structed for each 2D Gaussian, the fragment shader of the render pass will be invoked for every pixel overlapped by each Gaussian’s corresponding quadrilateral. Let xp be the center of the pixel handled by the current fragment shader invocation. The fragment shader outputs color C and alpha α of the current pixel for the current 2D Gaussian as follows: C = c · α, (3.4) α = DΣ2D(xp), where function D is from Equation (2.2). Finally, the rasterized 2D Gaussians are blended in back-to-front order by the GPU by appropriately setting the blending parameters of the WebGPU render pass. The render pass is configured to achieve the following standard pre-multiplied alpha blending operation: Cresult = Csource +Cdestination · (1− αsource) (3.5) 3.2.3 Optimization A key observation regarding the workload of both the 3DGS-web renderer and the geometry renderer is that the total amount of tiles in 3DGS-web and the total amount of pixels in geometry that a 2D Gaussian’s contribution has to be calculated for is directly related to the size of each 2D Gaussian’s bounding box. As introduced in Section 2.2.1, a Gaussian never reaches zero over its whole domain and, therefore, contributes to every point in space and every pixel. Thus, to achieve efficient rendering, it is limited to three standard deviations from its center. The elliptical boundary corresponding to this limit for a 2D Gaussian can be seen in Figure 3.2 as a dotted black line. The 3DGS renderer approximates the bounding ellipse using an axis-aligned bound- ing box with square extents as described by Esquare (defined in Equation (2.7)). As seen in Figure 3.2a, the bounding box is dimensioned such that the bounding ellipse’s longest axis will fit regardless of orientation. This makes the extent of the bounding box unnecessarily large. Therefore, we propose replacing 3DGS’s square bounding box with an axis-aligned bounding box with dimensions such that it tightly bounds a 2D Gaussian’s elliptical 20 3. Method (a) 3DGS (b) Ours Figure 3.2: A comparison of the square axis-aligned bounding box used by the 3DGS renderer and our tight bounding box used by the 3DGS-web-opt and geometry-opt renderers. boundary, as seen in Figure 3.2b. The two axes of the bounding ellipse are a and b, defined as ⌈ √ ⌉ a = v1 l , l = ∥v1∥ 1 1 v ⌈ 3√λ1⌉ , (3.6) b = 2 l2, l2 = 3 λ2 ,∥v2∥ where v1 and v2 are the two eigenvectors of Σ2D with eigenvalues λ1 and λ2, respec- tively. Finally, as shown by Quílez [38], using the axes of the bounding ellipse our extents function can be defined as [√ √ ]T Etight(Σ2D) = 2 a2x + b2 2x ay + b2y , (3.7) [ ]T [ ]T where a = ax ay and b = bx by . This new extents function for a 2D Gaussian’s axis-aligned bounding box is used in the 3DGS-web-opt and geometry- opt renderers. 21 3. Method 22 4 Results The following chapter details how the previously presented methods were evaluated and highlights key insights from the results. Sections 4.1 and 4.2 correspond to the methods presented in Sections 3.1 and 3.2, respectively. 4.1 Mesh-Supervised 3D Gaussian Optimization 4.1.1 Evaluation Methodology To evaluate our method of initializing a 3D Gaussian scene using existing scene geometry, the following evaluation methodology was followed. A training and evaluation set consisting of images depicting the eight scenes from NeRF’s Realistic Synthetic 360° dataset [1] was used. The exact selection of training and evaluation views provided in NeRF’s dataset was used. Additionally, the corresponding Blender scenes provided in the dataset were used to export triangular meshes for each scene. Using the set of training views and meshes as input, the optimization process was run using the original 3DGS method and our method. Both methods were initialized with a set of 100 000 3D Gaussians. To facilitate comparison with the original 3DGS paper, the process was run for 30 000 iterations. The 3DGS scene was checkpointed at 7000 and 30 000 iterations. Finally, objective reconstruction quality, i.e., the difference between an original image and a reconstructed image from the same viewpoint, was measured for every view in the evaluation set using three commonly used objective image similarity metrics: Structural Similarity Index (SSIM) [20], Peak Signal-to-Noise Ratio (PSNR), and Learned Perceptual Image Patch Similarity (LPIPS) [39]. 4.1.2 Reconstruction Quality Table 4.1 shows the results of measuring objective reconstruction quality for 3DGS and our method. The metrics have been averaged per method and scene across all views of the evaluation set. Furthermore, following the approach of the original 3DGS paper, the optimized 3D Gaussian scene is evaluated at 7000 and 30 000 training iterations. The metrics at these two points are shown in Table 4.1a and Table 4.1b, respectively. 23 4. Results Table 4.1: Reconstruction quality of our mesh-supervised 3D Gaussian optimization method compared to the original 3DGS optimization method. For the three objective image similarity metrics SSIM, PSNR, and LPIPS, the upward and downward arrows indicate whether larger or smaller values, respectively, correspond to higher similarity. The colored backgrounds indicate the “best” method for each metric and scene. The “#” column shows the number of 3D Gaussians in the scene in kilo-Gaussians. Chair Drums Ficus Hotdog Method SSIM↑ PSNR↑ LPIPS↓ # SSIM↑ PSNR↑ LPIPS↓ # SSIM↑ PSNR↑ LPIPS↓ # SSIM↑ PSNR↑ LPIPS↓ # 3DGS 0.980 33.19 0.020 317 0.947 25.41 0.050 247 0.984 34.19 0.016 191 0.981 35.86 0.030 145 Ours 0.983 33.57 0.016 490 0.948 25.46 0.045 390 0.984 34.20 0.015 228 0.982 36.08 0.025 253 Lego Materials Mic Ship Method SSIM↑ PSNR↑ LPIPS↓ # SSIM↑ PSNR↑ LPIPS↓ # SSIM↑ PSNR↑ LPIPS↓ # SSIM↑ PSNR↑ LPIPS↓ # 3DGS 0.975 33.66 0.027 263 0.950 28.75 0.051 136 0.985 33.17 0.014 133 0.898 30.57 0.129 210 Ours 0.979 34.28 0.020 370 0.950 28.72 0.049 227 0.989 34.72 0.009 274 0.902 30.80 0.112 365 (a) At 7000 training iterations Chair Drums Ficus Hotdog Method SSIM↑ PSNR↑ LPIPS↓ # SSIM↑ PSNR↑ LPIPS↓ # SSIM↑ PSNR↑ LPIPS↓ # SSIM↑ PSNR↑ LPIPS↓ # 3DGS 0.988 35.58 0.010 489 0.955 26.28 0.037 390 0.987 35.50 0.012 266 0.985 38.06 0.020 188 Ours 0.987 35.30 0.011 645 0.954 26.25 0.036 526 0.986 35.39 0.012 299 0.985 37.80 0.018 291 Lego Materials Mic Ship Method SSIM↑ PSNR↑ LPIPS↓ # SSIM↑ PSNR↑ LPIPS↓ # SSIM↑ PSNR↑ LPIPS↓ # SSIM↑ PSNR↑ LPIPS↓ # 3DGS 0.983 36.06 0.016 344 0.960 30.50 0.037 160 0.993 36.74 0.006 196 0.906 31.69 0.106 278 Ours 0.983 36.09 0.015 438 0.958 30.42 0.036 237 0.993 37.03 0.006 285 0.905 31.66 0.096 411 (b) At 30 000 training iterations At 7000 iterations, our method shows a minor improvement in all metrics for almost all scenes. On the other hand, at 30 000 iterations, the metrics are marginally better for some scenes and marginally worse for others. On average, both methods produce images of similar reconstruction quality after this many iterations. Finally, using our method the number of 3D Gaussians in each scene is significantly higher overall. Figure 4.1 shows two scenes where the difference between our method’s and 3DGS’ reconstruction quality (PSNR) is large, i.e., where our method exhibits the greatest improvement. In the first scene at 7000 iterations (see Figure 4.1a), an improvement is seen in the microphone’s mesh. In this case, 3DGS could not reconstruct the mesh’s high-frequency details on the underside and front of the microphone, while our method gave a result much closer to the ground truth image. However, for the same scene at 30 000 iterations (see Figure 4.1b), little difference is seen using our method compared to 3DGS. This is consistent with the results shown in Table 4.1. Moreover, in the second scene at 7000 iterations (see Figure 4.1c), our method resulted in a reconstruction much closer to the ground truth image compared to 3DGS. The difference is most clearly seen in the long gray jagged Lego piece in the middle of the bulldozer. Furthermore, the same difference is also, surprisingly, seen in Figure 4.1d at 30 000 iterations. 24 4. Results Ground Truth Ours 7k 3DGS 7k Ours 30k 3DGS 30k (a) (b) (c) (d) Figure 4.1: An example of two scenes where our method exhibits the greatest reconstruction quality improvement compared to 3DGS. The columns labeled “7k” and “30k” correspond to images of each scene after 7000 and 30 000 training iterations have passed, respectively. 4.2 Web-Based Renderer 4.2.1 Evaluation Methodology (a) Wide view 0.25× (b) Medium view 1× (c) Close-up view 4× Figure 4.2: An example of test cases with the three proximity levels used for each camera angle during evaluation. To evaluate the web-based renderers (with and without optimizations) compared to the 3DGS renderer, two aspects were considered: Image quality and run-time performance. For all renderers, a test set of 240 test cases was used. The test set is based on the eight scenes from NeRF’s Realistic Synthetic 360° dataset [1]. For all scenes, a single set of ten randomly selected camera angles were used where the 25 4. Results camera faces the scene’s center and the camera’s position is uniformly sampled from a sphere centered on the scene’s center. Additionally, three proximity variants of each camera angle were used, referred to as wide, medium, and close-up views with zoom levels of 0.25×, 1×, 4×, respectively. Here, 1× roughly corresponds to the object of the scene being close but still fully visible with some margin around the edges of the frame. An example of test cases showing the three proximity levels can be seen in Figure 4.2. To facilitate comparison with the original 3DGS paper, all test cases were rendered using dimensions of 800× 800 pixels. To ensure that the performance comparison between the web-based renderers and the 3DGS renderer is as fair as possible, the images produced by the web-based renderers were compared to the images produced by the 3DGS renderer using three commonly used objective image similarity metrics: Structural Similarity Index (SSIM) [20], Peak Signal-to-Noise Ratio (PSNR), and Learned Perceptual Image Patch Similarity (LPIPS) [39]. These similarity metrics were computed for all test cases individually, then averaged and grouped by renderer. To evaluate the run-time performance of the 3DGS renderer and the web-based renderers, GPU frame time was measured and approximate GPU memory usage was estimated for all test cases. For all renderers, to minimize the influence of external factors on measured frame time, GPU frame time and GPU memory usage were collected during 90 consecutive frames after initially having rendered 30 warm-up frames from starting the renderer. Additionally, each renderer was fully restarted between running each test case. This means restarting the operating system process for the 3DGS renderer and performing a full page reload for the web renderers. Finally, the collected performance metrics of the 90 rendered frames were aggregated into an average GPU frame time and the maximum GPU memory usage. All test cases were executed using an Nvidia GeForce RTX 3070 Ti graphics card. GPU frame time and memory usage for the 3DGS renderer was captured using the real-time 3D Gaussian viewer [40] published by the authors of the original 3DGS paper. To measure GPU frame time, the viewer was modified to use CUDA events and cudaEventElapsedTime. Meanwhile, no built-in way was found to measure GPU memory usage caused by a single operating system process using CUDA APIs. Therefore, using cudaMemGetInfo, total GPU memory usage is measured at the start of the application and once again after each finished frame. It should be noted that if any other process allocates or frees GPU memory between these measurements, the resulting renderer GPU memory usage could be incorrect. Hence, we refer to it as an estimation and the amount is not guaranteed to be exact. GPU frame time and memory usage for the web-based renderers was captured by running each of the WebGPU-based renderers in Google Chrome (version 125 Beta) with the enable-webgpu-developer-features and enable-unsafe-webgpu feature flags enabled. To capture GPU frame time, the WebGPU timestamp-query [41] extension was used. The enable-webgpu-developer-features flag is enabled to ensure timestamps are not quantized [42] (to multiples of 100µs) and stay as high accuracy as possible. The enable-unsafe-webgpu flag is enabled to allow usage of the GPUCommandEncoder.writeTimestamp [43] function, which was recently removed 26 4. Results (a) 3DGS (b) 3DGS-web (c) Absolute difference Figure 4.3: The worst performing test-case, with regard to similarity, for the 3DGS-web renderer in comparison to the 3DGS renderer. from the WebGPU specification [44] but still kept in Chrome under the flag [45]. With regard to GPU memory usage, there is currently no way to query the exact usage on the web platform. Therefore, the GPU memory usage was estimated using the webgpu-memory JavaScript library which keeps track of every resource allocated through the WebGPU API and estimates their total size based on the parameters used to construct them [46]. It should be noted that this does not necessarily count the exact memory usage on the GPU since it is up to the GPU driver and GPU to decide how resources are laid out in memory. 4.2.2 Image Quality Table 4.2: Image similarity of rendered images from the web-based renderers compared to the 3DGS renderer averaged over all test cases. For the three objective image similarity metrics SSIM, PSNR, and LPIPS, the upward and downward arrows indicate whether larger or smaller values, respectively, correspond to higher similarity. The colored backgrounds indicate the 1st , 2nd , and 3rd “best” method for each metric. Method SSIM↑ PSNR↑ LPIPS↓ 3DGS-web 0.999 53.02 0.002 3DGS-web-opt 0.999 52.49 0.002 geometry 1.000 61.68 0.000 geometry-opt 0.999 51.41 0.001 The averaged similarity metrics for all test cases can be seen in Table 4.2. Since all SSIM and LPIPS values are close to one and zero, respectively, these results suggest that overall the web-based renderers on average produce images with high similarity to the corresponding images produced by the 3DGS renderer. The geometry renderer produces the most similar images with effectively no noticeable difference, while the other three renderers all produce images containing some (minor) differences compared to the 3DGS renderer. 27 4. Results Figure 4.3 shows the image produced by the 3DGS-web (and 3DGS-web-opt1) renderer for its test case with lowest PSNR value, i.e., the worst-case. As the absolute difference shows, the error is small and almost perceptually insignificant. In general differences in images produced by 3DGS-web and 3DGS-web-opt, compared to 3DGS, are caused by the lower precision of 16-bit floating-point depth values used for sorting the Gaussians, causing them to be drawn in a different order than in images produced by the 3DGS renderer. (a) 3DGS (b) geometry-opt (c) Absolute difference Figure 4.4: An example illustrating how images of scenes with sub-pixel Gaussians exhibit a noticeable difference when using the geometry-opt renderer, in comparison with the 3DGS renderer. The difference occurs in the mesh of the microphone and the detailed patterns of the chair where Gaussians are smaller than a pixel. Note that the chair is from a wide-view test case but magnified 4× for clarity. Furthermore, Figure 4.4 exemplifies the difference in images for some of the worst- case test cases for the geometry-opt renderer. As can be seen in the absolute difference, the deviation occurs in the mesh of the microphone and the detailed patterns of the chair. In these high-frequency areas, there are many tiny screen-space Gaussians. Tiny Gaussians that previously used a square screen-space bounding box in the “non-optimized” renderers and were trained for this case are likely to become smaller than a pixel in the optimized version of the geometry-based renderer. The 3DGS renderer draws these sub-pixel Gaussians larger than a single pixel while the geometry-opt renderer clamps the bounding box to cover at least one pixel. This differing approach is the cause of the slight but nearly imperceptible image difference. 1Since the two images are practically identical, only the image produced by 3DGS-web is shown. 28 4. Results 4.2.3 Run-Time Performance Table 4.3: Run-time performance of the web-based renderers compared to the 3DGS renderer. The “Time” metric denotes GPU frame time in milliseconds and the “Mem” metric denotes approximate maximum GPU memory usage in mebibytes. The colored backgrounds indicate the 1st , 2nd , and 3rd “best” method for each metric and scene. Chair Drums Ficus Hotdog Lego Materials Mic Ship Avg. Method Time Mem Time Mem Time Mem Time Mem Time Mem Time Mem Time Mem Time Mem Time Mem 3DGS 6.16 659 5.79 642 4.11 436 2.74 371 4.09 573 3.47 486 5.60 457 4.28 661 4.53 536 3DGS-web 8.09 398 6.97 419 4.07 326 3.25 300 5.53 369 3.80 367 5.71 346 7.06 388 5.56 365 3DGS-web-opt 5.41 398 5.15 419 3.15 326 2.37 300 3.75 369 3.00 367 3.98 346 4.09 388 3.87 365 geometry 3.80 158 3.03 177 1.78 91 2.61 68 3.42 131 1.80 129 1.85 110 5.12 149 2.93 127 geometry-opt 2.05 158 1.96 177 1.25 91 1.49 68 1.95 131 1.31 129 1.43 110 2.41 149 1.73 127 (a) All test-cases Chair Drums Ficus Hotdog Lego Materials Mic Ship Avg. Method Time Mem Time Mem Time Mem Time Mem Time Mem Time Mem Time Mem Time Mem Time Mem 3DGS 3.18 659 2.77 642 2.16 436 2.18 371 2.83 573 1.95 484 1.76 457 3.69 661 2.56 536 3DGS-web 7.88 398 4.76 419 2.94 326 3.94 300 6.18 369 2.47 367 2.50 346 10.19 388 5.11 365 3DGS-web-opt 3.28 398 2.69 419 1.89 326 2.15 300 2.83 369 1.77 367 1.62 346 4.08 388 2.54 365 geometry 7.13 158 4.87 177 2.46 91 4.71 68 6.40 131 2.31 129 2.71 110 10.89 149 5.18 127 geometry-opt 3.01 158 2.56 177 1.53 91 2.07 68 2.85 131 1.50 129 1.58 110 4.14 149 2.41 127 (b) Only test-cases with close-up views Chair Drums Ficus Hotdog Lego Materials Mic Ship Avg. Method Time Mem Time Mem Time Mem Time Mem Time Mem Time Mem Time Mem Time Mem Time Mem 3DGS 11.47 561 11.08 608 7.60 368 3.98 307 6.55 479 5.87 472 11.99 421 6.21 535 8.09 469 3DGS-web 11.35 398 11.53 419 6.71 326 3.37 300 6.61 369 6.16 367 11.37 346 6.73 388 7.98 365 3DGS-web-opt 8.99 398 8.94 419 5.41 326 3.10 300 5.48 369 4.87 367 7.72 346 5.09 388 6.20 365 geometry 1.91 158 2.00 177 1.53 91 1.04 68 1.53 131 1.56 129 1.43 110 1.59 149 1.57 127 geometry-opt 1.47 158 1.57 177 1.05 91 0.92 68 1.37 131 1.20 129 1.38 110 1.28 149 1.28 127 (c) Only test-cases with wide views Table 4.3 shows the results of measuring GPU frame time and estimating maximum GPU memory usage for the 3DGS renderer as well as all the web-based renderers. The metrics have been aggregated (average for frame time and maximum for memory usage) per renderer and scene, and the average across all scenes for a specific renderer is displayed in the rightmost column. Furthermore, different subsets of test cases are considered in the three sub-tables. Table 4.3a includes all test cases, while Table 4.3b and Table 4.3c only include test cases with close-up and wide views, respectively. Results for only medium views are excluded here for brevity due to few additional insights but can be found in Appendix A.1.1. Based on Table 4.3a the following insights concerning frame time can be derived: The 3DGS-web renderer is on average somewhat slower than the 3DGS renderer for all test case types, despite having roughly the same architecture. Meanwhile, the 3DGS- web-opt renderer is for most scenes on average on par with or marginally faster than the 3DGS renderer. Furthermore, the geometry renderer is on average considerably faster than the 3DGS renderer and its derivatives for almost all scenes. Finally, the geometry-opt renderer is on average almost twice as fast as the geometry renderer. 29 4. Results Moreover, concerning frame time for close-up and wide views, the 3DGS, 3DGS-web, and 3DGS-web-opt renderers show an inverse trend compared to the geometry and geometry-opt renderers. On average, 3DGS, 3DGS-web, and 3DGS-web- opt perform significantly slower for wide views while geometry and geometry-opt perform somewhat slower for close-up views. For close-up views, it can be seen that 3DGS and 3DGS-web-opt even significantly outperform the geometry renderer for many scenes. The geometry-opt renderer on the other hand is only marginally outperformed for a few scenes using close-up views. Concerning memory usage, the 3DGS-web and 3DGS-web-opt renderers on average use at least 100MiB less GPU memory for almost all scenes compared to the 3DGS renderer. Furthermore, the geometry and geometry-opt renderers use at least 200MiB less GPU memory than the 3DGS-web and 3DGS-web- opt renderers for all scenes. Additionally, the 3DGS renderer requires noticeably more memory for close-up views than wide views for all scenes. Meanwhile, the amount of memory consumed by all the web-based renderers is independent of how Gaussians are distributed and sized in screen space. It should be noted that the reason 3DGS-web’s and 3DGS-web-opt’s GPU memory usage does not vary based on proximity, despite being based on the architecture of the 3DGS renderer, is due to the usage of a fixed upper limit for the number of 2D Gaussian instances, as detailed in Section 3.2.1. 30 5 Discussion Our work is motivated by the use of recent novel-view synthesis methods, specifically 3D Gaussian Spatting (3DGS), to allow for real-time interaction with scenes that would typically be prohibitively expensive to render in real time. Specifically, our work targets the web to allow for applications such as product visualization on consumer hardware in a portable way. We have explored two questions that build on 3DGS. Firstly, given a classical scene representation (i.e., polygonal mesh), what effect does using the existing scene geometry to inform initialization have on the optimization process and results of 3D Gaussian Splatting? Secondly, how can 3D Gaussians be rendered efficiently within the constraints of the web? 5.1 Mesh-Supervised 3D Gaussian Optimization To assess the effect of using existing scene geometry to inform the 3DGS optimization process, we propose and implemented a method for initializing a 3D Gaussian scene from a triangular mesh, as presented in Section 3.1. We evaluated our method compared to 3DGS across eight small-scale synthetic scenes and measured objective reconstruction quality for multiple views of each scene, as described in Section 4.1.1. The evaluation results show a distinct difference in reconstruction quality improve- ment when observing the scene after only 7000 training iterations as opposed to 30 000 iterations. At 30 000 iterations, both our method and 3DGS on average result in roughly the same reconstruction quality. Supposedly, since only the initialization stage was augmented, after a large number of iterations the optimization processes converge. Consequently, both methods arrive at roughly the same 3D Gaussian scene and nearly the same reconstruction quality. However, at 7000 iterations, our method consistently exhibits improvement across almost all scenes. These results suggest that our method somewhat accelerates the optimization to reach higher reconstruction quality with fewer iterations. This can also be seen in Figure 4.1 where examples of our method’s largest improvement are shown. Here our method was able to reconstruct certain small-scale details at only 7000 iterations better than 3DGS at 30 000 iterations. These examples should, however, be taken with a grain of salt since they are the very best cases. On average, our method gives marginal improvement after 7000 iterations. 31 5. Discussion Finally, despite both methods being initialized with 100 000 Gaussians, we observe that using our method results in scenes with more 3D Gaussians than 3DGS. Part of the reason for the higher reconstruction quality at 7000 iterations could be partly due to this increase. It is unclear whether this increase in Gaussians is required to reconstruct the scenes well. For example, in Figure 4.1b with the microphone at 30 000 iterations, no significant improvement can be seen despite our method using almost 100 000 Gaussians more. Ideally, the number of 3D Gaussians should be as low as possible, since a greater amount of Gaussians results in higher memory usage and typically slower rendering performance. Overall, these results suggest that there seems to be some merit in supervising the 3D Gaussian Splatting optimization process using existing scene geometry. However, using our rather simple method for initialization is insufficient to create noticeable visual differences across a variety of scenes. 5.1.1 Limitations A major limitation of our method is that it will not adapt to differing geometrical complexity, since it always creates a fixed number of 3D Gaussians. Furthermore, it does not utilize the fact that a 3D Gaussian has three dimensions since each Gaussian is flattened to align with each face of the triangular mesh. Finally, the heuristic used to scale each Gaussian does not accurately cover each face’s area. 5.1.2 Future Work For future work, it could be interesting to attempt a more sophisticated method for initializing the 3D Gaussian scene. For example, a different method could derive volume elements from the mesh and construct 3D Gaussians for each element. Alternatively, a signed-distance field could be constructed from the mesh and used to sample points inside the mesh’s volume. Furthermore, examining the effect of performing supervision of the optimization process beyond initialization could be interesting. For example, the loss function could be augmented to “encourage” the Gaussians to follow the surface or volume of the mesh, or at least penalize Gaussians that are outside the volume of the mesh. As a slightly different approach, it could also be interesting to supervise the Gaussians to follow the scene’s original geometry using additional per-view information (e.g., depth and normal render passes) rather than the mesh itself. For example, a depth render pass per input view could be used to sample points on surfaces for initialization or in the loss function to “encourage” the Gaussians to stay near the surface or at least inside the scene’s volume. This approach to initialization could additionally use the color and potentially a normal render pass for each view to approximate an initial color and orientation for each Gaussian. 32 5. Discussion 5.2 Web-Based Renderer To assess different ways of rendering 3D Gaussians on the web, we implemented two methods (presented in Section 3.2), named 3DGS-web and geometry, as well as identified a key optimization that is applied to both methods, named 3DGS-web- opt and geometry-opt (presented in Section 3.2.3). The first method takes direct inspiration from the architecture of the original 3D Gaussian renderer presented in the 3DGS paper (see the 3DGS renderer in Section 2.2.3). Meanwhile, the second method utilizes rasterization of proxy geometry through the traditional graphics pipeline. We evaluated image quality and run-time performance compared to the 3DGS renderer through a set of 240 test cases encompassing a variety of small-scale scenes from multiple camera angles with multiple proximity levels (close-up, medium, and wide views), as presented in Section 4.2.1. The evaluation results suggest that using the architecture of the original 3DGS renderer to synthesize novel views in real time in a web-based environment is viable. However, due to the limitations of the web platform, specifically WebGPU, the original architecture could not be implemented identically, and compromises were made. These compromises caused the images rendered with 3DGS-web to have a loss of quality compared to the 3DGS renderer in some cases. However, the differences are largely imperceptible. Furthermore, seemingly due to parts of the architecture that could not be implemented in WebGPU, the same run-time performance with regard to frame time was not achieved. On the other hand, on average our geometry-based 3D Gaussian renderer noticeably outperforms the original 3DGS renderer, as well as our WebGPU adaptions of it, with regard to frame time and GPU memory usage. Presumably, due to the highly optimized hardware in contemporary GPUs for geometry rasterization and shading, it is in most cases more efficient to render 3D Gaussians as proxy geometry than to use compute passes as in the 3DGS renderer’s architecture. However, the geometry renderer performs worse than 3DGS for close-up views. In the case of a close-up view, many 2D Gaussians will have a large mutual screen-space overlap. In this case, as described in Section 2.2.3, the 3DGS renderer will terminate accumulation early for pixels whose transmission becomes low. The observed performance inversion is likely caused by the geometry renderer not having a way of discarding fragments from highly occluded 2D Gaussians in this situation. Similarly, the 3DGS-web renderer does not perform an early termination either, and as seen in the results, it performs equally poorly for close-up views. Additionally, with our rather straightforward optimization of using more accurate 2D Gaussian bounding boxes, the 3DGS-web-opt renderer is at least as performant as the original 3DGS renderer. This can be explained by the fact that a smaller bounding box means that each 2D Gaussian is likely to overlap fewer screen-space tiles, and consequently, each thread corresponding to a pixel of a tile needs to load fewer Gaussians from global GPU memory and do less work to accumulate them. 33 5. Discussion Along the same lines, with our more accurate 2D Gaussian bounding boxes applied to our geometry-based renderer, it significantly outperforms all renderers in our comparison with regard to frame time and GPU memory usage. In this case, the proxy geometry will have a smaller footprint in screen space, and naturally fewer fragments need to be computed and blended. Overall, these results suggest that 3D Gaussian Splatting as a scene representation and rendering method can effectively be used on the web. When comparing whether to use the original 3DGS architecture or the geometry-based method, we recommend the geometry-based method due to lower memory usage and frame time in almost all cases, as well as a simpler pipeline for implementation. Furthermore, the geometry- based renderer’s memory usage is constant, unlike the 3DGS rendering architecture, whose memory usage depends on the number of 2D Gaussian instances, usually determined by the camera’s proximity to the scene. Finally, we conclude that these methods could likely allow real-time interaction for scenes that would typically be prohibitively expensive to render in real time, offering a potential avenue for interactive and high-quality product visualization on the web. 5.2.1 Limitations In our study of different ways of rendering 3D Gaussians on the web, we have only evaluated our methods for a single type of GPU, a single set of output image dimensions, and a single type of scene. The observations made might not hold under a different set of conditions. Regarding our methods, one major limitation of the 3DGS-web and 3DGS-web- opt renderers is that a maximum number of 3D Gaussian instances has to be manually configured. If this limit is set too low, all Gaussians might not be rendered for some close-up views, and if it is set too high a large amount of GPU memory will be allocated despite not being needed for all views. 5.2.2 Future Work For future work, it would be interesting to see how the compared methods scale when evaluated using large-scale environmental scenes, image dimensions closer to what would be used for a real-world application1, and lower-end devices. Moreover, several ideas could be explored for the geometry-based renderer. Firstly, as previously discussed, the geometry-based renderer, supposedly, performs poorly for close-up views due to a lack of an early termination when blending 2D Gaussians. It would be interesting to address this issue to see if it can reduce the computation per pixel. Secondly, it would be interesting to examine the effect of using more vertices for a 2D Gaussian’s screen-space proxy geometry to more closely approximate its elliptical boundary. Hopefully, this would further decrease the number of fragments rasterized per 2D Gaussian. 1Results for this case were measured but not discussed due to lack of time. These results can be seen in Appendix A.1.2. 34 5. Discussion 5.3 Risks and Ethics When considering the risks and ethics of research, two aspects should be considered: The research methodology and the research results. Our research methodology is isolated to the theoretical study of computer graphics with small-scale empirical experiments. Due to the scale of the study, and the lack of interaction with any human participants, we see no meaningful negative economic, legal, societal, privacy, security, ecological, or environmental ethical concerns. Regarding our research results, we neither see any reason for negative ethical impact. Our immediate results are not significant enough to cause any ground-breaking effects that lead to immediate ethical issues. However, in a broader sense, advancement in novel-view synthesis could be misused for malicious purposes, such as generating fake imagery with the intent of misleading. 35 5. Discussion 36 Bibliography [1] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “NeRF: Representing scenes as neural radiance fields for view synthesis,” in Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds., ser. Lecture Notes in Computer Science, Cham: Springer International Publishing, 2020, pp. 405–421, isbn: 978-3-030-58452-8. doi: 10.1007/978-3-030-58452-8_24. [2] Y. Xie, T. Takikawa, S. Saito, et al., “Neural fields in visual computing and beyond,” Computer Graphics Forum, 2022, issn: 1467-8659. doi: 10.1111/ cgf.14505. [3] M. Seefelder and D. Duckworth. “Reconstructing indoor spaces with NeRF,” Google Research. (Jun. 14, 2023), [Online]. Available: https://blog.research. google/2023/06/reconstructing-indoor-spaces-with-nerf.html (visited on 08/29/2023). [4] T. Müller, A. Evans, C. Schied, and A. Keller, “Instant neural graphics primi- tives with a multiresolution hash encoding,” ACM Trans. Graph., vol. 41, no. 4, 102:1–102:15, Jul. 2022. doi: 10.1145/3528223.3530127. [Online]. Available: https://doi.org/10.1145/3528223.3530127. [5] B. Kerbl, G. Kopanas, T. Leimkuehler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering,” ACM Transactions on Graphics, vol. 42, no. 4, 139:1–139:14, Jul. 26, 2023, issn: 0730-0301. doi: 10.1145/3592433. [Online]. Available: https://dl.acm.org/doi/10.1145/3592433 (visited on 11/06/2023). [6] G. Chen and W. Wang, A survey on 3d gaussian splatting, Jan. 8, 2024. arXiv: 2401.03890[cs]. [Online]. Available: http://arxiv.org/abs/2401.03890 (visited on 01/09/2024). [7] S. Niedermayr, J. Stumpfegger, and R. Westermann, Compressed 3d gaussian splatting for accelerated novel view synthesis, Jan. 22, 2024. doi: 10.48550/ arXiv.2401.02436. arXiv: 2401.02436[cs]. [Online]. Available: http: //arxiv.org/abs/2401.02436 (visited on 05/16/2024). [8] K. Kwok, Antimatter15/splat, Jan. 9, 2024. [Online]. Available: https:// github.com/antimatter15/splat (visited on 06/06/2024). [9] A. Meißner, Lichtso/splatter, Oct. 19, 2023. [Online]. Available: https:// github.com/Lichtso/splatter (visited on 06/06/2024). [10] kishimisu, Kishimisu/gaussian-splatting-WebGL, Oct. 26, 2023. [Online]. Avail- able: https://github.com/kishimisu/Gaussian-Splatting-WebGL (visited on 06/06/2024). 37 Bibliography [11] Y. Sato, BladeTransformerLLC/gauzilla, May 14, 2024. [Online]. Available: https ://github .com /BladeTransformerLLC /gauzilla (visited on 06/06/2024). [12] M. Svensson, MarcusAndreasSvensson/gaussian-splatting-webgpu, Oct. 26, 2023. [Online]. Available: https ://github .com /MarcusAndreasSvensson / gaussian-splatting-webgpu (visited on 06/06/2024). [13] M. Kellogg, Mkkellogg/GaussianSplats3d, May 9, 2024. [Online]. Available: ht tps://github.com/mkkellogg/GaussianSplats3D (visited on 06/06/2024). [14] M. Tyszkiewicz and A. Islamov, Cvlab-epfl/gaussian-splatting-web, Sep. 22, 2023. [Online]. Available: https://github.com/cvlab-epfl/gaussian- splatting-web (visited on 06/06/2024). [15] Wikipedia contributors, 68–95–99.7 rule, in Wikipedia, Page Version ID: 1214533040, Mar. 19, 2024. [Online]. Available: https://en.wikipedia. org/w/index.php?title=68%E2%80%9395%E2%80%9399.7_rule&oldid= 1214533040#Table_of_numerical_values (visited on 04/24/2024). [16] J. T. Barron, B. Mildenhall, D. Verbin, P. P. Srinivasan, and P. Hedman, “Mip- NeRF 360: Unbounded anti-aliased neural radiance fields,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), ISSN: 2575- 7075, Jun. 2022, pp. 5460–5469. doi: 10.1109/CVPR52688.2022.00539. [Online]. Available: https://ieeexplore.ieee.org/document/9878829 (visited on 06/06/2024). [17] A. Knapitsch, J. Park, Q.-Y. Zhou, and V. Koltun, “Tanks and temples: Bench- marking large-scale scene reconstruction,” ACM Transactions on Graphics, vol. 36, no. 4, 78:1–78:13, Jul. 20, 2017, issn: 0730-0301. doi: 10 .1145 / 3072959.3073599. [Online]. Available: https://dl.acm.org/doi/10.1145/ 3072959.3073599 (visited on 04/30/2024). [18] P. Hedman, J. Philip, T. Price, J.-M. Frahm, G. Drettakis, and G. Brostow, “Deep blending for free-viewpoint image-based rendering,” ACM Transactions on Graphics, vol. 37, no. 6, 257:1–257:15, Dec. 4, 2018, issn: 0730-0301. doi: 10.1145/3272127.3275084. [Online]. Available: https://dl.acm.org/doi/ 10.1145/3272127.3275084 (visited on 04/30/2024). [19] J. L. Schönberger and J.-M. Frahm, “Structure-from-motion revisited,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [20] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assessment: From error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, Apr. 2004, Conference Name: IEEE Transactions on Image Processing, issn: 1941-0042. doi: 10.1109/TIP.2003. 819861. [Online]. Available: https://ieeexplore.ieee.org/document/ 1284395 (visited on 04/12/2024). [21] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, Jan. 29, 2017. doi: 10.48550/arXiv.1412.6980. arXiv: 1412.6980[cs]. [Online]. Available: http://arxiv.org/abs/1412.6980 (visited on 05/02/2024). [22] M. Zwicker, H. Pfister, J. van Baar, and M. Gross, “EWA volume splatting,” in Proceedings Visualization, 2001. VIS ’01., Oct. 2001, pp. 29–538. doi: 10.1109/VISUAL.2001.964490. [Online]. Available: https://ieeexplore. ieee.org/abstract/document/964490 (visited on 12/20/2023). 38 Bibliography [23] D. Merrill and M. Garland, “Single-pass parallel prefix scan with decoupled look- back,” NVIDIA Corporation, NVR-2016-002, Mar. 1, 2016. [Online]. Available: https://research.nvidia.com/publication/2016-03_single-pass- parallel-prefix-scan-decoupled-look-back. [24] CCCL Development Team, CCCL: CUDA c++ core libraries, Jun. 6, 2024. [On- line]. Available: https://github.com/NVIDIA/cccl (visited on 06/06/2024). [25] CCCL Development Team. “Cub::DeviceScan.” (Jun. 6, 2024), [Online]. Avail- able: https ://nvidia .github .io /cccl /cub /api /structcub _1 _ 1DeviceScan.html (visited on 06/06/2024). [26] A. Adinets and D. Merrill, Onesweep: A faster least significant digit radix sort for GPUs, Jun. 3, 2022. doi: 10.48550/arXiv.2206.01784. arXiv: 2206.01784[cs]. [Online]. Available: http://arxiv.org/abs/2206.01784 (visited on 11/06/2023). [27] CCCL Development Team. “Cub::DeviceRadixSort.” (Jun. 6, 2024), [Online]. Available: https://nvidia.github.io/cccl/cub/api/structcub_1_ 1DeviceRadixSort.html (visited on 06/06/2024). [28] B. Kerbl, G. Kopanas, T. Leimkuehler, and G. Drettakis, Graphdeco- inria/gaussian-splatting, Nov. 1, 2023. [Online]. Available: https://github. com/graphdeco-inria/gaussian-splatting/tree/2eee0e26d2d5fd00ec 462df47752223952f6bf4e (visited on 06/07/2024). [29] I. T. Jolliffe and J. Cadima, “Principal component analysis: A review and recent developments,” Philosophical Transactions of the Royal Society A: Math- ematical, Physical and Engineering Sciences, vol. 374, no. 2065, p. 20 150 202, Apr. 13, 2016, Publisher: Royal Society. doi: 10.1098/rsta.2015.0202. [Online]. Available: https://royalsocietypublishing.org/doi/10.1098/ rsta.2015.0202 (visited on 06/03/2024). [30] R. Levine and R. Levien, Prefix-sum.wgsl, May 25, 2022. [Online]. Available: https://github.com/reeselevine/webgpu-litmus/blob/67e61fd6e 6130a62a9f0af28d3932b0b9418c02c/shaders/prefix-sum.wgsl (visited on 06/07/2024). [31] R. Levien. “Prefix sum on portable compute shaders,” Raph Levien’s blog. (Nov. 17, 2021), [Online]. Available: https://raphlinus.github.io/gpu/ 2021/11/17/prefix-sum-portable.html (visited on 06/07/2024). [32] R. Levien and R. Dodd. “Sorting,” Sorting. (Jan. 28, 2024), [Online]. Available: https ://github .com /linebender /linebender .github .io /blob / 34ee60d6eecc08249c8930ed8e968ed39769492f /content /wiki /gpu / sorting.md (visited on 05/13/2024). [33] R. Levien, Googlefonts/compute-shader-101, Dec. 27, 2023. [Online]. Available: https ://github .com /googlefonts /compute - shader - 101 /blob / 9f882d8d7d2fad98372d04350020c6cd672c1a72/compute-shader-hello/ src/shader.wgsl (visited on 06/07/2024). [34] T. Harada and L. Howes, “Introduction to GPU radix sort,” Advanced Micro Devices, Inc., 2011. [Online]. Available: https://gpuopen.com/download/ publications /Introduction _to _GPU _Radix _Sort .pdf (visited on 05/13/2024). 39 Bibliography [35] S. Ashkiani, A. Davidson, U. Meyer, and J. D. Owens, “GPU multisplit: An extended study of a parallel algorithm,” ACM Transactions on Parallel Computing, vol. 4, no. 1, 2:1–2:44, Aug. 23, 2017, issn: 2329-4949. doi: 10.1145/ 3108139. [Online]. Available: https://dl.acm.org/doi/10.1145/3108139 (visited on 05/13/2024). [36] B. Kerbl, G. Kopanas, T. Leimkuehler, and G. Drettakis, Graphdeco-inria/diff- gaussian-rasterization, Aug. 23, 2023. [Online]. Available: https://github. com /graphdeco - inria /diff - gaussian - rasterization (visited on 06/06/2024). [37] NVIDIA Corporation, CUDA c++ programming guide, Release 12.5, May 20, 2024. [Online]. Available: https://docs.nvidia.com/cuda/pdf/CUDA_C_ Programming_Guide.pdf (visited on 06/07/2024). [38] Í. Quílez. “Working with ellipses.” (2006), [Online]. Available: https :// iquilezles.org/articles/ellipses/ (visited on 05/23/2024). [39] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, ISSN: 2575-7075, Jun. 2018, pp. 586–595. doi: 10.1109/CVPR.2018.00068. [Online]. Available: https://ieeexplore.ieee.org/document/8578166 (visited on 04/12/2024). [40] B. Kerbl, G. Kopanas, T. Leimkuehler, and G. Drettakis, SIBR gaussian viewer, Nov. 1, 2023. [Online]. Available: https://gitlab.inria.fr/sibr/ sibr_core/-/tree/4ae964a/src/projects/gaussianviewer (visited on 06/07/2024). [41] B. Jones, K. Ninomiya, and J. Blandy, “WebGPU: Timestamp query,” W3C, W3C Working Draft, Jun. 2024. [Online]. Available: https://www.w3.org/ TR/2024/WD-webgpu-20240606/#timestamp. [42] F. Beaufort. “What’s new in WebGPU (chrome 120),” Chrome for Developers. (Dec. 8, 2023), [Online]. Available: https://developer.chrome.com/blog/ new - in - webgpu - 120 #timestamp _queries _quantization (visited on 06/07/2024). [43] MDN contributors. “GPUCommandEncoder: writeTimestamp(),” Web APIs | MDN. (Mar. 30, 2024), [Online]. Available: https://developer.mozilla. org/en-US/docs/Web/API/GPUCommandEncoder/writeTimestamp (visited on 06/07/2024). [44] F. Beaufort. “Remove GPUCommandEncoder.writeTimestamp,” GitHub. (Nov. 23, 2023), [Online]. Available: https : / / github . com / gpuweb / gpuweb/commit/6402899da70eed1379ec002e37d7e6e2273d09f9 (visited on 06/07/2024). [45] F. Beaufort. “Gate GPUCommandEncoder.writeTimestamp behind al- low_unsafe_apis,” Google Git - Dawn. (Nov. 9, 2023), [Online]. Available: https://dawn.googlesource.com/dawn/+/d61514719334478955a230b597f 57efec273e983 (visited on 06/07/2024). [46] G. Tavares, Webgpu-memory, version 1.4.2, Oct. 15, 2023. [Online]. Available: https://www.npmjs.com/package/webgpu-memory/v/1.4.2 (visited on 06/07/2024). 40 A Additional Evaluation Results A.1 Web-Based Renderer A.1.1 Run-Time Performance for Medium Views Table A.1: Run-time performance of the web-based renderers compared to the 3DGS renderer for medium views of each scene. The “Time” metric denotes GPU frame time in milliseconds and the “Mem” metric denotes approximate maximum GPU memory usage in mebibytes. The colored backgrounds indicate the 1st , 2nd , and 3rd “best” method for each metric and scene. Chair Drums Ficus Hotdog Lego Materials Mic Ship Avg. Method Time Mem Time Mem Time Mem Time Mem Time Mem Time Mem Time Mem Time Mem Time Mem 3DGS 3.84 585 3.51 630 2.58 382 2.05 345 2.90 515 2.59 486 3.06 433 2.94 565 2.94 493 3DGS-web 5.04 398 4.63 419 2.54 326 2.44 300 3.80 369 2.78 367 3.26 346 4.27 388 3.59 365 3DGS-web-opt 3.96 398 3.81 419 2.17 326 1.87 300 2.95 369 2.37 367 2.60 346 3.11 388 2.85 365 geometry 2.35 158 2.23 177 1.34 91 2.08 68 2.34 131 1.52 129 1.41 110 2.89 149 2.02 127 geometry-opt 1.68 158 1.74 177 1.18 91 1.47 68 1.63 131 1.22 129 1.33 110 1.80 149 1.51 127 A.1.2 High-Resolution Output Images The following results show the image quality (Table A.2) and run-time performance (Table A.3) for the web-based renderers and the 3DGS renderer when using output image dimensions of 2000× 2000 pixels. Table A.2: Image similarity of rendered images from the web-based renderers compared to the 3DGS renderer averaged over all test cases, but using output image dimensions of 2000× 2000 pixels. For the three objective image similarity metrics SSIM, PSNR, and LPIPS, the upward and downward arrows indicate whether larger or smaller values, respectively, correspond to higher similarity. The colored backgrounds indicate the 1st , 2nd , and 3rd “best” method for each metric. Method SSIM↑ PSNR↑ LPIPS↓ 3DGS-web 0.964 49.32 0.033 3DGS-web-opt 0.999 54.29 0.002 geometry 1.000 65.73 0.000 geometry-opt 1.000 57.82 0.000 I A. Additional Evaluation Results Table A.3: Run-time performance of the web-based renderers compared to the 3DGS renderer, but using output image dimensions of 2000× 2000 pixels. The “Time” metric denotes GPU frame time in milliseconds and the “Mem” metric denotes approximate maximum GPU memory usage in mebibytes. The colored backgrounds indicate the 1st , 2nd , and 3rd “best” method for each metric and scene. Chair Drums Ficus Hotdog Lego Materials Mic Ship Avg. Method Time Mem Time Mem Time Mem Time Mem Time Mem Time Mem Time Mem Time Mem Time Mem 3DGS 6.43 1430 5.40 1211 3.89 988 4.52 960 5.86 1368 3.38 897 3.76 986 7.93 1680 5.15 1191 3DGS-web 10.99 449 9.85 470 6.15 377 8.14 352 10.31 420 5.50 418 6.49 398 11.89 440 8.67 416 3DGS-web-opt 6.90 449 5.72 470 3.54 377 4.19 352 6.39 420 3.28 418 3.34 398 8.84 440 5.27 416 geometry 14.85 222 9.93 241 5.10 156 12.03 132 14.56 195 4.67 193 5.74 175 23.38 213 11.28 191 geometry-opt 5.89 222 4.71 241 2.57 156 4.60 132 5.81 195 2.53 193 2.50 175 8.91 213 4.69 191 (a) All test-cases Chair Drums Ficus Hotdog Lego Materials Mic Ship Avg. Method Time Mem Time Mem Time Mem Time Mem Time Mem Time Mem Time Mem Time Mem Time Mem 3DGS 9.88 1430 7.44 1211 5.30 988 7.44 960 9.53 1368 4.00 897 3.85 986 14.85 1680 7.79 1191 3DGS-web 17.58 449 15.99 470 11.64 377 14.18 352 16.99 420 8.86 418 10.88 398 18.49 440 14.33 416 3DGS-web-opt 11.53 449 8.49 470 5.29 377 7.22 352 10.79 420 3.98 418 3.87 398 16.70 440 8.48 416 geometry 35.51 222 22.24 241 11.39 156 26.08 132 33.81 195 9.44 193 12.91 175 56.41 213 25.97 191 geometry-opt 12.98 222 9.59 241 4.71 156 8.98 132 12.64 195 4.27 193 4.37 175 20.47 213 9.75 191 (b) Only test-cases with close-up views Chair Drums Ficus Hotdog Lego Materials Mic Ship Avg. Method Time Mem Time Mem Time Mem Time Mem Time Mem Time Mem Time Mem Time Mem Time Mem 3DGS 4.47 952 4.22 979 2.90 702 3.72 762 4.63 914 3.22 807 2.77 748 5.49 982 3.93 856 3DGS-web 9.87 449 8.53 470 3.74 377 8.05 352 10.28 420 4.55 418 3.91 398 13.06 440 7.75 416 3DGS-web-opt 4.70 449 4.52 470 2.66 377 3.44 352 5.19 420 3.19 418 2.60 398 6.60 440 4.11 416 geometry 7.04 222 5.55 241 2.74 156 8.51 132 8.11 195 3.21 193 3.02 175 11.66 213 6.23 191 geometry-opt 3.12 222 2.89 241 1.76 156 3.62 132 3.41 195 2.01 193 1.80 175 4.80 213 2.93 191 (c) Only test-cases with medium views Chair Drums Ficus Hotdog Lego Materials Mic Ship Avg. Method Time Mem Time Mem Time Mem Time Mem Time Mem Time Mem Time Mem Time Mem Time Mem 3DGS 4.95 844 4.55 891 3.48 650 2.39 598 3.42 768 2.91 753 4.67 700 3.44 824 3.73 754 3DGS-web 5.51 449 5.04 470 3.09 377 2.20 352 3.67 420 3.09 418 4.69 398 4.12 440 3.92 416 3DGS-web-opt 4.48 449 4.14 470 2.67 377 1.93 352 3.19 420 2.67 418 3.54 398 3.21 440 3.23 416 geometry 2.02 222 2.00 241 1.16 156 1.49 132 1.77 195 1.36 193 1.29 175 2.08 213 1.65 191 geometry-opt 1.58 222 1.66 241 1.25 156 1.21 132 1.39 195 1.30 193 1.32 175 1.46 213 1.40 191 (d) Only test-cases with wide views II