Edge‐Friend: Fast and Deterministic Catmull‐Clark Subdivision Surfaces

We present edge‐friend, a data structure for quad meshes with access to neighborhood information required for Catmull‐Clark subdivision surface refinement. Edge‐friend enables efficient real‐time subdivision surface rendering. In particular, the resulting algorithm is deterministic, does not require hardware support for atomic floating‐point arithmetic, and is optimized for efficient rendering on GPUs. Edge‐friend exploits that after one subdivision step, two edges can be uniquely and implicitly assigned to each quad. Additionally, edge‐friend is a compact data structure, adding little overhead. Our algorithm is simple to implement in a single compute shader kernel, and requires minimal synchronization which makes it particularly suited for asynchronous execution. We easily extend our kernel to support relevant Catmull‐Clark subdivision surface features, including semi‐smooth creases, boundaries, animation and attribute interpolation. In case of topology changes, our data structure requires little preprocessing, making it amendable for a variety of applications, including real‐time editing and animations. Our method can process and render billions of triangles per second on modern GPUs. For a sample mesh, our algorithm generates and renders 2.9 million triangles in 0.58ms on an AMD Radeon RX 7900 XTX GPU.


Introduction
Catmull-Clark subdivision [CC78] is a surface modeling algorithm that generates a dense and smooth quad mesh from a sparse polygon control mesh.Today, it is a widespread modeling technique.A subdivision step splits the control-mesh polygons into quads and transforms the quads' positions according to refinement rules.The resulting mesh is continuously subdivided until it is dense enough.
The use of graphics processing units (GPUs) proved to be advantageous: GPU data-parallelism assists in achieving high subdivision speed.Moreover, the final mesh can be rendered directly from GPU memory.However, parallelizing subdivision is not trivial and existing approaches suffer from draw-backs which we improve on with the following contributions: A novel quad-mesh-connectivity data structure.Subdivision requires neighbor information through suitable data structures.Existing ones suffer from a large memory foot-print [PEO09, MWS * 20, DV21].We propose a more light-weight data structure that allows to quickly access neighbor information.We show that it is sufficient to store only two references to neighbor edges per quad, called edge-friends, see Fig. 1.Since subdivision tends to be a memorybound problem, this increases subdivision speed.Furthermore, we organize edge-friend in a spatially coherent memory layout, which makes memory access patterns beneficial for GPU performance.
An atomic-operation free gathering approach.Implementations using atomic floating-point operations [PEO09, MWS * 20, DV21] come with a performance penalty and require vendor-and APIspecific extensions.Moreover, the non-deterministic scheduling of atomic operations changes their order on a frame-by-frame basis.With floating-point arithmetic not being associative, flickering artifacts can occur, even between frames with identical inputs.We completely eliminate atomic operations by expressing subdivision with gather operations as opposed to atomic scatter operations.
A single synchronization barrier.Many existing GPU approaches [PEO09, MWS * 20, DV21] need multiple dependent compute-kernel dispatches with expensive barrier synchronizations for a single subdivision step.We require only a single dispatch, and thus a single synchronization barrier per subdivision iteration, which provides additional performance benefits.
Low pre-processing cost.Some methods trade fast surface evaluation against expensive pre-processing [NLMD12,Pix22].This slows down modeling, animation, and simulation tasks that require topology changes like face, edge, or point insertion and deletion.Our approach requires negligible pre-processing enabling real-time geometric and topological edits.
Simple and extensible.Additionally, our subdivision compute kernel is simple and we demonstrate how to easily integrate relevant subdivision features like boundaries [Nas87], semi-smooth creases [DKT98], animation, and attribute interpolation, which makes our method attractive for production environments.However, our approach possesses the following limitations: Surface evaluation with hardware tessellation [NLMD12,Pix22] remains faster, but requires substantially more pre-processing than our method.We obtain conformal, crack-free meshes with uniform subdivision; however, like other methods [MWS * 20,DV21], we do not handle crack-free adaptive subdivision.

Related Work
Catmull-Clark Subdivision Rules [CC78] update the input-mesh vertices and create one new point per face and edge.A new facepoint is the centroid of the face.A new edge-point is the average of the edge's incident new face points f 0 , f 1 and vertices v 0 , v 1 : With the valence n of an old vertex v, the average Q of all its adjacent face points, and the average R of the midpoints of all edges incident to v, we get the new vertex point An extra-ordinary vertex has valence n ̸ = 4.One vertex-and facepoint, as well as two edge-points define a new quad.
Patch-based Subdivision methods split the control mesh into patches.A patch consists of a face together with the context required for local subdivision [BS02].While leading to data duplication and redundant computations, each patch can be treated independently enabling both parallel and adaptive subdivision.Extra triangles close cracks between different levels.However, floatingpoint rounding errors cause cracks, but a correct operation order assures consistent results where patches meet [NLMD12].
Breadth-First Subdivision algorithms apply the subdivision rules on an entire mesh in parallel.Our method falls into this category.Methods build mesh data structures that contain neighbor information [DV21, PEO09, MWS * 20].For example, to compute an edgepoint, the data structure must contain information about faces connecting to an edge.Direct Evaluation Catmull-Clark subdivision surfaces generalize uniform bi-cubic tensor-product B-Spline surfaces and overcome their topological limitations.Both surfaces are identical for quads with eight adjacent quads.There, direct B-Spline evaluation is more efficient than subdivision.As the refinement rules maintain the number of extra-ordinary vertices, the directly evaluable proportion of the surface grows with each subdivision.Specialized methods exist for direct evaluation of quads with one isolated extraordinary vertex [Sta98], at an extra-ordinary boundary [LB07], and with a single semi-sharp crease [NLG12].For other configurations, we are only aware of approximations [LS08].Hybrid methods [NLMD12, BFK * 16] subdivide until direct evaluation is possible.These methods leverage hardware tessellation minimizing I/O and enabling adaptive rendering.

Edge-Friend
We first describe the creation of our edge-friend data structure.Next, we derive edge-friend refinement rules and present a cachecoherent vertex-memory layout.Furthermore, we demonstrate how edge-friend handles important extensions, including boundaries.Finally, we discuss the rendering and attribute interpolation of our subdivided mesh.

Creation
We obtain an edge-friend mesh during a first subdivision iteration.This can be performed as pre-process or every frame with any subdivision method supporting arbitrary polygons.After one subdivision, we obtain a quad-only mesh necessary for our data structure.
Let the index buffer of the d-th subdivision be I d (cf.Fig. 2b), where level d = 0 is the unprocessed input.We call an element of the index buffer a corner [RSS03], see red numbers in Fig. 2b.Each input corner of I d−1 maps to a new quad in I d .Here, we make our key observation: the corners inside a quad can be rotated arbitrarily without causing topological changes.We always start a new quad with the vertex index of the old corner, continue with the adjacent edge-point and face-point along the winding order and conclude with the second edge-point.Given a quad with the corners (v 0 , v 1 , v 2 , v 3 ), we call the opposing edges v 0 v 1 and v 2 v 3 the on-edges of the quad.The opposing edges v 1 v 2 and v 3 v 0 are called the off-edges of the quad.Fig. 2a shows an example with on-edges marked by additional lines.Fig. 2b shows the converted index buffer I d and its corners c.
An on-edge of one quad is an off-edge in an adjacent quad.In Fig. 2, 11 7 is an on-edge of quad 0, but an off-edge in quad 2.
Let c be a corner index, e be an edge index and ⊕ be a bit-wise exclusive or operation.Using bit-logic 4 yields the quad index of a corner, • EQUAD(e)= e 2 yields the quad index of an edge, • DIAG(c) = c ⊕ 2 yields the corner index across the quad, and • OFF(c) = c ⊕ 3 yields the corner index along the off-edge.
To add neighborhood information, we assign both off-edges of a quad the corresponding on-edge indices.Thus, each quad consists of two opposing on-edges and references two on-edges of neighboring quads.We call these references the edge-friends of a quad.We store edge-friends in a friend buffer G d , where d is the subdivision level.An element of G d is the tuple (g 0 , g 1 ), where g 0 is the friend of off-edge v 1 v 2 and g 1 the friend of off-edge v 3 v 0 .Fig. 2b shows an example of G d .
For each vertex, we select an arbitrary corner and make that corner index a vertex attribute.We call this attribute valence loop start, as it enables gathering points from adjacent faces to update a vertex position.This allows atomic-operation free vertex updates.We denote the respective buffer L d , exemplified in Fig. 2c.We get the buffer sizes from the vertex and face count V d and F d : The buffer sizes increase exponentially with every subdivision step:

Edge-friend Refinement
We split edge-friend subdivision into two tasks: the quad task and the vertex task.Both tasks run in the same compute shader kernel.
Quad Task With one thread per quad, the quad task computes the face-point, the two edge points of the off-edges of the quad, the indices for four new quads, and eight new edge-friends.First, we load both edge-friends (g 0 , g 1 ) ← G d [q] of the current quad with index q.Remember that g i are edge indices.Hence, we obtain the corresponding corner indices by 2g i + 0 and 2g i + 1.Then, we can read all required vertex indices, as shown in Fig. 3a: With v i , v ′ i , we load the vertex positions, compute the face-point Figure 3: Quad Task.(a) Each quad task uses the edge-friends g i of quad q to load the vertices v i and v ′ i .(b) Next, the quad task computes the new face-point of q and stores it at index f in V d+1 .Additionally, it computes the new edge-points and stores them at e i in V d+1 .With v i , e i , f , 4q + 2, and 4q + 3 the new quads q i are written to I d+1 .For the friend relations, shown as arrows, we require the neighbor quads n i .Finally, the quad task writes the three new valence loop start corners for the generated vertices to L d+1 .All elements generated by the quad task are highlighted in blue.
of the centering quad and the edge points of the off-edges according to Eq. (1).We store the generated face-point in the new vertex buffer at index f = 4q + 1, and the generated edge-points at e 0 = 4(EQUAD(g 0 ) + 2 + (g 0 mod 2)), e 1 = 4(EQUAD(g 1 ) + 2 + (g 1 mod 2)), i.e., next to the face-point of the friends of q.This improves caching, as shown in Sec.3.3.Moreover, the quad task writes the four new quads emerging from the old quad.The locations in the new index buffer for the four new quad indices q i , i ∈ [0, 3] are q i = 4q + i.Again, the new quads start at an original vertex position, continue with an edge-point, the face-point and conclude with the second edge-point: where w i is the index of v i in the new vertex buffer.To add the new friend relations, we need new neighboring quads indices n 0 = 4EQUAD(g 1 ) + 2(g 1 mod 2) + 0, where n i is adjacent to q i .Using this, we can compute and write eight new friend indices: To conclude the quad task, we write the new valence loop start corners for the three newly generated vertices.It is valid to choose any of the connected corners, and we choose For an example, see Fig. 3b.
Vertex Task The vertex task updates each vertex position.In order for our algorithm to work in a single dispatch, we need to reformulate the vertex-point update rule to not depend on any points computed during the same subdivision step.After our required preprocessing iteration, the mesh is quad-only.According to the supplemental material of de Goes et al. [dGDMD16], given the valence n of vertex-point v d , the n points connected to v d with an edge E and the n points diagonally connected to v d with a quad F, the new vertex-point v d+1 is where α = 1 − β − γ, β = 3 2n , and γ = 1 4n .We collect the two sums by iterating over the faces adjacent to the input vertex index v, as shown in Algorithm 1.The loop starts at the corner c ← L d [v].We compute the required point locations using the previously established bit logic on corners, and add the points to their respective accumulators.To visit the next corner, one of the two friend references of the current quad is used, depending on which off-edge the current corner lies.We loop around the vertex until we arrive at the starting corner.We then compute the new vertex point position with Eq. (3).Typically, the new vertex index is w = 4v, the other case is discussed in Sec.3.3.An old corner c maps to a new corner c ′ with c ′ = 4c.We can thus propagate the input valence loop start to L d+1 .
Task Merge For closed topology of genus 0 with the total number of vertices V , the number of edges E and the number of faces F, the Euler characteristic states: V − E + F = 2.As we have uniquely assigned two edges to each quad, it applies that E = 2F.Thus, the number of quad and vertex tasks per closed topology is almost equal: V = F + 2. In addition, we do not require one task to finish before the other and thus can merge both tasks into a single compute shader.For the exceeding vertices, the kernel can just terminate early.This simplifies the implementation and is faster to run, as the algorithm only requires a single synchronization barrier between each subdivision iteration.

Vertex Memory Layout
Methods such as the halfedge refinement by Dupuy and Vanhoey [DV21] store the new face-and edge-points behind the Algorithm 1 Vertex Task L d+1 [w] ← 4v 17: end procedure last memory address for the updated vertex positions.As a new quad always uses one vertex-point, one face-point, and two edgepoints, the distance between the memory addresses accessed at once increases exponentially with every subdivision step.With our method, we can interleave the memory locations of the different point types to achieve better data locality.This speeds up both the next subdivision iteration and the final drawing of the generated geometry.Our vertex buffer is split into chunks of four positions.The first slot of a chunk i is reserved for the updated vertex-point position w i of vertex i.The second slot is reserved for the new facepoint f i of quad i.The third and fourth slots are reserved for the edge-points e 2i+0 and e 2i+1 of the on-edges of quad i.As previously established, the number of vertices is unequal to the number of faces.Therefore, we have to compensate if i is greater to the number of faces F d .An old vertex at index v maps to a new index w: The resulting vertex buffer for a mesh with a single closed topology, where V = F + 2, looks like this: For other topological genera, the handling of exceeding buffer sizes works analogously.

Semi-Sharp Creases
A common extension of the Catmull-Clark subdivision rules is the use of semi-sharp creases [DKT98].Individual edges of the control mesh can be assigned a sharpness value σ ∈ R ≥0 .The sharpness denotes whether to apply the original "smooth" rules from Sec. 2 (σ = 0), or to use additional "sharp", "crease" or "corner" rules in the current subdivision step (σ ≥ 1).0 < σ < 1 denotes a blend between the rules.When a creased edge is subdivided, the two resulting edges receive the sharpness of the original edge minus one, thus σ ′ = MAX(0, σ − 1).Face-points remain the same as is Sec. 2.
Edge-point Given the two points p 0 and p 1 that connect to an edge, the sharp rule is: Given the smooth point e smooth like in Eq. (1), and the sharpness σ of the edge, the resulting edge-point is: e = LERP e smooth , e sharp , MIN(σ, 1) . (4) Vertex-point Given two points p 0 and p 1 connected to a vertex v with two edges that have a sharpness σ > 0, the crease rule is and the corner rule is Given the smooth point v smooth like in Eq. ( 2), the number of edges m that are connected to vertex v with σ > 0, and the average sharpness σ of all these edges, the resulting updated vertex-point v ′ is: We assign a sharpness value to each edge-friend reference.The required adjustments to the edge point computation directly follows Eq. (4).In addition, we write out the new sharpness values of the generated friend relations.For the vertex-task, we add two more accumulators to the loop: m for the number of adjacent edges with sharpness σ > 0 and σ for accumulating all sharpness values.Thus σ = σ m .For the crease rule, we require the two points p 0 and p 1 .If the loop encounters a creased edge with a connected vertex e, we set p 0 ← e if m = 0 and p 1 ← e otherwise.Some assets require smooth subdivision of the sharpness values, which is known as the Chaikin rule [Cha74].If this is desired, one could move the refinement of the sharpness values to the vertextask, because there we have access to neighboring sharpness values.

Meshes with Boundaries
Although our data structure relies on a closed mesh topology, we support mesh boundaries.As semi-sharp creases are supported, we can simply close the geometry with ghost quads and mark previous boundary edges as infinitely sharp.Each boundary consisting of k edges receives k 2 ghost quads arranged in a fan and one additional vertex in the center of this fan.With a subdivision iteration on preprocessing, where each edge gets split into two, it is given that k 2 is an integer.After subdivision, the ghost quads can simply be ignored for rendering.Some assets require boundary corners to follow the corner rule.To achieve this, we mark the ghost edges attached to this vertex as infinitely sharp.

Rendering
After subdividing the desired amount of iterations, we employ a mesh shader to render the refined meshes to the screen.Here, we have to handle the problem of attribute interpolation, most commonly of texture coordinates, but our approach can also be used for other face-varying attributes.While subdivision requires a closed mesh, texture mapping has to slice open a mesh to be able to project it onto the texture.Usually this can be handled by duplicating the vertices at texture map seams on asset creation.One vertex receives the texture coordinate of one side of a texture map seam, and the other one the coordinate of the other side.
To do this duplication in real-time, we form a meshlet from each set of subdivided quads that originate from a single quad of our pre-processed control mesh.This comes with the benefit that the duplication of the vertices along the edges where texture seams can happen is done implicitly and without requiring any additional memory.On pre-processing, we associate one quad with four texture coordinates, thus one per corner, which are loaded and linearly interpolated by the mesh shader.Note that this only covers one possible method for interpolation.OpenSubdiv provides other interpolation methods for face-varying attributes, where the attribute itself is subject to smooth subdivision [Pix22].If the number of iterations is too great for the output triangle limit of a mesh shader, we additionally employ an amplification shader.The amplification shader splits the data of an original quad into tiles that are within the bounds of the triangle output limit.
The surface normal vector of a vertex is usually not defined by attributes, but by finding the partial derivatives of the refined surface.We compute the normal vectors using the usual cross-product formula in the mesh shader.If desired, it is possible to compute the limit surface normal by using the equations of Halstead et al. [HKD93].This limit surface projection can also be applied to the vertex positions.The computation of the normal vectors and texture coordinates also allows for normal-and displacement mapping.The real-time interpolation of vertex-blend attributes for animation is not necessary, because the transformation of the positions is performed before real-time subdivision.

Results and Discussion
We evaluate the performance of our method on the meshes of Fig. 4. The collection includes regular meshes, meshes with boundaries, and with semi-sharp creases.Big Guy and Pig start at d = 0 as their initial control meshes already comply with our on-off-edge rule after rotating every second quad.
Consider the dual graph of the quad mesh.For our on-off-edge rule to function, we require a closed quad mesh where all dual chord rings [DSSC08] have a length of 2n.This is true, if the closed quad mesh is homeomorphic to a sphere.In other cases, the pre-processing iteration doubles the length of all dual chord rings, forcing all dual chord rings to be of length 2n.This assures that we always have a consistent edge-friend data structure.
For non-quad-only meshes we perform a pre-processing iteration on the CPU when loading the model.Remember that any subdivision method that supports arbitrary face sizes can be applied for this

Meshes
Ogre Big Guy Pig Spot Rook Bishop Car Imrod Time (ms) 1.28 0.99 0.26 0.42 0.78 0.82 1.81 2.93 Table 1: Pre-processing Time.Each cell is the pre-processing time in milliseconds required to create the edge-friend data structure for the meshes from Fig. 4 on the CPU.The measurements include the creation of a hash-map for mesh connectivity.
step.Tab. 1 provides timings of our naïvely parallelized implementation of the pre-process iteration.We use GPU timers to isolate performance measurements for both subdivision and rendering.All measurements were taken on an AMD Ryzen 9 5950X, together with an AMD Radeon RX 7900 XTX and an NVIDIA RTX 4080.
We compare our method to the closely related halfedge refinement by Dupuy and Vanhoey [DV21].We refer to it as Halfedge.
For fair comparison, we ported over the publicly available OpenGL Halfedge implementation into our Direct3D12 application.As the input meshes are quad-only, we always employ the quad-only optimization of Halfedge, where the Prev, Next, and Face references of each Halfedge can be trivially computed.For simplicity reasons, we always use an implementation that supports semi-sharp creases for both Halfedge and our method, even if a test mesh does not have any edges tagged as such.Furthermore, as we use the regular crease refinement in our implementation, we removed the additional compute dispatch from Halfedge for refining the sharpness values according to the Chaikin rule.Atomic floating point arithmetic is not supported by all vendors or APIs.To take this case into account, we simulate atomic float addition similar to Patney et al. [PEO09].Fig. 5 shows our subdivision benchmark results without rendering.As can be seen, our method outperforms Halfedge by a factor of about three.This does not change for meshes with boundaries, where our method has to additionally subdivide ghost quads.Furthermore, the figure reveals that simulating atomic float addition for Halfedge is evidently slower.Fig. 6 shows the overall run-time required to subdivide and render a mesh.We achieve frame-times well above the required threshold for real-time rendering, even when generating more triangles than framebuffer pixels.
Both Halfedge and our algorithm require two temporary blocks of memory for subdivision, one for input and one for output.The semantics of both blocks are swapped each iteration: the old output becomes the new input and vice-versa.Based on the number of quads F and the number of vertices V , the required temporary memory sizes to hold a single iteration for Halfedge h(F,V ) and our algorithm e(F,V ) are Since F ≈ V , Halfedge requires ca.25% more memory than our method.Tab. 2 exemplifies this by measurements combining input and output buffer sizes.W 1 = 11  Table 2: Temporary Memory Requirements for six subdivisions.Each cell is the temporary memory size in MiB required to subdivide the meshes from Fig. 4.
As expected from a breadth-first subdivision method, both Halfedge and our algorithm are memory bound.To achieve a higher performance, the memory footprint has to be reduced.Our edgefriend achieves this by reducing neighbor information compared to Halfedge, as shown in Tab. 2. Additionally, by combining the edgeand face-point computation, we can make better use of loaded memory.We further benefit from an improved vertex memory layout, increasing the chances of cache hits.Moreover, we use regular writes into global memory which are faster than those with atomic additions.Unlike our approach, atomic-operations may cause flickering artifacts because of their non-deterministic scheduling.Finally, we only need a single synchronization barrier per iteration, while Halfedge needs three.

Conclusion and Future Work
We introduced a novel quad-based data structure for GPU parallel Catmull-Clark subdivision.Our algorithm is about three times faster compared to the latest related method.In future work, we want to expand our method to support adaptive subdivision based on surface flatness and distance to camera.

Bastian Kuth 1 Figure 1 :
Figure 1: (a) A control mesh after a pre-processing subdivision iteration is quad-only.(b) Our edge-friend data structure implicitly assigns two opposing edges (red) to each quad.Each quad stores only two edges in neighboring quads (blue).(c) We refine the edge-friend structure breadth-first down to level d = 4.(d) Refining and rendering the shown model takes under 40µs on an AMD Radeon RX 7900 XTX GPU.

Figure 2 :
Figure 2: Edge-friend Data Structure.(a) During pre-processing, each corner of the original mesh (left) maps to a quad in the subdivided mesh (right).We mark the on-edges of each quad with extra edge lines.Vertices are shown as circled numbers and original mesh vertices are shown in bold.(b) Using corner indices c (red numbers), we obtain quads and edges from the index buffer I d .The edge-friend buffer G d is used to access neighborhood information.(c) To gather neighboring vertices, L d maps each vertex to a corner.This corner must in turn reference the vertex.

Figure 4 :
Figure 4: Test Meshes.F d denotes the number of faces and V d the number of vertices.Meshes with boundaries require G d ghost faces and W d ghost vertices.Big Guy and Pig do not require a pre-processing iteration.
Figure 6: Overall Performance.Our Direct3D12 implementation subdivides to level d and renders each mesh of Fig. 4 to a framebuffer of size 1920 × 1080 using the Blinn-Phong reflection model.