SEARCH

SEARCH BY CITATION

Keywords:

  • ray tracing;
  • MBVH;
  • stackless traversal;
  • SIMD processors;
  • I.3.7 [Computer Graphics]: Three-Dimensional Graphics and Realism—Ray tracing

Abstract

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Related Work
  5. 3. Algorithm Overview
  6. 4. MBVH2 Traversal
  7. 5. MBVH4 Traversal
  8. 6. Implementation
  9. 7. Results
  10. 8. Conclusions and Future Work
  11. Acknowledgments
  12. Appendix A
  13. Appendix B
  14. Appendix C
  15. References

Stackless traversal algorithms for ray tracing acceleration structures require significantly less storage per ray than ordinary stack-based ones. This advantage is important for massively parallel rendering methods, where there are many rays in flight. On SIMD architectures, a commonly used acceleration structure is the MBVH, which has multiple bounding boxes per node for improved parallelism. It scales to branching factors higher than two, for which, however, only stack-based traversal methods have been proposed so far. In this paper, we introduce a novel stackless traversal algorithm for MBVHs with up to four-way branching. Our approach replaces the stack with a small bitmask, supports dynamic ordered traversal, and has a low computation overhead. We also present efficient implementation techniques for recent CPU, MIC (Intel Xeon Phi) and GPU (NVIDIA Kepler) architectures.

1. Introduction

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Related Work
  5. 3. Algorithm Overview
  6. 4. MBVH2 Traversal
  7. 5. MBVH4 Traversal
  8. 6. Implementation
  9. 7. Results
  10. 8. Conclusions and Future Work
  11. Acknowledgments
  12. Appendix A
  13. Appendix B
  14. Appendix C
  15. References

Ray shooting is an elementary operation in ray tracing that generally involves traversing a hierarchical acceleration structure such as a kd-tree or bounding volume hierarchy (BVH). Ray traversal algorithms can be divided into two main categories: stack-based and stackless algorithms.

Using a stack for the traversal is typically the most straightforward and efficient approach, especially if the traversal order is dynamic. However, if many rays are traced in parallel, the storage and bandwidth costs of maintaining a full stack for each ray can be very high (i.e. about 256–1024 bytes of memory per ray). Notable examples for this scenario are dynamic ray scheduling algorithms that improve memory access coherence for random ray distributions [PKGH97, NFLM07, AK10, KSS*13]. For such ray tracing methods, stackless traversal is better suited, particularly if the number of resident rays per core is on the order of thousands or more.

In recent years, the BVH has become the most popular acceleration structure thanks to its high performance [SFD09, ALK12], low memory footprint, fast construction [GPM11, KA13], and efficient dynamic updating [KIS*12]. However, all prior work on stackless BVH traversal has focused on traditional binary BVHs, which are not always optimal on certain SIMD architectures (e.g. CPUs).

The multi-bounding volume hierarchy (Multi-BVH or MBVH) [WBB08, EG08, DHK08] is an N-ary tree that provides higher SIMD utilization for shooting incoherent rays. It stores N bounding boxes or triangles per node, organized into SIMD packets, which can be efficiently intersected in parallel with a single ray. Although the MBVH was originally designed for high branching factors, the same principles can be applied to binary trees as well, improving data- or instruction-level parallelism [Ern11, AL09].

In this paper, we propose a new efficient stackless ray traversal algorithm for MBVHs that supports distance-based ordered traversal without restarts. We add parent and sibling pointers to the tree without necessarily increasing the memory footprint, and we replace the regular stack with a compact bitstack, an integer that fits into one or two machine registers. In the bitstack we store skip codes that indicate which siblings of a node must be traversed.

Two variations of the algorithm are presented: one variation for four-way branching MBVHs (MBVH4) and one for binary BVHs having two child boxes per node. We call this kind of binary tree MBVH2 in order to distinguish it from classical BVHs that store a single box per node. Our method can be extended to even higher branching factors (e.g. 8, 16), but this requires operations on larger bitmasks.

The MBVH4 is primarily used on CPUs with 4-wide or 8-wide SIMD, and also on the recent Intel MIC architecture with 16-wide SIMD. MIC was introduced with Larrabee [SCS*08], and its latest implementation is the Xeon Phi coprocessor [Int13]. On the other hand, the MBVH2 is the preferred choice on current NVIDIA GPUs [AL09, ALK12]. We have optimized our method (Section 'Implementation') and evaluated its performance (Section 'Results') for all these hardware platforms.

2. Related Work

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Related Work
  5. 3. Algorithm Overview
  6. 4. MBVH2 Traversal
  7. 5. MBVH4 Traversal
  8. 6. Implementation
  9. 7. Results
  10. 8. Conclusions and Future Work
  11. Acknowledgments
  12. Appendix A
  13. Appendix B
  14. Appendix C
  15. References

Most previous research on stackless ray traversal targeted either of two widely used acceleration structures: the kd-tree or the binary BVH.

2.1. Stackless kd-tree traversal

One approach for stackless kd-tree traversal is to store neighbour-links, also called ropes, in the leaf nodes for all six sides, which point to spatially adjacent nodes [MB90, HBŽ98, PGSS07]. During traversal, these links are used to directly jump to the next node (either inner or leaf node) that must be traversed after exiting a leaf node. This eliminates the need for a stack and also decreases the amount of traversed inner nodes, but it has a substantial storage overhead.

Foley and Sugerman [FS05] introduced the kd-restart and kd-backtrack algorithms, which are based on shortening the ray from the start when a stack pop would be necessary. By advancing the starting point of the ray to the leaf exit point, the node will be skipped in subsequent traversal steps. Kd-restart continues the search by simply restarting the traversal from the root node using the shortened ray. To avoid restarting after every leaf intersection, the kd-backtrack algorithm adds parent pointers and bounding boxes to the tree. These are used for ascending in the tree after processing a leaf.

Horn et al. [HSHH07] proposed two improved algorithms based on kd-restart: kd-push-down and short-stack traversal. Kd-push-down identifies the deepest node that fully contains the valid intersection interval and then uses that node, instead of the root, as the starting point for the traversal restarts. Short-stack traversal reduces the amount of necessary restarts by maintaining a small, fixed-size stack. The traversal must be restarted only when the short-stack underflows.

2.2. Stackless BVH traversal

Most of the stackless kd-tree traversal techniques, with the notable exception of short-stack traversal, cannot be directly applied to BVHs because the nodes of a BVH may overlap [Lai10].

Smits [Smi98] suggested the storage of a skip pointer in each BVH node, which points to the next node to process if the current node is missed by the ray. The downside of this approach is that it can traverse the BVH only in a predefined order, without taking into account ray directions or node distances, which incurs a major performance penalty. Torres et al. [TMG09] developed a GPU-optimized version for coherent rays using ray packets.

The restart trail method by Laine [Lai10] enables traversal restarts for BVHs (or other kinds of binary trees) by encoding which part of the tree has been visited so far in a 32- or 64-bit trail. This value stores one bit of information per tree level. When a restart is triggered, the trail guides the downward traversal from the root to the next unprocessed node. The advantage of this approach is that it supports ordered traversal according to the node distances, but due to the restarts, it visits more than twice as many nodes as stack-based traversal. This overhead can be alleviated, at the expense of increasing the traversal state size, by adding a short-stack.

Hapala et al. [HDW*11] add a parent pointer to each node to backtrack in the tree instead of restarting from the root. Their method determines the next node to traverse using simple state logic, and it performs the same box and triangle intersection tests as an equivalent stack-based version. It needs to store only two bits of state in addition to the current node pointer, which is significantly less than the size of a trail or bitstack. However, it has to re-evaluate the traversal order heuristic for all revisited nodes, which practically restricts the heuristic to a simple ray direction based technique. Using the intersection distances to determine the near and far children of a node would be too expensive because both would have to be re-intersected. Such distance-based sorting, though, would not necessarily always lead to fewer traversal steps [Dam11].

Very recently, Barringer and Akenine-Möller [BAM13] introduced a stackless algorithm for binary trees that efficiently supports dynamic traversal order, based on the child distances, without restarting. They described three variants of the algorithm: two for implicit trees, and one for sparse trees with parent pointers. The traversal state is maintained using two integers: the current node's index (or address) and left-first descent level index. The level index is the relative index of a node of an implicit tree with regard to the first node on the same level. However, this tree is not the actual tree stored in memory but a virtual implicit version of it with nodes sorted according to the dynamic traversal order. Traversing this virtual tree in left-first order is equivalent to traversing the original one in dynamic order. Thus, the left-first level index of the current node can be used to efficiently backtrack in the original tree, without re-intersecting the nodes.

3. Algorithm Overview

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Related Work
  5. 3. Algorithm Overview
  6. 4. MBVH2 Traversal
  7. 5. MBVH4 Traversal
  8. 6. Implementation
  9. 7. Results
  10. 8. Conclusions and Future Work
  11. Acknowledgments
  12. Appendix A
  13. Appendix B
  14. Appendix C
  15. References

Our goal is to traverse the same sequence of nodes as a stack-based algorithm with a distance-based order heuristic, but using only a few state variables. Also, we want to avoid intersecting the same nodes multiple times.

A standard stack-based approach for N-way trees performs the following operations for each visited inner node: First, it intersects all N child bounding boxes, computing the intersection distances. It then selects the nearest child as the next node to traverse, and pushes the other (up to inline image) children to the stack. If all children were missed by the ray, a node is popped from the stack, and the traversal continues with that node.

For each visited leaf node, the primitives stored in the respective node are intersected with the ray, and then the stack is popped to get the next node.

Our algorithm replaces the stack pop with backtracking in the tree from the current node. The purpose of this operation is to find the next unprocessed node. This is a node whose bounding box was hit by the ray while processing the parent, but which has not been traversed yet. It is a sibling of either the current node or one of its ancestors. To be able to ascend in the tree, we add a parent pointer to each node. We also store pointers to the siblings for accessing them without taking a round trip to the parent. These additional links do not necessarily increase the node size as the original layout is often padded with unused values.

The backtracking is guided by a bitmask that encodes which part of the tree needs to be traversed. It stores inline image bits for each visited tree level (except the root level), and is updated similarly to a stack, using bitwise push and pop operations. Hence, we call this special bitmask a bitstack. The per-level values in the bitstack are skip codes. These indicate which siblings of the most recently visited node on the respective level have already been processed and thus must be skipped.

In the following section, we describe in detail a simple version of our traversal algorithm for binary trees, which we later extend to support four-way trees (Section 'MBVH4 Traversal').

4. MBVH2 Traversal

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Related Work
  5. 3. Algorithm Overview
  6. 4. MBVH2 Traversal
  7. 5. MBVH4 Traversal
  8. 6. Implementation
  9. 7. Results
  10. 8. Conclusions and Future Work
  11. Acknowledgments
  12. Appendix A
  13. Appendix B
  14. Appendix C
  15. References

The binary variant of our stackless algorithm is shown in Algorithm 1. We use two state variables: a pointer to the current node (inline image) and the bitstack, a 32- or 64-bit integer (inline image). For binary trees, the skip codes pushed onto the bitstack are 1-bit flags, which have the following semantics:

  • 0: Skip the sibling of the current node; go to the parent.
  • 1: Traverse the sibling of the current node.

Image

The top of the bitstack is implicitly the least significant bit. This means that when pushing or popping an item, all the items in the stack must be shifted by one position, but this can be efficiently implemented using a simple bitwise shift. The initial value of the bitstack is 0, which is equivalent to an empty stack because it indicates that there are no nodes to process (i.e. all skip codes are 0). The advantage of this representation is that the traversal can be terminated earlier than returning to the root node, avoiding unnecessary backtracking steps.

The main traversal loop begins at line 4 with checking whether the current node is an inner node or a leaf node. If it is an inner node, its two child bounding boxes are tested for intersection with the ray (line 5).

If any of the children were hit, we first push a 0 bit to the bitstack by shifting the bits to left (line 7). For a single hit, we only have to set the current node to the intersected child. The skip code of 0 that was just pushed ensures that the other child subtree, which was missed by the ray, will not be later traversed.

Lines 11 and 12 handle the less frequent case of two hits. We compare the intersection distances and set the current node to the near child. Furthermore, we change the skip code to 1 with a binary OR operation in order to enable the traversal of the far child (line 12).

After processing the node, we continue the downward traversal (line 14). If no children were hit, backtracking is triggered to find the next node to process.

When we visit a leaf node (lines 18 and 19), we intersect the primitives in the leaf and shorten the ray if we find an intersection closer than what we have previously recorded. Then, we start backtracking in the tree.

The backtracking is performed in lines 22–30. In a loop, we ascend in the tree until we find a non-zero skip code in the bitstack, if there is any. The top stack item is extracted using a binary AND (line 22). If the bitstack is equal to 0, the entire traversal is terminated (lines 23–25). Otherwise, we jump to the parent node, pop the bitstack, and continue the search. After exiting the loop, we jump to the sibling of the current node, which has not yet been traversed. Finally, we flip the skip code at the top of the bitstack to 0 with a XOR, in order to avoid revisiting the previous, already processed node (line 30).

4.1. Comparison

A similar approach for sparse binary trees has been recently proposed by Barringer and Akenine-Möller [BAM13], but there are some important differences. Although the left-first descent level index in their algorithm is functionally similar to our bitstack, the semantics of the bits in these values are different. Our approach is slightly more efficient for two reasons:

  • Barringer terminates the traversal only when it returns to the root node. In contrast, we exit the loop earlier if the bitstack becomes zero, the testing of which has a low cost.
  • In many cases, the bitstack or level index must be 64 bits wide. On architectures with only limited native support for 64-bit integers (e.g. current GPUs), most full-width operations have a performance penalty. In Barringer's algorithm, one such operation is the incrementation of the level index before starting to backtrack. Our approach requires only a simple bit flip (at the end), which can be executed at full speed.

For the test scenes in Figure 1, our algorithm is faster by 1–11% on the NVIDIA Kepler GK110 architecture (Table 3). Another advantage is that it scales naturally to higher branching factors, as described in the next section.

image

Figure 1. Test scenes used for the performance measurements of the ray traversal algorithms. The images were rendered using simple 8-bounce diffuse path tracing.

Download figure to PowerPoint

5. MBVH4 Traversal

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Related Work
  5. 3. Algorithm Overview
  6. 4. MBVH2 Traversal
  7. 5. MBVH4 Traversal
  8. 6. Implementation
  9. 7. Results
  10. 8. Conclusions and Future Work
  11. Acknowledgments
  12. Appendix A
  13. Appendix B
  14. Appendix C
  15. References

For binary trees, 1-bit skip codes are sufficient because each node has only one sibling. In order to traverse four-way trees, we extend the skip codes to 3 bits, where each bit corresponds to a sibling of the respective node (see Figure 3a for an example). These bits have the same semantics as 1-bit skip codes: a 0 bit means that the sibling must be skipped, whereas a 1 bit means that it must be traversed. The siblings of a particular node are indexed circularly starting from the next node in the sibling group, as shown in Figure 2. For example, if the index of a node is 1, its siblings in ascending order are nodes 2, 3 and 0. Nodes that have less than four children are padded with invalid or empty node references, thus every node has exactly three siblings.

image

Figure 2. The stackless MBVH4 traversal algorithm requires eight pointers in each node: four child pointers, three sibling pointers and a parent pointer. The siblings are referenced in circular order, starting from the first sibling after the node in question.

Download figure to PowerPoint

The limitation of the skip codes is that they do not encode the order in which the siblings should be processed. Like for binary trees, we always descend into the nearest node first, but we cannot traverse its siblings in front-to-back order. This, however, has a quite small impact on performance because in about 90% of the traversal steps only two or less children are hit by the ray [BWW*12]. For our scenes, the performance hit caused by disabling full sorting in a reference stack-based approach [Ern11] is 2–9%.

The full traversal algorithm is given in Algorithm 2. For a single intersected child (line 9), the skip code is 000. When more than one child is hit, we first determine the index of the nearest child (inline image), which we select as the next node (lines 12 and 13). Then, we compute the skip code for this node using the SkipCode function (line 14). The implementation of this function can be seen on line 36. The skip code is computed from the hit mask, a 4-bit mask indicating which children are hit (inline image), and the index of the selected child (inline image). The bit for the selected node is removed from the hit mask, and the remaining 3 bits are rotated so that their positions match the indices of the corresponding siblings. For example, on Figure 3(a) the hit mask is 0111, and the index of the nearest child is 1, thus, the resulting skip code is 101.

Image

image

Figure 3. Example for stackless MBVH4 traversal. Only the steps where backtracking is triggered are depicted. Blue-coloured nodes represent unprocessed nodes, green nodes have been already processed, gray nodes have been culled (i.e. the ray does not intersect them), and the red node is the current node. The invalid/empty nodes used for padding are not shown. The bold arrows indicate the path from the root to the current node. The value of the bitstack can be seen below the tree in binary form. The skip codes in the bitstack are also shown on the corresponding tree levels. The top of the bitstack is highlighted in bold. The dotted arrows in (a) connect the bits in the skip code with the nodes which they refer to.

Download figure to PowerPoint

When backtracking is triggered, we first look for a node that has a non-zero skip code by following the parent pointers (lines 25–31). If this was successful, we have to jump to the next unprocessed sibling of this node. We determine the index of this sibling by scanning the skip code for the first set bit using the BitScan function (line 32). On most processors, BitScan can be implemented with a single instruction. Finally, we have to update the skip code at the top of the bitstack. This is done in line 34 by XORing the bitstack with a mask generated from the current skip code with the SkipCodeNext function. In this function (lines 37–38), we compute the next skip code by shifting out the trailing 0 bits and the next 1 bit. Then, we XOR this new skip code with the old one to produce the mask.

An example traversal with this algorithm is illustrated in Figure 3.

6. Implementation

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Related Work
  5. 3. Algorithm Overview
  6. 4. MBVH2 Traversal
  7. 5. MBVH4 Traversal
  8. 6. Implementation
  9. 7. Results
  10. 8. Conclusions and Future Work
  11. Acknowledgments
  12. Appendix A
  13. Appendix B
  14. Appendix C
  15. References

In this section, we provide implementation details for three different processor architectures: CPU (Intel Ivy Bridge), MIC (Intel Knights Corner) and GPU (NVIDIA Kepler).

6.1. CPU

Our CPU implementation is based on the MBVH4 traversal method introduced in the Intel Embree ray tracer [Ern11]. The simplified source code for the stackless traversal kernel is listed in Appendix A.

The child bounding boxes in the nodes are stored in structure-of-arrays (SoA) format to facilitate SIMD processing. The size of the node data structure in the original stack-based method is 112 bytes without any padding or 128 bytes with cache line padding. To support stackless traversal, we need to extend this layout with both parent and sibling pointers. These values can be fit inside the padding, thus, the total node size is 128 bytes. We have to add the pointers to the leaf nodes as well, which, depending on the chosen triangle representation, may or may not increase their size.

Computing the skip codes during traversal is relatively costly. We solve this problem by employing lookup tables for both the SkipCode and SkipCodeNext functions, which are small enough to easily fit into the L1 cache. The SkipCode table is addressed with a 6-bit value composed of the hit mask and the node index. Using one byte per entry, the size of the table is 64 bytes (i.e. a single cache line). The SkipCodeNext table can be even smaller as it contains only eight entries.

We use a 128-bit bitstack to be able to handle complex scenes that require deep trees. This way, the maximum permitted tree depth is 42. Since current CPUs do not have native 128-bit integer support, we implement the bitstack operations with 64-bit instructions. The most demanding operations are the bit shifts, which we implement using the special double precision shift instructions SHLD and SHRD. This approach is faster than using only regular shifts.

6.2. MIC

The Xeon Phi implementation is built upon the MBVH4 single-ray traversal algorithm by Benthin et al. [BWW*12], which has many similarities to the CPU algorithm (see Appendix B for the source code). Our stackless approach can be applied to hybrid single/packet traversal as well, but we opted for single traversal because of its simplicity and close-to-optimal performance for highly incoherent rays.

The node bounding boxes are packed in an array-of-structures (AoS) layout to better exploit the 16-wide SIMD units. The SIMD vectors are divided into 4-wide lanes, each containing a 3D vector. We insert the node pointers into the unused slots in the SIMD vectors, so the extended node data structure does not occupy more space than the basic one (128 bytes).

Because of the AoS layout, the ray-box intersection routine produces a sparse 16-bit hit mask. Every fourth bit is a hit flag, whereas all the other bits are zero. Building a SkipCode table for such a large mask or directly compacting the mask would not be practical. Therefore, we also produce a 4-bit hit mask by permuting (with VPERMD) the near and far distance vectors, and then comparing them. We compute the lookup table index from this compact mask and a 2-bit code that identifies the closest node. For two hits, this code is either 0 or 1 (for the first or second hit, respectively), and for more hits it is the node index. Thus, the table has 64 entries, just like the one used on the CPU.

6.3. GPU

We have optimized the GPU kernels for the NVIDIA Kepler GK110 architecture [Nvi12]. Our baseline traversal method is the stack-based speculative while-while kernel by Aila et al. [AL09, ALK12], which traverses binary BVHs.

We implement the stackless MBVH2 traversal algorithm using the following while-if-while loop organization:

while true

   while node is inner

      go to nearest child or  break

   if node is leaf

      intersect primitives

   while skip code is zero

      go to parent or  return

   go to sibling

We do not postpone leaf intersections because it would be inefficient in combination with backtracking and would also increase the size of the traversal state. This way, the state consists of only a node pointer and a 64-bit bitstack.

Our node data structure has the same size as the original (64 bytes) because there is enough free space for the parent and sibling pointers. When intersecting a node, we fetch the node data (including triangles) through the texture cache, but for backtracking we use regular memory loads.

The source code for the kernel can be seen in Appendix C.

7. Results

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Related Work
  5. 3. Algorithm Overview
  6. 4. MBVH2 Traversal
  7. 5. MBVH4 Traversal
  8. 6. Implementation
  9. 7. Results
  10. 8. Conclusions and Future Work
  11. Acknowledgments
  12. Appendix A
  13. Appendix B
  14. Appendix C
  15. References

We evaluated the performance of our stackless ray traversal algorithms and the corresponding stack-based algorithms using a simple but highly optimized diffuse path tracer on all three processor architectures.

The different types of BVHs were constructed using the same high-quality primitive partitioning techniques: we used both object and spatial partitioning optimized with the surface area heuristic (SAH), as proposed by Stich et al. [SFD09]. The MBVH4 nodes were generated using the top-down greedy splitting method by Wald et al. [WBB08]. The MIC tree leaves were limited to four triangles (i.e. a single multi-triangle), whereas the CPU and GPU ones were limited to eight triangles. All traversal algorithms used variants of the Möller-Trumbore ray-triangle intersection test [MT97].

The CPU used for the CPU benchmarks was an Intel Core i7-3770 (Ivy Bridge, 4 cores, 8 threads, 3.4 GHz, 8 MB L3 cache) with 16 GB RAM (DDR3-1600, dual channel). The code was compiled for 64-bits and AVX. Both the CPU and MIC implementations were written in C++ with SIMD intrinsic functions and OpenMP, and they were compiled with Intel C++ Compiler XE 13.1.

The MIC card was an Intel Xeon Phi SE10P coprocessor (Knights Corner, B1-stepping, 61 cores, 244 threads, 1.1 GHz, 8 GB GDDR5, ECC on) installed in a compute node of the Stampede supercomputer at the Texas Advanced Computing Center (TACC).

The GPU was an NVIDIA Tesla K20c (Kepler GK110, 13 multiprocessors, 2496 CUDA cores, 0.7 GHz, 5 GB GDDR5, ECC on). The path tracer was implemented as a traditional megakernel in CUDA 5.0.

The performance results, including the ray tracing speeds and the number of box and triangle intersection tests, are shown in Table 1. Our stackless algorithms, similarly to previous methods, are somewhat slower than the reference stack-based ones when used for ordinary ray tracing; however, they maintain about 22–51× smaller traversal states (Table 2). For special rendering methods that trace a very large amount of rays in parallel, low memory footprint is essential and could lead to a much higher overall performance.

Table 1. Performance measurements for 8-bounce diffuse path tracing (no Russian roulette) with trivial shading (no colours or textures) on CPU, MIC and GPU architectures. The scenes were rendered from the views depicted in Figure 1, and the image resolution was 1024 × 768 pixels. We have compared the state-of-the-art stack-based ray traversal methods [Ern11, BWW*12, ALK12] with our stackless methods in terms of: ray tracing speed (including shading) in million rays per second (Mray/s), number of multi-box intersections (inline image) and number of single- or multi-triangle intersections (inline image). Relative values (inline image) are also listed (Δ)
  MBVH4-CPUMBVH4-MICMBVH2-GPU
  Intel Core i7-3770Intel Xeon Phi SE10PNVIDIA Tesla K20c
SceneMethodMray/sinline imageinline imageMray/sinline imageinline imageMray/sinline imageinline image
ConferenceStack23.410.42.5118.310.82.1142.724.64.3
 Stackless21.311.22.7103.111.62.3101.024.04.0
 Δ−9%+8%+11%−13%+7%+10%−29%−3%−7%
Crytek SponzaStack14.916.63.070.318.62.593.539.55.6
 Stackless12.419.63.960.320.42.764.237.64.6
 Δ−17%+18%+31%−14%+10%+9%−31%−5%−17%
FairyStack19.113.53.292.114.62.573.130.37.8
 Stackless17.214.73.578.015.62.858.229.97.6
 Δ−10%+9%+11%−15%+7%+11%−20%−1%−2%
HairballStack7.625.65.437.327.64.724.158.815.3
 Stackless6.529.56.331.730.85.218.658.615.0
 Δ−15%+15%+16%−15%+12%+11%−23%−0%−2%
Power PlantStack12.518.24.256.919.63.951.744.013.1
 Stackless10.421.94.947.622.84.340.841.812.1
 Δ−17%+20%+17%−16%+17%+9%−21%−5%−7%
San MiguelStack7.825.04.638.526.64.133.356.99.8
 Stackless6.529.35.633.230.44.626.355.19.1
 Δ−17%+17%+20%−14%+14%+11%−21%−3%−6%
Table 2. Traversal state sizes (in bytes) for different types of BVHs and traversal methods. The listed methods are: simple stack-based, stack-based with node distances (stack+dist), and stackless traversal. All sizes include a 32-bit pointer to the current node
TreeMax depthMethodState size (B)
MBVH264Stack264
  Stack+dist520
  Stackless12
MBVH442Stack512
  Stack+dist1016
  Stackless20

For our test scenes, stackless traversal is slower by 9–17% on the CPU, 13–16% on the MIC, and 20–31% on the GPU. This is caused by the more complex traversal logic, more irregular memory accesses, and on the CPU and MIC the slightly higher number of box and triangle intersections. One reason for the latter is that the stack-based versions store node distances in the stack, and thus can skip previously pushed nodes that no longer need to be traversed.

Also, on the CPU we always sort the pushed nodes by hit distance, which further decreases the number of intersections by a small amount. Without these two minor optimizations, the stack-based algorithms would visit the same (in case of MBVH2) or very similar sequence of nodes as the respective stackless ones.

On the GPU, our stackless algorithm has slightly fewer intersections than the stack-based version because it does not postpone leaf nodes; however, this results in lower SIMD efficiency. It outperforms the sparse traversal algorithm by Barringer and Akenine-Möller [BAM13] for all test cases, as can be seen in Table 3.

Table 3. Performance of stackless MBVH2-GPU traversal using [BAM13] versus our algorithm for 8-bounce diffuse path tracing (also see Table 1). The GPU used was an NVIDIA Tesla K20c
 [BAM13]Our 
Scene(Mray/s)(Mray/s)Δ
Conference91.2101.0+11%
Crytek Sponza60.864.2+6%
Fairy53.558.2+9%
Hairball18.018.6+3%
Power Plant39.240.8+4%
San Miguel26.026.3+1%

8. Conclusions and Future Work

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Related Work
  5. 3. Algorithm Overview
  6. 4. MBVH2 Traversal
  7. 5. MBVH4 Traversal
  8. 6. Implementation
  9. 7. Results
  10. 8. Conclusions and Future Work
  11. Acknowledgments
  12. Appendix A
  13. Appendix B
  14. Appendix C
  15. References

We have presented a novel and efficient stackless ray traversal algorithm for the MBVH acceleration structure. Two algorithm variations have been discussed: one for four-way trees and one for binary trees. To our knowledge, this is the first published stackless method for wide BVHs.

The results show that on current architectures our algorithm performs competitively to stack-based approaches, but it is not the fastest option for conventional ray tracers. However, the main advantage of our algorithm is that it has a much smaller traversal state. Because of this, it can significantly enhance the efficiency of advanced, massively parallel in-core or out-of-core ray tracing schemes.

As future work, we would like to analyse the proposed algorithm in the context of out-of-core ray tracing, where the traversal of many rays must be suspended and later resumed. We also plan to investigate both stack-based and stackless MBVH4 traversal on latest-generation GPU architectures.

Acknowledgments

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Related Work
  5. 3. Algorithm Overview
  6. 4. MBVH2 Traversal
  7. 5. MBVH4 Traversal
  8. 6. Implementation
  9. 7. Results
  10. 8. Conclusions and Future Work
  11. Acknowledgments
  12. Appendix A
  13. Appendix B
  14. Appendix C
  15. References

This work was possible with the financial support of the Sectoral Operational Programme for Human Resources Development 2007-2013, co-financed by the European Social Fund, under the project number POSDRU/107/1.5/S/76841 with the title Modern Doctoral Studies: Internationalization and Interdisciplinarity. The research was also supported by OTKA K-104476 (Hungary). We would like to thank Paul Navrátil and the Texas Advanced Computing Center at The University of Texas at Austin for kindly providing us access to the Stampede supercomputer. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K20c GPU used for this research.

The test scenes are courtesy of Anat Grynberg and Greg Ward (Conference), Frank Meinl and Marko Dabrovic (Crytek Sponza), Ingo Wald (Fairy), Samuli Laine and Tero Karras (Hairball), University of North Carolina at Chapel Hill (Power Plant) and Guillermo M. Leal Llaguno (San Miguel).

References

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Related Work
  5. 3. Algorithm Overview
  6. 4. MBVH2 Traversal
  7. 5. MBVH4 Traversal
  8. 6. Implementation
  9. 7. Results
  10. 8. Conclusions and Future Work
  11. Acknowledgments
  12. Appendix A
  13. Appendix B
  14. Appendix C
  15. References
  • 1
    [AK10] Aila T., Karras T.: Architecture considerations for tracing incoherent rays. In Proceedings of the Conference on High Performance Graphics (Aire-la-Ville, Switzerland, 2010), HPG '10, Eurographics Association, pp. 113122.
  • 2
    [AL09] Aila T., Laine S.: Understanding the efficiency of ray traversal on GPUs. In Proceedings of the Conference on High Performance Graphics 2009 (New York, NY, USA, 2009), HPG '09, ACM Press, pp. 145149.
  • 3
    [ALK12] Aila T., Laine S., Karras T.: Understanding the Efficiency of Ray Traversal on GPUs—Kepler and Fermi Addendum. NVIDIA Technical Report NVR-2012-02, NVIDIA Corporation, June 2012.
  • 4
    [BAM13] Barringer R., Akenine-Möller T.: Dynamic stackless binary tree traversal. Journal of Computer Graphics Techniques (JCGT) 2, 2 (March 2013), 3849.
  • 5
    [BWW*12] Benthin C., Wald I., Woop S., Ernst M., Mark W.: Combining single and packet-ray tracing for arbitrary ray distributions on the Intel MIC architecture. IEEE Transactions on Visualization and Computer Graphics 18, 9 (September 2012), 14381448.
  • 6
    [Dam11] Dammertz H.: Acceleration Methods for Ray Tracing based Global Illumination. PhD thesis, Ulm University, 2011.
  • 7
    [DHK08] Dammertz H., Hanika J., Keller A.: Shallow bounding volume hierarchies for fast SIMD ray tracing of incoherent rays. Computer Graphics Forum 27, 4 (2008), 12251233.
  • 8
    [EG08] Ernst M., Greiner G.: Multi bounding volume hierarchies. In Proceedings of the IEEE Symposium on Interactive Ray Tracing 2008 (2008), pp. 3540.
  • 9
    [Ern11] Ernst M.: Embree: Photo-realistic ray tracing kernels. In ACM SIGGRAPH 2011 Exhibitor Tech Talks (2011).
  • 10
    [FS05] Foley T., Sugerman J.: KD-tree acceleration structures for a GPU raytracer. In Proceedings of the ACM SIGGRAPH/Eurographics Conference on Graphics Hardware (New York, NY, USA, 2005), HWWS '05, ACM Press, pp. 1522.
  • 11
    [GPM11] Garanzha K., Pantaleoni J., McAllister D.: Simpler and faster HLBVH with work queues. In Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics (New York, NY, USA, 2011), HPG '11, ACM Press, pp. 5964.
  • 12
    [HBŽ98] Havran V., Bittner J., Žára J.: Ray tracing with rope trees. In Proceedings of SCCG'98 (Spring Conference on Computer Graphics) (Budmerice, Slovak Republic, April 1998), pp. 130139.
  • 13
    [HDW*11] Hapala M., Davidovič T., Wald I., Havran V., Slusallek P.: Efficient stack-less BVH traversal for ray tracing. In Proceedings of the 27th Spring Conference on Computer Graphics (New York, NY, USA, 2011), SCCG '11, ACM Press, pp. 712.
  • 14
    [HSHH07] Horn D. R., Sugerman J., Houston M., Hanrahan P.: Interactive k-d tree GPU raytracing. In Proceedings of the 2007 Symposium on Interactive 3D Graphics and Games (New York, NY, USA, 2007), I3D '07, ACM Press, pp. 167174.
  • 15
    [Int13] Intel: Intel Xeon Phi System Software Developer's Guide, June 2013.
  • 16
    [KA13] Karras T., Aila T.: Fast parallel construction of high-quality bounding volume hierarchies. In Proceedings of the 5th High-Performance Graphics Conference (New York, NY, USA, 2013), HPG '13, ACM Press, pp. 8999.
  • 17
    [KIS*12] Kopta D., Ize T., Spjut J., Brunvand E., Davis A., Kensler A.: Fast, effective BVH updates for animated scenes. In Proceedings of the ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games (New York, NY, USA, 2012), I3D '12, ACM Press, pp. 197204.
  • 18
    [KSS*13] Kopta D., Shkurko K., Spjut J., Brunvand E., Davis A.: An energy and bandwidth efficient ray tracing architecture. In Proceedings of the 5th High-Performance Graphics Conference (New York, NY, USA, 2013), HPG '13, ACM Press, pp. 121128.
  • 19
    [Lai10] Laine S.: Restart trail for stackless BVH traversal. In Proceedings of the Conference on High Performance Graphics (Aire-la-Ville, Switzerland, 2010), HPG '10, Eurographics Association, pp. 107111.
  • 20
    [MB90] MacDonald D. J., Booth K. S.: Heuristics for ray tracing using space subdivision. The Visual Computer 6, 3 (May 1990), 153166.
  • 21
    [MT97] Möller T., Trumbore B.: Fast, minimum storage ray-triangle intersection. Journal of Graphics Tools 2, 1 (1997), 2128.
  • 22
    [NFLM07] Navrátil P. A., Fussell D. S., Lin C., Mark W. R.: Dynamic ray scheduling to improve ray coherence and bandwidth utilization. In Proceedings of the 2007 IEEE Symposium on Interactive Ray Tracing (Washington, DC, USA, 2007), RT '07, IEEE Computer Society, pp. 95104.
  • 23
    [Nvi12] Nvidia: NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110. Whitepaper, NVIDIA Corporation, 2012.
  • 24
    [PGSS07] Popov S., Günther J., Seidel H.-P., Slusallek P.: Stackless kd-tree traversal for high performance GPU ray tracing. Computer Graphics Forum 26, 3 (2007), 415424.
  • 25
    [PKGH97] Pharr M., Kolb C., Gershbein R., Hanrahan P.: Rendering complex scenes with memory-coherent ray tracing. In Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques (New York, NY, USA, 1997), SIGGRAPH '97, ACM Press/Addison-Wesley Publishing Co., pp. 101108.
  • 26
    [SCS*08] Seiler L., Carmean D., Sprangle E., Forsyth T., Abrash M., Dubey P., Junkins S., Lake A., Sugerman J., Cavin R., Espasa R., Grochowski E., Juan T., Hanrahan P.: Larrabee: A many-core x86 architecture for visual computing. ACM Transactions on Graphics 27, 3 (August 2008), 18:118:15.
  • 27
    [SFD09] Stich M., Friedrich H., Dietrich A.: Spatial splits in bounding volume hierarchies. In Proceedings of the Conference on High Performance Graphics 2009 (New York, NY, USA, 2009), HPG '09, ACM Press, pp. 713.
  • 28
    [Smi98] Smits B.: Efficiency issues for ray tracing. Journal of Graphics Tools 3, 2 (February 1998), 114.
  • 29
    [TMG09] Torres R., Martín P. J., Gavilanes A.: Ray casting using a roped BVH with CUDA. In Proceedings of the 25th Spring Conference on Computer Graphics (New York, NY, USA, 2009), SCCG '09, ACM Press, pp. 95102.
  • 30
    [WBB08] Wald I., Benthin C., Boulos S.: Getting rid of packets – efficient SIMD single-ray traversal using multi-branching BVHs. In Proceedings of the IEEE Symposium on Interactive Ray Tracing 2008 (2008), pp. 4957.