RT Engine: An Efficient Hardware Architecture for Ray Tracing

Yan, Run; Huang, Libo; Guo, Hui; Lü, Yashuai; Yang, Ling; Xiao, Nong; Wang, Yongwen; Shen, Li; Lan, Mengqiao

doi:10.3390/app12199599

Open AccessArticle

RT Engine: An Efficient Hardware Architecture for Ray Tracing

by

Run Yan

¹

,

Libo Huang

^1,*,

Hui Guo

¹

,

Yashuai Lü

²,

Ling Yang

¹

,

Nong Xiao

¹,

Yongwen Wang

¹,

Li Shen

¹ and

Mengqiao Lan

¹

School of Computer, National University of Defense Technology, Changsha 410005, China

²

Huawei 2012 Labs, Beijing 100089, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(19), 9599; https://doi.org/10.3390/app12199599

Submission received: 30 August 2022 / Revised: 16 September 2022 / Accepted: 19 September 2022 / Published: 24 September 2022

Download

Browse Figures

Versions Notes

Abstract

:

The reality of the ray tracing technology that leads to its rendering effect is becoming increasingly apparent in computer vision and industrial applications. However, designing efficient ray tracing hardware is challenging due to memory access issues, divergent branches, and daunting computation intensity. This article presents a novel architecture, a RT engine (Ray Tracing engine), that accelerates ray tracing. First, we set up multiple stacks to store information for each ray so that the RT engine can process many rays parallel in the system. The information in these stacks can effectively improve the performance of the system. Second, we choose the three-phase break method during the triangle intersection test, which can make the loop break earlier. Third, the reciprocal unit adopts the approximation method, which combines Parabolic Synthesis and Second-Degree interpolation. Combined with these strategies, we implement our system at RTL level with agile chip development. Simulation and experimental results show that our architecture achieves a performance per area which is 2.4 × greater than the best reported results for ray tracing on dedicated hardware.

Keywords:

machine vision; computer graphics; hardware architecture; rendering; ray tracing; graphics accelerators

1. Introduction

As a central topic of computer graphics, rendering is the process of synthesizing an image from a 3D scene. The domain rendering algorithm falls into two branches: rasterization [1] and ray tracing [2]. The rasterization method maps geometry in the scene into pixels. This method is very effective in hardware acceleration. Currently, most graphics processing units (GPUs) generate 3D photos in the rasterization method. However, the image quality is limited because it only deals with local illumination. On the other hand, the ray tracing technique generates high-quality 3D photos. It produces 3D images by simulating the optical properties of light, such as reflection, refraction, and shadow. Many modern graphics applications, high-end games [3], movie special effects [4], Virtual Reality (VR) [5], and computer-aided design require a high-quality visual experience that is challenging for rasterization, but is straightforward for ray tracing.

The essence of ray tracing is to find the closest objects that intersect with rays in 3D scenes and then transmit the results of the intersection test to a shader for graph coloring. Researchers proposed the acceleration structure (AS) to accelerate the testing speed. The idea is to arrange the graphics primitives in hierarchical spatial structures. This way, the test of rays and scenes can quickly eliminate those unrelated spaces and identify the closest primitives to the rays. The widely used acceleration structure is a tree structure. The leaf nodes of the tree include all scene primitives, and the internal nodes of the tree are used to subdivide a sizeable spatial representation into multiple smaller spatial regions (KD-trees [6]) or decompose the objects in the scene into smaller object sets (Bounding Volume Hierarchy (BVH) [7]). Based on the spatial structures, ray tracing is divided into two main parts: the ray traverse with acceleration structure and the intersection test between ray and primitives. The calculation of this part is one of the main bottlenecks in improving the performance of ray tracing [8]. This test occupies nearly 70–80% of total calculation time. Therefore, many studies are concerned about the ray traversal and intersection tests that achieve higher performance.

In the past few decades, many researchers worldwide have carried out significant research into ray tracing software and hardware acceleration. Various platforms are used at the hardware acceleration level, including central processing units (CPUs), GPUs, and dedicated hardware. Due to the hardware features, GPUs and dedicated hardware show significant performance advantages. There is a lot of research on optimizing GPUs’ acceleration. Ray tracing was challenging to implement on GPUs because early GPUs did not support general-purpose computation. Nathan A. Carr et al. [9] made the first attempt to implement ray tracing on a GPU. However, they only implemented the intersection test. This unit reconfigured the geometry engine into a ray engine that efficiently intersects caches of rays for many host-based rendering tasks. Timo Aila et al. [10] studied the mapping of elementary acceleration structure traversal and primitive intersection onto wide Single Instruction Multiple Data (SIMD)/Single Instruction Multiple Threads (SIMT) machines. Yahshua Lü et al. [11] proposed the Dynamic Ray Shuffling (DRS) architecture for GPUs to address ray tracing control flow divergences. The critical insight was that the primary control flow divergences caused by inconsistent ray traversal states of a warp could be eliminated by DRS. Experimental results show that, for an estimated 0.11% area cost, DRS significantly improves the SIMD efficiency for the tested benchmarks from 41.06% to 81.04% on average. Lufei Liu et al. [12] explored integrating the ray prediction strategy into existing GPU pipelines and improving the predictor effectiveness by predicting nodes higher in the tree and regrouping and scheduling traversal operations in a low-cost, reasonable manner. They found that GPU platform optimization pays more attention to the calculation part of the algorithm. Reducing the processing of branches and reducing the redundant operation through the optimization strategy of architecture to maximize the hardware characteristics has achieved good profits in academic research. In addition, commercial GPUs have also introduced accelerated architectures for ray tracing. In 2018, NVIDIA launched the first ray tracing GPU with the first-generation RT Core in Turing architecture [13]. In 2021, the Ampere architecture with the second-generation RT Core was established [14]. The RT Cores replace that software emulation, performing the tree traversal and the ray/box and ray/triangle intersection tests. A ray query is sent from the streaming multiprocessor (SM) to the RT Core. The RT Core uses dedicated evaluators to test each ray against the box or, at the leaves of the tree, the triangles that make up the scene. It does this repeatedly, optionally keeping track of the closest intersection found. When the appropriate intersection point is determined, the result is returned to the SM for further processing. In 2021, Imagination added the Ray Acceleration Cluster (RAC) of the PowerVR Photon architecture to the C-series GPU to provide ray tracing IP technology for the mobile phone market [15]. RAC uses a highly parallel Dual Triangle Tester Unit to improve the efficiency of hardware computing. Its primary function is to perform the intersecting test of the two triangles simultaneously and send the processing results to the next stage. This series’ GPU is divided into single-core and multi-core models applied to mobile and beyond mobile configurations.

In recent years, dedicated hardware has also been widely referenced in ray tracing studies. SaarCOR [16] is a ray tracing pipeline consisting of a ray generation/shading unit, a four-wide SIMD traversing unit, a list unit, a transformation, and an intersection test unit. The T&I engine [17] is a hardware acceleration architecture of ray traversal and intersection tests. This architecture adopts an order depth-first layout method to reduce memory bandwidth. It proposes the three-phase ray-triangle intersection and a latency hiding architecture defined as the ray accumulation unit. SGRT [18] is a real-time mobile ray tracing GPU. It mainly includes two key features: (1) an area-efficient parallel pipelined traversal unit; (2) flexible and high-performance kernels for shading and ray generation. RayCore [19] mainly includes ray-tracing units (RTUs) based on a unified traversal and intersection pipeline and a tree-building unit (TBU) for dynamic scenes. HART [20] utilizes heterogeneous hardware resources: dedicated ray-tracing hardware for BVH update and ray traversal and a CPU for BVH reconstruction. It also uses PrimAABB for traversal scheduling. Lee et al. [21] optimized the SGRT [18] and proposed two-AABB traversal architecture with two ray-AABB testing units. The experimental results showed that two-AABB was up to 2.9 times faster than the single-pipeline architecture. Kopta et al. proposed STRaTA (Streaming Treelet Ray Tracing Architecture) [22] to decrease energy consumption on massively parallel graphics processors. Viitanen et al. applied MBVH (Multi Bounding Volume Hierarchy) [23] to a fixed-function ray tracing accelerator architecture. With primary rays, energy efficiency improves by 15% and performance per area improves by 20%. Another implementation approach is multiple streams, such as a different approach to hardware-accelerated ray tracing, which begins by modifying the order of rendering operations, proposed by Konstantin Shkurko et al. [24]. The dual steaming approach organizes the memory access of ray tracing into two predictable data streams. E. Vasiou et al. [25] introduced Mach-RT (Many Chip-Ray Tracing), a new hardware architecture used to accelerate ray tracing. The primary approach combines a ray ordering scheme that minimizes access to the scene data with a sizeable on-chip buffer acting as near-computer storage spread over multiple chips.

From the perspective of existing academic research results and commercial GPUs, ray tracing acceleration has gradually improved computing capabilities with the algorithm’s progress and industrial technology’s development. Due to the different algorithms adopted by different designs and the various purposes of the ray tracing hardware architecture, many strategies have considerable differences in performance and hardware resource overhead. However, they have some performance/area limitations, and we are focusing on a more efficient hardware architecture for ray tracing.

Our main contributions to the literature are as follows:

(1): Three optimization methods for ray tracing memory access, branches, and functional units, respectively;
(2): A new hardware architecture based on an area-efficient parallel pipelined ray traversal and intersection unit;
(3): Implementation of the whole system at the RTL level and assess hardware resource overhead.

The optimization methods are multiple stacks, three-phase break, and approximate method of the reciprocal unit. Multiple stacks store information during ray traversal to ensure system parallelism and reduce memory access. A three-phase break makes the intersection of rays and primitives more efficient and allows an earlier exit from the loop. The approximate method can significantly reduce the hardware overhead and can converge faster. This paper is organized as follows: in Section 2, we describe the overall architecture design and the data flow. Optimization and design methods for RT engine are discussed in Section 3. In Section 4, we mainly evaluate and analyze the proposed architecture and estimate the hardware consumption. The conclusion of the paper is presented in Section 5.

2. Overall System Architecture

In this section, we describe the overall architecture of the RT engine. We will illustrate our basic design decisions, present the BVH ray tracing algorithm, introduce our system architecture, and then describe the data flow of RT engine.

2.1. Basic Design Decision

There are many decisions to be made in our design, which we will describe next.

Domain-specific architectures: A more hardware-centric approach is to design architectures tailored to a specific problem domain and offer significant performance (and efficiency) gains for that domain. This design approach exploits more efficient parallelism for the specific domain [26]. This design can speed up some applications compared to running the entire application on a general-purpose CPU. DSAs can perform better because they are better suited to the application’s needs. In addition, DSAs can make more effective use of the memory hierarchy. Memory access has become much more costly than arithmetic computations, as Horowitz [27] noted.

Acceleration structure: Many types of acceleration structure research focus on kd-tree and BVH. In the last two decades, the bounding volume hierarchy (BVH) has become the de facto standard acceleration structure for ray-tracing-based rendering algorithms [7]. The BVH traversal algorithm usually occupies less memory bandwidth and has the characteristics of a compact traversal state. NVIDIA [14] and Imagination [15] GPUs also choose BVH as an acceleration structure in the industry. We choose binary-based BVH in our design. It can effectively reduce hardware overhead and design costs compared to other kinds.

Per-ray traversal: There are two ways to determine ray-tracing traces. Packet traversal means a group of rays following the same tree path. This is achieved by sharing the traversal stack among the ray, which means that some rays will traverse those nodes which will not intersect. The other is per-ray traversal, which allows each ray to traverse independently, traversing the node’s children only when the node intersects with the ray. Each ray requires a separate traversal stack to store the ray’s data. Our design chooses the second way [7].

Primitive type: We use only triangles as the primitive type. This choice can improve system performance and simplify designs because this method can eliminate branches of different graphs. Therefore, other graphics before rendering should be converted into triangles, just like rasterization-based GPU processing methods.

First-hit traversal: This approach is the most widely used and is indispensable for computing the radiance at a shading point, which finds the nearest object in the direction of a ray from its origin. In binary-based BVHs, this approach can efficiently push the farther intersected node onto a stack and visit the closer one first [7].

2.2. Ray Tracing Algorithm

The BVH ray traversing algorithm can be described as the while–while loop [10] in Algorithm 1. This algorithm’s initial value sets the node’s address to the root node of BVH, and the initial value of the bottom of the stack is a mask. When a ray enters this loop, it traverses from the root node. The outer while loop determines whether the current ray has finished. When the node’s address is the mask of stacks, this indicates the current ray has completed the traversal. The first inner while loop is a depth-first traversal of BVH. The ray must intersect against the child nodes when the node is an internal node. If two child nodes intersect against the ray, push the farther children into the stack, and the other transfer to the next unit; if only one child node is intersected, traversal this child node; if no child node is intersected, then pop the stack. The second inner while loop is the intersection of leaf nodes. The primary purpose of this loop is to determine the intersection of rays and primitives of the scene and calculate the intersection. The calculation results of this loop can be used by the shader to render an image. After completing this part, the current leaf node is complete. Then, it pops the stack. We adopt a multiple-stacks approach to solve the parallelism of ray tracing. In addition, we also design two improved methods to accelerate the speed of the triangle intersection test. We will introduce each of these methods in detail next.

Algorithm 1 is a standard ray tracing algorithm for BVH. The current ray tracing algorithm introduced stackless-based BVH Traversal, such as Hapala et al. [28]. This method uses simple logic to sort each node’s parent pointer and minimize the stack’s use. However, it requires re-visiting internal nodes, which will cause some redundant operations. Binder et al. [29] present a BVH backtracking algorithm in constant time with two tables. The primary method is to prepare the bit trail and the current key for each level, which causes BVH’s storage space to increase. In addition, two tables are needed to occupy ample storage. The full-stack method will be easier and more efficient in dedicated hardware than those implementation methods. In addition, many later works have also studied how to use multi-branching wide BVH [23,30]. Compared with wide BVH, using binary-based BVH can reduce the expenses of the operation and reduce redundant processes. So, we finally chose this standard algorithm.

Algorithm 1: BVH Ray Tracing Algorithm

Input: ray, rootNode of the BVH

Output: intersection results

hit ← false
curNode ← rootNode
Stack ← ϕ
whilecurNode ≠ ϕ do

2.3. System Architecture

Figure 1 shows the overall design of the RT engine, which mainly includes memory and five parts. These five parts are the Shader, Issue Ray, Ray Traversal, Stack Management, and Triangle Intersection.

The Shader’s (SHD) purpose is to provide our system with rays, BVHs, and primitives data. In addition, we can render our system results and generate expected rendering images. This part is in conjunction with the CPU or GPU existing shader.

The Issue Ray primarily determines the system’s rays and nodes’ data input according to different access requests. This part includes two units. One is Arbitration (ARB), and the other is Ray Dispatch (DIS). The Ray Dispatch is mainly used to handle the dispatch request of rays. When the ray in the system is completed, the new ray will issue. The design of ARB deals with data conflict. This conflict comes from nodes returned from the Ray Traversal unit, nodes from stacks, and issue requests. So, to avoid data conflict, we set up arbitration for the preliminary processing of data requests. To handle the ray of the earlier issues faster, the arbitration method of the three requests is the top priority of the node from stacks, the secondary of the traversal of the unit, and finally, the newly issued rays.

The Ray Traversal (TRV) unit is essential to the entire system design. This part corresponds to the

T r a v e r s e

in Algorithm 1. The module completes the intersection between rays and BVHs. Input is ray data and BVH data, and output results control what will happen next. If the ray intersects the two child nodes of BVH, select the farther node, push it to the stack, and choose the one nearer to the following stage. If the ray intersects only a child node of BVH, it needs to be judged whether the node is a leaf node. If the ray does not intersect with the two child nodes of the BVH, then pop the stack.

The Stack Management unit is the key to ensuring multiple rays in the system parallel processing. This part corresponds to the

S t a c k . p o p

and

S t a c k . p u s h

in Algorithm 1. Its primary role is to operate the stack according to the results of the Ray Traversal unit and the Triangle Intersection unit. To ensure the stacks’ accuracy and order, we added a LUT (Look Up Table). This unit ensures that the data of rays and stacks can correspond.

The Triangle Intersection (IST) unit tests the intersection of the triangle and the ray of the leaf node, which corresponds to the

I n t e r s e c t i o n

in Algorithm 1. For the design of this part, we refer to the Woop algorithm [31]. This algorithm has three logical parts: the ray-plane test, the barycentric test, and the final hit-point calculation. IST1 completes the calculation of the ray-plane test. IST2 completes the barycentric test. IST3 completes the final hit-point calculation. The design of this unit uses two optimization methods to achieve a system performance improvement, which we will describe in detail in the subsequent chapters.

Table 1 illustrates how rays move from unit to unit. The numbers in Table 1 correspond to the numbers in Figure 1.

3. Proposed RT Engine Architecture

3.1. Multiple Stacks

As shown in Algorithm 1, each ray is input to the TRV unit, and TRV will send the information of nodes to the stack based on the results. If the ray intersects with two child nodes, the farther node will be pushed to the stack, and closer nodes will be sent to the next stage. If the ray does not interact with two child nodes, it will send a pop stack request. On the other hand, when the triangle intersection test is completed, the IST will request the pop stack. The stack features save the nodes and ensure the accuracy of the integrated process.

However, only one ray can be processed when there is only one stack in the system. This design cannot achieve multiple ray parallel processing, resulting in a waste of hardware resources. In this case, we consider setting a group of stacks to save the nodes’ information of multiple rays. This kind of design allows multiple rays to process at the same time.

Stack Management is set in our system design to handle stack operations. To solve the system’s accuracy when multiple stacks store nodes’ information, we set up the LUT unit to manage stacks. The data stored in the stack are the nodes’ address, and the nodes may be internal nodes or leaf nodes. As shown in Figure 2, the push request is from the TRV, and the pop request is from TRV and IST. After the MUX selects the request, the current corresponding ray’s address is compared to the ray’s address in the LUT. After matching the ray’s address, the request will be sent into the corresponding stack. Such a design guarantees the parallelism of the system while ensuring accuracy. Less hardware resource overhead is used, achieving a relatively high performance improvement.

Compared with a single stack, the performance of multiple stacks by LUT can improve by 9.99–20.78×.

3.2. Three-Phase Break for Intersection

Figure 3 shows the method from Woop [31], which is a fundamental algorithm for triangle intersection tests. First, a ray-plane test depends on whether the ray and triangle intersect within the t intervals. After this test, the next phase is used to perform a barycentric test. This test phase checks the barycentric coordinates (u, v) to determine the point in the hit within the triangle or outside. If the ray passes these two tests, the value of t, u, and v can be calculated. The last phase comes from the intermediate value calculated by the first two phases, determining Cramer’s rule.

We set up IST1 to complete the ray-plane test. If it misses, it will break to LUT. This unit calculated t value is first checked. The results mainly depend on whether the ray and the triangle intersect at the t. The pop stack request is sent to the LUT if the test does not pass. If the test passes, the t value will transmit to the IST2 unit to perform the following process. The IST2 unit is for barycentric and obtains u. Then, we check u. A pop stack request is issued to the LUT if the test is not passed. If the test passes, the obtained u value will transmit to the IST3 for processing. The IST3 phase processes the v value, which can determine if the ray intersects with a triangle and can return the value of t, u, and v. If we can test it as early as possible, we can get stack operations to the LUT as soon as possible. This design allows the triangle intersection test to get the output results earlier.

The simulation results show this method can effectively improve performance. For the simulation results, see Table 2.

IST 1 break

means it did not pass the ray-plane test, exit

I n t e r s e c t i o n

, etc. For the test scene, see Figure 4.

3.3. Approximation Method for Reciprocal

In Algorithm 1, the

I n t e r s e c t i o n

needs reciprocal operation once. Our design uses the approximation method for the reciprocal, which combines Parabolic Synthesis [32] and Second-Degree interpolation [33]. Compared to other methods, this approach converges faster and has a smaller chip area. Our floating-point calculation unit uses IEEE 754 standards. In the first step, we extracted the sign bit, the exponent bits, and the significand bits of the divisor. The significand bits are issued to the inverter where the Harmonized Parabolic Synthesis (approximation) is performed. In this block, the coefficients of the synthesis method are stored in look-up tables.

To enable the entire process to handle the parallel processing, the processing of the exponent is performed as follows:

removing the bias:

e^{'} = e - 127

(1)

inverting the sign and adding the bias:

e^{″} = 127 - e^{'}

(2)

when equations 1 and 2 are combined:

e^{″} = 254 - e

(3)

The schematic diagram of the entire process is in Figure 5. The reciprocal unit has been tested for every possible input. The max error is 1.18358967 ×

10^{- 7}

(≈

2^{- 23.01}

) which is smaller than the machine epsilon (

ϵ

) (upper bound for the error) that is commonly defined as

2^{- 23}

(by ISO C, Matlab, etc.) for the single-precision floating-point format.

Reduced precision has many studies in ray tracing, such as Vaidyanathan et al. [34], aiming to reduce the accuracy of BVH to improve performance. This method requires only a few hardware resources to reduce traversal calculation complexity while maintaining robust image quality. Our approach is different from the BVH accuracy. It is mainly intended to maintain the accuracy of BVHs and find a novel way to reduce accuracy and hardware resources in reciprocal operation.

To compare the reciprocal computing unit used by RT engine, we compare the area overhead with the same functional units in other academic research. For comparison data, see Table 3. The table lists several academic research studies about ray tracing on the reciprocal area. The process of SGRT and HART is 65 nm, and the clock rate is 500 MHz. We perform an approximate transformation to NAND2 under the same clock rate for a better comparison.

4. Simulation Results and Analysis

To achieve more accurate functional verification, we implement our system at the RTL level with Verilog HDL (Hardware Description Language) by Chisel language [35]. We create a simulator to cycle accurate simulations by Verilator and Scala, which can verify the accuracy of the design. After simulation, the results provide the total number of cycles used to generate a scene, hardware utilization, average steps, and hit results.

Our system uses BVHs with SAH [36]. Figure 4 shows the four test scenes: Conference (350K Tris), Fairyforest (186K Tris), Sibenik (97K Tris), and Sanmiguel (12.1M Tris), respectively. We choose the primary ray to simulate the results. All scenes are rendered into 1920 × 1080 resolution. Finally, we use the synthesis tool to assess the hardware overhead.

The hardware setup is structured as follows. The RT engine has a total of 25 groups of stacks. The depth of each stack is 64. Set three sets of FIFO with a depth of 25 in the ARB. In addition, this paper focuses on the calculation of traversal and intersection tests, so no cache is set.

4.1. Hardware Complexity and Area Estimation

In Table 4, we compare our architecture with other dedicated architectures. Our architecture significantly reduces the number of floating-point units and has several features. First, our algorithm is simpler and more efficient than other architectures. Second, we use multiply–accumulate units to occupy less area than other architectures. Third, we use fewer reciprocal units, as other architectural reciprocal units or divider units occupy a large amount of chip area. Therefore, we can conclude that our architecture is more efficient than previous researchers’ methods.

Table 5 summarizes the area estimation of RT engine. Stacks, FIFOs, and functional units require hardware resources. We estimated the area of this design to be 55.9% of the total area for arithmetic units. Finally, we concluded that the RT engine occupies a 0.48 mm

^{2}

area with 28 nm process under an 850 MHz clock rate.

4.2. Simulation Results

Table 6 shows our simulation results at the 850 MHz clock rate. It can be found from the simulation results that different test scenes are different in resource utilization and stack utilization rates. In general, the more complicated the test scenes, the higher the utilization rate of the stack. IST will maintain a relatively low utilization rate, and the performance will decrease. Under the configuration of a single core, our RT engine performance is from 16.59–92.74 MRPS (million rays per second).

Table 7 shows that our architecture performed better than existing ray tracing architectures. To achieve performance comparison and assessment fairly, most of the performance uses the same scene with our architecture. If there are none that are the same, we list the best performance. The area data in Table 7 is the computing units of traversal and intersection without cache, except that RayCore and Two-AABB did not mention the cache area.

In Table 7, the performance/area of our architecture is 193.21 MRPS /mm

^{2}

, which is 2.4 × higher than the best results of other academic research. It can be found that the process (nm) significantly impacts the evaluation results. Our efficiency is also higher than RayCore [19] and Two-AABB [21] under the same process. One of the main reasons is that many architectures use SIMD or Multiple Instructions Stream Multiple Data Stream (MIMD) architectures, which are inefficient due to branching and memory access issues. Our single pipeline architecture is more efficient for ray tracing. Another key reason for the difference in performance/area is that the design goals of these architectures are different. For example, STRaTA [22], Dual Streaming [24], and Mach-RT [25] focus on the latency and conflict of memory access, so the method of multiple Thread Multiprocessors (TMs) is used. This design can effectively improve the overall performance, but there would be a lot of redundant hardware resources. RayCore [19] mainly focuses on dynamic scenes, so it is not dominant in this comparison. The focus of these designs is different from ours. RT engine’s low hardware consumption comes from our careful design, algorithm improvement, and fabrication technology innovations.

MRPS/mm

^{2}

is a fair unit of measurement for ray tracing hardware because it is relatively independent of the resolution of the scene rendering [37]. As can be seen from the data in Table 7, our architecture is better than other researchers’ results in terms of efficiency. This result is encouraging because these designs could potentially be used as a co-processor for accelerating ray tracing performance on existing or future GPUs or CPUs.

5. Conclusions

In this article, we present an efficient ray tracing hardware architecture RT engine by analyzing relevant literature, algorithms, and RTL level implementations. We adopt three optimization strategies to improve the system efficiency for memory access, branching, and significant computation in ray tracing. Multiple stacks are used to store ray traversal information. The three-phase break method is used to perform the loop break earlier and approximate the method for the reciprocal to achieve hardware optimization and performance improvement. Based on these three optimization strategies, we use the chip agile development method to implement the RTL level, verify the accuracy of system functions and evaluate performance through simulation. We use the synthesis tool to assess the chip area. The experimental results show that the performance/area (MRPS/mm

^{2}

) of our architecture is about 2.4× higher than the best reported results of other academic research. These results indicate that our architecture can achieve efficient ray tracing.

Author Contributions

Writing—original draft, review and editing, R.Y.; Conceptualization, L.H.; hardware, H.G.; software, Y.L.; investigation, L.Y. and N.X.; analysis, L.S. and Y.W.; validation, M.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by the National Natural Science Foundation of China (No. 61872374/62102433).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Acknowledgments

The authors would like to thank everyone who contributed to the realization of this research, either academically or through financial support.

Conflicts of Interest

The authors declare no conflict of interest.

References

Catmull, E. A Subdivision Algorithm for Computer Display of Curved Surfaces; The University of Utah: Salt Lake City, UT, USA, 1974. [Google Scholar]
Whitted, T. An improved illumination model for shaded display. ACM Siggraph Comput. Graph. 1979, 13, 14. [Google Scholar] [CrossRef]
Schmid, J.; Uludag, Y.; Deligiannis, J. It just works: Raytraced reflections in “Battlefield V”. In Proceedings of the GPU Technology Conference, San Francisco, CA, USA, 19–22 March 2019. [Google Scholar]
Christensen, P.; Fong, J.; Shade, J.; Wooten, W.; Schubert, B.; Kensler, A.; Friedman, S.; Kilpatrick, C.; Ramshaw, C.; Bannister, M. RenderMan: An Advanced Path-Tracing Architecture for Movie Rendering. ACM Trans. Graph. 2018, 37, 1–21. [Google Scholar] [CrossRef]
Velho, L.; da Silva, V.; Novello, T. Immersive visualization of the classical non-Euclidean spaces using real-time ray tracing in VR. In Proceedings of the Graphics Interface Conference 2020, Toronto, ON, Canada, 28–29 May 2020. [Google Scholar]
Cao, Y.; Zhang, X.; Duan, B.; Zhao, W.; Wang, H. An improved method to build the KD tree based on presorted results. In Proceedings of the 11th International Conference on Software Engineering and Service Science (ICSESS), Beijing, China, 16–18 October 2020; pp. 71–75. [Google Scholar]
Meister, D.; Ogaki, S.; Benthin, C.; Doyle, M.J.; Guthe, M.; Bittner, J. A Survey on Bounding Volume Hierarchies for Ray Tracing. In Computer Graphics Forum; Wiley Online Library: Hoboken, NJ, USA, 2021; Volume 40, pp. 683–712. [Google Scholar]
Deng, Y.; Ni, Y.; Li, Z.; Mu, S.; Zhang, W. Toward Real-Time Ray Tracing: A Survey on Hardware Acceleration and Microarchitecture Techniques. ACM Comput. Surv. 2017, 50, 58.1–58.41. [Google Scholar] [CrossRef]
Carr, N.A.; Hall, J.D.; Hart, J.C. The Ray Engine. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, Saarbrücken, Germany, 2–3 September 2002; pp. 37–46. [Google Scholar]
Aila, T.; Laine, S. Understanding the efficiency of ray traversal on GPUs. In Proceedings of the Conference on High Performance Graphics, New Orleans, LA, USA, 1–3 August 2009; pp. 145–149. [Google Scholar]
Luü, Y.; Huang, L.; Shen, L.; Wang, Z. Unleashing the power of GPU for physically-based rendering via dynamic ray shuffling. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Boston, MA, USA, 14–17 October 2017; pp. 560–573. [Google Scholar]
Liu, L.; Chang, W.; Demoullin, F.; Chou, Y.H.; Saed, M.; Pankratz, D.; Nowicki, T.; Aamodt, T.M. Intersection Prediction for Accelerated GPU Ray Tracing. In Proceedings of the MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, Virtual, 18–22 October 2021; pp. 709–723. [Google Scholar]
Burgess, J. RTX ON—The NVIDIA TURING GPU. In Proceedings of the IEEE Hot Chips 31 Symposium (HCS), Cupertino, CA, USA, 18–20 August 2019; pp. 1–27. [Google Scholar] [CrossRef]
Corporation, N. NVIDIA Ampere GA102 GPU Architecture. Available online: https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.1.pdf/ (accessed on 31 May 2022).
Beets, K. Introduction to the PowerVR Photon Architecture. Available online: https://www.imaginationtech.com/products/gpu/graphics-architecture/powervr-photon/ (accessed on 31 May 2022).
Schmittler, J. SaarCOR: A Hardware Architecture for Real-Time Ray Tracing; The Eurographics Association: Geneve, Switzerland, 2007. [Google Scholar]
Nah, J.H.; Park, J.S.; Park, C.; Kim, J.W.; Jung, Y.H.; Park, W.C.; Han, T.D. T&I engine: Traversal and intersection engine for hardware accelerated ray tracing. In Proceedings of the 2011 SIGGRAPH Asia Conference, Hong Kong, China, 12–15 December 2011; pp. 1–10. [Google Scholar]
Lee, W.J.; Shin, Y.; Lee, J.; Lee, S.; Ryu, S.; Kim, J. Real-time ray tracing on future mobile computing platform. In Proceedings of the SIGGRAPH Asia 2013 Symposium on Mobile Graphics and Interactive Applications, Hong Kong, China, 19–22 November 2013; pp. 1–5. [Google Scholar]
Nah, J.H.; Kwon, H.J.; Kim, D.S.; Jeong, C.H.; Park, J.; Han, T.D.; Manocha, D.; Park, W.C. RayCore: A ray-tracing hardware architecture for mobile devices. ACM Trans. Graph. (TOG) 2014, 33, 1–15. [Google Scholar] [CrossRef]
Nah, J.H.; Kim, J.W.; Park, J.; Lee, W.J.; Park, J.S.; Jung, S.Y.; Park, W.C.; Manocha, D.; Han, T.D. HART: A hybrid architecture for ray tracing animated scenes. IEEE Trans. Vis. Comput. Graph. 2014, 21, 389–401. [Google Scholar] [CrossRef]
Lee, J.; Lee, W.J.; Shin, Y.; Hwang, S.; Ryu, S.; Kim, J. Two-AABB traversal for mobile real-time ray tracing. In Proceedings of the SIGGRAPH Asia 2014 Mobile Graphics and Interactive Applications, Shenzhen, China, 3–6 December 2014; pp. 1–5. [Google Scholar]
Kopta, D.; Shkurko, K.; Spjut, J.; Brunvand, E.; Davis, A. Memory considerations for low energy ray tracing. In Computer Graphics Forum; Wiley Online Library: Hoboken, NJ, USA, 2015; Volume 34, pp. 47–59. [Google Scholar]
Viitanen, T.; Koskela, M.; Jääskeläinen, P.; Takala, J. Multi bounding volume hierarchies for ray tracing pipelines. In Proceedings of the SIGGRAPH ASIA 2016 Technical Briefs, Macao, China, 5–8 December 2016; pp. 1–4. [Google Scholar]
Shkurko, K.; Grant, T.; Kopta, D.; Mallett, I.; Yuksel, C.; Brunvand, E. Dual streaming for hardware-accelerated ray tracing. Proceedings of High Performance Graphics, Vancouver, BC, Canada, 28–30 July 2017; pp. 1–11. [Google Scholar]
Vasiou, E.; Shkurko, K.; Brunvand, E.; Yuksel, C. Mach-RT: A many chip architecture for HighPerformance ray tracing. IEEE Trans. Vis. Comput. Graph. 2020, 28, 1585–1596. [Google Scholar] [CrossRef] [PubMed]
Hennessy, J.L.; Patterson, D.A. A new golden age for computer architecture. Commun. ACM 2019, 62, 48–60. [Google Scholar] [CrossRef]
Horowitz, M. 1.1 computing’s energy problem (and what we can do about it). In Proceedings of the International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), San Francisco, CA, USA, 9–13 February 2014; pp. 10–14. [Google Scholar]
Hapala, M.; Davidovič, T.; Wald, I.; Havran, V.; Slusallek, P. Efficient stack-less bvh traversal for ray tracing. In Proceedings of the 27th Spring Conference on Computer Graphics, Smolenice Castle, Slovakia, 27–29 April 2011; pp. 7–12. [Google Scholar]
Binder, N.; Keller, A. Efficient stackless hierarchy traversal on GPUs with backtracking in constant time. In Proceedings of the High Performance Graphics, Dublin, Ireland, 20–22 June 2016; pp. 41–50. [Google Scholar]
Vaidyanathan, K.; Woop, S.; Benthin, C. Wide BVH traversal with a short stack. In Proceedings of the Conference on High-Performance Graphics, Strasbourg, France, 8–10 July 2019; pp. 15–19. [Google Scholar]
Woop, S. A Ray Tracing Hardware Architecture for Dynamic Scenes; Fachrichtung 6.2-Informatik Computer Graphik, Saarland University: Saarbriicken, Germany, 2004. [Google Scholar]
Hertz, E.; Svensson, B.; Nilsson, P. Combining the parabolic synthesis methodology with second-degree interpolation. Microprocess. Microsystems 2016, 42, 142–155. [Google Scholar] [CrossRef]
Hertz, E. Methodologies for Approximation of Unary Functions and Their Implementation in Hardware. Ph.D. Thesis, Halmstad University Press, Halmstad, Sweden, 2016. [Google Scholar]
Vaidyanathan, K.; Akenine-Möller, T.; Salvi, M. Watertight ray traversal with reduced precision. In Proceedings of the High Performance Graphics, Dublin, Ireland, 20–22 June 2016; pp. 33–40. [Google Scholar]
Bachrach, J.; Vo, H.; Richards, B.; Lee, Y.; Waterman, A.; Avižienis, R.; Wawrzynek, J.; Asanović, K. Chisel: Constructing hardware in a scala embedded language. In Proceedings of the DAC Design Automation Conference 2012, San Francisco, CA, USA, 3–7 June 2012; pp. 1212–1221. [Google Scholar]
Wald, I. Fast construction of SAH BVHs on the Intel many integrated core (MIC) architecture. IEEE Trans. Vis. Comput. Graph. 2010, 18, 47–57. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Kopta, D.; Spjut, J.; Brunv, E.; Davis, A. Efficient MIMD architectures for high-performance ray tracing. In Proceedings of the International Conference on Computer Design, Amsterdam, The Netherlands, 3–6 October 2010. [Google Scholar]

Figure 1. Overall system architecture of RT engine.

Figure 2. Multiple stacks design.

Figure 3. Triangle intersection test. (a) Ray-plane test, (b) barycentric test and (c) final hit point calculation.

Figure 4. Test scenes: Conference, Fairyforest, Sibenik and Sanmiguel with primary rays.

Figure 5. Block diagram of the reciprocal algorithm.

Table 1. Data flow.

No	From→ To	Purpose	Next Process
1	SHD→MEM	Generate data	Input to MEM
2	DIS→ARB	Dispatch new ray	Determine priorities
3	LUT→ARB	Inter from Stack	Determine priorities
4	TRV→ARB	Inter from TRV	Determine priorities
5	ARB→MEM	Input address	Fetch data
6	MEM→TRV	Input to TRV	TRV with the BVH
7	TRV→MEM	Node is a leaf	Fetch leaf data
8	TRV→LUT	Pop or push	Stack process
9	LUT→MEM	Leaf to MEM	Fetch leaf
10	MEM→IST1	Stage1 of IST	Triangle process
11	IST1→IST2	Stage2 of IST	Triangle process
12	IST2→IST3	Stage3 of IST	Triangle process
13	TRI→SHD	$hit$ to SHD	Shader
14	TRI→LUT	Break from IST	Pop

Table 2. Three-phase break for intersection test results.

Test Scenes	Conference	Fairyforest	Sibenik	Sanmiguel
IST1 break	50.12%	34.86%	34.37%	28.96%
IST2 break	0	10.69%	21.66%	5.65%
IST3 break	49.88%	54.45%	43.97%	65.39%

Table 3. Comparison of different reciprocal unit.

	SGRT [18]	HART [20]	Ours
Clock rate (MHz)	500	500	500
Process (nm)	65	65	28
Area ( $μ$ m $^{2}$ )	0.11	0.11	0.0059
NAND2 ¹	76,389	76,389	12,258

Table 4. Complexity of our design in terms of the number of floating-point units(RCP: reciprocal unit, ADD: adder, MUL: multiplier, MAC: multiply accumulate, CMP: comparator, DIV: divider, SQR: square root).

	T&I [17]	SGRT [18]	RayCore [19]	HART [20]	RT Engine
FP CMP	108	49	71	211	26
FP ADD	86	36	73	107	13
FP MUL	92	34	110	107	16
FP MAC	0	0	0	0	17
FP RCP	2	1	0	4	1
FP DIV	0	0	11	0	0
FP SQR	0	0	4	0	0

Table 5. Area estimates of RT engine.

Functional Unit	Area (mm $^{2}$ )	Total Area (mm $^{2}$ )	Memory Unit	Area (mm $^{2}$ )	Total Area (mm $^{2}$ )
FP CMP	0.00052	0.0135	Stacks	0.00431	0.108
FP ADD	0.00295	0.0384	FIFO	0.00379	0.011
FP MUL	0.00512	0.0819
FP MAC	0.00758	0.129
FP RCP	0.00606	0.00606
Total					0.48

Table 6. Performance comparison of the proposed ray-tracing hardware.

Scene	Utilization of TRV	Utilization of IST	Utilization of Stacks	Average Steps (TRV)	Average Step (IST)	Performance (MRPS)
Conference (350K Tris)	99.9%	44.3%	68%	9.16	4.06	92.74
Fairyforest (186K Tris)	99.9%	22.1%	84%	18.71	4.13	46.99
Sibenik (97.5K Tris)	99.9%	45.9%	100%	26.26	13.36	28.43
Sanmiguel (12.1M Tris)	99.9%	9.1%	100%	46.64	4.25	16.59

Table 7. Performance comparison against previous approaches.

	Clock Rate	Acceleration Structure	Performance (MRPS)	Area (mm $^{2}$ )	Process (nm)	Performance/ Area (MRPS /mm $^{2}$ )
T&I engine SIGGRAPH’11 [17]	500 MHz	Kd-tree	198	9.04	65	21.90
SGRT SIGGRAPH’13 [18]	500 MHz	BVH	184	7.2	65	25.56
RayCore TOG’14 [19]	500 MHz	Kd-tree	193	18	28	10.72
Two-AABB SIGGRAPH’14 [21]	500 MHz	BVH	297.6	6.82	28	43.63
HART TVCG’15 [20]	500 MHz	BVH	602	7.68	65	78.39
STRaTA CGF’15 [22]	1 GHz	BVH	365.6	57.1	65	6.40
MBVH SIGGRAPH’16 [23]	500 MHz	BVH	88	3.12	45	28.21
Dual Streaming HPG’17 [24]	1 GHz	BVH	345.6	57.1	65	6.05
Mach-RT TVCG’20 [25]	2 GHz	BVH	284.25	52	65	5.47
RT engine	850 MHz	BVH	92.74	0.48	28	193.21

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yan, R.; Huang, L.; Guo, H.; Lü, Y.; Yang, L.; Xiao, N.; Wang, Y.; Shen, L.; Lan, M. RT Engine: An Efficient Hardware Architecture for Ray Tracing. Appl. Sci. 2022, 12, 9599. https://doi.org/10.3390/app12199599

AMA Style

Yan R, Huang L, Guo H, Lü Y, Yang L, Xiao N, Wang Y, Shen L, Lan M. RT Engine: An Efficient Hardware Architecture for Ray Tracing. Applied Sciences. 2022; 12(19):9599. https://doi.org/10.3390/app12199599

Chicago/Turabian Style

Yan, Run, Libo Huang, Hui Guo, Yashuai Lü, Ling Yang, Nong Xiao, Yongwen Wang, Li Shen, and Mengqiao Lan. 2022. "RT Engine: An Efficient Hardware Architecture for Ray Tracing" Applied Sciences 12, no. 19: 9599. https://doi.org/10.3390/app12199599

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

RT Engine: An Efficient Hardware Architecture for Ray Tracing

Abstract

1. Introduction

2. Overall System Architecture

2.1. Basic Design Decision

2.2. Ray Tracing Algorithm

2.3. System Architecture

3. Proposed RT Engine Architecture

3.1. Multiple Stacks

3.2. Three-Phase Break for Intersection

3.3. Approximation Method for Reciprocal

4. Simulation Results and Analysis

4.1. Hardware Complexity and Area Estimation

4.2. Simulation Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI