1. Introduction
The SPAD-based solid-state LiDAR systems, exhibiting a broad array of applications, outperform traditional rotating LiDAR and microelectromechanical systems (MEMS) LiDAR. This superiority arises from their remarkable mechanical stability, high-performance characteristics, and cost-effectiveness. In contrast to indirect time-of-flight (iToF) systems [
1,
2], which suffer from limited measurement distances (<20 m), substantial power consumption (tenfold that of direct time-of-flight, dToF), and convoluted calibration algorithms, the dToF imaging system offers a more promising solution. By combining the single-photon sensitivity of SPADs with the picosecond-level temporal resolution of TDCs, dToF systems enable long-range object measurement and centimeter-level depth resolution [
3]. Consequently, dToF establishes itself as the predominant technological trajectory for the forthcoming generation of 3D imaging. The capability of measuring depth enables solid-state LiDAR to excel in numerous applications, from floor-sweeping robots and facial recognition for consumers to autonomous driving and 3D building modeling in the industrial domain [
4]. As LiDAR technology advances towards more compact system designs, the increased SPAD imaging sensor pixels and depth–RGB image fusion are anticipated to become primary issues to addressed.
Three-dimensional depth imaging employing SPAD technology currently faces certain restrictions, including low spatial resolution, suboptimal effective optical efficiency (i.e., fill factor), and elevated costs. These limitations also impede the expansive array of applications for dToF LiDAR systems. Although recent research advancements indicate that SPAD arrays can achieve QVGA (320 × 240) resolutions and even higher (512 × 512), with prospective advancements targeting Video Graphics Array (VGA, 640 × 480) resolutions, the pixel number remains markedly inferior to traditional CMOS Image Sensor (CIS) technology [
5,
6,
7,
8]. Apple LiDAR (256 × 192) and Huawei P30 (240 × 180) exemplify the potential of dToF technology in the consumer market, yet the resolution disparity persists. To address this challenge, the 3D stacking technique has been proposed, which positions the back-illuminated SPAD array on the top tier and the readout control circuits on the bottom tier [
9]. This configuration allows researchers to enhance the resolution to 512 × 512 [
10]. However, the considerable costs associated with this technique have impeded its development and subsequent applications. At present, no effective solutions have been identified to surmount this obstacle in the realm of circuit design.
The inherent limitations of consumer-grade LiDAR technology, specifically its low resolution and high cost, have long been recognized within the industry. As 3D sensing technology advances, consumer-grade RGB-D cameras (e.g., Intel RealSense, Microsoft Kinect) have gained popularity owing to their ability to capture color images and depth information simultaneously, then recovering 3D scenes at a lower cost. Motivated by this development, researchers have explored RGB-guided depth-completion algorithms to reconstruct depth–density maps from sparse depth data obtained by dToF depth sensors and color images captured by RGB cameras, with the goal of predicting low-cost LiDAR-generated high-resolution scenes [
11]. The authors [
12] introduce a normalized convolution layer that enables unguided scene depth completion on highly sparse data using fewer parameters than related techniques. Their proposed method treats validity masks as a continuous confidence field and presents a new criterion for propagating confidences between CNN layers. This approach allows for the generation of pointwise continuous confidence maps for the network output. The authors [
13] propose convolutional spatial propagation networks (CSPNs), which demonstrate greater efficiency and accuracy in depth estimation compared to previous state-of-the-art propagation strategies, without sacrificing theoretical guarantees. Liu et al. [
14] proposed an architecture, the differentiable kernel regression network, which consists of a CNN network that learns steering kernels to replace hand-crafted interpolation for performing the coarse depth prediction from the sparse input.
Gaussian beams play a significant role in various applications, including LiDAR, optical communication, and laser welding [
15]. However, in flash LiDAR and laser TV projection, it is essential to homogenize Gaussian beams into flat-topped beams. Common techniques to generate flat-top beams include MLA, diffractive optics (DOE), and free-form mirror sets. Among these technologies, MLA-based beam homogenizers have garnered considerable interest, particularly in compact consumer devices, owing to their unique properties. An MLA can divide a non-uniform laser into multiple beamlets, which can subsequently be superimposed onto a microdisplay with the assistance of an additional lens [
16]. As a result, MLA diffusers display independence from the incident intensity profile and a wide spectral range. In 2016, Jin et al. [
17] proposed enhancing homogeneity in the homogenizing system by substituting the first MLA with a free-form MLA. Each free-form surface within the MLA introduced suitable aberrations in the wavefront, redistributing the beam’s irradiance. In the same year, Cao et al. [
18] presented a laser beam homogenization method that employed a central off-axis random MLA. By adjusting the off-axis quantity of the center, the MLA’s periodicity was disrupted, eliminating the periodic lattice effect on the target surface. In 2019, Liu et al. [
19] designed a continuous profile MLA featuring sublenses with random apertures and arrangements. This innovation facilitated a breakthrough in beamlet coherence, achieving a simulated uniformity of 94.33%. Subsequently, in 2020, Xue et al. [
20] proposed a monolithic random MLA for laser beam homogenization. During this homogenization process, the coherence between sub-beams was entirely disrupted, resulting in a homogenized spot with a high energy utilization rate.
In an effort to enhance the resolution of low-pixel depth sensors and broaden the application scope of sparse-pixel, low-cost ranging sensors, we propose a SPAD-based LiDAR with a micro-optics device and RGB-guided depth-complemented neural network to upsample the resolution from 8 × 8 to 256 × 256 pixels as shown in
Figure 1. This lightweight, SPAD-based dToF system is ideal for LiDAR applications. In this work, the developed sparse-pixel SoC was fabricated with the 130 nm BCD process. The high photon detection probability (PDP) of the SPAD and picosecond TDC ensure that the system effectively achieves millimeter-level measurement accuracy in indoor scenarios. The sensor integrates a 16 × 16 SPAD array, which can be divided into 2 × 2 areas at high frame rates for ranging or 8 × 8 at low frame rates for imaging. Subsequently, we engineered an optical system to facilitate imaging at the hardware level. Micron-scale random MLAs were employed on the transmitter to homogenize the vertical-cavity surface-emitting laser (VCSEL) array’s Gaussian beam into a flat-topped source with a 45° FOV. Free-form lenses were applied at the receiver to align the SPAD array with a 45° FOV and enhance the resolution for sparse imaging using the 8 × 8 array. To meet consumer-grade imaging requirements, an RGB-guided depth complementation neural network was implemented on an NVIDIA Jetson Xavier NX (Deep Learning Accelerator) and PC, improving the resolution from 8 × 8 to 256 × 256 pixels (QVGA level). This cost-effective, lightweight imaging system has a wide range of applications in distance measurement, basic object recognition, and simple pose recognition. To validate the system, we conducted ranging verification on the SPAD-based SoC LiDAR, as well as FOV and resolution tests on micro-optical modules. Our self-developed RGB-guided depth completion neural network upsampled and complemented 8 × 8 depth information by ×32 to map the real world. The paper is structured as follows:
Section 2 presents the SoC implementation and key functional component design;
Section 3 discusses the optical system simulation and MLA design;
Section 4 introduces the proposed RGB-to-depth completion architecture;
Section 5 reports the results and discussion of
Section 2,
Section 3 and
Section 4; and
Section 6 provides conclusions and future perspectives.
2. SOC Implementation
Traditional LiDAR systems rely on a ranging framework in which the transmitter and receiver operate as separate entities. A short laser pulse is illuminated by the transmitter, which then undergoes diffuse reflection upon encountering an object before returning to the receiver. Here, the SPAD captures the corresponding photons and subsequently produces photon counts. This signal is conveyed through a multiplexed TDC, ultimately reaching the digital signal processor (DSP) for processing and generation of a time-dependent histogram. However, the conventional architecture is encumbered by several limitations, such as the challenging task of achieving picosecond-level pulsing and the system’s inherently poor signal-to-noise ratio (SNR) [
21,
22]. We propose a highly integrated SoC based on a 130 nm BCD process, which integrates a SPAD readout circuit, TDC, charge pump circuit for high-voltage generation, and ranging algorithm processor, as illustrated in
Figure 2. This design offers superior temporal resolution and SNR compared to conventional architectures, as well as significant cost advantages. The ranging SoC integrates two SPAD pixel arrays; namely, the signal SPAD array is utilized for depth imaging, while the reference SPAD array is employed for offset calibration between the sensor and cover glass.
Figure 2 illustrates the signal path for both pixel arrays within the ranging SoC. The high voltage of all SPADs is generated in charge pump, which is linked to the high-voltage drain drain (HVDD). Each region of the SPAD array corresponds to a TDC output, and the output from the TDC is subsequently stored in static random-access memory (SRAM) for histogram statistics. The SoC is interconnected with an external laser driver using low-voltage differential signaling (LVDS), enabling the adjustment of laser pulse width and power. Additionally, the SoC integrates a microcontroller unit (MCU) capable of performing on-chip operations, including the implementation of the ranging algorithm processor.
Figure 3a illustrates the SPAD pixel circuit. The cathode of the SPAD is connected to a high voltage bias, which is provided via the HVDD connection. The passive quench circuit implemented by a thick-gate transistor device enables the overbias voltage of the SPAD to reach up to 3.3 V. Following level conversion, the output signal of the SPAD pixel is reduced to a low voltage range compatible with the thin gate transistor. The detection threshold of the SPAD avalanche signal is determined by the buffer of the SPAD posterior stage. By adjusting the supply voltage of the pixel circuit, the detection threshold of the SPAD avalanche signal can be modified. In
Figure 3b, the design of the SPAD readout circuit is presented. Four SPADs are combined into a macropixel using an OR-tree structure. This macropixel configuration addresses the issue of low photon detection efficiency in individual SPADs. The probability of triggering an avalanche signal in this 2 × 2 pixel structure is four times that of a single pixel, and the avalanche signal from each pixel can be output through the OR-tree. Moreover, each SPAD can be individually enabled, allowing for flexible configuration of the SPAD array size.
The design structure of the TDC is shown in
Figure 4. It operates by synchronizing the pulse optical signal generated by the laser as the start signal of the TDC, while the SPAD triggers the generation of an avalanche signal, serving as the stop signal. The time interval between these two signals is recorded to output information from the TDC. Initially, a clock counter is employed to measure the number of clocks between the start and end signals, generating a rough count. The resolution of the rough count is determined by the clock frequency, making it challenging to achieve high-frequency rough counts. To address this limitation, a fine-count structure is introduced to improve the frequency of the TDC. The design of the fine count incorporates a multi-phase clock scheme. In this scheme, an on-chip oscillator generates a 10.19 MHz clock source, which serves as the input to a phase-locked loop (PLL). The PLL produces a 234 MHz clock output after 23 frequency multiplications. The clock is directly used for rough counting to obtain the output of TDC<7:3>, and the 234 MHz clock is shifted. Additionally, eight clocks, denoted as P0 to P7, with the same frequency but different phases, are derived from the PLL clock. The phase of P0 aligns with the phase of the PLL output clock. These eight clocks, with the same frequency and different phases, are then utilized for sampling and coding, resulting in the output of TDC<2:0>. Finally, the output of the TDC is fed into the digital circuit for histogram processing. The temporal resolution of the TDC is specified as 300 ps.
3. Optical System and MLA
In active imaging applications, particularly for LiDAR systems, it is imperative to optimize not only the receiver components, such as lenses and sensors, but also the transmitter elements and the target object. This optimization process requires an in-depth understanding of various factors, including the laser emission mode (flash or dot matrix) and the physical mechanism governing the object’s reflection. The optical model is constructed by taking into account calculations derived from the optical power at the transmitter, the single-pixel optical power budget at the receiver, and the lens system. This model can be readily extended to encompass the entire array once the illumination pattern is known. In the standard scenario, the illumination pattern along the detector’s field of view in both horizontal and vertical directions is designed to match the FOV angle at the transmitter. Employing complex physics can facilitate the incorporation of the optical model by calculating the optical power density on the target surface.
Figure 5 presents a generic flash LiDAR model, in which the scene of interest is uniformly illuminated by an active light source, typically a wide-angle pulsed laser beam. The coverage area in the target plane is contingent upon the FOV of the beam. All signal counting calculations are predicated on the lens focusing the reflected light energy back to the SPAD array. To maximize the efficiency of returned energy throughout the field of view, it is crucial to ensure a well-matched FOV between the transmitter and sensor. Regarding the returned light, the target is assumed to be a diffuse surface, i.e., a Lambertian reflector, which exhibits consistent apparent brightness to the observer, irrespective of the observer’s viewpoint, and the surface brightness is isotropic. The active imaging system setup is illustrated in
Figure 5, with the distance to the target as D, the lens aperture as d, the focal length as f, the sensor height as a, and the area as
. The reflected light power,
, emanating from the target is determined by the source power,
(D), and the reflectance,
, of the object. The lens component encompasses the f-number (f# = f/d) and FOV. Additionally, the light transmission rate,
, of the lens and filter must be considered. When treating the illumination light as square, the optical power model can be articulated by Equation (
1):
Transitioning from the optical power model to a photon counting model proves to be more advantageous for system design and sensor evaluation. Consequently, the photon counting model can be formulated as presented in Equation (
2):
where
is the laser repetition frequency as shown in
Figure 5,
is the light source wavelength used, h is Planck’s constant (6.62607004 ×
J· s), and
v is the speed of light (2.998 ×
m/s). This value defines the total number of photons per laser pulse that reaches the pixel array.
The diffusion beam function of the laser diffuser is primarily achieved through the micro-concave spherical structure etched on its surface. This micro-concave spherical surface acts as a micro-concave lens, and the entire laser diffuser can be considered an array of micro-concave lenses. Essentially, the diffused laser surface light source is a superposition of the surface light sources emitted by each micro-concave lens. It is necessary to homogenize the Gaussian beam into a flat-topped beam to achieve a more uniform projected light source for active imaging as illustrated in
Figure 6e. As illustrated in
Figure 6d, the laser diffuser diffusion diagram demonstrates the laser incident vertically on the diffuser, with the laser beam diverging due to the influence of the micro-concave lens. This forms a surface light source with a specific diffusion angle, which is associated with the parameters of the micro-concave lens on the diffuser. In accordance with this principle, this paper proposes the preparation of two types of MLAs: the random MLA, illustrated in
Figure 6b, suitable for generating a circular homogenized light spot; and the rectangular MLA, presented in
Figure 6c, designed to produce a rectangular homogenized light spot.
To elucidate the specific relationship between the characteristic parameters of the micro-concave spherical structure and the diffusion angle, a deductive argument will be presented from the perspective of geometrical optics. Ideally, the overall diffusion angle of the laser beam after traversing the diffuser is equivalent to the diffusion angle of
single sublens. To simplify the analysis of the diffusion angle regarding the diffuser structure parameters, the sublens diffusion process is examined individually.
Figure 6a illustrates the diffusion diagram of the sublens, which involves two parameters: the hole P and the radius of curvature R. By employing the lens focal length formula and geometric principles, the set of Equations (3) is derived. By further simplifying the system of equations, Equation (
4) can be obtained.
where n is the refractive index of the micro-concave lens material and f is the focal length of the micro-concave lens. If the parameters of the other sublenses are known, the diffusion angle of the sublenses can be derived, which in turn yields the diffusion angle of the laser diffuser. Equation (
4) further reveals that the diffusion angle theta is directly proportional to the aperture P of the micro-concave lens and inversely proportional to the radius of curvature R.
4. Proposed RGB-Guided Depth Completion Neural Network
The primary aim of our research is to accurately estimate a dense depth map, designated as Ŷ, derived from sparse LiDAR data (Y’) in conjunction with the corresponding RGB image (I) serving as guidance.
denotes the network parameters. The task under consideration can be succinctly expressed through the following mathematical formulation:
We implemented an end-to-end network, denoted as F, to effectively execute the depth-completion task in our study. The optimization function can be concisely represented in the subsequent description:
L represents the loss function employed for the purpose.
In our proposed methodology, the network is divided into two distinct components: (1) a depth-completion network designed to generate a coarse depth map from the sparse depth input, and (2) a refinement network aimed at optimizing sharp edges while enhancing depth accuracy. We introduce an encoder–decoder-based CNN network for estimating the coarse depth map, as previously demonstrated in [
13,
14]. However, the aforementioned networks encounter two primary challenges. First, networks such as [
12,
14] employ complex architectures to boost accuracy; nevertheless, their proposed networks are incompatible with most CNN accelerators or embedded DLAs. Second, networks like [
13] utilize ResNet-50 [
23] or other similar networks as the CNN backbone, resulting in a substantial number of parameters (~240 M) and posing implementation challenges on embedded hardware.
To make the network easier to implement on the embedded systems, we explored two optimization strategies:
4.1. Depthwise Separable Convolution
The architecture of depthwise separable convolution is illustrated in
Figure 7. In contrast to the conventional 3 × 3 2D convolution, depthwise separable convolution employs a 3 × 3 group convolution, referred to as depthwise convolution, and a 1 × 1 standard convolution, known as pointwise convolution.
For a standard convolution with a filter K of dimensions
applied to an input feature map F of size
, the total number of parameters can be computed as:
In the case of depthwise separable convolution, the total parameters are calculated as follows:
Considering that the kernel size for our network is 3 and the output channel N is considerably larger than , the standard convolution possesses a significantly higher number of parameters in comparison to depthwise separable convolution.
4.2. Network
Different from the standard CNN network, we substitute the standard 3 × 3 convolution with our depthwise separable convolution. Our network architecture is depicted in
Figure 8. Within the depth-completion network, we employ an U-net [
25] network as our backbone. The encoder consists of five stages, each containing two types of depthwise separable convolution blocks. One block performs downsampling by a factor of 2× using a stride of 2, while the other extracts features with a stride of 1. The decoder also consists of five stages. In the first four stages, feature maps are upsampled by 2×, followed by a convolution operation with a stride of 1. These maps are then concatenated with the feature maps from the corresponding encoder stage outputs and processed by two subsequent convolution blocks with a stride of 1. The final stage comprises an upsampling layer and a single convolution block with a stride of 1. For the refinement network, we adopted the architecture from [
12], which has been demonstrated to possess the advantages of being lightweight and exhibiting high performance.
4.3. Loss Function
We trained our network in two steps: In Step 1, we trained our network using the
error as our loss function. The loss function can be described as follows:
where
denotes the ground truth at pixel
,
represents the predicted depth at pixel
, and
M stands for the total number of pixels with values greater than 0.
Upon completing the training of the network, we proceeded to train the entire network using a new loss function to optimize the boundary region performance, which can be described as follows:
In this equation,
,
, and
are hyperparameters.
is defined similarly to Equation (
5),
represents the masked mean squared error (MSE), and
stands for the gradient error. To achieve improved boundary performance, we incorporated depth gradient information into our loss function, following [
26]. The gradient loss function,
, penalizes the disagreement of depth edges in both vertical and horizontal directions.