EDRNet: Edge-Enhanced Dynamic Routing Adaptive for Depth Completion

Sun, Fuyun; Li, Baoquan; Zhang, Qiaomei

doi:10.3390/math13060953

Open AccessArticle

EDRNet: Edge-Enhanced Dynamic Routing Adaptive for Depth Completion

by

Fuyun Sun

^*,

Baoquan Li

and

Qiaomei Zhang

Tianjin Key Laboratory of Intelligent Control of Electrical Equipment, School of Control Science and Engineering, Tiangong University, Tianjin 300387, China

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(6), 953; https://doi.org/10.3390/math13060953

Submission received: 8 February 2025 / Revised: 5 March 2025 / Accepted: 11 March 2025 / Published: 13 March 2025

(This article belongs to the Special Issue Recent Advances in Artificial Intelligence and Machine Learning, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

Depth completion is a technique to densify the sparse depth maps acquired by depth sensors (e.g., RGB-D cameras, LiDAR) to generate complete and accurate depth maps. This technique has important application value in autonomous driving, robot navigation, and virtual reality. Currently, deep learning has become a mainstream method for depth completion. Therefore, we propose an edge-enhanced dynamically routed adaptive depth completion network, EDRNet, to achieve efficient and accurate depth completion through lightweight design and boundary optimisation. Firstly, we introduce the Canny operator (a classical image processing technique) to explicitly extract and fuse the object contour information and fuse the acquired edge maps with RGB images and sparse depth map inputs to provide the network with clear edge-structure information. Secondly, we design a Sparse Adaptive Dynamic Routing Transformer block called SADRT, which can effectively combine the global modelling capability of the Transformer and the local feature extraction capability of CNN. The dynamic routing mechanism introduced in this block can dynamically select key regions for efficient feature extraction, and the amount of redundant computation is significantly reduced compared with the traditional Transformer. In addition, we design a loss function with additional penalties for the depth error of the object edges, which further enhances the constraints on the edges. The experimental results demonstrate that the method presented in this paper achieves significant performance improvements on the public datasets KITTI DC and NYU Depth v2, especially in the edge region’s depth prediction accuracy and computational efficiency.

Keywords:

depth completion; depth map; deep learning; edge guide; neural network; computer vision

MSC:

68T07

1. Introduction

Currently, depth maps have an extensive range of applications in areas such as unmanned driving, virtual reality, and robot navigation. Although depth information can be obtained from LiDAR and commodity-grade RGB-D cameras, these depth sensors have obvious limitations in practical applications. Although LiDAR can provide highly accurate depth measurement information, its output depth map is very sparse (e.g., the projected depth map of Velodyne HDL-64e has only about 5.6% of effective pixels), and the equipment cost is high. In contrast, commercial-grade RGB-D cameras (e.g., Kinect) are capable of generating relatively dense depth maps, but during use, due to lighting variations in the captured scene, object occlusion, etc., the obtained depth maps usually have many nulls and inaccurate measurements, making it difficult to be used for high-precision tasks. In addition, although the monocular depth estimation task can generate a complete depth map, it is an unsuitable problem. Therefore, achieving high-quality depth completion has become an essential topic of common concern in academia and industry.

Earlier depth completion was usually achieved using traditional methods such as image filters [1,2,3]. However, the depth maps generated by these methods often have limited accuracy due to noise, occlusion, and other factors. In recent years, the advancement of deep learning techniques has garnered increasing scholars’ focus on convolutional neural networks (CNNs). For example, Uhrig et al. [4] designed a sparsity-invariant network layer to adapt to sparse deep data input; Huang et al. [5] proposed a sparsity-invariant multiscale coding and decoding network (HMS-Net) and a sparse feature mapping on this basis; and Chodosh et al. [6] trained a recurrent self-encoder and proposed an end-to-end multilayer dictionary learning algorithm. However, traditional convolutional neural networks (CNNs) can usually only capture local information due to the use of a fixed convolutional kernel for feature extraction. To enhance the performance of depth completion, researchers have started to explore the use of RGB images for assisted completion [7,8,9]. For example, Ma et al. [7] proposed a feature fusion of RGB images and corresponding sparse depth maps to complete depth completion. However, RGB images sometimes cannot accurately provide geometric information about objects due to object shadows and light reflections, etc., and the complemented depth maps often suffer from blurred edges. Therefore, some researchers have explored the use of the Transformer [10,11,12] to complete the depth completion task. The Transformer demonstrates significant potential in the depth completion challenge because of its robust global modelling capability. Kyeongha et al. [10] proposed a two-branch network that is completely based on the Transformer. Nevertheless, the Transformer excessively prioritises global information while neglecting the local details of the image. Based on this, Zhang et al. [11] proposed to combine CNN and the Transformer to solve the problem of generating unsuitable depth maps in depth completion, which achieved exciting results, but it requires substantial computational resources and poses challenges in fulfilling real-time demands.

Therefore, the current depth completion techniques reliant on deep learning still face several challenges: (1) The models usually deal with the features of the whole image, and in the boundary region, the network struggles to accurately capture boundary information due to large changes in the object’s appearance. (2) The models have high computational complexity, large parameter counts, and require a long training time, which is highly demanding on the computational power. (3) In some textures, there are sparse or poor lighting conditions in the regions, and due to the lack of sufficient visual cues, the depth completion model struggles to extract sufficient feature information and accurately predict the depth.

To address the above problems, this paper proposes a Sparse Adaptive Dynamic Routing Transformer module (SADRT), which consists of CNN and the Transformer in parallel. Among them, the task of the Transformer part is to process image global information, but it inevitably brings a large amount of computation. So, this paper designed a dynamic routing mechanism so that the network adaptively selects the feature propagation path, effectively captures multi-level features, and reduces the computational overhead. The CNN part of the spatial channel attention mechanism (CBAM) and the design of adaptive sparse activation (ASA) algorithms mean that the network is more flexible to handle the depth information of different scenes and objects. In addition, considering the limited object boundary and structure information brought by RGB images, this paper designs a matching loss function for the depth error of object edges and introduces the edge information map extracted by Canny operator in the coding layer, and then fuses the edge information with the RGB image and the sparse depth map to achieve a more precise depth map.

In summary, our main contributions are as follows:

We propose a Sparse Adaptive Dynamic Routing Transformer block (SADRT) based on CNN and the Transformer as part of the basic unit of the encoder in EDRNet. By combining dynamic routing and sparse adaptive activation mechanism, the method in this paper improves the computational efficiency of the model while maintaining high accuracy, providing a feasible solution for real-time applications.
We design a multi-lead structure perception network framework. In layer 4 of the encoder, we introduce the edge information map extracted by Canny operator and feature fusion with the RGB image and sparse depth map. The method effectively combines the structural information of the RGB image and edge image, further improving the accuracy and quality of the depth map.
We design a matching loss function for object edge depth error. This loss function comprises two components: edge strength loss and edge position matching loss. Experiments on two publicly accessible datasets demonstrate that this loss function enhances the edge quality of the depth map by explicitly modelling the depth error in the edge region.

2. Materials and Methods

2.1. Network Architecture

The network model we designed is based on the U-Net [13] architecture and consists of two parts: encoder and decoder. The model adopts an end-to-end approach and is capable of learning features directly from the input images and generating high-quality depth maps. To reduce the computational burden of the network, a lightweight network MobileNet V2 [14] is used in the first two layers of the encoder, and a Sparse Adaptive Dynamic Routing Transformer (SADRT) block is designed for multi-scale feature extraction. To enhance the edge details, we extracted the edge information corresponding to the RGB image using the Canny operator, which was experimentally determined to be input to the fourth layer of the encoder. In the decoder block, we combined the convolutional layer with the attention mechanism to enhance the information interaction on spatial and channel dimensions to assist our network in focusing on more important information during the decoding process. To ensure that the rich feature information extracted in the encoding stage can be efficiently transferred to the decoding stage, we designed a layer-hopping connection between encoding and decoding to facilitate the transfer of feature information derived from each encoding layer to the corresponding section of the decoding layer, which can better recover the image details and improve the model performance. Finally, we adopt the Spatial Propagation Network (SPN) [15] to enhance the depth map. The overall network architecture we proposed is shown in Figure 1.

2.2. SADRT Block

During depth completion, the traditional CNN focuses on information extraction from local regions through convolutional operations. However, limited by the size of the fixed convolutional kernel and the local receptive field, CNN has limitations in capturing global features. On the other hand, the Transformer can capture global information while processing data through the self-attention mechanism, but its computational complexity is high. To combine the advantages of the two and make up for their respective shortcomings, this paper is inspired by the JCAT block proposed by Zhang et al. [11] and designs the Sparse Adaptive Dynamic Routing Transformer block (SADRT) as part of the basic unit of the encoder in the depth completion model. In the Transformer part, a dynamic routing algorithm is proposed to reduce the computational effort generated by the self-attentive mechanism in the Transformer when processing global information. In the Convolution section, Adaptive Sparse Activation (ASA) is combined with Channel Spatial Attention (CBAM). Subsequently, the outputs of the Transformer layer and the convolutional adaptive attention layer are concatenated, and the output feature maps are generated by further convolutional operations. Our proposed SADRT block is shown in Figure 2. Details of the parameters for each layer of the network architecture is detailed in Appendix A.6.

2.2.1. Dynamic Routing Transformer

Dynamic Routing (DR), originally proposed by Capsule Networks [16], is a mechanism that can dynamically select tokens to participate in attention computation based on the importance of input features. In this paper, we combine the Self-Attention Layer (SRA), Layer Normalisation (LN), and Feed-Forward Network (FFN) to construct the Dynamic Routing Transformer (DRT) architecture. By introducing the dynamic routing algorithm, the architecture is able to adaptively select the most relevant feature paths during the training process so that the network focuses on the most important features for the depth completion task, and minimises redundant computations generated by the self-attention mechanism in the Transformer when processing global information, thus, improving the efficiency of training and inference.

The dynamic routing computation process in DRT:

Assuming that the size of the input feature graph

F

is

H \times W \times C

and the size of the segmented patch is

P \times P

, then

① Feature embedding: Split the input feature graph into multiple patches and embed them into tokens. The formula is shown in (1):

X = E m b e d (F)

(1)

where N denotes the number of tokens, C denotes the number of channels,

X \in ℝ^{N \times C}

.

② Attention calculation: Calculate the attention of each token to obtain the corresponding attention weight. The formula is shown in (2)–(4):

Q, K, V = L i n e a r \Pr o j (X)

(2)

where

Q, K, V \in ℝ^{N \times C}

denotes the query, key, and value matrices, respectively.

A = S o f t \max (\frac{Q K^{T}}{\sqrt{C}}) V

(3)

where

A \in ℝ^{N \times C}

denotes the attention weight matrix.

X^{'} = A X V^{T}

(4)

where

X^{'} \in ℝ^{N \times C}

denotes the updated token.

③ Importance scoring: Employ the softmax function to convert the attention weights to a probability distribution and use the entropy of the probability distribution as the importance score for each token. The formula is shown in (5) and (6):

P = S o f t \max (A)

(5)

where

P \in ℝ^{N \times C}

denotes the probability distribution of the token.

H (P) = - \sum_{i = 1}^{N} \sum_{j = 1}^{N} P_{i j} \log P_{i j}

(6)

where H(P) denotes the entropy of the probability distribution.

④ Dynamic selection: Select top-

K

tokens based on importance scores.

⑤ Attention update: Attention is computed using the selected token and the representation of each token is updated. The formula is shown in (7):

X^{'} = A t t e n t i o n (X^{'} [K])

(7)

where X′[K] denotes the selected token.

2.2.2. Adaptive Sparse Activation

In depth completion tasks, convolutional neural networks (CNNs) extract features through intensive convolutional operations, resulting in large computational and parametric quantities. The traditional sparse activation method selects some features for activation using a predefined sparse strategy, diminishing the computing burden yet failing to fully leverage the dynamic information of the input features. So, we propose an Adaptive Sparse Activation (ASA) method, which dynamically adjusts the position and proportion of sparse activation according to the statistical information of input features so that important features can be effectively selected for activation in different input situations.

Convolutional Operation

The convolution operation is used to extract local features by sliding a convolution kernel over the feature map. The standard operation is as follows:

Y = σ (W * X + b)

(8)

where W is the convolution kernel, X denotes the input feature map, b denotes the bias, σ denotes the activation function, and ∗ denotes the convolution operation.

Sparse Activation

The sparse activation method reduces the computation by selecting some features for activation. Suppose the input feature map is

X \in ℝ^{C \times H \times W}

, the expression for the sparse activation is

X_{s p a r s e} = X • M

(9)

where M denotes a sparse mask matrix with only some positions 1 and other positions 0.

Adaptive Sparse Activation

The adaptive sparse activation method dynamically generates a sparse mask matrix M informed by the statistical information of the input features. The steps are as follows:

① Feature statistics: Calculate the statistics of the input feature map. The formula is shown in (10):

μ = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{H} X_{i j}, σ = \sqrt{\frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{H} {(X_{i j} - μ)}^{2}}

(10)

where µ denotes the mean, σ denotes the standard deviation.

② Sparse threshold: Calculate the adaptive sparse threshold τ based on the statistical information. The formula is shown in (11):

τ = α • σ + β • μ

(11)

where α and β are adjustable parameters.

③ Sparse mask: Generate the sparse mask matrix

M

. The formula is shown in (12):

M_{i j} = \{\begin{matrix} 1 \\ 0 \end{matrix} \begin{matrix}  \end{matrix} \begin{matrix} i f \end{matrix} \begin{matrix} X_{i j} \geq τ \\ o t h e r w i s e \end{matrix}\}

(12)

④ Sparse activation: Apply the sparse mask to the input feature map. The formula is shown in (13):

X_{s p a r s e} = X • M

(13)

where M denotes the dynamically generated sparse mask matrix.

2.3. Introducing Edge Image References

The depth completion task usually relies on RGB images for guidance, but the generated depth maps are often blurred in the edge and detail regions due to insufficient feature information. This paper offers a multi-modal feature fusion strategy that combines RGB images with edge images to solve the problem. Extracting edge images from the input RGB images using the Canny algorithm, a classical image processing technique, helps the model to accurately capture the boundaries and key geometric features of the object, thus, providing supplementary guidance for the missing parts of the sparse depth map. See Appendix A.2 for details of the experiments on the selection of edge operators. In the model of this paper, we fuse the edge image with the high-level features of the RGB image after feeding it into layer 4 of the encoder. This multi-modal feature fusion approach enhances the structural information of the image and significantly improves the edge detail of objects in the depth map. To enhance the fused features, we use many convolutional layers following feature splicing to ensure a close integration of the RGB image’s colour information with the edge image’s structural information. The experimental results demonstrate that the approach enhances edge clarity and detail recovery in the depth map, significantly increasing the accuracy of the depth completion job. The visualisation results of the edge information are shown in Figure 3.

2.4. Composite Depth Completion Loss Function

Considering the influence of object edges on the quality of generated depth maps, this paper introduces a new loss function component based on the loss function proposed by Liu [17] et al. The loss function we designed is named Composite Depth Completion Loss. The formula is as follows:

L_{C D C} (P, G) = α L_{E M_L oss} (P, G) + β L_{L 1} (P, G) + γ L_{T V} (P) + δ L_{D ice} (P, G)

(14)

where P denotes the prediction image, G denotes the real image, α, β, γ, and δ are the weight parameters, and the four weight parameters are set to 0.3, 0.2, 0.2, and 0.3, respectively, through experiments. See Appendix A.3 for details of the experiments.

2.4.1. Edge-Matching Loss Function

During the training of the model, the traditional loss function mainly focuses on the mean situation of all pixels in the image and fails to fully consider the impact of some significant features in the image (e.g., edges) on the model training results. To ensure that the predicted image matches the real image in terms of edge location and shape, this paper proposes an edge-matching loss function, which can fully consider the edge texture information and improve the precision of the model. Through the edge detection algorithm (Sobel operator), a specialised loss is calculated for the edge regions of the predicted depth map. The loss consists of two parts: edge strength loss and edge position-matching loss.

Edge strength loss

It is used to calculate the difference between the edge strength of the predicted and real images. The horizontal and vertical gradients of the predicted image

P

and the real image

G

are computed using the Sobel operator. The formula is as follows:

P_{x} = P * G_{x}

(15)

P_{y} = P * G_{y}

(16)

G_{x} = G * G_{x}

(17)

G_{y} = G * G_{y}

(18)

where ∗ denotes the convolution operation, G_x and G_y are Sobel kernels.

Then, the edge strength is calculated:

P_{e d g e} = \sqrt{P_{x}^{2} + P_{y}^{2}}

(19)

G_{e d g e} = \sqrt{G_{x}^{2} + G_{y}^{2}}

(20)

The edge strength loss is defined as

L_{e d g e_s t r e n g t h} (P, G) = \sum_{i, j} | P_{e d g e_{i, j}} - G_{e d g e_{i, j}} |

(21)

Edge position-matching loss

It is used to calculate how well the predicted image and the real image match in the edge position to ensure the similarity of the edge profile. The edge position-matching loss is defined as:

\begin{array}{l} L_{e d g e_p o s i t i o n} (P, G) = \\ \sum_{i, j} (¶ (P_{e d g e_{i, j}} > ε) ¶ (G_{e d g e_{i, j}} \leq ε) + ¶ (P_{e d g e_{i, j}} \leq ε) ¶ (G_{e d g e_{i, j}} > ε)) \end{array}

(22)

where ¶ is an indicator function that takes 1 when the condition is satisfied and 0 otherwise, ε is a small threshold for judging the edges.

The edge-matching loss function is defined as

L_{E M_L oss} (P, G) = L_{e d g e_s t r e n g t h} (P, G) + L_{e d g e_position} (P, G)

(23)

2.4.2. L1 Loss Function

It is a loss function commonly used in depth completion tasks to measure the overall difference between the predicted and true values to ensure the accuracy of the overall depth prediction. The formula is as follows:

L_{L 1} (P, G) = \sum_{i, j} | P_{i, j} - G_{i, j} |

(24)

2.4.3. Total Variation Loss Function

This loss function was proposed by Liu et al. [17]. It is used to promote pixel-level connection regularisation and reduce the noise in the predicted depth map. The formula is as follows:

L_{T V} (P) = \sum_{i, j} (| P_{i + 1, j} - P_{i, j} | + | P_{i, j + 1} - P_{i, j} |)

(25)

2.4.4. Dice Loss Function

This loss function was proposed by Liu et al. [17]. It is used to measure the degree of overlap between the predicted and real images, and it can focus on the loss of the foreground region and avoid the interference of background pixels. The formula is as follows:

L_{D i c e} (P, G) = 1 - \frac{2 \sum_{i, j} P_{i, j} G_{i, j} + ε}{\sum_{i, j} P_{i, j}^{2} + \sum_{i, j} G_{i, j}^{2} + ε}

(26)

where ε is a small threshold, set artificially.

3. Results and Discussion

3.1. Datasets

Our experiments are based on two public datasets, KITTI DC and NYU Depth V2, which cover outdoor and indoor scenes, respectively, and are the official datasets for depth completion.

The KITTI Depth Completion Dataset [18] consists of outdoor scenes captured by multiple sensors, covering trees, roads, pedestrians, and vehicles. The resolution of the RGB images and depth maps is 352 × 1216, and since the sky part contains very little depth information, the resolution is cropped to 240 × 1216 in the experiments to facilitate better training of the model. The dataset contains more than 93,000 sets of RGB images and original sparse depth maps. We divide it into three parts: training set, validation set, and test set, where 86,000 datasets are used for training, 7000 sets for validation, and 1000 sets for testing.

NYU Depth V2 Dataset [19] is an indoor scene depth benchmark dataset released by New York University. The dataset uses Microsoft Kinect’s RGB-D camera to acquire monocular RGB images of indoor scenes and the corresponding true depth information. It contains 464 indoor scene photo sets and is a commonly used dataset for depth completion tasks. The images have a native resolution of 640 × 480 and a depth range of 0.5–10 m. In this paper, we use this dataset to evaluate the depth completion capability of EDRNet in indoor scenes. In the experiments, following the segmentation approach of previous work [20,21], we use 120,000 images of 249 scenes uniformly sampled from the training set for training and 654 images of 215 scenes for validation and testing.

3.2. Evaluation Metrics

The performance of depth completion is evaluated mainly by comparing the error and accuracy between the predicted depth P and the true depth G. We use the following common evaluation metrics to assess the performance of our method: (1) root mean square error (RMSE); (2) mean absolute error (MAE); (3) inverse root mean square error (iRMSE); (4) inverse mean absolute error (iMAE); (5) structural similarity index measure (SSIM); (6) learned perceptual image patch similarity (LPIPS); and (7) mean relative error (REL).

3.3. Implementation Details

We implement our model in PyTorch 2.1.0, the GPUs are two NVIDIA GeForce RTX 4090 with 24 G memory, and the operating system is Ubuntu 20.04. By comparing the optimisation effects of Adam and AdamW [22], this paper finally chooses to use the AdamW [22] optimiser for optimisation. The initial learning rate is 0.001, β1 = 0.9, β2 = 0.999, and weight decay is 0.01. The number of iterations is 100, and the batch size is 32. In order to learn efficiently, we perform data preprocessing on the dataset, including flipping and colour dithering. On the KITTI DC and NYUv2 datasets, the batch size of each GPU was set to 6 and 12, respectively. On the NYUv2 dataset, we trained the model for 100 epochs and attenuated the learning rate by 0.5 times at epochs 40, 60, and 80. For the KITTI DC dataset, the model is trained using 100 epochs and we decay the learning rate by 0.5 times at epoch 50, 60, 70, 80, 90. To validate the performance of the proposed network, we performed qualitative analysis, quantitative analysis, and ablation studies, respectively.

3.4. Discussion

3.4.1. Quantitative Analysis

We compare the experimental results of EDRNet with some of the relevant results in the published paper, the results of which are shown in Table 1.

This section systematically evaluates the performance of EDRNet. Based on the KITTI DC dataset, six evaluation metrics, namely, MAE, iMAE, RMSE, iRMSE, SSIM, and LPIPS, are used for benchmarking. The experimental results show that, except for the iRMSE metric and the LPIPS metric, which are slightly lower than CompletionFormer, EDRNet exhibits significant advantages in the other four metrics. Benchmarking using RMSE and REL is based on the NYU V2 dataset. The results show that although the RMSE metrics are slightly lower than CompletionFormer, the REL metrics are significantly better than other network models proposed in the literature. In summary, EDRNet performs well in all metrics in both indoor and outdoor environments, our edge fusion strategy and dynamic routing mechanism can effectively preserve edge structure and details, and the model is highly competitive in depth completion. See Appendix A.4 for details of the overfitting correlation analysis.

3.4.2. Qualitative Analysis

The qualitative results on the KITTI DC dataset are shown in Figure 4.

Figure 4 demonstrates the visualisation results of the proposed method in comparison with other methods. Where each row is (a) colour image, (b) ground truth depth map, (c) prediction result of CSPN++ [24], (d) prediction result of Rignet [29], (e) prediction result of CompletionFormer [11], and (f) prediction result of EDRNet proposed in this paper. Compared with other methods, EDRNet has clearer boundaries of objects in the graph due to the introduction of edge images, which greatly avoids confusion of depth relationships between foreground and background objects due to smooth transition of depth values. We use arrows to indicate key locations. For example, compared to other methods, in the left set of result maps, where the arrow points to the upper part of the car and the background trees, EDRNet is more distinguishable on the edges of these two objects. The contours of the corresponding objects in the EDRNet complemented result maps can also be clearly seen at the tree trunks and backgrounds indicated by the arrows in the middle and right sets of result maps, and the shapes of the edge parts are highly consistent with those in the colour images.

The qualitative results of the NYU DepthV2 dataset are shown in Figure 5.

Figure 5 shows the visualisation results of the proposed method compared with other methods, where each column is (a) colour image, (b) ground truth depth map, (c) prediction results of CSPN++ [24], (d) prediction results of Rignet [29], (e) prediction results of CompletionFormer [11], and (f) prediction results of EDRNet, the model proposed in this paper. We use arrows to indicate key locations. Again, as indicated by the regions marked by arrows in the four sets of comparison plots, EDRNet exhibits higher clarity in the boundary processing of foreground and background objects.

The experimental results show that the depth maps generated by EDRNet exhibit finer detail reproduction capabilities in both indoor and outdoor scenes, especially outperforming other comparative methods in the processing of object edges and complex structures.

3.4.3. Visualisation and Analysis

We have conducted a visual analysis to visualise the convergence of the loss function of this paper’s network architecture during the training and validation process.

From the experimental results in Section 3.4.1 and Section 3.4.2, it is known that CompletionFormer has the best performance among the other network architectures compared with EDRNet in this paper. Therefore, we plot the training and validation loss function curves for the network architecture of EDRNet versus CompletionFormer (Figure 6 and Figure 7). These two plots visualise the convergence properties of the model on the KITTI and NYUv2 datasets.

As shown in Figure 6 and Figure 7, the training loss of this paper’s method quickly converges to a stable value within 100 epochs, and the validation loss is consistently lower than that of the comparison methods, indicating that it possesses higher learning efficiency and generalisation ability. This is consistent with the improvement of the final metrics in Table 1. Compared to the CompletionFormer architecture, the model in this paper converges faster and is more stable.

3.4.4. Performance Comparison of Hybrid CNN–Transformer Block

To verify the effectiveness of the SADRT block proposed in this paper, we compare its performance with other hybrid CNN–Transformer modules proposed in the published literature, and the experimental results are shown in Table 2.

The results show that the SADRT block significantly outperforms other hybrid CNN–Transformer blocks in all evaluation metrics on the KITTI DC dataset. The sparse dynamic routing mechanism is designed so that the SADRT block can adaptively process different region features, reducing the amount of redundant computation while improving the accuracy of the depth map. In addition, the inference speed of the SADRT block reaches 15 FPS with 364.5G FLOPs, which is significantly better than other hybrid models.

3.5. Ablation Results

3.5.1. Ablation Experiments for Each Block

This section conducts ablation experiments to validate the effectiveness of our proposed blocks in enhancing model performance. The results are shown in Table 3.

Table 3 illustrates that both the RMSE and MAE metrics of the model decrease from the original ones, whether or not the edge image is introduced alone or in combination with the edge loss function, indicating that the introduction of both the edge information and the EM loss function can effectively enhance the precision of the depth map. Furthermore, the SADRT block reduces the number of parameters and floating points of the model from 67.8 M and 374.7 G to 55.4 M and 359.9 G, respectively, while extracting multilevel features. When all the blocks work together, the RMSE and MAE are reduced to 86.3 mm and 34.1 mm, respectively, while maintaining a reasonable computational cost. The experimental results show that the individual modules proposed in this paper not only improve the depth map accuracy but also effectively reduce the computational overhead brought by the Transformer, among which the SADRT module is pivotal in reducing the computational complexity. Details of the intuitive curve fitting and significance analysis are given in Appendix A.1.

3.5.2. The Edge Image Introduces Positional Experiments

In order to analyse the impact of the introduction location of the edge image on the model performance, we conduct ablation experiments based on the basis of the original model architecture and compare the model’s performance when the edge image is introduced at different locations of the encoder and decoder. The results are shown in Table 4.

The experimental results show that the results of the two evaluation metrics are closer when the edge image is introduced at layer 4 of the encoder and layer 1 of the decoder. However, in comparison, the evaluation metrics of the former model decreased more. Based on this, this paper finally chooses to introduce the edge image at encoder layer 4 to improve the network architecture’s overall performance.

3.5.3. Comparisons with General Feature Backbones

To verify that the backbone designed in this paper has a performance advantage over other backbones, we have performed the following comparative implementations. The results are shown in Table 5.

The experimental results show that the backbone of this paper performs superior performance in terms of RMSE, MAE, FLOPs, and frames per second (FPS).

3.5.4. Comparison and Ablation Study of Loss Function

To verify the effectiveness of the proposed loss function, this study constructs a control experimental group based on the EDRNet architecture. It replaces the loss function with L1 Loss, Smooth L1 Loss, and Hybrid Loss sequentially for comparison and analysis while keeping the network structure and training strategy consistent. The experimental results are shown in Table 6.

This experiment quantitatively evaluates four loss functions on the KITTI dataset. The results show that the loss functions proposed in this paper exhibit optimal performance in the core metrics. Among them, the Root Mean Square Error (RMSE) is 708.56 mm, which is 0.53% lower than the traditional L1 Loss (712.35 mm) and 0.08% lower than the Hybrid Loss, and the MAE is 202.56 mm, which is 0.6% higher than the sub-optimal Hybrid Loss (203.78 mm). Although the iRMSE after training with our proposed loss function is slightly higher than Hybrid Loss, all other metrics outperform it. It is worth noting that Smooth L1 Loss outperforms L1 Loss in both iRMSE and iMAE among all compared methods, and our loss function achieves significant improvements in each metric through multi-objective optimisation.

To prove the effectiveness of the components in the loss function proposed in this paper, we set the coefficients to 0, respectively, and the experimental results are shown in Table 7.

The experimental results show a significant decrease in depth accuracy (RMSE = 713.23 mm, MAE = 205.79 mm) when α (Method A) is removed, illustrating the importance of α for direct depth prediction. When γ (Method C) was removed, iRMSE performed optimally, but the overall depth accuracy was still slightly inferior to the entire model. Removing δ (Method D) then leads to significantly worse iMAE values, indicating that the Dice loss function has a key role in geometric consistency. Ultimately, the whole model (α = 0.3, β = 0.2, γ = 0.2, δ = 0.3) is optimal in most metrics by synergistically optimising all the loss terms, and is only slightly higher in iRMSE than method C. This result confirms the complementary nature of the loss functions and that a combined balancing of the weights achieves the optimal global performance.

4. Conclusions

In this paper, we propose an edge-enhanced dynamical routed adaptive depth completion network (EDRNet). RGB images and sparse depth maps are used as inputs by the network and extract the edge maps by the Canny algorithm, a traditional image processing technique, introducing them into the fourth layer of the encoder to provide finer edge information for the architecture. Meanwhile, we design the Sparse Adaptive Dynamic Routing Transformer block (SADRT), which consists of the Convolutional Adaptive Attention Mechanism and Dynamic Routing Transformer integrated in parallel as a partial encoder layer. The design reduces the number of high references associated with the traditional Transformer while efficiently capturing global and local features. In addition, we propose an edge-matching loss function to further optimise the edge quality and global accuracy of the depth map. After sufficient experiments, it is shown that EDRNet exhibits excellent performance and computational efficiency in the depth completion task. Combined with previous research, our work pays great attention to the acquisition and optimisation of edge information and provides new ideas for research in the field of depth completion. Although our method performs well in most scenarios, the precision of the depth map in the transparent object region is currently low due to light reflection and other reasons. In the future, we will work on solving the problems in this area.

Author Contributions

Conceptualization, F.S. and B.L.; data curation, F.S.; formal analysis, F.S. and B.L.; funding acquisition, B.L.; investigation, F.S. and B.L.; methodology, F.S.; project administration, F.S.; resources, F.S., B.L. and Q.Z.; software, F.S.; supervision, B.L.; validation, F.S.; visualisation, F.S.; writing—original draft preparation, F.S.; writing—review and editing, F.S., B.L. and Q.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China (grant numbers 61973234 and 62203326) and in part by the Tianjin Natural Science Foundation (grant number 20JCYBJC00180).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The source code presented in this study is available on request from the corresponding author.

Acknowledgments

The authors would like to thank Tiangong University for technical support and all members of our team for their contribution to the deep completion experiments. The authors acknowledge the anonymous reviewers for their helpful comments on the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1. Curve Fitting and Significance Analysis

In Figure A1, we added a scatter fit curve of depth predictions versus true values (Ours: R² = 0.904 vs. CompletionFormer: R² = 0.854). The results show that the method in this paper significantly improves the linear correlation between predicted and true values while maintaining a low RMSE (708.56 mm), validating the model’s robustness to complex scenes.

Figure A1. Results on the KITTI dataset.

Then, we designed the corresponding ablation studies by sequentially removing the edge image input, the edge matching loss, and the SADRT block on the basis of the original EDRNet architecture. We plotted the scatter fit curves of the depth prediction values and the true values and recorded the changes in the corresponding R² values. The experimental results are shown in Figure A2, Figure A3, Figure A4 and Figure A5.

Figure A2. Full architecture.

Figure A3. Remove edge image.

Figure A4. Remove EM loss.

Figure A5. Remove SADRT block.

Table A1. Comparison of R² values for different configuration methods on KITTI.

Method	Edge Image	EM Loss	SADRT Block	R²↑	Change
A	√	√	√	0.90	-
B		√	√	0.83	−0.07
C	√		√	0.85	−0.05
D	√	√		0.87	−0.03

Note: Optimal results for each indicator are in bold. ↑ indicates that a larger value is better.

In this study, the contribution of each block to the model performance was systematically analysed by ablation experiments (as shown in Table A1). From the experimental results, it is known that the complete model A achieves the optimal performance (R² = 0.90), which verifies the effectiveness of the multi-module co-optimisation. The removal of the edge image processing block (Method B) triggered the maximum performance decay (R² = 0.83, ΔR² = 0.07), confirming the core position of this block in the depth completion task. When EM Loss was missing (Method C), the model performance decreased (R² = 0.85, ΔR² = 0.05), illustrating its key role in optimising the geometric constraints. Removal of the SADRT block (Method D) resulted in a decrease of 0.03 in the R² value, indicating that this module has a significant effect on the accuracy of the depth completion task. In summary, the order of contribution of each block to the model performance is as follows: edge image processing module > EM loss function > SADRT module.

Appendix A.2. Experiments on the Selection of Edge Operators

This experimental system compares the performance differences between Canny, Sobel, and Prewitt operators in depth completion, and the experimental results are shown in Table A2.

Table A2. Performance comparison of different edge detection methods.

Edge Method	RMSE↓ mm	MAE↓ mm	Edge SSIM↑
Canny	708.56	202.56	0.912
Sobel	715.23	204.69	0.904
Prewitt	719.54	208.12	0.897

Note: Optimal results for each indicator are in bold. ↑ indicates that a larger value is better and ↓ indicates that a smaller value is better.

As shown in Table A2, the Canny operator achieves optimal results in all three evaluation metrics due to its multi-level noise suppression and accurate edge localisation capability. Specifically, the Canny algorithm enhances the structural consistency between the edge image and the RGB image by retaining the local maxima of the edges through non-maximum suppression and effectively filtering out the texture noise using double thresholding. In contrast, the Sobel and Prewitt operators are susceptible to noise interference due to the lack of a similar noise suppression mechanism, leading to the inclusion of more redundant features in the edge image, which in turn affects the depth completion performance of the network. Therefore, the input edge image in the network architecture of this paper is extracted by the Canny operator.

Appendix A.3. Experiments to Determine the Parameters of the Loss Function

Due to the large size of the dataset, the grid search method needs to consume a lot of computational resources, so this paper adopts the univariate adjustment and empirically oriented experiments to determine the parameter values of each loss function. According to the laws of determining the parameters of the loss function in the previous literature and the performance characteristics of each loss function in this paper, we set up several groups of penalty parameters in the empirically oriented combination experiments. We conducted experiments to select the optimal group as the penalty parameters of each loss function in this paper. Although the penalty parameters determined in this way are not necessarily optimal, a better balance between computational resources and performance is achieved. The experimental procedure is as follows:

(1): Univariate adjustment experiments

Based on the previous paper’s parameter setting experience, this paper first fixes β = 0.2, γ = 0.2, and δ = 0.3 and tests the performance when α ∈ [0.2,0.3, 0.4, 0.5], respectively.

The experimental results are shown in Table A3.

Table A3. Comparison of performance for different values of α.

Method	KITTI DC
Method	RMSE↓ mm	MAE↓ mm	R²↑
α = 0.2	709.23	202.52	0.899
α = 0.3	708.56	202.56	0.904
α = 0.4	708.88	202.92	0.872
α = 0.5	709.45	203.14	0.844

Note: Optimal results for each indicator are in bold. ↑ indicates that a larger value is better and ↓ indicates that a smaller value is better.

From the above table, it can be seen that the performance of the system is optimal when α = 0.3, so we set the value of α to 0.3 and then adjust the other parameters in turn.

Similarly, we tested the performance at β ∈ [0.2,0.4,0.6,0.8] where α = 0.3, γ = 0.2, δ = 0.3. The experimental results are shown in Table A4.

Table A4. Comparison of performance for different values of β.

Method	KITTI DC
Method	RMSE↓ mm	MAE↓ mm	R²↑
β = 0.2	708.56	202.56	0.904
β = 0.4	708.98	202.87	0.901
β = 0.6	709.45	203.23	0.899
β = 0.8	710.23	203.69	0.900

Note: Optimal results for each indicator are in bold. ↑ indicates that a larger value is better and ↓ indicates that a smaller value is better.

From the above table, it can be seen that the performance of the system is optimal when β = 0.2, so we set the value of β to 0.2.

(2): Empirically orientated experiments

From the above univariate adjustment experiments, it is known that the parameters of α and β are chosen to be 0.3 and 0.2, respectively. Based on the paper a priori knowledge and the corresponding properties of the loss function, we set the parameters of γ and δ constraining the TV loss and the Dice loss to four sets of candidate values (γ: 0.2/0.3, δ: 0.2/0.3) and conduct the corresponding experiments. The experimental results are shown in Table A5.

Table A5. Multi-parameter performance comparison.

Method	α	β	γ	δ	KITTI DC
Method	α	β	γ	δ	RMSE↓ mm	MAE↓ mm	R²↑
A	0.3	0.2	0.3	0.2	708.69	202.74	0.902
B	0.3	0.2	0.2	0.3	708.56	202.56	0.904
C	0.3	0.2	0.3	0.3	709.63	203.17	0.898
D	0.3	0.2	0.2	0.2	709.02	202.95	0.897

Note: Optimal results for each indicator are in bold. ↑ indicates that a larger value is better and ↓ indicates that a smaller value is better.

The experimental results show that Method B reaches the optimal value in all three indicators, so we set the parameter values of γ and δ to 0.2 and 0.3, respectively.

(3): Comprehensive validation

Through the above series of experiments, we finally determined the four parameters as 0.3, 0.2, 0.2, and 0.3. To more intuitively illustrate that the penalty parameters specified by the above experiments perform best, we plotted the training and validation loss function curves during the model training process. The curves are shown in Figure A6 and Figure A7.

Figure A6. Training loss function curves.

Figure A7. Validation loss function curves.

From the above training and validation loss function curves, Method B shows optimal convergence characteristics in the first 20 epochs, indicating that the model can quickly capture data features in this phase. The loss curve of Method C shows apparent oscillation in the first and middle periods, indicating that the parameter settings lead to an unstable optimisation process. The two curves of B and C indicate that the addition of the TV loss weight (γ) in the loss function can improve the system stability, but an excessive amount of it will instead inhibit the model expression ability. In terms of generalisation ability, Method B has the most miniature training and validation loss spacing (Δ = 0.1). In summary, the network structure performs better with Method B parameters, which achieves the best trade-off between convergence speed, stability, and generalisation performance.

Appendix A.4. Overfitting Correlation Analysis

Due to the small size of the NYUv2 dataset, we applied enhancement operations such as random rotation (±15°), scale transformation (0.8–1.2 times), and colour perturbation (brightness and contrast adjustments ±20%) to the input data during the training phase, which effectively improved the data diversity. In addition, the model introduces stochastic channel dropout and weight normalisation in key blocks (SADRT, Edge Fusion) to further suppress the over-reliance on specific local features.

Meanwhile, Figure A8 and Figure A9 shows our supplementary training and validation loss function plots on the KITTI and NYUv2 datasets. The figure shows that the training loss and validation loss curves of EDRNet, the network proposed in this paper, are decreasingly converged on both datasets. The validation loss values are always slightly higher than the training loss, and there is no significant deviation between the two. This indicates that the model does not fall into overfitting.

Figure A8. Training and validation loss function curves on KITTI.

Figure A9. Training and validation loss function curves on NYU V2.

Appendix A.5. Correlation Analysis of Inference Speed and Power Consumption

Additional analysis of inference speed (FPS):

In Table A6 (Quantitative evaluation on KITTI DC and NYUv2) of the original manuscript, we reported the inference speed of the backbone in this paper (FPS = 15), which is significantly better than that of the mainstream models of the backbone, such as PVT-Large (FPS = 9), Swin-Tiny (FPS = 7), etc. To further validate the inference speed of the overall network architecture, we conducted additional tests in an NVIDIA RTX 4090 environment for additional testing (Table A6).

Table A6. Inference speed comparison.

Method	KITTI DC (FPS)↑	NYUv2 (FPS)↑
SDformer [12]	8	12
GuideFormer [10]	9	12
CompletionFormer [11]	12	14
Ours	14	16

Note: Optimal results for each indicator are in bold. ↑ indicates that a larger value is better.

From the experimental results, EDRNet achieves 14 FPS on the KITTI DC dataset and 16 FPS on NYU Depth v2, which is significantly better than other Transformer-based methods such as CompletionFormer, etc. EDRNet maintains the high accuracy of the depth completion effect and has faster inference capability by optimising the multi-scale feature fusion mechanism and streamlining the computational complexity of the Transformer module.

2.: Power consumption analysis:

To verify the energy efficiency of the model, we measured the average power consumption (in Watts, W) during training and inference using the NVIDIA-SMI tool. Samples were taken once per second, fluctuating values in the initialisation phase of the first 10 s were excluded, and the average value in the stabilisation phase was taken. The experimental results are presented below:

Table A7. Average power consumption (W) of different methods on KITTI DC and NYUv2.

Method	KITTI DC(W)↓		NYUv2(W)↓
Method	Training	Inference	Training	Inference
SDformer [12]	290	270	290	260
GuideFormer [10]	290	280	280	260
CompletionFormer [11]	270	250	250	230
Ours	260	230	230	210

Note: Optimal results for each indicator are in bold. ↓ indicates that a smaller value is better.

As can be seen from the data, our proposed network requires the lowest average power consumption during training and inference, which indicates that the design of the dynamic routing technique effectively reduces the computational cost. In addition, the overall power consumption of the NYUv2 dataset is generally slightly lower than that of the KITTI DC dataset, which is related to the small size and simple environment of the NYUv2 dataset.

Appendix A.6. Network Details

To present the network hierarchy information more systematically, we have added a new hierarchical decomposition table (Table A8) for EDRNet.

Table A8. Network parameters of EDRNet.

Name	Operator	Input Dimension $(H \times W \times D)$	Output Dimension $(H \times W \times D)$
input	RGB and Sparse Depth	RGB Image: $H \times W \times 3$ Sparse Depth: $H \times W \times 1$	RGB Image: $H \times W \times 3$ Sparse Depth: $H \times W \times 1$
Conv1	Concat [RGB, Sparse Depth] Conv3 × 3 + BN + ReLU	$H \times W \times 4$	$H \times W \times 64$
Conv2	MobileNetV2 Block×3	$H \times W \times 64$	$H \times W \times 64$
Conv3	MobileNetV2 Block×3	$H \times W \times 64$	$\frac{1}{2} H \times \frac{1}{2} W \times 128$
Conv4	SADRT Block×3	$\frac{1}{2} H \times \frac{1}{2} W \times 128$	$\frac{1}{4} H \times \frac{1}{4} W \times 64$
Conv5	SADRT Block×3 Canny image: $H \times W \times 1$ → downsampling → 1×1 Conv → concat	$\frac{1}{4} H \times \frac{1}{4} W \times 64$	$\frac{1}{8} H \times \frac{1}{8} W \times 128$
Conv6	SADRT Block×3	$\frac{1}{8} H \times \frac{1}{8} W \times 128$	$\frac{1}{16} H \times \frac{1}{16} W \times 256$
Conv7	SADRT Block×3	$\frac{1}{16} H \times \frac{1}{16} W \times 256$	$\frac{1}{32} H \times \frac{1}{32} W \times 512$
Dec4	3 × 3 Conv Attention Layer	$\frac{1}{32} H \times \frac{1}{32} W \times 512$	$\frac{1}{16} H \times \frac{1}{16} W \times 256$
Dec3	3 × 3 Conv Attention Layer	$\frac{1}{16} H \times \frac{1}{16} W \times 256$	$\frac{1}{8} H \times \frac{1}{8} W \times 128$
Dec2	3 × 3 Conv Attention Layer	$\frac{1}{8} H \times \frac{1}{8} W \times 128$	$\frac{1}{4} H \times \frac{1}{4} W \times 64$
Dec1	3 × 3 Conv Attention Layer	$\frac{1}{4} H \times \frac{1}{4} W \times 64$	$\frac{1}{2} H \times \frac{1}{2} W \times 64$
Dec0	3 × 3 Conv Attention Layer	$\frac{1}{2} H \times \frac{1}{2} W \times 64$	$H \times W \times 64$
Prediction Depth	-	$H \times W \times 64$	$H \times W \times 1$
Refine	Spatial Propagation Network Refinement	$H \times W \times 1$	$H \times W \times 1$

References

Levin, A.; Lischinski, D.; Weiss, Y. Colorization using optimization. ACM Trans. Graph. (TOG) 2004, 23, 689–694. [Google Scholar] [CrossRef]
Tomasi, C.; Manduchi, R. Bilateral filtering for gray and color images. In Proceedings of the Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271), Bombay, India, 7 January 1998; pp. 839–846. [Google Scholar] [CrossRef]
Ku, J.; Harakeh, A.; Waslander, S.L. In Defense of Classical Image Processing: Fast Depth Completion on the CPU. In Proceedings of the 2018 15th Conference on Computer and Robot Vision (CRV), Toronto, ON, Canada, 8–10 May 2018. [Google Scholar]
Uhrig, J.; Schneider, N.; Schneider, L.; Franke, U.; Brox, T.; Geiger, A. Sparsity Invariant CNNs. In Proceedings of the 2017 International Conference on 3D Vision (3DV), Qingdao, China, 10–12 October 2017; pp. 11–20. [Google Scholar] [CrossRef]
Huang, Z.; Fan, J.; Cheng, S.; Yi, S.; Wang, X.; Li, H. HMS-Net: Hierarchical Multi-Scale Sparsity-Invariant Network for Sparse Depth Completion. IEEE Trans. Image Process. 2018, 29, 3429–3441. [Google Scholar] [CrossRef] [PubMed]
Chodosh, N.; Wang, C.; Lucey, S. Deep Convolutional Compressed Sensing for LiDAR Depth Completion. arXiv 2018, arXiv:1803.08949. [Google Scholar] [CrossRef]
Ma, F.; Karaman, S. Sparse-to-Dense: Depth Prediction from Sparse Depth Samples and a Single Image. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 1–8. [Google Scholar] [CrossRef]
Ma, F.; Cavalheiro, G.V.; Karaman, S. Self-Supervised Sparse-to-Dense: Self-Supervised Depth Completion from LiDAR and Monocular Camera. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 3288–3295. [Google Scholar] [CrossRef]
Jaritz, M.; Charette, R.D.; Wirbel, É.; Perrotton, X.; Nashashibi, F. Sparse and Dense Data with CNNs: Depth Completion and Semantic Segmentation. In Proceedings of the 2018 International Conference on 3D Vision (3DV), Verona, Italy, 5–8 September 2018; pp. 52–60. [Google Scholar] [CrossRef]
Rho, K.; Ha, J.; Kim, Y. GuideFormer: Transformers for Image Guided Depth Completion. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 6240–6249. [Google Scholar]
Zhang, Y.; Guo, X.; Poggi, M.; Zhu, Z.; Huang, G.; Mattoccia, S. CompletionFormer: Depth Completion with Convolutions and Vision Transformers. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 18527–18536. [Google Scholar] [CrossRef]
Qian, J.; Sun, M.; Lee, A.; Li, J.; Zhuo, S.; Chiang, P. SDformer: Efficient End-to-End Transformer for Depth Completion. arXiv 2024, arXiv:2409.08159. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv 2015, arXiv:1505.04597. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Liu, S.; Mello, S.D.; Gu, J.; Zhong, G.; Yang, M.-H.; Kautz, J. Learning Affinity via Spatial Propagation Networks. In Proceedings of the Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Sabour, S.; Frosst, N.; Hinton, G.E. Dynamic Routing Between Capsules. arXiv 2017, arXiv:1710.09829. [Google Scholar] [CrossRef]
Liu, X.; Liu, Y.; Fu, W.; Liu, S. RETRACTED ARTICLE: SCTV-UNet: A COVID-19 CT segmentation network based on attention mechanism. Soft Comput 2024, 28, 473. [Google Scholar] [CrossRef] [PubMed]
Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor Segmentation and Support Inference from RGBD Images. In Proceedings of the European Conference on Computer Vision, Florence, Italy, 7–13 October 2012. [Google Scholar]
Agarwal, A.; Arora, C. Attention Attention Everywhere: Monocular Depth Prediction with Skip Attention. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 5850–5859. [Google Scholar] [CrossRef]
Lee, J.H.; Han, M.-K.; Ko, D.W.; Suh, I.H. From Big to Small: Multi-Scale Local Planar Guidance for Monocular Depth Estimation. arXiv 2019, arXiv:1907.10326. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Cheng, X.; Wang, P.; Yang, R. Learning Depth with Convolutional Spatial Propagation Network. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2361–2379. [Google Scholar] [CrossRef] [PubMed]
Cheng, X.; Wang, P.; Guan, C.; Yang, R. CSPN++: Learning Context and Resource Aware Convolutional Spatial Propagation Networks for Depth Completion. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019. [Google Scholar]
Xu, Y.; Zhu, X.; Shi, J.; Zhang, G.; Bao, H.; Li, H. Depth Completion from Sparse LiDAR Data with Depth-Normal Constraints—Supplementary Materials. In Proceedings of the ICCV, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2811–2820. [Google Scholar]
Tang, J.; Tian, F.-P.; Feng, W.; Li, J.; Tan, P. Learning Guided Convolutional Network for Depth Completion. IEEE Trans. Image Process. 2019, 30, 1116–1129. [Google Scholar] [CrossRef] [PubMed]
Zhao, S.; Gong, M.; Fu, H.; Tao, D. Adaptive Context-Aware Multi-Modal Network for Depth Completion. IEEE Trans. Image Process. A Publ. IEEE Signal Process. Soc. 2021, 30, 5264–5276. [Google Scholar] [CrossRef] [PubMed]
Qiu, J.; Cui, Z.; Zhang, Y.; Zhang, X.; Liu, S.; Zeng, B.; Pollefeys, M. DeepLiDAR: Deep Surface Normal Guided Depth Prediction for Outdoor Scene From Sparse LiDAR Data and Single Color Image. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3308–3317. [Google Scholar] [CrossRef]
Yan, Z.; Wang, K.; Li, X.; Zhang, Z.; Xu, B.; Li, J.; Yang, J. RigNet: Repetitive Image Guided Network for Depth Completion. In Proceedings of the European Conference on Computer Vision, Virtual, 11–17 October 2021. [Google Scholar]
Srinivas, A.; Lin, T.Y.; Parmar, N.; Shlens, J.; Abbeel, P.; Vaswani, A. Bottleneck Transformers for Visual Recognition. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 16514–16524. [Google Scholar]
Peng, Z.; Guo, Z.; Huang, W.; Wang, Y.; Xie, L.; Jiao, J.; Tian, Q.; Ye, Q. Conformer: Local Features Coupling Global Representations for Recognition and Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 9454–9468. [Google Scholar] [CrossRef] [PubMed]
Kim, S.; Gholami, A.; Shaw, A.E.; Lee, N.; Mangalam, K.; Malik, J.; Mahoney, M.W.; Keutzer, K. Squeezeformer: An Efficient Transformer for Automatic Speech Recognition. arXiv 2022, arXiv:2206.00888. [Google Scholar] [CrossRef]
Yang, Y.; Pan, Y.; Yin, J.; Han, J.; Ma, L.; Lu, H. Hybridformer: Improving Squeezeformer with Hybrid Attention and NSR Mechanism. In Proceedings of the ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Identity Mappings in Deep Residual Networks. arXiv 2016, arXiv:1603.05027. [Google Scholar] [CrossRef]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. arXiv 2021, arXiv:2102.12122. [Google Scholar] [CrossRef]
Lee, Y.; Kim, J.; Willette, J.; Hwang, S.J. MPViT: Multi-Path Vision Transformer for Dense Prediction. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 7277–7286. [Google Scholar] [CrossRef]
Van Gansbeke, W.; Neven, D.; Brabandere, B.D.; Gool, L.V. Sparse and Noisy LiDAR Completion with RGB Guidance and Uncertainty. In Proceedings of the 2019 16th International Conference on Machine Vision Applications (MVA), Tokyo, Japan, 27–31 May 2019; pp. 1–6. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Liao, Y.; Huang, L.; Wang, Y.; Kodagoda, S.; Yu, Y.; Liu, Y. Parse geometry from a line: Monocular depth estimation with partial laser observation. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2016; pp. 5059–5066. [Google Scholar] [CrossRef]

Figure 1. The architecture of the proposed model.

Figure 2. SADRT block.

Figure 3. Edge information visualisation images, where (a) is the RGB image, (b) is the ground truth, (c) is the RGB image of the box section, and (d) is the edge image of the box section.

Figure 4. Experimental results on the KITTI DC dataset.

Figure 5. Experimental results of the NYU DepthV2 dataset.

Figure 6. Training and validation loss function curves on KITTI.

Figure 7. Training and validation loss function curves on NYU V2.

Table 1. Quantitative evaluation on KITTI DC and NYUv2.

Method	KITTI DC						NYUv2
Method	RMSE↓ mm	MAE↓ mm	iRMSE↓ 1/km	iMAE↓ 1/km	SSIM↑	LPIPS↓	RMSE↓ (m)	REL↓
CSPN [23]	1019.64	279.46	2.93	1.15	0.794	0.162	0.117	0.016
CSPN++ [24]	743.69	209.28	2.07	0.90	0.896	0.132	-	-
Sparse-to-dense [7]	814.73	249.95	2.80	1.21	0.864	0.101	0.230	0.044
FCFR [25]	735.81	217.15	2.20	0.98	0.913	0.090	0.106	0.015
GuideNet [26]	736.24	218.83	2.25	0.99	0.911	0.089	0.101	0.015
ACMNet [27]	744.91	206.09	2.08	0.90	0.895	0.098	0.105	0.015
DeepLiDAR [28]	758.38	226.50	2.56	1.15	0.875	0.115	0.115	0.022
RigNet [29]	713.44	204.55	2.16	0.92	0.922	0.078	0.090	0.013
SDformer [12]	809.78	222.32	2.32	0.93	0.905	0.095	-	-
GuideFormer [10]	721.48	207.76	2.14	0.97	0.917	0.086	-	-
CompletionFormer [11]	708.87	203.45	2.01	0.88	0.925	0.073	0.090	0.012
Ours	708.56	202.56	2.04	0.88	0.931	0.075	0.092	0.012

Note: Optimal results for each indicator are in bold. ↑ indicates that a larger value is better and ↓ indicates that a smaller value is better.

Table 2. Performance comparison of different hybrid CNN–Transformer blocks.

Block	RMSE↓ mm	MAE↓ mm	FLOPs↓ G	FPS↑
BoTNet [30]	88.5	34.9	368.2	11
Conformer [31]	88.2	35.1	366.3	13
Squeezeformer [32]	87.9	34.8	365.7	12
HybridFormer [33]	87.4	34.1	366.2	13
SADRT (Ours)	86.3	34.1	364.5	15

Note: Optimal results for each indicator are in bold. ↑ indicates that a larger value is better and ↓ indicates that a smaller value is better.

Table 3. Quantitative results on KITTI DC for ablation study.

Method	Edge Image	EM Loss	SADRT Block	RMSE↓ mm	MAE↓ mm	Params↓ M	FLOPs↓ G
Backbone Only				90.3	35.1	67.8	374.7
A	√			90.1	35.0	68.9	383.6
B		√		90.0	35.0	69.1	394.5
C			√	90.1	34.9	55.4	359.9
D	√	√		89.7	34.9	69.5	389.6
E	√		√	88.5	34.6	62.1	362.7
F		√	√	87.9	34.5	66.3	361.9
G	√	√	√	86.3	34.1	67.5	364.5

Note: Optimal results for each indicator are in bold. ↓ indicates that a smaller value is better.

Table 4. Results of ablation experiments introducing edge image positions.

Method	Location of Edge Image	RMSE↓ mm	MAE↓ mm
A	Encoder Layer 1	722.41	204.28
B	Encoder Layer 4	708.56	202.56
C	Decoder Layer 1	715.23	201.87
D	Decoder Layer 4	725.32	203.16

Note: Optimal results for each indicator are in bold. ↓ indicates that a smaller value is better.

Table 5. Comparison of common backbone results.

Backbone	RMSE↓ mm	MAE↓ mm	FLOPs↓ G	FPS↑
Swin-Tiny [34]	92.6	36.4	634.8	7
ResNet34 [35]	91.4	35.5	582.1	7
PVT-Large [36]	91.4	35.6	419.8	9
MPViT-Base [37]	91.0	35.5	1259.3	3
CompletionFormer Tiny [11]	90.9	35.3	389.4	9
Ours	90.3	35.1	374.7	15

Note: Optimal results for each indicator are in bold. ↑ indicates that a larger value is better and ↓ indicates that a smaller value is better.

Table 6. Quantitative evaluation on KITTI DC.

Method	KITTI DC
Method	RMSE↓ mm	MAE↓ mm	iRMSE↓ 1/km	iMAE↓ 1/km
L1 Loss [38]	712.35	204.87	2.11	1.02
Smooth L1 Loss [39]	710.23	204.34	2.06	0.93
Hybrid Loss [40]	709.13	203.78	2.02	0.88
Ours	708.56	202.56	2.04	0.88

Note: Optimal results for each indicator are in bold. ↓ indicates that a smaller value is better.

Table 7. Comparison of evaluation metrics with different parameter configurations.

Method	α	β	γ	δ	KITTI DC
Method	α	β	γ	δ	RMSE↓ mm	MAE↓ mm	iRMSE↓ 1/km	iMAE↓ 1/km
A	0	0.2	0.2	0.3	713.23	205.79	2.23	0.95
B	0.3	0	0.2	0.3	709.44	203.44	2.09	0.91
C	0.3	0.2	0	0.3	708.85	202.93	2.03	0.89
D	0.3	0.2	0.2	0	710.87	204.01	2.11	0.98
Ours	0.3	0.2	0.2	0.3	708.56	202.56	2.04	0.88

Note: Optimal results for each indicator are in bold. ↓ indicates that a smaller value is better.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sun, F.; Li, B.; Zhang, Q. EDRNet: Edge-Enhanced Dynamic Routing Adaptive for Depth Completion. Mathematics 2025, 13, 953. https://doi.org/10.3390/math13060953

AMA Style

Sun F, Li B, Zhang Q. EDRNet: Edge-Enhanced Dynamic Routing Adaptive for Depth Completion. Mathematics. 2025; 13(6):953. https://doi.org/10.3390/math13060953

Chicago/Turabian Style

Sun, Fuyun, Baoquan Li, and Qiaomei Zhang. 2025. "EDRNet: Edge-Enhanced Dynamic Routing Adaptive for Depth Completion" Mathematics 13, no. 6: 953. https://doi.org/10.3390/math13060953

APA Style

Sun, F., Li, B., & Zhang, Q. (2025). EDRNet: Edge-Enhanced Dynamic Routing Adaptive for Depth Completion. Mathematics, 13(6), 953. https://doi.org/10.3390/math13060953

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

EDRNet: Edge-Enhanced Dynamic Routing Adaptive for Depth Completion

Abstract

1. Introduction

2. Materials and Methods

2.1. Network Architecture

2.2. SADRT Block

2.2.1. Dynamic Routing Transformer

2.2.2. Adaptive Sparse Activation

2.3. Introducing Edge Image References

2.4. Composite Depth Completion Loss Function

2.4.1. Edge-Matching Loss Function

2.4.2. L1 Loss Function

2.4.3. Total Variation Loss Function

2.4.4. Dice Loss Function

3. Results and Discussion

3.1. Datasets

3.2. Evaluation Metrics

3.3. Implementation Details

3.4. Discussion

3.4.1. Quantitative Analysis

3.4.2. Qualitative Analysis

3.4.3. Visualisation and Analysis

3.4.4. Performance Comparison of Hybrid CNN–Transformer Block

3.5. Ablation Results

3.5.1. Ablation Experiments for Each Block

3.5.2. The Edge Image Introduces Positional Experiments

3.5.3. Comparisons with General Feature Backbones

3.5.4. Comparison and Ablation Study of Loss Function

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix A.1. Curve Fitting and Significance Analysis

Appendix A.2. Experiments on the Selection of Edge Operators

Appendix A.3. Experiments to Determine the Parameters of the Loss Function

Appendix A.4. Overfitting Correlation Analysis

Appendix A.5. Correlation Analysis of Inference Speed and Power Consumption

Appendix A.6. Network Details

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI