Monocular Depth Estimation Based on Dilated Convolutions and Feature Fusion

Li, Hang; Liu, Shuai; Wang, Bin; Wu, Yuanhao

doi:10.3390/app14135833

Open AccessArticle

Monocular Depth Estimation Based on Dilated Convolutions and Feature Fusion

¹

Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Dong Nanhu Road 3888, Changchun 130033, China

²

Navigation College, Dalian Maritime University, Linghai Road 1, Dalian 116026, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(13), 5833; https://doi.org/10.3390/app14135833

Submission received: 3 June 2024 / Revised: 28 June 2024 / Accepted: 1 July 2024 / Published: 3 July 2024

Download

Browse Figures

Versions Notes

Abstract

:

Depth estimation represents a prevalent research focus within the realm of computer vision. Existing depth estimation methodologies utilizing LiDAR (Light Detection and Ranging) technology typically obtain sparse depth data and are associated with elevated hardware expenses. Multi-view image-matching techniques necessitate prior knowledge of camera intrinsic parameters and frequently encounter challenges such as depth inconsistency, loss of details, and the blurring of edges. To tackle these challenges, the present study introduces a monocular depth estimation approach based on an end-to-end convolutional neural network. Specifically, a DNET backbone has been developed, incorporating dilated convolution and feature fusion mechanisms within the network architecture. By integrating semantic information from various receptive fields and levels, the model’s capacity for feature extraction is augmented, thereby enhancing its sensitivity to nuanced depth variations within the image. Furthermore, we introduce a loss function optimization algorithm specifically designed to address class imbalance, thereby enhancing the overall predictive accuracy of the model. Training and validation conducted on the NYU Depth-v2 (New York University Depth Dataset Version 2) and KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) datasets demonstrate that our approach outperforms other algorithms in terms of various evaluation metrics.

Keywords:

monocular depth estimation; convolutional neural networks; feature fusion; end to end

1. Introduction

The issue of depth estimation in images is a prominent subject within the realm of computer vision [1,2]. Traditional methods [3] for depth estimation primarily rely on LiDAR technology to acquire depth data. However, this approach is associated with high costs, offers limited depth information, and necessitates stringent environmental conditions [4]. Existing methods for depth estimation are categorized into multi-view depth estimation and monocular depth estimation, depending on the quantity of input images [5]. Binocular depth estimation techniques ascertain depth information through the pairing and alignment of multiple input images, necessitating prior knowledge of camera parameters. The outcomes are significantly impacted by the selection of feature points. Monocular depth estimation [6], which involves predicting depth information from a single input image, is widely preferred for its simplicity. Nevertheless, the estimation results frequently prove unsatisfactory due to the absence of supplementary depth sources and parallax information. In recent years, researchers have started to investigate the application of deep learning methods to tackle the challenge of monocular depth estimation [7], inspired by the remarkable advancements of deep learning technology in areas such as object detection and semantic segmentation.

In 2014, Eigen et al. [8] introduced a multi-scale depth network aimed at enhancing the precision and robustness of monocular depth estimation. This network aims to forecast global depth information through a coarse-grained network branch, capture local detail information via a fine-grained network branch, and subsequently integrate the outcomes. In 2015, Fayao Liu [9] and colleagues introduced a new architecture known as the Deep Convolutional Neural Field (DCNF), which integrates Deep Convolutional Neural Networks (DCNNs) with Conditional Random Fields (CRFs) [10]. This approach does not necessitate the use of geometric priors or any additional information. Instead, it precisely estimates depth information from a single image by acquiring unary and pairwise potentials of the random field within the convolutional neural network. In 2018, Fu et al. [11] introduced the concept of Depth-Ordered Regression. By incorporating a Spacing-Increasing Discretization (SID) strategy for the discretization of depth values, the researchers transformed the training of depth networks into a structured regression task. Atrous Spatial Pyramid Pooling (ASPP) was additionally employed to augment the network’s feature extraction capabilities. This approach offers a novel method for monocular depth estimation tasks. In 2020, a method was proposed by L. Huynh et al. [12] to guide depth estimation towards favoring indoor environmental plane structures. Non-local coplanarity constraints were incorporated into the network, along with the utilization of a novel attention mechanism known as Depth Attention Volume (DAV). This approach yielded favorable outcomes on indoor datasets like the NYU Depth Dataset [13] and ScanNet Dataset [14]. In 2023, Yuanfeng Ji [15] proposed a simple, efficient, yet powerful framework for dense visual predictions based on the conditional diffusion pipeline. Our approach follows a “noise-to-map” generative paradigm for prediction by progressively removing noise from a random Gaussian distribution, guided by the image. The method, called DDP, efficiently extends the denoising diffusion process into the modern perception pipeline. In contrast to conventional monocular image depth estimation algorithms, the methods mentioned above have demonstrated notable success due to the utilization of extensive datasets and the robust feature extraction capabilities inherent in convolutional neural networks. Nevertheless, they encounter challenges related to the unified optimization of models, indistinct boundaries, and incomplete details. The summary of monocular depth estimation algorithms is shown in Table 1.

To tackle the aforementioned issues, this study introduces a monocular depth estimation approach that relies on an end-to-end convolutional neural network. The methodology comprises three components: a feature extraction module, a feature fusion module, and a depth perception module. In the feature extraction stage, a Dense Convolutional Network (DNET) is employed for extracting features from the input image. This approach helps mitigate the issue of gradient vanishing and improves feature propagation. In the feature fusion stage, dilated convolutions are employed to extract and integrate feature maps with varying receptive fields. This augmentation enhances the effective receptive field of the convolution kernels without escalating the parameter count, enabling the network to encompass a more extensive contextual semantic information. During the depth estimation stage, the monocular depth estimation issue is converted into an ordinal multi-class classification problem. The method involves designing a model tailored for ordinal multi-class problems, which entails mapping input data to depth data. In the meantime, a loss function optimization algorithm targeting class imbalance was proposed, which led to an improvement in the model’s prediction accuracy.

The paper presents several key contributions as follows:

(1) A novel convolutional neural network (CNN) model, designated DNET, is proposed. This model exhibits several advantageous characteristics, including dense connectivity, effective gradient propagation, efficient parameter sharing, and the suppression of feature sparsity.

(2) A feature fusion module is introduced, which is based on dilated convolution. This module enhances the effective receptive field of convolutional kernels without increasing the network parameters, thereby enabling the extraction of more comprehensive contextual semantic information.

(3) An optimization algorithm designed for addressing class imbalance in the loss function establishes a correlation between depth categories and continuous depth, thereby enhancing the accuracy of depth estimation within the network model.

2. Related Works

2.1. Dense Convolutional Networks

To attain favorable outcomes in computer vision tasks, there is a trend towards designing networks with greater depth. Nonetheless, this phenomenon has resulted in a challenge referred to as the vanishing gradient problem. In response to this concern, Gao Huang and collaborators introduced an innovative convolutional network structure named DenseNet [17] (Densely Connected Convolutional Networks), described in Figure 1, in 2017. By integrating dense convolutional modules into conventional CNNs, DenseNet improves the feature extraction capacities of deep networks and effectively addresses the issue of vanishing gradients that arise during backpropagation.

Traditional convolutional neural networks (CNNs) solely receive input from the feature maps of the preceding layer. Inspired by ResNet [18], DenseNet is designed to receive input from the outputs of all preceding layers, drawing inspiration from ResNet’s network structure. This practice serves to enhance the feature extraction capabilities of the network and mitigate challenges such as gradient vanishing and exploding.

2.2. Dilated Convolution

Dilated convolution, also referred to as atrous convolution [19], was introduced during the International Conference on Learning Representations (ICLR) in 2016 as a solution to the problem of feature information loss induced by pooling layers in image segmentation algorithms [20].

In image segmentation tasks, it is noteworthy that the spatial scale of the input and output of the neural network remains consistent. Following the application of multiple pooling layers, the resolution of feature maps diminishes, necessitating upsampling to align with the spatial dimensions of the input image. This process results in a partial loss of feature information. In contrast to conventional convolution, dilated convolution enlarges the convolution kernel by incorporating gaps for convolution operations between the kernel. This expansion enhances the receptive field without necessitating an increase in the number of parameters.

Figure 2 illustrates the schematic diagrams depicting traditional convolution [21] and dilated convolution. In conventional convolution operations, the connections are dense, with each element of the convolution kernel interacting with adjacent elements on the input feature map. Dilated convolution, in contrast, introduces “gaps” between the elements of the convolution kernel, enabling each pixel of the kernel to interact with spaced elements on the input feature map. This spaced sampling expands the “receptive field” of the convolutional kernel without augmenting the parameter count.

2.3. Feature Pyramid Network

The Feature Pyramid Network (FPN) is an innovative network architecture introduced by Lin et al. [22] in 2017 with the aim of enhancing the detection performance of current object detection models, focusing on the detection of small objects. An increase in the number of network layers has been shown to enhance detection accuracy. Nevertheless, the performance in detecting small objects remains inadequate. The primary factor is the growing number of pooling operations due to the deepening of network layers, which results in the loss of feature information from small objects. This ultimately affects the network model’s ability to effectively detect small objects. Figure 3 is the architecture of FPN.

The primary concept behind the FPN involves the construction of a feature pyramid aimed at extracting multi-scale feature information from images. This approach integrates features from various scales to enhance object detection capabilities. By integrating low-level features with high-level features, the resultant high-level feature maps obtained post downsampling preserve certain low-level feature details, consequently enhancing the network model’s capacity to identify small objects. This methodology has been implemented in various traditional object detection networks, offering valuable insights for multi-scale object detection task.

3. Method Design and Implementation

The network architecture proposed in this study, utilizing dilated convolutions and feature fusion, is illustrated in Figure 4. The system primarily comprises three key components: the feature extraction module, the multi-receptive field feature fusion module, and the depth perception module. The feature extraction module is tasked with the initial extraction of features from the input image data to acquire preliminary feature information. The information is subsequently processed by the multi-receptive field feature fusion module to produce enhanced feature representations. Finally, the feature information is transmitted to the depth perception module, where a correlation is established between the feature maps and depth categories, culminating in the generation of the image’s depth estimation result.

3.1. Backbone of the Feature Extraction Modules

Drawing upon the concepts of the DenseNet architecture, the backbone of the feature extraction network, denoted as DNET in Figure 5, is composed of dense blocks, transition layers, and convolutional pooling layers.

This study incorporates dense blocks into the conventional CNN architecture. As illustrated in Table 2, the network receives image data with dimensions of

320 \times 240 \times 3

as input. After traversing the CONV1 layer, comprising convolution and pooling operations, the result is a feature map of size

160 \times 120 \times 64

. Subsequently, it passes through the DEN1 dense block. The dense block DEN1 is composed of conventional

1 \times 1

and

3 \times 3

convolutions integrated with channel-wise concatenation operations. As illustrated in Table 2, the input feature map is subjected to

1 \times 1

and

3 \times 3

convolution operations. Subsequently, the resulting feature maps are merged with the input feature map at the channel level to generate the final feature map. This process is iterated twice, leading to the generation of the ultimate feature map following the DEN1 module.

Subsequently, the feature map undergoes processing by the TRAN1 transition layer, which is comprised of a

1 \times 1

convolution and a

2 \times 2

average pooling operation. The feature map of size

160 \times 120 \times 256

, following processing by the transition layer TRAN1, is transformed into a feature map of dimensions

80 \times 60 \times 128

. Then, the feature map is forwarded through the DEN2 dense block, resulting in an

80 \times 60 \times 512

feature map. Subsequently, this feature map undergoes processing by the TRAN2 transition layer, yielding a

40 \times 30 \times 256

feature map. Finally, the feature map undergoes processing in the DEN3 dense block, leading to the generation of a

40 \times 30 \times 1024

feature map.

The design of the DEN dense blocks in the network structure enables the connection of each layer’s feature information to all subsequent feature layers through “shortcut connections”. High-level feature maps have the capability to readily access the information that has been extracted by low-level feature maps. This enables each layer to directly tap into feature maps from all preceding layers. This contributes to the efficiency of reusing features, facilitates gradient propagation, and enhances the overall performance of the model. The configuration of dense blocks serves to elevate the model’s utilization rate of features, diminish the parameter count, and bolster the transmission of features and information flow throughout the network.

3.2. Multi-Receptive Field Feature Fusion

The design of the multi-receptive field feature fusion module draws inspiration from the FPN architecture commonly used in object detection tasks. FPN enhances the detection accuracy of small objects through the integration of feature maps from various scales at the channel level. The multi-receptive field feature fusion module proposed in this study comprises two components: dilated convolution and channel-level feature fusion. By enlarging the effective receptive field of the convolution kernel and creating a multi-scale feature representation pyramid, this module improves the network’s capacity to capture contextual semantic information and its sensitivity to detailed features.

The specific structure is depicted in Figure 6. First, the input feature map

χ

undergoes convolution by applying a

1 \times 1

convolution kernel, resulting in the feature map

χ_{1}

. Next, the input feature map

χ

undergoes convolution by applying a

3 \times 3

dilated convolution kernel with a dilation rate of 6, resulting in the feature map

χ_{2}

. Then, a

3 \times 3

dilated convolution kernel with a dilation rate of 12 is applied to

χ

to obtain the feature map

χ_{3}

. Likewise, a

3 \times 3

dilated convolution kernel with a dilation rate of 18 is applied to the input feature map

χ

to generate the output feature map

χ_{4}

. The feature maps

χ_{1}

,

χ_{2}

,

χ_{3}

, and

χ_{4}

are subsequently concatenated at the channel level to produce the feature map

γ

. Finally, the feature map

γ

is subjected to

1 \times 1

convolution and upsampling operations to generate a prediction outcome that aligns with the dimensions of the input image data. In this procedure, the

1 \times 1

convolutional operation modifies the channel dimensions of the resulting feature map to align with the predetermined number of predicted classes, which is established as 240 in this study. The process of upsampling is implemented to guarantee that the resultant feature map aligns with the dimensions of the input image, which have been specified as

640 \times 480

in this study. The convolutions utilized in the multi-receptive field feature fusion module are exclusively dilated convolutions. Dilated convolutions enhance the effective receptive field of the convolutional kernel compared to standard convolutions by employing varying dilation rates. This approach reduces the parameter count while preserving resolution.

During the feature fusion stage, the feature maps with varying receptive fields are combined with the original feature map at the channel level. This process results in a feature map that encompasses both low-level detailed features and high-level contour information. This augmentation enhances the network model’s capability to extract features from targets of varying scales and increases its sensitivity to intricate information.

3.3. Depth Perception

The module concerning depth perception represents the ultimate component within the network architecture discussed in this paper. It plays a primary role in establishing the mapping relationship between the feature maps and pixel-level depth categories. The specific relationship is illustrated in Figure 7.

Most monocular image depth estimation algorithms that rely on deep learning transform the depth estimation problem into a regression problem. These methodologies treat depth information in images as a continuous variable and employ appropriate regression models to forecast the depth values of pixels, thereby producing the ultimate depth estimation outcomes. By examining experimental findings from reputable studies in the contemporary domain of monocular depth estimation, this paper suggests that the task of depth estimation becomes more challenging when aiming to predict continuous depth values. This paper simplifies the depth estimation issue by transforming the depth regression problem into a depth classification problem, considering the limited sensitivity of the human eye to subtle changes in depth information. By discretizing continuous depth values, distinct depth intervals are assigned to specific classes, thereby creating an association between continuous depth values and depth categories. Ideally, the depth intervals corresponding to different depth categories should be equal. Given that variations in depth should be more acceptable in depth intervals with greater depth values, we employ the magnitude of the depth values as the criterion for segmenting the depth interval ranges. The specific mapping model is illustrated in Figure 8.

In Figure 8, we segment the continuous depth values ranging from 0 to 25 m into six depth intervals, each representing a distinct depth category. By modifying the span of each depth interval based on the depth values’ magnitude, a greater tolerance for depth discrepancies is attained in intervals characterized by larger depth values in contrast to those with smaller depth values. It should be noted that Figure 8 is a schematic diagram of the mapping relationship. The actual number of depth intervals and the values of the depth range are determined by different datasets as detailed in the experimental section. The formula illustrating the mapping relationship between continuous depth values and depth categories is presented in Formulas (1) and (2):

c = K \times \sqrt{\frac{(d \times (β - α) - α)}{(β - α)}}

(1)

d = α + (β - α) \times {(\frac{c}{K})}^{2}

(2)

Here, K denotes the number of depth categories, d represents the continuous depth value, c indicates the depth category,

α

stands for the starting depth of the continuous depth values, and

β

represents the ending depth of the continuous depth values.

3.4. Loss Function

The conventional method for addressing multi-class classification problems commonly involves the utilization of the cross-entropy loss function to quantify the disparity between predicted outcomes and true labels as demonstrated explicitly in Formula (3):

L_{C E} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{C} y_{i, c} log (p_{i, c})

(3)

In the given context,

L_{C E}

represents the cross-entropy loss function, where N denotes the total number of samples, and C represents the total number of classes. The variables

y_{i, c}

indicate the probability (0 or 1) that sample i corresponds to the true label c, while

p_{i, c}

signifies the probability assigned by the model to sample i belonging to class c.

The label categories utilized for the multi-class problem as described in Formula (3) are discrete and independent. The depth classification issue examined in this paper necessitates the consideration of both the imbalanced distribution of depth categories and the ordinal relationships among these categories. The handling of class imbalance has been introduced in Section 3.3. For example, when the true depth value lies within the range of [10, 20], a forecasted depth range of [5, 10] is more proximate to the actual label compared to [1, 2]. Consequently, as the predicted category approaches the actual depth category, the associated penalty should decrease. When formulating the loss function, it is essential to thoroughly contemplate the influence of this matter on the overall optimization of the model. Based on the aforementioned considerations, the imbalance-aware ordinal loss function proposed in this study is presented in Formula (4):

\begin{matrix} L (w, h, χ, Θ) = \sum_{k = 0}^{l_{(w, h)} - 1} {(1 - P_{(w, h)}^{k})}^{γ} log (P_{(w, h)}^{k}) + \sum_{k = l_{(w, h)}}^{K - 1} {(P_{(w, h)}^{k})}^{γ} log (1 - P_{(ω, h)}^{k}), \\ P_{(w, h)}^{k} = P ({\hat{l}}_{(w, h)} > k | χ, Θ) \end{matrix}

(4)

In Formula (4),

χ

represents the resulting feature map corresponding to the input image I; w and h denote the horizontal and vertical coordinates of a pixel, respectively;

l_{(w, h)}

denotes the actual depth category, while p represents the predicted value for the depth category greater than k; and

{\hat{l}}_{(w, h)}

is the predicted depth category.

4. Experiments and Results

4.1. Experimental Dataset

To mitigate the substantial difference in depth range between indoor and outdoor environments in depth estimation tasks, we assessed the generalization capacity of our proposed neural network model by utilizing the NYU Depth V2 dataset (https://cs.nyu.edu/~fergus/datasets/nyu_depth_v2.html#raw_parts, accessed on 5 February 2024) for indoor scenes and the KITTI [23] dataset (https://www.cvlibs.net/datasets/kitti/, accessed on 29 February 2024) for outdoor scenes.

4.1.1. NYU Depth-v2 Dataset

The NYU Depth-v2 Dataset is a frequently utilized dataset for indoor scenes, which was developed and launched by New York University in 2012, mainly intended for tasks related to depth estimation and semantic segmentation. This dataset represents an enhanced version of the NYU Depth Dataset, encompassing a broader range of scenes, improved quality annotations, and expanded data information. The NYU Depth-v2 Dataset comprises 120,000 raw training images. In the course of our experiments, we selected 20,000 images for use as the training set, 1000 images for the validation set, and 654 images for the test set. To maintain the detailed information, the images were not resized. The initial image data underwent preprocessing procedures, such as cropping and rotation, before being directly inputted into the network for training.

4.1.2. KITTI Dataset

The KITTI Dataset was founded in 2012 through a collaboration between the Karlsruhe Institute of Technology (KIT) and the Toyota Technological Institute at Chicago (TTI-C) to support research in the area of autonomous driving. This dataset includes diverse sensor data types, including images, LiDAR point clouds, GPS, and IMU data. It functions as a fundamental dataset for computer vision applications such as 3D object detection and depth estimation. In the context of depth estimation tasks, the KITTI Dataset offers a collection of 90,000 RGB-D images. In order to ensure consistency with the size of the indoor scene dataset, this study randomly obtained 20,000 images from 90,000 raw images for use as the training set, 1000 images for use as the validation set, and 1000 images for use as the test set.

4.2. Evaluation Measures

For tasks related to depth estimation, the primary evaluation metrics used to assess the performance of models typically include RMSE,

{RMSE}_{Log}

, Abs Rel, Sq Rel, and Accuracy. The formulas for these evaluation metrics are presented in the following Formula (5):

\begin{matrix} R M S E = \sqrt{\frac{1}{|N|} \sum_{i \in N} {∥d_{i} - d_{i}^{*}∥}^{2}} \\ R M S E_{L o g} = \sqrt{\frac{1}{|N|} \sum_{i \in N} {∥log d_{i} - log d_{i}^{*}∥}^{2}} \\ A b s R e l = \sum_{i \in N} \frac{|d_{i} - d_{i}^{*}|}{d_{i}^{*}} \\ S q l R e l = \sum_{i \in N} \frac{{∥d_{i} - d_{i}^{*}∥}^{2}}{d_{i}^{*}} \\ A c c : % o f d_{i} s . t . max (\frac{d_{i}}{d_{i}^{*}}, \frac{d_{i}^{*}}{d_{i}}) = δ < t h r \end{matrix}

(5)

where N denotes the number of pixels in the image,

d_{i}

denotes the predicted grayscale value at pixel i, and

d_{i}^{*}

denotes the actual grayscale value at pixel i. The threshold

t h r

, typically assumed as 1.25, can be taken as

1 . 25^{2}

, and

1 . 25^{3}

. The metrics including RMSE,

{RMSE}_{Log}

, Abs Rel, and Sq Rel assess the model’s performance by quantifying the discrepancy between the predicted and true grayscale values for each pixel. Smaller values are indicative of superior performance for these metrics. Acc is determined by the ratio of predicted values to actual values and the percentage of pixels falling within a defined threshold range. Higher values signify superior performance. Table 3 provides a comparative analysis of the benefits and drawbacks associated with various evaluation metrics.

4.3. Experimental Environment and Parameter Settings

The PyTorch deep learning framework was used to run all experiments on a 64-bit Core i9-9900k CPU (Santa Clara, CA, USA) running at 3.60 GHz, with 32 GB of memory, and an NVIDIA GTX 2070 SUPER GPU (Santa Clara, CA, USA) under Win10 (Redmond, WA, USA). The network parameters underwent updates through the Stochastic Gradient Descent (SGD) optimization strategy, initialized with a learning rate of 0.0001 and a momentum value of 0.9. The network adopted a learning rate strategy based on polynomial decay, with a batch size of three and an epoch of 10,000 for the network. The KITTI Dataset was divided into 80 depth intervals, while the NYU dataset was divided into 68 depth intervals.

4.4. Experimental Results and Analysis

4.4.1. Ablation Experiment

The ablation experiment was performed on the NYU Depth-v2 Dataset from New York University to evaluate the effects of various enhancement strategies on the model.

In the feature extraction phase of the algorithm, a feature extraction framework was developed utilizing the DNET backbone as a foundation. To assess its efficacy, we conducted a comparative analysis of the DNET backbone against the established ResNet50 [18] and DenseNet121 [17] backbones. Furthermore, the efficacy of the FPN module was corroborated. The experimental results are depicted in Figure 9.

The experimental results suggest that the approach utilizing the DNET backbone outperforms Resnet50 and Densenet121 in terms of performance. In terms of RMSE metric, the DNET backbone demonstrates a prediction loss of 0.481, outperforming Resnet50 with 0.663, Densenet121 with 0.508, and DNET without FPN with 1.22. In the context of Abs Rel, the DNET backbone demonstrates a prediction loss of 0.192, surpassing the values of 0.294 for ResNet50, 0.228 for DenseNet121 and 0.352 for DNET without FPN. In terms of Sq Rel, the DNET backbone exhibits a prediction loss of 0.152, outperforming Resnet50 with 0.207, Densenet121 with 0.187 and DNET without FPN with 0.289. The exceptional performance can be credited to the innovative design of the DNET backbone in feature extraction, enabling the model to effectively capture nuanced and essential feature information across various levels, from low-level to high-level feature maps.

The comparative outcomes of depth estimation between the proposed DNET backbone and the ResNet50 and DenseNet121 backbone are illustrated in Figure 10. In the near-field scenario of Scene 1, it is evident that the DNET backbone adeptly isolates the object situated at the lower-right corner of the table and captures the edge details of the chair, thereby producing a distinct outline of the object. In contrast, the ResNet50, DNET without FPN and DenseNet121 backbones suffer from the loss of pertinent information. In Scene 2, it is evident that the DNET backbone, characterized by dense convolutions, continues to exhibit significant feature extraction capabilities for remote scenes, thereby demonstrating a distinct superiority over the ResNet50, DNET without FPN and DenseNet121 backbones. The edge information of the distant bookshelf is accurately extracted, revealing a clear structure. In Scene 3, the DNET backbone successfully captures the internal structural information of the nearby chair, whereas the ResNet50, DNET without FPN and DenseNet121 backbones fail to retain this internal structural detail.

To validate the effectiveness of the improved loss function Formula (4), based on the DNET backbone, ablation experiments were performed on the NYU Depth-v2 Dataset. The results are depicted in Figure 11.

The improved loss function resulted in a decrease in the initial RMSE loss value from 0.443 to 0.408, indicating an 8% reduction. The accuracy metric

δ

demonstrated an increase from 0.9094 to 0.9198, indicating a 3.4% enhancement. This enhancement can be credited to the refined loss function that tackles the imbalance in class distribution among the samples. It establishes a correlation between depth categories and continuous depth values, leading to varied penalty terms for distinct depth intervals despite having the same prediction deviation. This reduces the impact of distant scene samples on the training trajectory of the model, thus improving the predictive capacity of the network model in various depth scenarios.

Figure 12 presents the results of the ablation study conducted before and after implementing the improved loss function. In Scene 4, it is apparent that the algorithm employing the improved loss function successfully captures the depth of distant windows and objects on the table in the foreground. In contrast, the algorithm that relies on the original cross-entropy loss function fails to preserve these specific details. In Scene 5, during the process of depth estimation, the enhanced algorithm extracts more precise object edge details, outperforming the original algorithm. In Scene 6, the effects of the ablation experiment are notably prominent, especially in the upper-left and lower-right corners of the image. The enhanced algorithm, which considers class imbalance, dynamically adapts penalty terms across various categories. This approach ensures accurate estimation of depth for distant objects and improves estimation for nearby targets.

4.4.2. Results and Analyses

We conducted a comparative analysis of our proposed method against established depth estimation techniques and assessed the algorithm’s resilience by utilizing the NYU Depth-v2 Dataset, which comprises indoor scenes, and the KITTI Dataset, which includes outdoor scenes. It is noteworthy that the raw depth within the depth estimation dataset exhibits sparsity, a characteristic attributed to the constraints of depth cameras and LiDAR technology. The ground truth in the dataset was interpolated for visualization purposes, which is only used for comparative analysis and does not represent authentic training label data. The results are presented in Table 4.

The results demonstrate that the proposed method surpasses other mainstream depth estimation methods such as Eigen, Make3D, DORN, BTS and DDP in all evaluation metrics on the NYU Depth-v2 Dataset. The accuracy index demonstrates a 5% to 50% enhancement in comparison to alternative algorithms, while the error index shows a reduction by 10.8% to 79%.

Figure 13 illustrates the depth estimation results of our approach on the NYU Depth-v2 Dataset. In Scene 7, our algorithm demonstrates a high level of accuracy in discerning the variations in depth of the box suspended on the wall and the chair. This is particularly evident in its ability to differentiate the distance between the backrest and the seat cushion of the chair from the perspective of the camera. Conversely, alternative algorithms may exhibit indistinct boundary information while processing the depth of the chair. In Scene 9 and Scene 13, it is evident that our method is capable of extracting edge features of small and delicate objects in the scene, such as chandeliers and projectors, enabling the distinction of depth variances between the foreground and background. In Scene 12, characterized by a complex scene depth with various objects placed on the table, our algorithm effectively categorizes the depths of objects in close proximity and those farther away. It successfully identifies clear boundaries and provides accurate depth information, a capability that sets it apart from other algorithms that struggle to differentiate between these depths. Notably, our method demonstrates the ability to distinguish the depth between the bracket on the wall and the wall itself in Scene 8. In Scene 13, the design on the bag is deceptive, leading to inaccurate depth estimations by other algorithms, whereas our approach effectively manages this issue.

The outstanding results achieved on the NYU Depth-v2 Dataset served as a catalyst for our decision to proceed with experiments on the KITTI Dataset, yielding impressive outcomes. Table 5 presents a comparison of the results obtained by our method with those of state-of-the-art depth estimation algorithms on the KITTI Dataset.

It is evident that the proposed method remains effective for outdoor scenes. The algorithm’s accuracy metrics are comparable to those of the BTS algorithm and superior to other algorithms. In the context of error metrics, our approach demonstrates superior performance compared to all other assessed algorithms. It reduces the RMSE from the optimal value of 2.355 to 1.951 and the Abs Rel from the optimal value of 0.120 to 0.095, achieving a 20% improvement in both metrics. Figure 14 illustrates the depth estimation results of different approaches on the KITTI Dataset.

In Scene 15 and Scene 16, our algorithm effectively discerns the disparity in depth between the sky and the adjacent houses and trees, whereas other algorithms fail to retain the boundary details of the sky area. In Scene 17, the billboards flanking the road exhibit relatively small dimensions and are characterized by a distinct depth in contrast to the background sky. In contrast to the subpar performance exhibited by other algorithms, our algorithm adeptly captures the contour information of small targets. In Scene 18, there are several billboards and traffic lights. The accuracy of the BTS algorithm in estimating the overall boundary depth is compromised by the influence of background trees when processing the traffic light. Our algorithm effectively estimates the depth information of the traffic light by accurately extracting precise boundary depths. In Scene 20, our algorithm effectively estimates the depth at the boundary between the sky and the buildings, whereas the Make3D and BTS algorithms are unsuccessful in predicting the depth value at this boundary.

5. Conclusions

This study introduces a monocular depth estimation approach that relies on dilated convolution and feature fusion. By integrating the DNET backbone with the dilated convolution feature fusion module and implementing a method to address class imbalance in the loss function, this methodology successfully accomplishes depth estimation from monocular images. The efficacy of our approach was validated on the NYU Depth-v2 Dataset and KITTI Dataset, with comparisons made against other state-of-the-art algorithms. Our method demonstrated a reduction in RMSE of over 10% and an improvement in accuracy metrics by over 5%. Despite the promising results obtained, the applicability of our method is limited by certain inherent limitations. The method is sensitive to sensor noise and faces challenges in integrating data from multiple sources, such as cameras and LiDAR. In future work, we will investigate the potential of updated depth estimation frameworks for practical applications. This will include the investigation of new algorithms and architectures with the objective of improving accuracy in a variety of conditions. The objective is to integrate these advances into operational systems, thereby enhancing their performance and reliability.

Author Contributions

Conceptualization, H.L. and S.L.; methodology, H.L.; software, H.L. and S.L.; validation, B.W. and Y.W.; formal analysis, S.L.; investigation, S.L.; resources, H.L.; data curation, H.L.; writing—original draft preparation, H.L. and S.L.; writing—review and editing, Y.W. and B.W.; visualization, B.W. and Y.W.; supervision, S.L.; project administration, H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

RMSE	Root Mean Square Error
${RMSE}_{Log}$	Root mean squared logarithmic error
Abs Rel	Absolute relative error
Sq Rel	Squared relative error
FPN	Feature Pyramid Network
CNN	Convolutional neural network

References

Pierre-Alexandre, B. Holography, and the Future of 3D Display. Light Adv. Manuf. 2021, 2, 446–459. [Google Scholar]
Situ, G. Deep holography. Light Adv. Manuf. 2022, 3, 278–300. [Google Scholar] [CrossRef]
Liu, S.; Li, Y.; Li, H.; Wang, B.; Wu, Y.; Zhang, Z. Visual Image Dehazing Using Polarimetric Atmospheric Light Estimation. Appl. Sci. 2023, 13, 10909. [Google Scholar] [CrossRef]
Dong, L.; Wang, B. Research on the New Detection Method of Suppressing the Skylight Background Based on the Shearing Interference and the Phase Modulation. Opt. Express 2020, 28, 12518–12528. [Google Scholar] [CrossRef] [PubMed]
Liu, S.; Li, H.; Zhao, J.; Liu, J.; Zhu, Y.; Zhang, Z. Atmospheric Light Estimation Using Polarization Degree Gradient for Image Dehazing. Sensors 2024, 24, 3137. [Google Scholar] [CrossRef] [PubMed]
Godard, C.; Mac, A.; Firman, M.; Brostow, G. Digging into Self-supervised Monocular Depth Estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2019), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Bhat, S.; Alhashim, I.; Wonka, P. Adabins: Depth Estimation using Adaptive Bins. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR2021), Nashville, TN, USA, 19–25 June 2021. [Google Scholar]
Eigen, D.; Puhrsch, C.; Fergus, R. Depth Map Prediction from a Single Image Using a Multi-Scale Deep Network. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
Liu, F.; Shen, C.; Lin, G. Deep Convolutional Neural Fields for Depth Estimation from a Single Image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Lafferty, J.; Andrew, M.; Fernando, P. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), Williams College, Williamstown, MA, USA, 28 June–1 July 2001. [Google Scholar]
Fu, H.; Gong, M.; Wang, C.; Batmanghelich, K.; Tao, D. Deep Ordinal Regression Network for Monocular Depth Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Huynh, L.; Nguyen-Ha, P.; Matas, J.; Rahtu, E.; Heikkilä, J. Guiding Monocular Depth Estimation Using Depth-attention Volume. In Proceedings of the European Conference on Computer Vision (ECCV 2020), Glasgow, UK, 23–28 August 2020. [Google Scholar]
Silberman, N.; Derek, H.; Pushmeet, K.; Rob, F. Indoor Segmentation and Support Inference from RGBD Images. In Proceedings of the European Conference on Computer Vision (ECCV 2012), Firenze, Italy, 7–13 October 2012. [Google Scholar]
Dai, A.; Chang, A.X.; Savva, M.; Halber, M.; Funkhouser, T.; Nießner, M. ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Ji, Y.; Chen, Z.; Xie, E.; Hong, L.; Liu, X.; Liu, Z.; Lu, T.; Li, Z.; Luo, P. Ddp: Diffusion Model for Dense Visual Prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2023), Paris, France, 4–6 October 2023. [Google Scholar]
Lee, J.; Han, M.; Kim, D.; Kweon, I. From Big to Small: Multi-Scale Local Planar Guidance for Monocular Depth Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L. Densely Connected Convolutional Networks. In Proceedings of the IEEE Conference on Computer Cision and Cattern Recognition (CVPR 2017), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
He, K.; Zhang, X.; Ren, S. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Yu, F.; Koltun, V. Multi-Scale Context Aggregation by Dilated Convolutions. In Proceedings of the 4th International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Alberto, G.; Sergio, O.; Sergiu, O.; Victor, V.; José, G. A Review on Deep Learning Techniques Applied to Semantic Segmentation. arXiv 2017, arXiv:1704.06857. [Google Scholar]
Yann, L.; Léon, B.; Yoshua, B.; Patrick, H. Gradient-based Learning Applied to Document Recognition. In Proceedings of the IEEE 1998, Seattle, WA, USA, 15 May 1998. [Google Scholar]
Lin, T.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Are We Ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2012), Providence, RI, USA, 16–21 June 2012. [Google Scholar]
Saxena, A.; Sun, M.; Ng, Y. Make3D: Learning 3D Scene Structure from a Single Still Image. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 31, 824–840. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Densely connected convolutional networks.

Figure 2. Dilated convolutional networks. (a) Standard convolution. (b) Dilated convolution.

Figure 3. Architecture of Feature Pyramid Network (FPN).

Figure 4. Architecture of the proposed approach.

Figure 5. Architecture of DNET backbone.

Figure 6. Multi-receptive field feature fusion.

Figure 7. Relationship between feature maps and depth categories.

Figure 8. Mapping between continuous depth values and depth categories.

Figure 9. DNET ablation experiment comparative result. Ablation experiments with FPN have been supplemented.

Figure 10. Depth estimation results on NYU Depth-v2 Dataset using different backbones. Resnet50 [18] backbone, Densenet121 [17] backbone, DNET without FPN and our DNET backbone.

Figure 11. The ablation experiment results with the original and improved loss functions.

Figure 12. Depth estimation results on NYU Depth-v2 Dataset using different loss functions.

Figure 13. Depth estimation results of different approaches on the NYU dataset.The ground truth is interpolated for visualization. The DDP algorithm, as a recent model, is included in the comparison results.

Figure 14. Depth estimation results of different approaches on the KITTI Dataset.

Table 1. Summary of monocular depth estimation algorithms.

Metrics	Main Contributions	Year of Submission
Eigen	Introduces multi-scale deep networks for monocular depth estimation, improving accuracy by capturing global and local depth information.	2014
DCNF	Combines deep neural networks with Conditional Random Fields to significantly improve image segmentation and object recognition accuracy.	2015
DORN	Introduces ordinal regression to depth estimation, leading to more precise and robust depth predictions in complex scenes.	2018
BTS [16]	Enhances depth estimation accuracy by using a “big to small” architecture for progressive refinement.	2019
DAV	Integrates a dynamic attention mechanism with a variational model to enhance depth estimation accuracy.	2020
DDP	The model employs a generative paradigm via a conditional diffusion pipeline, gradually removing random Gaussian noise guided by the image in order to achieve precise and dense visual predictions.	2023

Table 2. Configurations of DNET backbone.

Network Layer	Convolutional Architecture	Output Channel	Output Size
CON1	$7 \times 7$ conv, stride 2	64	$160 \times 120$
CON1	$3 \times 3$ max pool, stride 1	64	$160 \times 120$
DEN1	$(\begin{matrix} 1 \times 1 & conv \\ 3 \times 3 & conv \end{matrix}) \times 3$	256	$160 \times 120$
TRAN1	$1 \times 1$ conv	128	$80 \times 60$
TRAN1	$2 \times 2$ max pool, stride 2	128	$80 \times 60$
DEN2	$(\begin{matrix} 1 \times 1 & conv \\ 3 \times 3 & conv \end{matrix}) \times 6$	512	$80 \times 60$
TRAN2	$1 \times 1$ conv	256	$40 \times 30$
TRAN2	$2 \times 2$ max pool, stride 2	256	$40 \times 30$
DEN3	$(\begin{matrix} 1 \times 1 & conv \\ 3 \times 3 & conv \end{matrix}) \times 12$	1024	$40 \times 30$

Table 3. Comparative analysis of model performance metrics.

Metrics	Advantages	Disadvantages
RMSE	Capable of quantifying the absolute discrepancy between the predicted values and ground truth	Highly influenced by outliers and incapable of quantifying relative error
${RMSE}_{Log}$	The utilization of logarithmic transformation on the Root Mean Square Error (RMSE) diminishes the influence of outliers	Unable to quantify the relative error
Abs Rel	Capable of assessing the relative error between the predicted values and the ground truth.	Unable to quantify the absolute error
Sq Rel	Applying the square operation to the Abs Rel metric enhances the distinction	Unable to quantify the absolute error
Acc	A visual representation of the accuracy of the predicted values	Inability to quantify the deviation of the deviation between the predicted value and the ground truth

Table 4. Performance on NYU Depth-v2 Dataset.

δ

: threshold. Bold numbers: best performance under the respective conditions. Numbers with an underscore: the second-best performance under the respective condition. The DDP algorithm, as a recent model, is included in the table.

Table 4. Performance on NYU Depth-v2 Dataset.

δ

: threshold. Bold numbers: best performance under the respective conditions. Numbers with an underscore: the second-best performance under the respective condition. The DDP algorithm, as a recent model, is included in the table.

Method	Higher Is Better			Lower Is Better
Method	$δ < 1.25$	$δ < 1 . 25^{2}$	$δ < 1 . 25^{3}$	RMSE	RMSE_Log	Abs Rel	Sq Rel
Eigen [8]	0.692	0.899	0.967	1.156	0.270	0.190	1.515
Make3D [24]	0.601	0.820	0.926	1.734	0.261	0.280	3.012
DORN [11]	0.850	0.966	0.981	0.394	0.160	0.137	0.080
BTS [16]	0.866	0.972	0.992	0.398	0.153	0.121	0.075
DDP [15]	0.920	0.980	0.989	0.369	0.140	0.129	0.077
Ours	0.912	0.983	0.990	0.355	0.132	0.120	0.066

Table 5. Performance on KITTI Dataset.

δ

: threshold. Bold numbers: best performance under the respective conditions. Numbers with an underscore: the second-best performance under the respective condition.

Table 5. Performance on KITTI Dataset.

δ

: threshold. Bold numbers: best performance under the respective conditions. Numbers with an underscore: the second-best performance under the respective condition.

Method	Higher Is Better			Lower Is Better
Method	$δ < 1.25$	$δ < 1 . 25^{2}$	$δ < 1 . 25^{3}$	RMSE	RMSE_Log	Abs Rel	Sq Rel
Eigen [8]	0.800	0.889	0.922	2.755	0.144	0.137	0.355
Make3D [24]	0.851	0.903	0.944	2.502	0.131	0.107	0.309
BTS [16]	0.945	0.988	0.990	2.355	0.120	0.091	0.288
Ours	0.930	0.985	0.998	1.951	0.095	0.072	0.156

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, H.; Liu, S.; Wang, B.; Wu, Y. Monocular Depth Estimation Based on Dilated Convolutions and Feature Fusion. Appl. Sci. 2024, 14, 5833. https://doi.org/10.3390/app14135833

AMA Style

Li H, Liu S, Wang B, Wu Y. Monocular Depth Estimation Based on Dilated Convolutions and Feature Fusion. Applied Sciences. 2024; 14(13):5833. https://doi.org/10.3390/app14135833

Chicago/Turabian Style

Li, Hang, Shuai Liu, Bin Wang, and Yuanhao Wu. 2024. "Monocular Depth Estimation Based on Dilated Convolutions and Feature Fusion" Applied Sciences 14, no. 13: 5833. https://doi.org/10.3390/app14135833

APA Style

Li, H., Liu, S., Wang, B., & Wu, Y. (2024). Monocular Depth Estimation Based on Dilated Convolutions and Feature Fusion. Applied Sciences, 14(13), 5833. https://doi.org/10.3390/app14135833

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Monocular Depth Estimation Based on Dilated Convolutions and Feature Fusion

Abstract

1. Introduction

2. Related Works

2.1. Dense Convolutional Networks

2.2. Dilated Convolution

2.3. Feature Pyramid Network

3. Method Design and Implementation

3.1. Backbone of the Feature Extraction Modules

3.2. Multi-Receptive Field Feature Fusion

3.3. Depth Perception

3.4. Loss Function

4. Experiments and Results

4.1. Experimental Dataset

4.1.1. NYU Depth-v2 Dataset

4.1.2. KITTI Dataset

4.2. Evaluation Measures

4.3. Experimental Environment and Parameter Settings

4.4. Experimental Results and Analysis

4.4.1. Ablation Experiment

4.4.2. Results and Analyses

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI