Depth Estimation Using Feature Pyramid U-Net and Polarized Self-Attention for Road Scenes

Tao, Bo; Shen, Yunfei; Tong, Xiliang; Jiang, Du; Chen, Baojia

doi:10.3390/photonics9070468

Open AccessArticle

Depth Estimation Using Feature Pyramid U-Net and Polarized Self-Attention for Road Scenes

by

Bo Tao

^1,2,*

,

Yunfei Shen

^1,3

,

Xiliang Tong

³

,

Du Jiang

⁴ and

Baojia Chen

⁵

¹

Key Laboratory of Metallurgical Equipment and Its Control, Ministry of Education, Wuhan University of Science and Technology, Wuhan 430081, China

²

Hubei Key Laboratory of Mechanical Transmission and Manufacturing Engineering, Wuhan University of Science and Technology, Wuhan 430081, China

³

Precision Manufacturing Institute, Wuhan University of Science and Technology, Wuhan 430081, China

⁴

Research Center for Biomimetic Robot and Intelligent Measurement and Control, Wuhan University of Science and Technology, Wuhan 430081, China

⁵

Hubei Key Laboratory of Hydroelectric Machinery Design and Maintenance, China Three Gorges University, Yichang 443005, China

^*

Author to whom correspondence should be addressed.

Photonics 2022, 9(7), 468; https://doi.org/10.3390/photonics9070468

Submission received: 15 May 2022 / Revised: 17 June 2022 / Accepted: 30 June 2022 / Published: 4 July 2022

(This article belongs to the Special Issue Optical 3D Sensing Systems)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Studies have shown that the observed image texture details and semantic information are of great significance for the depth estimation on the road scenes. However, there are ambiguities and inaccuracies in the boundary information of observed objects in previous methods. For this reason, we hope to design a new depth estimation method that can obtain higher accuracy and more accurate boundary information of the detected object. Based on polarized self-attention (PSA) and feature pyramid U-net, we proposed a new self-supervised monocular depth estimation model to extract more accurate texture details and semantic information. Firstly, we add a PSA module at the end of the depth encoder and pose encoder so that the network can extract more accurate semantic information. Then, based on the U-net, we put the multi-scale image obtained by the object detection module FPN (Feature Pyramid network) directly into the decoder. It can guide the model to learn semantic information, thus enhancing the boundary of the image. We evaluated our method on KITTI 2015 datasets and Make3D datasets, and our model achieved better results than previous studies. In order to verify the generalization of the model, we have done monocular, stereo, monocular plus stereo experiments. The experimental results show that our model has achieved better results in several main evaluation indexes and clearer boundary information. In order to compare different forms of PSA mechanism, we did ablation experiments. Compared with no PSA module, after adding the PSA module, better results in evaluating indicators were achieved. We also found that our model is better in monocular training than stereo training and monocular plus stereo training.

Keywords:

monocular depth estimation; self-supervision; attention mechanism; target detection module

1. Introduction

Depth estimation is a fundamental problem in computer vision. It can be used in many fields, such as 3D model reconstruction, scene understanding, autonomous driving, computational photography, etc. Generally, depth information can be collected by numerous hardware such as Kinect V1 and Kinect V2 (using the stereo matching and time-of-flight methods, respectively [1]), laser sweep tracers, and structured light sensors. Among them, depth estimation based on a monocular camera is the cheapest scheme, and a monocular camera is the most commonly used camera.

Training strategies for monocular depth estimation are widely used in real-world applications such as shadow detection and removal [2], 3D reconstruction [3,4], and augmented reality [5]. The training methods of monocular depth estimation can be classified into supervised, semi-supervised, unsupervised, and self-supervised methods.

Methods for supervised monocular depth estimation [4,5,6,7] can be trained with ground truth and produce good results. However, the ground truth is difficult to obtain in most scenarios. Semi-supervised monocular depth estimation uses large amounts of relatively inexpensive no-labeled data to improve learning performance effectively. The method introduces additional information, such as synthetic data, surface texture, and LiDAR. As a semi-supervised learning method, it reduces the reliance of the model on the ground truth depth map, enhances the consistency of the scale, and improves the accuracy of the estimated depth map.

By contrast, self-supervised monocular depth estimation [8,9,10,11,12], which relies only on stereo image pairs or monocular video for supervised training, has attracted more attention from the industry to academic community. The state-of-the-art (SOTA) self-supervised monocular depth estimation methods [8,9,10,11] can successfully estimate the relative depth. However, the existing methods are weak for image edges. The edge contour detail estimation still needs to be improved. To address this problem, We propose a new polarized self-attention (PSA) and feature pyramid U-net self-supervised monocular depth estimation method, which can estimate the depth of the image accurately and preserve the contour lines of the image. The PSA mechanism combines the characteristics of channel self-attention mechanism and spatial self-attention mechanism and connect them in parallel and serial ways novelty; we add it after the encoder 512 layer directly, so we can plug and play without changing the main structure of the network. With this special structure, the model can learn pixel-level semantic features by convolution without significantly increasing the size of the model. Inspired by the U-net model and the object detection module FPN, we pass the original image processed by the maximum pool operation to the decoder. The experiment shows that our method has achieved good results in the task of depth estimation.

The main contributions of this research are as follows:

(1): PSA is used in the monocular self-supervised depth estimation model. It can guide the model to learn pixel-level semantic information, so it can get the depth map with more accurate boundaries.
(2): We design a new decoder splicing method by combining the skip connection of U-net and FPN. This approach can get better results without significantly increasing the amount of calculation.

2. Related Work

In this section, we describe self-supervised monocular depth estimation methods, the network combining FPN and U-net, and the application of the self-attention mechanism in the task of depth estimation.

2.1. Self-Supervised Monocular Depth Estimation

By combining deep learning [13,14,15,16] with depth estimation, an increasing number of strategies are available for monocular depth estimation. Among them, self-supervised monocular depth estimation [17,18,19,20,21,22,23] has become a hot research topic in industry and academia because it utilizes the network’s learning ability. The datasets of monocular depth estimation model are left-right image pair [19,21,22,23] or video sequence [17,18,20]. They do not have real-depth information. The joint loss function guides the convergence of the model. When predicting, they use the trained model and camera matrix to calculate a depth map.

In [24], Xie, et al. proposed discrete parallax estimation of stereo image pairs and used images to estimate training losses. From 3D movies, they extracted stereo image pair datasets. In [8] (MonoDepth), Godard, et al. adds a left–right consistency objective function; the left and right images can be predicted with better depth consistency. Chen, et al. [25] constructed a loss function by using the relative depth relationship, and predicted the pixel-level depth through a multi-scale neural network directly. In addition, they think it is not reasonable to predict left-right parallax maps from a single input image. Zhou, et al. [26] combined a depth network with a pose network and used the depth network to predict the depth of the object image. The transformation matrix of the camera is predicted using a pose network. In addition, a motion interpretation mask [20,27] is predicted to encourage the exclusion of nonrigid scene motion regions. Godard, et al. [9] (monodepth2) proposed a pixel-by-pixel minimization of reprojection error store move occlusions. Masking losses were designed to ignore training pixels that violate the camera motion assumption. At the same time, the multi-scale estimation method [10,28] was used to upsample and project the input image of the depth decoder to the size of the original image by the network. Finally, the projected images are used to calculate the loss. Although the above methods have achieved good results, their predicted depth does not provide accurate semantic information, and it is even difficult to judge what the detected objects are from the color depth map. That is because the model can not learn the relationship between object boundary information and different textures effectively.

2.2. The Network Combining FPN and U-Net

Object detection module FPN [29] can guide the model to learn more semantic information. Researchers found that the different scale images obtained by maximum pool operation can improve the accuracy of object detection after convolution layer, upsampling and cat operation, respectively. In [30], Song, et al. proposed a new method of monocular depth estimation. With this method, the decoding process can be broken down into different components in order to maximize the benefits of good coding features. In [31], Lai, et al. proposed a densely connected pyramid network for monocular depth estimation. By using dense connection modules, they not only integrate the features of adjacent floors, but also integrate the features of non-adjacent floors. It is different from the traditional pyramid structure neural network that only fuses features between adjacent floors of the pyramid. In monodepth [26], Zhou, et al. built the model with U-net. It has achieved better results than previous studies. So we thought of building a new model by combining FPN with U-net.

2.3. Self-Attention Mechanism

When dealing with semantic segmentation tasks, researchers found that inserting a self-attention mechanism [32,33,34,35,36,37,38] into the model can effectively improve the performance of the model. Because of its plug-and-play characteristics, this method is used in various tasks. In CBAM [39], Woo, et al. developed a channel-plus-space self-attention mechanism, which achieved better results than channel-only self-attention. They found that spatial attention mechanism plays a key role in making attention "where". In [40], Huang, et al. used the self-attention mechanism and boundary consistency to build a model, which improved the performance of the depth estimation task. The self-attention mechanism allows the network to improve the depth boundary and image quality through more accurate value and boundary consistency, resulting in a clearer structure. In [41], a self-attention-based depth and motion network was proposed. This framework could capture long-distance context information, resulting in a clearer depth map. We chose the PSA [42] because it achieved good results in the recent semantic segmentation task. We think that the self-attention mechanism and object detection module can guide the model to learn semantic information.

3. Methods

In this section, we describe the self-supervised monocular depth estimation method in detail; it is based on the PSA and object detection module. We chose three forms of data: monocular continuous images, stereo image pairs, monocular continuous images, plus stereo image pairs.

We construct two end-to-end networks based on U-net, a depth network, and a pose network. The two networks are both encoder and decoder structures. The depth encoder extracts the features from a single colorful image, and we can learn the semantic information from the network. We use the object detection module and depth encoder in parallel. It can pass the extracted boundary information of the detected object to depth decoder, then construct a clearer depth image. At the same time, the pose encoder is also designed to extract pose information from successive frames and calculate the camera matrix by learning the pose relationships between successive images. Finally, we add a PSA module to the end of depth encoder and pose encoder. It is helpful for extracting pixel-level semantic information.

3.1. Network Architecture

The depth estimation task is highly relevant to the target detection and semantic segmentation tasks. We can see the outline of the appearance of the detected object clearly from the depth map generated by the model. For this purpose, we parallelize the object detection module and the depth encoder. After that, pass the result of the object detection module into the deep decoder, as shown in Figure 1.

After the original image passes through FPN, four pictures of different scales are obtained. They are 2, 4, 8, 16 times of down-sampling size, respectively. They are exactly the same size as the pictures of the skip connection. On this basis, we concatenate them with the up-sampled image. We use it to extract multi-scale features. We do not need real semantic information, labels, and anchors of detected objects. We just need the model to learn the different details from the detected objects and the environment. This is beneficial to generate more accurate boundary information of objects.

For self-attention, we have chosen a lighter PSA mechanism. PSA has two characteristics: (1) filtering: completely collapsing features in one direction while maintaining high resolution in its orthogonal direction; (2) HDR: performing a softmax normalization on the HW × 1 × 1 feature (the smallest feature tensor in the attention block) and then using a sigmoid function for projecting mapping. It increases the dynamic attentional range. Formally, the PSA mechanism can be instantiated as the following two modules (PSA_p, PSA_s).

3.1.1. Channel-Only Branch

In Figure 2 and Figure 3, the content inside the red dotted box is Channel-only Self-Attention.

H \times W

is the height and width of the picture,

C

is the number of channels. From the shape of the cube, we can see how features change inside the PSA module intuitively. The green solid block

A^{ch}

can be obtained by the following formula:

A^{ch} (X) = F_{SG} [W_{z | θ_{1}} ((σ_{1} (W_{v} (X)) \times F_{SM} (σ_{2} (W_{q} (X))))]

(1)

where

A^{ch} (X) \in R^{C \times 1 \times 1}

.

W_{q}

,

W_{v}

and

W_{z}

are 1 × 1 convolutional layer.

σ_{1}

and

σ_{2}

are two tensors reshape operator which change the dimension of tensor.

F_{SM} (X) = \sum_{j = 1}^{N_{p}} \frac{e^{x} j}{\sum_{m = 1}^{N_{p}} e^{x} m} x_{j}

, where

F_{SM} (\cdot)

is a sigmoid function.

⊙^{ch}

denotes the channel multiplication operator, the output

Z^{ch}

of the channel-only branch is:

Z^{ch} = A^{ch} (X) ⊙^{ch} X \in R^{C \times H \times W}

(2)

3.1.2. Spatial-Only Branch

The yellow solid block

A^{sp}

can be obtained by the following formula:

A^{sp} (X) = F_{SG} [σ_{3} (F_{SM} (σ_{1} (F_{GP} (W_{q} (X)))) \times (σ_{2} (W_{v} (X))))]

(3)

where

A^{sp} (X) \in R^{1 \times H \times W}

.

W_{q}

and

W_{v}

are 1 × 1 convolutional layer.

σ_{1}

,

σ_{2}

and

σ_{3}

are three tensors reshape operator which change the dimension of tensor.

F_{GP} (\cdot)

is the global pooling. Denoting the space multiplication operator by

⊙^{sp}

, the output of the Spatial-only branch is:

Z^{sp} = A^{sp} (X) ⊙^{sp} X \in R^{C \times H \times W}

(4)

From Equations (1)–(4), two forms of PSA modules can be generated.

We use the self-supervised monocular depth estimation network monodepth2 [9] as our baseline. It is a U-net model based on an encoder–decoder architecture, allowing for end-to-end output. We use consecutive 3-frame images as input. Different scale images from the depth decoder are used to calculate the loss. In Figure 1, we show the basic framework of the model. We use the ResNet18 [43] network as the encoder backbone for depth network and pose network. To better facilitate the acquisition of accurate semantic information, we insert the PSA module behind the 512 layers. It is used for depth encoder and pose encoder at the same time. Inspired by residual networks and U-net, a parallel object detection module is designed. It scales the image to the corresponding depth solution through the maximum pooling downsample operation. Then stitch the two together. It allows the network to learn the boundary information and the complex texture information effectively.

3.2. Loss Function

As in [44], we upsampled the small-scale images using a bilinear sampling method, which is locally sub-differentiable. The photometric error function

p e

of pair images was obtained by using

L_{1}

norm [45,46] and SSIM [9].

p e (I_{a}, I_{b}) = \frac{α}{2} (1 - SSIM (I_{a}, I_{b})) + (1 - α) ‖ I_{a} - I_{b} ‖_{1}

(5)

where,

α = 0.85

,

I_{a}

,

I_{b}

are two consecutive frames.

Inspired by monodepth2 [9], we also use a pixel-by-pixel minimization of the reprojection error, resulting in a loss of luminosity at each pixel is:

L_{p} = \min_{t'} p e (I_{t}, I_{t' \to t})

(6)

As described in [46], we use edge-aware smoothing loss:

L_{s} = |\partial_{x} d_{t}^{*}| e^{- |\partial_{x} I_{t}|} + |\partial_{y} d_{t}^{*}| e^{- |\partial_{y} I_{t}|}

(7)

where,

d_{t}^{*} = d_{t} / \bar{d_{t}}

is the average normalized inverse depth,

d_{t}

is the true depth and

\bar{d_{t}}

is the predicted depth. The estimation can be stopped counting the depth of contraction.

The final loss function is a composite of Equations (6) and (7):

L = μ L_{p} + λ L_{s}

(8)

λ = 0.001

, and

μ

is a per-pixel mask. Set

μ

to 0 or 1 to selectively weight each pixel. When the reprojection error of the pixels on the transformed image is smaller than the reprojection error on the original image, it is only brought into the calculation,

μ =

1.

Our models were implemented in PyTorch [47]. In monocular training, we used Adam [48] for 25 iterations with a batch size of 12 and an input/output resolution of 640 × 192. For the first 15 iterations, we used a learning rate of

10^{- 5}

and then the learning rate dropped to

10^{- 6}

. We trained on an AMD Ryzen 5 5600X 6-Core, NVIDIA GeForce RTX 3060 (12 G). Our monocular model took 13 h, the stereo model took 14 h, and the monocular plus stereo took 17 h.

4. Results and Discussion

In this section, we discussed the results achieved by our model on publicly available datasets. And the results were compared with the results of SOTA monocular depth estimation models.

We trained our model on the KITTI dataset. It has continuous frames taken by cameras set up on moving vehicles. Then, we evaluated our models on KITTI and Make3D [49]. Finally, we did an ablation study of different PSA modules.

4.1. KITTI Results

The results of the evaluation of the KITTI dataset are shown in Table 1. KITTI 2015 Eigen split [50] was chosen as the evaluation set. We compared our results with other monocular depth estimation models. The models are trained by the self-supervised monocular, self-supervised stereo, and self-supervised monocular plus stereo, respectively.

From Table 1, we find that our method achieves well when compared with the other self-supervised depth estimation models. Our model achieves the best result in the monocular training. For the stereo training, our model achieves the best result except for the absolute relative error. For the monocular plus stereo training, our model achieves the best result except for the absolute relative error and square relative error.

The texture information of the important detectors can be found by direct observation that our method also achieves good results, as shown in Figure 4. Therefore, the model with the PSA module improves depth estimation map quality and gets a more accurate boundary by extracting pixel-level semantic information.

4.2. Make3D Results

Similar to monodepth2 [9], our model was also evaluated on the Make3D benchmark. Make3D has a single-frame RGB-D image without stereo image pairs or video sequences. Therefore, we trained our monocular depth model on the KITTI 2015 [59] datasets. We used the evaluation criterion proposed in monodepth [8], and the results of the comparison are shown in Table 2.

We used others’ models directly to derive their error evaluation metrics [60]. From Table 2, we find that our method achieves the best results in evaluating indicators when compared with the other self-supervised monocular depth estimation models.

4.3. Generalizing to Other Datasets

We have also used our method on other publicly available datasets of the road scenes. For example: cityscapes [45], DIODE [61], IDD [62]. We put images of datasets into our model for prediction. The results show that our model can be applied to the new scenes.

The predicted depth maps are shown in Figure 5. It is evident that our model can also achieve better results on other datasets. However, due to differences in lighting, environment, and objects, the displayed texture boundaries are not as accurate as those on the KITTI dataset.

4.4. Ablation Study

We conducted an ablation study to compare the effects of adding PSA_p, PSA_s, and both separately. They were added into the 512 layers locations of the depth encoder and the pose encoder [63].

We summarize the data in Table 3. Compared with no PSA module, after adding the PSA module, better results in the evaluating indicator have been achieved in monocular training. For stereo training, our model did not achieve best result in evaluating indicators of absolute relative error. In addition, For monocular plus stereo training, our model did not achieve best result in evaluating indicators of absolute relative error and square relative error.

4.5. Discussion

The results of this research show that using PSA plus feature pyramid U-net is better than traditional methods in semantic information extraction. Our method does not only provide better results in the evaluating indicators but also better boundary information for the actual depth map prediction. Another keypoint compares monocular, stereo and monocular plus stereo training in different modules of PSA. In Table 3, it is evident that adding PSA_p and PSA_s together leads to better results, especially in monocular training. But our models did not achieve the best results on all the evaluating indicators. Therefore, PSA modules are better at monocular training. Finally, we have also verified our model on other datasets, and it can also achieve good results. It proves that the model has a certain generalization.

5. Conclusions

In this study, we proposed a new self-supervised monocular depth estimation method. This method is intended to get a more accurate depth estimation map and to understand a detected object from the predicted depth map intuitively. Compared with the previous models, its predicted depth map has achieved better results in detecting the boundary of the object and the accuracy of depth estimation. Finally, we have also verified our model on other datasets, and it can also achieve good results. It proves that the model has a certain generalization.

Author Contributions

Conceptualization, B.T.; methodology, Y.S.; software, Y.S.; validation, B.T., Y.S. and X.T.; formal analysis, Y.S.; investigation, Y.S.; resources, D.J.; data curation, B.T.; writing—original draft preparation, Y.S.; writing—review and editing, B.T.; visualization, B.C.; supervision, B.T.; project administration, B.T.; funding acquisition, B.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number (51505349, 51575407), Hubei Provincial Department of Education, grant number D20201106, and the Open Fund of Hubei Key Laboratory of Hydroelectric Machinery Design and Maintenance in China Three Gorges University (2021KJX13).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

http://make3d.cs.cornell.edu/index.html (Make3D datasets). http://www.cvlibs.net/datasets/kitti/raw_data.php (KITTI datasets). https://www.cityscapes-dataset.com/ (cityscapes datasets). http://idd.insaan.iiit.ac.in/ (IDD datasets).

Conflicts of Interest

The authors declare no conflict of interest.

References

Pagliari, D.; Pinto, L. Calibration of Kinect for Xbox One and Comparison between the Two Generations of Microsoft Sensors. Sensors 2015, 15, 27569–27589. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Fan, X.; Wu, W.; Zhang, L.; Yan, Q.; Fu, G.; Chen, Z.; Long, C.; Xiao, C. Shading-aware shadow detection and removal from a single image. Vis. Comput. 2020, 36, 2175–2188. [Google Scholar] [CrossRef]
Fu, Y.; Yan, Q.; Liao, J.; Chow, A.L.H.; Xiao, C. Real-time dense 3D reconstruction and camera tracking via embedded planes representation. Vis. Comput. 2020, 36, 2215–2226. [Google Scholar] [CrossRef]
Fu, Y.; Yan, Q.; Liao, J.; Xiao, C. Joint Texture and Geometry Optimization for RGB-D Reconstruction. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 5949–5958. [Google Scholar] [CrossRef]
Hao, Z.; Li, Y.; You, S.; Lu, F. Detail Preserving Depth Estimation from a Single Image Using Attention Guided Networks. In Proceedings of the 2018 International Conference on 3D Vision (3DV), Verona, Italy, 5–8 September 2018; pp. 304–313. [Google Scholar] [CrossRef] [Green Version]
Klodt, M.; Vedaldi, A. Supervising the New with the Old: Learning SFM from SFM. In Proceedings of the Computer Vision—ECCV 2018. ECCV 2018, Munich, Germany, 8–14 September 2018; Lecture Notes in Computer Science. Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Cham, Switzerland, 2018; Volume 11214. [Google Scholar]
Yang, N.; Wang, R.; Stückler, J.; Cremers, D. Deep Virtual Stereo Odometry: Leveraging Deep Depth Prediction for Monocular Direct Sparse Odometry. In Proceedings of the Computer Vision—ECCV 2018. ECCV 2018, Munich, Germany, 8–14 September 2018; Lecture Notes in Computer Science. Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Cham, Switzerland, 2018; Volume 11212. [Google Scholar]
Godard, C.; Mac Aodha, O.; Brostow, G.J. Unsupervised Monocular Depth Estimation with Left-Right Consistency. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef] [Green Version]
Godard, C.; Mac Aodha, O.; Firman, M.; Brostow, G.J. Digging into Self-Supervised Monocular Depth Estimation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 3827–3837. [Google Scholar] [CrossRef] [Green Version]
Ye, X.; Fan, X.; Zhang, M.; Xu, R.; Zhong, W. Unsupervised Monocular Depth Estimation via Recursive Stereo Distillation. IEEE Trans. Image Processing 2021, 30, 4492–4504. [Google Scholar] [CrossRef]
Klingner, M.; Termöhlen, J.A.; Mikolajczyk, J.; Fingscheidt, T. Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In Proceedings of the ECCV, 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 2619–2627. [Google Scholar]
Yang, Z.; Wang, P.; Wang, Y.; Xu, W.; Nevatia, R. LEGO: Learning Edge with Geometry all at Once by Watching Videos. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 225–234. [Google Scholar] [CrossRef] [Green Version]
Jiang, D.; Li, G.; Sun, Y.; Hu, J.; Yun, J.; Liu, Y. Manipulator grabbing position detection with information fusion of color image and depth image using deep learning. J. Ambient Intell. Humaniz. Comput. 2021, 12, 10809–10822. [Google Scholar] [CrossRef]
Tao, B.; Liu, Y.; Huang, L.; Chen, G.; Chen, B. 3D reconstruction based on photoelastic fringes. Concurr. Comput. Pract. Exp. 2022, 34, e6481. [Google Scholar] [CrossRef]
Tao, B.; Wang, Y.; Qian, X.; Tong, X.; He, F.; Yao, W.; Chen, B.; Chen, B. Photoelastic Stress Field Recovery Using Deep Convolutional Neural Network. Front. Bioeng. Biotechnol. 2022, 10, 818112. [Google Scholar] [CrossRef]
Jiang, D.; Li, G.; Tan, C.; Huang, L.; Sun, Y.; Kong, J. Semantic segmentation for multiscale target based on object recognition using the improved Faster-RCNN model. Future Gener. Comput. Syst. 2021, 123, 94–104. [Google Scholar] [CrossRef]
Yang, Z.; Wang, P.; Xu, W.; Zhao, L.; Nevatia, R. Unsupervised learning of geometry from videos with edge-aware depth-normal consistency. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA; 2018; pp. 7493–7500. [Google Scholar]
Mahjourian, R.; Wicke, M.; Angelova, A. Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5667–5675. [Google Scholar]
Yin, Z.; Shi, J. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 1983–1992. [Google Scholar]
Wang, C.; Miguel Buenaposada, J.; Zhu, R.; Lucey, S. Learning depth from monocular videos using direct methods. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 2022–2030. [Google Scholar]
Zou, Y.; Luo, Z.; Huang, J.B. Df-net: Unsupervised joint learning of depth and flow using cross-task consistency. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 38–55. [Google Scholar]
Ranjan, A.; Jampani, V.; Balles, L.; Kim, K.; Sun, D.; Wulff, J.; Black, M.J. Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In Proceedings of the IEEE/CVF Conference on Computer vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12232–12241. [Google Scholar]
Luo, C.; Yang, Z.; Wang, P.; Wang, Y.; Xu, W.; Nevatia, R.; Yuille, A. Every pixel counts ++: Joint learning of geometry and motion with 3d holistic understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2624–2641. [Google Scholar] [CrossRef] [Green Version]
Xie, J.; Girshick, R.; Farhadi, A. Deep3D: Fully Automatic 2D-to-3D Video Conversion with Deep Convolutional Neural Networks. In Proceedings of the Computer Vision—ECCV 2016. ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; Lecture Notes in Computer Science. Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer: Cham, Switzerland, 2016; Volume 9908. [Google Scholar]
Chen, P.-Y.; Liu, A.H.; Liu, Y.-C.; Wang, Y.-C.F. Towards Scene Understanding: Unsupervised Monocular Depth Estimation with Semantic-Aware Representation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 2619–2627. [Google Scholar] [CrossRef]
Zhou, T.; Brown, M.; Snavely, N.; Lowe, D.G. Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Xing, X.; Cai, Y.; Wang, Y.; Lu, T.; Yang, Y.; Wen, D. Dynamic Guided Network for Monocular Depth Estimation. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 5459–5465. [Google Scholar] [CrossRef]
Phan, M.H.; Phung, S.L.; Bouzerdoum, A. Ordinal Depth Classification Using Region-based Self-attention. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 3620–3627. [Google Scholar] [CrossRef]
Zhang, Y.; Han, J.H.; Kwon, Y.W.; Moon, Y.S. A New Architecture of Feature Pyramid Network for Object Detection. In Proceedings of the 2020 IEEE 6th International Conference on Computer and Communications (ICCC), Chengdu, China, 11–14 December 2020; pp. 1224–1228. [Google Scholar] [CrossRef]
Song, M.; Lim, S.; Kim, W. Monocular Depth Estimation Using Laplacian Pyramid-Based Depth Residuals. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 4381–4393. [Google Scholar] [CrossRef]
Lai, Z.; Tian, R.; Wu, Z.; Ding, N.; Sun, L.; Wang, Y. DCPNet: A Densely Connected Pyramid Network for Monocular Depth Estimation. Sensors 2021, 21, 6780. [Google Scholar] [CrossRef] [PubMed]
Ng, M.Y.; Chng, C.B.; Koh, W.K.; Chui, C.K.; Chua, M.C.H. An enhanced self-attention and A2J approach for 3D hand pose estimation. Multimed. Tools Appl. 2021, 9, 124847–124860. [Google Scholar] [CrossRef]
Yang, J.; Yang, J. Aspect Based Sentiment Analysis with Self-Attention and Gated Convolutional Networks. In Proceedings of the 2020 IEEE 11th International Conference on Software Engineering and Service Science (ICSESS), Beijing, China, 16–18 October 2020; pp. 146–149. [Google Scholar] [CrossRef]
Wang, J.; Zhang, G.; Yu, M.; Xu, T.; Luo, T. Attention-Based Dense Decoding Network for Monocular Depth Estimation. IEEE Access 2020, 8, 85802–85812. [Google Scholar] [CrossRef]
Zhang, W.; Wang, G.; Huang, M.; Wang, H.; Wen, S. Generative Adversarial Networks for Abnormal Event Detection in Videos Based on Self-Attention Mechanism. IEEE Access 2021, 9, 124847–124860. [Google Scholar] [CrossRef]
Miyazaki, K.; Komatsu, T.; Hayashi, T.; Watanabe, S.; Toda, T.; Takeda, K. Weakly-Supervised Sound Event Detection with Self-Attention. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 66–70. [Google Scholar] [CrossRef]
Johnston, A.; Carneiro, G. Self-Supervised Monocular Trained Depth Estimation Using Self-Attention and Discrete Disparity Volume. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 4755–4764. [Google Scholar] [CrossRef]
Wang, C.; Deng, C. On the Global Self-attention Mechanism for Graph Convolutional Networks. 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 8531–8538. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. arXiv 2018, arXiv:1807.06521v1. [Google Scholar]
Huang, Y.-K.; Wu, T.-H.; Liu, Y.-C.; Hsu, W.H. Indoor Depth Completion with Boundary Consistency and Self-Attention. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Korea, 27–28 October 2019; pp. 1070–1078. [Google Scholar] [CrossRef] [Green Version]
Mathew, A.; Patra, A.P.; Mathew, J. Self-Attention Dense Depth Estimation Network for Unrectified Video Sequences. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Virtual Conference, 25–28 October 2020; pp. 2810–2814. [Google Scholar] [CrossRef]
Liu, H.; Liu, F.; Fan, X.; Huang, D. Polarized Self-Attention: Towards High-quality Pixel-wise Regression. arXiv 2021, arXiv:2107.00782. Available online: https://arxiv.org/abs/2107.00782 (accessed on 14 May 2022).
Aziz, S.; Bilal, M.; Khan, M.; Amjad, F. Deep Learning-based Automatic Morphological Classification of Leukocytes using Blood Smears. 2020 International Conference on Electrical, Communication, and Computer Engineering (ICECCE), Istanbul, Turkey, 12–13 June 2020; pp. 1–5. [Google Scholar] [CrossRef]
Wang, Q.; Gao, J.; Lin, W.; Yuan, Y. Pixel-Wise Crowd Understanding via Synthetic Data. Int. J. Comput. Vis. 2021, 129, 225–245. [Google Scholar] [CrossRef]
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes Dataset for Semantic Urban Scene Understanding. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar] [CrossRef] [Green Version]
Wang, H.; Sang, X.; Chen, D.; Wang, P. Self-Supervised Learning of Monocular Depth Estimation Based on Progressive Strategy. in IEEE Transactions on Computational Imaging 2021, 7, 375–383. [Google Scholar] [CrossRef]
Zhou, S.; Wu, J.; Zhang, F.; Sehdev, P. Depth occlusion perception feature analysis for person re-identification. Pattern Recognit. Lett. 2020, 138, 617–623. [Google Scholar] [CrossRef]
Pillai, S.; Ambrus, R.; Gaidon, A. SuperDepth: Self-Supervised, Super-Resolved Monocular Depth Estimation. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 9250–9256. [Google Scholar] [CrossRef] [Green Version]
Li, Z.; Snavely, N. MegaDepth: Learning Single-View Depth Prediction from Internet Photos. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2041–2050. [Google Scholar] [CrossRef] [Green Version]
Goldman, M.; Hassner, T.; Avidan, S. Learn Stereo, Infer Mono: Siamese Networks for Self-Supervised, Monocular, Depth Estimation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 15–20 June 2019; pp. 2886–2895. [Google Scholar] [CrossRef] [Green Version]
Casser, V.; Pirk, S.; Mahjourian, R.; Angelova, A. Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, Hawaii, USA, 27 January–1 February 2019; pp. 8001–8008. [Google Scholar]
Garg, R.; VijayKumar, B.G.; Carneiro, G.; Reid, I. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In Proceedings of the ECCV, 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 740–756. [Google Scholar]
Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The kitti vision benchmark suite. In Proceedings of the CVPR, 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
Mehta, I.; Sakurikar, P.; Narayanan, P.J. Structured adversarial training for unsupervised monocular depth estimation. In Proceedings of the 3DV, 2018 International Conference on 3d Vision (3DV), Verona, Italy, 5–8 September 2018; pp. 314–323. [Google Scholar]
Poggi, M.; Tosi, F.; Mattoccia, S. Learning monocular depth estimation with unsupervised trinocular assumptions. In Proceedings of the 3DV, 2018 International Conference on 3d Vision (3DV), Verona, Italy, 5–8 September 2018; pp. 324–333. [Google Scholar]
Watson, J.; Firman, M.; Brostow, G.; Turmukhambetov, D. Self-Supervised Monocular Depth Hints. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 2162–2171. [Google Scholar] [CrossRef] [Green Version]
Li, R.; Wang, S.; Long, Z.; Gu, D. Undeepvo: Monocular visual odometry through unsupervised deep learning. In Proceedings of the ICRA, 2018 IEEE International Conference on Robotics and Automation, Brisbane, QLD, Australia, 21–25 May 2018; pp. 7286–7291. [Google Scholar]
Masoumian, A.; Rashwan, H.; Abdulwahab, S.; Cristiano, J. GCNDepth: Self-supervised Monocular Depth Estimation based on Graph Convolutional Network. arXiv 2021, arXiv:2112.06782. [Google Scholar]
Godet, P.; Boulch, A.; Plyer, A.; Le Besnerais, G. STaRFlow: A SpatioTemporal Recurrent Cell for Lightweight Multi-Frame Optical Flow Estimation. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 2462–2469. [Google Scholar] [CrossRef]
Tao, B.; Huang, L.; Zhao, H.; Li, G.; Tong, X. A time sequence images matching method based on the siamese network. Sensors 2021, 21, 5900. [Google Scholar] [CrossRef] [PubMed]
Vasiljevic, I.; Kolkin, N.; Zhang, S.; Luo, R.; Wang, H.; Dai, F.Z.; Daniele, A.F.; Mostajabi, M.; Basart, S.; Walter, M.R.; et al. DIODE: A Dense Indoor and Outdoor Depth Dataset. arXiv 2019, arXiv:1908.00463. [Google Scholar]
Varma, G.; Subramanian, A.; Namboodiri, A.; Chandraker, M.; Jawahar, C. IDD: A Dataset for Exploring Problems of Autonomous Navigation in Unconstrained Environments. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, 7–11 January 2019; pp. 1743–1751. [Google Scholar] [CrossRef] [Green Version]
Hao, Z.; Wang, Z.; Bai, D.; Tao, B.; Tong, X.; Chen, B. Intelligent detection of steel defects based on improved split attention networks. Front. Bioeng. Biotechnol. 2022, 9, 810876. [Google Scholar] [CrossRef]

Figure 1. Self-supervised monocular depth estimation U-net model with the FPN and the attention mechanism PSA.

Figure 2. The attention mechanism PSA_p module. Channel-only Self-attention in the left half and Spatial-only Self-attention in the right half are parallelized and summed. LN is layer normalization. ⨂ is matrix multiplication.

Figure 3. The attention mechanism PSA_s module. Channel-only Self-attention in the left half and Spatial-only Self-attention in the right half are connected in series. LN is layer normalization. ⨂ is matrix multiplication.

Figure 4. We train our model on the KITTI dataset and compare it with the experimental results of monodepth2 [9] and GCNDepth [58]. The main differences in the experiment are marked with red boxes, and our model achieves better results.

Figure 5. We train our model on KITTI and then test our model by the images from cityscapes, DIODE, IDD. In the first row, the first three pictures are from cityscapes; the fourth picture is from DIODE. The pictures in the second row are from IDD.

Table 1. Validated with Eigen split on the KITTI 2015 dataset and then compared with existing models. M(K): a self-supervised monocular model trained on KITTI, S(K): a self-supervised stereo model trained on KITTI, MS(K): a self-supervised monocular plus stereo model trained on KITTI, K: KITTI. The best results are identified in bold.

Method	Train	Abs Rel	Sq Rel	RMSE	RMSE log	δ < 1.25	$δ < {1.25}^{2}$	$δ < {1.25}^{3}$
Zhou, et al. [26]	M(K)	0.183	1.595	6.709	0.270	0.734	0.902	0.959
Yang, et al. [17]	M(K)	0.182	1.481	6.501	0.267	0.725	0.906	0.963
Mahjourian, et al. [18]	M(K)	0.163	1.240	6.220	0.250	0.762	0.916	0.968
GeoNet [19]	M(K)	0.149	1.060	5.567	0.226	0.796	0.935	0.975
DDVO [20]	M(K)	0.151	1.257	5.583	0.228	0.810	0.936	0.974
DF-Net [21]	M(K)	0.150	1.124	5.507	0.223	0.806	0.933	0.973
Ranjan, et al. [22]	M(K)	0.148	1.149	5.464	0.226	0.815	0.935	0.973
EPC++ [23]	M(K)	0.141	1.026	5.291	0.215	0.816	0.945	0.979
Struct2depth [51]	M(K)	0.141	1.026	5.291	0.215	0.816	0.945	0.979
Monodepth2 [9]	M(K)	0.115	0.903	4.863	0.193	0.877	0.959	0.980
Klingner, et al. [11]	M(K)	0.113	0.870	4.720	0.187	0.876	0.958	0.978
Johnston, et al. [37]	M(K)	0.110	0.872	4.714	0.189	0.878	0.958	0.980
Current study	M(K)	0.110	0.838	4.706	0.180	0.878	0.960	0.982
Garg, et al. [52]	S(K)	0.152	1.226	5.489	0.246	0.784	0.921	0.967
Monodepth R50 [53]	S(K)	0.133	1.142	5.533	0.230	0.830	0.936	0.970
StrAT [54]	S(K)	0.128	1.019	5.403	0.227	0.827	0.935	0.971
3Net(R50) [55]	S(K)	0.129	0.996	5.281	0.223	0.831	0.939	0.974
3Net(R18) [55]	S(K)	0.112	0.953	5.007	0.207	0.862	0.949	0.976
Monodepth2 [9]	S(K)	0.109	0.873	4.960	0.209	0.864	0.948	0.975
Hint-Monodepth [56]	S(K)	0.111	0.912	4.977	0.205	0.862	0.950	0.977
Current study	S(K)	0.111	0.870	4.917	0.205	0.866	0.952	0.977
UnDeepVO [57]	MS(K)	0.183	1.730	6.571	0.268	−	−	−
EPC++ [23]	MS(K)	0.128	0.936	5.011	0.209	0.831	0.945	0.979
Monodepth2 [9]	MS(K)	0.107	0.819	4.751	0.198	0.873	0.955	0.977
Current study	MS(K)	0.110	0.857	4.741	0.190	0.882	0.960	0.981

Table 2. Predicted results of our monocular depth estimation model and other current monocular depth estimation models on Make3D. M(K): a self-supervised monocular model trained on KITTI. The best results are identified in bold.

Method	Train	Abs Rel	Sq Rel	RMSE	RMSE log
Zhou, et al. [26]	M(K)	0.386	5.328	10.472	0.478
monodepth2 [9]	M(K)	0.324	3.586	7.415	0.164
Johnston, et al. [37]	M(K)	0.306	3.100	7.126	0.160
Current study	M(K)	0.284	2.903	7.011	0.149

Table 3. Ablation study. √: PSA module added, ×: no PSA module added. If two forms of PSA module are not added, we also do not add object detection module FPN. M(K): a self-supervised monocular model trained on KITTI, S(K): a self-supervised stereo model trained on KITTI, MS(K): a self-supervised monocular plus stereo model trained on KITTI. The best results are identified in bold.

Train	PSA_p	PSA_s	Abs Rel	Sq Rel	RMSE	RMSE log	$δ < {1.25}^{2}$	$δ < {1.25}^{2}$	$δ < {1.25}^{3}$
M(K)	√	×	0.116	0.865	4.816	0.194	0.874	0.958	0.980
	×	√	0.117	0.908	4.878	0.192	0.870	0.959	0.982
	√	√	0.110	0.838	4.706	0.180	0.878	0.960	0.982
	×	×	0.115	0.903	4.863	0.193	0.877	0.959	0.980
S(K)	√	×	0.110	0.919	5.000	0.207	0.865	0.950	0.976
	×	√	0.110	0.893	4.958	0.206	0.886	0.950	0.977
	√	√	0.111	0.870	4.917	0.205	0.866	0.952	0.977
	×	×	0.109	0.873	4.960	0.209	0.864	0.948	0.975
MS(K)	√	×	0.110	0.857	4.741	0.190	0.882	0.960	0.981
	×	√	0.117	0.857	4.877	0.191	0.863	0.956	0.982
	√	√	0.111	0.862	4.742	0.190	0.883	0.960	0.981
	×	×	0.107	0.819	4.751	0.198	0.873	0.955	0.977

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tao, B.; Shen, Y.; Tong, X.; Jiang, D.; Chen, B. Depth Estimation Using Feature Pyramid U-Net and Polarized Self-Attention for Road Scenes. Photonics 2022, 9, 468. https://doi.org/10.3390/photonics9070468

AMA Style

Tao B, Shen Y, Tong X, Jiang D, Chen B. Depth Estimation Using Feature Pyramid U-Net and Polarized Self-Attention for Road Scenes. Photonics. 2022; 9(7):468. https://doi.org/10.3390/photonics9070468

Chicago/Turabian Style

Tao, Bo, Yunfei Shen, Xiliang Tong, Du Jiang, and Baojia Chen. 2022. "Depth Estimation Using Feature Pyramid U-Net and Polarized Self-Attention for Road Scenes" Photonics 9, no. 7: 468. https://doi.org/10.3390/photonics9070468

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Depth Estimation Using Feature Pyramid U-Net and Polarized Self-Attention for Road Scenes

Abstract

1. Introduction

2. Related Work

2.1. Self-Supervised Monocular Depth Estimation

2.2. The Network Combining FPN and U-Net

2.3. Self-Attention Mechanism

3. Methods

3.1. Network Architecture

3.1.1. Channel-Only Branch

3.1.2. Spatial-Only Branch

3.2. Loss Function

4. Results and Discussion

4.1. KITTI Results

4.2. Make3D Results

4.3. Generalizing to Other Datasets

4.4. Ablation Study

4.5. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI