Next Article in Journal
Internet of Things Adoption in Technology Ecosystems Within the Central African Region: The Case of Silicon Mountain
Previous Article in Journal
A Survey of Security Strategies in Federated Learning: Defending Models, Data, and Privacy
Previous Article in Special Issue
Enhancing Recognition of Human–Object Interaction from Visual Data Using Egocentric Wearable Camera
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

SPDepth: Enhancing Self-Supervised Indoor Monocular Depth Estimation via Self-Propagation

1
School of Instrumentation and Optoelectronic Engineering, Key Laboratory of Precision Opto-Mechatronics Technology, Ministry of Education, Beihang University, Beijing 100191, China
2
Qingdao Research Institute, Beihang University, Qingdao 266104, China
3
School of Artificial Intelligence, Beihang University, Beijing 100191, China
4
Aerospace Optical-Microwave Integrated Precision Intelligent Sensing, Key Laboratory of Ministry of Industry and Information Technology, Beihang University, Beijing 100191, China
5
School of Automation Science and Electrical Engineering, Beihang University, Beijing 100191, China
6
Hangzhou Research Institute, Beihang University, Hangzhou 310051, China
7
Nanchang Institute of Technology, Nanchang 330044, China
*
Author to whom correspondence should be addressed.
Future Internet 2024, 16(10), 375; https://doi.org/10.3390/fi16100375
Submission received: 25 August 2024 / Revised: 8 October 2024 / Accepted: 15 October 2024 / Published: 16 October 2024
(This article belongs to the Special Issue Machine Learning Techniques for Computer Vision)

Abstract

:
Due to the existence of low-textured areas in indoor scenes, some self-supervised depth estimation methods have specifically designed sparse photometric consistency losses and geometry-based losses. However, some of the loss terms cannot supervise all the pixels, which limits the performance of these methods. Some approaches introduce an additional optical flow network to provide dense correspondences supervision, but overload the loss function. In this paper, we propose to perform depth self-propagation based on feature self-similarities, where high-accuracy depths are propagated from supervised pixels to unsupervised ones. The enhanced self-supervised indoor monocular depth estimation network is called SPDepth. Since depth self-similarities are significant in a local range, a local window self-attention module is embedded at the end of the network to propagate depths in a window. The depth of a pixel is weighted using the feature correlation scores with other pixels in the same window. The effectiveness of self-propagation mechanism is demonstrated in the experiments on the NYU Depth V2 dataset. The root-mean-squared error of SPDepth is 0.585 and the δ1 accuracy is 77.6%. Zero-shot generalization studies are also conducted on the 7-Scenes dataset and provide a more comprehensive analysis about the application characteristics of SPDepth.

1. Introduction

Depth estimation plays an important role in various online computer vision applications, including robots navigation, unmanned driving and scenes reconstruction. Unlike depth capturing with active devices, it is much easier to integrate a single camera into existing wireless network devices. Obtaining depth values from a camera is called monocular depth estimation. The self-supervised monocular depth estimation methods have received widespread attention due to the advantage of not needing ground truth data.
Without labeled data for supervising, the design of loss function is of vital importance. The most commonly used loss for depth estimation is based on photometric consistency. The image synthesized by homographic transformation needs to be consistent with the real image. In indoor scenes, there are many low-textured regions such as walls, roofs and tables. It is hard for these regions to find correct matchings, thus making the photometric consistency assumption invalid. To solve the problem, many self-supervised works [1,2] specifically designed proper loss terms including patch-based photometric consistency loss and geometry-related losses. However, the proposed loss terms are usually not dense enough for covering all the pixels. For example, P2Net [1] proposed sparse patch-based photometric loss and plane fitting loss as two main loss terms to supervise the training of low-textured and planar regions. Figure 1 presents the supervised areas under the two losses.
The white areas in Figure 1d are not covered by any of the two losses, which means that depth ambiguities exist in these areas. Some indoor depth estimation approaches [3,4] introduced an optical flow network to provide dense correspondences in training. Yet an additional optical flow network also needs training and overloads the loss function. Contrary to these works, we do not burden the loss function with dense correspondences computation. Instead, we adopt a self-propagation approach.
Since direct supervision information is absent, propagating depths from well-supervised pixels to those not supervised can be considered. It can be found that the depths have high structural similarities with the image itself. Thus, feature self-similarities should be well exploited to perform depth self-propagation. An optical flow estimation method GMFlow [5] has taken advantage of feature self-similarities to propagate optical flow from matched pixels to unmatched ones. A simple self-attention layer was added at the end of the network. Following GMFlow, UniMatch [6] performed similar depth propagation. However, UniMatch conducted supervised stereo depth estimation. The effectiveness of self-propagation on the self-supervised monocular depth estimation task has not been explored.
Inspired by the abovementioned works, we accomplish depth self-propagation through introducing local window self-attention. Feature self-similarities scores are utilized to weight the depths, which means that the weighted depths are aggregated with extra depth information from similar pixels. In simple terms, pixels with similar features are likely to have similar depths.
Depth estimation is a dense prediction task and global self-attention brings immense computational cost. Moreover, since depth changes in the global range cannot be ignored, there is no need to perform global self-propagation. The local depth self-similarities are stronger than the global. Therefore, we limit the self-propagation to a local window range to perform more effective depth enhancement.
Briefly, the following contributions were made:
  • We proposed an enhanced self-supervised indoor monocular depth estimation network called SPDepth. A novel depth self-propagation mechanism is innovatively proposed to address the challenge of insufficient self-supervision information. The feature self-similarities provide important cues for the propagation.
  • Considering that depth self-similarities are stronger in the local range than the global, local window self-attention is introduced at the end of the network. The proposed strategy limits self-propagation in a local range so that the propagation is much more effective and saves extensive computational cost.
  • The experimental results on the NYU Depth V2 dataset demonstrate the effectiveness of our proposed SPDepth. SPDepth achieves good performance in both details and object edges. The zero-shot generalization experiments on the 7-Scenes dataset provide an analysis of the characteristics of SPDepth.

2. Related Work

2.1. Monocular Depth Estimation

Supervised monocular depth estimation requires ground truth data for training. Early supervised approaches [7,8,9,10,11,12,13] tended to define depth estimation as pixel-wise regression. A recent method called IEBins [14] used the idea of classification-regression with iterative elastic bins to predict depth. As a research focus of generative learning, diffusion models were also introduced to monocular depth estimation [15,16]. To fully utilize the geometric regularities of indoor scenes, many works added proper geometry-related loss functions to improve their performance. Some approaches [17,18] designed surface normal loss since surface normal is an important local feature. There are also some classic works such as NDDepth [19,20], P3Depth [21], PlaneReg [22] and PlaneNet [23] leveraging plane-aware supervision to achieve significantly better accuracy.
Without the ground truth as supervision, self-supervised depth estimation usually utilizes image synthesis loss as the main supervision. Inspired by structure-from-motion, SfMLearner [24] first proposed image synthesis loss for self-supervised depth learning. A classic work Monodepth2 [25] focused on the issues of occlusion and moving objects in outdoor driving scenes. Although Monodepth2 behaved well on KITTI [26], it still could not deal with the challenge of low-textured regions in indoor scenes. The original image synthesis loss can cause serious matching errors in low-textured regions. In order to enhance the estimation results in indoor scenes, P2Net [1] designed a patch-based photometric consistency constraint to avoid nondiscriminative matchings. Furthermore, planar consistency loss was also designed to predict better in planar low-textured regions. Other effective geometric-aware loss terms were also proposed, such as the co-planar loss and the Manhattan normal loss in StructDepth [2].
The current geometry-based losses cannot provide sufficiently dense supervision. Not all the pixels are covered in the supervision. To handle the issue, F2Depth [3] introduced a self-supervised optical flow network to provide extra pixel motion supervision for depth learning. Yet it was still hard to calculate high-accuracy optical flow for all pixels, since there are some unmatched pixels occluded or beyond the image boundaries. DistDepth [27] transferred structure knowledge from a supervised expert network DPT [28,29] to the student network DepthNet. Although DistDepth learnt densely supervised structure depth, it has not been able to completely eradicate its reliance on ground truth.

2.2. Self-Attention

The self-attention mechanism has achieved great success in natural language processing. It has also been widely applied in many computer vision tasks, such as image semantic segmentation [30], image classification [31] and object detection [32]. The CNNs were frequently replaced by a self-attention–based Transformer [33] and its variant Swin Transformer [34] in monocular depth estimation [35,36,37,38,39,40]. A Transformer was used for feature enhancement by some self-supervised works such as GasMono [41] and Lite-Mono [42]. Similar to the work [43], ADAADepth [44] and CADepth-Net [45] introduced a self-attention module after the ResNet encoder to explore global information and predict depth for noncontiguous regions. Although self-attention modules have been embedded in the mentioned works, there are few self-supervised methods directly using depth as input to the self-attention layer. Feature self-similarities of the image are not well exploited to perform a weighted optimization of depth.
Similar to depth, optical flow field also shares high structural similarities with the image. Inspired by Transformer, a global motion aggregation module based on self-attention was introduced in [46]. The globally aggregated motion features helped propagate the flow from matched pixels to unmatched ones. GMFlow [5] added a self-attention layer after the obtained optical flow. The operation also led to an improvement for unmatched pixels. Based on GMFlow, UniMatch [6] extended the optical flow task to stereo matching and stereo depth estimation. UniMatch validated the effectiveness of self-propagation on stereo depth estimation. However, it belonged to supervised depth estimation using the posed stereo images instead of a single image. There are few works exploring the effect of depth self-propagation on a single image in a self-supervised manner.
Inspired by the abovementioned optical flow estimation works, we propose to enhance self-supervised indoor monocular depth estimation with self-propagation. The proposed framework is called SPDepth.

3. Methods

The proposed self-supervised indoor depth estimation framework SPDepth is introduced in this section. Section 3.1 provides an overview of SPDepth. In Section 3.2, the local window self-propagation is explained in detail. Section 3.3 presents the overall loss function.

3.1. Overview of SPDepth

We introduce a self-propagation module to spread the high-quality depth from the well-supervised regions to those lacking supervision. The pipeline of SPDepth is presented in Figure 2.
It can be noticed that the depth map and the target image share high structural similarities. To improve the performance in areas not supervised, we exploit the self-similarity of features by introducing a self-attention mechanism. The general self-attention is formulated by the following:
D ˜ = soft max ( F Q F K T C ) D
where C denotes the number of feature channels, D represents the depth produced from the DepthCNN and D ˜ denotes the propagated depth results. The softmax function normalizes self-attention weights. F Q and F K stand for the projected query feature and key feature and the projection can be written as follows:
F Q = W q r y F
F K = W k e y F
where F H × W × C are the up-sampled features, H and W are the height and width of the image, W q r y C × C and W k e y C × C are linear transformation matrices. The propagated depth D ˜ is integrated with depth information from other pixels.
In the previous optical flow estimation works [5,46], global propagation was mostly adopted because optical flow has high self-similarities in the global range. However, since the depth values vary in different regions of the image, it is improper to perform propagation globally. The global depth self-propagation leads to too much computational cost. Instead, we use local window self-propagation to weight the depths within a neighborhood.

3.2. Local Window Self-Propagation

To better utilize local self-similarities provided by the pixels with solid supervision, a local window self-propagation strategy is well-designed. The strategy guarantees that the propagation is performed in a reliable range. A detailed diagram of the self-propagation module is provided in Figure 3.
The split key feature windows together with the projected query features are used for computing correlations. Figure 4 gives a description of the windows split process. Each key feature window is centered on the pixel P i j to be predicted. The total number of windows is H × W . The window radius is R . Let K = 2 R + 1 , then the windows are of the size K × K . F K W I N H × W × C × K 2 denotes the partitioned and reshaped key feature windows.
F Q H × W × 1 × C are the projected and reshaped query features. The correlations are computed through a simple matrix dot-product operation. The normalized window attention scores S H × W × 1 × K 2 are calculated as follows:
S = softmax ( F Q F K W I N C )
where S reveals the similarity distribution for each feature vector with respect to its K 2 neighboring feature vectors.
The depth D H × W × 1 is partitioned into overlapped windows D W I N H × W × K 2 × 1 in the same way as key features. The propagated depth D ˜ H × W × 1 is calculated as the following:
D ˜ = S D W I N
As the outcome of the self-attention module, the propagated depth is weighted with local feature self-similarities. The operation helps propagate high-accuracy depths from supervised regions to those not supervised. Combined with the above calculation process, the computational complexity of the local window self-propagation is
Ω W I N = 2 H W C 2 + ( 1 + C ) H W K 2

3.3. Overall Loss Functions

For the proposed depth estimation network SPDepth, the complete loss function is formulated as follows:
L = L p h + λ 1 L s m + λ 2 L s p p
where L p h is the patch-based photometric loss, L s m is the depth smoothness loss term and L s p p denotes the plane fitting loss. All the loss terms are defined in the same way as P2Net [1]. Moreover, λ 1 is set to 0.001 and λ 2 is set to 0.05 in the experiments by following the baseline [1].
To be specific, let I t denote the target image, I s denote the source image and the patch-based photometric loss L p h is written as
L L 1 = I t [ P i t ] I s [ P i t s ] 1
L S S I M = S S I M ( I t [ P i t ] , I s [ P i t s ] )
L p h = α L S S I M + ( 1 α ) L L 1
where α is set to 0.85, SSIM denotes structural similarity and P represents the 3 × 3 patches defined as follows:
P = { ( x i + x k , y i + y k ) , x k { N , 0 , N } , y k { N , 0 , N } , 1 i n , i }
where ( x i , y i ) is the coordinate of the extracted key point, n denotes the number of key points and N is set to 2.
The depth smoothness loss L s m is calculated as
L s m = | x d t * | e | x I t | + | y d t * | e | y I t |
where d t * = d t / d ¯ t denotes the normalized depth.
The plane fitting loss L s p p is formulated as follows:
L s p p = m = 1 M n = 1 N | D ( p n ) D ( p n ) |
where M is the number of extracted plane super-pixels, N is the number of pixels within each super-pixel. D ( p n ) is used for back-projecting to the 3D space to fit the plane parameters, D ( p n ) is the depth retrieved in terms of plane parameters and camera intrinsic parameters.

4. Experiments

4.1. Implementation Details

Datasets: We trained and evaluated SPDepth on the NYU Depth V2 dataset [47]. NYU Depth V2 was recorded with the Microsoft Kinect. It contained 582 indoor scenes in total. The same training set as the previous works [1,4] was used. The training split was composed of 21,483 RGB images captured in 283 scenes. The training images were sampled at intervals of 10 frames. For each target frame I t , its source frames { I t 1 , I t + 1 } were used for training. We evaluated SPDepth on the official test dataset containing 654 RGB-D images.
We performed zero-shot generalization studies on the 7-Scenes dataset [48] made up of 7 indoor scenes. We conducted experiments on the official test image sequences. The sequences were sampled at intervals of 10 frames. The final test dataset was composed of 1700 images.
Experimental setup: The architecture of SPDepth was based on P2Net [1]. Images were left-right flipped and color-augmented randomly in training. The color-augment operation includes brightness, contrast, saturation, hue changes and left–right flipping. The size of the training images was 288 × 384. The local attention window radius R was set to 1. Adam [49] was adopted as the optimizer. We trained SPDepth on a single NVIDIA GeForce RTX 3090 GPU for 37 epochs. The training lasted about 15 h. The batch size was 10. The learning rate was set to 1 × 10−4 in the first 27 epochs, and 1 × 10−5 in the last 10 epochs.
Evaluation metrics: The median scaling strategy similar to previous methods [1,3,4] was adopted because of the scale ambiguity. The evaluation metrics for measuring performance were calculated according to
A b s   R e l = 1 N i N | D i D ^ i | D ^ i
R M S = 1 N i N | | D i D ^ i | | 2
M e a n   log 10 = 1 N i N | log 10 D i log 10 D ^ i |
A c c u r a c i e s = %   o f   D i   s . t .   max ( D i D ^ i , D ^ i D i ) = δ   <   t h r
where D ^ stands for ground truth, D denotes the predicted depth, N is the number of pixels and t h r represents thresholds (1.25, 1.252, 1.253) for calculating accuracies. The accuracies in Equation (17) stand for the percentage of predicted depths within a certain range of ground truth depths. And the thresholds limit the range.

4.2. Results

4.2.1. NYU Depth V2 Results

We evaluated on the official test split of NYU Depth V2 and compared with other existing methods including supervised and self-supervised ones. Table 1 shows the comparison results. Compared with the baseline P2Net [1], our SPDepth improved on almost all the metrics, regardless of whether postprocessing was performed at test time. The postprocessing refers to left–right flipping augmentation, which was proposed in [50]. In postprocessing, the image is fed into the model twice, once normally and once flipped left and right. Then, the final depth is averaged with the two depth results. The results in Table 1 demonstrate that our proposed self-propagation mechanism is effective for indoor monocular depth estimation.
For the inference on a 640 × 480 image, P2Net requires 1.68 G GPU memory, while SPDepth only needs 2.30 G, and P2Net spends 9.4 ms while SPDepth needs 15.1 ms. It shows that the addition of self-propagation does not take up too much computing and time resources.
Figure 5 compares the depth visualization results of P2Net and our SPDepth. Our SPDepth preserved more details and produced much sharper results at object boundaries. And in the framed areas, P2Net failed to make correct predictions, while these areas were predicted more accurately by SPDepth. The addition of the self-attention layer helps achieve effective self-propagation of depth. The enhancement of feature encoding makes up for limited self-supervision information to a certain extent. Benefiting from the self-propagation mechanism, the overall depth results were enhanced.

4.2.2. 7-Scenes Results

Zero-shot generalization experiments were performed on the 7-Scenes dataset. The model trained on NYU Depth V2 was directly tested without being fine-tuned. The generalization results of our SPDepth and P2Net are compared in Table 2.
In terms of the average values in the last row, SPDepth achieved similar generalization performance with P2Net. In the three scenes Heads, Pumpkin and RedKitchen, SPDepth outperformed P2Net. And in the remaining four scenes, SPDepth generalized not that well. It can be noticed that there was a significant performance gap between two methods in the scene Stairs. In order to analyze the cause of the gap, we select a representative test image from Stairs. The predicted depth of the image is visualized in Figure 6.
In Figure 6, the regions framed in black and green should have obvious depth changes theoretically. P2Net correctly predicted the depth difference. However, SPDepth could not distinguish the difference clearly. The results reveal that P2Net is still good at distinguishing the depth of the entire plane from neighboring regions and SPDepth could incorrectly propagate depth from planar regions to neighboring areas with similar appearance. The results show that the propagation could make some mistakes.

4.2.3. Ablation Studies

Ablation studies were conducted on the NYU Depth V2 dataset to explore the impact of the changing local attention window radius, and the encoder output feature F o r i without being up-sampled. We compare the model performance, training time and inference GPU memory use with a different local window radius in Table 3. It shows that setting the radius to 1 led to the best performance, the shortest training time and the smallest memory use.
Table 4 presents the results using different feature F o r i . Although the feature channel increased from 64 to 128, the performance still decreased due to the much smaller resolution of feature F o r i .

5. Conclusions and Discussion

In this paper, an enhanced self-supervised indoor monocular depth estimation network SPDepth is proposed. To meet the challenge of not sufficient self-supervision information, we introduce a learnable module instead of burdening the loss function. A self-propagation module is placed at the end of SPDepth to propagate high-quality depth results from supervised pixels to unsupervised ones. The computation of self-propagation is based on local window self-attention. The depth results from DepthCNN are partitioned into windows. In terms of the window attention scores, the depths are weighted. The propagation is limited to a local window because local depth similarities are much more significant than global. The contributions are proven effective through experiments conducted on the NYU Depth V2 dataset. Within the generalization research, our SPDepth achieves nearly the same performance with the baseline. It is also observed that SPDepth cannot well distinguish the depth differences between planar regions and their adjacent regions. The precision of depth self-propagation needs further improvement. For future work, the relative position bias will be added into the computation of self-propagation. The bias term can provide relative position information for better handling of the propagation process. The Transformer module will also be exploited to enhance the features for computing self-similarities.

Author Contributions

Conceptualization, H.Z., X.G. and B.Z.; methodology, S.S.; software, X.G.; formal analysis, X.G. and S.S.; investigation, X.G.; resources, H.Z. and X.L.; data curation, X.G.; writing—original draft preparation, X.G.; writing—review and editing, S.S., N.L. and X.L.; visualization, X.G.; supervision, H.Z. and N.L.; project administration, N.L., X.L. and B.Z.; funding acquisition, H.Z. and B.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Application Innovation Project of CASC (grant number: KZ36006202); National Key Research and Development Program of China (grant number: 2023YFC3300029); Zhejiang Provincial Natural Science Foundation of China (grant number: LD24F020007); “One Thousand Plan” projects in Jiangxi Province (grant number: Jxsg2023102268).

Data Availability Statement

The data were derived from the following resources available in the public domain: https://cs.nyu.edu/~fergus/datasets/nyu_depth_v2.html (accessed on 14 October 2024) and https://www.microsoft.com/en-us/research/project/rgb-d-dataset-7-scenes/ (accessed on 14 October 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Yu, Z.; Jin, L.; Gao, S. P2Net: Patch-match and plane-regularization for unsupervised indoor depth estimation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXIV, 2020. pp. 206–222. [Google Scholar]
  2. Li, B.; Huang, Y.; Liu, Z.; Zou, D.; Yu, W. StructDepth: Leveraging the structural regularities for self-supervised indoor depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 12663–12673. [Google Scholar]
  3. Guo, X.; Zhao, H.; Shao, S.; Li, X.; Zhang, B. F2Depth: Self-supervised indoor monocular depth estimation via optical flow consistency and feature map synthesis. Eng. Appl. Artif. Intell. 2024, 133, 108391. [Google Scholar] [CrossRef]
  4. Zhou, J.; Wang, Y.; Qin, K.; Zeng, W. Moving Indoor: Unsupervised video depth learning in challenging environments. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8618–8627. [Google Scholar]
  5. Xu, H.; Zhang, J.; Cai, J.; Rezatofighi, H.; Tao, D. GMFlow: Learning optical flow via global matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8121–8130. [Google Scholar]
  6. Xu, H.; Zhang, J.; Cai, J.; Rezatofighi, H.; Yu, F.; Tao, D. Unifying flow, stereo and depth estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13941–13958. [Google Scholar] [CrossRef] [PubMed]
  7. Fu, H.; Gong, M.; Wang, C.; Batmanghelich, K.; Tao, D. Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2002–2011. [Google Scholar]
  8. Laina, I.; Rupprecht, C.; Belagiannis, V.; Tombari, F.; Navab, N. Deeper depth prediction with fully convolutional residual networks. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 239–248. [Google Scholar]
  9. Eigen, D.; Fergus, R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2650–2658. [Google Scholar]
  10. Li, J.; Klein, R.; Yao, A. A two-streamed network for estimating fine-scaled depth maps from single rgb images. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3372–3380. [Google Scholar]
  11. Zhang, S.; Yang, L.; Mi, M.B.; Zheng, X.; Yao, A. Improving deep regression with ordinal entropy. arXiv 2023, arXiv:2301.08915. [Google Scholar]
  12. Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; Volume 27. [Google Scholar]
  13. Liu, F.; Shen, C.; Lin, G.; Reid, I. Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 2024–2039. [Google Scholar] [CrossRef] [PubMed]
  14. Shao, S.; Pei, Z.; Wu, X.; Liu, Z.; Chen, W.; Li, Z. IEBins: Iterative elastic bins for monocular depth estimation. Adv. Neural Inf. Process. Syst. 2024, 36, 53025–53037. [Google Scholar]
  15. Zhao, W.; Rao, Y.; Liu, Z.; Liu, B.; Zhou, J.; Lu, J. Unleashing text-to-image diffusion models for visual perception. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 5729–5739. [Google Scholar]
  16. Ji, Y.; Chen, Z.; Xie, E.; Hong, L.; Liu, X.; Liu, Z.; Lu, T.; Li, Z.; Luo, P. DDP: Diffusion model for dense visual prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 21741–21752. [Google Scholar]
  17. Hu, J.; Ozay, M.; Zhang, Y.; Okatani, T. Revisiting single image depth estimation: Toward higher resolution maps with accurate object boundaries. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, 7–11 January 2019; pp. 1043–1051. [Google Scholar]
  18. Yin, W.; Liu, Y.; Shen, C.; Yan, Y. Enforcing geometric constraints of virtual normal for depth prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 5684–5693. [Google Scholar]
  19. Shao, S.; Pei, Z.; Chen, W.; Chen, P.C.; Li, Z. NDDepth: Normal-distance assisted monocular depth estimation and completion. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 1–17. [Google Scholar] [CrossRef] [PubMed]
  20. Shao, S.; Pei, Z.; Chen, W.; Wu, X.; Li, Z. NDDepth: Normal-distance assisted monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 7931–7940. [Google Scholar]
  21. Patil, V.; Sakaridis, C.; Liniger, A.; Van Gool, L. P3Depth: Monocular depth estimation with a piecewise planarity prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1610–1621. [Google Scholar]
  22. Yu, Z.; Zheng, J.; Lian, D.; Zhou, Z.; Gao, S. Single-image piece-wise planar 3D reconstruction via associative embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 1029–1037. [Google Scholar]
  23. Liu, C.; Yang, J.; Ceylan, D.; Yumer, E.; Furukawa, Y. PlaneNet: Piece-wise planar reconstruction from a single rgb image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2579–2588. [Google Scholar]
  24. Zhou, T.; Brown, M.; Snavely, N.; Lowe, D.G. Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1851–1858. [Google Scholar]
  25. Godard, C.; Mac Aodha, O.; Firman, M.; Brostow, G.J. Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3828–3838. [Google Scholar]
  26. Uhrig, J.; Schneider, N.; Schneider, L.; Franke, U.; Brox, T.; Geiger, A. Sparsity invariant cnns. In Proceedings of the 2017 International Conference on 3D Vision (3DV), Qingdao, China, 10–12 October 2017; pp. 11–20. [Google Scholar]
  27. Wu, C.-Y.; Wang, J.; Hall, M.; Neumann, U.; Su, S. Toward practical monocular indoor depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3814–3824. [Google Scholar]
  28. Ranftl, R.; Bochkovskiy, A.; Koltun, V. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 12179–12188. [Google Scholar]
  29. Ranftl, R.; Lasinger, K.; Hafner, D.; Schindler, K.; Koltun, V. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1623–1637. [Google Scholar] [CrossRef] [PubMed]
  30. Zhang, F.; Panahi, A.; Gao, G. FsaNet: Frequency self-attention for semantic segmentation. IEEE Trans. Image Process. 2023, 32, 4757–4772. [Google Scholar] [CrossRef] [PubMed]
  31. Bello, I.; Zoph, B.; Vaswani, A.; Shlens, J.; Le, Q.V. Attention augmented convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3286–3295. [Google Scholar]
  32. Ramachandran, P.; Parmar, N.; Vaswani, A.; Bello, I.; Levskaya, A.; Shlens, J. Stand-alone self-attention in vision models. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
  33. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
  34. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
  35. Agarwal, A.; Arora, C. Attention attention everywhere: Monocular depth prediction with skip attention. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 5861–5870. [Google Scholar]
  36. Ning, J.; Li, C.; Zhang, Z.; Wang, C.; Geng, Z.; Dai, Q.; He, K.; Hu, H. All in tokens: Unifying output space of visual tasks via soft token. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 19900–19910. [Google Scholar]
  37. Shao, S.; Pei, Z.; Chen, W.; Li, R.; Liu, Z.; Li, Z. URCDC-Depth: Uncertainty rectified cross-distillation with cutflip for monocular depth estimation. IEEE Trans. Multimed. 2023, 26, 3341–3353. [Google Scholar] [CrossRef]
  38. Piccinelli, L.; Sakaridis, C.; Yu, F. iDisc: Internal discretization for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 21477–21487. [Google Scholar]
  39. Li, Z.; Wang, X.; Liu, X.; Jiang, J. Binsformer: Revisiting adaptive bins for monocular depth estimation. IEEE Trans. Image Process. 2024, 33, 3964–3976. [Google Scholar] [CrossRef] [PubMed]
  40. Li, Z.; Chen, Z.; Liu, X.; Jiang, J. Depthformer: Exploiting long-range correlation and local information for accurate monocular depth estimation. Mach. Intell. Res. 2023, 20, 837–854. [Google Scholar] [CrossRef]
  41. Zhao, C.; Poggi, M.; Tosi, F.; Zhou, L.; Sun, Q.; Tang, Y.; Mattoccia, S. GasMono: Geometry-aided self-supervised monocular depth estimation for indoor scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 16209–16220. [Google Scholar]
  42. Zhang, N.; Nex, F.; Vosselman, G.; Kerle, N. Lite-Mono: A lightweight cnn and transformer architecture for self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 18537–18546. [Google Scholar]
  43. Johnston, A.; Carneiro, G. Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4756–4765. [Google Scholar]
  44. Kaushik, V.; Jindgar, K.; Lall, B. ADAADepth: Adapting data augmentation and attention for self-supervised monocular depth estimation. IEEE Robot. Autom. Lett. 2021, 6, 7791–7798. [Google Scholar] [CrossRef]
  45. Yan, J.; Zhao, H.; Bu, P.; Jin, Y. Channel-wise attention-based network for self-supervised monocular depth estimation. In Proceedings of the 2021 International Conference on 3D Vision (3DV), London, UK, 1–3 December 2021; pp. 464–473. [Google Scholar]
  46. Jiang, S.; Campbell, D.; Lu, Y.; Li, H.; Hartley, R. Learning to estimate hidden motions with global motion aggregation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 9772–9781. [Google Scholar]
  47. Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor segmentation and support inference from rgbd images. In Proceedings of the Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; pp. 746–760. [Google Scholar]
  48. Shotton, J.; Glocker, B.; Zach, C.; Izadi, S.; Criminisi, A.; Fitzgibbon, A. Scene coordinate regression forests for camera relocalization in RGB-D images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 2930–2937. [Google Scholar]
  49. Kingma, D.P. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  50. Godard, C.; Mac Aodha, O.; Brostow, G.J. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 270–279. [Google Scholar]
  51. Liu, M.; Salzmann, M.; He, X. Discrete-continuous depth estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 716–723. [Google Scholar]
  52. Li, B.; Shen, C.; Dai, Y.; Van Den Hengel, A.; He, M. Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1119–1127. [Google Scholar]
  53. Jun, J.; Lee, J.-H.; Lee, C.; Kim, C.-S. Depth map decomposition for monocular depth estimation. In Proceedings of the Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part II, 2022. pp. 18–34. [Google Scholar]
  54. Zhao, W.; Liu, S.; Shu, Y.; Liu, Y.-J. Towards better generalization: Joint depth-pose learning without posenet. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9151–9161. [Google Scholar]
  55. Zhang, Y.; Gong, M.; Li, J.; Zhang, M.; Jiang, F.; Zhao, H. Self-supervised monocular depth estimation with multiscale perception. IEEE Trans. Image Process. 2022, 31, 3251–3266. [Google Scholar] [CrossRef] [PubMed]
  56. Song, X.; Hu, H.; Liang, L.; Shi, W.; Xie, G.; Lu, X.; Hei, X. Unsupervised monocular estimation of depth and visual odometry using attention and depth-pose consistency loss. IEEE Trans. Multimed. 2023, 26, 3517–3529. [Google Scholar] [CrossRef]
  57. Bian, J.; Li, Z.; Wang, N.; Zhan, H.; Shen, C.; Cheng, M.-M.; Reid, I. Unsupervised scale-consistent depth and ego-motion learning from monocular video. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
Figure 1. The areas being supervised in P2Net [1]: (a) RGB image; (b) pixels supervised by the patch-based photometric loss; (c) pixels supervised by the plane fitting loss; (d) all the pixels being supervised. The pixels in black are supervised.
Figure 1. The areas being supervised in P2Net [1]: (a) RGB image; (b) pixels supervised by the patch-based photometric loss; (c) pixels supervised by the plane fitting loss; (d) all the pixels being supervised. The pixels in black are supervised.
Futureinternet 16 00375 g001
Figure 2. The pipeline of SPDepth. Features with a resolution half the size of the training images are encoded from the target image. The features are then up-sampled to the same size as the image. The depths from DepthCNN and up-sampled features F are together input to the self-propagation module. Based on self-attention, the depths are weighted with the self-similarity scores to perform propagation.
Figure 2. The pipeline of SPDepth. Features with a resolution half the size of the training images are encoded from the target image. The features are then up-sampled to the same size as the image. The depths from DepthCNN and up-sampled features F are together input to the self-propagation module. Based on self-attention, the depths are weighted with the self-similarity scores to perform propagation.
Futureinternet 16 00375 g002
Figure 3. Details of the self-propagation module. The projected key features are split into feature windows. The window attention scores are calculated through the self-attention operation. The depths are then weighted in respective windows.
Figure 3. Details of the self-propagation module. The projected key features are split into feature windows. The window attention scores are calculated through the self-attention operation. The depths are then weighted in respective windows.
Futureinternet 16 00375 g003
Figure 4. Zeros are padded on both sides of the input to form windows at the edges. The number of zeros for each dimension equals the window radius. Both key feature windows and depth windows are partitioned in this way.
Figure 4. Zeros are padded on both sides of the input to form windows at the edges. The number of zeros for each dimension equals the window radius. Both key feature windows and depth windows are partitioned in this way.
Futureinternet 16 00375 g004
Figure 5. Visualization results on the NYU Depth V2 dataset. (a) RGB image; (b) P2Net [1]; (c) our SPDepth; (d) ground truth.
Figure 5. Visualization results on the NYU Depth V2 dataset. (a) RGB image; (b) P2Net [1]; (c) our SPDepth; (d) ground truth.
Futureinternet 16 00375 g005aFutureinternet 16 00375 g005b
Figure 6. Visualization of generalized results in the scene Stairs of the 7-Scenes dataset. (a) RGB image; (b) P2Net [1]; (c) our SPDepth; (d) ground truth.
Figure 6. Visualization of generalized results in the scene Stairs of the 7-Scenes dataset. (a) RGB image; (b) P2Net [1]; (c) our SPDepth; (d) ground truth.
Futureinternet 16 00375 g006
Table 1. The performance comparison of SPDepth and other methods on the NYU Depth V2 dataset. PP represents postprocessing. The best results are shown in bold. ↓ means the lower the value, the better. ↑ means the higher the value, the better.
Table 1. The performance comparison of SPDepth and other methods on the NYU Depth V2 dataset. PP represents postprocessing. The best results are shown in bold. ↓ means the lower the value, the better. ↑ means the higher the value, the better.
MethodsSupervisionREL ↓RMS ↓Log10 ↓δ < 1.25 ↑δ < 1.252δ < 1.253
Liu [51]0.3351.0600.127---
Li [52]0.2320.8210.0940.6210.8860.968
Liu [13]0.2130.7590.0870.6500.9060.976
Eigen [9]0.1580.641-0.7690.9500.988
Li [10]0.1430.6350.0630.7880.9580.991
PlaneNet [23] 0.1420.5140.0600.8270.9630.990
PlaneReg [22] 0.1340.5030.0570.8270.9630.990
Laina [8]0.1270.5730.0550.8110.9530.988
DORN [7]0.1150.5090.0510.8280.9650.992
VNL [18]0.1080.4160.0480.8750.9760.994
P3Depth [21]0.1040.3560.0430.8980.9810.996
Jun [53]0.1000.3620.0430.9070.9860.997
DDP [16]0.0940.3290.0400.9210.9900.998
Moving
Indoor [4]
×0.2080.7120.0860.6740.9000.968
TrianFlow [54]×0.1890.6860.0790.7010.9120.978
Zhang [55]×0.1770.634-0.7330.936-
Monodepth2 [25]×0.1700.6170.0720.7480.9420.986
ADPDepth [56]×0.1650.5920.0710.7530.9340.981
SC-Depth [57]×0.1590.6080.0680.7720.9390.982
P2Net [1]×0.1590.5990.0680.7720.9420.984
SPDepth×0.1590.5850.0670.7760.9460.986
P2Net + PP [1]×0.1570.5920.0670.7770.9440.985
SPDepth + PP×0.1570.5790.0660.7810.9470.986
Table 2. Zero-shot generalization results of our SPDepth and P2Net [1] on the 7-Scenes dataset. PP represents postprocessing. The best results are shown in bold. ↓ means the lower the value, the better. ↑ means the higher the value, the better.
Table 2. Zero-shot generalization results of our SPDepth and P2Net [1] on the 7-Scenes dataset. PP represents postprocessing. The best results are shown in bold. ↓ means the lower the value, the better. ↑ means the higher the value, the better.
MethodsOur SPDepth + PPP2Net [1] + PP
SceneREL ↓RMS ↓Log10 ↓δ < 1.25 ↑δ < 1.252δ < 1.253REL ↓RMS ↓Log10 ↓δ < 1.25 ↑δ < 1.252δ < 1.253
Chess0.1900.4200.0820.6670.9390.9930.1830.4080.0810.6690.9400.993
Fire0.1630.3050.0700.7410.9600.9950.1570.2910.0680.7670.9650.994
Heads0.1870.1940.0790.7070.9270.9820.1870.1970.0790.7010.9240.982
Office0.1590.3600.0670.7680.9660.9960.1560.3510.0650.7750.9700.997
Pumpkin0.1320.3610.0580.8310.9780.9960.1410.3800.0620.7970.9770.995
RedKitchen0.1630.3990.0700.7490.9530.9930.1650.4040.0720.7350.9510.994
Stairs0.1740.4860.0750.7480.8990.9640.1580.4540.0680.7670.9110.971
Average0.1640.3700.0700.7500.9530.9920.1620.3670.0700.7470.9550.993
Table 3. Ablation studies of the local attention window radius. The best results are shown in bold. ↓ means the lower the value, the better. ↑ means the higher the value, the better.
Table 3. Ablation studies of the local attention window radius. The best results are shown in bold. ↓ means the lower the value, the better. ↑ means the higher the value, the better.
RadiusREL ↓RMS ↓Log10 ↓δ < 1.25 ↑δ < 1.252δ < 1.253Time ↓Memory ↓
10.1590.5850.0670.7760.9460.98615.05 h2.30 G
20.1670.6210.0710.7570.9350.98324.77 h3.15 G
30.1880.6610.0780.7170.9210.977120.89 h4.41 G
Table 4. Ablation studies of feature F o r i . C is the number of feature channels. H and W are the height and width of the feature. The best results are shown in bold. ↓ means the lower the value, the better. ↑ means the higher the value, the better.
Table 4. Ablation studies of feature F o r i . C is the number of feature channels. H and W are the height and width of the feature. The best results are shown in bold. ↓ means the lower the value, the better. ↑ means the higher the value, the better.
( C , H , W ) REL ↓RMS ↓Log10 ↓δ < 1.25 ↑δ < 1.252δ < 1.253Time ↓Memory ↓
(64, 144, 192)0.1590.5850.0670.7760.9460.98615.05 h2.30 G
(64, 72, 96)0.1610.5930.0680.7710.9430.98616.07 h2.30 G
(128, 36, 48)0.1620.5960.0680.7680.9440.98524.22 h2.90 G
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Guo, X.; Zhao, H.; Shao, S.; Li, X.; Zhang, B.; Li, N. SPDepth: Enhancing Self-Supervised Indoor Monocular Depth Estimation via Self-Propagation. Future Internet 2024, 16, 375. https://doi.org/10.3390/fi16100375

AMA Style

Guo X, Zhao H, Shao S, Li X, Zhang B, Li N. SPDepth: Enhancing Self-Supervised Indoor Monocular Depth Estimation via Self-Propagation. Future Internet. 2024; 16(10):375. https://doi.org/10.3390/fi16100375

Chicago/Turabian Style

Guo, Xiaotong, Huijie Zhao, Shuwei Shao, Xudong Li, Baochang Zhang, and Na Li. 2024. "SPDepth: Enhancing Self-Supervised Indoor Monocular Depth Estimation via Self-Propagation" Future Internet 16, no. 10: 375. https://doi.org/10.3390/fi16100375

APA Style

Guo, X., Zhao, H., Shao, S., Li, X., Zhang, B., & Li, N. (2024). SPDepth: Enhancing Self-Supervised Indoor Monocular Depth Estimation via Self-Propagation. Future Internet, 16(10), 375. https://doi.org/10.3390/fi16100375

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop