Nested DWT–Based CNN Architecture for Monocular Depth Estimation
Abstract
:1. Introduction
- A nested DWT–based CNN architecture is proposed for monocular depth estimation,
- Dense Skip functions are implemented with attention function so as to improve the learning of local features,
- Dense convolution blocks are incorporated for higher feature extraction and learning.
2. Wavelets
3. Nested DWT Net Architecture
- NDWT: Basic NDWTN;
- NADWT: NDWTN with attention on skip paths;
- NRDWT: NDWTN with residual blocks;
- NARDWT: NRDWT with attention on skip paths.
4. Loss Function
5. Datasets
6. Standard Performance Metrics
- Root Mean Squared Error (RMS):
- Average relative error (REL):
- Logarithm error ():
7. Experiments and Ablation Studies
- NDWT (3C, 3R, 3Bs) + Bs
- NADWT (3C, 3LR, 1Bs)
- NADWT (3C, 3LR, 1Bs) + Bs
- NADWT (3C, 3Bs, 3R) + Bs
- NADWT (3C, 3R, 3Bs) + Bs
- NRDWT (3C, 3R, 3Bs) + Bs
- NRDWT (3C, 3Bs, 3R) + Bs
- NARDWT(3C, 3LR, 3Bs) + Bs
- NARDWT (3C, 3R, 3Bs) + Bs
- NARDWT (3C, 3Bs, 3LR) + Bs
- NARDWT (3C, 3Bs, 3LR)
- NARDWT (3C, 3LR)
- NARDWT (4C, 4Bs, 4LR) + 1Bs
- Where, C: Convolution LAYER, R: ReLU, LR: Leaky ReLU, Bs: Batch Normalization, and NUMBER: Number of LAYERS implemented.
8. Results and Observation
- Batch normalization: improves the depth range and loss. Batch normalization after the activation layer degrades loss. Additional computations and trainable parameters.
- Activation: among activation layers, the LR activation function offered higher performance. Training and validation performance is better with ReLU.
- Attention: gives higher training, validation, and evaluation scores.
- Residual: gives lower training and validation accuracy but the evaluation score is moderately better. Requires more training.
- Convolution: higher convolution layers do not improve performance, but visually give better representation.
- Loss function: replacing MAE loss with BerHu loss did not show improvement.
9. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
2D | Two dimensional |
3D | Three dimensional |
ADAM | Adaptive Moment Estimation |
BerHu | Reversed Huber loss |
CNN | Convolution Neural Network |
DWT | Discrete wavelet transforms |
GPU | Graphic processing unit |
IWT | Inverse DWT |
LIDAR | LIght Detection and Ranging |
LR | Leaky ReLU |
MAE | Mean Absolute Error |
MSE | Mean square error |
mins | Minutes |
NDWTN | Nested Discrete Waveform Transform Net |
NYU | New York University |
RADAR | Radio Detection and Ranging |
ReLU | Rectified Linear Unit |
RGB | Red, Green and Blue |
RMSE | Root mean square error |
SONAR | Sound Navigation and Ranging |
SSIM | Structural Similarity Index |
References
- Ens, J.; Lawrence, P. An investigation of methods for determining depth from focus. IEEE Trans. Pattern Anal. Mach. Intell. 1993, 15, 97–108. [Google Scholar] [CrossRef] [Green Version]
- Xian, T.; Subbarao, M. Performance evaluation of different depth from defocus (DFD) techniques. Proc. SPIE 2005, 6000, 87–99. [Google Scholar] [CrossRef] [Green Version]
- Lee, S.; Hayes, M.H.; Paik, J. Distance estimation using a single computational camera with dual off–axis color filtered apertures. Opt. Express 2013, 21, 23116–23129. [Google Scholar] [CrossRef]
- Mather, G. The Use of Image Blur as a Depth Cue. Perception 1997, 26, 1147–1158. [Google Scholar]
- Eigen, D.; Puhrsch, C.; Fergus, R. Depth Map Prediction from a Single Image using a Multi–Scale Deep Network. arXiv 2014, arXiv:1406.2283. [Google Scholar]
- Harsányi, K.; Kiss, A.; Majdik, A.; Sziranyi, T. A Hybrid CNN Approach for Single Image Depth Estimation: A Case Study. In Proceedings of the Multimedia and Network Information Systems (MISSI 2018), Wroclaw, Poland, 12–14 September 2018; Choroś, K., Kopel, M., Kukla, E., Siemiński, A., Eds.; Springer International Publishing: Cham, Switzerland, 2019; pp. 372–381. [Google Scholar]
- Alhashim, I.; Wonka, P. High Quality Monocular Depth Estimation via Transfer Learning. arXiv 2018, arXiv:1812.11941. [Google Scholar]
- Shivakumar, S.S.; Nguyen, T.; Miller, I.D.; Chen, S.W.; Kumar, V.; Taylor, C.J. DFuseNet: Deep Fusion of RGB and Sparse Depth Information for Image Guided Dense Depth Completion. arXiv 2019, arXiv:1902.00761. [Google Scholar]
- Laina, I.; Rupprecht, C.; Belagiannis, V.; Tombari, F.; Navab, N. Deeper Depth Prediction with Fully Convolutional Residual Networks. arXiv 2016, arXiv:1606.00373. [Google Scholar]
- Zhao, C.; Sun, Q.; Zhang, C.; Tang, Y.; Qian, F. Monocular depth estimation based on deep learning: An overview. Sci. China Technol. Sci. 2020, 63, 1612–1627. [Google Scholar] [CrossRef]
- He, L.; Wang, G.; Hu, Z. Learning Depth From Single Images With Deep Neural Network Embedding Focal Length. IEEE Trans. Image Process. 2018, 27, 4676–4689. [Google Scholar] [CrossRef] [Green Version]
- Chi, J.; Gao, J.; Qi, L.; Zhang, S.; Dong, J.; Yu, H. Depth estimation of a single RGB image with semi–supervised two–stage regression. In Proceedings of the 5th International Conference on Communication and Information Processing, Chongqing, China, 15–17 November 2019; pp. 97–102. [Google Scholar] [CrossRef] [Green Version]
- Masoumian, A.; Rashwan, H.A.; Cristiano, J.; Asif, M.S.; Puig, D. Monocular Depth Estimation Using Deep Learning: A Review. Sensors 2022, 22, 5353. [Google Scholar] [CrossRef]
- Zhu, J.; Liu, L.; Liu, Y.; Li, W.; Wen, F.; Zhang, H. FG–Depth: Flow–Guided Unsupervised Monocular Depth Estimation. arXiv 2023, arXiv:2301.08414. [Google Scholar]
- Godard, C.; Mac Aodha, O.; Brostow, G.J. Unsupervised Monocular Depth Estimation with Left–Right Consistency. arXiv 2016, arXiv:1609.03677. [Google Scholar] [CrossRef]
- Bhat, S.F.; Alhashim, I.; Wonka, P. AdaBins: Depth Estimation Using Adaptive Bins. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021. [Google Scholar] [CrossRef]
- Li, B.; Zhang, H.; Wang, Z.; Liu, C.; Yan, H.; Hu, L. Unsupervised monocular depth estimation with aggregating image features and wavelet SSIM (Structural SIMilarity) loss. Intell. Robot. 2021, 1, 84–98. [Google Scholar] [CrossRef]
- Zhao, S.; Fu, H.; Gong, M.; Tao, D. Geometry–Aware Symmetric Domain Adaptation for Monocular Depth Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Antensteiner, D.; Štolc, S.; Huber-Mörk, R. Depth Estimation with Light Field and Photometric Stereo Data Using Energy Minimization. In Proceedings of the Progress in Pattern Recognition, Image Analysis, Computer Vision and Applications (CIARP 2016), Lima, Peru, 8–11 November 2016; Beltrán-Castañón, C., Nyström, I., Famili, F., Eds.; Springer International Publishing: Cham, Switzerland, 2017; pp. 175–183. [Google Scholar]
- Woodham, R.J. Photometric Method For Determining Surface Orientation From Multiple Images. Opt. Eng. 1980, 19, 191139. [Google Scholar] [CrossRef]
- Chen, G.; Han, K.; Wong, K.Y.K. PS–FCN: A Flexible Learning Framework for Photometric Stereo. arXiv 2018, arXiv:1807.08696. [Google Scholar] [CrossRef]
- Chen, G.; Han, K.; Shi, B.; Matsushita, Y.; Wong, K.Y.K. Deep Photometric Stereo for Non–Lambertian Surfaces. arXiv 2020, arXiv:2007.13145. [Google Scholar] [CrossRef]
- Ju, Y.; Jian, M.; Guo, S.; Wang, Y.; Zhou, H.; Dong, J. Incorporating Lambertian Priors Into Surface Normals Measurement. IEEE Trans. Instrum. Meas. 2021, 70, 1–13. [Google Scholar] [CrossRef]
- Van Dijk, T.; de Croon, G.C.H.E. How do neural networks see depth in single images? arXiv 2019, arXiv:1905.07005. [Google Scholar]
- Yue, H.; Zhang, J.; Wu, X.; Wang, J.; Chen, W. Edge Enhancement in Monocular Depth Prediction. In Proceedings of the 2020 15th IEEE Conference on Industrial Electronics and Applications (ICIEA), Kristiansand, Norway, 9–13 November 2020; pp. 1594–1599. [Google Scholar] [CrossRef]
- Xie, J.; Feris, R.S.; Sun, M.T. Edge–Guided Single Depth Image Super Resolution. IEEE Trans. Image Process. 2016, 25, 428–438. [Google Scholar] [CrossRef]
- Zhang, C.; Tian, Y. Edge Enhanced Depth Motion Map for Dynamic Hand Gesture Recognition. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Portland, OR, USA, 23–28 June 2013; pp. 500–505. [Google Scholar] [CrossRef]
- Paul, S.; Jhamb, B.; Mishra, D.; Kumar, M.S. Edge loss functions for deep–learning depth–map. Mach. Learn. Appl. 2022, 7, 100218. [Google Scholar] [CrossRef]
- Wolter, M.; Garcke, J. Adaptive wavelet pooling for convolutional neural networks. Proc. Mach. Learn. Res. 2021, 130, 1936–1944. [Google Scholar]
- Ferrà, A.; Aguilar, E.; Radeva, P. Multiple Wavelet Pooling for CNNs. In Proceedings of the Computer Vision–ECCV 2018 Workshops, Munich, Germany, 8–14 September 2018; Leal-Taixé, L., Roth, S., Eds.; Springer International Publishing: Cham, Switzerland, 2019; pp. 671–675. [Google Scholar]
- Yang, H.H.; Yang, C.H.H.; James Tsai, Y.C. Y–Net: Multi–Scale Feature Aggregation Network With Wavelet Structure Similarity Loss Function For Single Image Dehazing. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 2628–2632. [Google Scholar] [CrossRef] [Green Version]
- Ramamonjisoa, M.; Firman, M.; Watson, J.; Lepetit, V.; Turmukhambetov, D. Single Image Depth Estimation using Wavelet Decomposition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
- Yu, B.; Wu, J.; Islam, M.J. UDepth: Fast Monocular Depth Estimation for Visually–guided Underwater Robots. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023. [Google Scholar]
- Zioulis, N.; Albanis, G.; Drakoulis, P.; Alvarez, F.; Zarpalas, D.; Daras, P. Hybrid Skip: A Biologically Inspired Skip Connection for the UNet Architecture. IEEE Access 2022, 10, 53928–53939. [Google Scholar] [CrossRef]
- Luo, C.; Li, Y.; Lin, K.; Chen, G.; Lee, S.J.; Choi, J.; Yoo, Y.F.; Polley, M.O. Wavelet Synthesis Net for Disparity Estimation to Synthesize DSLR Calibre Bokeh Effect on Smartphones. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 2404–2412. [Google Scholar] [CrossRef]
- Li, Q.; Shen, L.; Guo, S.; Lai, Z. Wavelet Integrated CNNs for Noise–Robust Image Classification. arXiv 2020, arXiv:2005.03337. [Google Scholar] [CrossRef]
- Liu, P.; Zhang, H.; Lian, W.; Zuo, W. Multi-level Wavelet Convolutional Neural Networks. IEEE Access 2019, 7, 74973–74985. [Google Scholar] [CrossRef]
- Olaf Ronneberger, P.F.; Brox, T. U–Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer–Assisted Intervention, MICCAI 2015, Munich, Germany, 5–9 October 2015; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar] [CrossRef]
- Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention U–Net: Learning Where to Look for the Pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar]
- Zhang, Z.; Liu, Q.; Wang, Y. Road Extraction by Deep Residual U–Net. IEEE Geosci. Remote Sens. Lett. 2018, 15, 749–753. [Google Scholar] [CrossRef] [Green Version]
- Yang, H.H.; Fu, Y. Wavelet U–Net and the Chromatic Adaptation Transform for Single Image Dehazing. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 2736–2740. [Google Scholar] [CrossRef]
- Wang, Y.; Zhu, X.; Zhao, Y.; Wang, P.; Ma, J. Enhancement of Low–Light Image Based on Wavelet U–Net. J. Phys. Conf. Ser. 2019, 1345, 022030. [Google Scholar] [CrossRef]
- Li, Y.; Wang, Y.; Leng, T.; Zhijie, W. Wavelet U–Net for Medical Image Segmentation. In Proceedings of the Artificial Neural Networks and Machine Learning—ICANN 2020: 29th International Conference on Artificial Neural Networks, Bratislava, Slovakia, 15–18 September 2020; Part I. Springer: Berlin/Heidelberg, Germany, 2020; pp. 800–810. [Google Scholar] [CrossRef]
- Chuter, J.L.; Boullanger, G.B.; Saez, M.N. U-N.o.1T: A U–Net exploration, in Depth. 2018. Available online: https://cs229.stanford.edu/proj2018/report/34.pdf (accessed on 9 March 2023).
- Sharma, M.; Sharma, A.; Tushar, K.R.; Panneer, A. A Novel 3D–Unet Deep Learning Framework Based on High–Dimensional Bilateral Grid for Edge Consistent Single Image Depth Estimation. In Proceedings of the 2020 International Conference on 3D Immersion (IC3D), Brussels, Belgium, 15 December 2020; pp. 1–8. [Google Scholar] [CrossRef]
- Liu, P.; Zhang, Z.; Meng, Z.; Gao, N. Monocular Depth Estimation with Joint Attention Feature Distillation and Wavelet–Based Loss Function. Sensors 2021, 21, 54. [Google Scholar] [CrossRef]
- Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. UNet++: A Nested U–Net Architecture for Medical Image Segmentation. arXiv 2018, arXiv:1807.10165. [Google Scholar]
- Peng, D.; Zhang, Y.; Guan, H. End–to–End Change Detection for High Resolution Satellite Images Using Improved UNet++. Remote. Sens. 2019, 11, 1382. [Google Scholar] [CrossRef] [Green Version]
- Gur, S.; Wolf, L. Single Image Depth Estimation Trained via Depth from Defocus Cues. arXiv 2020, arXiv:2001.05036. [Google Scholar]
- Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor Segmentation and Support Inference from RGBD Images. In Proceedings of the Computer Vision—ECCV 2012, Florence, Italy, 7–13 October 2012; Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; pp. 746–760. [Google Scholar]
- Lubor Ladicky, J.S.; Pollefeys, M. Pulling Things out of Perspective. In Proceedings of the CVPR ’14: 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; IEEE Computer Society: Washington, DC, USA, 2014; pp. 89–96. [Google Scholar] [CrossRef]
- Wang, Y. MobileDepth: Efficient Monocular Depth Prediction on Mobile Devices. arXiv 2020, arXiv:2011.10189. [Google Scholar]
- Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv 2015, arXiv:1502.03167. [Google Scholar]
- Fu, H.; Gong, M.; Wang, C.; Batmanghelich, K.; Tao, D. Deep Ordinal Regression Network for Monocular Depth Estimation. arXiv 2018, arXiv:1806.02446. [Google Scholar] [CrossRef]
- Patil, V.; Sakaridis, C.; Liniger, A.; Van Gool, L. P3Depth: Monocular Depth Estimation with a Piecewise Planarity Prior. arXiv 2022, arXiv:2204.02091. [Google Scholar] [CrossRef]
- Yuan, W.; Gu, X.; Dai, Z.; Zhu, S.; Tan, P. NeW CRFs: Neural Window Fully–connected CRFs for Monocular Depth Estimation. arXiv 2022, arXiv:2203.01502. [Google Scholar] [CrossRef]
- Bhat, S.F.; Birkl, R.; Wofk, D.; Wonka, P.; Müller, M. ZoeDepth: Zero–shot Transfer by Combining Relative and Metric Depth. arXiv 2023, arXiv:2302.12288. [Google Scholar] [CrossRef]
Parms | DWT | ADWT | UNET++ | DenseNet [7] | AdaBins [16] | NDWTN |
---|---|---|---|---|---|---|
Total | 13.39 | 14.88 | 13.23 | 53.99 | 78.0 | 42.82 |
Trainable | 13.39 | 14.87 | 13.22 | 53.97 | – | 42.66 |
Models | REL↓ | RMSE↓ | log10↓ | |||
---|---|---|---|---|---|---|
db4 | 0.33 | 0.61 | 0.81 | 0.39 | 0.16 | 0.18 |
Harr | 0.34 | 0.62 | 0.82 | 0.39 | 0.15 | 0.17 |
Models | REL↓ | RMSE↓ | log10↓ | Year | |||
---|---|---|---|---|---|---|---|
DWT | 0.27 | 0.52 | 0.73 | 0.54 | 1.76 | 0.21 | 2023 * |
ADWT | 0.27 | 0.51 | 0.70 | 0.80 | 1.57 | 0.23 | 2023 * |
UNET++ | 0.29 | 0.55 | 0.75 | 0.66 | 1.69 | 0.21 | 2023 * |
DenseNet [7] | 0.85 | 0.97 | 0.99 | 0.12 | 0.52 | 0.05 | 2018 |
DORN [54] | 0.83 | 0.97 | 0.99 | 0.12 | 0.51 | 0.05 | 2018 |
P3Depth [55] | 0.898 | 0.98 | 0.996 | 0.1 | 0.36 | 0.04 | 2022 |
NewCRFs [56] | 0.92 | 0.99 | 0.998 | 0.095 | 0.33 | 0.04 | 2022 |
ZoeD–M12–N [57] | 0.96 | 0.995 | 0.999 | 0.075 | 0.27 | 0.03 | 2023 |
NADWT(3) | 0.33 | 0.61 | 0.81 | 0.39 | 0.16 | 0.18 | 2023 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Paul, S.; Mishra, D.; Marimuthu, S.K. Nested DWT–Based CNN Architecture for Monocular Depth Estimation. Sensors 2023, 23, 3066. https://doi.org/10.3390/s23063066
Paul S, Mishra D, Marimuthu SK. Nested DWT–Based CNN Architecture for Monocular Depth Estimation. Sensors. 2023; 23(6):3066. https://doi.org/10.3390/s23063066
Chicago/Turabian StylePaul, Sandip, Deepak Mishra, and Senthil Kumar Marimuthu. 2023. "Nested DWT–Based CNN Architecture for Monocular Depth Estimation" Sensors 23, no. 6: 3066. https://doi.org/10.3390/s23063066
APA StylePaul, S., Mishra, D., & Marimuthu, S. K. (2023). Nested DWT–Based CNN Architecture for Monocular Depth Estimation. Sensors, 23(6), 3066. https://doi.org/10.3390/s23063066