You are currently viewing a new version of our website. To view the old version click .
Energies
  • Article
  • Open Access

16 August 2024

Improving Accuracy and Efficiency of Monocular Depth Estimation in Power Grid Environments Using Point Cloud Optimization and Knowledge Distillation

,
,
,
,
,
and
1
State Grid Hunan Electric Power Corporation Ltd., Research Institute, Changsha 410007, China
2
Hunan Provincial Engineering Research Center for Multimodal Perception and Edge Intelligence in Electric Power, Research Institute, Changsha 410007, China
3
College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China
*
Author to whom correspondence should be addressed.
This article belongs to the Section F1: Electrical Power System

Abstract

In the context of distribution networks, constructing 3D point cloud maps is crucial, particularly for UAV navigation and path planning tasks. Methods that utilize reflections from surfaces, such as laser and structured light, to obtain depth point clouds for surface modeling and environmental depth estimation are already quite mature in some professional scenarios. However, acquiring dense and accurate depth information typically requires very high costs. In contrast, monocular image-based depth estimation methods do not require relatively expensive equipment and specialized personnel, making them available for a wider range of applications. To achieve high precision and efficiency in UAV distribution networks, inspired by knowledge distillation, we employ a teacher–student architecture to enable efficient inference. This approach maintains high-quality depth estimation while optimizing the point cloud to obtain more precise results. In this paper, we propose KD-MonoRec, which integrates knowledge distillation into the semi-supervised MonoRec framework for UAV distribution networks. Our method demonstrates excellent performance on the KITTI dataset and performs well in collected distribution network environments.

1. Introduction

A point cloud is a 3D representation of an object, environment, or surface, consisting of quantities of points in space. These points are typically captured using techniques such as laser scanning, ground scanning, or photogrammetry, which record each point’s position relative to a reference system. Early research relied on LiDAR to construct these point clouds, as demonstrated by Thrun et al. [], who fused LiDAR data for localization and mapping. However, the high cost and complexity of LiDAR have driven a shift toward more affordable visual sensors, which capture richer textures. Multi-View Stereo (MVS) [] is one such method that generates dense 3D point clouds from multiple images. In the 1980s and 1990s, researchers like Tomasi and Kanade [] and Scharstein and Szeliski [] began estimating disparity information from multiple perspectives to reconstruct 3D scenes. These methods typically relied on calculating disparity maps, matching corresponding pixels from different viewing angles to estimate depth. Later, Barnes et al. [] introduced the PatchMatch algorithm for 3D scene reconstruction, and Campos et al. [] advanced SLAM technologies to generate sparse point clouds using monocular or RGB-D cameras. Despite these advancements, challenges remain in managing model complexity and computational demands. This study explores how integrating deep learning, knowledge distillation, and point cloud optimization can enhance 3D scene understanding.
With the rapid advancement of deep learning in recent years, especially the success of convolutional neural networks (CNNs) like VGGNet [] and ResNet [] on large-scale datasets such as ImageNet [], MVS-based 3D point cloud reconstruction methods have increasingly incorporated deep learning techniques. However, these methods often require complex models and substantial computational resources, making real-time inference on standard GPUs challenging. Therefore, there is a growing interest in simplifying neural networks and reducing the number of parameters. Knowledge distillation is a technique that improves the efficiency of neural networks by transferring knowledge from a larger, more complex model (the teacher) to a smaller, simpler model (the student). Hinton et al. [] introduced the concept of knowledge distillation, where the student model learns not only from the hard labels of the training data but also from the soft labels generated by the teacher model. These soft labels provide additional information about the class probabilities, enabling the student model to generalize more effectively. Building on this foundation, Wu et al. [] developed a practical distillation framework that strikes a balance between model accuracy and efficiency. Their method dynamically adjusts the distillation process based on the complexity of the input data, allowing the student model to maintain high performance across various tasks without incurring unnecessary computational costs. We have also implemented this method in an external model, achieving higher accuracy with a more compact model, which is particularly advantageous in resource-constrained environments.
Another crucial aspect of point cloud processing is optimization, which aims to improve the quality and accuracy of point cloud data. For example, Zhao et al. [] discussed the challenges of generating point clouds in complex environments, such as around utility poles, where noise levels can be high. By applying a series of filtering and optimization steps, such as spatial filtering, normal estimation, and color consistency checks, it is possible to remove noise, outliers, and unnecessary details, resulting in a clearer and more reliable 3D representation. These optimizations ensure that the point cloud data more accurately reflect the true structure and features of the scene, thereby enhancing the effectiveness and accuracy of 3D reconstruction and scene understanding.
In this paper, we investigate the combination of knowledge distillation methods with state-of-the-art, deep learning-based MVS approaches, such as MonoRec []. We adopt the basic architecture of MonoRec and design a teacher–student framework, conducting experiments on public datasets and in distribution network environments to evaluate both accuracy and efficiency. Furthermore, we incorporate point cloud optimization techniques into our framework to improve the quality of the reconstructed 3D representations. These optimizations, including spatial filtering, normal estimation, and color consistency checks, are essential for aligning the point cloud data more closely with the actual structure and features of the scene. This integration significantly contributes to the overall effectiveness and accuracy of our approach to 3D reconstruction and scene understanding tasks.

3. Methods

We follow the previous MonoRec [] work to further study depth estimation in Figure 1. We also improved that algorithm by combining the knowledge distillation to improve the accuracy of the depth estimation. Based on the original MonoRec framework, which consists of a MaskModule that predicts the mask of the dynamic objects and a DepthModule that estimates the depth of the masked image, we adopt knowledge distillation to make the inference more effective in depth estimation, especially in power distribution network environments. In this section, the MaskModule and depth prediction module will first be introduced, followed by a description of the specific methods to improve knowledge distillation and the optimization method of the generated depth map. Finally, the training strategy of the framework will be discussed.
Figure 1. The KD-MonoRec architecture first constructs a photometric cost volume based on multiple input frames using the SSIM metric to measure photometric consistency. MaskModule detects inconsistencies between different input frames to determine moving objects. The multi-frame cost volume is then multiplied by the prediction mask and passed to the DepthModule, which predicts a dense inverse depth map. In the decoder, the cost volume features are concatenated with pre-trained ResNet-18 features. In order to train the student model to predict the depth map, we introduce the concept of knowledge distillation and use the trained teacher model to guide the training of the student model to improve prediction accuracy and generalization ability.

3.1. MaskModule

The goal of MaskModule is to predict a mask M t , where M t ( x ) [ 0 , 1 ] represents the probability that pixel x in frame I t belongs to a moving object. Determining moving objects from frames alone is a vague and difficult to generalize task. Therefore, we propose to use a set of Cost Volumes C t ( t { 1 , 2 , , N } ) , which encode the relationship between key frame I t and other frames’ C t ( t { 1 , 2 , , N } t ) geometric prior. We use C t instead of C because inconsistent geometric information in different values of C t is a strong prior for predicting moving objects: dynamic pixels produce inconsistent optimal depth steps in different C t values. However, geometric priors alone are not sufficient to predict moving objects, as poor quality textures or non-Lambertian surfaces may also cause inconsistencies. Furthermore, for objects moving at a constant speed, the ontology often reaches a semantic consensus of the wrong depth that is inconsistent with the context of the scene. Therefore, we further utilize the pre-trained ResNet-18 [] features of frame I t to encode semantic priors in addition to geometric priors. The network is designed using U-Net architecture [] with skip connections. All entities pass through an encoder with shared weights. Features from different cost entities are aggregated using max pooling and then passed through the decoder. In this way, MaskModule can be applied to a different number of frames without retraining.

3.2. DepthModule

DepthModule predicts the dense pixel inverse depth map D t of key frame I t . To achieve this, the module receives the complete cost body C concatenated with the keyframe I t . Different from MaskModule, here we use C instead of C t because multiple frames of this volume can usually improve the depth accuracy and enhance the robustness to photometric noise []. To eliminate false depth predictions for moving objects, we perform a pixel-wise multiplication of M t with the cost volume C for each depth step d. In this way, the moving object area will not leave any maxima (i.e., strong prior), so the DepthModule must rely on image features and surrounding environment information to infer the depth of the moving object. We adopt a U-Net architecture [] with multi-scale depth output of the decoder. Finally, the DepthModule outputs an interpolation factor between d m i n and d m a x . In practice, we use s = 4 depth prediction scales.

3.3. Knowledge Distillation Architecture

In depth estimation, the MaskModule uses the knowledge distillation method to guide the student model to better complete the task by learning the knowledge of the teacher model. The core of this method is to use the complex representation and prior information of the teacher model to pass it to the student model to improve the performance of the student model in regard to the moving object segmentation task.
First, the teacher model is built through a set of cost volumes that encode geometric priors between the current image and other frames. Each ontology describes a different perspective and has a unique contribution to the depth information of moving objects. The teacher model combines the information of these entities to generate strong priors for segmentation of moving objects.
Next, the student model is designed to leverage this complex prior information, along with semantic priors from pre-trained ResNet-18 features, to more accurately predict whether each pixel belongs to a moving object. During the knowledge distillation process, the output of the teacher model is used as an additional supervision signal and used together with the output of the student model for loss calculation. This joint loss contains two aspects of information: on the one hand, the prediction error of the student model itself, and on the other hand, the output of the teacher model, that is, the complex geometric prior.
In this way, the student model not only learns the basic information of the task during training but also acquires deeper and more critical knowledge for the task from the teacher model. This helps improve the understanding and generalization ability of the student model for the task of depth estimation of moving objects. The idea of knowledge distillation plays the role of transmitting, refining, and strengthening knowledge here, allowing the model to better understand complex scene geometry and semantic information, thereby more accurately segmenting moving objects.

3.4. Depth Optimization

To further optimize the depth map generated by our method in a networked UAV environment, we introduce depth map filtering as a key step in depth map optimization. This method aims to improve the accuracy and stability of the depth map by processing the depth map corresponding to the original image.
We adopted the depth image optimization algorithm, as illustrated in Algorithm 1. We first obtain the depth map generated by KD-MonoRec. Subsequently, we use statistics-based filtering methods to spatially filter the point cloud. By analyzing the distribution characteristics of the point cloud in the depth map, we are able to effectively remove outliers caused by sensor errors or occlusions, thereby improving the overall quality of the depth map. Secondly, in order to better retain the detailed information of the object’s surface, we introduce normal vector information for normal vector filtering. By computing normal vectors between adjacent points, we are able to identify and preserve the main features of the object surface, further reducing artifacts and discontinuities in the depth map.
Specifically, given a point cloud containing a set of points P, where each point p i has three-dimensional coordinates ( x i , y i , z i ) and a corresponding normal vector N i = ( n x i , n y i , n z i ) , for each point p i , calculate its normal vector N i as follows:
N i = ComputeNormal ( p i , Neighbors ( p i ) ) .
Filter the depth map using the computed normal vectors:
D filtered = FilterByNormal ( D , N ) .
Here, D filtered is the filtered depth map, D is the original depth map, and N represents the normal vectors for points in the point cloud.
To enhance the detail recovery ability of the depth map, we use the color information in the point cloud for color filtering. First, we initialize an empty filtered depth map D filtered with the same dimensions as the input depth map D. Then, for each pixel ( x , y ) in the depth map D, we extract the corresponding color information C ( x , y ) from the point cloud P and define a local neighborhood N local ( x , y ) centered at pixel ( x , y ) with size N. Next, we compute the average color C ¯ ( x , y ) of the local neighborhood N local ( x , y ) and the color consistency score Δ ( x , y ) between the current pixel color C ( x , y ) and the average color C ¯ ( x , y ) of its local neighborhood using the Euclidean distance:
Δ ( x , y ) =   C ( x , y ) C ¯ ( x , y ) .
If Δ ( x , y ) is below the color consistency threshold τ , we update the depth value of pixel ( x , y ) in the filtered depth map D filtered ( x , y ) with the corresponding depth value from the original depth map D ( x , y ) . Otherwise, we apply a smoothing operation to D ( x , y ) within the local neighborhood N local ( x , y ) to ensure color consistency, such as averaging or median filtering. We repeat the above steps for all pixels in the depth map D and finally return the filtered depth map D filtered .
Algorithm 1 Depth Image Optimization
Require  Depth Image1: DepthImageOptimization d e p t h _ m a p
f i l t e r e d _ d e p t h _ m a p spatial_filter( d e p t h _ m a p )
f i l t e r e d _ d e p t h _ m a p normal_filter( f i l t e r e d _ d e p t h _ m a p )
f i l t e r e d _ d e p t h _ m a p color_filter( f i l t e r e d _ d e p t h _ m a p )
return  f i l t e r e d _ d e p t h _ m a p 2: spatialFilter d e p t h _ m a p
f i l t e r e d _ d e p t h _ m a p statistical_filter( d e p t h _ m a p )
return  f i l t e r e d _ d e p t h _ m a p 3: normalFilter d e p t h _ m a p
p o i n t _ c l o u d depth_map_to_point_cloud( d e p t h _ m a p )
n o r m a l s calculate_normals( p o i n t _ c l o u d )
f i l t e r e d _ d e p t h _ m a p filter_with_normals( d e p t h _ m a p , n o r m a l s )
return  f i l t e r e d _ d e p t h _ m a p 4: colorFilter d e p t h _ m a p
c o l o r s extract_colors p o i n t _ c l o u d
t h r e s h o l d 20 Color threshold
 {Simple color consistency check: retain if colors are consistent, discard otherwise}
 {Threshold value used to define color consistency}
return  f i l t e r e d _ d e p t h _ m a p

4. Experiments

4.1. Datasets

KITTI [] is a public dataset widely used in the fields of computer vision and autonomous driving. The dataset was created in collaboration with the Karlsruhe Institute of Technology in Germany and Toyota Technology Research Institute of America. It provides data from multiple sensor modalities, including stereo vision, LiDAR, camera, and inertial measurement unit (IMU) data. The scenes in the dataset include vehicle driving, pedestrian walking, and road conditions in urban environments. The KITTI dataset provides rich annotation information, such as camera calibration parameters, LiDAR point clouds, vehicle motion trajectories, etc., and can be used for the research and evaluation of computer vision tasks like target detection, target tracking, stereo vision, and semantic segmentation.
EuRoC [] is a public dataset used for Visual Inertial Odometry (VIO) and visual SLAM (Simultaneous Localization and Mapping) research. This dataset is provided by the Robotics and Perception Group of the European Space Agency (ESA). The EuRoC dataset provides data from multiple sensor modalities such as cameras, inertial measurement units (IMUs), and LiDAR. The dataset contains simulated and actual data collected in laboratory and indoor environments. Camera images, IMU data, and LiDAR data can be used for the development, evaluation, and comparison of VIO and SLAM algorithms. The dataset provides annotation information such as sensor calibration parameters, camera attitude true values, and IMU attitude true values.
Distribution network drone dataset. Our custom drone dataset is designed for computer vision and perception tasks in distribution grid environments. Through professional drones equipped with high-resolution cameras, inertial measurement units (IMUs), LiDAR, and GPS modules, we collect data in a variety of scenarios, including power tower inspections, transmission line monitoring, and substation inspections.
The dataset provides rich annotation information, including UAV attitude data, equipment calibration parameters, and environmental conditions, to support computer vision tasks such as target detection, trajectory tracking, and three-dimensional reconstruction. Relevant application areas include power equipment fault detection, environmental modeling, and other distribution network monitoring and maintenance tasks.

4.2. Evaluation Metrics

We employ the following metrics to assess the quality of point cloud generation by drones in the context of power grid distribution environments.
Absolute Relative Difference (AbsRel): This measures the absolute relative error between the predicted depth map and the true depth map. Calculating the relative difference between the depth prediction and the true depth value for each pixel, AbsRel sums these differences and takes the absolute value. A smaller AbsRel indicates more accurate depth prediction by the model.
A b s R e l = 1 N i = 1 N d g t d g t d i
Squared Relative Difference (SqRel): Similar to AbsRel, SqRel quantifies the squared relative error between the predicted depth map and the true depth map. It also evaluates the relative difference between depth prediction and true depth values, but in squared form. SqRel pays more attention to pixels with larger errors.
The formula is expressed as:
SqRel = 1 N i = 1 N d gt d i d gt 2
Root Mean Square Error (RMSE): RMSE measures the root mean square error between the predicted depth map and the true depth map. It is the square root of the sum of the squared differences between predicted and true depth values averaged across all pixels. RMSE indicates the average magnitude of prediction errors by the model.
The formula is expressed as follows:
RMSE = 1 N i = 1 N ( d i d gt ) 2 .
Root Mean Square Logarithmic Error (RMSLE): RMSLE calculates the root mean square error in the logarithmic domain. It offers robustness when dealing with nonlinear relationships and is typically used when the target value range is wide.
The formula is expressed as follows:
RMSLE = 1 N i = 1 N log ( d i + 1 ) log ( d gt + 1 ) 2 .
These metrics collectively provide insights into the accuracy and performance of drone-generated point clouds in power grid distribution environments. Among them, d i is the predicted depth value, d gt is the true depth value, and N is the total number of pixels. These evaluation metrics play a crucial role in the field of depth estimation. First, the absolute relative error (AbsRel) provides us with an intuitive measure of depth estimation accuracy by calculating the absolute value of the relative difference between the depth prediction and the true depth value. Its advantage is that it can evenly consider the error of each pixel point and normalize it to the relative range of the depth value, making the evaluation results more universal and comparable. Next, the squared relative error (SqRel) further emphasizes the sensitivity to large errors, making us pay more attention to pixels with large errors in the depth map. This helps provide a more complete understanding of how a model performs at different depth ranges when evaluating its performance. The root mean square error (RMSE) provides a quantification of the overall average value of the depth prediction error, and its simple and intuitive calculation method makes it one of the most commonly used evaluation indicators in the field of depth estimation. Finally, root mean square logarithmic error (RMSLE) is more robust when dealing with nonlinear relationships, especially when target values are spread over a wide range. Together, these evaluation metrics constitute a multifaceted evaluation of the performance of the depth estimation model.

4.3. Training Strategy

Neural network initialization. We use pre-trained ResNet-18 to initialize the MaskModule and DepthModule. This helps speed up model convergence and enables the model to better capture semantic information from frames.
Knowledge distillation. During the training process, the output of the MaskModule is used as the output of the teacher model and is used for loss calculation together with the output of the student model. The loss function should include the prediction error of the student model itself and the output of the teacher model. Through back-propagation, the student model can gradually learn the knowledge of the teacher model and improve its performance on the moving object segmentation task.
Depth map optimization. In the bootstrapping stage, the depth map generated by the model is optimized. Convert depth maps to point cloud data, then apply spatial, normal, and color filtering to remove noise, artifacts, and discontinuities. This step helps improve the quality and accuracy of the depth map.
Parameter adjustment. This neural network is implemented in PyTorch, and the images are resized to 512 × 256. During the bootstrapping phase, we trained the deep module for 70 epochs with a learning rate (LR) of 1 × 10 4 for the first 65 epochs and 1 × 10 5 thereafter. The MaskModule was trained for 60 epochs with a learning rate of 1 × 10 4 . During MaskModule refinement, we train with a learning rate of 1 × 10 4 for 32 epochs, while during module refinement, we train with 1 × 10 4 for 15 epochs, followed by 1 × 10 5 for 4 epochs.
Periodic training. The hyperparameters o, α , and γ are set to be 4, 10 3 × 2 s , and 4, respectively.

4.4. Results

The experiments conducted in the UAV-based environment demonstrate that the KD-MonoRec model significantly outperforms other depth estimation models, as in Table 1. KD-MonoRec achieved the best results across all evaluation metrics, showcasing its enhanced accuracy and robustness in predicting depth. The improvements in absolute relative error, squared relative error, root mean square error, and root mean square logarithmic error highlight the effectiveness of the proposed model enhancements.
Table 1. Evaluation results. The best results are bolded.
Furthermore, KD-MonoRec maintains the smallest model size, which is particularly beneficial for UAV applications where computational resources and memory are limited. The combination of superior performance and efficiency makes KD-MonoRec a highly suitable choice for depth estimation tasks in resource-constrained environments.

4.5. Ablation Study

First, a baseline model, that is, MonoRec without applying any improvements, is established to compare with other experimental results in Table 2. Distillation-only experiment: Apply the distillation framework to the MonoRec algorithm using the output of MaskModule as the teacher model, which is used for loss calculations together with the output of the student model. By comparing the performance of the baseline model and the distillation-only model, the impact of distillation on depth estimation is evaluated. Depth map optimization experiment only: Based on the baseline model, only point cloud filtering is applied as a key step in depth map optimization. By comparing the performance of the baseline model and the depth map-only optimization model, the impact of the depth map optimization strategy on depth estimation is evaluated. Distillation and depth map optimization experiments: Combine the two strategies of distillation and depth map optimization and apply them to the MonoRec algorithm. Evaluate the combined impact of distillation and depth map optimization working together on depth estimation. The result of point cloud comparison is shown in Figure 2.
Table 2. Results of ablation study. KDM refers to KD-MonoRec. The best results are bolded.
Figure 2. Raw images and estimated depth images. The leftmost column represents the raw image, followed by the MonoDepth2 and KD-MonoRec depth estimation results.

5. Conclusions

In conclusion, our study presents a comprehensive framework for enhancing depth estimation algorithms in networked UAV environments. By refining the MonoRec algorithm, introducing key modules for moving object segmentation and depth prediction, and employing knowledge distillation techniques, we achieve significant improvements in accuracy and stability, particularly in dynamic object detection. The integration of point cloud filtering as part of our depth map optimization strategy further enhances the quality of depth maps by reducing noise and artifacts. Through a meticulously designed training strategy, our algorithm demonstrates both real-time performance and robustness in distributed drone environments. Our findings not only contribute to advancing depth estimation in specialized settings but also offer valuable insights into knowledge distillation and depth map optimization methodologies. These advancements provide strong support for practical applications in UAV power distribution networks, showcasing the potential of our approach in addressing real-world challenges in autonomous systems.

Author Contributions

Conceptualization, J.X. and K.Z.; methodology, X.X.; validation, X.X., S.W. and Z.H.; formal analysis, J.X.; investigation, J.X.; resources, L.L.; data curation, X.X.; writing—original draft preparation, J.X.; writing—review and editing, K.Z.; visualization, S.L.; supervision, Z.H.; project administration, L.L.; funding acquisition, S.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Hunan Electric Power Company Ltd. under Project 5216A5220027 and Hunan Provincial Science and Technology Innovation Platform and Talent Program (2023TP2180).

Data Availability Statement

Data are unavailable due to privacy or ethical restrictions.

Conflicts of Interest

Authors Jian Xiao, Keren Zhang, Xianyong Xu, Shuai Liu, Sheng Wu and Zhihong Huang were employed by the company (the State Grid Hunan Electric Power Corporation Ltd.). The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Thrun, S.; Burgard, W.; Fox, D. A real-time algorithm for mobile robot mapping with applications to multi-robot and 3D mapping. In Proceedings of the 2000 ICRA. Millennium Conference, IEEE International Conference on Robotics and Automation, Symposia Proceedings (Cat. No.00CH37065), Online, 24–28 April 2000; Volume 1, pp. 321–328. [Google Scholar] [CrossRef]
  2. Kutulakos, K.N.; Seitz, S.M. A theory of shape by space carving. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Kerkyra, Greece, 20–27 December 1999; IEEE: Piscataway, NJ, USA, 1999; Volume 1, pp. 307–314. [Google Scholar]
  3. Tomasi, C.; Kanade, T. Shape and motion from image streams: A factorization method. Proc. Natl. Acad. Sci. USA 1993, 90, 9795–9802. [Google Scholar] [CrossRef] [PubMed]
  4. Scharstein, D.; Szeliski, R. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. Vis. 2002, 47, 7–42. [Google Scholar] [CrossRef]
  5. Barnes, C.; Shechtman, E.; Finkelstein, A.; Goldman, D.B. PatchMatch: A randomized correspondence algorithm for structural image editing. ACM Trans. Graph. 2009, 28, 24. [Google Scholar] [CrossRef]
  6. Campos, C.; Elvira, R.; Rodriguez, J.J.G.; Montiel, J.M.M.; D. Tardos, J. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual–Inertial, and Multimap SLAM. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
  7. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar]
  8. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar]
  9. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1106–1114. [Google Scholar] [CrossRef]
  10. Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
  11. Wu, C.Y.; Wang, J.; Hall, M.; Neumann, U.; Su, S. Toward Practical Monocular Indoor Depth Estimation. arXiv 2022, arXiv:2112.02306. [Google Scholar]
  12. Zhao, Q.; Gao, X.; Li, J.; Luo, L. Optimization algorithm for point cloud quality enhancement based on statistical filtering. J. Sens. 2021, 2021, 1–10. [Google Scholar] [CrossRef]
  13. Wimbauer, F.; Yang, N.; von Stumberg, L.; Zeller, N.; Cremers, D. MonoRec: Semi-Supervised Dense Reconstruction in Dynamic Environments from a Single Moving Camera. arXiv 2021, arXiv:2011.11814. [Google Scholar]
  14. Longuet-Higgins, H.C. A computer algorithm for reconstructing a scene from two projections. Nature 1981, 293, 133–135. [Google Scholar] [CrossRef]
  15. Hartley, R.; Zisserman, A. Multiple View Geometry in Vomputer Vision; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
  16. Curless, B.; Levoy, M. A volumetric method for building complex models from range images. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, New Tork, NY, USA, 1 August 1996; pp. 303–312. [Google Scholar]
  17. Seitz, S.M.; Curless, B.; Diebel, J.; Scharstein, D.; Szeliski, R. A comparison and evaluation of multi-view stereo reconstruction algorithms. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; IEEE: Picataway, NJ, USA, 2006; Volume 1, pp. 519–528. [Google Scholar]
  18. Newcombe, R.A.; Izadi, S.; Hilliges, O.; Molyneaux, D.; Kim, D.; Davison, A.J.; Kohi, P.; Shotton, J.; Hodges, S.; Fitzgibbon, A. Kinectfusion: Real-time dense surface mapping and tracking. In Proceedings of the 2011 10th IEEE International Symposium on Mixed and Augmented Reality, Basel, Switzerland, 26–29 October 2011; IEEE: Picataway, NJ, USA, 2011; pp. 127–136. [Google Scholar]
  19. Kim, H.; Leutenegger, S.; Davison, A.J. Real-time 3D reconstruction and 6-DoF tracking with an event camera. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part VI 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 349–364. [Google Scholar]
  20. Lowe, D.G. Object recognition from local scale-invariant features. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Corfu, Greece, 20–25 September 1999; IIEEE: Picataway, NJ, USA, 1999; Volume 2, pp. 1150–1157. [Google Scholar]
  21. Bay, H.; Tuytelaars, T.; Van Gool, L. Surf: Speeded up robust features. In Proceedings of the Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, 7–13 May 2006; Proceedings, Part I 9. Springer: Berlin/Heidelberg, Germany, 2006; pp. 404–417. [Google Scholar]
  22. Lee, L.H.; Braud, T.; Zhou, P.; Wang, L.; Xu, D.; Lin, Z.; Kumar, A.; Bermejo, C.; Hui, P. All one needs to know about metaverse: A complete survey on technological singularity, virtual ecosystem, and research agenda. arXiv 2021, arXiv:2110.05352. [Google Scholar]
  23. Shen, S. Accurate multiple view 3d reconstruction using patch-based stereo for large-scale scenes. IEEE Trans. Image Process. 2013, 22, 1901–1914. [Google Scholar] [CrossRef]
  24. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: Picataway, NJ, USA, 2009; pp. 248–255. [Google Scholar]
  25. Huang, P.H.; Matzen, K.; Kopf, J.; Ahuja, N.; Huang, J.B. DeepMVS: Learning Multi-view Stereopsis. arXiv 2018, arXiv:1804.00650. [Google Scholar]
  26. Schönberger, J.L.; Frahm, J.M. Structure-from-Motion Revisited. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  27. Yao, Y.; Luo, Z.; Li, S.; Fang, T.; Quan, L. MVSNet: Depth Inference for Unstructured Multi-view Stereo. arXiv 2018, arXiv:1804.02505. [Google Scholar]
  28. Wang, F.; Galliani, S.; Vogel, C.; Speciale, P.; Pollefeys, M. PatchmatchNet: Learned Multi-View Patchmatch Stereo. arXiv 2020, arXiv:2012.01411. [Google Scholar]
  29. Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. arXiv 2020, arXiv:2003.08934. [Google Scholar] [CrossRef]
  30. Wang, C.; Miguel Buenaposada, J.; Zhu, R.; Lucey, S. Learning Depth From Monocular Videos Using Direct Methods. In Proceedings of the The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
  31. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar]
  32. Pilzer, A.; Lathuilière, S.; Sebe, N.; Ricci, E. Refine and Distill: Exploiting Cycle-Inconsistency and Knowledge Distillation for Unsupervised Monocular Depth Estimation. arXiv 2019, arXiv:1903.04202. [Google Scholar]
  33. Wang, Y.; Li, X.; Shi, M.; Xian, K.; Cao, Z. Knowledge distillation for fast and accurate monocular depth estimation on mobile devices. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2457–2465. [Google Scholar]
  34. Ren, W.; Wang, L.; Piao, Y.; Zhang, M.; Lu, H.; Liu, T. Adaptive co-teaching for unsupervised monocular depth estimation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 89–105. [Google Scholar]
  35. Feng, Z.; Yang, L.; Jing, L.; Wang, H.; Tian, Y.; Li, B. Disentangling object motion and occlusion for unsupervised multi-frame monocular depth. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 228–244. [Google Scholar]
  36. Li, H.; Gordon, A.; Zhao, H.; Casser, V.; Angelova, A. Unsupervised monocular depth learning in dynamic scenes. In Proceedings of the Conference on Robot Learning, London, UK, 8–11 November 2021; PMLR: Londin, UK, 2021; pp. 1908–1917. [Google Scholar]
  37. Hui, T.W. RM-Depth: Unsupervised Learning of Recurrent Monocular Depth in Dynamic Scenes. arXiv 2023, arXiv:2303.04456. [Google Scholar]
  38. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv 2015, arXiv:1505.04597. [Google Scholar]
  39. Newcombe, R.A.; Lovegrove, S.J.; Davison, A.J. DTAM: Dense tracking and mapping in real-time. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2320–2327. [Google Scholar] [CrossRef]
  40. Godard, C.; Aodha, O.M.; Firman, M.; Brostow, G. Digging Into Self-Supervised Monocular Depth Estimation. arXiv 2019, arXiv:1806.01260. [Google Scholar]
  41. Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The kitti dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
  42. Burri, M.; Nikolic, J.; Gohl, P.; Schneider, T.; Rehder, J.; Omari, S.; Achtelik, M.W.; Siegwart, R. The EuRoC micro aerial vehicle datasets. Int. J. Robot. Res. 2016, 35, 1157–1163. [Google Scholar] [CrossRef]
  43. Mallya, A.; Lazebnik, S. PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning. arXiv 2018, arXiv:1711.05769. [Google Scholar]
  44. Wu, Z.; Li, Z.; Fan, Z.G.; Wu, Y.; Wang, X.; Tang, R.; Pu, J. ADU-Depth: Attention-based Distillation with Uncertainty Modeling for Depth Estimation. In Proceedings of the Conference on Robot Learning, Auckland, New Zealand, 14–18 December 2023; PMLR: London, UK, 2023; pp. 3167–3179. [Google Scholar]
  45. Feng, C.; Chen, Z.; Zhang, C.; Hu, W.; Li, B.; Lu, F. Iterdepth: Iterative residual refinement for outdoor self-supervised multi-frame monocular depth estimation. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 329–341. [Google Scholar] [CrossRef]
  46. Bhat, S.F.; Birkl, R.; Wofk, D.; Wonka, P.; Müller, M. Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv 2023, arXiv:2302.12288. [Google Scholar]
  47. Ke, B.; Obukhov, A.; Huang, S.; Metzger, N.; Daudt, R.C.; Schindler, K. Repurposing diffusion-based image generators for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 9492–9502. [Google Scholar]
  48. Li, Z.; Bhat, S.F.; Wonka, P. Patchfusion: An end-to-end tile-based framework for high-resolution monocular metric depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 10016–10025. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.