Improving Accuracy and Efficiency of Monocular Depth Estimation in Power Grid Environments Using Point Cloud Optimization and Knowledge Distillation

Xiao, Jian; Zhang, Keren; Xu, Xianyong; Liu, Shuai; Wu, Sheng; Huang, Zhihong; Li, Linfeng

doi:10.3390/en17164068

Open AccessArticle

Improving Accuracy and Efficiency of Monocular Depth Estimation in Power Grid Environments Using Point Cloud Optimization and Knowledge Distillation

by

Jian Xiao

^1,2,*,

Keren Zhang

^1,2,

Xianyong Xu

^1,2,

Shuai Liu

^1,2,

Sheng Wu

^1,2,

Zhihong Huang

^1,2 and

Linfeng Li

³

¹

State Grid Hunan Electric Power Corporation Ltd., Research Institute, Changsha 410007, China

²

Hunan Provincial Engineering Research Center for Multimodal Perception and Edge Intelligence in Electric Power, Research Institute, Changsha 410007, China

³

College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China

^*

Author to whom correspondence should be addressed.

Energies 2024, 17(16), 4068; https://doi.org/10.3390/en17164068

Submission received: 3 June 2024 / Revised: 5 August 2024 / Accepted: 7 August 2024 / Published: 16 August 2024

(This article belongs to the Section F1: Electrical Power System)

Download

Browse Figures

Versions Notes

Abstract

:

In the context of distribution networks, constructing 3D point cloud maps is crucial, particularly for UAV navigation and path planning tasks. Methods that utilize reflections from surfaces, such as laser and structured light, to obtain depth point clouds for surface modeling and environmental depth estimation are already quite mature in some professional scenarios. However, acquiring dense and accurate depth information typically requires very high costs. In contrast, monocular image-based depth estimation methods do not require relatively expensive equipment and specialized personnel, making them available for a wider range of applications. To achieve high precision and efficiency in UAV distribution networks, inspired by knowledge distillation, we employ a teacher–student architecture to enable efficient inference. This approach maintains high-quality depth estimation while optimizing the point cloud to obtain more precise results. In this paper, we propose KD-MonoRec, which integrates knowledge distillation into the semi-supervised MonoRec framework for UAV distribution networks. Our method demonstrates excellent performance on the KITTI dataset and performs well in collected distribution network environments.

Keywords:

depth estimation; knowledge distillation; semi-supervised

1. Introduction

A point cloud is a 3D representation of an object, environment, or surface, consisting of quantities of points in space. These points are typically captured using techniques such as laser scanning, ground scanning, or photogrammetry, which record each point’s position relative to a reference system. Early research relied on LiDAR to construct these point clouds, as demonstrated by Thrun et al. [1], who fused LiDAR data for localization and mapping. However, the high cost and complexity of LiDAR have driven a shift toward more affordable visual sensors, which capture richer textures. Multi-View Stereo (MVS) [2] is one such method that generates dense 3D point clouds from multiple images. In the 1980s and 1990s, researchers like Tomasi and Kanade [3] and Scharstein and Szeliski [4] began estimating disparity information from multiple perspectives to reconstruct 3D scenes. These methods typically relied on calculating disparity maps, matching corresponding pixels from different viewing angles to estimate depth. Later, Barnes et al. [5] introduced the PatchMatch algorithm for 3D scene reconstruction, and Campos et al. [6] advanced SLAM technologies to generate sparse point clouds using monocular or RGB-D cameras. Despite these advancements, challenges remain in managing model complexity and computational demands. This study explores how integrating deep learning, knowledge distillation, and point cloud optimization can enhance 3D scene understanding.

With the rapid advancement of deep learning in recent years, especially the success of convolutional neural networks (CNNs) like VGGNet [7] and ResNet [8] on large-scale datasets such as ImageNet [9], MVS-based 3D point cloud reconstruction methods have increasingly incorporated deep learning techniques. However, these methods often require complex models and substantial computational resources, making real-time inference on standard GPUs challenging. Therefore, there is a growing interest in simplifying neural networks and reducing the number of parameters. Knowledge distillation is a technique that improves the efficiency of neural networks by transferring knowledge from a larger, more complex model (the teacher) to a smaller, simpler model (the student). Hinton et al. [10] introduced the concept of knowledge distillation, where the student model learns not only from the hard labels of the training data but also from the soft labels generated by the teacher model. These soft labels provide additional information about the class probabilities, enabling the student model to generalize more effectively. Building on this foundation, Wu et al. [11] developed a practical distillation framework that strikes a balance between model accuracy and efficiency. Their method dynamically adjusts the distillation process based on the complexity of the input data, allowing the student model to maintain high performance across various tasks without incurring unnecessary computational costs. We have also implemented this method in an external model, achieving higher accuracy with a more compact model, which is particularly advantageous in resource-constrained environments.

Another crucial aspect of point cloud processing is optimization, which aims to improve the quality and accuracy of point cloud data. For example, Zhao et al. [12] discussed the challenges of generating point clouds in complex environments, such as around utility poles, where noise levels can be high. By applying a series of filtering and optimization steps, such as spatial filtering, normal estimation, and color consistency checks, it is possible to remove noise, outliers, and unnecessary details, resulting in a clearer and more reliable 3D representation. These optimizations ensure that the point cloud data more accurately reflect the true structure and features of the scene, thereby enhancing the effectiveness and accuracy of 3D reconstruction and scene understanding.

In this paper, we investigate the combination of knowledge distillation methods with state-of-the-art, deep learning-based MVS approaches, such as MonoRec [13]. We adopt the basic architecture of MonoRec and design a teacher–student framework, conducting experiments on public datasets and in distribution network environments to evaluate both accuracy and efficiency. Furthermore, we incorporate point cloud optimization techniques into our framework to improve the quality of the reconstructed 3D representations. These optimizations, including spatial filtering, normal estimation, and color consistency checks, are essential for aligning the point cloud data more closely with the actual structure and features of the scene. This integration significantly contributes to the overall effectiveness and accuracy of our approach to 3D reconstruction and scene understanding tasks.

2. Related Work

2.1. Multi-View Stereo

The study of Multi-View Stereo (MVS) [14] can be traced back to the 1980s. The goal of MVS is to use images taken from multiple camera perspectives at different positions to perform 3D point cloud generation or reconstruction. Before the appearance of deep learning, the early traditional MVS methods were mainly based on triangulation, epipolar matching, photometric consistency, and other principles [15] for reconstruction. By calculating the triangulation between corresponding points in multiple viewing angles, the 3D coordinate points of a scene are obtained. However, triangulation-based methods require accurate corresponding point matching and have high computational complexity for large-scale and complex scenes. On the basis of the traditional MVS method, voxel-based traditional methods, point cloud-based traditional methods, and deep learning-based methods have been developed.

Voxel-based traditional methods. In order to solve the problems of corresponding point matching and computational complexity, Curless [16], Seitz [2], and others introduced voxel representation in the 1990s. The voxel-based MVS method divided the 3D space of a scene into a voxel grid, in which each voxel stores its geometric and color information, and estimated the presence of surfaces in each voxel. Through consistency testing and optimization of voxels from multiple perspectives, a more accurate 3D reconstruction is obtained as a result. In 2006, Seitz et al. [17] proposed a patch-based voxel shading method, which extracts image boundary information from multiple views from different perspectives through MVS and uses image patches to fill voxels to reconstruct the details of the scene. Texture information better determined voxel visibility and reconstruction results. In 2011, KinectFusion [18] proposed a voxel-based real-time 3D reconstruction technology. KinectFusion acquires depth images in real time and uses pose estimation, 3D reconstruction, and fusion algorithms. The fusion process projects the point clouds in the depth images into corresponding voxels in the 3D model and updates the information in voxels to fuse the continuous depth images into a continuous 3D model. In 2016, Kim et al. [19] introduced a real-time 3D reconstruction and a 6-DOF tracking method based on event cameras, which was used to achieve real-time 3D reconstructions and camera tracking by combining the characteristics of event cameras and the voxel hashing methods. Currently, voxel-based MVS reconstruction methods have certain advantages in simplifying the correspondence matching problem, but they still face challenges in capturing sophisticated geometric details.

Point cloud-based traditional methods. In order to further improve the accuracy and efficiency of reconstruction, some work began to use sparse point cloud representation by extracting sparse feature points from images from multiple perspectives, such as SIFT [20], SURF [21], and other local feature descriptors [22], and to estimate the 3D position of the point through triangulation or optimization methods applied to the extracted feature points. Sparse point cloud representation methods have relatively high reconstruction accuracy and can handle larger-scale scenes. Compared to voxel-based methods, they can handle large-scale scenes and complex geometries more flexibly. However, sparse point clouds representing relatively few points may result in a loss of detail in the reconstruction model; therefore, in some applications, such as virtual reality and augmented reality, further densification may be required to obtain more refined results. With the improvement in computing power and algorithms, the MVS method gradually developed from sparse point clouds to dense point cloud representation. Shen et al. [23] developed methods to extract denser feature points or pixels from images taken from multiple perspectives, estimating their 3D positions through optimization or energy minimization techniques. Dense point cloud representation methods can provide more sophisticated geometric details and have better reconstruction effects in terms of surface details and textures.

Deep learning-based methods. In recent years, the widespread application of deep learning in the field of computer vision has had an impact on the traditional MVS method. The success of the convolutional neural network (CNN) [8] in image classification tasks [24] prompted researchers to apply deep learning methods to MVS. A neural network is trained to estimate the depth of each pixel in the image, and a dense 3D reconstruction is produced by integrating the network output from multiple views. DeepMVS [25] is an end-to-end MVS method based on deep learning that uses CNN to predict dense depth maps or point clouds directly from input images, trained and optimized through geometric consistency between multiple views. COLMAP [26] combines traditional visual geometry methods and deep learning technology, using CNN to extract features, and combines image matching and geometric optimization to generate dense point clouds and reconstruction results. MVSNet [27] uses CNN to predict the depth or disparity information of each pixel from the input image, using a pyramid structure to handle features of different scales, taking into account visual and geometric consistency. PatchMatchNet [28] uses CNN to learn the matching relationship between image patches and optimizes it through the PatchMatch algorithm. It can generate dense depth maps or point clouds with high computational efficiency. In 2020, Mildenhall et al. [29] proposed Neural Radiance Fields (NeRF). NeRF can learn the radiance and line-of-sight direction of each point in the scene, as well as the geometric structure in the scene, generating high-quality images and videos. Deep learning methods have achieved certain breakthroughs in reconstruction accuracy and efficiency but have high demands for training data and computing resources.

2.2. Monocular Depth Estimation

Early supervised learning methods are trained by using sparsely annotated images with depth labels or by leveraging dense point clouds from 3D sensors. These methods aim to learn directly from image-to-depth mappings, such as using convolutional neural networks (CNNs) for depth regression. In contrast, self-supervised learning methods do not require expensive depth labels or sensor information by designing loss functions. One of the common methods is to use stereo image pairs to learn depth information by minimizing the photometric consistency loss between the left and right images. This type of approach enables models to automatically learn deep structures from images. Some studies employ multiple frames of images to improve the accuracy of depth estimation. These methods can exploit optical flow or disparity for depth estimation by combining image information from different times or perspectives [30]. Multi-frame methods are generally more suitable for depth estimation in dynamic scenes.

In recent years, the development of convolutional neural networks and the introduction of attention mechanisms have brought significant improvements to monocular depth estimation. The attention mechanism enables the network to focus more on key areas in the image, thereby improving the accuracy of depth estimation. Recently, some research has begun to introduce vision-based attention mechanisms (Vision Transformer or ViT) [31] to deal with the monocular depth estimation problem. This method improves the performance of depth estimation by introducing a self-attention mechanism to better capture global and local information in images. In response to the requirements of real-time and computational efficiency, some research focuses on how to reduce the computational burden of depth estimation models while maintaining accuracy to adapt to embedded systems or practical application scenarios. The above methods have achieved some results in different directions, but in specific application scenarios, especially in the UAV power distribution network environment, accuracy, real-time performance, and robustness in dynamic environments need to be comprehensively considered.

2.3. Knowledge Distillation

Knowledge distillation (KD) is a method of transferring learned knowledge from a large teacher model to a lightweight student model. In the field of depth estimation, KD has been successfully applied to improve model performance. For example, Pilzer et al. [32] attempted to use distillation to transfer knowledge from a depth-estimated optimized teacher network to a self-supervised student network to overcome network size limitations in real-time applications. Wang et al. [33] transferred the knowledge of powerful and complex deep models to lightweight student models through knowledge distillation for fast depth estimation on mobile devices. Recent work such as DistDepth [11] uses knowledge distillation to transfer structural knowledge of depth estimation to a self-supervised student model to obtain more accurate indoor depth maps.

In the UAV power distribution network environment, self-supervised monocular deep learning can solve the challenges of complex scenes such as lighting changes, occlusions, and moving objects. Traditional photometric loss training methods have certain shortcomings, so researchers have proposed strategies such as minimum photometric loss of multi-source frames, utilization of segmentation information, geometry-based occlusion masks, lighting alignment, and object motion detection to improve model performance. However, these methods usually require additional information, such as camera intrinsics, semantic labels, or support from additional networks. Some methods attempt to alleviate self-supervised deep learning’s reliance on static object assumptions. Ren et al. [34] proposed an adaptive collaborative teaching framework for unsupervised depth estimation, which leverages the advantages of knowledge distillation and ensemble learning, enabling more accurate depth estimation, while DynamicDepth [35] uses single-frame depth as the prior depth to mitigate the impact of dynamic objects in multi-frame depth networks. Other methods such as Li et al. [36] and RM-Depth [37] are also attempts to solve this problem by predicting the motion of dynamic objects and separating static scenes from dynamic objects on reprojection loss. In this context, knowledge distillation emerges as an effective approach by training a teacher model in self-supervised learning and transferring its knowledge to a small student model, providing strong support for deploying deep learning models in resource-constrained UAV environments.

3. Methods

We follow the previous MonoRec [13] work to further study depth estimation in Figure 1. We also improved that algorithm by combining the knowledge distillation to improve the accuracy of the depth estimation. Based on the original MonoRec framework, which consists of a MaskModule that predicts the mask of the dynamic objects and a DepthModule that estimates the depth of the masked image, we adopt knowledge distillation to make the inference more effective in depth estimation, especially in power distribution network environments. In this section, the MaskModule and depth prediction module will first be introduced, followed by a description of the specific methods to improve knowledge distillation and the optimization method of the generated depth map. Finally, the training strategy of the framework will be discussed.

3.1. MaskModule

The goal of MaskModule is to predict a mask

M_{t}

, where

M_{t} (x) \in [0, 1]

represents the probability that pixel x in frame

I_{t}

belongs to a moving object. Determining moving objects from frames alone is a vague and difficult to generalize task. Therefore, we propose to use a set of Cost Volumes

C_{t} (t \in {1, 2, \dots, N})

, which encode the relationship between key frame

I_{t}

and other frames’

C_{t^{'}} (t^{'} \in {1, 2, \dots, N} ∖ t)

geometric prior. We use

C_{t}

instead of C because inconsistent geometric information in different values of

C_{t}

is a strong prior for predicting moving objects: dynamic pixels produce inconsistent optimal depth steps in different

C_{t}

values. However, geometric priors alone are not sufficient to predict moving objects, as poor quality textures or non-Lambertian surfaces may also cause inconsistencies. Furthermore, for objects moving at a constant speed, the ontology often reaches a semantic consensus of the wrong depth that is inconsistent with the context of the scene. Therefore, we further utilize the pre-trained ResNet-18 [8] features of frame

I_{t}

to encode semantic priors in addition to geometric priors. The network is designed using U-Net architecture [38] with skip connections. All entities pass through an encoder with shared weights. Features from different cost entities are aggregated using max pooling and then passed through the decoder. In this way, MaskModule can be applied to a different number of frames without retraining.

3.2. DepthModule

DepthModule predicts the dense pixel inverse depth map

D_{t}

of key frame

I_{t}

. To achieve this, the module receives the complete cost body C concatenated with the keyframe

I_{t}

. Different from MaskModule, here we use C instead of

C_{t}

because multiple frames of this volume can usually improve the depth accuracy and enhance the robustness to photometric noise [39]. To eliminate false depth predictions for moving objects, we perform a pixel-wise multiplication of

M_{t}

with the cost volume C for each depth step d. In this way, the moving object area will not leave any maxima (i.e., strong prior), so the DepthModule must rely on image features and surrounding environment information to infer the depth of the moving object. We adopt a U-Net architecture [40] with multi-scale depth output of the decoder. Finally, the DepthModule outputs an interpolation factor between

d_{m i n}

and

d_{m a x}

. In practice, we use

s = 4

depth prediction scales.

3.3. Knowledge Distillation Architecture

In depth estimation, the MaskModule uses the knowledge distillation method to guide the student model to better complete the task by learning the knowledge of the teacher model. The core of this method is to use the complex representation and prior information of the teacher model to pass it to the student model to improve the performance of the student model in regard to the moving object segmentation task.

First, the teacher model is built through a set of cost volumes that encode geometric priors between the current image and other frames. Each ontology describes a different perspective and has a unique contribution to the depth information of moving objects. The teacher model combines the information of these entities to generate strong priors for segmentation of moving objects.

Next, the student model is designed to leverage this complex prior information, along with semantic priors from pre-trained ResNet-18 features, to more accurately predict whether each pixel belongs to a moving object. During the knowledge distillation process, the output of the teacher model is used as an additional supervision signal and used together with the output of the student model for loss calculation. This joint loss contains two aspects of information: on the one hand, the prediction error of the student model itself, and on the other hand, the output of the teacher model, that is, the complex geometric prior.

In this way, the student model not only learns the basic information of the task during training but also acquires deeper and more critical knowledge for the task from the teacher model. This helps improve the understanding and generalization ability of the student model for the task of depth estimation of moving objects. The idea of knowledge distillation plays the role of transmitting, refining, and strengthening knowledge here, allowing the model to better understand complex scene geometry and semantic information, thereby more accurately segmenting moving objects.

3.4. Depth Optimization

To further optimize the depth map generated by our method in a networked UAV environment, we introduce depth map filtering as a key step in depth map optimization. This method aims to improve the accuracy and stability of the depth map by processing the depth map corresponding to the original image.

We adopted the depth image optimization algorithm, as illustrated in Algorithm 1. We first obtain the depth map generated by KD-MonoRec. Subsequently, we use statistics-based filtering methods to spatially filter the point cloud. By analyzing the distribution characteristics of the point cloud in the depth map, we are able to effectively remove outliers caused by sensor errors or occlusions, thereby improving the overall quality of the depth map. Secondly, in order to better retain the detailed information of the object’s surface, we introduce normal vector information for normal vector filtering. By computing normal vectors between adjacent points, we are able to identify and preserve the main features of the object surface, further reducing artifacts and discontinuities in the depth map.

Specifically, given a point cloud containing a set of points P, where each point

p_{i}

has three-dimensional coordinates

(x_{i}, y_{i}, z_{i})

and a corresponding normal vector

N_{i} = (n_{x i}, n_{y i}, n_{z i})

, for each point

p_{i}

, calculate its normal vector

N_{i}

as follows:

N_{i} = ComputeNormal (p_{i}, Neighbors (p_{i})) .

(1)

Filter the depth map using the computed normal vectors:

D_{filtered} = FilterByNormal (D, N) .

(2)

Here,

D_{filtered}

is the filtered depth map, D is the original depth map, and

N

represents the normal vectors for points in the point cloud.

To enhance the detail recovery ability of the depth map, we use the color information in the point cloud for color filtering. First, we initialize an empty filtered depth map

D_{filtered}

with the same dimensions as the input depth map D. Then, for each pixel

(x, y)

in the depth map D, we extract the corresponding color information

C (x, y)

from the point cloud P and define a local neighborhood

N_{local} (x, y)

centered at pixel

(x, y)

with size N. Next, we compute the average color

\bar{C} (x, y)

of the local neighborhood

N_{local} (x, y)

and the color consistency score

Δ (x, y)

between the current pixel color

C (x, y)

and the average color

\bar{C} (x, y)

of its local neighborhood using the Euclidean distance:

Δ (x, y) = ∥ C (x, y) - \bar{C} (x, y) ∥ .

(3)

If

Δ (x, y)

is below the color consistency threshold

τ

, we update the depth value of pixel

(x, y)

in the filtered depth map

D_{filtered} (x, y)

with the corresponding depth value from the original depth map

D (x, y)

. Otherwise, we apply a smoothing operation to

D (x, y)

within the local neighborhood

N_{local} (x, y)

to ensure color consistency, such as averaging or median filtering. We repeat the above steps for all pixels in the depth map D and finally return the filtered depth map

D_{filtered}

.

Algorithm 1 Depth Image Optimization

Require Depth Image1: DepthImageOptimization

d e p t h_m a p

f i l t e r e d_d e p t h_m a p \leftarrow

spatial_filter(

d e p t h_m a p

)

f i l t e r e d_d e p t h_m a p \leftarrow

normal_filter(

f i l t e r e d_d e p t h_m a p

)

f i l t e r e d_d e p t h_m a p \leftarrow

color_filter(

f i l t e r e d_d e p t h_m a p

)

return

f i l t e r e d_d e p t h_m a p

2: spatialFilter

d e p t h_m a p

f i l t e r e d_d e p t h_m a p \leftarrow

statistical_filter(

d e p t h_m a p

)

return

f i l t e r e d_d e p t h_m a p

3: normalFilter

d e p t h_m a p

p o i n t_c l o u d \leftarrow

depth_map_to_point_cloud(

d e p t h_m a p

)

n o r m a l s \leftarrow

calculate_normals(

p o i n t_c l o u d

)

f i l t e r e d_d e p t h_m a p \leftarrow

filter_with_normals(

d e p t h_m a p

,

n o r m a l s

)

return

f i l t e r e d_d e p t h_m a p

4: colorFilter

d e p t h_m a p

c o l o r s \leftarrow

extract_colors

p o i n t_c l o u d

t h r e s h o l d \leftarrow 20

Color threshold

{Simple color consistency check: retain if colors are consistent, discard otherwise}

{Threshold value used to define color consistency}

return

f i l t e r e d_d e p t h_m a p

4. Experiments

4.1. Datasets

KITTI [41] is a public dataset widely used in the fields of computer vision and autonomous driving. The dataset was created in collaboration with the Karlsruhe Institute of Technology in Germany and Toyota Technology Research Institute of America. It provides data from multiple sensor modalities, including stereo vision, LiDAR, camera, and inertial measurement unit (IMU) data. The scenes in the dataset include vehicle driving, pedestrian walking, and road conditions in urban environments. The KITTI dataset provides rich annotation information, such as camera calibration parameters, LiDAR point clouds, vehicle motion trajectories, etc., and can be used for the research and evaluation of computer vision tasks like target detection, target tracking, stereo vision, and semantic segmentation.

EuRoC [42] is a public dataset used for Visual Inertial Odometry (VIO) and visual SLAM (Simultaneous Localization and Mapping) research. This dataset is provided by the Robotics and Perception Group of the European Space Agency (ESA). The EuRoC dataset provides data from multiple sensor modalities such as cameras, inertial measurement units (IMUs), and LiDAR. The dataset contains simulated and actual data collected in laboratory and indoor environments. Camera images, IMU data, and LiDAR data can be used for the development, evaluation, and comparison of VIO and SLAM algorithms. The dataset provides annotation information such as sensor calibration parameters, camera attitude true values, and IMU attitude true values.

Distribution network drone dataset. Our custom drone dataset is designed for computer vision and perception tasks in distribution grid environments. Through professional drones equipped with high-resolution cameras, inertial measurement units (IMUs), LiDAR, and GPS modules, we collect data in a variety of scenarios, including power tower inspections, transmission line monitoring, and substation inspections.

The dataset provides rich annotation information, including UAV attitude data, equipment calibration parameters, and environmental conditions, to support computer vision tasks such as target detection, trajectory tracking, and three-dimensional reconstruction. Relevant application areas include power equipment fault detection, environmental modeling, and other distribution network monitoring and maintenance tasks.

4.2. Evaluation Metrics

We employ the following metrics to assess the quality of point cloud generation by drones in the context of power grid distribution environments.

Absolute Relative Difference (AbsRel): This measures the absolute relative error between the predicted depth map and the true depth map. Calculating the relative difference between the depth prediction and the true depth value for each pixel, AbsRel sums these differences and takes the absolute value. A smaller AbsRel indicates more accurate depth prediction by the model.

A b s R e l = \frac{1}{N} \sum_{i = 1}^{N} |\frac{d_{g t} - d_{g t}}{d_{i}}|

(4)

Squared Relative Difference (SqRel): Similar to AbsRel, SqRel quantifies the squared relative error between the predicted depth map and the true depth map. It also evaluates the relative difference between depth prediction and true depth values, but in squared form. SqRel pays more attention to pixels with larger errors.

The formula is expressed as:

SqRel = \frac{1}{N} \sum_{i = 1}^{N} {(\frac{d_{gt} - d_{i}}{d_{gt}})}^{2}

(5)

Root Mean Square Error (RMSE): RMSE measures the root mean square error between the predicted depth map and the true depth map. It is the square root of the sum of the squared differences between predicted and true depth values averaged across all pixels. RMSE indicates the average magnitude of prediction errors by the model.

The formula is expressed as follows:

RMSE = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(d_{i} - d_{gt})}^{2}} .

(6)

Root Mean Square Logarithmic Error (RMSLE): RMSLE calculates the root mean square error in the logarithmic domain. It offers robustness when dealing with nonlinear relationships and is typically used when the target value range is wide.

The formula is expressed as follows:

RMSLE = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(log (d_{i} + 1) - log (d_{gt} + 1))}^{2}} .

(7)

These metrics collectively provide insights into the accuracy and performance of drone-generated point clouds in power grid distribution environments. Among them,

d_{i}

is the predicted depth value,

d_{gt}

is the true depth value, and N is the total number of pixels. These evaluation metrics play a crucial role in the field of depth estimation. First, the absolute relative error (AbsRel) provides us with an intuitive measure of depth estimation accuracy by calculating the absolute value of the relative difference between the depth prediction and the true depth value. Its advantage is that it can evenly consider the error of each pixel point and normalize it to the relative range of the depth value, making the evaluation results more universal and comparable. Next, the squared relative error (SqRel) further emphasizes the sensitivity to large errors, making us pay more attention to pixels with large errors in the depth map. This helps provide a more complete understanding of how a model performs at different depth ranges when evaluating its performance. The root mean square error (RMSE) provides a quantification of the overall average value of the depth prediction error, and its simple and intuitive calculation method makes it one of the most commonly used evaluation indicators in the field of depth estimation. Finally, root mean square logarithmic error (RMSLE) is more robust when dealing with nonlinear relationships, especially when target values are spread over a wide range. Together, these evaluation metrics constitute a multifaceted evaluation of the performance of the depth estimation model.

4.3. Training Strategy

Neural network initialization. We use pre-trained ResNet-18 to initialize the MaskModule and DepthModule. This helps speed up model convergence and enables the model to better capture semantic information from frames.

Knowledge distillation. During the training process, the output of the MaskModule is used as the output of the teacher model and is used for loss calculation together with the output of the student model. The loss function should include the prediction error of the student model itself and the output of the teacher model. Through back-propagation, the student model can gradually learn the knowledge of the teacher model and improve its performance on the moving object segmentation task.

Depth map optimization. In the bootstrapping stage, the depth map generated by the model is optimized. Convert depth maps to point cloud data, then apply spatial, normal, and color filtering to remove noise, artifacts, and discontinuities. This step helps improve the quality and accuracy of the depth map.

Parameter adjustment. This neural network is implemented in PyTorch, and the images are resized to 512 × 256. During the bootstrapping phase, we trained the deep module for 70 epochs with a learning rate (LR) of

1 \times 10^{- 4}

for the first 65 epochs and

1 \times 10^{- 5}

thereafter. The MaskModule was trained for 60 epochs with a learning rate of

1 \times 10^{- 4}

. During MaskModule refinement, we train with a learning rate of

1 \times 10^{- 4}

for 32 epochs, while during module refinement, we train with

1 \times 10^{- 4}

for 15 epochs, followed by

1 \times 10^{- 5}

for 4 epochs.

Periodic training. The hyperparameters o,

α

, and

γ

are set to be 4,

10^{- 3}

×

2^{- s}

, and 4, respectively.

4.4. Results

The experiments conducted in the UAV-based environment demonstrate that the KD-MonoRec model significantly outperforms other depth estimation models, as in Table 1. KD-MonoRec achieved the best results across all evaluation metrics, showcasing its enhanced accuracy and robustness in predicting depth. The improvements in absolute relative error, squared relative error, root mean square error, and root mean square logarithmic error highlight the effectiveness of the proposed model enhancements.

Furthermore, KD-MonoRec maintains the smallest model size, which is particularly beneficial for UAV applications where computational resources and memory are limited. The combination of superior performance and efficiency makes KD-MonoRec a highly suitable choice for depth estimation tasks in resource-constrained environments.

4.5. Ablation Study

First, a baseline model, that is, MonoRec without applying any improvements, is established to compare with other experimental results in Table 2. Distillation-only experiment: Apply the distillation framework to the MonoRec algorithm using the output of MaskModule as the teacher model, which is used for loss calculations together with the output of the student model. By comparing the performance of the baseline model and the distillation-only model, the impact of distillation on depth estimation is evaluated. Depth map optimization experiment only: Based on the baseline model, only point cloud filtering is applied as a key step in depth map optimization. By comparing the performance of the baseline model and the depth map-only optimization model, the impact of the depth map optimization strategy on depth estimation is evaluated. Distillation and depth map optimization experiments: Combine the two strategies of distillation and depth map optimization and apply them to the MonoRec algorithm. Evaluate the combined impact of distillation and depth map optimization working together on depth estimation. The result of point cloud comparison is shown in Figure 2.

5. Conclusions

In conclusion, our study presents a comprehensive framework for enhancing depth estimation algorithms in networked UAV environments. By refining the MonoRec algorithm, introducing key modules for moving object segmentation and depth prediction, and employing knowledge distillation techniques, we achieve significant improvements in accuracy and stability, particularly in dynamic object detection. The integration of point cloud filtering as part of our depth map optimization strategy further enhances the quality of depth maps by reducing noise and artifacts. Through a meticulously designed training strategy, our algorithm demonstrates both real-time performance and robustness in distributed drone environments. Our findings not only contribute to advancing depth estimation in specialized settings but also offer valuable insights into knowledge distillation and depth map optimization methodologies. These advancements provide strong support for practical applications in UAV power distribution networks, showcasing the potential of our approach in addressing real-world challenges in autonomous systems.

Author Contributions

Conceptualization, J.X. and K.Z.; methodology, X.X.; validation, X.X., S.W. and Z.H.; formal analysis, J.X.; investigation, J.X.; resources, L.L.; data curation, X.X.; writing—original draft preparation, J.X.; writing—review and editing, K.Z.; visualization, S.L.; supervision, Z.H.; project administration, L.L.; funding acquisition, S.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Hunan Electric Power Company Ltd. under Project 5216A5220027 and Hunan Provincial Science and Technology Innovation Platform and Talent Program (2023TP2180).

Data Availability Statement

Data are unavailable due to privacy or ethical restrictions.

Conflicts of Interest

Authors Jian Xiao, Keren Zhang, Xianyong Xu, Shuai Liu, Sheng Wu and Zhihong Huang were employed by the company (the State Grid Hunan Electric Power Corporation Ltd.). The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Thrun, S.; Burgard, W.; Fox, D. A real-time algorithm for mobile robot mapping with applications to multi-robot and 3D mapping. In Proceedings of the 2000 ICRA. Millennium Conference, IEEE International Conference on Robotics and Automation, Symposia Proceedings (Cat. No.00CH37065), Online, 24–28 April 2000; Volume 1, pp. 321–328. [Google Scholar] [CrossRef]
Kutulakos, K.N.; Seitz, S.M. A theory of shape by space carving. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Kerkyra, Greece, 20–27 December 1999; IEEE: Piscataway, NJ, USA, 1999; Volume 1, pp. 307–314. [Google Scholar]
Tomasi, C.; Kanade, T. Shape and motion from image streams: A factorization method. Proc. Natl. Acad. Sci. USA 1993, 90, 9795–9802. [Google Scholar] [CrossRef] [PubMed]
Scharstein, D.; Szeliski, R. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. Vis. 2002, 47, 7–42. [Google Scholar] [CrossRef]
Barnes, C.; Shechtman, E.; Finkelstein, A.; Goldman, D.B. PatchMatch: A randomized correspondence algorithm for structural image editing. ACM Trans. Graph. 2009, 28, 24. [Google Scholar] [CrossRef]
Campos, C.; Elvira, R.; Rodriguez, J.J.G.; Montiel, J.M.M.; D. Tardos, J. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual–Inertial, and Multimap SLAM. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1106–1114. [Google Scholar] [CrossRef]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Wu, C.Y.; Wang, J.; Hall, M.; Neumann, U.; Su, S. Toward Practical Monocular Indoor Depth Estimation. arXiv 2022, arXiv:2112.02306. [Google Scholar]
Zhao, Q.; Gao, X.; Li, J.; Luo, L. Optimization algorithm for point cloud quality enhancement based on statistical filtering. J. Sens. 2021, 2021, 1–10. [Google Scholar] [CrossRef]
Wimbauer, F.; Yang, N.; von Stumberg, L.; Zeller, N.; Cremers, D. MonoRec: Semi-Supervised Dense Reconstruction in Dynamic Environments from a Single Moving Camera. arXiv 2021, arXiv:2011.11814. [Google Scholar]
Longuet-Higgins, H.C. A computer algorithm for reconstructing a scene from two projections. Nature 1981, 293, 133–135. [Google Scholar] [CrossRef]
Hartley, R.; Zisserman, A. Multiple View Geometry in Vomputer Vision; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
Curless, B.; Levoy, M. A volumetric method for building complex models from range images. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, New Tork, NY, USA, 1 August 1996; pp. 303–312. [Google Scholar]
Seitz, S.M.; Curless, B.; Diebel, J.; Scharstein, D.; Szeliski, R. A comparison and evaluation of multi-view stereo reconstruction algorithms. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; IEEE: Picataway, NJ, USA, 2006; Volume 1, pp. 519–528. [Google Scholar]
Newcombe, R.A.; Izadi, S.; Hilliges, O.; Molyneaux, D.; Kim, D.; Davison, A.J.; Kohi, P.; Shotton, J.; Hodges, S.; Fitzgibbon, A. Kinectfusion: Real-time dense surface mapping and tracking. In Proceedings of the 2011 10th IEEE International Symposium on Mixed and Augmented Reality, Basel, Switzerland, 26–29 October 2011; IEEE: Picataway, NJ, USA, 2011; pp. 127–136. [Google Scholar]
Kim, H.; Leutenegger, S.; Davison, A.J. Real-time 3D reconstruction and 6-DoF tracking with an event camera. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part VI 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 349–364. [Google Scholar]
Lowe, D.G. Object recognition from local scale-invariant features. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Corfu, Greece, 20–25 September 1999; IIEEE: Picataway, NJ, USA, 1999; Volume 2, pp. 1150–1157. [Google Scholar]
Bay, H.; Tuytelaars, T.; Van Gool, L. Surf: Speeded up robust features. In Proceedings of the Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, 7–13 May 2006; Proceedings, Part I 9. Springer: Berlin/Heidelberg, Germany, 2006; pp. 404–417. [Google Scholar]
Lee, L.H.; Braud, T.; Zhou, P.; Wang, L.; Xu, D.; Lin, Z.; Kumar, A.; Bermejo, C.; Hui, P. All one needs to know about metaverse: A complete survey on technological singularity, virtual ecosystem, and research agenda. arXiv 2021, arXiv:2110.05352. [Google Scholar]
Shen, S. Accurate multiple view 3d reconstruction using patch-based stereo for large-scale scenes. IEEE Trans. Image Process. 2013, 22, 1901–1914. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: Picataway, NJ, USA, 2009; pp. 248–255. [Google Scholar]
Huang, P.H.; Matzen, K.; Kopf, J.; Ahuja, N.; Huang, J.B. DeepMVS: Learning Multi-view Stereopsis. arXiv 2018, arXiv:1804.00650. [Google Scholar]
Schönberger, J.L.; Frahm, J.M. Structure-from-Motion Revisited. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Yao, Y.; Luo, Z.; Li, S.; Fang, T.; Quan, L. MVSNet: Depth Inference for Unstructured Multi-view Stereo. arXiv 2018, arXiv:1804.02505. [Google Scholar]
Wang, F.; Galliani, S.; Vogel, C.; Speciale, P.; Pollefeys, M. PatchmatchNet: Learned Multi-View Patchmatch Stereo. arXiv 2020, arXiv:2012.01411. [Google Scholar]
Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. arXiv 2020, arXiv:2003.08934. [Google Scholar] [CrossRef]
Wang, C.; Miguel Buenaposada, J.; Zhu, R.; Lucey, S. Learning Depth From Monocular Videos Using Direct Methods. In Proceedings of the The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar]
Pilzer, A.; Lathuilière, S.; Sebe, N.; Ricci, E. Refine and Distill: Exploiting Cycle-Inconsistency and Knowledge Distillation for Unsupervised Monocular Depth Estimation. arXiv 2019, arXiv:1903.04202. [Google Scholar]
Wang, Y.; Li, X.; Shi, M.; Xian, K.; Cao, Z. Knowledge distillation for fast and accurate monocular depth estimation on mobile devices. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2457–2465. [Google Scholar]
Ren, W.; Wang, L.; Piao, Y.; Zhang, M.; Lu, H.; Liu, T. Adaptive co-teaching for unsupervised monocular depth estimation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 89–105. [Google Scholar]
Feng, Z.; Yang, L.; Jing, L.; Wang, H.; Tian, Y.; Li, B. Disentangling object motion and occlusion for unsupervised multi-frame monocular depth. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 228–244. [Google Scholar]
Li, H.; Gordon, A.; Zhao, H.; Casser, V.; Angelova, A. Unsupervised monocular depth learning in dynamic scenes. In Proceedings of the Conference on Robot Learning, London, UK, 8–11 November 2021; PMLR: Londin, UK, 2021; pp. 1908–1917. [Google Scholar]
Hui, T.W. RM-Depth: Unsupervised Learning of Recurrent Monocular Depth in Dynamic Scenes. arXiv 2023, arXiv:2303.04456. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv 2015, arXiv:1505.04597. [Google Scholar]
Newcombe, R.A.; Lovegrove, S.J.; Davison, A.J. DTAM: Dense tracking and mapping in real-time. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2320–2327. [Google Scholar] [CrossRef]
Godard, C.; Aodha, O.M.; Firman, M.; Brostow, G. Digging Into Self-Supervised Monocular Depth Estimation. arXiv 2019, arXiv:1806.01260. [Google Scholar]
Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The kitti dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
Burri, M.; Nikolic, J.; Gohl, P.; Schneider, T.; Rehder, J.; Omari, S.; Achtelik, M.W.; Siegwart, R. The EuRoC micro aerial vehicle datasets. Int. J. Robot. Res. 2016, 35, 1157–1163. [Google Scholar] [CrossRef]
Mallya, A.; Lazebnik, S. PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning. arXiv 2018, arXiv:1711.05769. [Google Scholar]
Wu, Z.; Li, Z.; Fan, Z.G.; Wu, Y.; Wang, X.; Tang, R.; Pu, J. ADU-Depth: Attention-based Distillation with Uncertainty Modeling for Depth Estimation. In Proceedings of the Conference on Robot Learning, Auckland, New Zealand, 14–18 December 2023; PMLR: London, UK, 2023; pp. 3167–3179. [Google Scholar]
Feng, C.; Chen, Z.; Zhang, C.; Hu, W.; Li, B.; Lu, F. Iterdepth: Iterative residual refinement for outdoor self-supervised multi-frame monocular depth estimation. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 329–341. [Google Scholar] [CrossRef]
Bhat, S.F.; Birkl, R.; Wofk, D.; Wonka, P.; Müller, M. Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv 2023, arXiv:2302.12288. [Google Scholar]
Ke, B.; Obukhov, A.; Huang, S.; Metzger, N.; Daudt, R.C.; Schindler, K. Repurposing diffusion-based image generators for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 9492–9502. [Google Scholar]
Li, Z.; Bhat, S.F.; Wonka, P. Patchfusion: An end-to-end tile-based framework for high-resolution monocular metric depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 10016–10025. [Google Scholar]

Figure 1. The KD-MonoRec architecture first constructs a photometric cost volume based on multiple input frames using the SSIM metric to measure photometric consistency. MaskModule detects inconsistencies between different input frames to determine moving objects. The multi-frame cost volume is then multiplied by the prediction mask and passed to the DepthModule, which predicts a dense inverse depth map. In the decoder, the cost volume features are concatenated with pre-trained ResNet-18 features. In order to train the student model to predict the depth map, we introduce the concept of knowledge distillation and use the trained teacher model to guide the training of the student model to improve prediction accuracy and generalization ability.

Figure 2. Raw images and estimated depth images. The leftmost column represents the raw image, followed by the MonoDepth2 and KD-MonoRec depth estimation results.

Table 1. Evaluation results. The best results are bolded.

Model	AbsRel	SqRel	RMSE	RMSLE	Model Size (MB)
PackNet [43]	0.080	0.331	2.914	0.124	328.0
MonoRec [13]	0.050	0.295	2.266	0.082	473.0
ADU-Depth [44]	0.077	0.290	2.723	0.113	341.0
IterDepth [45]	0.103	1.160	3.968	0.166	502.0
ZoeDepth [46]	0.061	0.726	3.218	0.189	522.1
Marigold [47]	0.057	0.662	1.028	0.206	419.2
PatchFusion [48]	0.046	0.560	2.656	0.126	312.9
KD-MonoRec	0.043	0.281	2.259	0.073	289.0

Table 2. Results of ablation study. KDM refers to KD-MonoRec. The best results are bolded.

Model	AbsRel	SqRel	RMSE	RMSLE
MonoRec	0.050	0.295	2.266	0.082
KDM w/o KD	0.048	0.291	2.237	0.077
KDM w/o Point Cloud Optimization	0.053	0.302	2.287	0.093
KD-MonoRec	0.043	0.281	2.259	0.073

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xiao, J.; Zhang, K.; Xu, X.; Liu, S.; Wu, S.; Huang, Z.; Li, L. Improving Accuracy and Efficiency of Monocular Depth Estimation in Power Grid Environments Using Point Cloud Optimization and Knowledge Distillation. Energies 2024, 17, 4068. https://doi.org/10.3390/en17164068

AMA Style

Xiao J, Zhang K, Xu X, Liu S, Wu S, Huang Z, Li L. Improving Accuracy and Efficiency of Monocular Depth Estimation in Power Grid Environments Using Point Cloud Optimization and Knowledge Distillation. Energies. 2024; 17(16):4068. https://doi.org/10.3390/en17164068

Chicago/Turabian Style

Xiao, Jian, Keren Zhang, Xianyong Xu, Shuai Liu, Sheng Wu, Zhihong Huang, and Linfeng Li. 2024. "Improving Accuracy and Efficiency of Monocular Depth Estimation in Power Grid Environments Using Point Cloud Optimization and Knowledge Distillation" Energies 17, no. 16: 4068. https://doi.org/10.3390/en17164068

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Improving Accuracy and Efficiency of Monocular Depth Estimation in Power Grid Environments Using Point Cloud Optimization and Knowledge Distillation

Abstract

1. Introduction

2. Related Work

2.1. Multi-View Stereo

2.2. Monocular Depth Estimation

2.3. Knowledge Distillation

3. Methods

3.1. MaskModule

3.2. DepthModule

3.3. Knowledge Distillation Architecture

3.4. Depth Optimization

4. Experiments

4.1. Datasets

4.2. Evaluation Metrics

4.3. Training Strategy

4.4. Results

4.5. Ablation Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI