1. Introduction
Precise dense depth maps are fundamental to various computer vision applications, including facilitating autonomous navigation [
1], augmented reality interactions [
2], 3D modeling techniques [
3], and supporting real-time localization and mapping systems [
4]. Despite advancements in depth-sensing technology, hardware devices often hinder depth sensors from generating dense depth maps for research purposes. To address this limitation, the depth completion technique [
5] converts sparse depth inputs into rich, dense depth maps, enabling the derivation of enhanced depth information and corresponding color visuals from limited sparse datasets.
Recent advancements in deep neural networks (DNNs) have substantially enhanced depth completion tasks. The sparse-to-dense method [
6] leverages a single deep regression network to handle raw RGB-D data, addressing the challenge of producing dense depth maps from sparsely sampled raw RGB images. Meanwhile, the convolutional spatial propagation network (CSPN) [
7] employs an efficient linear propagation model to effectively capture relationships among pixel neighborhoods. This mechanism refines depth outputs obtained from state-of-the-art methods while converting sparse depth samples into dense maps through an integrated propagation mechanism. Building upon this foundation, the non-local spatial propagation network (NLSPN) [
8] further advances this approach by introducing a learnable affinity normalization mechanism. This method estimates non-local neighboring pixels and their respective affinities, integrating them with an initial depth map and pixel confidence values to enhance affinity learning and resilience against inconsistencies at depth boundaries. However, most existing DNN-based depth completion techniques rely heavily on extensive supervised data, yet gathering and disposing of large numbers of sparse depth maps are both challenging and expensive [
5]. To address this limitation, this study employs a few-shot learning framework to predict the dense depth for the image.
Figure 1d,e illustrate the testing results obtained using the advanced NLSPN [
8] architecture. When trained on a randomly selected 10% of the NYUv2 dataset, the model produces outputs that exhibit more noise and blurriness compared to those trained on the full NYUv2 dataset. Therefore, an inadequate dataset jeopardizes the model’s generalization ability. Exploring the impact of a limited training dataset on depth completion is crucial.
DNN-based depth completion methods typically depend on extensive supervised data samples. However, acquiring and processing large-scale, sparse depth maps are both challenging and costly. Moreover, current real-world datasets lack pixel-accurate ground truth due to depth sensor limitations. Furthermore, semi-dense annotations are often imprecise and affected by occlusions, dynamic objects, and other distortions.
To address data scarcity, several researchers are exploring self-supervised frameworks [
9,
10] to compensate for limited ground-truth depth information. These methods derive supervisory signals from the dataset itself, eliminating the need for extensive labeled datasets. By leveraging intrinsic data patterns and correlations, self-supervised learning enables depth estimation without requiring manual annotations. This strategy mitigates the difficulty of obtaining extensive labeled depth datasets while enhancing the model’s capacity to deduce depth from partial or corrupted inputs. Knowledge distillation (KD) is crucial in self-supervised learning by boosting performance, stability, and efficiency. KD [
11] extracts essential information from a complex teacher model and transfers it to a more compact student model, enhancing efficiency without compromising performance. KD techniques have been extensively studied for image classification tasks; however, their direct application to pixel-level tasks through classification-based KD methods often yields suboptimal results. Imposing the strict alignment of the coarse feature maps between teacher and student models can impose restrictive constraints, neglecting structured relationships among pixels and reducing overall efficiency.
Unlike previous studies that ignore dataset insufficiency, we introduce a few-shot learning framework that integrates self-training with noise and dense-pixel KD. Our approach yields superior performance in both smoothness and detail sharpness while utilizing the same limited dataset (
Figure 1f). We train a teacher model on limited labeled data, ensuring a noise-free training process to produce reliable pseudo-labels for the unlabeled data, which retain a high accuracy and remain closely aligned with the true labels. To enhance the student model’s training, we introduce noise augmentation, incorporating both labeled and unlabeled data. We apply Gaussian blur to RGB images and introduce stochastic perturbations to sparse depth data. The training process follows an iterative scheme, where the student model transitions into a teacher role to generate pseudo-labels, while a newly initialized student model is trained to improve performance. To sustain the teacher model’s high performance, dense-pixel KD is employed, leveraging its acquired knowledge as a supervisory signal for the student model’s training [
11].
The primary contributions of our study are as follows:
- (1)
We propose a novel few-shot learning framework for depth completion. This framework utilizes advanced depth completion techniques to produce dense depth maps from datasets that are either incomplete or contain missing data.
- (2)
We employ noise augmentation and KD techniques, utilizing the deep regression network NLSPN to capture and infer depth-related features from limited datasets. Our approach improves the student model’s resistance to noise, strengthening its overall performance and generalization abilities.
- (3)
Our experimental results on the NYUv2 dataset show that the proposed method substantially enhances the performance of the student model compared to current techniques. Even when limited to only 10% of the training data, our method improves the accuracy from 97.1% to 99.7%, matching the performance achieved with the complete training set.
The remainder of this paper is organized as follows.
Section 2 reviews related research,
Section 3 details the methodology,
Section 4 presents the experimental results and analysis, and
Section 5 concludes the study.
3. The Proposed Method
This study proposes a novel few-shot learning framework based on self-training with noise and pixel-wise knowledge distillation (FSLNKD) for depth completion (
Figure 2). This framework is built upon a self-training paradigm that leverages a teacher–student architecture. By incorporating noise samples and pixel-level distillation techniques, our approach effectively enhances the student model’s training process.
First, we introduce the teacher–student model, a conventional machine learning method that effectively handles few-shot datasets, particularly in few-shot learning models. The teacher model, characterized by its expressiveness and superior performance, is typically trained on a larger or more complex dataset. It often possesses a greater number of parameters and a deeper structure. The teacher model identifies and transfers the intricate data structures and patterns to the student model. We choose a high-precision depth completion network as our teacher model, which is trained using labeled data—such as the data containing ground-truth annotations. After training, the pre-trained teacher model is utilized to generate pseudo-labels for the unlabeled dataset. The student model, typically less complex with fewer parameters or a shallower architecture, is designed for data-constrained environments. To enhance its performance and generalization capabilities, it leverages knowledge distilled from a more complex teacher model, even with minimal training data. By combining labeled, pseudo-labeled, and noisy samples, the student model is effectively trained.
Second, to enable the student model to derive granular insights from the teacher model, we implement pixel-level distillation to ensure precise knowledge transfer at the pixel scale. This approach proves particularly advantageous in image segmentation and synthesis tasks, where preserving intricate details and spatial consistency is crucial. Within this framework, we adapt KD techniques for depth completion, enabling the student model to generate outputs with fine-grained details comparable to those of the teacher model. This approach allows for the deployment of more compact and computationally efficient student models while maintaining high performance.
Finally, the student model is reconfigured to function as a teacher, enabling an iterative training process, where the teacher model guides the student in re-labeling previously unlabeled data, which are then used to train a new student model. This iterative teacher–student framework improves both the model efficiency and generalization by facilitating effective knowledge transfer. Moreover, it mitigates the challenge of limited data, facilitates model compression, and lays a solid groundwork for future learning tasks.
3.1. Self-Training with Noise for Depth Completion
Algorithm 1 presents the self-training process incorporating noise for depth completion, inspired by the noisy-student approach originally developed to improve ImageNet classification [
39].
Algorithm 1: Self-Training with Noise for Depth Completion |
Input: Labeled data, , and unlabeled data, , where rsd denotes the combination of RGB images and sparse depth, and represents the corresponding ground truth. |
Output: Student model |
1. For k = 1 to K do |
2. Train the teacher model, , to minimize the loss function that is applied to the annotated dataset: |
| (1) |
3. Generate pseudo-labels for the unannotated dataset using the trained teacher model: |
| (2) |
4. Train the student model, , by leveraging both labeled and pseudo-labeled data to minimize the loss function while introducing noise into the student model’s training process: |
| (3) |
5. k = k + 1; |
6. End for |
The algorithm introduces an advanced self-training framework in which the teacher model generates high-quality pseudo-labels using clean data. The student model is trained to replicate labels from both annotated and pseudo-labeled datasets, even when exposed to noisy inputs. This approach significantly boosts the student model’s ability to generalize effectively, surpassing the teacher model’s capabilities.
Two types of noise augmentation are introduced to the student: Gaussian blur is applied to RGB images [
40] and stochastic perturbation is used for sparse depth [
41]. Applying noise augmentation to the two input data types enhances the model’s ability to utilize the limited labeled data more effectively, promoting the extraction of more generalized features. Expanding the training dataset for the student model is crucial. Therefore, random noise is injected into the input variables of the student model during training, ensuring variations in the data at every time step. Incorporating noise into the student model generates new samples in proximity to existing samples, effectively smoothing the structure of the input space. This smoothing effect simplifies the learning process by making the mapping function easier to model. Furthermore, the injected noise mimics potential errors encountered in real-world depth information collection, enhancing the model’s robustness. The two types of noise used in this process are as follows:
(1) Gaussian blur on RGB images: Gaussian blur is a commonly utilized image processing technique that applies a Gaussian function to modify pixel values within the image. This technique reduces sharp edges and fine details, producing a visually softened effect. The Gaussian function, known for its bell-shaped distribution, computes the weight of each pixel based on its distance from the kernel’s center. Pixels closer to the center have a greater influence on the resulting blurred value. Incorporating Gaussian blur into RGB images during self-training introduces controlled noise, which strengthens the model’s capacity to generalize and extract robust features for depth completion. This technique enhances the model’s resilience variations in input data, boosting performance and precision in depth estimation tasks. The 1D Gaussian function is expressed as follows:
In 2D space, the Gaussian function is defined as the product of two independent 1D Gaussian functions, each corresponding to a distinct dimension:
where r denotes the horizontal distance from the origin, d implies the vertical distance from the origin, and
signifies the standard deviation of the Gaussian distribution. When extended to two dimensions, Equation (5) creates a surface characterized by concentric circular contours, indicating a Gaussian distribution that radiates symmetrically from the central point.
(2) Stochastic perturbation on sparse depth: To improve the model’s resilience and generalization capability, we introduce stochastic noise perturbation into the sparse depth data. This process involves randomly selecting a subset, with 1–k% of the available depth points, and modifying each chosen point by a small random noise within a range of +/–, where = 0–5 units. By introducing these controlled perturbations, we generate a diverse set of slightly modified depth maps encountered during training, effectively simulating the natural variability and uncertainties in real-world depth measurements. This method fosters greater model resilience and mitigates overfitting by reducing the model’s reliance on specific sparse data patterns, allowing adaptation to a broad spectrum of depth scenarios. This approach substantially improves both the performance and the precision of our framework in depth completion tasks.
Our improvement strategy involves incorporating noise into the student model while retaining a level of complexity equal to or greater than that of the teacher model. This knowledge expansion technique enables the student to outperform the teacher by equipping it with adequate capacity and subjecting it to demanding conditions through controlled noise.
Figure 3 evaluates noise-processing effects through visual comparisons.
Figure 3a,b contrast the original RGB image with its Gaussian-blurred counterpart, where edge sharpness and fine details are attenuated to create a smoothed appearance. Subsequent columns analyze prediction outcomes across varying inputs:
Figure 3c: baseline results using unmodified data;
Figure 3d: predictions with Gaussian-blurred RGB inputs highlighting detail loss impacts; and
Figure 3e: outputs combining the original RGB image with stochastically perturbed depth data. Incorporating noise into the student model during training establishes a more rigorous learning environment, promoting the acquisition of robust and generalized features. This regularization technique mitigates overfitting on training data and enhances the model’s adaptability to diverse input scenarios. Furthermore, the student model, possessing a complexity level equal to or greater than that of the teacher model, fully leverages the knowledge transferred from the teacher. By expanding the student model’s capacity, learning becomes more efficient, allowing it to capture and retain more features and patterns. Our approach surpasses the constraints of the teacher model by equipping the student model with both the capability and the challenging conditions necessary for superior learning. This approach promotes the development of a more robust and precise depth completion model and highlights the effectiveness of well-designed student–teacher training frameworks in performance improvements.
3.2. Pixel-Wise Knowledge Distillation for Depth Completion
Depth completion, unlike conventional image classification, requires dense predictions at the pixel level. Inspired by Hinton’s KD [
11], this approach matches the probability distribution of classes for each pixel between the student and teacher models, as follows:
where
and
indicate the soft class probabilities at pixel positions (
h,
w), generated by the student and teacher models, respectively;
S denotes the associated similarity matrix;
KL refers to the Kullback–Leibler divergence [
42]; and
T signifies the temperature parameter.
The similarity matrix represents the pixel-to-region similarity matrix, quantifying the extent to which individual pixels correspond to predefined regions or clusters within the image. This similarity or dissimilarity can be determined by various factors, including color similarity (such as using color histograms or color distances, as applied in Euclidean distance), texture-based similarity (using texture features), or an integrated approach combining both elements. Within the depth completion task, similarity measures determine how closely the inpainted region aligns with its surrounding known regions, deducing the missing or damaged parts based on the available surrounding information. For a given input,
x, the resulting pixel embeddings are denoted as
F. To simplify the notation, let
V denote the matrix obtained by concatenating elements along the row dimension. Consequently,
S is formulated as follows:
We utilize pixel-wise class probability distillation for the depth completion task, as illustrated in
Figure 4.
Our approach enhances depth predictions by utilizing pixel-wise class probability distillation, thereby significantly boosting the accuracy, robustness, and reliability of depth estimates, especially in areas with sparse or uncertain data (
Figure 5). This method produces more precise and superior outcomes across diverse applications. Following the preliminary depth estimation or the inpainting of missing depth values, a subsequent pixel-wise class probability distillation process refines these predictions further. By leveraging learned probability distributions, this stage adjusts depth values while incorporating contextual information from adjacent pixels and regions. In depth completion tasks, this technique facilitates more informed decisions when assigning depth values to pixels in inpainted or incomplete areas. By incorporating probability distributions, the approach effectively handles uncertainty and ambiguity in depth estimates. Moreover, it quantifies uncertainty or confidence level for each pixel’s depth prediction—critical for applications demanding reliable depth information. By utilizing intricate pixel depth features, this approach enhances adjustments based on local contextual data and neighboring pixel data. Consequently, the outcome demonstrates improved smoothness and structural consistency. By integrating probability distributions with contextual information for each pixel, the approach refines both the accuracy and dependability of depth predictions. This refined methodology proves highly beneficial for depth completion systems, producing high-quality depth maps for advanced applications across various fields.
3.3. Loss Function
The combined loss function employed for training the student network optimizes performance through a balanced integration of multiple objectives:
where
indicates the scaling parameter employed to harmonize the impact of depth and structure distillation losses.
4. Experiments
4.1. Datasets
Our methodology is assessed using the NYUv2 dataset (NYUv2 dataset:
https://cs.nyu.edu/~fergus/datasets/nyu_depth_v2.html, accessed on 1 January 2025) [
43] and the KITTI Depth Completion (KITTI DC) dataset (KITTI DC dataset:
https://www.cvlibs.net/datasets/kitti/eval_depth.php, accessed on 1 January 2025) [
44]. The NYUv2 dataset [
43] consists of RGB and depth images collected from 464 indoor scenes captured by a Kinect sensor. By the official split, 249 scenes are allocated for training, while the remaining 215 are reserved for testing, with the test set comprising a total of 654 images. From the raw training data, 46,000 images are extracted. To address missing depth values, interpolation is performed using the cross-bilateral filter provided in the official toolkit. All the images are resized to 320 × 240 and center-cropped to 304 × 228. For each image, a random selection of 500 sparse depth samples is made. The model undergoes training for 25 epochs using the
loss function, with the learning rate decreasing by a factor of 0.2 every five epochs after the first 10 epochs. The training is conducted with a batch size of 24. To evaluate the model’s few-shot learning capabilities, approximately 20 data points are extracted from each sequence, forming a 20-shot training set comprising around 5000 data points. The model is trained using the reduced dataset, while evaluation is carried out on the designated validation set. The KITTI Depth Completion (KITTI DC) dataset [
44] contains over 90,000 synchronized RGB images and sparse LiDAR point clouds captured in outdoor driving scenarios. Following the standard protocol, we excluded the top 100 pixels where LiDAR projections are absent due to sensor mounting positions. All images were center-cropped to 1216 × 240 patches for training consistency. The model was trained for 25 epochs with an initial learning rate that decayed by a factor of 0.4 every 5 epochs after the first 10 epochs, using a batch size of 25.
4.2. Evaluation Metrics
We utilize well established evaluation metrics [
6,
7,
43], where
implies the actual depth value at a specific pixel position,
, and
signifies the predicted depth. The metrics are defined as follows:
Root mean square error (RMSE):
Mean absolute relative error (REL):
Mean absolute error (MAE):
Improved root mean square error (iRMSE):
Improved mean absolute error (iMAE):
Threshold : This metric determines the percentage of that satisfies the condition , where = .
N refers to the total number of pixels.
4.3. Comparison to State-of-the-Art Techniques
We assess our methodology by comparing it against leading, state-of-the-art techniques within the field of guided depth completion.
The sparse-to-dense method [
6] processes sparse depth measurements as input, setting unobserved areas to an initial value of zero. By leveraging a guidance image, it systematically generates a dense depth map through an end-to-end process. The sparse-to-dense (SS) method [
9] refines the approach in [
6] by integrating self-supervision through photometric loss derived from consecutive frames. Meanwhile, CSPN [
7] further enhances the SS method [
6] by employing a recurrent spatial propagation mechanism to refine depth estimation.
Nconv-CNN [
45] integrates a normalized convolutional layer to effectively process sparse inputs while employing a confidence map for unguided depth completion. The output generated by this unguided network is subsequently fused with the RGB image to enhance the performance of guided depth completion. In contrast, DeepLiDAR [
12] employs sparse data as its primary input and incorporates surface normal estimation as an intermediary process to generate dense depth maps. This approach relies on ground-truth surface normals for supervised training.
In contrast,
-Random is a modified version of deep depth densification (
) [
15] that generates an intermediate dense depth map from sparse data. It employs nearest neighbor interpolation to enhance the initial sparse depth measurements before passing them into a depth completion network. In its initial form,
utilizes a uniform grid with over 500 sampled points as sparse measurements, ensuring a well-balanced distribution and higher density. For an equitable assessment, we retrained and reassessed
using the same sparse measurement pattern as our approach, referring to this modified version as
-Random. In contrast, the NLSPN [
8] enhances depth completion by leveraging non-local spatial information to iteratively refine and enhance the depth map.
Table 1 demonstrates that the proposed few-shot learning framework substantially enhances the NLSPN performance [
8] across all three evaluation criteria using only one-tenth of the labeled data required for the same task.
Table 2 demonstrates our method’s performance on the KITTI benchmark. By leveraging both labeled and pseudo-labeled data through our self-training framework, we achieve competitive results compared to existing supervised approaches while using only 10% of the labeled training samples. The integration of pixel-wise knowledge distillation effectively preserves structural details in challenging outdoor scenarios with dynamic objects and large depth ranges.
The student model significantly outperforms the teacher model by leveraging pseudo-labels strategically. Although the student model emulates the teacher model’s behavior, noise is incorporated into the student model while being excluded from the teacher model. This mechanism enhances the accuracy of the teacher model’s training in pseudo-labels and the generalization of the student models.
Moreover, our proposed dense-pixel-wise KD module retains the advanced capabilities demonstrated in previous iterations. It preserves and progressively enhances the knowledge acquired in earlier stages, strengthening its overall robustness and effectiveness.
Consequently, our few-shot learning framework enhances the performance of the original NLSPN model while substantially reducing the need for labeled data. By strategically incorporating noise and leveraging KD, the student model acquires a more robust understanding, yielding improved generalization. Therefore, the student model surpasses the teacher model, showcasing the effectiveness and scalability of our approach for efficient depth completion.
4.4. Iterative Effect Analysis
We conduct a detailed analysis of the effects of iterative training on the performance of our depth completion frameworks. Our approach conducts the supervised training of the NLSPN model using labeled data, designating this initial model as the “teacher.” Once trained, the teacher model serves as a knowledge source for training a subsequent “student” model, effectively facilitating the transfer of learned insights. Upon reaching an adequate level of training, the student model assumes the role of the teacher for the successive training iterations.
This iterative training process operates cyclically, where each newly trained student model assumes the role of the teacher in subsequent iterations. This continuous cycle of role reversal and training drives a progressive improvement in the model’s capabilities. The performance metrics in
Table 3 indicate substantial improvements in each iteration, demonstrating a consistent upward trend.
Figure 6 demonstrates that the results improve progressively with each iteration.
The consistent and progressive performance improvement demonstrates the effectiveness of our iterative training approach. In every cycle, the model’s capacity for depth completion undergoes gradual refinement, ensuring that the knowledge transferred and acquired in each iteration is expanded and optimized. This iterative strategy enhances the precision and overall quality of depth estimation and underscores the robustness and scalability of our training framework, advancing depth completion technology.
4.5. Ablation Study
To evaluate the impact of the proposed self-training with noise and the pixel-wise knowledge distillation module, we conduct a comprehensive ablation study on the NYUv2 dataset. We systematically examine three critical aspects of the methodology.
4.5.1. Contribution of Self-Training with Noise
To evaluate noise-enhanced self-training, three configurations were analyzed:
Baseline: standard NLSPN model without self-training or noise augmentation.
Self-training only: teacher–student iterative training framework implemented, excluding noise perturbations.
Full framework: the integration of self-training with dual noise strategies—RGB Gaussian blur and stochastic perturbation are applied to sparse depth data.
The experimental outcomes are summarized in
Table 4. Implementing the self-training framework (Configuration 2) notably elevated model performance over the baseline (Configuration 1), reducing the RMSE from 0.240 to 0.184. Incorporating noise augmentation (Configuration 3) further decreased the RMSE to 0.090, lowered the REL from 0.048 to 0.011, and achieved a
of 99.7%. These improvements suggest that simulating real-world input variations through noise strategies strengthens the model robustness and generalizability.
4.5.2. Contribution of Pixel-Wise Knowledge Distillation
To evaluate the pixel-wise knowledge distillation module’s efficacy, we excluded pixel-wise knowledge distillation from the full framework (self-training + noise augmentation) and assessed two variants:
Full framework: we employed the combined task and KD losses defined in Equation (8).
KD ablation: we relied solely on the task loss specified in Equation (3), omitting KD components.
Table 5 reveals notable performance deterioration when eliminating KD, with the RMSE rising from 0.090 to 0.163 and the REL increasing from 0.011 to 0.024. This demonstrates pixel-wise KD’s critical role in transferring the teacher model’s structural knowledge for dense pixel prediction, particularly preserving edge precision and spatial coherence through targeted supervision.
4.5.3. Independent Effects of Noise Variants
To isolate individual noise impacts, we performed component-wise ablations:
RGB Gaussian blur only: Gaussian blur (σ = 1.5) was applied exclusively to RGB inputs with unmodified depth data.
Depth perturbation only: random displacement (ε = 3) was introduced solely to sparse depth measurements.
Dual noise: the concurrent implementation of both modalities.
Table 6 demonstrates that individual noise strategies—RGB Gaussian blur or sparse depth perturbation—each enhance performance, while their combined implementation achieves the peak efficacy (RMSE 0.090). This dual-modality approach leverages complementary perturbations across visual and geometric data domains, synergistically improving the robustness against heterogeneous input variations.
4.5.4. Discussion
The ablation studies conducted on the NYUv2 dataset validate three critical observations:
Self-training: iterative pseudo-label generation mitigates overfitting in limited-data regimes.
Noise augmentation: simulated input perturbations boost the robustness to real-world data artifacts.
Pixel-wise knowledge distillation: the teacher–student knowledge transfer preserves structural patterns, proving essential for edge-detail reconstruction in dense prediction tasks.
While these experiments conclusively demonstrate the module efficacy in indoor scenarios, their generalizability across diverse environments remains to be verified. Future work will extend the ablation analysis to outdoor benchmarks like KITTI DC, investigating two critical dimensions:
Domain-specific adaptation: outdoor environments exhibit fundamentally different noise characteristics—sensor ranges exceed 80 m in KITTI versus 10 m in NYUv2, and the LiDAR sparsity patterns differ from those of Kinect depth sensors. We plan to develop adaptive noise scheduling that dynamically adjusts the perturbation intensity based on depth distribution statistics and scene semantics.
Cross-domain knowledge transfer: the pixel-wise distillation mechanism currently operates within single domains. We propose cross-domain knowledge bridging where teacher models pretrained on NYUv2 guide student training on KITTI pseudo-labels, potentially reducing annotation dependency. This would require addressing domain gaps through attention-guided feature alignment between indoor structural priors and outdoor geometric regularities.