1. Introduction
Autonomous driving technology is one of the most extensively researched areas in recent times, owing to its wide applicability and significant impact across industries. Such vehicles are equipped with the capability to perceive and analyze their surrounding environment without a human driver, facilitating safe navigation to the designated destinations. However, one of the key challenges that these innovative technologies continue to face is the capability to accurately and swiftly comprehend and process three-dimensional space for real-time autonomous driving decisions [
1,
2]. Three-dimensional spatial perception plays a crucial role in autonomous driving systems, leading to research efforts that aim to perform depth estimation using expensive equipment such as LiDAR through sensor fusion techniques [
3,
4,
5,
6,
7,
8]. However, for effective implementation and widespread commercial application, it is preferable to perform depth estimation of objects and scenes quickly and cost-effectively using only a single camera [
9,
10]. This MDE (Monocular Depth Estimation) technique maximizes the safety and efficiency of autonomous vehicles by providing essential three-dimensional information needed to understand the surrounding vehicular environment, avoid obstacles, and plan safe routes [
9,
10]. In the field of MDE, a variety of methods have been developed [
9,
10,
11]. However, these methods often encounter limitations in addressing real-world variability, such as complex lighting conditions, diverse meteorological circumstances, and the variety in objects and textures. However, these performance limitations can be substantially improved by leveraging rapidly advancing deep learning technologies if comprehensive datasets are available that include data for diverse scenarios, such as complex lighting conditions, varied meteorological conditions, textures of different objects, and advanced data augmentation techniques. In this study, we introduce synthetic-based data augmentation techniques that account for data diversity. Specifically, we propose a Mask method, which segments objects of interest from one image and synthesizes them onto another image. This approach is further enhanced with Mask-Scale, which involves resizing adjustments, and CutFlip, based on image flipping, to maximize the utilization of natural features and textures within existing datasets. Furthermore, we have derived optimal data augmentation strategies for contemporary MDE technologies by combining these techniques with other data augmentation methods previously suggested in various studies. Building on this, we propose the RMS algorithm, which integrates the latest MDE techniques, loss functions for MDE training, network compression methods, and system deployment considerations. This algorithm is designed to derive optimal application strategies tailored to specific autonomous driving applications.
The following section introduces the research related to the proposed techniques and outlines the major contributions.
Section 2 details the technical aspects of the MDE methods.
Section 4 describes the proposed data augmentation techniques, specifically the Mask, Mask-Scale, and CutFlip methods.
Section 5 elaborates on the three operational stages of the proposed RMS algorithm. In
Section 6, the performance of the proposed techniques is validated, and
Section 7 concludes the discussion.
2. Related Work
Prior to the advent of deep learning, early research in MDE primarily revolved around depth-cue-based approaches [
12,
13,
14]. Study [
12] utilized an approach based on the vanishing point, study [
13] focused on depth perception derived from focus and defocus techniques, and study [
14] employed a shadow-based approach. However, these studies were constrained by their ability to perform MDE under limited conditions, rendering them inadequate for application in real-world settings with diverse variations.
With the advent and progression of deep learning [
15,
16,
17], research in the field of MDE also began to incorporate deep learning methodologies [
18]. This approach is characterized by an encoder–decoder structure that receives RGB input and produces depth maps. Subsequently, numerous studies [
18,
19,
20,
21,
22,
23,
24,
25] emerged, adopting a similar encoder–decoder framework. Further advancements were made as research [
26,
27,
28] explored the generation of depth maps based on probabilistic combinations of sequential images using CRFs (Conditional Random Fields) applied to the output feature maps of the encoder. In study [
26], depth maps were derived by extracting feature maps of various sizes from consecutive images and combining them using an attention-based mechanism. Additionally, the application of CRFs was diversified, with study [
29] implementing multiple cascade CRFs, study [
27] using continuous CRFs, study [
28] applying hierarchical CRFs, and study [
30] employing FC-CRFs (Fully Connected CRFs) for performing MDE.
However, the application of supervised learning to MDE incurs high data-labeling costs. To mitigate this, attempts have been made to employ unsupervised learning methodologies [
31,
32,
33,
34,
35,
36]. These studies, predominantly based on image reconstruction techniques, stereo matching, and depth extraction through camera pose estimation from consecutive video frames, introduce additional complexities without achieving significant advancements in accuracy. Meanwhile, as an alternative approach to overcoming the issue of insufficient data in MDE, several research attempts have been made to generate variant data to supplement the scarce training dataset. In studies [
37,
38,
39,
40,
41,
42,
43,
44,
45,
46,
47], attempts were made to augment and utilize existing data through various methods such as data augmentation techniques, style transfer, and data synthesis. In the studies by [
48,
49,
50], the use of copy-and-paste-based data augmentation techniques was explored to enhance performance in various tasks, specifically object detection and segmentation. This approach involves integrating elements from one image into another to enrich the dataset and improve the robustness of the models trained for these applications. Similarly, the study by [
37] explored data augmentation for MDE in their CutDepth approach by pasting rectangular regions from one image onto the original image. This method was analyzed to enhance the REL performance by approximately 1.5%, demonstrating its efficacy in improving depth estimation accuracy. Based on the CutDepth framework, various composite techniques have emerged, specifically vertical orientation and perpendicular orientation techniques, referred to as vertical CutDepth [
51] and perpendicular CutDepth [
52], respectively. These methods have demonstrated performance improvements comparable to those achieved with the original CutDepth approach. In studies [
38,
39,
40,
41,
42,
43,
44,
45,
46,
47], classical methods involving noise, brightness, contrast adjustments, and multi-scale and geometric transformations were applied to MDE techniques to enhance accuracy. These approaches resulted in achieving an REL performance metric of
on the KITTI dataset, according to the analysis. The study [
53] analyzed the performance of a simple encoder–decoder-based MDE model by applying data augmentation techniques such as scale, rotation, color jitter, color normalization, and flips, utilizing geometric variations and filtering methods. The application of these techniques achieved an REL performance of
on the KITTI dataset, as analyzed. Several studies [
54,
55,
56,
57] have utilized Generative Adversarial Network (GAN) technology to generate data or perform style transformations for data augmentation purposes. In particular, research by [
55,
56,
57] focused on implementing data augmentation techniques tailored to various weather conditions to enhance robust performance. Additionally, the study [
58] explored the creation of image data and corresponding depth labels within a virtual environment for use as data resources. The study [
59] introduced reliable data augmentation that minimizes the loss between disparity maps generated by the original and augmented images, enhancing image robustness in predicting color fluctuations. Similarly, in the study by [
60], an attempt was made to enhance the performance of depth estimation by applying augmentation at the feature representation level derived from the results of an image encoder. Research [
61] implemented a data augmentation technique based on supervisory loss, improving depth at occluded edges and image boundaries while making the model more resilient to changes in illumination and image noise. The study [
62] generated multi-perspective views and corresponding depth maps based on NeRFs (Neural Radiance Fields), utilizing interpolated- and angle-variation-based data augmentation methods, and conducted performance evaluations for AdaBins [
63], DepthFormer [
64], and BinsFormer [
65].
In parallel, substantial research has been undertaken to enhance MDE technologies for dependable and real-time performance on devices with limited resources, such as autonomous vehicles, robotics, and embedded systems. These advancements focus on model compression, lightweight architectures, and acceleration techniques, which are typically grouped into pruning [
66,
67], the development of efficient architectures [
68,
69,
70], the application of knowledge distillation [
71,
72], and real-time operation [
66,
67,
68,
69,
70,
71,
72,
73]. This body of work aims to refine MDE functionality to suit the computational constraints of various hardware platforms, enhancing operational efficiency across multiple application domains. In the studies by [
66,
67], pruning techniques and similar methods were explored with the objective of energy conservation through targeted weight training approaches. These methods focus on reducing the computational demands of models by selectively pruning less important network weights, thereby enhancing energy efficiency during operations. In the study by [
68], a lightweight design approach for the encoder–decoder network in MDE was addressed. Specifically, the research utilized MobileNet to reduce the weight of the encoder–decoder structure. By replicating this streamlined architecture twice, the study aimed to mitigate the loss of accuracy typically associated with reductions in model complexity. In the study by [
69], visual domain adaptation was employed to minimize accuracy degradation within a lightweight network structure based on MobileNet. The research by [
70] aimed to enhance prediction accuracy through a lightweight design that incorporates elements from the biological visual system and self-attention mechanisms. Meanwhile, the studies [
71,
72] explored the use of KD (Knowledge Distillation) to streamline the traditional encoder–decoder architecture in MDE. However, despite these technologies’ ability to significantly reduce latency—by up to a factor of ten—accuracy degradation remains a substantial limitation. Finally, the technologies for the real-time operation of autonomous driving computations can utilize the previously described model lightweight techniques, namely pruning [
66,
67], efficient architecture [
68,
69,
70], and knowledge distillation [
71,
72]. However, while these model lightweight techniques can reduce the size of the model, they do not always decrease operational latency because they may require additional computations for the model operation. Therefore, it is essential to deploy and analyze the performance on actual embedded devices to verify their effectiveness.
Despite advancements in various data augmentation techniques, the current REL performance still presents limitations for commercial deployment. The reason for this is that the proposed data augmentation techniques do not necessarily guarantee performance improvements. Specifically, data augmentation methods based on color filters (e.g., color jitter, color normalization, brightness control, contrast control) tend to exhibit variability in performance enhancement compared to geometric variation techniques. Moreover, the performance can vary depending on the MDE model, making it challenging to ensure performance improvements in recent MDE models.
Consequently, there is a demand for developing geometric-variation-based data augmentation techniques that can consistently yield performance enhancements across all MDE technologies. Furthermore, it is essential to identify combinations of data augmentation techniques that can effectively enhance the performance of recent MDE models through the integrated use of traditional augmentation methods.
In this research, we have developed data augmentation techniques based on geometric variations, specifically Mask, Mask-Scale, and CutFlip, that can reliably enhance the accuracy of MDE. We particularly investigated the optimal combinations of these techniques with traditional data augmentation methods such as scaling, rotation, translation, noise, and brightness control by analyzing their performance synergies. Additionally, we conducted experimental analyses to determine the most effective strategies for maximizing MDE accuracy across various loss functions and optimized these strategies for operational latency and memory efficiency through network lightweighting techniques. This approach is applicable to supervised, unsupervised, and semi-supervised learning, offering a viable method for enhancing the accuracy of monocular depth prediction in robotic and autonomous driving environments.
The contributions of this paper are summarized as follows:
Proposal of Novel Synthetic-Based Data Augmentation Techniques for MDE Performance Enhancement: This paper proposes new synthetic-based data augmentation methods, such as Mask, Mask-Scale, and CutFlip, to improve monocular depth estimation performance and derive the optimal combination of data augmentation techniques.
Proposal of Network Compression Methods for Enhanced Efficiency in Real-Time MDE: Strategies to minimize the size and operational time of real-time monocular depth estimation models through quantization and pruning techniques are suggested.
Optimal Application Strategies for Autonomous Driving Systems Considering Performance: This paper presents the RMS algorithm, an optimal strategy tailored for commercial autonomous driving applications, taking into account the current MDE performances on high-end servers and on-device systems. This strategic approach is designed to harness the capabilities of different deployment environments effectively.
4. Proposed Data Augmentation Techniques for MDE
MDE faces a significant challenge due to the high costs associated with data acquisition and labeling, resulting in substantially fewer training data compared to other image recognition tasks. Consequently, the application of data augmentation is essential to compensate for the insufficient quantities of training data for MDE. However, traditional data augmentation techniques such as flipping, scaling, noise addition, brightness adjustment, and rotation encounter limitations in enhancing performance due to a lack of data diversity. In this section, we propose techniques that go beyond variations within a single image, introducing methods that synthesize data across multiple images, namely Mask, Mask-Scale, and CutFlip.
Table 1 outlines the definitions of these techniques,
Figure 1 presents their illustrative diagrams, and
Figure 2 depicts examples of applying Mask, Mask-Scale, and CutFlip.
In this study, we conducted experiments applying the proposed data augmentation techniques on the KITTI dataset as the original data source [
81]. The primary reason for utilizing the KITTI dataset is that it not only provides depth map data for MDE but also encompasses classes such as cars, pedestrians, bicycles, and people, which are crucial for autonomous driving in outdoor environments.
When augmenting data using the Mask, Mask-Scale, and CutFlip techniques, the corresponding depth map should be synthesized in the same manner as the altered image. As described in
Figure 1a, when augmenting data with Mask, the depth map information is masked in the depth map exactly as the masked object’s depth information and location. Conversely, as explained in
Figure 1b, when applying Mask-Scale, the depth map information is adjusted inversely proportional to the scale ratio. For CutFlip-L and CutFlip-R applications, as illustrated in
Figure 1c, the depth map from either the left or right side is directly copied to the opposite side. CutFlip-D combines several images by selectively applying Flip to the left or right images, and the corresponding depth maps are combined and structured similarly.
The traditional data augmentation methods mentioned above, namely flip, scale, noise, brightness and rotation, artificially create a variety of environmental changes that could be encountered in real driving scenarios. This integration during the training process enables a better REL performance in actual test datasets and enhances the model’s generalization capability. However, these methods face performance limitations in enhancing depth prediction based on object information due to insufficient variability and the scarcity of object data themselves caused by class imbalance. Conversely, the synthetic-based Mask method proposed in this paper employs segmentation techniques to extract objects from different images and synthesizes them onto the base image, including their depth information. This approach addresses the lack of high-quality depth prediction data based on object information, thereby contributing to an enhanced performance. Furthermore, the Mask-Scale data augmentation method overcomes a limitation of the Mask method, which is restricted to augmenting data based on the existing size of objects and their corresponding depth information in the original image. By proportionally varying the size of the objects and their depth information during augmentation, the Mask-Scale method enables the synthesis and augmentation of not only the objects themselves but also their associated depth information. Lastly, CutFlip represents one of the most efficient and straightforward methods for data augmentation. Unlike traditional data augmentation techniques, it possesses photorealistic qualities that closely mimic actual data, thereby significantly enabling the enhancement of REL performance during real testing scenarios.