1. Introduction
Recent technical innovations in deep learning have led to a quantum leap in robot technology and autonomous driving technology [
1]. In particular, various sensors, such as cameras, lidar, radar, GPS, ultrasonic waves, and IMUs, are used to acquire and process diverse information related to vehicle situational awareness in order to make driving judgments and control the vehicle [
1,
2,
3].
However, to apply the information gathered from these various sensors to autonomous driving in real time, the corresponding calculations must be lightweight and accelerated [
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16,
17,
18]. Among these sensors, the tasks that require the highest computation and latency are 2D and 3D context-aware computations, which primarily involve cameras, lidar, and radar. In studies [
6,
7,
8,
13,
14,
15], network weight reduction and acceleration efforts were conducted for camera-based 2D object detection and 2D segmentation calculations. In studies [
5,
14,
15], quantization, pruning, and knowledge distillation methods for light weighting of deep learning were studied. Study [
18] performs acceleration research for camera-based lane detection.
However, because all these studies focus on single tasks, the corresponding operations must be combined in a real environment where all of them must be used. For this reason, research on MTL (multi-task learning) was initiated in studies [
9,
10,
11,
12,
19,
20,
21], allowing multiple tasks noted above to be performed simultaneously as much as possible. Multi-task learning (MTL) is a learning paradigm in machine learning and its aim is to leverage useful information contained in multiple related tasks to help improve the generalization performance of all the tasks [
22]. Therefore, due to the parallel execution characteristics of multi-task learning (MTL), essential image recognition tasks for autonomous driving, such as 2D object detection, lane detection, and drivable area segmentation, were conducted using MTL. However, among these studies on multi-task learning, there has been no research addressing optimal design methodologies for three-task configurations. In fact, most multi-task learning (MTL) studies feature complex structures and intricate training processes, making it challenging to reproduce their performance. Particularly in multi-task learning (MTL), the determination of an optimal combination of components is critical. This includes the shared backbone and its lightweight version, subnets for each task, loss functions dictating subnet training performance, and task-specific optimizers and training details, all of which significantly impact the safety of autonomous driving. From an accuracy perspective, concurrently executing multiple tasks can lead to improper training of the shared backbone weights, potentially degrading each task’s performance and adversely affecting the safety of autonomous driving. In terms of latency, if the latency of each task slows down beyond the required threshold, it can prevent the decision-making and control stages of autonomous driving from being executed within an appropriate time frame, leading to potentially severe accidents. Additionally, regarding memory size, if each task consumes an increasing proportion of the limited hardware memory in an autonomous vehicle, it can place additional load on the overall system operations, compromising stability. Therefore, in this study, we experimented with various combinations of these details that determine the performance of each task in MTL and proposed a solution through the MDO (multi-task decision and optimization) algorithm to find the optimal configuration.
2. Related Work
In the field of situational awareness for self-driving technology, it is crucial to execute image recognition tasks with high precision in real time. In particular, information from various sensors should be utilized to enable safe and reliable driving decisions.
Among these, representative image recognition tasks based on cameras include 2D tasks such as object detection, semantic segmentation, and lane detection, as well as 3D tasks such as 3D object detection and 3D segmentation. First, 2D detection studies of [
8,
23] achieve an accuracy of 52 AP with a performance of over 30 FPS based on a single stage. Recently, anchor free-based technologies, as explored in [
24,
25], have developed a technique that achieves a performance of 280 FPS or more, while also enhancing accuracy. In the field of 2D semantic segmentation, research studies [
26,
27] have announced a technology that delivers an accuracy performance of 82.4 mAP. In the field of 3D object detection research, studies based on cameras [
28] (18.69% AP), lidar [
29] (81.8% AP), and a fusion of camera–lidar sensors [
30] (82.4% AP) have been announced. In the field of 3D semantic segmentation, lidar-based research [
31] has achieved a performance of 74% mIoU. In the studies by [
32,
33], acceleration of depth estimation was investigated using only cameras, whereas the research conducted by [
34,
35] focused on exploring camera-based 3D object detection. The research presented in [
36,
37] dealt with the acceleration of camera-based 3D reconstruction. Lidar-based 3D object detection and 3D segmentation were the subjects of studies by [
38,
39,
40]. Finally, the research by [
41,
42] investigated radar-based 3D object detection.
However, we need to note that the technologies discussed above pertain to studies of individual tasks. In practice, within an autonomous vehicle, when all the corresponding recognition models are loaded and executed simultaneously, there can be complications due to synchronization issues among various technologies and potential system overloads. In other words, even if only some of the various image recognition tasks in autonomous driving meet the accuracy or latency requirements, but others do not, it can negatively impact the safety of autonomous driving.As a result, the exploration of MTL (multi-task learning) was initiated specifically for autonomous driving applications [
19,
20,
21].
MTL aims to leverage useful information contained within related tasks to enhance the generalization performance of all tasks. MTL can be categorized into five technical approaches based on its characteristics: feature learning approach [
43] low-rank approach [
44], task clustering approach [
45], task relation learning approach [
46], and decomposition approach [
47]. These approaches are being utilized in various domains of deep learning, including natural language processing [
48], reinforcement learning [
49], medicine [
50], and computer vision [
43].
Additionally, in the field of autonomous driving, extensive research is being conducted to improve the performance of related tasks using MTL. In HybridNet, MTL was investigated with respect to three tasks: drivable area segmentation, lane detection, and object detection [
19]. Additionally, YOLOP demonstrated potential by enhancing the performance of HybridNet for MTL, focusing on the aforementioned three tasks [
20,
21].
From the foregoing, it is evident that the accuracy performance of each MTL task is influenced by the efficiency of the underlying backbone network. However, as indicated by the studies referenced in [
51,
52], the use of a complexly structured backbone network, such as ViT (Vision Transformer) [
53], does not necessarily ensure high accuracy across all tasks. This makes achieving the ultimate objective of driving quite challenging. These findings underscore the importance of designing image recognition technology that takes into account the mutually complementary relationship among the relevant technologies.
The principal contributions of this study are as follows:
This study proposes an optimal neural network architecture incorporating backbone and loss functions for triple-task learning of drivable area segmentation, object detection, and lane detection. It achieves improvements in all aspects, including accuracy, latency, and size, compared to traditional individual tasks and previous three-task learning methods.
The integration of depth estimation within the MTL framework for 2D image recognition was explored, and it was proven to be unsatisfactory due to the low ITC (inter-task correlation).
For the performance optimization of MTL, a 3 step MDO algorithm was applied, along with additional training techniques based on SSL (semi-supervised learning). This approach enables enhancements in all aspects, including accuracy, latency, and memory size.
5. Simulation Results
The evaluation of various tasks are conducted under the umbrella of MTL, analyzing them both individually and in integrated configurations, ranging from single-task scenarios to combinations of up to four tasks. Furthermore, the results were benchmarked against previous MTL techniques, notably YOLOP [
20] and HybridNet [
19]. This comparison was extended to traditional OD (object detection) strategies, such as RetinaNet [
8], LD (lane detection) methods such as UFLD [
18] and CLRNet [
57], DE (depth estimation) such as DepthFormer, and DAS (drivable area segmentation) approaches utilizing architectures like FPN, PFPN, BiFPN, and Transformer (SegFormer) mentioned in
Section 3.
To assess the MDO algorithm, the optimal multi-task set , SBM B, and TSM S are determined within parameters exceeding the targeted accuracy of 95 % for LD and DAS, surpassing the targeted mAP of 0.80 for OD, and falling below the targeted absolute REL (relative error) of 0.06 for DE. Subsequently, the accuracy and latency performances of these optimized sets are evaluated.
The BDD 100K and the KITTI datasets in [
60,
61] were employed for training and evaluation. Specifically, whereas the BDD 100K dataset contains labels for DAS, such labels are absent in the KITTI dataset. To address this, SSL (semi-supervised learning) was applied to the KITTI dataset for additional training. More precisely, a pseudo label was created using an InterImage model [
51] pretrained on Cityscapes [
62], which was then utilized to apply semi-supervised learning for the DAS task. Experiments were executed using implementations in TensorFlow, facilitated by an NVIDIA GPU equipped with a 2-way 4090 architecture. A piecewise constant decay strategy was adopted for the learning rate schedule. Model performances were assessed across a span of 50 epochs, with the most optimal outcome within this range being chosen for further analysis. The AdamW was employed as the optimizer algorithm. Each task-specific loss value in MTL was trained through the summation of loss values derived in Equation (3), using the functions mentioned in
Section 4. The weight
for each loss function was set to 2 for object detection and maintained at 1 for the remaining tasks.
For the performance analysis of each task (OD, LD, DAS, and DE) in MTL, experimental groups were set up with a dedicated 1-task model, a 2-task model (DAS + LD, DAS + OD, OD + LD, DE + DAS) and a 3-task model (OD + LD + DAS), and their respective performances were compared.
Table 2 shows the performance of DAS,
Table 3 presents the performance of OD,
Table 4 illustrates the performance of LD, and
Table 5 presents the performance of DE. Additionally,
Figure 3,
Figure 4 and
Figure 5 provide visual examples of the application of these 2-task and 3-task scenarios.
Based on the results of all experimental groups presented in
Table 3,
Table 4 and
Table 5, it is evident that the applications of depth estimation are insufficient for ensuring safe autonomous driving. As evidenced by
Table 2 and
Table 5, the results for the 2-task (DAS + DE) setup indicate that DAS does not meet its target accuracy of 95%, and, similarly, DE falls short of the target REL of 0.06. In contrast, other experimental sets excluding DE, such as 1-task, as well as 2-task and 3-task configurations, generally satisfy their target performance.
This can be attributed to the fact that tasks such as DAS and LD have high ITC, leading to their backbone weights being trained to exhibit similar distributions, which in turn enhances their collective performance. Conversely, tasks like OD and DE have less ITC, resulting in them being trained with different backbone weight distributions, ultimately leading to a mutual degradation in performance.
Therefore, it can be inferred that tasks for multi-task learning can be readily trained to assist each other in improving accuracy, whereas some MTS configurations may not offer such benefits. Consequently, constructing an MTS with such complementary tasks is instrumental in enhancing the safety of autonomous driving. Additionally, the multi-task learning examples presented in this study reveal that operating with only three tasks—OD, LD, and DAS—excluding DE, provides a more secure and efficient approach to securing an autonomous driving image recognition model.
Moreover, although the backbone models generally exhibit similar performance, it is noteworthy that in the 3-task configuration, BiFPN demonstrates the best performance. This surpasses even the Transformer-based SegFormer and PFPN, which have the highest number of parameters. This suggests that for the KITTI dataset of DAS, OD, and LD tasks, the BiFPN model, with fewer parameters than the SegFormer and PFPN, is less prone to overfitting and offers better generalization effects.
Furthermore, it can be observed that this parallel processing approach in multi-task learning offers significant advantages in terms of latency. As indicated in
Table 2,
Table 3 and
Table 4, for the 2-task configuration, there is an approximate 50% reduction in latency, while for the 3-task setup, the latency reduction effect can reach around 60% compared to the individual task learning.
Table 6 selectively compares the system load of 3-task learning in experimental sets that meet the performance requirements of each task. The results presented are after the application of step 3 of the MDO algorithm, which is TensorFlow-Lite based FP16 quantization [
63]. The rationale for employing FP16 quantization is that, compared to other quantization techniques, it incurs the least accuracy loss and can halve the memory size, while having no impact on operational latency [
5]. This demonstrates that the application of the MDO algorithm can achieve optimal adjustments suitable for autonomous driving in terms of accuracy, latency, and memory size.
Next, let us delve into an analysis of the loss functions utilized for the DAS and LD tasks, as presented in
Table 2 and
Table 4. Traditionally, in image segmentation problems, the binary cross-entropy function is predominantly employed. However, to address class imbalance issues, the Dice and Tversky functions are utilized [
52]. As can be discerned from
Figure 3 and
Figure 4, the proportion of the foreground area is relatively small compared to the entirety of the image. Consequently, based on the results presented in
Table 2 and
Table 4, the proposed DAS and LD techniques demonstrate superior performance with the Dice function rather than the BC. Notably, conventional methods such as YOLOP and HybridNet also employ a similar Tversky function as their loss function. Given this context, it would be prudent to utilize the Dice or Tversky loss functions in autonomous driving applications, taking into account the dimensions of the foreground areas.
From the aforementioned results, it can be observed that the REL of the dedicated model for the DE task, i.e., DepthFormer, demonstrates superior performance compared to other experimental groups. It is imperative to note that even the singular task configuration showcases a suboptimal REL metric, which further deteriorates when subjected to multi-task operational paradigms, as exemplified by the 2-task model. From the foregoing analysis, it becomes apparent that implementing depth estimation via MTL frameworks is suboptimal, given the intrinsically low degree of ITC between the DE task and other associated tasks. Furthermore, as elucidated in [
64], the particular task under consideration is not optimally aligned for integration within MTL frameworks. This is attributed to its heightened dependence on supplementary operations external to the backbone structure (e.g., T-Net), rather than on the primary backbone task. This aspect categorically renders it as a technically misaligned group for MTL applications. Additionally, an examination of the performance metrics associated with the 2-task (DAS+DE) configuration, as detailed in
Table 2, reveals a concurrent degradation in the performance of DAS, another task intricately connected with DE. This observation underscores the inappropriateness of sharing a common backbone between the DE and DAS tasks. However, as depicted in
Table 6, the dedicated model approach, exemplified by DepthFormer, necessitates the utilization of additional parameters compared to the MTL methodology, resulting in augmented costs for securing the requisite resources. Consequently, a comprehensive consideration of both the additional resource costs and accuracy is imperative when determining the application of MTL to DE.
6. Conclusions
This study explores MTL for maximizing the efficiency of various image recognition tasks performed in autonomous driving, considering the task characteristics and the given hardware conditions. Additionally, MDO algorithm, an optimal configuration algorithm for this purpose, is proposed. The MDO algorithm targets drivable area segmentation, object detection, lane detection, and depth estimation as the tasks for recognition, and is comprised of three stages: minimizing latency, maximizing accuracy, and minimizing size. Through the MDO algorithm, an optimal neural network design including the backbone and loss functions is achieved. Additional training based on the SSL led to improvements in all aspects—accuracy, latency, and size—compared to traditional single-task methods and existing three-task learning approaches. The experimental results reveal that integrated accuracy performance is crucial in the configuration and optimization of MTL, and this integrated accuracy is determined by the ITC. Considering these characteristics, it was proven important to design multi-task sets comprising tasks with high ITC. The proposed MDO algorithm facilitated approximately a 12% improvement in object detection mAP performance, a 15% enhancement in lane detection ACC, and a 27% reduction in execution time. Additionally, it has been found that depth estimation has a low ITC with tasks such as drivable area segmentation, object detection, and lane detection. Forming a multi-task set with these tasks could potentially lead to mutual performance degradation. Therefore, to achieve stable performance in depth estimation, it is concluded that it should be implemented either through a dedicated independent neural network or conducted using additional sensors like lidar. In the future, research is planned to extend MTL based on sensor fusion, not only through single-sensor inputs from cameras but also incorporating inputs from lidar sensors. This expansion aims to enhance the currently limited performance of depth estimation. Additionally, the scope of research will be extended to encompass the entire process of perception, decision-making, and control in autonomous driving, achieving an end-to-end learning approach. This will facilitate both horizontally and vertically integrated optimization in the field.