1. Introduction
In the rapidly evolving landscape of computer vision applications, various real-world computer vision applications (e.g., autonomous driving) are benefiting from the use of large field-of-view (FoV) cameras [
1,
2]. These types of cameras use fish-eye lenses, also known as wide-angle lenses, to capture a broad view by using extensive non-linear mapping instead of regular perspective projection. However, this leads to radial distortion in the captured images, causing stretching and information loss at the edges, ultimately resulting in a blurry effect. One potential solution to the distortion issue is to use transformation techniques to convert fish-eye images into regular perspective images [
3]. However, the conversion unwraps the image, resulting in the loss of a portion of the image at the edge, which ultimately compromises the rich information contained at the boundary of the images. As a result, the detection process is hindered. Another possible solution is to use the distorted fish-eye images directly in the model, as suggested by some existing research [
4,
5]. However, the current literature on computer vision is not well optimized for learning from fish-eye cameras, thus providing a need for further research in this area.
Semantic segmentation, an important computer vision task, focuses on classifying each pixel in an image and grouping the pixels belonging to a specific object [
6]. Thus, it segments a scene into bounding areas or masks containing different objects, facilitating a proper understanding of the scene. The significance of semantic segmentation in autonomous driving lies in its ability to enable the vehicle’s sensor systems to precisely comprehend and differentiate various objects and areas in the surroundings [
7]. It leads to well-informed decisions based on specific information about the road, obstacles, pedestrians, traffic signs, and other crucial elements, ultimately ensuring safer and more reliable autonomous navigation. While semantic segmentation is well explored for 2D perspective images [
8], point cloud [
9], depth images [
10], and thermal images [
11], it is relatively less explored in the context of fish-eye images [
12].
There is limited research on fish-eye image segmentation that has delved into diverse methods for converting the radial distortion in fish-eye images into perspective images [
3]. Some research focused on training the model on synthetic fish-eye images obtained by applying different fish-eye transformations to the perspective images [
13,
14]. However, these approaches do not perform well when using real-world fish-eye data. Some of the more recent works on semantic segmentation of fish-eye images focus on multi-modal, multi-task learning [
1,
4] to take advantage of additional information from different modalities and tasks. Nonetheless, multi-modal or multi-task data are hard to collect and annotate in large quantities to further improve the performance of the model. On the other hand, these models are computationally expensive due to the high number of learnable parameters. To this end, a single-task, uni-modal model is desirable for fish-eye image segmentation. In addition, the current literature on semantic segmentation focuses on supervised learning [
3,
13,
15], which requires a large amount of labeled data to train deep learning models effectively. Since semantic segmentation involves classifying each pixel in an image, annotation is more costly and laborious than other computer vision tasks, such as object classification [
16]. Consequently, there is a lack of publicly available semantically annotated fish-eye segmentation datasets. Semi-supervised learning (SSL) has emerged as a potential solution to this problem, allowing models to learn from a small amount of labeled data while leveraging a larger amount of unlabeled data. The basic idea of the most effective forms of semi-supervised methods involves predicting the pseudo-label for the unlabeled sample and learning from the predicted pseudo-label in an unsupervised learning setting. To the best of our knowledge, there is no existing work on SSL-based semantic segmentation for fish-eye data.
In this work, we focus on learning semantic segmentation from real-world fish-eye data in a semi-supervised setting. First, we explore and adapt various existing SSL methods from the perspective image domain, such as Mean Teacher [
17] and Cross-Pseudo-Supervision (CPS) [
18]. However, these methods are not originally optimized for the fish-eye data and, therefore, perform poorly. To this end, we propose a new semantic segmentation method for fish-eye images named FishSegSSL that learns effective representation from uni-modal segmentation data (single task) and improves the performance over the existing methods, including fully supervised, MeanTeacher, CPS, and CPS with CutMix. More specifically, our method incorporates into the best existing method (CPS) three key concepts that are inspired by the semi-supervised literature for image classification: pseudo-label filtering, dynamic thresholding, and robust strong augmentation. Pseudo-label filtering includes the concept of filtering low-confidence pseudo-labels to reduce the noisy signal on the unsupervised learning component. Dynamic thresholding further improves pseudo-label filtering by introducing the concept of adaptive thresholding based on the learning difficulty of different classes on the dataset. Finally, the robust strong augmentation concept introduces an additional regularizer on the easy samples so that the model does not overfit them and continues learning from all samples as the training progresses.
While there has been some research on the WoodScape datasets, existing works are either (1) fully supervised, requiring a large amount of labeled data, or (2) multi-modal or multi-task models, requiring additional data modality or annotation. To the best of our knowledge, this is the first work on fish-eye image segmentation in a semi-supervised setting. In this work, we make two main contributions in terms of novelty: (1) we present a novel (first) semi-supervised learning method specialized in fish-eye image segmentation, and (2) we propose a method that can learn from fish-eye images only, without requiring multi-modal or multi-task data. We evaluate our proposed method on the WoodScape dataset [
1], which consists of real-world fish-eye images. Our extensive study shows the effectiveness of our proposed method. We also perform an in-depth ablation and sensitivity analysis on different modules of our proposed methods. Our experimental analysis shows a 10.49% improvement over fully supervised learning with the same amount of labeled data by applying the proposed method. Overall, we make the following key contributions to this work:
We present a comprehensive study on fish-eye image segmentation by adopting conventional semi-supervised learning (SSL) methods from the perspective image domain. Our study reveals the suboptimal performance of such methods since they were not originally designed for fish-eye images.
We propose a new SSL-based semantic segmentation method that incorporates three key concepts: pseudo-label filtering to reduce noisy pseudo-labels, dynamic thresholding for adaptive filtering based on class difficulty, and robust strong augmentation as a regularizer on easy samples.
This is the first work on semi-supervised fish-eye image segmentation that is uni-modal and single-task.
We perform an extensive experimental analysis with a detailed ablation and sensitivity study on the WoodScape dataset, revealing a 10.49% improvement over fully supervised learning with the same amount of labeled data and showcasing the efficacy of the introduced method.
The subsequent sections of the paper are structured as follows:
Section 2 reviews prior research pertinent to our proposed model.
Section 3 provides an in-depth presentation of the proposed method.
Section 4 and
Section 5 delve into the discussion of the experimental results. Finally,
Section 6 concludes with closing remarks and considerations for the future direction of this research.
4. Experiments
This section presents the results of this study. First, we will discuss the dataset and implementation details. Then, we will discuss the performance of existing methods in fish-eye image segmentation, which we consider the baseline of this study. Finally, we will present a discussion of the results of our proposed methods.
4.1. Dataset and Implementation Details
Dataset Description. WoodScape [
1] is a dataset specifically designed for autonomous driving tasks captured using multi-camera fish-eye lenses. It comes with extensive ground truth annotations for nine essential tasks, including semantic segmentation, which covers classes like roads, pedestrians, and traffic signs. The main purpose behind WoodScape is to offer a comprehensive platform for evaluating computer vision algorithms utilized on fish-eye images. This dataset stands out as the first labeled dataset specifically designed for semantic segmentation. The data were collected from three distinct geographical regions: the US, Europe, and China. The collection of data consists of four surround-view cameras and involves nine distinct tasks, including depth estimation, 3D bounding-box detection, and soiling detection. Overall, WoodScape promises high-quality data and opens avenues for developing unified multi-task and multi-camera models for advancing autonomous driving technology.
Implementation Details. We follow OmniDet [
27] and SemiSeg [
18] for the training setup and default hyper-parameters. The visual encoder backbone of OmniDet is a ResNet50 network. We first divide the total training samples from WoodScape [
1] into supervised and unsupervised splits for semi-supervised training. More specifically, we used 20% of the 8029 samples as our labeled set and the rest of the samples as an unlabeled set. We trained the model for 125 epochs with an Adam optimizer, a learning rate of 0.0001, and a batch size of 16. A multi-step scheduler with a decay factor of 0.1 and a step size of [100, 110] was used for learning rate adjustment. The model was trained on an NVIDIA RTX A6000 GPU.
4.2. Baseline
For this and all subsequent experiments, we consider supervised learning with an equal amount of labeled data (20% to total data) used in the semi-supervised methods as the baseline. So, the improvement with the semi-supervised method is purely from using unlabeled data (as the number of labeled samples is the same in both settings). The first row of
Table 1 shows the baseline mIoU for our work, which is 54.32. For the rest of the paper, we will discuss the improvement in different semi-supervised methods over this baseline.
4.3. Main Results
Table 1 summarizes the results for all the existing methods as well as our proposed semi-supervised training framework. We followed the protocol described in [
18] for evaluation. More precisely, the evaluation process is unimodal: single-scale testing. The results are presented utilizing the mean intersection over union (mIoU) metric, which evaluates the model’s accuracy by computing the intersection-to-union ratio between predicted and ground truth masks for each class. The mIoU offers an average assessment of segmentation performance across all classes, with higher values denoting superior performance. As mentioned in the
Section 3, we re-implemented three popular semi-supervised segmentation methods from the regular image literature, namely, MeanTeacher [
17], CPS [
18], and CPS with CutMix augmentation [
18]. We found the best result for CPS with CutMix augmentation from existing methods. It obtained an mIoU of 62.47, which is an 8.15% improvement of the supervised learning with the same number of labeled samples. The next best performance was shown by the original CPS with an mIoU of 60.31. MeanTeacher did not perform well, and the result was worse than learning from supervised loss only. However, our proposed method, FishSegSSL, outperformed all previously explored methods by gaining an mIoU of 64.81. This provides an
2.34% improvement over the existing best practice and a
10.49% improvement over the fully supervised baseline.
4.4. Ablation Study
In this section, we conduct a detailed ablation study to assess the impact of our proposed components on the model’s performance. Each component is individually removed, and the mean intersection over union (mIoU) for each setting is reported.
In
Table 2, in the first configuration, we incorporate all three components—pseudo-label filtering strategy, dynamic confidence thresholding, and robust strong augmentation—with the existing best-performing method, CPS with CutMix augmentation, which yields an mIoU of 64.81. Note that it is also the highest performance reported in this study. Subsequently, we excluded the dynamic confidence thresholding component, replacing it with a fixed threshold value (0.90) for pseudo-label filtering, resulting in a 1.92% drop in model performance. This underscores the superiority of class-wise confidence-based dynamic thresholds for filtering pseudo-labels over fixed threshold values.
We further removed the robust strong augmentation module while retaining the pseudo-label filtering and dynamic confidence thresholding. This resulted in lower model performance than the complete three-component setup, i.e., the first configuration of the ablation study. Nevertheless, when incorporating pseudo-label filtering alongside dynamic confidence thresholding, the model performs better than when pseudo-label filtering and robust strong augmentation are considered in isolation. This provides the interesting insight that dynamic confidence thresholding has a more substantial impact on leveraging the model than robust strong augmentation. However, it is not viable to explore a scenario where we remove pseudo-label filtering and solely consider dynamic confidence thresholding. This limitation arises because dynamic confidence thresholding adjusts the threshold value based on class-wise confidence rather than relying on a fixed threshold value. The pseudo-label filtering component utilizes this dynamic threshold to selectively filter out high-quality pseudo-labels. Isolating only the robust strong augmentation component leads to a further decline in model performance, emphasizing the efficacy of considering high-quality pseudo-labels over employing hard augmentation for easy samples. Omitting all three proposed components reverts to the best-performing existing method, CPS with CutMix augmentation, with a 2.34% decrement from our proposed FishSegSSL, demonstrating the effectiveness of our framework.
4.5. Sensitivity Study
In this section, we discuss the results of the sensitivity study on the existing methods that we find important for fish-eye segmentation performance.
4.5.1. Hyperparameter Sensitivity
Table 3 shows different combinations of CPS weight and learning rate values for the CPS method. The table includes the performance of the CPS method for different combinations of CPS weights of 1.5, 1, and 0.5. We can see that with the change in the CPS weights with a fixed learning rate, the model’s performance does not change significantly. However, by decreasing the learning rates from 0.01 to 0.0001 with a constant CPS weight, the model’s performance increases significantly. The highest mIoU, 60.40%, is achieved with a 0.0001 learning rate with 0.5 CPS weight. So, it can be concluded that the model’s performance is more sensitive to the learning rate compared to the CPS weight.
4.5.2. Confidence Thresholding
Table 4 shows the performance of CPS with confidence thresholding for different threshold values. The results indicate that the highest mIoU value of 60.66 is obtained with a threshold of 0.75, whereas the lowest value of 59.96 is obtained with a threshold of 0.9. Since the threshold value works as a consistency regularizer by ignoring low-confidence pseudo-labels, the higher threshold value induces a higher regularization effect on the model and thus affects the model’s performance by ignoring a larger amount of predicted pseudo-labels for supervision.
Table 5 demonstrates the performance of CPS with CutMix and confidence thresholding at various threshold values. We consider the threshold values of 0.5, 0.75, 0.9, and 0.95 in this study. The results indicate that the highest mIoU is achieved with a threshold of 0.5, while the lowest is observed with a threshold of 0.95. We believe the reason behind the following performance of the model is similar to that explained in the previous subsection.
To understand the performance of CPS with CutMix augmentation, shown in
Table 6, we conduct experiments using different combinations of the number of CutMixed boxes and the range of these boxes. The table demonstrates the method setting and the corresponding mIoU value. This experiment aims to determine the optimal combination of box number and range that would yield the highest mIoU value. The default setting for the experiment had a box number of 3 and a range of 0.25–0.5, with an mIoU value of 62.47. The highest performance, 62.71%, is achieved by four CutMix boxes with a range of 0.4 to 0.75. Also, note that with the increment of the CutMix range, the model’s performance is enhanced as the overlapping area of the two images increases. Thus, the model obtains a wider view of the images to capture the underlying information.
5. Discussion
In this study, we present an adaption of traditional semi-supervised learning methods to real-world fish-eye data, exposing their suboptimal performance and lack of optimization for fish-eye characteristics. To deal with this challenge, we propose a novel SSL-based semantic segmentation method featuring pseudo-label filtering, dynamic thresholding, and robust strong augmentation. At first, we establish a supervised learning baseline with 20% labeled data, yielding an mIoU of 54.32, and subsequently compare different semi-supervised methods, identifying CPS with CutMix augmentation as the top performer with an 8.15% improvement over supervised learning.
We further propose a novel semantic segmentation framework with three SSL components inspired by the existing SSL literature in the perspective image domain. The components are pseudo-label filtering, dynamic thresholding, and robust strong augmentation modules. We then combined the proposed components with CPS and CutMix augmentation, the existing best-performing setting. The significance of the proposed pseudo-label filtering is that it ensures superior supervision by filtering high-quality pseudo-labels by leveraging a fixed threshold value or dynamic confidence thresholding. The robust strong augmentation module imposes a hard augmentation on the easy samples, creating a varied sample set and ensuring proper utilization of the unlabeled set. From experimental analysis, we can see that the dynamic threshold strategy outperforms all the existing methods by achieving a notable 2.34% improvement over existing methods and a substantial 10.49% improvement over the fully supervised baseline. We also performed detailed ablation studies to underscore the significance of the proposed components. The ablation study proves that the pseudo-label filtering strategy with dynamic confidence thresholding and robust strong augmentation outperforms alternatives. Further investigations on sensitivity analysis reveal the model’s sensitivity for both model-specific hyperparameters and threshold values. Overall, the findings showcase the effectiveness of our proposed confidence threshold and robust strong augmentation module with CPS and CutMix augmentation in significantly enhancing semi-supervised segmentation models for fish-eye images.