Fisheye Object Detection with Visual Prompting-Aided Fine-Tuning

Jeon, Minwoo; Park, Gyeong-Moon; Hwang, Hyoseok

doi:10.3390/rs16122054

Open AccessArticle

Fisheye Object Detection with Visual Prompting-Aided Fine-Tuning

by

Minwoo Jeon

^†

,

Gyeong-Moon Park

^†

and

Hyoseok Hwang

^*

College of Software, Kyunghee University, Yongin 17104, Republic of Korea

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2024, 16(12), 2054; https://doi.org/10.3390/rs16122054

Submission received: 3 April 2024 / Revised: 28 May 2024 / Accepted: 5 June 2024 / Published: 7 June 2024

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Versions Notes

Abstract

:

Fisheye cameras play a crucial role in various fields by offering a wide field of view, enabling the capture of expansive areas within a single frame. Nonetheless, the radial distortion characteristics of fisheye lenses lead to notable shape deformation, particularly at the edges of the image, posing a significant challenge for accurate object detection. In this paper, we introduce a novel method, ‘VP-aided fine-tuning’, which harnesses the strengths of the pretraining–fine-tuning paradigm augmented by visual prompting (VP) to bridge the domain gap between undistorted standard datasets and distorted fisheye image datasets. Our approach involves two key elements: the use of VPs to effectively adapt a pretrained model to the fisheye domain, and a detailed 24-point regression of objects to fit the unique distortions of fisheye images. This 24-point regression accurately defines the object boundaries and substantially reduces the impact of environmental noise. The proposed method was evaluated against existing object detection frameworks on fisheye images, demonstrating superior performance and robustness. Experimental results also showed performance improvements with the application of VP, regardless of the variety of fine-tuning method applied.

Keywords:

object detection; visual prompting; fisheye image

1. Introduction

Fisheye cameras, characterized by their expansive field of view, provide a broad hemispherical view [1,2]. Their unique optical properties provide significant advantages, including capturing wide areas within a single frame and reducing the number of cameras required for panoramic views. Owing to these features, they have been extensively adopted across various fields. From autonomous vehicles [3] that rely on comprehensive environmental perception, to simultaneous localization and mapping (SLAM) [4] systems targeting efficient mapping, to virtual reality (VR) [5] applications seeking immersive experiences, the applications of fisheye cameras are diverse and wide-ranging.

With the progression of deep neural networks and the expansion of datasets, performance in object detection tasks has significantly improved [6]. However, the inherent characteristics of fisheye lenses induce significant radial distortions into these datasets [7]. The shape of objects becomes notably distorted, and as one moves toward the edge, i.e., away from the lens’s center, this distortion becomes even more exaggerated. As a result, the same object can have varying appearances and shapes depending on its position.

To address these distortion issues, numerous research initiatives have been undertaken. Some studies proposed methods that rectify fisheye distortion and then applied conventional object detection methodologies [3,8]. These approaches, however, require knowledge of the fisheye camera’s intrinsic parameters and lead to a loss of edge information from the raw image during the undistortion process. Another method designed convolution techniques based on the spherical shape of the image [9]. This method heavily relies on position-based convolution kernels and fails to fully harness the boundary continuity advantage inherent to fisheye images while focusing on omnidirectional images. Furthermore, certain studies focused on directly implementing object detection techniques for fisheye datasets [10,11]. Due to the distortion in fisheye datasets, the shape of an object changes based on its position within the image. This phenomenon can be observed in Figure 1, where it is evident that as objects move toward the edges of the image, their shapes appear increasingly distorted, exhibiting alterations in orientation and size. Consequently, a substantial volume of fisheye image data is required to effectively train models to detect objects under variations across different positions in the fisheye images.

It has been recognized that the rectangular bounding boxes typically used in object detection are not optimal for use in fisheye images. Conventional bounding boxes can only provide an approximate location of an object and do not contain information about its shape. In order to address the distortion issue in fisheye images, recent studies have focused on modeling object representations using 24 points [10,12,13]. This method achieves clear boundaries and minimizes environmental noise, making it robust against distortion. These advantages grant the technique the ability to be distortion-invariant, which is particularly suitable for fisheye objects.

The recently prominent pretraining–fine–tuning paradigm [14,15,16,17] is widely utilized as a training method, in which large-scale models are pretrained on extensive datasets and then fine-tuned for adaptation to various downstream tasks [18,19]. By providing task-specific information to the pretrained model through parameter-efficient tuning methods [20,21,22,23,24], promising results have been demonstrated across a multitude of downstream tasks. Among these parameter-efficient tuning methods, visual prompting (VP) [24] operates by learning perturbations in the shape of padding from a pixel-level perspective and then applying them to the input image. Inspired by this paradigm, we used pretrained models with a large number of standard datasets, fine-tuned with the fisheye dataset, and generated and applied VPs.

To effectively conduct object detection in fisheye images, we propose ‘VP-aided fine-tuning’, as illustrated in Figure 2. The two key elements of this approach are as follows: (1) employing the pretraining–fine-tuning paradigm effectively through VP, and (2) regressing the object to 24 points. With VP, we bridge the domain gap between the standard dataset used to train the pretrained model and the downstream task dataset of fisheye images, enabling efficient adaptation. By regressing the object to 24 points, we achieve robustness against image distortion, clearly delineate the object boundaries, and reduce environmental noise.

The main contributions of this paper are as follows:

We propose a new approach called ‘VP-aided fine-tuning’, which applies VP to various fine-tuning strategies, thereby enhancing the performance of 24-point object detection.
We attempt to enhance performance by reducing the distance between the pretraining dataset and the fisheye dataset through learnable VP.
The experimental results demonstrated that our proposed approach with VP enhanced the performance across traditional and recent tuning techniques. Furthermore, it was shown that the performance was enhanced when using convolution-based models and transformer-based models as the backbone.

The source code can be found at https://github.com/AIRLABkhu/FOD_VP-Aided_FT.

2. Related Works

2.1. Visual Prompting

Inspired by the success in the natural language processing (NLP) field [25,26], the concept of prompt tuning [27,28] has been applied in the domain of computer vision, particularly in the form of visual prompt tuning (VPT) [23]. This technique repurposes the vision transformer (ViT) [29] architecture for downstream tasks by incorporating a series of learnable tokens at the outset in the model’s input. This modification facilitates a more refined and task-specific initialization, paving the way for more effective and targeted learning. Moreover, an approach known as VP has emerged, which learns a single image transformation in the form of padding at the pixel level and integrates this with the input image. Facilitating interaction with pretrained models, VP, unlike VPT that generally adds trainable tokens to the model’s input sequence, can be easily applied not only to ViT-based frameworks but also to traditional convolutional neural networks (CNNs).

2.2. Fine-Tuning Pretrained Models

We applied the pretraining–fine-tuning mechanism to the fisheye dataset. Over time, several studies have been conducted to ensure effective and efficient adaptation during fine-tuning. Therefore, we explored and compared various fine-tuning methods and introduced prompts into these approaches. The most standard methods for adapting a pretrained model are linear probing (LP) [30], full fine-tuning (FT), and bias tuning [31]. In LP, only a linear classifier is trained on top of the fixed representations extracted from a pretrained model. Unlike LP, FT involves training the entire pretrained model on a target task. This means all the parameters of the model, from early to deep layers, are adjusted during the training on the downstream task. Bias tuning is a more targeted approach to fine-tuning, where only the bias terms of the model are adjusted, keeping the weights fixed. Additionally, we adopted some recently proposed enhancements to the fine-tuning process. L2 starting point (L2-SP) [32] is a method where the fine-tuned weights are regularized to the pretrained weights, and linear probing then full fine-tuning (LP-FT) [33] is a method that performs linear probing and then proceeds to full fine-tuning.

2.3. Object Detection in Fisheye Images

Fisheye datasets exhibit severe radial distortion, unlike standard datasets. Fisheye lenses, being ultra-wide-angle lenses, capture images wherein objects at different angles from the optical axis appear vastly different. Accordingly, extensive research has been conducted to effectively carry out object detection in fisheye images. On the one hand, there have been several methods involving rectifying the distortion in fisheye images to achieve undistortion. Yogamani et al. [3] employed a fourth-order polynomial method, while Khomutenko et al. [8] utilized a unified camera model for undistortion. During undistortion, edge information and details captured at close range are lost, and the computational complexity increases.

Other methods designed convolutional techniques based on the spherical shape. SphereNet [9] constructs CNNs on spherical surfaces to encode invariance against distortion, enabling the transfer of existing perspective CNN models to omnidirectional scenarios. Additionally, the method named Spherephd introduces new convolution and pooling techniques to apply CNNs on spherical polyhedral surfaces [34]. However, such methods heavily rely on location-based convolution kernels and, while focusing on omnidirectional images, fail to fully harness the edge continuity advantage inherent in fisheye images.

In addition, attempts have been made to minimize the influence of the background and enhance robustness against distortion by representing objects in a form beyond rectangular bounding boxes. Rashed et al. [10] proposed representations fitting the shape of objects, like rotated boxes, ellipses, and polygons. Xu et al. [12] introduced a method for representing and regressing objects using 24 points, as depicted in Figure 1b, with the objective of identifying and localizing objects in fisheye images with greater precision and robustness.

3. Methods

3.1. Dataset Generation

Representing objects in fisheye images, which are characterized by radial distortion, with traditional rectangular bounding boxes is not ideal. Therefore, we opted for a more generic shape representation of objects to diminish the background’s influence, choosing to depict them as polygons. This approach aligns with the method introduced by Rashed et al. [10], which employs a 24-point annotation system. We applied this technique to represent objects as polygons in the Woodscape fisheye dataset.

First, we identified the centroid of the object utilizing the provided rectangular bounding box annotations. We then transformed the instance segmentation annotations into object masks and stored them accordingly. Given that the number of annotations and classes surpassed that of the bounding box annotations, we aligned them based on the bounding box criteria. The objects were categorized into five distinct classes: pedestrians, vehicles, bicycles, traffic lights, and traffic signs. Next, as shown in Figure 3, we emitted 24 rays at 15-degree intervals from the center point and annotated the intersections with the mask’s boundary as the 24 points. These intersections were marked, forming a 24-point annotation for each object. Finally, the results labeled with 24 points are shown in Figure 4. Furthermore, annotating objects with 24 points offers several advantages over segmentation. First, it has a lower computational cost, since it does not require storing a class label for every pixel, which is advantageous for real-time processing. Moreover, it generally delivers better generalization performance. Segmentation operates at the pixel level, leading to underperformance when object boundaries are ambiguous or when objects overlap.

However, representing objects in polygonal form makes the training process more challenging than the conventional bounding box approach. The increase in the number regression items slows down the convergence speed and compromises accuracy. Additionally, the computation of intersection over union (IOU) becomes more intricate and challenging. Xu et al. [12] considered irregular bounding boxes and employed an overarching framework using the anchor-free YOLOX [35]. They proposed a loss function based on the concentric circle distribution, termed GIOU. This loss function had the advantage of eliminating the need for direct IOU computation for irregular bounding boxes. The loss function is defined as

L o s s = \frac{1}{24} \sum_{i = 0}^{23} 1 - {GIOU}_{cir} (p d_{i}, g t_{i}),

(1)

where

p d_{i}

and

g t_{i}

represent the predicted and ground truth concentric circles, respectively. The

{GIOU}_{cir}

is defined as

{GIOU}_{c i r} = CIOU - \frac{|S_{c} ∖ (S_{p d} \cup S_{g t})|}{|S_{c}|},

(2)

where

S_{p d}

and

S_{g t}

mean the area of the predicted circle and label circle, respectively.

S_{c}

is the maximum circumcircle area of the predicted circle and label circle. The circular IOU (CIOU) is computed as

CIOU = \{\begin{matrix} 0, & if r_{\max} + r_{\min} \leq {dist}_{cen} \\ π r^{2}, & if r_{\max} - r_{\min} > {dist}_{cen} \\ SIOU, & else \end{matrix} .

(3)

In Equation (3), SIOU is calculated as

SIOU = t_{1} r_{\min}^{2} + t_{2} r_{\max}^{2} - r_{\min} {dist}_{cen} sin (t_{1}),

(4)

where

t_{1}

and

t_{2}

are given by

t_{1} = arccos (\frac{r_{min}^{2} + {dist}_{cen}^{2} - r_{max}^{2}}{2 r_{\min} {dist}_{cen}}),

(5)

t_{2} = arccos (\frac{r_{max}^{2} + {dist}_{cen}^{2} - r_{\min}^{2}}{2 r_{\max} {dist}_{cen}}) .

(6)

Here,

{dist}_{cen}

is the Euclidean distance between the centers of the two circles.

r_{m a x}

represents the radius of the larger circle, while

r_{m i n}

denotes the radius of the smaller circle. SIOU represents the area of the intersecting part between the two circles.

3.2. Overview of Visual Prompting

Inspired by the success of the prompt paradigm in the field of NLP, Bahng et al. [24] introduced the concept of VP for pretrained vision and vision-language models for the first time. VP aims to learn a task-specific visual prompt, denoted as

v_{ϕ}

and parameterized by

ϕ

, which is a set of pixel-level modifications directly added to the input image. This method is designed to assist in the adaptation of the pretrained model to downstream tasks, while the model parameters remain fixed. For a given input image x and its corresponding label y, the learning objective of VP is defined as follows:

\underset{ϕ}{arg min} (- log P_{θ; ϕ} (y ∣ x + v_{ϕ})) .

(7)

Here,

θ

and

ϕ

represent the parameters of the pretrained model and the visual prompt, respectively. During the training phase, the likelihood of the correct label y is maximized. In the inference phase, the trained prompt is added to test images, which are then fed through the pretrained model for classification or detection. This addition of the prompt does not require substantial computational resources, making VP a resource-efficient alternative to full model fine-tuning.

3.3. VP-Aided Fine-Tuning

Our research, as illustrated in Figure 5, employed the pretraining-fine-tuning paradigm for fisheye object detection and explored methodologies to enhance fine-tuning performance using VP. The pretrained model was trained on a standard dataset and then fine-tuned for the downstream task on a fisheye dataset. By adding VPs to the input image, we provided downstream task-specific information to the model, making the adaptation more efficient. The visual prompt is designed to match the resolution of the original fisheye image, ensuring seamless integration when it is added to the image. The prompt, consisting of learned padding values, is applied directly at the pixel level to the fisheye image. During the training process, padding is learned and strategically placed only along the edges of the image. This placement is vital, because it helps to correct the typical distortions found in fisheye images, while leaving the central part of the image unchanged, preserving essential visual information. We describe the method of incorporating and continuously updating the visual prompt with the original fisheye image as follows:

δ^{'} = δ - η \cdot \nabla_{δ} L (θ, x + δ, y) .

(8)

Here,

δ

represents the visual prompt,

η

the learning rate,

\nabla_{δ} L

the gradient of loss L as defined in Equation (1),

θ

the model parameters, x the input, and y the true label. After updating the prompt, the resultant image

x + δ

is clipped to the valid pixel range [0, 255], and this is represented by the following equation:

x^{'} = clip (x + δ, 0, 255) .

(9)

Consequently, both the prompt and the model are updated together through the loss function. In fine-tuning tasks like ours, where the distribution of pretrained data and input images is different, the deployment of prompts can alleviate this difference and enable effective learning. We have discovered that, even with a reduced number of learning parameters, the application of VP to fine-tuning methods can achieve enhanced detection performance. Moreover, VP can be easily applied to any fine-tuning method, since it simply involves adding to the input image.

We applied VP to standard fine-tuning methods including partial fine-tuning (PT), FT, and bias tuning [31]. PT, inspired by the VPT’s head-tuning setup, involves freezing the backbone and tuning the neck and head. Additionally, we adopted recently proposed fine-tuning enhancement techniques like PT-FT and L2-SP [32]. These fine-tuning methods are illustrated in Figure 6. PT-FT was inspired by the LP-FT [33] setup, conducting FT after PT instead of LP. The L2-SP approach regularizes fine-tuned weights towards pretrained weights, employing a regularization term, as follows:

Ω (w) = \frac{α}{2} {∥w_{S} - w_{S}^{0}∥}_{2}^{2} + \frac{β}{2} {∥w_{\bar{S}}∥}_{2}^{2} .

(10)

Here,

w^{0}

denotes the weights pretrained on source data (standard images), acting as the starting point.

w_{S}

represents the weights of the source network, that is, the weights of frozen parameters, and

w_{\bar{S}}

pertains to the weights of tunable parameters.

α

and

β

are regularization parameters that determine the strength of the penalty.

3.4. Theoretical Analysis of Visual Prompting

To elaborate on why the visual prompting method enhances performance in fisheye object detection, we draw a parallel to adversarial reprogramming [36], a concept traditionally associated with adversarial attacks. Adversarial attacks, such as the fast gradient sign method (FGSM) [37] and projected gradient descent (PGD) [38], utilize input perturbations to deceive a model. These attacks are characterized by updating perturbations using a method contrary to gradient descent, typically termed as gradient ascent on the loss landscape, to degrade model performance in its task:

δ^{'} = δ + ϵ \cdot sign (\nabla_{δ} L (θ, x + δ, y)) .

(11)

Here,

δ

represents the perturbation,

ϵ

the learning rate,

\nabla_{δ} L

the gradient of loss L with respect to the perturbation,

θ

the model parameters, x the input, and y the true label. In contrast, VP employs gradient descent to iteratively refine these perturbations, referred to as ’prompts’, which are aimed at enhancing rather than degrading model performance.

δ^{'} = δ - η \cdot \nabla_{δ} L (θ, x + δ, y) .

(12)

Here,

η

is the learning rate used for gradient descent, promoting an optimization that enhances model performance. Both adversarial reprogramming and visual prompting effectively modify the input to adapt the model for a new or refined task. However, while adversarial reprogramming seeks to subvert or repurpose the model, visual prompting through methods such as ours aims to harness this adaptability to improve accuracy and robustness in fisheye object detection. By modifying the input slightly with learned perturbations (prompts), our method consistently outperformed standard training approaches without prompts, confirming the practical utility and theoretical soundness of our approach.

4. Experiments and Results

4.1. Setup

4.1.1. Dataset

We used the Woodscape dataset [3] for object detection evaluation. The Woodscape dataset is a pioneering fisheye dataset for autonomous driving, comprising images captured from four different camera orientations: front, left, right, and rear. The official release consists of 8233 images, which were randomly divided into training and testing datasets in a ratio of 8:2. Instance segmentation annotations are provided for more than 40 classes, and 2D bounding boxes annotations are provided for 5 classes. We generated 24-point labels using the provided segmentation masks, aligned based on 5 classes: pedestrians, vehicles, bicycles, traffic lights, and traffic signs.

4.1.2. Models

Our overall network architecture is based on YOLOX [35], selected for its efficiency in real-time object detection and its anchor-free characteristic, which enables efficient processing of 24-point regression. To assess the robustness and adaptability of different architectural approaches, we evaluated a variety of models as the backbone of the YOLOX framework, including transformer models and traditional CNNs. The backbone includes five pretrained vision models: Swin transformer-tiny (Swin-T), Swin transformer-small (Swin-S) [19] pretrained ImageNet-22k [39] dataset; and Darknet-53 [11], Resnet-50 [40], and VGG-19 [41] pretrained COCO 2017 [42] dataset.

4.1.3. Evaluation Metric

The objective of our research was to effectively detect objects in fisheye images by representing them with 24 points. To evaluate the performance of our proposed 24-point prediction, we measured the IOU of the regions generated by the prediction and the ground truth polygons. To differentiate this from the conventional IOU, which typically measures the similarity of a rectangular boxed region, we termed this metric as

{IOU}_{p o l y}

and defined it as follows:

{IOU}_{p o l y} = \frac{A_{p r e d} \cap A_{g t}}{A_{p r e d} \cup A_{g t}},

(13)

where

A_{p r e d}

and

A_{g t}

mean the areas of polygons formed by connecting the 24 points of the prediction and the ground truth, respectively. This metric can also be applied to the bounding box of a rectangle, where the polygon is formed from 4 points instead of 24.

4.1.4. Implementation Details

Our model was trained on three Nvidia A4000 GPUs, with a batch size of 20 over 300 epochs. We used a learning rate of 0.0001 and employed the stochastic gradient descent (SGD) [43] optimizer. The prompt was updated using the SGD optimizer with a learning rate of 1, which is a decayed cosine schedule [44].

4.2. Vp-Aided Fine-Tuning Results

4.2.1. Results of Applying VP to Various Backbone Models

In this section, we explored the impact of VP on the performance of various backbone models. Our experimental setup involved a comparative analysis of three tuning methods: training from scratch, PT, and FT, under the influence of VP. Training from scratch means initializing the network weights randomly without any pretrained weights, and starting the learning process anew. Zero-shot refers to a scenario where the model is directly applied using pretrained weights without any further modifications or training adjustments. As a result of integrating VP, both PT and FT methods showed performance improvements. Specifically, the PT method improved within a range of 0.1% to 3.09%, and the FT method demonstrated a larger range of improvement, from 0.11% to 5.78%. As shown in Table 1, the FT method with VP consistently outperformed the other approaches across all backbone models.

There was a significant improvement compared to training from scratch for every backbone model. When VP was applied to Darknet53, there was a notable increase in IOU score from 0.5276 to 0.5904. For the ResNet50 model, the IOU improved from 0.4042 to 0.5811, and for VGG19, it increased from 0.5325 to 0.6050. The Swin Transformer variants also showed enhancements, with Swin-T improving from 0.3170 to 0.5038 and Swin-S from 0.4243 to 0.5082. These results indicate that applying VP during the FT phase effectively reduced the distance between pretrained data and fisheye data, thereby enhancing performance.

To provide a more comprehensive evaluation, we also included performance metrics based on the mean average precision (mAP). This additional evaluation further demonstrated the effectiveness of the VP method. As detailed in Table 2, the use of VP consistently improved mAP across the various backbone models. Specifically, the FT method with VP achieved the highest mAP scores, outperforming the other methods. For instance, when VP was applied to Darknet53, the mAP increased from 0.3678 to 0.5135. For ResNet50, the mAP improved from 0.3016 to 0.4785, and for VGG19, it increased from 0.4008 to 0.5456. Similarly, the Swin-T and Swin-S models saw improvements, with Swin-T mAP increasing from 0.3708 to 0.4823 and Swin-S from 0.3789 to 0.5122. These results confirm that VP not only enhances IOU but also significantly improves mAP, demonstrating its robust capability for enhancing the performance of object detection models across various backbones.

In Table 3, the number of tunable parameters is shown based on the presence or absence of VP in PT and FT. When the padding size was 30, the number of VP parameters was 191,340, demonstrating that the tuning performance could be efficiently improved with a relatively small number of parameters. Table 4 presents an analysis of the training time and the number of tunable parameters based on varying padding sizes. The results were based on using Darknet53 as the backbone model with the PT setting. As observed, the training time slightly increased with larger padding sizes. However, this increase was minimal and did not significantly impact the overall training process. This minimal increase can be attributed to the relatively small number of additional learnable parameters introduced by the prompts. For example, padding sizes of 30, 50, and 70 added only 0.19 M, 0.31 M, and 0.41 M parameters, respectively, which is negligible compared to the total number of parameters in typical deep learning models. Consequently, the computational overhead introduced by these additional parameters is minor, and the training time increase remained insignificant. These results indicate that the use of visual prompts did not substantially affect the training time, allowing the method to maintain efficiency, while enhancing model performance.

4.2.2. Results of Applying VP to Various Fine-Tuning Methods

In additional experiments conducted within this study, we evaluated the performance enhancements when VP was applied to a range of fine-tuning methods. These methods encompassed PT, FT, Bias tuning, L2-SP, and PT-FT. Our goal was to validate the consistent efficacy of VP in improving performance across different fine-tuning methodologies. Table 5 presents the results for the Darknet53 backbone model, demonstrating that the implementation of VP consistently led to performance improvements across all fine-tuning methods compared to scenarios without VP. Specifically, for PT, the application of VP increased the performance from 0.5188 to 0.5365, marking an improvement of approximately 1.77%. For FT, there was an enhancement from 0.5842 to 0.5904, indicating an increase of roughly 1.06%. The bias adjustment method saw an increase from 0.5356 to 0.5396 with VP, L2-SP regularization from 0.5775 to 0.5887, and the PT-FT from 0.5683 to 0.5693.

4.2.3. Qualitative Results

The quantitative results of the fine-tuning method are shown in Figure 7. These results demonstrate the detection outcomes when applying VP to FT, specifically for images from the front and right cameras. Our method exhibited superior performance in object detection, more accurate class prediction, and precise 24-point regression of objects. Given the inherent distortion towards the edges in fisheye images, our 24-point regression approach proved to be robust against such distortions, successfully avoiding occlusion effects.

4.2.4. The Efficacy of 24-Point Representation

We conducted an additional experiment to validate the efficacy of using a 24-point polygon representation for object detection in fisheye images, as opposed to the conventional rectangular bounding box approach. We utilized the YOLOX architecture, experimenting with various backbone models such as Darknet53, ResNet50, VGG19, Swin-T, and Swin-S. The backbones were trained in two distinct manners: one set was trained to estimate the rectangular bounding box as provided by the dataset, and the other set was trained to estimate the proposed 24 points (24-point representation). It is important to note that in this experiment, our evaluation metric was the IOU with the dataset-provided bounding box. To facilitate this comparison, we transformed the 24-point representation into a bounding box by calculating the minimum and maximum values on the x and y axes. This allowed us to directly compare the performance of the 24-point representation with the traditional bounding box method. As shown in Figure 8, the 24-point polygon representation showed higher IOU performance than the bounding box for all backbone models. When using the Darknet53 model, the IOU improved significantly from 0.3791 with bounding box to 0.5904. Similar performance improvements were observed in the ResNet50 and VGG19 models, with IOUs of 0.5811 and 0.6050, respectively. With the application of Swin-T and Swin-S models, the traditional bounding box method recorded IOUs of 0.3021 and 0.3235, whereas the 24-point polygon representation achieved higher IOUs of 0.5038 and 0.5082. These results demonstrate that the 24-point polygon representation provided greater precision and effectiveness than the conventional bounding box method in object detection for fisheye images. This indicates that the 24-point polygon representation is more suitable for accurate object detection for complex-shaped objects or distorted fisheye images.

4.2.5. Effectiveness of Padding Size

In this study, we analyzed the impact of VP by varying the padding size during the PT and FT phases across the various backbone models. According to the results presented in Table 6, it is evident that changes in padding size did not lead to consistent performance improvements across all models. For instance, in the case of the Darknet53 model, the highest IOU was achieved at a padding size of 50 during PT, but the performance decreased at 70. In the FT phase, a padding size of 70 recorded the highest IOU. Similarly, other models also exhibited inconsistent IOU changes with increasing padding sizes. These findings indicate a complex interaction between padding size and model performance, suggesting that this may vary depending on the specific characteristics of a model or dataset.

4.2.6. Performance Analysis According to Prompt Design

We evaluated three distinct designs for the shape of the prompt, as illustrated in Figure 9. Design 1 involved padding according to the proportions of the fisheye image. Design 2 entailed padding around the input image. Design 3 was created by adding padding for the zero-equivalent values beneath the fisheye image. Lastly, Design 4 was specifically designed for the unique characteristics of fisheye images. The results of these prompt designs are presented in Table 7, where the backbone model was set to Darknet53 using PT and FT. We found that Design 1, which involved padding based on the aspect ratio of fisheye images, yielded the optimal results. While Design 4 also demonstrated impressive performance, Design 1 generally outperformed it across the various backbone models. It seems that training the prompt in areas not corresponding to the fisheye image did not contribute to performance improvement, as these areas were not updated in correlation with the fisheye image information. Therefore, we standardized the prompt design to the format of Design 1 and proceeded with the experiments.

4.2.7. Performance Analysis of Prompt Initialization

We adopted an initialization strategy where the prompt parameters were set using values drawn from a standard normal distribution (torch.randn). This choice was based on the assumption that a Gaussian distribution provides a balanced start, allowing prompts to contribute both positive and negative adjustments symmetrically during the training process. To empirically validate the efficacy of the different initialization strategies, we conducted experiments comparing standard normal random initialization with uniform random and zero initialization. As shown in Table 8, the use of standard normal random initialization under our visual prompting framework (w/VP) resulted in superior performance (0.5904) compared to the other methods when using Darknet53 as the backbone model. Based on the empirical evidence, we ultimately chose to initialize the prompt parameters randomly using a standard normal distribution.

4.2.8. Performance Analysis of Optimizer Setting

We conducted extensive testing with three optimizers: SGD, Adam [45], and RMSprop [43]. The objective was to determine the most effective optimizer for both the model components (backbone, head, neck) and the VP. As illustrated in Table 9, our experimental results indicated that using SGD as the optimizer for both the model backbone and the visual prompt yielded the highest mIOU of 0.5904. These results were based on using Darknet53 as the backbone model with the FT setting. Although Adam and RMSprop provided competitive results in certain configurations, they did not consistently outperform SGD across all settings. This analysis clarified our choice of SGD as the primary optimizer for our VP-aided fine-tuning method.

5. Conclusions

This paper addressed the unique challenges posed by fisheye images, particularly the severe radial distortion affecting object detection tasks. We proposed a ‘VP-aided fine-tuning’ approach, leveraging the strengths of the pretraining–fine-tuning paradigm, augmented with VP and a 24-point object regression method to enhance object detection in fisheye images. Our method effectively bridges the domain gap between standard and fisheye datasets, enabling the pretrained model to adapt more efficiently to the distortion-specific characteristics of fisheye images. The employment of a 24-point representation for object delineation proved to be significantly advantageous in terms of robustness against image distortion, precision in boundary delineation, and the reduction of environmental noise. Recent research has actively explored regression methods for efficiently performing object detection in fisheye images, including approaches that use 24-point regression. By applying our method within the pretraining–fine-tuning paradigm to these studies, we could further enhance performance. Our approach is to integrate a VP consisting of few parameters into the input image. This method can be easily applied across various fine-tuning methods and diverse backbone models. In addition to being simple to implement, this approach enhances the efficacy of object detection models. This adaptability and improved performance underscore its potential as a valuable tool for advancing research in object detection.

Author Contributions

Conceptualization, M.J. and H.H.; methodology, M.J.; software, M.J.; validation, G.-M.P. and M.J.; supervision, G.-M.P. and H.H.; All authors have read and agreed to the published version of the manuscript.

Funding

This work was partly supported by an Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. RS-2022-00167169), and by an Institute of Information and Communications Technology Planning and Evaluation (IITP) grant funded by the Korean government (MSIT) (No. RS-2022-00155911, Artificial Intelligence Convergence Innovation Human Resources Development (Kyung Hee University), and by the Convergence security core talent training business support program (IITP-2023-RS-2023-00266615).

Data Availability Statement

The Woodscape dataset was presented by Yogamani et al. [3], and the dataset is available at https://github.com/valeoai/WoodScape, which was accessed on 20 March 2024.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhu, H.; Wang, X.; Zhou, J.; Wang, X. Approximate model of fisheye camera based on the optical refraction. Multimed. Tools Appl. 2014, 73, 1445–1457. [Google Scholar] [CrossRef]
Choi, K.H.; Kim, Y.; Kim, C. Analysis of Fish-Eye Lens Camera Self-Calibration. Sensors 2019, 19, 1218. [Google Scholar] [CrossRef] [PubMed]
Yogamani, S.; Hughes, C.; Horgan, J.; Sistu, G.; Varley, P.; O’Dea, D.; Uricár, M.; Milz, S.; Simon, M.; Amende, K.; et al. Woodscape: A multi-task, multi-camera fisheye dataset for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 9308–9318. [Google Scholar]
Ji, S.; Qin, Z.; Shan, J.; Lu, M. Panoramic SLAM from a multiple fisheye camera rig. ISPRS J. Photogramm. Remote Sens. 2020, 159, 169–183. [Google Scholar] [CrossRef]
Xiong, Y.; Turkowski, K. Creating image-based VR using a self-calibrating fisheye lens. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Juan, PR, USA, 17–19 June 1997; pp. 237–243. [Google Scholar] [CrossRef]
Liu, L.; Ouyang, W.; Wang, X.; Fieguth, P.; Chen, J.; Liu, X.; Pietikäinen, M. Deep Learning for Generic Object Detection: A Survey. Int. J. Comput. Vis. 2020, 128, 261–318. [Google Scholar] [CrossRef]
Gochoo, M.; Otgonbold, M.E.; Ganbold, E.; Hsieh, J.W.; Chang, M.C.; Chen, P.Y.; Dorj, B.; Al Jassmi, H.; Batnasan, G.; Alnajjar, F.; et al. FishEye8K: A Benchmark and Dataset for Fisheye Camera Object Detection. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 17–23 June 2023; pp. 5305–5313. [Google Scholar] [CrossRef]
Khomutenko, B.; Garcia, G.; Martinet, P. An enhanced unified camera model. IEEE Robot. Autom. Lett. 2015, 1, 137–144. [Google Scholar] [CrossRef]
Coors, B.; Condurache, A.P.; Geiger, A. Spherenet: Learning spherical representations for detection and classification in omnidirectional images. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 518–533. [Google Scholar]
Rashed, H.; Mohamed, E.; Sistu, G.; Kumar, V.R.; Eising, C.; El-Sallab, A.; Yogamani, S. Generalized Object Detection on Fisheye Cameras for Autonomous Driving: Dataset, Representations and Baseline. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 2271–2279. [Google Scholar] [CrossRef]
Lei, X.; Sun, B.; Peng, J.; Zhang, F. Fisheye Image Object Detection Based on an Improved YOLOv3 Algorithm. In Proceedings of the 2020 Chinese Automation Congress (CAC), Shanghai, China, 6–8 November 2020; pp. 5801–5805. [Google Scholar] [CrossRef]
Xu, X.; Gao, Y.; Liang, H.; Yang, Y.; Fu, M. Fisheye object detection based on standard image datasets with 24-points regression strategy. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; pp. 9911–9918. [Google Scholar] [CrossRef]
Wang, X.; Xu, X.; Gao, Y.; Yang, Y.; Yue, Y.; Fu, M. CRRS: Concentric Rectangles Regression Strategy for Multi-point Representation on Fisheye Images. arXiv 2023, arXiv:2303.14639. [Google Scholar]
Bao, H.; Dong, L.; Piao, S.; Wei, F. BEiT: BERT Pre-Training of Image Transformers. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021. [Google Scholar]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the 37th International Conference on Machine Learning, PMLR, Vienna, Austria, 9–11 November 2020; pp. 1597–1607. [Google Scholar]
Chen, X.; Xie, S.; He, K. An empirical study of training self-supervised vision transformers. In Proceedings of the CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 9620–9629. [Google Scholar]
Yuan, L.; Chen, D.; Chen, Y.L.; Codella, N.; Dai, X.; Gao, J.; Hu, H.; Huang, X.; Li, B.; Li, C.; et al. Florence: A new foundation model for computer vision. arXiv 2021, arXiv:2111.11432. [Google Scholar]
Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12009–12019. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Lester, B.; Al-Rfou, R.; Constant, N. The Power of Scale for Parameter-Efficient Prompt Tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 3045–3059. [Google Scholar]
Hu, E.J.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
Zhang, Y.; Zhou, K.; Liu, Z. Neural Prompt Search. arXiv 2022, arXiv:2206.04673. [Google Scholar]
Jia, M.; Tang, L.; Chen, B.C.; Cardie, C.; Belongie, S.; Hariharan, B.; Lim, S.N. Visual prompt tuning. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2022; pp. 709–727. [Google Scholar]
Bahng, H.; Jahanian, A.; Sankaranarayanan, S.; Isola, P. Exploring visual prompts for adapting large-scale models. arXiv 2022, arXiv:2203.17274. [Google Scholar]
Li, X.L.; Liang, P. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Virtual, 1–6 August 2021; Volume 1: Long Papers. pp. 4582–4597. [Google Scholar]
Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; Neubig, G. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 2023, 55, 1–35. [Google Scholar] [CrossRef]
Kenton, J.D.M.W.C.; Toutanova, L.K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the NAACL-HLT, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Kornblith, S.; Shlens, J.; Le, Q.V. Do better imagenet models transfer better? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2661–2671. [Google Scholar]
Cai, H.; Gan, C.; Zhu, L.; Han, S. TinyTL: Reduce Memory, Not Parameters for Efficient On-Device Learning. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2020; Volume 33, pp. 11285–11297. [Google Scholar]
Xuhong, L.; Grandvalet, Y.; Davoine, F. Explicit inductive bias for transfer learning with convolutional networks. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 2825–2834. [Google Scholar]
Kumar, A.; Raghunathan, A.; Jones, R.; Ma, T.; Liang, P. Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
Lee, Y.; Jeong, J.; Yun, J.; Cho, W.; Yoon, K.J. SpherePHD: Applying CNNs on a Spherical PolyHeDron Representation of 360° Images. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 9173–9181. [Google Scholar] [CrossRef]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Elsayed, G.F.; Goodfellow, I.; Sohl-Dickstein, J. Adversarial reprogramming of neural networks. arXiv 2018, arXiv:1806.11146. [Google Scholar]
Goodfellow, I.J.; Shlens, J.; Szegedy, C. Explaining and harnessing adversarial examples. arXiv 2014, arXiv:1412.6572. [Google Scholar]
Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards deep learning models resistant to adversarial attacks. arXiv 2017, arXiv:1706.06083. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F.-F. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Ruder, S. An overview of gradient descent optimization algorithms. arXiv 2016, arXiv:1609.04747. [Google Scholar]
Loshchilov, I.; Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. In Proceedings of the International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]

Figure 1. Examples of representing objects in fisheye images: (a) using rectangular bounding boxes, (b) using 24 points. (a) demonstrates that due to distortion of objects, the bounding boxes contain a significant amount of background, failing to clearly delineate overlapping objects. On the other hand, representing objects with 24 points, as shown in (b), reduces the influence of the background, while also enabling clear differentiation of overlapping objects.

Figure 2. Full fine-tuning (FT) updates the entire model, and partial fine-tuning (PT) freezes the backbone model while updating the remaining parts. The fire icon represents tunable parameters, and the ice icon represents frozen parameters. In VP-aided fine-tuning, both the model and the prompt are updated simultaneously. This integrated input is subsequently fed into the model, aiming to enhance the performance.

Figure 3. Process of generating 24-point annotations. (a) Original fisheye image. (b) Visualization of the given rectangular bounding box annotation, composed of width, height, and center point. (c) Rays are emitted at 15-degree intervals from the center coordinates of the bounding box. (d) Annotation of 24 intersection points where the emitted rays intersect with the boundaries of the instance segmentation.

Figure 4. The 24-point labeling results. Labeling is distinguished by five colors according to class. As objects move towards the edges of the image, the distortion in their shape becomes more evident, and their direction and size change. (a) Front, (b) left, (c) right, and (d) rear camera case examples.

Figure 5. Architecture of the proposed ‘VP-aided fine-tuning’. (a) The pretrained model is trained using a standard image without distortion. (b) The process of adapting the pretrained backbone for the fisheye object detection task using VP-aided fine-tuning. VP is added to the fisheye image before being input into the model, where it is tuned together with the model using the loss function defined in Equation (1). Whether the backbone is updated depends on the fine-tuning setting; typically, it is updated during FT and frozen during PT. By adding prompts to the fisheye image, the difference between the standard image and the fisheye image can be reduced, allowing for effective fine-tuning.

Figure 6. Overview of fine-tuning methods.

Figure 7. Results of the qualitative evaluation based on the ground truth and the various methods. Qualitative evaluation is distinguished by five colors according to the predicted class. From the top, the results are for images captured by the front, left, right, and rear cameras.

Figure 8. Performance comparison between rectangular bounding box and 24-point representation across various backbone models.

Figure 9. Type of prompt designs.

Table 1. mIOU based on various backbone models with tuning settings. The best scores are in bold.

Method	Tuning Setting				Target Backbone
Method	Pre-Init	Backbone	Neck & Head	Prompt	Darknet53	ResNet50	VGG19	Swin-T	Swin-S
Scratch		✓	✓		0.5276	0.4042	0.5325	0.3170	0.4243
Zero-shot	✓				0.2304	0.2222	0.2357	0.1850	0.1814
PT w/o VP	✓		✓		0.5188	0.4791	0.5136	0.4160	0.4355
FT w/o VP	✓	✓	✓		0.5842	0.5694	0.5472	0.4919	0.5071
PT w/VP	✓		✓	✓	0.5365	0.5100	0.5344	0.4247	0.4365
FT w/VP	✓	✓	✓	✓	0.5904	0.5811	0.6050	0.5038	0.5082

Table 2. mAP based on various backbones with tuning settings. The best scores are in bold.

Method	Tuning Setting				Target Backbone
Method	Pre-Init	Backbone	Neck & Head	Prompt	Darknet53	ResNet50	VGG19	Swin-T	Swin-S
Scratch		✓	✓		0.3678	0.3016	0.4008	0.3708	0.3789
PT w/o VP	✓		✓		0.4435	0.4094	0.4507	0.4627	0.4786
FT w/o VP	✓	✓	✓		0.4994	0.4715	0.5128	0.4767	0.5051
PT w/VP	✓		✓	✓	0.4546	0.4252	0.4578	0.4672	0.4891
FT w/VP	✓	✓	✓	✓	0.5135	0.4785	0.5456	0.4823	0.5122

Table 3. Comparison of PT and FT tunable parameters for various backbones.

Backbone	PT w/o VP	PT w/VP	FT w/o VP	FT w/VP
Darknet53	27.15 M	27.34 M	54.23 M	54.42 M
ResNet50	27.15 M	27.34 M	35.94 M	36.14 M
VGG19	27.15 M	27.34 M	47.71 M	47.90 M
Swin-T	18.50 M	18.69 M	46.02 M	46.21 M
Swin-S	18.50 M	18.69 M	67.34 M	67.53 M

Table 4. Computational time and number of tunable parameters based on padding size.

Padding Size	Training Time	Number of Parameters
0	29,055 s	0.00 M
30	29,297 s	0.19 M
50	29,344 s	0.31 M
70	29,423 s	0.41 M

Table 5. mIOU comparison of various fine-tuning methods with and without VP. The best scores are in bold.

Method	Fine-Tuning Method
Method	PT	FT	Bias	L2-SP	PT-FT
w/o VP	0.5188	0.5842	0.5356	0.5775	0.5683
w/VP	0.5365	0.5904	0.5396	0.5887	0.5693

Table 6. PT and FT with VP results based on VP padding size.

Type	Padding Size	Darknet53	ResNet50	VGG19	Swin-T	Swin-S
PT	0	0.5188	0.4791	0.5136	0.4160	0.4355
	30	0.5138	0.4982	0.5048	0.4132	0.4365
	50	0.5365	0.5030	0.5272	0.4247	0.4331
	70	0.5304	0.5100	0.5344	0.4001	0.4248
FT	0	0.5842	0.5694	0.5472	0.4919	0.5071
	30	0.5619	0.5706	0.6050	0.4852	0.5003
	50	0.5692	0.5811	0.5639	0.4919	0.4928
	70	0.5904	0.5635	0.5073	0.5038	0.5082

Table 7. mIOU comparison based on VP type.

Tuning Method	VP Type
Tuning Method	Design 1	Design 2	Design 3	Design 4
PT	0.5365	0.5355	0.4921	0.5261
FT	0.5904	0.5831	0.5794	0.5929

Table 8. mIOU comparison based on VP initialization. The best scores are in bold.

Method	Initialization Type
Method	Normal Random	Uniform Random	Zero
w/o VP	0.5842	0.5652	0.5566
w/VP	0.5904	0.5788	0.5809

Table 9. mIOU comparison based on backbone optimizer and VP optimizer. The best scores are in bold.

Model Optimizer	VP Optimizer
Model Optimizer	SGD	Adam	RMSprop
SGD	0.5904	0.5496	0.5068
Adam	0.5817	0.5521	0.5886
RMSProp	0.5874	0.5478	0.5516

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jeon, M.; Park, G.-M.; Hwang, H. Fisheye Object Detection with Visual Prompting-Aided Fine-Tuning. Remote Sens. 2024, 16, 2054. https://doi.org/10.3390/rs16122054

AMA Style

Jeon M, Park G-M, Hwang H. Fisheye Object Detection with Visual Prompting-Aided Fine-Tuning. Remote Sensing. 2024; 16(12):2054. https://doi.org/10.3390/rs16122054

Chicago/Turabian Style

Jeon, Minwoo, Gyeong-Moon Park, and Hyoseok Hwang. 2024. "Fisheye Object Detection with Visual Prompting-Aided Fine-Tuning" Remote Sensing 16, no. 12: 2054. https://doi.org/10.3390/rs16122054

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fisheye Object Detection with Visual Prompting-Aided Fine-Tuning

Abstract

1. Introduction

2. Related Works

2.1. Visual Prompting

2.2. Fine-Tuning Pretrained Models

2.3. Object Detection in Fisheye Images

3. Methods

3.1. Dataset Generation

3.2. Overview of Visual Prompting

3.3. VP-Aided Fine-Tuning

3.4. Theoretical Analysis of Visual Prompting

4. Experiments and Results

4.1. Setup

4.1.1. Dataset

4.1.2. Models

4.1.3. Evaluation Metric

4.1.4. Implementation Details

4.2. Vp-Aided Fine-Tuning Results

4.2.1. Results of Applying VP to Various Backbone Models

4.2.2. Results of Applying VP to Various Fine-Tuning Methods

4.2.3. Qualitative Results

4.2.4. The Efficacy of 24-Point Representation

4.2.5. Effectiveness of Padding Size

4.2.6. Performance Analysis According to Prompt Design

4.2.7. Performance Analysis of Prompt Initialization

4.2.8. Performance Analysis of Optimizer Setting

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI