Offshore Ship Detection in Foggy Weather Based on Improved YOLOv8

Liang, Shirui; Liu, Xiuwen; Yang, Zaifei; Liu, Mingchen; Yin, Yong

doi:10.3390/jmse12091641

Open AccessArticle

Offshore Ship Detection in Foggy Weather Based on Improved YOLOv8

by

Shirui Liang

,

Xiuwen Liu

^*,

Zaifei Yang

,

Mingchen Liu

and

Yong Yin

Key Laboratory of Marine Simulation and Control, Navigation College, Dalian Maritime University, Dalian 116026, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2024, 12(9), 1641; https://doi.org/10.3390/jmse12091641

Submission received: 25 August 2024 / Revised: 11 September 2024 / Accepted: 11 September 2024 / Published: 13 September 2024

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

The detection and surveillance of ship targets in coastal waters is not only a crucial technology for the advancement of ship intelligence, but also holds great significance for the safety and economic development of coastal areas. However, due to poor visibility in foggy conditions, the effectiveness of ship detection in coastal waters during foggy weather is limited. In this paper, we propose an improved version of YOLOv8s, termed YOLOv8s-Fog, which provides a multi-target detection network specifically designed for nearshore scenes in foggy weather. This improvement involves adding coordinate attention to the neck of YOLOv8 and replacing the convolution in C2f with deformable convolution. Additionally, to expand the dataset, we construct and synthesize a collection of ship target images captured in coastal waters on days with varying degrees of fog, using the atmospheric scattering model and monocular depth estimation. We compare the improved model with the standard YOLOv8s model, as well as several other object detection models. The results demonstrate superior performance achieved by the improved model, achieving an average accuracy of 74.4% ([email protected]), which is 1.2% higher than that achieved by the standard YOLOv8s model.

Keywords:

YOLOv8; ship detection; foggy target detection

1. Introduction

The situation in coastal waters is highly complex, with numerous vessels traveling to and from them, daily. This complexity is particularly evident during the opening of the sea for fishing, as ports and nearshore waterways are prone to blockage, resulting in frequent ship traffic accidents. Furthermore, the presence of foggy weather significantly diminishes visibility, thereby presenting heightened risks to the safety of maritime vessels and substantially impeding their regular navigation. These adverse conditions may precipitate collisions between ships. To mitigate these challenges, it is imperative to employ onboard sensors for the collection of pertinent data concerning a vessel’s positioning and its immediate aquatic surroundings. Such data can then undergo processing to inform subsequent operational maneuvers. Consequently, harnessing ship environmental sensing technology is paramount in facilitating safer passage for ships navigating coastal waters amidst foggy sailing conditions.

The primary methods for perceiving the maritime environment include navigation radar, visual observation, auditory perception, AIS/VHF communication, and weather radio. Target detection is a crucial task for comprehending the surrounding environment. Traditional object detection relies on manual design, resulting in poor recognition accuracy, high computational overheads, and certain limitations. In recent years, significant progress has been made in the field of object detection due to the continuous development of deep learning. This progress eliminates the need for manual feature design and surpasses traditional object detection technology in terms of accuracy and real-time performance. As a result, deep learning-based object detection has been widely applied in various domains. In terms of computation speed and accuracy, computer vision based on deep learning goes far beyond traditional detection algorithms based on manual features [1]. Object detection based on deep learning can be categorized into two types: one-stage algorithms and two-stage algorithms. The two-stage algorithm model is more complex, making it challenging to train and deploy. It also results in slower detection speeds, rendering it unsuitable for real-time applications. On the other hand, one-stage algorithms directly detect and classify targets within input images. They offer faster detection speeds, lighter and simpler models, as well as easier training and deployment processes. The YOLO [2] series is a one-stage object detection algorithm, which ensures detection accuracy, while the detection speed is fast. YOLOv8 [3] is proposed by the YOLOv5 [4] team, YOLOv7 [5] is proposed by the YOLOv4 [6] team, YOLOv7 is comparable to YOLOv5, although the accuracy exceeds v5, but the speed is not as fast as v5, and it occupies a large amount of memory, so these two versions have their own advantages and disadvantages. However, the introduction of v8 represents a significant improvement over v5 and surpasses v7. It utilizes the anchor-free method, eliminating the substantial computational overhead associated with using anchors and resulting in faster convergence and better object detection effectiveness. Therefore, we have chosen YOLOv8 as the base model for this study.

Most target detection studies based on foggy conditions primarily focus on defogging, which aims to repair the obscured areas of images and enhance the clarity of the image background. This is conducted to optimize the use of images through the subsequent use of visual algorithms. Traditional dehazing algorithms include the wavelet transform algorithm [7], based on image enhancement, the dark channel [8] dehazing algorithm, based on image restoration, and the latest dehazing method, which has been developed based on a convolutional neural network. Notably, the dehazing operation is crucial for target detection in restored images. Nevertheless, it does not address the root problem and may lead to image distortion and poor restoration effects, resulting in less than ideal detection outcomes. Moreover, due to the complex and changeable sea surface hydrological and weather environment, the real-time performance of the detection system is paramount. Traditional defogging research typically involves defogging the image before conducting detection, which is time consuming and fails to meet real-time requirements. Therefore, we opt to extract features from foggy ship images against a non-foggy background to achieve target detection.

In this paper, we present a novel approach for detecting ships in coastal waters during foggy conditions. The contributions of this paper are outlined as follows:

We propose a novel approach to detect ship targets in foggy coastal waters by incorporating the coordinate attention (CA) mechanism into the neck layer and integrating deformable convolution into the C2f module, following the addition of the attention mechanism;
We compare the detection effects of the model with different attention mechanisms at the same improvement position;
We compare the detection effectiveness of the model with the introduction of the CA at the same location and the deformable convolution of different parts;
We assessed the proposed method using the YOLOv8 model with the same foggy nearshore vessel dataset and conducted comparisons with several other object detection models;
We compared the detection results of the proposed method with the YOLOv8 model using real images of ships in foggy nearshore waters;
We have demonstrated that the proposed method achieves outstanding results in terms of detection effectiveness.

The remaining sections of the paper are structured as follows. Section 2 concerns related work. Section 3 introduces information on how the dataset was generated. Section 4 is about YOLOv8s, the YOLOv8s-Fog algorithm, deformable convolution, the CA mechanism, and the atmospheric scattering model. In Section 5, various experiments are conducted using the dataset and the results of the analyses are shown. The findings of this paper are summarized in Section 6 and Section 7.

2. Related Work

2.1. Research Status of Ship Detection and Classification

At present, the research on ship target detection is roughly consistent with common object detection technology, which can be divided into two categories: one is based on traditional image processing technology, which is detected after going through the stages of image segmentation, feature extraction, and classification; the other relates to object detection technology based on deep learning, which has been studied over the past decade, using convolutional neural networks to learn the features of objects to achieve higher detection accuracy and detection speed.

Traditional object detection algorithms have the advantages of being simple and easy to understand because they are based on manual design, but a large number of traversals in the region selection process generates huge overheads, resulting in a great reduction in speed. The features extracted during the feature extraction stage cannot meet the needs of detection and the object portrayal stays at the surface stage. David G. Lowe [9] proposed SIFT (Scale Invariant Feature Transform), which is a local feature description method; Dalal et al. [10] proposed HOG (gradient histogram feature). Ahonen et al. [11] proposed the use of LBP for identification, etc. Yang et al. [12] used saliency segmentation and local binary models to describe the structure of ships in satellite images. Hou et al. [13] combined Mask R-CNN and Faster R-CNN for scene segmentation to carry out detection, which improved the detection capability in regard to ships in complex backgrounds. Cui et al. [14] extracted cascading feature maps using CBAM to highlight salient features at specific scales. Fu et al. [15] formed a ship rotation detection model based on the feature fusion pyramid network and deep reinforcement learning. Wei et al. [16] introduced dilated convolution into the Faster R-CNN to improve the feature extraction ability. Li et al. [17] proposed a background filtering network and a fine-grained ship classification network to improve the detection accuracy of ships. Wu et al. [18] used the pyramid network FPN to combine gradient information with grayscale information to form a detection network CF-SDN for land–sea separation. Lin et al. [19] used the task partition model to assign different detection functions to the deep and gradual layers in the network, which improved the robustness of offshore ship detection. Wang et al. [20] improved the loss function by introducing the CFE module to enhance the detection of small targets. Chen et al. [21] combined the Generative Adversarial Network (GAN) and a CNN to form a ship detection technique with a hybrid deep learning approach.

2.2. Research Status of Target Detection in Haze Weather

Target detection in foggy conditions is generally divided into two categories: one is to defog first and then detect, the defogging operation is mainly to repair the fog-obscured area in the image and enhance the clarity of the image background, so as to facilitate the efficient use of the image through the use of subsequent visual algorithms. The other approach is to avoid defogging and directly improve the algorithm according to the characteristics of the foggy dataset. There are two types of dehazing methods, one is based on image enhancement and the other is based on image restoration. In 2009, K. He proposed a priori dark channel dehazing algorithm [8], in which he observed that in outdoor images without haze, most local areas contain some pixels with very low intensity in at least one-color channel, so based on this prior knowledge, the thickness of the haze can be evaluated, and a high-quality image can be restored, and a high-quality depth map can be obtained at the same time. Since then, most dehazing studies have focused on dehazing based on the dark channel, prior to dehazing. Zhu et al. [22] proposed CAP (Color Attenuation Prior) in 2015, in which it was found that the difference between the brightness and saturation of the corresponding image is different in the area where the fog is located to different degrees and that this difference is linearly related to the brightness of the image. In 2016, Cai et al. [23] proposed the DehazeNet algorithm, which uses a deep learning architecture to learn the mapping relationship between the fog map and the transmission map to achieve dehazing. Galdran [24] proposed a method to artificially underexpose the original blurry image through a series of gamma correction operations and then merge the obtained multi-exposure image, set into a fog-free result at multiple scales. Rotimi-Williams Bello [25] proposed embedding the Mask mechanism into the backbone of the YOLOv7 algorithm (Mask YOLOv7) to enhance the ability of the model to carry out object detection and instance segmentation. A small number of studies are based on the algorithm’s improvement of the dataset, directly aimed at the foggy world without fogging. Pamudu Ranasinghe proposed an LLE_UENET [26] model that enhances the perception and interactivity of images taken in poor lighting conditions.

3. Improved Ship Dataset in Foggy Weather Conditions

Presently, there exists a scarcity of datasets containing nearshore ships in genuine foggy conditions, posing a significant challenge to the training of neural networks. Nevertheless, the procurement of images in foggy weather conditions is comparably straightforward. Consequently, it becomes imperative to produce foggy images from these meteorological circumstances to enhance the dataset. Within this discourse, we initially harness the monocular map depth estimation, advanced by Ren e Ranftl [27], to gauge the depth particulars of fog-free coastal water images. Subsequently, we employ both the depth information and McCartney’s atmospheric scattering model [28] to synthesize the fog. To conclude, manual annotations were appended to the images to assemble a comprehensive collection of images depicting ships amidst varying levels of coastal water haze concentrations.

3.1. Atmospheric Scattering Model

According to the atmospheric scattering model, there are two main reasons for the degradation of the imaging results of the detection system under the strong scattering medium: The first concerns the target reflected light, which is impacted by the absorption and scattering effect of the suspended particles in the atmosphere, resulting in the attenuation of the target reflected light energy, which leads to a reduction in the brightness of the contrast in the imaging results of the detection system. The second is the sunlight and other ambient light that is impacted by the scattering medium in the atmosphere and the scattering effect involving the formation of the background light, which is usually greater than the target light intensity of the background light, resulting in the imaging results of the detection system being fuzzy. A diagram of the proposed foggy imaging process is shown in Figure 1.

The following two formulas (Equations (1) and (2)) [29] represent the mathematical equations of the atmospheric scattering model. In these equations, I denotes the foggy image, J represents the fog-free image, A stands for the atmospheric light value, t is the transmission rate, β indicates the attenuation coefficient, and d signifies the depth of the scene.

I (x) = J (x) t (x) + A (1 - t (x))

(1)

t (x) = e^{- β d (x)}

(2)

3.2. Foggy Image Generation Process

According to the formula for the atmospheric scattering model, it is evident that by altering the value of the parameter β, a foggy image can be simulated based on a fog-free image. This enables us to generate foggy sky images with varying concentrations of fog. Generating foggy images requires us to have knowledge of the depth of the scene in the image, in order to simulate the effect of different concentrations of fog at various distances and to make it closer to real-life situations on foggy days. Since the image is captured by a monocular camera and does not contain depth information, we utilize the monocular depth estimation method proposed by Ren e Ranftl [27]. This method has undergone extensive training to develop an improved depth estimation network. By inputting the image into this network, we can obtain a depth map for the ship image in the coastal sea area through transfer learning. Based on changes in color within the depth map, we can determine relative distances within the image and apply an atmospheric scattering model to generate fog effects. The flowchart for generating fog images is shown in Figure 2.

3.3. Dataset

In this paper, the β values of the atmospheric scattering model are 0.5, 1.5, 2.5, 3.5, and 4.5. These values correspond to the generation of foggy day images with five different concentrations of fog: 0.05, 0.3, 0.4, 0.2, and 0.05, respectively. We applied this method to the 6_class_final dataset (a public dataset from source [30]) to generate a dataset of 5579 images of foggy days, with a resolution of 416 × 416 pixels. Each original fog-free image only randomly corresponds to a β value, i.e., a concentration, that is, it is not possible for two or more fogged images to have the same original image. All the images were manually labeled using Labelme [31] according to seven common nearshore scene objects, namely bucket float, buoy, container ship, cruise liner, fish-b (fishing boating), warship, and yacht. The distribution of the number of labels is shown in Figure 3, and example images for each category can be found in Figure 4.

Figure 5 shows the images and depth maps of the generated fog at different concentrations. The first image on the right side is the original picture and the 2nd~6th are foggy pictures of different concentrations (gradually increasing the fog concentration) generated when the β is 0.5, 1.5, 2.5, 3.5, and 4.5, respectively. The depth map of the image is shown on the left.

4. Proposed Approach

In this manuscript, we introduce YOLOv8s-Fog, a cutting-edge network model tailored for ship target detection in foggy coastal waters. This model is an extension of the state-of-the-art YOLOv8s framework, featuring the incorporation of the CA [32] mechanism into the neck and the augmentation of convolution in the C2f module into deformable convolution (DCNv2 [33]) at specific positions for more comprehensive feature capture.

4.1. Overview of YOLOv8

YOLOv8 is an end-to-end object detection framework that consists of three key processes: feature extraction, feature fusion, and prediction. Feature extraction, which is carried out by the backbone, involves extracting high-level semantic features from the input image to capture contextual and abstract information. Feature fusion involves the fusion of features at different levels to obtain more information-rich features, which is primarily achieved using a “neck”. The prediction component is responsible for locating and classifying the target and generating the prediction result, which is carried out by the “head”. In YOLOv8, the C3 module in YOLOv5 is replaced with the C2f module (as shown in Figure 6), which includes more skip connections than the C3 module and incorporates additional split operations. Furthermore, it eliminates the convolution operation in branching, thereby enhancing feature information, while reducing the computational effort. YOLOv8 replaces the coupling head of YOLOv5 with the decoupling head in the head part. As shown in Figure 7, there is no longer an Obj Loss branch, and only the decoupled classification and regression branches extract the category features and location features, respectively. This allows for the completion of the classification and localization tasks, resulting in improved model accuracy and accelerated convergence speed of the network. In addition, YOLOv8 eliminates the use of anchors. Anchors are artificially designed and can negatively impact the model’s detection effectiveness by generating numerous redundant boxes, leading to a significant imbalance in positive and negative samples. In contrast, YOLOv8 does not rely on anchors and directly learns the object’s position, thereby enhancing the network’s ability to accurately represent the object’s shape. With advancements in feature pyramid networks (FPNs) [34] and the path aggregation network (Pan) [35], YOLOv8 achieves comparable accuracy to anchor-based methods, while utilizing an anchor-free approach. While the accuracy of YOLOv8 has been improved in some common COCO or Pascal VOC categories and demonstrates a wide range of generality, its detection performance in regard to ship targets in coastal waters in foggy conditions remains suboptimal. The current model does not specifically target this scenario and fails to meet the detection requirements. Therefore, it is necessary to enhance the capability of YOLOv8 in detecting ships in coastal waters under foggy skies. By doing so, we can improve the accuracy of detection in this context and better align with the specific needs in this regard.

4.2. Deformable Convolution Module

The conventional convolution operation involves dividing the feature map into sections of the same size as the convolution kernel and then performing the convolution operation. The position of each section on the feature map is fixed. However, for objects with more complex deformation, using traditional convolution may not yield ideal results. This fixed grid may not be able to capture the complex spatial variations of the objects in the image, especially when the object is distorted or small [36]. The traditional convolution formula is depicted in Equation (3). Under a foggy sky, the image of the ship in the coastal sea area is blurred. The outline of the target is not obvious and is often obscured by fog. Additionally, the ship will tilt when there are waves, causing its shape to appear different from different angles. There are also various types of targets to be detected, such as irregular nets on fishing boats, ships with sails or elevated structures, and conical buoys. Traditional convolution methods often include elements such as the sky, sea, and shore that are not relevant to the focus area. Therefore, it is important to refine the convolution area to exclude the unnecessary elements and concentrate on the desired target. However, the utilization of deformable convolution [37] alters the receptive field from a rigidly square shape to one that closely aligns with the actual shape of the object. This ensures that the convolution region always covers the shape of the object, as illustrated in Figure 8. Therefore, we have opted to modify the convolution in the network to deformable convolution in order to enhance the performance of the algorithm. Deformable convolution introduces an offset in the receptive field, which is generated by convolving the input feature map with another map. This offset is learnable, as shown in Equation (4). However, since the position after adding the offset is not an integer and does not correspond to the actual pixels of the feature map, it is necessary to use interpolation to obtain the offset pixel value. Bilinear interpolation is commonly used for this purpose. The issue with DCNv1 lies in the fact that deformable convolutions have the potential to introduce irrelevant contexts (regions), which may interfere with the desired feature extraction, thus potentially degrading algorithm performance. Therefore, we employ DCNv2, which utilizes a greater number of deformable convolutions compared to DCNv1. This allows us to add the weight of each sampling point on top of the DCNv1 base (add an offset) to differentiate whether the introduced region is indeed the region of interest or not (refer to Equation (5)). Additionally, DCNv2 also emulates the features of R-CNN [38] in addressing context-independent problems.

y (P_{0}) = \sum_{P_{n} \in R} w (P_{n}) \cdot x (P_{0} + P_{n})

(3)

y (P_{0}) = \sum_{P_{n} \in R} w (P_{n}) \cdot x (P_{0} + P_{n} + Δ P_{n})

(4)

y (P) = \sum_{k = 1}^{K} w_{k} \cdot x (P + P_{k} + Δ P_{k}) \cdot Δ m_{k}

(5)

4.3. Coordinate Attention Module

The attention mechanism, like the human visual system, directs its limited attention towards important information, thus conserving resources and swiftly acquiring the most pertinent information. Coordinated attention considers not only channel information, but also positional information pertaining to the direction. In contrast to the channel attention mechanism, which undergoes transformation into a singular feature vector through 2D global pooling, CA dissects channel attention into two distinct 1D feature coding processes that amalgamate features along with divergent orientations, as depicted in Figure 9. The resultant feature maps are subsequently encoded individually to constitute a pair of direction-sensitive and position-sensitive feature maps. These can be effectively applied in tandem with the input feature maps to amplify the representation of the target of interest. In instances of foggy conditions, the features pertaining to the ship may encounter obstruction and attenuation due to the presence of fog. Employing the original YOLOv8s algorithm does not effectively prioritize the characteristics of the ship or may inadvertently divert the focus towards extraneous information, consequently impacting the learning outcomes. By integrating the CA mechanism, we can concentrate limited attention where it is most warranted and mitigate any adverse effects stemming from fog on the results.

4.4. Loss Function

The input for detecting ships in coastal waters on foggy days consists of manually labelling target bounding boxes and category labels, while the final output is the detection of target bounding boxes and labels by the model. To ensure that the model is trained to accurately predict these labels, we utilize a loss function to quantify the disparity between the predicted and true values. YOLOv8s incorporates two types of losses: localization loss and classification loss. For positioning loss, we employ DFL-Loss and CIoU-Loss using the DFL Formula (6), while BCE Loss is utilized for classification losses, as shown in Equation (7):

D F L (S_{i}, S_{i + 1}) = - ((y_{i + 1} - y) \log (S_{i}) + (y - y_{i}) \log (S_{i + 1}))

(6)

L = \frac{1}{N} \sum_{i} L_{i} = \frac{1}{N} \sum_{i} - [y_{i} \cdot \log (p_{i}) + (1 - y_{i}) \cdot \log (1 - p_{i})]

(7)

4.5. The YOLOV8s-Fog Detection Model

The features of vessels navigating coastal waters during hazy weather are not as readily discernible as in clear conditions. It is a complex task to distinguish these characteristics, given the dense maritime traffic and the multitude of overlapping and diverse vessel types present in coastal areas. Consequently, the original YOLOv8 algorithm often experiences frequent occurrences of missed detection and false identification when applied to this scenario. Therefore, this paper aims to enhance detection accuracy by refining the original YOLOv8s algorithm within the context of coastal water environments, and the network structure of the original YOLOv8s is shown in Figure 10. The enhanced network structure depicted in Figure 11 embodies a sophisticated solution to the challenges posed by foggy weather conditions on the discernibility of ship characteristics. In such atmospheric circumstances, the inherent features of ships are compromised, rendering them less conspicuous and, thus, posing significant obstacles to accurate learning. To surmount this challenge, we have introduced the CA mechanism, which serves to apportion varying weights to different channels or spatial regions. This strategic allocation enables more focused scrutiny of crucial local information within ship images in coastal waters under foggy skies, thereby elevating the model’s recognition accuracy. Amidst foggy weather conditions, sea targets, such as ships and buoys, assume diverse angles and shapes that demand a nuanced approach for effective detection. To address these exigencies, we have eschewed traditional convolution in favor of deformable convolution. Specifically, we have replaced the second convolution in the C2f module in the neck of the original YOLOv8s with deformable convolution and integrated the CA mechanism prior to each image scale alteration. This innovative approach empowers us to adeptly capture the distinctive features of ships near shorelines on misty days with unparalleled precision. The original YOLOv8s has 225 layers and 11.138 million parameters, and YOLOv8s-Fog has 289 layers and 12.244 million parameters.

5. Experiments and Analysis

5.1. Implementation Details

Using the YOLOv8s.pt model, based on the PyTorch framework, Cuda11.1, and Python3.8, the input image size was set to 640 × 640, the IoU threshold was 0.7, the batch size was set to 16, the initial learning rate was 0.01, the momentum was 0.937, and the weight decay was 0.0005, and we trained the network with 200 epochs. All the experiments were conducted on a Linux server, with Intel(R) Xeon(R) Gold 6330 CPU, 96GB RAM, and NVIDIA GTX 3090 GPU (Shanghai Gpushare Cloud Network Technology Co., Ltd., Shanghai, China). The dataset was divided into three subsets, namely the training set, the validation set, and the test set. The training set contains 3875 images, (70%), the validation set contains 1130 images (20%), and the test set contains 574 images (10%). Each subset contains images of ships in the nearshore waters for various categories of ships, as well as for different fog concentrations, but the images in these three collections are all completely different and have different original images. In order to improve the diversity of the data and the robustness of the model, we use Mosaic data augmentation to randomly crop, rotate, scale, and splice the images, and finally run 10 epochs and turn off Mosaic data augmentation in order to make the model more conducive to convergence and stabilization.

5.2. Evaluation Metrics

To assess the performance of the YOLOv8s-Fog model in foggy nearshore waters, we used a variety of evaluation metrics commonly used in object detection tasks, namely precision (P), recall (R), accuracy, the F1 score, and the mean average precision (mAP). The formula is as follows (Equations (8)–(12)), where TP is the number of ship samples present and correctly detected, FP is the number of ship samples not present but incorrectly detected, FN is the number of ship samples present but not correctly detected, and C is the number of classes. In addition, some of the experiments also used GFLOPs (Giga Floating-Point Operations Per Second, i.e., one billion floating-point operations per second) and Params (the total number of parameters that need to be trained in model training).

P = \frac{T P}{T P + F P}

(8)

R = \frac{T P}{T P + F N}

(9)

a c c u r a c y = \frac{A L L T P}{T o t a l n u m b e r o f s a m p l e s}

(10)

F 1 s c o r e = 2 * \frac{P * R}{P + R}

(11)

m A P = \frac{1}{|C|} \int_{0}^{1} P (R) d R

(12)

5.3. Ablation Study

We conducted a comparison between the YOLOv8s-Fog model and the unimproved YOLOv8s base model. Both models were trained on the same dataset and validated using the same test set. In Figure 12, we present a comparison of the results using the test set for the unimproved YOLOv8s model, as well as for two variations with separate improvements (introducing only deformable convolution and only the CA mechanism), and finally for all the improvements combined (YOLOV8s-Fog). The results clearly demonstrate that the improved model shows significant enhancements over the unimproved model in terms of both the map@50 and precision metrics.

We tested the detection effect of adding the CA mechanism at different locations in the neck or backbone on the same self-made foggy nearshore vessel dataset, and the results are shown in Table 1. Specifically, bonev1 involves the addition of CA after C2f in the backbone, bonev2 involves the addition of CA before C2f in the backbone, neck_C2front involves the addition of CA before C2f in the neck, neck_Cofront involves the addition of CA before the contact in the neck, neck_C2behind involves the addition of CA after the last three C2f in the neck. The second half of the table represents the results after introducing deformable convolution at the same position as the neck, based on the first half of the table. The results show that the YOLOv8s-Fog model achieves better accuracy with almost the same number of parameters and GFLOPS.

We tested the performance of the detectors for YOLOv8s constructed using four different attentional mechanisms (with the same hyperparameters for fairness), using the same self-made foggy nearshore vessel dataset, including GAM [39], ECA [40], EMA [41], and CA, and we added the above different attention mechanisms (with or without DCN) in the same location. As shown in Table 2, the positions of the different attention mechanisms added coincide with the positions added by the attention mechanisms in the YOLOv8s-Fog model, and the second half of the table represents the results after introducing deformable convolutions at the same position in the neck, based on the first half of the table. The results show that the YOLOv8s-Fog model achieves the highest detection accuracy, with similar parameters and GFLOPS.

5.4. Comparison with State-of-the-Art Detection Models

The confusion matrix for YOLOv8s-Fog’s results is presented in Figure 13, with the true category on the horizontal axis and the predicted category on the vertical axis. Each grid represents the proportion of the corresponding true class being predicted as the corresponding prediction class. It is important to note that a larger value along the diagonal line in the confusion matrix indicates better performance.

Accuracy represents the proportion of all the correctly classified samples to the total number of samples; Table 3 shows the number of correctly predicted samples for each sample in the test set and the total number of samples for each sample, and we get an accuracy of 63.56% for all the samples (TP indicates the number of samples that were correctly classified). However, accuracy does not accurately reflect the entirety of the results of the test, and the PR curve and F1 score should be further explored.

The PR curves for all classes of YOLOv8s-Fog outputs are presented in Figure 14. In regard to the PR curve, the Area Under the Curve (AUC) is indicative of the overall performance of the classifier, with a larger AUC value signifying superior performance. The F1 curve for all classes of YOLOv8s-Fog outputs are depicted in Figure 15. The F1 score represents the harmonic average of the precision and recall. From these two figures, it can be observed that the YOLOv8s-Fog model outperforms the YOLOv8s model in terms of detection effectiveness.

We conducted a comparison of the detection performance between YOLOv8s-Fog and other object detection models within each category using the same test set. The results, presented in Table 4, demonstrate that the YOLOv8s-Fog model achieves the highest detection accuracy and exhibits superior performance across multiple categories.

5.5. Comparison with Retinex + YOLOv8s

Retinex [44] is an image enhancement algorithm, with the core idea being to adjust the contrast and brightness of an image, while preserving the details of the image. In this part, the foggy dataset is first dehazed by Retinex, and the dehazing effect is shown in Figure 16c (Figure 16a is a fog-free image, Figure 16b is a foggy image generated based on Figure 16a, and Figure 16c is based on the enhanced image generated by Figure 16b after Retinex processing), and then the original YOLOv8s model is used for detection, and the detection results are shown in Table 5. It can be seen that the effect of the YOLOv8s-Fog model is still better than that of a simple dehazing treatment and then detection.

5.6. Comparison with the Fog-Free Dataset

Upon comparing the YOLOv8s-Fog with the YOLOv8s benchmark models, both of which were trained on the original no-fog dataset and validated on the same fog-free test set, we present the results in Table 6. The analysis reveals that the enhanced network structure, YOLOv8s-Fog, does not exhibit the same level of effectiveness as the original YOLOv8s network structure when trained under fog-free conditions. This suggests that the enhancements are specifically tailored to address characteristics associated with foggy weather conditions and may not be suitable for other weather scenarios.

5.7. Analysis of the Detection Results

The diagram below illustrates the comparative results between the YOLOv8s and YOLOv8s-Fog models. It is evident from the diagram that YOLOv8s-Fog surpasses the original algorithm in terms of detection performance, mitigating both leakage and misdetection issues present in the original algorithm. For example, as shown in Figure 17, improvements from (2) to (12) address leakage concerns, while enhancements from (13) to (15) rectify misdetection problems. Overall, YOLOv8s-Fog also enhances the confidence level of accurate detections, culminating in a more precise detection outcome.

Figure 18 shows a comparison between the Grad-CAM [45] generated by YOLOv8s and YOLOv8s-Fog. Grad-CAM can help us analyze the area of interest in the network, according to which the region of interest in the network can be used to in turn analyze whether the network has learned the correct features or information. It can be seen that the YOLOv8s-Fog model pays more attention to the target itself, whereas the original YOLOv8s model also focuses on the areas that include the sea surface, the sky, and other features that are beyond the scope of the target itself or it only focuses on the ship portion of the image. These figures show that the proposed model can improve the detection of coastal sea targets in foggy conditions and can better pay attention to the characteristics of detected targets.

We took field photographs at Xinghai Bay and Donggang Marina in Dalian, China, of which 158 samples were taken, in addition to 42 images of ships in the nearshore sea area on real foggy days that were found on the Internet, which were used to supplement the sample. The test results are shown in Table 7, and the training set is still the dataset from which we generated the fog. Figure 19 demonstrates the detection results of YOLOv8s and YOLOv8s-Fog using the above dataset of ships in coastal waters in foggy conditions. The results not only prove that the self-made foggy dataset can be used for the detection of ship targets in foggy conditions, but also show that the improved YOLOv8s-Fog model does have better detection performance than the YOLOv8s model in foggy conditions, reducing undesirable detection effects, such as false detection and redundant detection frames.

6. Discussion

As the new target detection method proposed in this paper, the YOLOv8s-Fog model can be applied for target detection tasks involving ships in coastal waters in foggy weather conditions, with high speed and accuracy. YOLOv8s-Fog does not perform dehazing operations, but directly improves the network structure based on the dataset of foggy days. For other advanced models [46,47], which usually dehaze first and then detect, there are often problems, such as image distortion, poor image restoration effects, and long dehazing operations, but YOLOv8s does not have such problems. YOLOv8s-Fog introduces the CA mechanism to better perceive direction and position, so as to capture more important local information features. Since there is no way to add other attention mechanisms in circumstances involving sea vessels in foggy coastal areas, we draw on [48,49] and other models that introduce other attention mechanisms, such as ECA, GMA, EMA, etc., to explore and compare the differences between these other attention mechanisms and the CA mechanism. In contrast, it is found that the addition of the CA mechanism has the best effect when other changes share the same conditions, and the characteristics of different ships can be better learned. In addition, YOLOv8s-Fog introduces deformable convolution, which explores the most suitable position to improve traditional convolution to deformable convolution compared with the same attention mechanism, so as to better focus on the important feature areas of ships with variable and complex shapes. Compared with other models, the excellent detection performance of YOLOv8s-Fog proves that it provides a new possibility for target detection applications involving ships in coastal waters in foggy conditions.

However, although the accuracy of the method has improved, the number of parameters and memory needed have not decreased, and we will explore more efficient ways to reduce the consumption of additional resources in future. In addition, this paper is currently limited to the detection of ships in coastal waters in foggy weather, and there is currently no further research on the detection of ships in coastal waters in other bad weather conditions, such as rain and hail.

7. Conclusions

In view of the poor visibility and low accuracy of ship target detection and the many false detections and missed detections in coastal waters on foggy days, this paper proposes a YOLOv8s-Fog detection method for ship target detection in coastal waters on foggy days. The method accurately localizes and classifies ships in coastal waters on foggy days by adding a CA mechanism to the neck and introducing deformable convolution. The experimental results show that the average accuracy (mAP) of the YOLOv8-Fog model tested on the ship dataset in foggy nearshore waters is 74.4%. In the future, we will further focus on the detection of other ships in bad weather and study real-time target detection of unmanned ships integrated with YOLOv8s-Fog.

Author Contributions

Conceptualization, S.L. and X.L.; methodology, S.L. and X.L.; software, S.L., Z.Y. and M.L.; validation, S.L.; formal analysis, S.L.; investigation, S.L.; data curation, S.L.; writing—original draft preparation, S.L.; writing—review and editing, X.L. and Y.Y.; visualization, S.L.; supervision, X.L. and Y.Y.; project administration, X.L.; funding acquisition, X.L. and Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China under Grant 2022YFB4301402, the Ministry of Industry and Information Technology of the People’s Republic of China (No. CBG3N21-3-3), the National Natural Science Foundation of China under Grant 52071049, and the 2022 Liaoning Provincial Science and Technology Plan (Key) Project: R&D and Application of Autonomous Navigation System for Smart Ships in Complex Waters under Grant 2022JH1/10800096.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Acknowledgments

The authors would like to acknowledge the valuable comments provided by the anonymous reviewers.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wang, H.; Xu, Y.; He, Y.; Cai, Y.; Chen, L.; Li, Y.; Sotelo, M.A.; Li, Z. YOLOv5-Fog: A Multiobjective Visual Detection Algorithm for Fog Driving Scenes Based on Improved YOLOv5. IEEE Trans. Instrum. Meas. 2022, 71, 2515612. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLO. January 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 10 January 2023).
Jocher, G. YOLOv5 by Ultralytics. May 2020. Available online: https://zenodo.org/records/3908560 (accessed on 25 June 2020).
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Rong, Z.; Jun, W.L. Improved wavelet transform algorithm for single image dehazing. Optik 2014, 125, 3064–3066. [Google Scholar] [CrossRef]
He, K.; Sun, J.; Tang, X. Single image haze removal using dark channel prior. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 1956–1963. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar] [CrossRef]
Ahonen, T.; Hadid, A.; Pietikäinen, M. Face Recognition with Local Binary Patterns. In Computer Vision—ECCV 2004; Lecture Notes in Computer Science; Pajdla, T., Matas, J., Eds.; Springer: Berlin/Heidelberg, Germany, 2004; pp. 469–481. [Google Scholar] [CrossRef]
Yang, F.; Xu, Q.; Li, B. Ship Detection from Optical Satellite Images Based on Saliency Segmentation and Structure-LBP Feature. IEEE Geosci. Remote Sens. Lett. 2017, 14, 602–606. [Google Scholar] [CrossRef]
Hou, S.; Ma, X.; Wang, X.; Fu, Z.; Wang, J.; Wang, H. Sar Image Ship Detection Based on Scene Interpretation. In Proceedings of the IGARSS 200—IGARSS 2020 IEEE International Geoscience and Remote Sensing Symposium, Waikoloa, HI, USA, 26 September–2 October 2020; IEEE: New York, NY, USA, 2020; pp. 2863–2866. [Google Scholar] [CrossRef]
Cui, Z.; Li, Q.; Cao, Z.; Liu, N. Dense Attention Pyramid Networks for Multi-Scale Ship Detection in SAR Images. IEEE Trans. Geosci. Remote Sens. 2019, 57, 8983–8997. [Google Scholar] [CrossRef]
Fu, K.; Li, Y.; Sun, H.; Yang, X.; Xu, G.; Li, Y.; Sun, X. A Ship Rotation Detection Model in Remote Sensing Images Based on Feature Fusion Pyramid Network and Deep Reinforcement Learning. Remote Sens. 2018, 10, 1922. [Google Scholar] [CrossRef]
Wei, S.; Chen, H.; Zhu, X.; Zhang, H. Ship Detection in Remote Sensing Image based on Faster R-CNN with Dilated Convolution. In Proceedings of the 39th Chinese Control Conference, Shenyang, China, 27–29 July 2020; IEEE: New York, NY, USA, 2020. Available online: https://www.webofscience.com/wos/alldb/full-record/WOS:000629243507049 (accessed on 13 October 2023).
Li, J.; Tian, J.; Gao, P.; Li, L. Ship Detection and Fine-Grained Recognition in Large-Format Remote Sensing Images Based on Convolutional Neural Network. In Proceedings of the IGARSS 2020—IGARSS 2020 IEEE International Geoscience and Remote Sensing Symposium, Waikoloa, HI, USA, 26 September–2 October 2020; IEEE: New York, NY, USA, 2020; pp. 2859–2862. [Google Scholar] [CrossRef]
Wu, Y.; Ma, W.; Gong, M.; Bai, Z.; Zhao, W.; Guo, Q.; Chen, X.; Miao, Q. A Coarse-to-Fine Network for Ship Detection in Optical Remote Sensing Images. Remote Sens. 2020, 12, 246. [Google Scholar] [CrossRef]
Lin, H.; Shi, Z.; Zou, Z. Fully Convolutional Network with Task Partitioning for Inshore Ship Detection in Optical Remote Sensing Imges. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1665–1669. [Google Scholar] [CrossRef]
Wang, Y.; Ning, X.; Leng, B.; Fu, H. Ship Detection Based on Deep Learning. In Proceedings of the 2019 IEEE International Conference on Mechatronics And Automation (ICMA), Tianjin, China, 4–7 August 2019; pp. 275–279. [Google Scholar] [CrossRef]
Chen, Z.; Chen, D.; Zhang, Y.; Cheng, X.; Zhang, M.; Wu, C. Deep learning for autonomous ship-oriented small ship detection. Saf. Sci. 2020, 130, 104812. [Google Scholar] [CrossRef]
Zhu, Q.; Mai, J.; Shao, L. A Fast Single Image Haze Removal Algorithm Using Color Attenuation Prior. IEEE Trans. Image Process. 2015, 24, 3522–3533. [Google Scholar] [CrossRef] [PubMed]
Cai, B.; Xu, X.; Jia, K.; Qing, C.; Tao, D. DehazeNet: An End-to-End System for Single Image Haze Removal. IEEE Trans. Image Process. 2016, 25, 5187–5198. [Google Scholar] [CrossRef] [PubMed]
Galdran, A. Image dehazing by artificial multiple-exposure image fusion. Signal Process. 2018, 149, 135–147. [Google Scholar] [CrossRef]
Bello, R.-W.; Oladipo, M. Mask YOLOv7-Based Drone Vision System for Automated Cattle Detection and Counting. Artif. Intell. Appl. 2024, 2, 129–139. [Google Scholar] [CrossRef]
Ranasinghe, P.; Muthukuda, D.; Morapitiya, P.; Dissanayake, M.B.; Lakmal, H.K.I.S. Deep Learning Based Low Light Enhancements for Advanced Driver-Assistance Systems at Night. In Proceedings of the 2023 IEEE 17th International Conference on Industrial and Information Systems (ICIIS), Peradeniya, Sri Lanka, 25–26 August 2023; pp. 489–494. [Google Scholar] [CrossRef]
Ranftl, R.; Lasinger, K.; Hafner, D.; Schindler, K.; Koltun, V. Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 1623–1637. [Google Scholar] [CrossRef]
Cantor, A. Optics of the atmosphere–Scattering by molecules and particles. IEEE J. Quantum Electron. 1978, 14, 698–699. [Google Scholar] [CrossRef]
Nayar, S.K.; Narasimhan, S.G. Vision in bad weather. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Kerkyra, Greece, 20–27 September 1999; Volume 2, pp. 820–827. [Google Scholar] [CrossRef]
yolo-project. 6_class_final Dataset. Roboflow Universe. Roboflow. January 2022. Available online: https://universe.roboflow.com/yolo-project/6_class_final (accessed on 28 January 2022).
Wada, K. Labelme: Image Polygonal Annotation with Python. Available online: https://zenodo.org/records/5711226 (accessed on 18 November 2021).
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13708–13717. [Google Scholar] [CrossRef]
Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable ConvNets V2: More Deformable, Better Results. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 9300–9308. [Google Scholar] [CrossRef]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar] [CrossRef]
Yang, W.; Wu, J.; Zhang, J.; Gao, K.; Du, R.; Wu, Z.; Firkat, E.; Li, D. Deformable convolution and coordinate attention for fast cattle detection. Comput. Electron. Agric. 2023, 211, 108006. [Google Scholar] [CrossRef]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar] [CrossRef]
Liu, Y.; Shao, Z.; Hoffmann, N. Global Attention Mechanism: Retain Information to Enhance Channel-Spatial Interactions. arXiv 2021, arXiv:2112.05561. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11531–11539. [Google Scholar] [CrossRef]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. In Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar] [CrossRef]
ELand, H.; McCann, J.J. Lightness and Retinex Theory. J. Opt. Soc. Am. 1971, 61, 1–11. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. Int. J. Comput. Vis. 2020, 128, 336–359. [Google Scholar] [CrossRef]
Liu, Y.; Ding, H.; Yang, Z.; Xu, Q.; Ding, G.; Hu, P. UDP-YOLO: High Efficiency and Real-Time Performance of Autonomous Driving Technology. Comput. Inform. 2023, 42, 834–860. [Google Scholar] [CrossRef]
Shi, X.; Song, A. Defog YOLO for road object detection in foggy weather. Comput. J. 2024, 9, bxae074. [Google Scholar] [CrossRef]
Liu, X.; Lin, Y. YOLO-GW: Quickly and Accurately Detecting Pedestrians in a Foggy Traffic Environment. Sensors 2023, 23, 5539. [Google Scholar] [CrossRef]
Wu, C.-M.; Lei, J.; Liu, W.-K.; Ren, M.-L.; Ran, L.-L. Unmanned Ship Identification Based on Improved YOLOv8s Algorithm. CMC-Comput. Mater. Contin. 2024, 78, 3071–3088. [Google Scholar] [CrossRef]

Figure 1. Atmospheric scattering model.

Figure 2. Flow chart for generating fog images.

Figure 3. Labels.

Figure 4. Example images for each category.

Figure 5. Generating fog images and depth maps.

Figure 6. C2f and C3 modules.

Figure 7. Comparison between YOLOv8 and YOLOv5 detection heads.

Figure 8. Representation of Deformable Convolution.

Figure 9. Coordinate attention mechanism.

Figure 10. Structure of the YOLOv8s network.

Figure 11. Structure of the YOLOv8s-Fog network.

Figure 12. Radar chart of the different improved YOLOv8s models.

Figure 13. Confusion matrix.

Figure 14. PR curve.

Figure 15. F1 curve.

Figure 16. Effect after image enhancement.

Figure 17. Comparison of the test results.

Figure 18. Comparison between the heat maps of the two models.