1. Introduction
The situation in coastal waters is highly complex, with numerous vessels traveling to and from them, daily. This complexity is particularly evident during the opening of the sea for fishing, as ports and nearshore waterways are prone to blockage, resulting in frequent ship traffic accidents. Furthermore, the presence of foggy weather significantly diminishes visibility, thereby presenting heightened risks to the safety of maritime vessels and substantially impeding their regular navigation. These adverse conditions may precipitate collisions between ships. To mitigate these challenges, it is imperative to employ onboard sensors for the collection of pertinent data concerning a vessel’s positioning and its immediate aquatic surroundings. Such data can then undergo processing to inform subsequent operational maneuvers. Consequently, harnessing ship environmental sensing technology is paramount in facilitating safer passage for ships navigating coastal waters amidst foggy sailing conditions.
The primary methods for perceiving the maritime environment include navigation radar, visual observation, auditory perception, AIS/VHF communication, and weather radio. Target detection is a crucial task for comprehending the surrounding environment. Traditional object detection relies on manual design, resulting in poor recognition accuracy, high computational overheads, and certain limitations. In recent years, significant progress has been made in the field of object detection due to the continuous development of deep learning. This progress eliminates the need for manual feature design and surpasses traditional object detection technology in terms of accuracy and real-time performance. As a result, deep learning-based object detection has been widely applied in various domains. In terms of computation speed and accuracy, computer vision based on deep learning goes far beyond traditional detection algorithms based on manual features [
1]. Object detection based on deep learning can be categorized into two types: one-stage algorithms and two-stage algorithms. The two-stage algorithm model is more complex, making it challenging to train and deploy. It also results in slower detection speeds, rendering it unsuitable for real-time applications. On the other hand, one-stage algorithms directly detect and classify targets within input images. They offer faster detection speeds, lighter and simpler models, as well as easier training and deployment processes. The YOLO [
2] series is a one-stage object detection algorithm, which ensures detection accuracy, while the detection speed is fast. YOLOv8 [
3] is proposed by the YOLOv5 [
4] team, YOLOv7 [
5] is proposed by the YOLOv4 [
6] team, YOLOv7 is comparable to YOLOv5, although the accuracy exceeds v5, but the speed is not as fast as v5, and it occupies a large amount of memory, so these two versions have their own advantages and disadvantages. However, the introduction of v8 represents a significant improvement over v5 and surpasses v7. It utilizes the anchor-free method, eliminating the substantial computational overhead associated with using anchors and resulting in faster convergence and better object detection effectiveness. Therefore, we have chosen YOLOv8 as the base model for this study.
Most target detection studies based on foggy conditions primarily focus on defogging, which aims to repair the obscured areas of images and enhance the clarity of the image background. This is conducted to optimize the use of images through the subsequent use of visual algorithms. Traditional dehazing algorithms include the wavelet transform algorithm [
7], based on image enhancement, the dark channel [
8] dehazing algorithm, based on image restoration, and the latest dehazing method, which has been developed based on a convolutional neural network. Notably, the dehazing operation is crucial for target detection in restored images. Nevertheless, it does not address the root problem and may lead to image distortion and poor restoration effects, resulting in less than ideal detection outcomes. Moreover, due to the complex and changeable sea surface hydrological and weather environment, the real-time performance of the detection system is paramount. Traditional defogging research typically involves defogging the image before conducting detection, which is time consuming and fails to meet real-time requirements. Therefore, we opt to extract features from foggy ship images against a non-foggy background to achieve target detection.
In this paper, we present a novel approach for detecting ships in coastal waters during foggy conditions. The contributions of this paper are outlined as follows:
We propose a novel approach to detect ship targets in foggy coastal waters by incorporating the coordinate attention (CA) mechanism into the neck layer and integrating deformable convolution into the C2f module, following the addition of the attention mechanism;
We compare the detection effects of the model with different attention mechanisms at the same improvement position;
We compare the detection effectiveness of the model with the introduction of the CA at the same location and the deformable convolution of different parts;
We assessed the proposed method using the YOLOv8 model with the same foggy nearshore vessel dataset and conducted comparisons with several other object detection models;
We compared the detection results of the proposed method with the YOLOv8 model using real images of ships in foggy nearshore waters;
We have demonstrated that the proposed method achieves outstanding results in terms of detection effectiveness.
The remaining sections of the paper are structured as follows.
Section 2 concerns related work.
Section 3 introduces information on how the dataset was generated.
Section 4 is about YOLOv8s, the YOLOv8s-Fog algorithm, deformable convolution, the CA mechanism, and the atmospheric scattering model. In
Section 5, various experiments are conducted using the dataset and the results of the analyses are shown. The findings of this paper are summarized in
Section 6 and
Section 7.
5. Experiments and Analysis
5.1. Implementation Details
Using the YOLOv8s.pt model, based on the PyTorch framework, Cuda11.1, and Python3.8, the input image size was set to 640 × 640, the IoU threshold was 0.7, the batch size was set to 16, the initial learning rate was 0.01, the momentum was 0.937, and the weight decay was 0.0005, and we trained the network with 200 epochs. All the experiments were conducted on a Linux server, with Intel(R) Xeon(R) Gold 6330 CPU, 96GB RAM, and NVIDIA GTX 3090 GPU (Shanghai Gpushare Cloud Network Technology Co., Ltd., Shanghai, China). The dataset was divided into three subsets, namely the training set, the validation set, and the test set. The training set contains 3875 images, (70%), the validation set contains 1130 images (20%), and the test set contains 574 images (10%). Each subset contains images of ships in the nearshore waters for various categories of ships, as well as for different fog concentrations, but the images in these three collections are all completely different and have different original images. In order to improve the diversity of the data and the robustness of the model, we use Mosaic data augmentation to randomly crop, rotate, scale, and splice the images, and finally run 10 epochs and turn off Mosaic data augmentation in order to make the model more conducive to convergence and stabilization.
5.2. Evaluation Metrics
To assess the performance of the YOLOv8s-Fog model in foggy nearshore waters, we used a variety of evaluation metrics commonly used in object detection tasks, namely precision (
P), recall (
R),
accuracy, the
F1
score, and the mean average precision (
mAP). The formula is as follows (Equations (8)–(12)), where
TP is the number of ship samples present and correctly detected,
FP is the number of ship samples not present but incorrectly detected,
FN is the number of ship samples present but not correctly detected, and
C is the number of classes. In addition, some of the experiments also used GFLOPs (Giga Floating-Point Operations Per Second, i.e., one billion floating-point operations per second) and Params (the total number of parameters that need to be trained in model training).
5.3. Ablation Study
We conducted a comparison between the YOLOv8s-Fog model and the unimproved YOLOv8s base model. Both models were trained on the same dataset and validated using the same test set. In
Figure 12, we present a comparison of the results using the test set for the unimproved YOLOv8s model, as well as for two variations with separate improvements (introducing only deformable convolution and only the CA mechanism), and finally for all the improvements combined (YOLOV8s-Fog). The results clearly demonstrate that the improved model shows significant enhancements over the unimproved model in terms of both the map@50 and precision metrics.
We tested the detection effect of adding the CA mechanism at different locations in the neck or backbone on the same self-made foggy nearshore vessel dataset, and the results are shown in
Table 1. Specifically, bonev1 involves the addition of CA after C2f in the backbone, bonev2 involves the addition of CA before C2f in the backbone, neck_C2front involves the addition of CA before C2f in the neck, neck_Cofront involves the addition of CA before the contact in the neck, neck_C2behind involves the addition of CA after the last three C2f in the neck. The second half of the table represents the results after introducing deformable convolution at the same position as the neck, based on the first half of the table. The results show that the YOLOv8s-Fog model achieves better accuracy with almost the same number of parameters and GFLOPS.
We tested the performance of the detectors for YOLOv8s constructed using four different attentional mechanisms (with the same hyperparameters for fairness), using the same self-made foggy nearshore vessel dataset, including GAM [
39], ECA [
40], EMA [
41], and CA, and we added the above different attention mechanisms (with or without DCN) in the same location. As shown in
Table 2, the positions of the different attention mechanisms added coincide with the positions added by the attention mechanisms in the YOLOv8s-Fog model, and the second half of the table represents the results after introducing deformable convolutions at the same position in the neck, based on the first half of the table. The results show that the YOLOv8s-Fog model achieves the highest detection accuracy, with similar parameters and GFLOPS.
5.4. Comparison with State-of-the-Art Detection Models
The confusion matrix for YOLOv8s-Fog’s results is presented in
Figure 13, with the true category on the horizontal axis and the predicted category on the vertical axis. Each grid represents the proportion of the corresponding true class being predicted as the corresponding prediction class. It is important to note that a larger value along the diagonal line in the confusion matrix indicates better performance.
Accuracy represents the proportion of all the correctly classified samples to the total number of samples;
Table 3 shows the number of correctly predicted samples for each sample in the test set and the total number of samples for each sample, and we get an accuracy of 63.56% for all the samples (TP indicates the number of samples that were correctly classified). However, accuracy does not accurately reflect the entirety of the results of the test, and the PR curve and F1 score should be further explored.
The PR curves for all classes of YOLOv8s-Fog outputs are presented in
Figure 14. In regard to the PR curve, the Area Under the Curve (AUC) is indicative of the overall performance of the classifier, with a larger AUC value signifying superior performance. The F1 curve for all classes of YOLOv8s-Fog outputs are depicted in
Figure 15. The F1 score represents the harmonic average of the precision and recall. From these two figures, it can be observed that the YOLOv8s-Fog model outperforms the YOLOv8s model in terms of detection effectiveness.
We conducted a comparison of the detection performance between YOLOv8s-Fog and other object detection models within each category using the same test set. The results, presented in
Table 4, demonstrate that the YOLOv8s-Fog model achieves the highest detection accuracy and exhibits superior performance across multiple categories.
5.5. Comparison with Retinex + YOLOv8s
Retinex [
44] is an image enhancement algorithm, with the core idea being to adjust the contrast and brightness of an image, while preserving the details of the image. In this part, the foggy dataset is first dehazed by Retinex, and the dehazing effect is shown in
Figure 16c (
Figure 16a is a fog-free image,
Figure 16b is a foggy image generated based on
Figure 16a, and
Figure 16c is based on the enhanced image generated by
Figure 16b after Retinex processing), and then the original YOLOv8s model is used for detection, and the detection results are shown in
Table 5. It can be seen that the effect of the YOLOv8s-Fog model is still better than that of a simple dehazing treatment and then detection.
5.6. Comparison with the Fog-Free Dataset
Upon comparing the YOLOv8s-Fog with the YOLOv8s benchmark models, both of which were trained on the original no-fog dataset and validated on the same fog-free test set, we present the results in
Table 6. The analysis reveals that the enhanced network structure, YOLOv8s-Fog, does not exhibit the same level of effectiveness as the original YOLOv8s network structure when trained under fog-free conditions. This suggests that the enhancements are specifically tailored to address characteristics associated with foggy weather conditions and may not be suitable for other weather scenarios.
5.7. Analysis of the Detection Results
The diagram below illustrates the comparative results between the YOLOv8s and YOLOv8s-Fog models. It is evident from the diagram that YOLOv8s-Fog surpasses the original algorithm in terms of detection performance, mitigating both leakage and misdetection issues present in the original algorithm. For example, as shown in
Figure 17, improvements from (2) to (12) address leakage concerns, while enhancements from (13) to (15) rectify misdetection problems. Overall, YOLOv8s-Fog also enhances the confidence level of accurate detections, culminating in a more precise detection outcome.
Figure 18 shows a comparison between the Grad-CAM [
45] generated by YOLOv8s and YOLOv8s-Fog. Grad-CAM can help us analyze the area of interest in the network, according to which the region of interest in the network can be used to in turn analyze whether the network has learned the correct features or information. It can be seen that the YOLOv8s-Fog model pays more attention to the target itself, whereas the original YOLOv8s model also focuses on the areas that include the sea surface, the sky, and other features that are beyond the scope of the target itself or it only focuses on the ship portion of the image. These figures show that the proposed model can improve the detection of coastal sea targets in foggy conditions and can better pay attention to the characteristics of detected targets.
We took field photographs at Xinghai Bay and Donggang Marina in Dalian, China, of which 158 samples were taken, in addition to 42 images of ships in the nearshore sea area on real foggy days that were found on the Internet, which were used to supplement the sample. The test results are shown in
Table 7, and the training set is still the dataset from which we generated the fog.
Figure 19 demonstrates the detection results of YOLOv8s and YOLOv8s-Fog using the above dataset of ships in coastal waters in foggy conditions. The results not only prove that the self-made foggy dataset can be used for the detection of ship targets in foggy conditions, but also show that the improved YOLOv8s-Fog model does have better detection performance than the YOLOv8s model in foggy conditions, reducing undesirable detection effects, such as false detection and redundant detection frames.
6. Discussion
As the new target detection method proposed in this paper, the YOLOv8s-Fog model can be applied for target detection tasks involving ships in coastal waters in foggy weather conditions, with high speed and accuracy. YOLOv8s-Fog does not perform dehazing operations, but directly improves the network structure based on the dataset of foggy days. For other advanced models [
46,
47], which usually dehaze first and then detect, there are often problems, such as image distortion, poor image restoration effects, and long dehazing operations, but YOLOv8s does not have such problems. YOLOv8s-Fog introduces the CA mechanism to better perceive direction and position, so as to capture more important local information features. Since there is no way to add other attention mechanisms in circumstances involving sea vessels in foggy coastal areas, we draw on [
48,
49] and other models that introduce other attention mechanisms, such as ECA, GMA, EMA, etc., to explore and compare the differences between these other attention mechanisms and the CA mechanism. In contrast, it is found that the addition of the CA mechanism has the best effect when other changes share the same conditions, and the characteristics of different ships can be better learned. In addition, YOLOv8s-Fog introduces deformable convolution, which explores the most suitable position to improve traditional convolution to deformable convolution compared with the same attention mechanism, so as to better focus on the important feature areas of ships with variable and complex shapes. Compared with other models, the excellent detection performance of YOLOv8s-Fog proves that it provides a new possibility for target detection applications involving ships in coastal waters in foggy conditions.
However, although the accuracy of the method has improved, the number of parameters and memory needed have not decreased, and we will explore more efficient ways to reduce the consumption of additional resources in future. In addition, this paper is currently limited to the detection of ships in coastal waters in foggy weather, and there is currently no further research on the detection of ships in coastal waters in other bad weather conditions, such as rain and hail.