Lightweight Water Surface Object Detection Network for Unmanned Surface Vehicles

Li, Chenlong; Wang, Lan; Liu, Yitong; Zhang, Shuaike

doi:10.3390/electronics13153089

Open AccessArticle

Lightweight Water Surface Object Detection Network for Unmanned Surface Vehicles

School of Mechanical and Electrical Engineering, Harbin Engineering University, Harbin 150009, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(15), 3089; https://doi.org/10.3390/electronics13153089 (registering DOI)

Submission received: 3 July 2024 / Revised: 30 July 2024 / Accepted: 2 August 2024 / Published: 4 August 2024

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

The detection algorithms for water surface objects considerably assist unmanned surface vehicles in rapidly perceiving their surrounding environment, providing essential environmental information and evaluating object attributes. This study proposes a lightweight water surface target detection algorithm called YOLO-WSD (water surface detection), based on YOLOv8n, to address the need for real-time, high-precision, and lightweight target detection algorithms that can adapt to the rapid changes in the surrounding environment during specific tasks. Initially, we designed the C2F-E module, enriched in gradient flow compared to the conventional C2F module, enabling the backbone network to extract richer multi-level features while maintaining lightness. Additionally, this study redesigns the feature fusion network structure by introducing low-level features and achieving multi-level fusion to enhance the network’s capability of integrating multiple levels. Meanwhile, it investigates the impact of channel number differences in the Concat module fusion on model performance, thereby optimizing the neural network structure. Lastly, it introduces the WIOU localization loss function to bolster model robustness. Experiments demonstrated that YOLO-WSD achieves a 4.6% and 3.4% improvement in mAP0.5 on the water surface object detection dataset and Seaship public dataset, respectively, with recall rates improving by 5.4% and 8.5% relative to the baseline YOLOv8n model. The model’s parameter size is 3.3 M. YOLO-WSD exhibits superior performance compared to other mainstream lightweight algorithms.

Keywords:

YOLO-WSD; YOLOV8; object detection; surface targets; USV

1. Introduction

Unmanned surface vehicles(USV) equipped with high-performance modules are increasingly becoming the primary tool for marine exploration [1]. This trend is mainly attributable to the advantageous characteristics of unmanned surface vehicles, such as low cost, low risk, and high efficiency [2]. In executing tasks such as target detection and environmental monitoring [3], these vehicles rely heavily on their ability to perceive the environment [4]. Presently, the perception technology of unmanned surface vehicles predominantly depends on various radar systems and Automatic Identification System(AIS) [5]. However, the results obtained from radar scans cannot determine object properties, making them unsuitable for precise operations. The AIS system cannot locate small hazardous objects near the shore, which sometimes results in AIS unreliability. Traditional vision technology is also one of the most popular solutions, and extensive prior knowledge allows it to perform excellently under ideal conditions [6]. However, the visual system still encounters a range of complex sea conditions, including waves, sea fog, and water surface light reflections [7]. These environmental issues result in technical problems such as limited target pixel coverage, significant variations in target scale, and weak texture information, which can cause the visual system to fail [8]. All of these factors hinder the development of visual systems for unmanned surface vehicles [9].

In recent years, the rapid advancement of deep learning has provided innovative solutions for the visual systems of unmanned surface vehicles. Mature algorithms emerging from this progress, such as Faster R-CNN [10], SSD [11], and the YOLO [12] series, have surpassed traditional visual algorithms in target recognition. Their exceptional generalization capabilities have led to widespread applications. Significant enhancements in accuracy and real-time performance have marked considerable advancements in deep learning for surface target detection in maritime applications.

However, in the practical application of USV surface target detection, deep learning algorithms still present some safety issues. For instance, failure to recall targets by the model could lead to maritime accidents or poor generalization capabilities could result in model failure, causing the USV to stop moving. These issues are hindering the development of intelligent unmanned boats. Additionally, achieving excellent detection real-time performance, lightweight characteristics, and deployability within limited hardware computing power while maintaining detection performance remains a persistent challenge. To address this, this study proposes the YOLO-WSD model, which achieves a fine balance between model performance and real-time capabilities without incorporating cutting-edge modules while ensuring deployability. By optimizing the network structure and adjusting gradient strategies, this model maintains the inference speed and lightweight characteristics of YOLOv8n while surpassing the model performance of YOLOv8s. Moreover, in experiments, YOLO-WSD demonstrated a high recall rate and excellent generalization ability, effectively reducing the occurrence of the aforementioned issues.

This study presents the following primary contributions:

The C2F-E module is designed and employed to reconstruct the backbone network using the gradient path design strategy and the ELAN structure. The introduction of more enriched feature extraction levels not only reduces the parameters and computational load of the backbone network but also enhances the overall detection performance of the model.
The feature fusion network structure has been redesigned, incorporating the bidirectional feature pyramid design and introducing lower-level features for multi-level fusion. The improved structure significantly enhances the fusion capability of features at different scales, thereby greatly improving the model detection accuracy.
A comparative study is conducted on the impact of different channel numbers at high and low levels on the feature fusion of the Concat module, reaching preliminary conclusions.
The weighted intersection over union (WIOU) accelerates model convergence, enhancing model accuracy without increasing computational load or inference speed.

The remaining sections of this paper are organized as follows: Section 2 reviews related work on surface target detection. Section 3 provides a detailed introduction to the proposed improvement scheme and its principles. Section 4 introduces the dataset and experimental environment, while Section 5 discusses the experiments and the results. Section 6 summarizes these findings.

2. Related Work

2.1. Surface Target Detection Based on Deep Learning

The rapid advancement of deep learning in surface target detection on water is primarily attributed to the development of numerous deep learning-based models and datasets for surface detection by researchers. These advancements have enabled deep learning to quickly surpass traditional algorithms. It also suffers from relatively uniform weather conditions, making it challenging to achieve high generalization.

Moosbauer et al. [13] publicly released the Singapore Maritime Dataset, which is based on videos and significantly enriches the available dataset resources in the field. The dataset can accomplish tasks of object detection and tracking, but the data content is relatively simple. Shao et al. [14] proposed the Seaships dataset, which contains over thirty thousand images and includes multiple scenes and time periods. Zhou et al. [15] introduced the water surface object detection dataset (WSODD), which includes fourteen surface targets in various scenes and weather conditions. With its comprehensive data distribution and high generalization, this dataset has valuable applications in surface detection.

The Zhou team also proposed the CRB-Net target detection algorithm, achieving a 65% mAP value on the WSODD dataset. CRB-Net surpasses a series of lightweight mainstream algorithms, such as YOLOv4 and EfficientDet, through a superior K-means algorithm and SPP attention structure. However, due to the early proposal of the algorithm, its detection performance is difficult to match with current state-of-the-art models. Zou et al. [16] presented an improved SSD algorithm based on MobileNetV2 and CNN, which performs better in detecting ship images than the baseline algorithm. However, the SSD algorithm struggles with the issue of unbalanced positive and negative sample matching, making it difficult to apply for detecting small marine targets, which is a crucial task. Liu and Li [17] used YOLOv3 as the baseline model and designed a loss function E-IOU dependent on boundary box regression. This resulted in their model accelerating convergence speed while improving detection accuracy. Han et al. [18] proposed Yolo-Ship based on YOLOv4, utilizing reconstructed convolution techniques and residual connections, significantly increasing small target detection accuracy. It achieved a 0.9% improvement on the Seaship dataset than YOLOv4. Zhang et al. [19] enhanced the detection accuracy by incorporating the GHOST module and the feature pyramid structure of the Transformer. The proposed algorithm exhibited strong robustness in sea fog environments. In addition, the study conducted clustering on surface objects to determine optimal anchor boxes. However, since the algorithm is designed for specific marine conditions, its ability to maintain excellent performance under general conditions still needs to be proven through experiments.

The performance of the aforementioned models was impressive at the time of their proposal, but there are still two important technical directions that need to be addressed. The first is to rapidly integrate cutting-edge computer vision techniques to enhance detection performance, and the second is to improve the model’s ability to generalize across multiple scenarios. These are also the research objectives of this paper.

2.2. Guidelines for Target Detection Model Design

In the long course of deep learning development, many teams have investigated structural design principles to enhance the overall network performance. Radosavovic et al. [20] summarized various network design methods and proposed design principles, such as the low correlation between FLOPs and detection speed and the optimal performance achieved when the bottleneck is set to 1. These design principles belong to the layer-level perspective. The RegNet model, designed based on these principles, outperforms EfficientNet in terms of performance and is also five times faster on GPUs. Wang et al. [21] analyzed the DenseNet Block and introduced CSPNET, which is designed to adhere to the principle of maximizing gradient combinations. This model carefully examines module gradient paths and the role of gradient flow, providing new insights for subsequent structural designs. Its performance quickly surpassed that of the DarkNet and VGG series. Wang et al. [22] designed the ELAN structure, allowing for more effective propagation of deep network gradient information in layer-aggregated models compared to the CSP structure, with richer gradient flow. The design concept of ELAN can be seen in many SOTA models for object detection, such as ELAN-W and ELAN in YOLOv7. Wang et al. [23], by introducing the concepts of gradient sources and gradient timestamps, conducted a concrete analysis of the gradient flow process in VoVNet, CSPNet, and ELAN modules and deeply summarized the design logic of these modules. Based on the experimental results, they proposed a model gradient path design strategy to design efficient and high-quality neural networks. The gradient path design strategy is currently one of the optimal design strategies. It helps the model reduce the deterioration caused by increased network depth and ensures the reasonable distribution of gradient flow. One of the core ideas of this design strategy is to increase the richness of gradients within the module to improve parameter utilization.

3. Methods

Surface target detection often encounters complex weather conditions and scene environments. An outstanding target detection algorithm considers both detection speed and accuracy and robustness across various scenarios, especially in scenes with vastly differing object scales, such as water surface objects. YOLOv8 (SOTA) achieves an excellent balance between detection speed and accuracy, with the lightweight YOLOv8n excelling in lightweight design and real-time capabilities. In this paper, YOLOv8n is selected as the baseline network to enhance its overall performance without compromising real-time capabilities. The entire YOLO network is restructured to achieve this objective, as depicted in Figure 1. In the backbone network, the C2F-E module designed in this study is utilized, combined with integrating the CBAM (Convolutional Block Attention Module) attention mechanism, to enable the backbone network to retain more effective shallow layer information while extracting deeper layer information. The concept of a weighted bidirectional feature pyramid is introduced for the structural design of the feature fusion network, enriching feature fusion and overcoming the limitations of uneven feature scale fusion. WIOU is introduced to improve the model’s robustness to low-quality anchor boxes. The specific design details of each module are outlined as follows:

3.1. C2F-E Module and the Redesigned Backbone Network

The primary function of the backbone network is to extract features at various levels with specific attributes, and its inter-module structure design is part of the data path design strategy. The C2F module, an enhanced version of the C2 module, incorporates the ELAN structure into the CSP structure. This integration of the module’s internal structure design with the gradient path design strategy leads to improved overall efficiency. The YOLO series modifies the backbone network’s size through width coefficients, depth coefficients, and the parameter ‘n’. Although this modification provides model flexibility, it can create a complex design for individual models.

Referring to the gradient path design strategy, this paper proposes some design issues in the YOLO backbone network. Firstly, the excessive employment of 1 × 1 and 3 × 3 convolutions as a translation layer [24] in the YOLO backbone leads to a shortest gradient path count of 13. Hence, the original information must traverse at least 13 translation layers before reaching the SPPF module. This does not align with our design philosophy, as it complicates the retention of detailed original information. Among these, only five translation layers are absolutely essential as they need to perform the resizing of image dimensions. The backbone network composed of the five translation layers can provide convolutional features through 5 to 13 layers, with the deepest features remaining unchanged. This undoubtedly offers more feature map options for subsequent networks.

Secondly, in the n model, the C2F module implements the ELAN concept, maintaining all input information through two cross-stage connections. As a result, the original information in the Concat module constitutes two-thirds of the total, which indicates that the parameter utilization efficiency of the module itself is relatively low. Lastly, the Conv layers in the backbone network, excluding the bottleneck, lack supervision from residuals or skip connections. Therefore, if harmful gradients emerge in Conv, they cannot be mitigated in subsequent propagation using the original information.

In order to mitigate these issues, this study introduces the C2F-E module, depicted in Figure 2A, to redesign the backbone. The design goal of the C2F-E module is to maximize gradient timestamps within the module, thereby providing richer gradient combinations. By reducing transition layers between modules, the backbone network gains the ability to provide more hierarchical feature maps. The C2F-E module’s design utilizes the gradient path concept. In contrast to the C2F module, the C2F-E module employs MaxPooling to retain the original information from the preceding module rather than the information post-Conv translation layer. This approach provides supervision for the Conv translation layer and diminishes the potential for harmful gradient formation. In addition, this design addresses the insufficiency of information in the Concat module by increasing the number of fused features from different distributions from two to three, enhancing the gradient flow within the module, as shown in Figure 3B. Compared to the C2F module, the gradient design strategy of the C2F-E is more effective.

The reconstructed backbone, illustrated in Figure 3B, substitutes the C2F + Conv structure with the C2F-E module and incorporates the CBAM module. Its shortest gradient path is reduced to 11 from the original 13, consisting of seven 1 × 1 convolutions and four MaxPooling operations. This new backbone network design mitigates performance degradation associated with increased network depth. The integration of the CBAM [25] attention module extends the longest gradient path, facilitating deeper feature extraction and aligning with an enhanced gradient path design strategy. Relative to the baseline backbone network, the restructured backbone network offers an advanced optimization in design strategies and principles.

3.2. Multi-Scale Feature Fusion Network

In object detection algorithms, the feature pyramid structure, as introduced by [26], effectively fuses the features extracted by the backbone network. For instance, the baseline network utilizes the PAN-FPN structure derived from PA-Net [27], which merges features from bottom to top through the FPN network. The Google team, led by [28], introduced the weighted bidirectional feature pyramid network (BiFPN), incorporating learnable weights to signify the importance of different-level feature maps during their fusion.

This study combines both ideas and optimizes them; the structure is shown in Figure 4. First, based on the bidirectional feature pyramid concept and the optimal gradient path idea, the study introduces residual connections to enhance the information hierarchy fused in the Concat module. This approach allows a single module to retain as much information from different levels as possible. Simultaneously, the shortest gradient path between the backbone network and the detection head is reduced to prevent model overfitting. The integration of low-level features boosts the detection capability for small targets.

Second, the study introduces 1 × 1 convolution to reconstruct the overall channel numbers, setting the output channel numbers for the entire Neck part at 128, and for the downsampling convolution, the output channel number is set at 64.

During this process, a hypothesis is proposed: when fusing features from different levels in the Concat module, having similar channel numbers for features from different levels can enhance the performance of the feature fusion network. This hypothesis relies on using the add module for feature fusion in the model. If the two features to be fused belong to different dimensions or distributions, a linear combination should be applied to project the two features into the same space to ensure effective fusion. If the features belong to different distributions, directly using the add operation for fusion introduces prior weights [29], which can negatively affect the model.

Similarly, when concatenating multi-channel high-level features with fewer channel low-level features and inputting them into the Conv module, the multi-channel high-level features, having a greater total amount of information, contribute a relatively larger amount of helpful information. Since multi-channel high-level features inherently possess a high correlation, the model considers such data distributions more critical, granting them additional weight gain, as illustrated in Figure 5. This negatively affects the model’s ability to learn and fuse features from different distributions. Furthermore, since the weights of convolutional kernels are greater than zero, the side with more channels will inevitably suppress the other data distribution during normalization, reducing the effectiveness of feature fusion.

The study concludes with preliminary experimental validation, confirming the general accuracy of the proposed hypothesis. Detailed experimental results are available in Section 4.

3.3. WIOU Loss Function

The YOLOv8n baseline model employs the CIOU loss function [30] for computing localization loss. The expression of this loss function is as shown in Equations (1) and (2):

CIOU = IOU - \frac{ρ^{2} (B, B^{gt})}{c^{2}} - α υ

(1)

L_{CIOU} = 1 - IOU + \frac{ρ^{2} (B, B^{gt})}{c^{2}} + α υ

(2)

where

ρ^{2} (B, B^{gt})

signifies the Euclidean distance between the center points of the predicted box and the ground truth box,

c

indicates the diagonal distance of the smallest enclosing rectangle covering both the predicted and ground truth boxes, and

α υ

jointly assesses the aspect ratio consistency. The formula is described as shown in Equations (3) and (4):

α = \frac{υ}{(1 - IOU) + υ}

(3)

υ = \frac{4}{π^{2}} (\arctan \frac{w_{gt}}{h_{gt}} - \arctan (\frac{w}{h}))^{2}

(4)

In the formula,

w_{gt}

and

h_{gt}

represent the width and height of the ground truth boxes, respectively, while

w

and

h

represent the width and height of the predicted boxes.

CIOU largely addresses the issue of being unable to calculate LOSS due to no overlap between the true and predicted boxes, and it also resolves the aspect ratio problem. However, it does not consider the differences between the predicted box’s width, height, and confidence in reality, hindering effective optimization of the model’s similarity. Additionally, CIOU does not account for low-quality annotated boxes, which can lead to incorrect model learning. WIOU addresses and resolves these issues effectively.

WIOU [27,31] is a loss function that employs a dynamic, non-linear focusing technique across three iterations: v1, v2, and v3, with the paper implementing the WIOUv3 variant. The fundamental principle behind WIOU involves a strategic allocation of gradient gains to evaluate the accuracy of predicted bounding boxes, thus mitigating the influence of both inferior and superior quality predictions on the overall training process. The equation for WIOUv1 is presented below as shown in Equations (5) and (6):

L_{WIOUv 1} = R_{WIOU} \times L_{IOU}

(5)

R_{WIOU} = \exp (\frac{{(x - x_{gt})}^{2} + {(y - y_{gt})}^{2}}{(W_{g}^{2} + H_{g}^{2}) *}) \in [1, e)

(6)

R_{WIOU}

serves as an indicator to gauge the overlap level between the predicted and the annotated boxes. Should the overlap between these boxes be significant, it’s crucial that the loss function avoids overly striving for additional improvements in this overlap. This approach is aimed at bolstering the model’s generalization ability across instances within the same category.

WIOUv3 defines an outlier degree

β

to describe the quality of the annotated box and constructs a non-monotonic focusing factor using

β

. A smaller outlier degree

β

indicates higher annotation box quality. In this case, the gradient gain

r

is also smaller, reducing the adjustment weight of the network for this predicted box. When the outlier degree

β

is large, a smaller gradient gain

r

is also assigned to prevent low-quality examples from generating a large amount of harmful gradient.

α

and

δ

are two hyperparameters used to help modify the gradient gain, thereby adjusting the convergence speed and target of the localization loss function. The formula for WIOUv3 is as follows, as shown in Equations (7) and (8):

β = \frac{L_{IOU}^{*}}{L_{IOU}} \in [0, + \infty)

(7)

L_{WIOUv 3} = r L_{WIOUv 1}, r = \frac{β}{δ α^{β - δ}}

(8)

WIOU overcomes the constraints of CIOU by taking into account the real disparities between two bounding boxes. Additionally, WIOU implements a non-linear focusing strategy, significantly boosting the model’s ability to generalize across similar categories. It dynamically adjusts the loss weighting for smaller objects, thus enhancing the model’s capability to detect such objects. Consequently, WIOU is chosen as the localization loss function for this model and is incorporated into three detection heads.

4. Dataset and Training Strategy

4.1. Dataset

This study utilized the WSODD dataset for training and testing. The dataset was captured by Hikvision industrial cameras and comprises 7467 images with a total of 21,911 examples. The dataset encompasses 14 distinct categories of images, such as ships, boats, balls, bridges, rocks, humans, trash, masts, buoys, platforms, harbors, trees, grass, and wildlife. Ships constitute more than 30% of the dataset, which is accessible on GitHub. The images were split between a training set and a validation set following a 72:28 ratio, with 5402 images allocated for training purposes and 2065 for validation. An assortment of sample images from the dataset is showcased in Figure 6A.

The dataset was captured under different lighting conditions, including noon (strong light), dusk (dim light), and evening (very dim light) at various locations. The data distribution is diverse, and by observing the dataset distribution in Figure 6C, it is evident that small target objects constitute a significant proportion of the dataset. Instance count statistics in Figure 6B reveal substantial variations among different categories, posing challenges for the model in terms of learning discrepancies across categories. Overall, the dataset presents a considerable challenge with a wide coverage of sea surface target categories and sizes, making it highly practical.

4.2. Experiment Environment and Parameters

For the sake of uniformity and the ability to replicate the findings, this research utilized the Kaggle platform for all experimental work. The computational work, including model training and evaluation, was carried out on a Tesla P100-PCIE-16GB GPU. The experiments were conducted using specific software versions: Python version 3.10.12, Pytorch version 2.0.0, and Ultralytics YOLO version 8.1.6.

The configuration of model hyperparameters was as follows: the learning rate was initiated at 0.01, momentum was set to 0.9, weight decay was determined to be 0.0005, and the batch size was established at 16. The IOU threshold was fixed at 0.7. The SGD (Stochastic Gradient Descent) optimization algorithm was employed for training, along with a predetermined random seed to guide the initialization process. Default settings were maintained for any hyperparameters not specifically mentioned. The dimension of input images was standardized to 512 × 512 pixels, and the training extended over 150 epochs.

4.3. Model Evaluation Metrics

In this segment, we utilize a range of performance metrics to gauge the effectiveness of diverse object detection models. These metrics include Mean Average Precision (mAP), Precision (P), Recall (R), Inference Speed, and Model Size. Mean Average Precision (mAP) serves as a comprehensive indicator of an object detection algorithm’s performance. Precision (P) assesses the algorithm’s accuracy in correctly classifying objects. Recall (R) evaluates the algorithm’s proficiency in identifying relevant objects. Inference Speed (speed): In this study, the inference speed was assessed utilizing the “Speed” parameter provided by the Ultralytics framework in conjunction with the FPS (Frames Per Second) parameter. Speed denotes the time required for the model to infer one image under the Ultralytics framework, with a batch size of 16. The FPS parameter indicates the number of images inferred per second for different model batch sizes set to 1 (the FPS testing was conducted on an additional RTX 3060 graphics card). The former facilitates a more intuitive comparison of model speeds within the same framework, while the latter offers a more general approach to calculating model inference speed. The formulas for evaluation metric are as follows:

P = \frac{TP}{TP + FP}

(9)

R = \frac{TP}{TP + FN}

(10)

AP = \int_{0}^{1} P (R) dR

(11)

mAP = \frac{\sum_{i = 1}^{n} A P_{i}}{n}

(12)

In the equations used, TP denotes the count of true positive samples correctly identified, FP stands for the false positive samples that were incorrectly identified, and FN indicates the missed positive samples. AP represents the area under the curve in the Precision–Recall (P-R) graph, with Recall (R) plotted along the x-axis and Precision (P) on the y-axis. This area reflects the space bounded by the curve and both axes. A greater AP value suggests a higher level of recall (or precision) for a given level of precision (or recall). The mAP metric averages the AP values for all categories, offering a holistic measure of the model’s overall effectiveness.

5. Experiment

5.1. Comparison of Lightweight Object Detection Algorithms

To highlight the superiority of the proposed algorithm in this paper, we selected seven mainstream lightweight algorithms for comparison at the same level. These include Yolov8n, Yolov8s, Yolov5n, Yolov5s, Yolov7-tiny, and Efficientdet-D0. The first five models are among the most commonly applied in industrial systems, while the latter one models are native networks of BiFPN. During training, no pre-trained weights are added to the YOLO series. Due to the excessively long training time of EfficientDet, this study adopts a transfer learning approach along with fine-tuning the model to achieve the best results. Table 1 presents the complete comparative experimental data.

The experimental results indicate that compared to the baseline model, our model achieved a significant improvement in mAP (mean Average Precision) and recall with a slight increase in parameter count and inference time, specifically a 4.6% and 5.4% increase, respectively. In comparison to Yolov5s, Yolov7-tiny and Yolov8s, our model attained higher mAP0.5 and recall values using only one-third of their parameters while substantially reducing the image inference time. Figure 7 illustrates the comparison of model performance under different metrics. It is evident that YOLO-WSD is the most comprehensive among these algorithms. Although YOLOv5n, YOLOv8n, and YOLOv7 exhibit excellent real-time performance and parameter efficiency, they perform poorly in various accuracy metrics. On the other hand, YOLOv5s and YOLOv8s achieve higher accuracy but are significantly larger in size and have lower FPS. YOLO-WSD has achieved almost optimal results in terms of inference speed, model performance, and model parameter count, and it has not introduced complex modules, maintaining its versatility for deployment. Therefore, this paper suggests that YOLO-WSD is a more suitable object detection algorithm for unmanned ship vision systems.

Comparing the category-wise Average Precision (AP) values between Yolov8n and YOLO-WSD in Figure 8 reveals that the improved model has achieved significant enhancements across almost all categories. Notably, there is an increase of over 20% in the ‘person’ category, providing substantial assistance in surface target detection. The considerable differences in the number of labels for each category reflect the model’s outstanding robustness and generalization ability. It demonstrates the model’s capacity to adapt to varying quantities of category data and make accurate predictions of target objects. This also validates the correctness of our approach in preserving original information and enhancing generalization by reducing the shortest gradient path.

To provide a more intuitive comparison between YOLO-WSD and YOLOv8n, this article selects various images for object detection under different weather and environmental conditions. Each set of comparison images represents different sea surface characteristics, as illustrated in Figure 9. Under normal weather and varying environmental conditions, YOLO-WSD demonstrates significantly better recall and precision compared to the baseline. It is noteworthy that in the near-sea comparison images, YOLO-WSD, relying on excellent category generalization, correctly detects instances of Person and harbor even in cases where the dataset has missed annotations.

In dense target detection scenarios such as docks where boundaries are difficult to distinguish, YOLO-WSD maintains consistency with the GroundTruth, while YOLOv8n exhibits a significant number of omissions. When facing different weather conditions, both models perform similarly, unable to detect targets in strong sunlight but exhibiting good performance in heavy foggy weather. For monochromatic and small target objects, both algorithms successfully achieve recall, but YOLO-WSD exhibits much higher classification accuracy. In summary, YOLO-WSD outperforms YOLOv8n in most scenarios, showing superior detection performance.

5.2. Comparative Experiment of the Neck Section

This study conducted verification experiments for the hypothesis of feature fusion in Section 2. The validation approach involves selecting a feature fusion network utilizing multiple Concat modules. Within the network, all Concat modules are set to merge high-level feature channel numbers with low-level feature channel numbers at ratios of 1:1 and 2:1, respectively. Subsequently, the overall performance of the model is observed. The specific experimental procedure is as follows: Firstly, based on the number of channels’ differences, the feature fusion network was divided into a DCN group (Different Channel Neck) and an SCN group (Similar Channel Neck). To ensure the fairness of the experiment, the backbone used in both groups was the original backbone of YOLOv8n, and both employed 1 × 1 convolutions to adjust the number of channels after feature extraction in the backbone. In the DCN group, the feature dimensions for the upper, middle, and lower levels were 256, 128, and 64, respectively, while in the SCN group, they were all set to 128. Slight adjustments were made to the channel above numbers to mitigate the impact of differences in parameter volume. However, these adjustments did not compromise the overall experimental principles. Comparative experiments were conducted using PAN-FPN and BiFPN-YOLO paired with CIOU and WIOU.

The comparative data analysis from Table 2 and Table 3 reveals consistent experimental results. Under the conditions of similar model sizes and comparable parameter volumes in the Neck section, the feature fusion networks of DCN exhibit performance that is not inferior to SCN. In fact, in the Neck section of this model, the mAP values of DCN fusion approaches significantly surpass those of the SCN fusion method. This aligns with the hypothesis in Section 2, emphasizing that the Concat module should ensure that input channels from different dimensions have approximately the same number to avoid negative weight gains arising from the differences in information quantity content across dimensions. This design principle can enhance network performance without any loss.

The model’s backbone was sequentially restructured in the third section, the feature fusion network was enhanced, and the WIOU loss function was introduced. This study utilized a controlled variable method for separate experimental analysis of each module to ascertain the individual effectiveness of each component. Experiment one was the baseline model, while experiment two integrated the WIOU loss function. Experiment three implemented BiFPN-YOLO, and experiment four applied the C2F-E-reconstructed backbone network. Experiment five combined WIOU with BiFPN-YOLO and experiment six showed the YOLO-WSD model.

Table 4 lists experimental data demonstrating significant improvements due to these three modules. For instance, the backbone network achieved a 0.7% mAP increase, a 6% reduction in parameters, and a 10% boost in inference speed. Despite rising parameter count and computational complexity, incorporating the CBAM mechanism into the original backbone network resulted in only a 0.3% mAP enhancement. These results underscore the efficacy of the backbone network’s design strategy and the superiority of C2F-E. BiFPN-YOLO led to a 3.6% mAP increase, albeit with a marginal decrease in inference speed, attributable to the augmented feature fusions and parameters. The introduction of WIOU accounted for a 0.4% mAP improvement.

Ultimately, YOLO-WSD significantly enhances performance while keeping the parameters and inference speed on par with YOLOv8n. Thus, the study concludes that the proposed model is superior.

Remarkably, the mAP improvements in experiments M5 and M6 are nearly equivalent to the cumulative enhancements of experiments M2, M3, and M4, suggesting a linear synergy of the enhancements from each module. The research infers that this phenomenon is due to the distinct attributes of the features extracted by the Backbone, Neck, and Detect sections, with each phase imparting advanced attributes that benefit the subsequent layer.

5.3. Cross-Dataset Validation

The significant improvement of YOLO-WSD on the WSODD dataset illustrates the effectiveness of the researchers’ work. Cross-dataset validation was conducted using the Seaship7000 dataset to validate the detection performance of YOLO-WSD for marine targets across different scenarios. Figure 10 shows that the Seaships dataset has instance categories and distributions vastly different from the WSODD dataset. This can simulate the real-world target detection scenarios faced by YOLO-WSD in different maritime areas very well. In addition, to assess the effectiveness of the improved modules in this study, the influence of the loss function was removed, and CIOU, consistent with the baseline model, was selected for validation.

Table 5 reveals that YOLO-WSD significantly improves over YOLOv8n in various data environments. Specifically, the [email protected]:0.95 increased by 6.3%, and the recall by 8.5%. For unmanned boats navigating on the sea surface, the substantial increase in recall suggests that these boats can avoid more potential collisions, securing a better navigation path. Furthermore, YOLO-WSD approaches YOLOv8s in terms of [email protected] while having significantly fewer parameters and faster inference speed compared to YOLOv8s. This dataset effectively verifies the excellent generalization ability of YOLO-WSD. Thus, the structural design of YOLO-WSD is considered effective and successful.

6. Conclusions

Deploying a high-precision and real-time object detection algorithm in a challenging and dynamic sea environment with limited resources poses a significant challenge. This study proposes a lightweight object detection algorithm, YOLO-WSD, based on YOLOv8n, to address this issue. Firstly, based on the gradient path principle, we designed the C2F-E module, which reduces the number of parameters while improving detection accuracy. Secondly, we redesigned the feature fusion network, significantly enhancing the efficiency and information content of feature fusion. Thirdly, we introduced the WIOU loss function to accelerate the localization of prediction boxes. Finally, we validated the effectiveness of the proposed algorithm through experiments. The experimental results show that YOLO-WSD achieves mAP values of 75.3% and 77.6% on the WSODD and Seaship datasets, respectively, which represents improvements of 4.6% and 3.4% compared to the original model. The model size is 3.3 M, and the FPS is 76.9. We also conducted experimental comparisons with other state-of-the-art algorithms, and the results demonstrate that YOLO-WSD achieves a better balance between detection performance and computational efficiency.

In addition, the study proposes a hypothesis concerning the distribution of information in the Concat module and provides preliminary experimental verification, fostering improvements in the Concat module’s usage strategy and the design optimization of the feature fusion network.

Author Contributions

C.L.: Conceptualization, methodology, writing—original draft. L.W.: Data curation, software, visualization, writing—review and editing. Y.L.: Supervision, validation, writing—review and editing. S.Z.: Supervision, validation, writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Access to the experimental data presented in this article can be obtained by contacting the corresponding author.

Acknowledgments

Thank you for acknowledging the contributions of the references in this field. We are grateful for the valuable resources offered by the WSODD dataset and the Seaship dataset. Additionally, we appreciate the journal’s provision of valuable and accessible resources.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Qi, Q.; Li, K.; Zheng, H.; Gao, X.; Hou, G.; Sun, K. SGUIE-Net: Semantic attention guided underwater image enhancement with multi-scale perception. IEEE Trans. Image Process. 2022, 31, 6816–6830. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Zhang, Y.; Yu, X.; Yuan, C. Unmanned surface vehicles: An overview of developments and challenges. Annu. Rev. Control. 2016, 41, 71–93. [Google Scholar] [CrossRef]
Specht, C.; Świtalski, E.; Specht, M. Application of an autonomous/unmanned survey vessel (ASV/USV) in bathymetric measurements. Polish Marit. Res. 2017, 24, 36–44. [Google Scholar] [CrossRef]
Tanakitkorn, K. A review of unmanned surface vehicle development. Marit. Technol. Res. 2019, 1, 2–8. [Google Scholar] [CrossRef]
Campbell, S.; Naeem, W.; Irwin, G.W. A review on improving the autonomy of unmanned surface vehicles through intelligent collision avoidance manoeuvres. Annu. Rev. Control. 2012, 36, 267–283. [Google Scholar] [CrossRef]
Fefilatyev, S.; Goldgof, D.; Shreve, M.; Lembke, C. Detection and tracking of ships in open sea with rapidly moving buoy-mounted camera system. Ocean Eng. 2012, 54, 1–12. [Google Scholar] [CrossRef]
Liu, P.; Wang, G.; Qi, H.; Zhang, C.; Zheng, H.; Yu, Z. Underwater image enhancement with a deep residual framework. IEEE Access 2019, 7, 94614–94629. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the 13th European Conference, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Jiang, Z.; Wang, R. Underwater Object Detection Based on Improved Single Shot Multibox Detector. In Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence, Sanya, China, 24–26 December 2021; pp. 1–7. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Moosbauer, S.; König, D.; Jäkel, J.; Teutsch, M. A Benchmark for Deep Learning Based Object Detection in Maritime Environments. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 16–17 June 2019; pp. 916–925. [Google Scholar]
Shao, Z.; Wu, W.; Wang, Z.; Du, W.; Li, C. Seaships: A large-scale precisely annotated dataset for ship detection. IEEE Trans. Multimed. 2018, 20, 2593–2604. [Google Scholar] [CrossRef]
Zhou, Z.; Sun, J.; Yu, J.; Liu, K.; Duan, J.; Chen, L.; Chen, C.L.P. An image-based benchmark dataset and a novel object detector for water surface object detection. Front. Neurorobot. 2021, 15, 723336. [Google Scholar] [CrossRef] [PubMed]
Zou, Y.; Zhao, L.; Qin, S.; Pan, M.; Li, Z. Ship Target Detection and Identification based on SSD_MobilenetV2. In Proceedings of the 2020 IEEE 5th Information Technology and Mechatronics Engineering Conference (ITOEC), Chongqing, China, 12–14 June 2020; pp. 1676–1680. [Google Scholar]
Liu, C.; Li, J. Self-correction ship tracking and counting with variable time window based on YOLOv3. Complexity 2021, 2021, 2889115. [Google Scholar] [CrossRef]
Han, X.; Zhao, L.; Ning, Y.; Hu, J. ShipYOLO: An enhanced model for ship detection. J. Adv. Transp. 2021, 2021, 1060182. [Google Scholar] [CrossRef]
Zhang, J.; Jin, J.; Ma, Y.; Ren, P. Lightweight object detection algorithm based on YOLOv5 for unmanned surface vehicles. Front. Mar. Sci. 2023, 9, 1058401. [Google Scholar] [CrossRef]
Radosavovic, I.; Kosaraju, R.P.; Girshick, R.; He, K.; Dollár, P. Designing Network Design Spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10428–10436. [Google Scholar]
Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A New Backbone that Can Enhance Learning Capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 13–19 June 2020; pp. 390–391. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Wang, C.Y.; Liao, H.Y.M.; Yeh, I.H. Designing network design strategies through gradient path analysis. J. Inf. Sci. Eng. 2023, 39, 975–995. [Google Scholar] [CrossRef]
Iandola, F.; Moskewicz, M.; Karayev, S.; Girshick, R.; Darrell, T.; Keutzer, K. Densenet: Implementing efficient convnet descriptor pyramids. arXiv 2014, arXiv:1404.1869. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional Block Attention Module. In Proceedings of the 15th European Conference, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Wang, K.; Liew, J.H.; Zou, Y.; Zhou, D.; Feng, J. PANet: Few-shot Image Semantic Segmentation with Prototype Alignment. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 9196–9205. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and Efficient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. Proc. AAAI Conf. Artif. Intell. 2020, 34, 12993–13000. [Google Scholar] [CrossRef]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding box regression loss with dynamic focusing mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]

Figure 1. YOLO-WSD Network Architecture and Components.

Figure 2. A comparative diagram between the C2F-E and C2F structures. (A) The structure of the C2F-E module; (B) comparison of the Gradient Flow between C2F-E and C2F.

Figure 3. A comparison between backbone networks. (A) Original backbone network of YOLOv8n; (B) custom-designed backbone network of YOLO-WSD proposed in this paper.

Figure 4. Different feature fusion network structures: (A) model proposed in the PA-Net paper; (B) model proposed in the EfficientDet paper; (C) structure used in this paper.

Figure 5. Concat + Conv feature fusion process: (A) fusion of high- and low-level features with the same number of channels; (B) fusion of high- and low-level features with different numbers of channels.

Figure 6. WSODD Dataset: (A) example images from the dataset; (B) instance distribution; (C) instance size distribution).

Figure 7. Comparison of Model Performance Under Different Metrics.

Figure 8. Comparison of the category-wise average precision (AP) values between YOLO-WSD and YOLOv8n: (A) represents YOLOv8n and (B) represents YOLO-WSD.

Figure 9. Comparison of the detection results between YOLO-WSD and YOLO-V8n in typical weather and environmental conditions.

Figure 10. Seaship dataset: (A) instance distribution; (B) instance size distribution.

Table 1. Comparison between the detection performance of YOLOv5n, YOLOv5s, YOLOv7-tiny, YOLOv8n, YOLOv8s, Efficientdet-D0, Efficientdet-D1, and YOLO-WSD on the WSODD dataset.

Methods	mAP0.5 (%) ↑	[email protected]:0.95 (%) ↑	P (%) ↑	R (%) ↑	Parameters (M) ↓	Speed (ms) ↓	FPS ↑
Efficientdet-D0	48.5	25.5	74.0	32.5	3.9	-	35.6
YOLOv5n	67.2	36.1	74.0	61.5	2.5	1.7	78.1
YOLOv5s	74.7	42.1	75.6	70.8	9.1	2.5	76.9
YOLOv7-tiny	65.9	33.3	67.9	63.2	6.2	-	93.4
YOLOv8s	75.0	42.7	83.3	68.2	11.1	2.7	74.6
YOLOv8n (baseline)	70.7	39.2	76.2	65.5	3.15	1.9	83.3
YOLO-WSD	75.3	41.3	76.3	70.9	3.3	2.1	76.9

↑/↓ In the indicator, the arrows mean that the higher/lower the value, the better.

Table 2. Detection performance of different channel necks.

Methods	mAP0.5 (%)	[email protected]:0.95 (%)	P (%)	R (%)	Parameters (M)	Speed (ms)
PAN-FPN	70.7	39.0	76.2	65.5	3.2	1.9
PAN-FPN + WIOU	71.1	39.5	74.0	66.9	3.2	1.9
BiFPN-YOLO	70.7	38.9	75.8	64.9	3.7	2.0
BiFPN-YOLO + WIOU	73.2	41.4	76.7	68.1	3.7	2.1

Table 3. Detection performance of similar channel necks.

Methods	mAP0.5 (%)	[email protected]:0.95 (%)	P (%)	R (%)	Parameters (M)	Speed (ms)
PAN-FPN	70.8 ↑	39.0 ↑	74.2	66.0	3.2	2.0
PAN-FPN + WIOU	73.0 ↑	40.4 ↑	77.6	67.6	3.2	2.0
BiFPN-YOLO	74.3 ↑	40.8 ↑	73.5	69.9	3.4	2.1
BiFPN-YOLO + WIOU	74.6 ↑	41.3 ↓	80.0	68.6	3.4	2.1

↑/↓ represent if the model’s performance has improved/declined compared to Table 2.

Table 4. Results of the ablation experiment.

Methods	C2F-E	BiFPN YOLO	WIOU	mAP0.5 (%)	[email protected]:0.95 (%)	Parameters (M)	Speed (ms)
M1				70.7	39.0	3.2	1.9
M2			√	71.1	39.5	3.2	1.9
M3		√		74.3	40.8	3.4	2.1
M4	√			71.4	39.4	3.0	1.8
M5		√	√	74.6	41.3	3.4	2.0
M6	√	√	√	75.3	41.3	3.3	2.1

Table 5. Comparison of the detection performance for YOLOv8n, YOLOv8s, and YOLO-WSD based on the Seaship dataset.

Methods	mAP0.5 (%)	[email protected]:0.95 (%)	P (%)	R (%)	Parameters (M)	Speed (ms)
YOLOv8n (baseline)	74.2	44.1	77.3	67.3	3.15	1.9
YOLOv8s	78.3	51.9	79.3	68.7	11.1	2.8
YOLO-WSD (CIOU)	77.6	50.4	75.5	75.8	3.3	2.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, C.; Wang, L.; Liu, Y.; Zhang, S. Lightweight Water Surface Object Detection Network for Unmanned Surface Vehicles. Electronics 2024, 13, 3089. https://doi.org/10.3390/electronics13153089

AMA Style

Li C, Wang L, Liu Y, Zhang S. Lightweight Water Surface Object Detection Network for Unmanned Surface Vehicles. Electronics. 2024; 13(15):3089. https://doi.org/10.3390/electronics13153089

Chicago/Turabian Style

Li, Chenlong, Lan Wang, Yitong Liu, and Shuaike Zhang. 2024. "Lightweight Water Surface Object Detection Network for Unmanned Surface Vehicles" Electronics 13, no. 15: 3089. https://doi.org/10.3390/electronics13153089

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Lightweight Water Surface Object Detection Network for Unmanned Surface Vehicles

Abstract

1. Introduction

2. Related Work

2.1. Surface Target Detection Based on Deep Learning

2.2. Guidelines for Target Detection Model Design

3. Methods

3.1. C2F-E Module and the Redesigned Backbone Network

3.2. Multi-Scale Feature Fusion Network

3.3. WIOU Loss Function

4. Dataset and Training Strategy

4.1. Dataset

4.2. Experiment Environment and Parameters

4.3. Model Evaluation Metrics

5. Experiment

5.1. Comparison of Lightweight Object Detection Algorithms

5.2. Comparative Experiment of the Neck Section

5.3. Cross-Dataset Validation

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI