1. Introduction
Compared to the maritime environment, the navigation conditions for inland ships are more complex, with higher ship density and more obstacles. Additionally, there are numerous small objects, which affects navigation safety [
1]. At present, ships commonly use radar and AIS devices to perceive their navigation environment. However, radar has blind spots, and small objects have poor radar reflections. In addition, radar echoes are easily affected by weather or obstructed, resulting in a decrease in detection effectiveness. AIS provides higher navigation accuracy compared to radar, but it is a passive sensor, which can also cause safety issues when some ships turn off their AIS devices [
2]. Therefore, simply using radar and AIS devices or visual observation by ship operators does not provide a comprehensive understanding of the navigation environment. In recent years, object-detection technology based on visible light images has made significant progress and has gradually been applied to the intelligent perception of ships’ navigation environments [
3]. Compared to radar and remote-sensing images, visible light images have advantages such as high resolution, rich color information, and clear textures. They provide a more intuitive observation of the navigation environment [
4]. Additionally, visible light image-acquisition devices are relatively affordable, stable, easy to install, and can serve as a complement to radar and AIS devices, enhancing navigation safety [
5].
Traditional object-detection algorithms are divided into feature-based methods and segmentation-based methods. By 2010, the performance of traditional object-detection algorithms reached a stable level. In 2012, the CNN constructed by Hinton and his research team, the AlexNet algorithm, significantly outperformed the SVM algorithm and won the ImageNet competition. Object-detection algorithms based on deep learning gradually became the main research direction of the object-detection field [
6]. Researchers are starting to use large-scale convolutional neural networks with deeper levels and more parameters to solve complex object-detection tasks. However, as the parameters and models increase, there are higher requirements for the performance of the corresponding devices. Therefore, there is a need to develop a lightweight object-detection model that can be effectively deployed on ships to achieve intelligent perception.
The quality of a dataset has a very important impact on the performance of the model, so some researchers have produced and published some water-surface object datasets. Among them, in 2018, Shao et al. [
7] published SeaShips, a public ship dataset, which covers six common ship types and was designed for training and evaluating ship object-detection algorithms. In 2019, Moosbauer et al. [
8] published SMD (Singapore Maritime Dataset), and performed experiments based on Faster R-CNN [
9] and Mask R-CNN [
10], respectively, to verify the applicability of the dataset in the maritime field. In 2020, Shin et al. [
11] proposed a data expansion method to automatically expand the dataset; this method uses the strength segmentation technology to extract the foreground image from the SMD, and combines it with the new maritime background to obtain high-quality data. In 2021, Zhou et al. [
12] published a water-surface target dataset named WSODD (Water Surface Object Detection Dataset); the scenes in the dataset include oceans, lakes, and rivers, with varying weather conditions, such as sunny, foggy, and cloudy. At the same time, they proposed a new object detector named CRB-Net, which has better detection accuracy of water-surface objects. Cheng et al. [
13] published the first floating garbage dataset named FloW, and performed baseline experiments based on vision and radar, but with low accuracy.
These datasets provide the basis for the training of water-surface object-detection algorithm, but the SeaShips only has ship objects, with fewer object categories, while the SMD is a dataset in the form of video, which is not suitable for the model training of this paper, and the FloW is a floating garbage dataset, with fewer object categories too. WSODD contains 14 kinds of objects in addition to ships, almost covering the objects of navigation, which were captured in different weather conditions. Considering that the model developed in this paper is used to detect various objects in navigation to improve navigation safety, the model in this paper will be trained based on WSODD.
This study aims to address the issues of water-surface obstructive objects and ships with high density, as well as the predominance of small objects, in the navigation of inland ships. For this purpose, a lightweight object-detection algorithm named LWS-YOLOv7, using on deep learning, was developed, using YOLOv7 as the baseline model, based on a water-surface object dataset named WSODD. According to the characteristics of the water surface, small objects, and multi-scale objects during navigation, several targeted improvements have been made, which balance detection accuracy, inference speed, and computational cost. The main contributions of this research are as follows:
The model is combined with normalized Wasserstein distance and the w-CIoU loss is proposed to replace the CIoU loss of the baseline model to improve the model’s recall rate and accuracy for small objects.
A new receptive field amplification module named GSPPCSPC is proposed to reduce the model’s parameters and enhance the multi-scale object detection capability of the model.
A P2 feature fusion layer is added to reduce the feature loss of small objects and improve the model’s detection performance for small objects.
The model is pruned, which significantly reduces the parameters and computational costs without sacrificing the overall detection accuracy and speed of the model, facilitating the deployment of the model on shipborne devices.
2. Related Work
Based on the detection process, deep-learning-based object-detection algorithms can be divided into one-stage and two-stage methods. The use of two-stage object-detection algorithms has been previously studied [
14]. With this method, the first stage focuses on obtaining proposal boxes for the object locations; in the second stage, these proposal boxes are classified and the object positions are accurately determined. Compared to one-stage algorithms, these algorithms achieve higher detection accuracy at the expense of the detection speed. Representative algorithms in this category include the R-CNN series. R-CNN [
15] was the first application of deep learning in the field of object detection. It utilizes a CNN to extract features from generated candidate regions, and then performs classification and bounding box regression, significantly improving the performance of the object-detection algorithms. Fast R-CNN [
16] combines the SS (selective search) algorithm with R-CNN, improving both the detection speed and accuracy. Faster R-CNN replaces the SS algorithm with the region proposal network (RPN) to obtain feature maps for regions of interest (ROI), further improving the detection accuracy of the algorithm. One-stage algorithms skip the step of generating proposal boxes and directly provide the class probabilities and coordinates of objects in the image [
17,
18,
19]. The advantage of one-stage algorithms is their faster detection speed, which meets the requirements of real-time detection. However, this sacrifices some detection accuracy. Representative algorithms in this category include the YOLO series and SSD [
20]. In order to achieve real-time object detection, Redmon et al. proposed the YOLOv1 [
21] in 2016, which abandoned the generation of candidate regions and significantly improved the detection speed. Through several iterations, the detection accuracy of the algorithm has also been greatly improved, and it has been widely applied in various domains. The SSD algorithm draws inspiration from the idea of anchors in Faster R-CNN, sets prior boxes of different scales, and has good multi-scale object-detection capabilities. However, it has relatively poor detection performance of small objects.
One-stage and two-stage object-detection algorithms each have their own advantages and disadvantages; the former is more suitable for real-time detection scenes, and the latter is more suitable for non-real-time detection scenes and providing higher detection accuracy. In the context of ship navigation environment perception, real-time performance is highly demanded. Therefore, scholars mostly focus on the study of one-stage object-detection algorithms. In 2021, Du et al. [
22] developed a navigation mark-recognition algorithm based on YOLOv3, which provides accurate navigation mark information for ships. The data augmentation methods proposed in the article significantly enhance the algorithm’s performance, and subsequent versions of the YOLO algorithm also incorporate data-augmentation components. Chen et al. [
23] improved the YOLOv3 algorithm by modifying the model’s backbone and pruning the model. They proposed a lightweight ship detector to solve the deployment problem of real-time ship detection using synthetic aperture radar. In addition, knowledge distillation is employed to compensate for the performance degradation caused by network pruning. In 2023, Gao et al. [
24] proposed a lightweight infrared image-detection algorithm for small ships. They preprocessed the data using gamma transformation to make ship objects more salient, and then replaced the YOLOv5 model’s backbone with the MobileV3 network to achieve a more lightweight model that can be easily deployed on shipborne devices. By improving the backbone network, this algorithm dramatically reduces the model’s parameters while maintaining excellent detection performance for infrared ship images.
Although deep-learning-based object-detection algorithms can provide good detection performance, there are still various problems in water-surface object detection. In 2020, Zou [
25] researched lightweight ship-detection algorithms, which improved the SSD algorithm using the MobilenetV2 network, and combined it with the improved Faster R-CNN to enhance detection efficiency. However, this method performs poorly in detecting small objects, and occasionally results in missed detections of small objects. However, a large number of model parameters and slow detection speed made real-time detection impossible. In 2021, Han et al. [
26] optimized YOLOv4 and proposed the ShipYOLO model. The model improved the backbone network by incorporating a parameter reconstruction technique, enhancing both the model’s accuracy and inference speed. Secondly, a new amplified receptive field module named DSPP was designed, which improved the model’s ability to capture the spatial information of small-scale ships and the robustness of ship displacement. Finally, a new feature pyramid structure, AtFPN, was proposed, which improved the detection performance for objects of different scales and reduced the model’s parameters. Although this model improves the performance of the model and reduces its parameters, it still requires high computational power and makes it unsuitable for real-time detection on shipboard devices. Yang et al. [
27] designed ship classification experiments and verified that convolutional neural networks significantly outperformed traditional machine learning methods in this task. But the model has a large number of parameters, resulting in slow detection speed, which makes real-time detection unachievable. Yu et al. [
28] proposed an enhanced yolov7 for rapid detection of small objects on the water surface, Firstly, a detection branch is designed to address difficulties in recognizing targets of different sizes. Secondly, a LVC module was added to give more attention to small objects. Additionally, the L-ELAN was integrated to enhance computational efficiency. Lastly, the Wise-IOU optimized the loss function. However, this method only makes improvements to the model’s modules, and compared to model pruning techniques, it has a less effective model-simplification effect.
Although researchers have conducted extensive research on ship detection based on deep learning, the navigation environment of inland water is complex, and there are still two main detection problems in this navigation environment. Firstly, there are numerous multi-scale objects, including a large number of small objects, which can affect navigation safety; therefore, it is necessary to design an algorithm to improve ships’ multi-scale object-detection ability and to solve the problem of false and missed detection of small objects. Secondly, shipborne devices have limited computing power, requiring a reduction in the algorithm complexity and demand for computing power. Researchers have not fully considered the above two issues; therefore, the authors of this paper aimed to make improvements to YOLOv7 [
29] to improve the detection ability of the model for small objects and use the model pruning method to simplify the network and reduce the computational power requirements, while ensuring the model’s accuracy and real-time detection and reducing the model’s parameters for deployment on shipborne devices.
3. Methodology
Compared to YOLOv5, YOLOv7 has four main modifications: the introduction of model reparameterization; the improvement of the training method of auxiliary heads; the proposal of a more efficient network architecture, ELAN; and the combination of YOLOv5’s cross grid search and YOLOX’s matching strategy for label allocation. With the above improvements, YOLOv7 outperformed all previous versions on official datasets, and its characteristics of high-precision detection and fast inference make it particularly suitable for real-time navigation environment detection tasks.
The model’s structure diagram of YOLOv7 is shown in
Figure 1, which is divided into four parts, namely the input, backbone, neck, and output. Firstly, in the input section, the model performs preprocessing operations, such as data augmentation, on the dataset images, and feeds them into the backbone network. Secondly, the backbone network extracts features from the objects and fuses the extracted features with the neck network to obtain features of three sizes: large, medium, and small. Finally, the fused features are passed to the detection head, which generates the results.
Unlike previous versions of YOLO, the backbone network of YOLOv7 is mainly composed of the ELAN module, the MP module, and the SPPCSPC module. The ELAN module has two branches of different lengths. By controlling the shortest and longest gradient paths, it enables the network to learn more features while being more efficient and robust. The SPPCSPC module builds upon the structure of SPP (spatial pyramid pooling) and incorporates the concept of the CSP module. It uses max pooling to obtain features with different receptive fields, allowing for better adaptation to detection tasks with varying resolutions and improving the inference speed. The MP module combines convolutional networks with max pooling to achieve downsampling while minimizing feature loss. The neck of YOLOv7 adopts the same PAFPN structure as YOLOv5, and the output part introduces the RepConv module, which further improves the model’s inference speed.
YOLOv7 has set up three basic models with different depths and widths, namely YOLOv7-tiny, YOLOv7, and YOLOv7-W6, to cope with different usage scenarios [
30]. Considering the requirements of accuracy, speed, and the devices’ computing power for ship navigation environment detection, we selected YOLOv7 as the baseline network in this study.
3.1. Introducing w-CIoU Loss Instead of IoU Loss
The loss function of YOLOv7 can be divided into three categories: classification, localization, and confidence loss. The loss function formula is shown in Equation (1).
Among them, is a classification loss used to correct the model’s ability to identify objects, is a positioning loss used to correct the model’s ability to locate the objects, and is a confidence loss used to correct the reliability of the model’s detection results. The classification loss and confidence loss in YOLOv7 are both binary cross entropy losses, while the localization loss is CIoU.
The quality of training samples has a significant impact on the final performance of the model; therefore, the allocation of positive and negative labels during the training process is crucial. However, the sensitivity of IoU to objects of different scales varies greatly. As shown in
Figure 2, for the same positional deviation, large objects and small objects exhibit significantly different declines in IoU. Small objects show larger fluctuations in IoU, leading to inaccurate label assignments for positive and negative samples. Therefore, using CIoU as the localization loss in scenarios with numerous small objects, such as inland water navigation, will reduce the model’s recall rate and accuracy and result in missed detections of small objects.
Based on the issues posed by small objects in label assignment and localization accuracy, we drew inspiration from the normalized Wasserstein distance (NWD) [
31] and combined it with CIoU to propose the w-CIoU loss. The goal was to reduce the sensitivity of the original loss function to positional deviations of small objects and improve the accuracy of positive and negative sample assignment for small objects. The composition and formulas for the w-CIoU loss function are as follows:
Among these, represents the prediction box, represents the real box, is the Euclidean distance between the center points of the two boxes, and represents the diagonal distance of the minimum closure region that can simultaneously contain both the predicted box and the true box. is the equilibrium parameter and is the consistency coefficient. is the Wasserstein distance, and represent the Gaussian distribution for the modeling of bounding boxes and , and is the weight factor, which is used to control the optimization proportion of the two types of loss functions to cope with the presence of objects with different proportion sizes. In this paper, a value of 0.44 was taken.
In order to verify the improvement of the w-CIoU function on the YOLOv7 model in small-object detection, there are two experiments on the WOSDD dataset: the data were split into a training set and a test set at an 8:2 ratio before the training. The model of experiment 1 was the original YOLOv7, and the model of experiment 2 was an improved YOLOv7 with w-CIoU as the localization loss. In addition, the hardware device used in the experiments was a Tesla V100 graphics card (Nvidia, Santa Clara, CA, USA), the system version was Ubantu22.04, the deep learning framework was PyTorch1.10.1, and the CUDA version was 10.2. After 300 iterations of training, the model was verified on the test set, and the experimental results are shown in
Table 1. The three indicators accuracy, recall rate, and mAP are used here, where a higher value signifies superior performance of the model. It was found that, compared to the CIoU function, the accuracy of the model with w-CIoU increased by 4.5%, while the recall rate increased by 1.1%, and the mAP value increased by 1.4%, which indicates that by using the w-CIoU loss function with the appropriate parameter settings, the model can effectively handle the localization and label assignment issues posed by small objects, improving the model’s performance for inland water navigation scenes with numerous small objects.
3.2. Spatial Pyramid Pooling Module Based on GhostConv (GSPPCSPC)
The SPPNet (spatial pyramid pooling network) [
32] was proposed by the research team led by Kaiming He in 2015. This module effectively avoids information loss or distortion caused by drop or warp processing and obtains fixed-length output vectors through pooling operations, allowing the network to retain rich spatial information and to adapt to different scales of object-detection and image-classification tasks. This improves the flexibility and generalization ability of the model, improves the speed of generating candidate boxes, and reduces the computational costs.
The SPPCSPC module in YOLOv7 builds upon the SPPNet module and incorporates ideas from the CSP module. It consists of two branches. In the first branch, the SPP structure is applied to obtain four different scales of receptive fields; this helps to distinguish objects of different scales within the module. In the second branch, the module performs routine convolution processing and merges with the first branch to form a structure similar to the CSP module, improving the module performance. However, compared to the SPP module of YOLOv5, the convolution operation of the SPPCSPC module is greatly increased, which not only increases the model’s parameters but also improves the computational costs of the terminal device. Therefore, we propose a new receptive field amplification module named GSPPCSPC, which combines the GhostConv convolutional network and Depthwise (DW) convolutional network with the SPPCSPC module.
GhostNet [
33] is a lightweight neural network architecture designed specifically for mobile devices and embedded systems. It was proposed by a research team at Huawei’s Noah’s Ark Lab in 2020. The key idea of GhostNet is to utilize an efficient feature reuse strategy, significantly reducing computational resource consumption, which allows the model to maintain high accuracy while significantly reducing model’s complexity. GhostNet is particularly suitable for scenes such as mobile devices with limited computing resources or that require real-time processing. The DW (Depthwise) convolutional network consists of depthwise convolution and pointwise convolution. This network effectively separates the spatial filtering and cross-channel information mixing steps in standard convolution, thereby significantly reducing the computational complexity and model size while maintaining the model’s recognition performance, which is conducive to the model’s miniaturization and accelerated operation.
A diagram of the SPPCSPC module is shown in
Figure 3, while the GSPPCSPC module is shown in
Figure 4. Firstly, this module replaces the original 3 × 3 convolution with GhostConv convolution to reduce computational resource consumption. Secondly, a DW convolution module is introduced before the branches of the 5 × 5 pooling layer for spatial feature extraction and cross-channel information fusion, enhancing the feature-representation capability of the module. The GSPPCSPC module reduces the parameters and computational power requirements of the original module, and reduces the spatial feature loss, further improving the model’s performance.
3.3. P2 Small-Object Feature-Fusion Network
The NECK network is used for feature fusion, while the YOLOv7 network has three detection layers. For input images with a size of 640 × 640, the minimum detectable object size is 8 × 8 pixels. During ship navigation, distant ships and other small obstacles tend to have smaller sizes, resulting in a higher proportion of small objects. In addition, some objects are in partially camouflaged backgrounds with a low proportion in the image, which causes a low detection accuracy of the YOLOv7 and high false alarm rate for such object-detection tasks. Therefore, we introduced a P2 layer with a scale of 160 × 160 to the feature-fusion network section to detect small objects at a scale of 4 × 4 pixels. In order to not significantly increase the parameters, the RepConV convolutional layer was not added. The overall network structure of the improved model is shown in
Figure 5.
By incorporating the P2 layer, the model can effectively detect small objects without significantly increasing the parameters. This enhancement improved the overall performance of the model in detecting small objects and objects against partially camouflaged backgrounds.
3.4. Model Pruning
The YOLOv7 is composed of numerous neural network connections, but not all connections contribute equally to the model’s performance. By pruning the connections with smaller contributions to the results, the model can be compressed. Model pruning is an effective technique for model compression, allowing the network to be streamlined by identifying and removing redundant or less influential parameters or connection structures without significantly impacting or having a minimal impact on the model’s performance. In this study, the LAMP (layer-wise adaptive magnitude pruning) method [
34] was selected to prune the YOLOv7 network model.
Research on network pruning has shown that amplitude-based network pruning, after carefully selecting layer-wise sparsity, can strike a balance between network performance and lightweight design, but there is no consensus on the selection method. The LAMP method proposes a new global pruning importance score that calculates the importance scores for weights to guide the network pruning. This method does not require any hyperparameter adjustment or heavy calculation, and it is superior to manual heuristics or extensive hyperparameter search methods in hierarchical sparsity selection. By applying the LAMP method to prune the YOLOv7 network model, we aimed to achieve a more compact and efficient model while maintaining the performance or minimizing the impact on the model’s performance.
As shown in
Figure 6, the vectors
and
represent the weight vectors between different layers of a model, respectively, which are flattened to one dimension. The LAMP score is calculated by normalizing the sum of the squared weights. Specifically, assume there is a weight vector in the model, denoted as
. This vector is associated with every fully connected layer and convolutional layer in the model, which can be two-dimensional or four-dimensional. In order to calculate the LAMP score uniformly between the fully connected layer and the convolutional layers, all weight vectors are flattened into one-dimensional vectors before calculation. During the flattening process, the vectors are sorted in ascending order according to their weights, that is, when
,
is established, where
represents the mapping value of index
.
At this point, the formula for calculating the LAMP score of the ua-th index of tensor
is:
Once the LAMP scores of the weight vectors are calculated, it is possible to globally prune the connections with the smallest LAMP scores until the specified threshold condition is met. From the LAMP score calculation formula, it can be observed that there is at least one connection with a LAMP score of 1 in each layer. Therefore, at least one connection will be retained in each layer, which distinguishes it from the approach of solely pruning connections based on the weight magnitudes.
Compared to other pruning methods, the method based on the LAMP score has the following advantages. Firstly, it does not require any hyperparameters to be tuned, as only the pruning rate needs to be set. Secondly, the LAMP score of each connection can be obtained through elementary tensor operations, with minimal computational costs.
5. Conclusions
In inland waterways, there is a high density of ships and various objects, with a predominance of small objects, which can easily affect navigation safety. The use of computer vision method to detect visible light images can help ship drivers better grasp the navigation environment and improve navigation safety.
This paper proposes a lightweight water-surface object-detection model named LWS-YOLOv7 to improve the obstacle-detection effect of inland ships and enhance navigation safety. Firstly, the localization loss function of the model was optimized by combining normalized Wasserstein distance with CIoU, proposing a w-CIoU function to reduce the sensitivity to position deviations of small objects, and improving the recall rate of the model. Secondly, the GSPPCSPC module was introduced to reduce the parameters and increase the receptive field to improve the multi-scale object-detection capability of the model. Thirdly, a feature fusion layer P2 with a scale of 160 × 160 was added to detect objects larger than 4 × 4 pixels, thereby improving the accuracy and recall rate of small objects. Finally, the LMAP-based method was used to prune the model, significantly reducing its parameters and GFLOPS. The experimental results demonstrate that, compared to the original YOLOv7 model, the LWS-YOLOv7 achieved a 3.1% improvement in mAP, with a 68% increase in FPS. Additionally, it reduced the parameters to 61.2% and GFLOPS to 71.2% of the original values. Therefore, the LWS-YOLOv7 proposed in this paper has excellent detection performance for inland ship navigation scenarios with a high proportion of small objects. At the same time, it is suitable for various meteorological conditions and can be deployed on ships to perform real-time object-detection tasks.
Future research can be focused on the following aspects: (1) continuing to collect water-surface object datasets under various scenarios to produce higher quality datasets and (2) combining with knowledge-distillation technology to further simplify the network parameters and to improve the model performance.