1. Introduction
With the advent of the information age, the rapid development of computers and related cutting-edge technologies has driven breakthroughs in machine vision technology. Machine vision, as a fusion technology that uses computers as tools and combines image processing with sensors, has emerged in multiple research directions such as object detection, semantic segmentation, motion tracking, 3D reconstruction, and action recognition. In many practical application scenarios, object detection has important application value, covering multiple fields such as video surveillance, autonomous driving, unmanned ship detection, and navigation [
1]. The world has abundant ocean and inland river resources, so research on water target detection is increasingly receiving attention [
2], especially in the field of ship detection at sea. Initially, ship detection mainly relied on visible light technology, but with the continuous development of infrared thermal imaging technology, infrared-based ship detection has gradually become a research hotspot. Compared to visible light, infrared detection technology has significant characteristics such as all-weather detection, long detection distance, and strong anti-interference ability, which enable it to effectively detect ships in navigation at night and in adverse weather conditions. Compared to other detection tasks, the nearshore ship detection task based on infrared images has its distinct characteristics:
- (1)
Based on its physical properties, the resolution of infrared images is lower compared to visible light images. This results in limited information for small target ships, and during the feature extraction process, crucial details are easily lost, thereby affecting the detection.
- (2)
From an ocean perspective, ships of the same category exhibit different scales in the images, which can easily result in the loss of small targets. This necessitates that detection algorithms have a stronger capability for multi-scale target detection.
- (3)
Coastal areas and ports often harbor a substantial number of ships. The occurrence of mutual occlusion between ships poses a significant challenge for the accurate positioning and classification of ships in nearshore environments.
Deep learning-based object detection methods have gradually become mainstream, giving rise to numerous algorithm frameworks. These encompass two-stage algorithms based on anchor boxes (such as RCNN [
3] and Faster RCNN [
4]), single-stage detection algorithms (such as YOLO [
5] and SSD [
6]), and detection algorithms without anchor boxes (such as CenterNet [
7] and FCOS [
8]). With continuous and in-depth research in detection technology, the focus of detection tasks has shifted from large and medium-sized targets to enhancing the detection performance of small targets while preserving the effectiveness of larger ones.
In response to this challenge, researchers have introduced various methods, including multi-scale feature fusion and context learning. These approaches aim to further refine the detection performance of small targets without compromising the accuracy of larger and medium-sized targets. This trend underscores the pursuit of a more comprehensive performance in object detection algorithms, enabling them to excel in handling diverse target sizes and scenes.
Shi et al. [
9] proposed a strategy for fusing deep and shallow features to enhance detection probability. This strategy captures low-level structures and texture features of small targets, along with high-level semantic information, to prevent missed detections. Ye et al. [
10] added a small object detection layer to the original network model to solve the problem of small object detection and adopted a new connection method based on the introduced BIFPN to address the issue caused by multi-scale changes in ships. With the continuous research on and development of attention mechanisms, they have been widely applied in various fields of deep learning in recent years, including object detection. Si et al. [
11] proposed an improved YOLO-RSSD algorithm, where an enhanced bidirectional feature pyramid network structure is embedded in the feature fusion section. This enables cross-layer multi-scale weighted feature fusion, and a channel attention mechanism is introduced in the convolutional units to further enhance the detection effectiveness of small ship targets in infrared images. Guo et al. [
12] proposed a nearshore ship detection method based on the FCOS network, incorporating a bidirectional attention feature pyramid network. This approach enhances the detection accuracy of small targets. Although the current improved methods have shown some enhancement in the detection of small targets, there is still room for further improvement.
Wang et al. [
13] designed a CNeB2 module to enhance the spatial correlation in encoding, reducing interference from redundant information and improving the model’s capability to recognize dense targets. With the increasing improvement in feature extraction and fusion in network models, some scholars have made enhancements in the post-processing stage of network models. Shi et al. [
14] proposed an improved YOLOv5s_SE to address the issue of insufficient performance of existing algorithms in detecting small targets. It is achieved by integrating Soft_NMS and EIOU_Loss, replacing the non-maximum suppression function (NMS) in the original network, and thereby improving the detection ability of occluded objects.
Infrared ships near the coast exhibit distinctive characteristics. Firstly, due to equipment limitations, infrared images have lower resolution, appear more blurred compared to visible light, and are often accompanied by significant noise. Secondly, the coastal perspective of the infrared images, facing the ocean, results in a broader field of view for detection scenes. This leads to uneven distribution of target sizes, and in coastal areas, the density of ships is relatively high, making occlusion incidents more likely.
Considering these characteristics of infrared ship detection near the coast, and through a comparison of various network models, a GT_YOLO infrared ship detection algorithm based on YOLOv5s is proposed. YOLO, as a single-stage detection algorithm, has significant advantages in terms of speed, accuracy, and deployment, making it more adaptable to diverse and complex scenes than other detection algorithms. This algorithm not only performs well in small target detection but also effectively addresses the challenges introduced by dense scenes. Furthermore, the mAP0.5 of infrared ships has been improved by 1%, mAP0.75 by 5.7%, and mAP0.5:0.95 by 5%. The main contributions of the research can be summarized as follows.
To address prominent challenges in the detection of infrared ships near the coast, this paper proposes a GT_YOLO algorithm for infrared ship detection based on YOLOv5s. The algorithm introduces an attention mechanism to allow the network model to focus more on crucial features to improve performance in specific scenarios near the coast.
- (1)
To capture distant contextual information of the targets, this paper introduces a feature fusion module based on a fused attention mechanism. This module enhances feature fusion while suppressing noise introduced by shallow feature layers.
- (2)
To suppress the unclear detail information caused by low resolution and its impact on small object detection, the SPD-Conv module was introduced to improve the detection accuracy of small objects.
- (3)
To address the issue of dense occlusion, which is prone to occur near the coast, this paper introduces Soft-NMS to ensure that the detection model still has excellent detection performance in dense occlusion scenes.
2. GT_YOLO
2.1. YOLOv5
YOLO [
3] (you only look once) is a classic single-stage detection algorithm, and in this paper, the mature version YOLOv5 from the YOLO series is selected. The network structure is divided into three parts: the backbone, neck, and head. The backbone uses the BottleneckCSP structure for feature extraction, the neck introduces a feature pyramid network (FPN) and PANet to enhance feature fusion and extraction, and the head classifies and predicts targets based on the learned features. YOLOv5 achieves rapid and accurate detection of targets in various scenarios through the collaborative work of these three parts, demonstrating outstanding performance.
YOLOv5 adopts the input method of mosaic data augmentation, synthesizing new images through random cropping, scaling, and composition methods. This strategy not only enriches the experimental data but also helps improve the inference speed. In terms of loss calculation, YOLOv5 utilizes CIOU_Loss, which aids in more accurately assessing the error of the target detection results. Finally, the output is optimized through NMS to obtain the final detection results. This entire set of data augmentation and loss calculation mechanisms allows YOLOv5 to achieve more robust target detection in complex scenarios and produce more accurate results. The YOLOv5 network architecture is illustrated in
Figure 1.
2.2. Feature Fusion Enhancement
In the original YOLOv5 network, the FPN structure is employed, but there are some issues with the simple fusion of shallow-level features with other multi-scale features in the channel dimension. This approach struggles to accurately reflect the importance of different channel features and may lead to the diffusion of noise throughout the entire network model, impacting the fusion effectiveness. To address this problem, this paper introduces a feature fusion module GT based on a fusion-style attention mechanism, as illustrated in
Figure 2.
This module effectively enhances the capabilities of global context modeling and local feature extraction, not only providing better attention to small targets but also suppressing the propagation of noise. The features of the GT module come from the high-level features of the backbone network, features at the same level, and features from the previous layer, generating new features through clever fusion. This feature fusion module based on the fusion-style attention mechanism contributes to improving the network’s perception of global and local information, thereby more effectively addressing the challenges in small target detection.
For the high-level features of the backbone network, this paper employs the GCNet [
15] for effective global context modeling, capturing long-range dependencies while extracting a global understanding of the visual scene. The attention mechanism module integrates the advantages of simplified non-local (SNL) blocks and SENet, providing not only modeling capabilities for long-range dependencies but also lightweight characteristics. The input feature tensor C
1 × H
1 × W
1 undergoes processing through SNL blocks and SE blocks, transforming into C × H × W, thereby capturing more useful features.
For features at the same level, the module adopts triplet attention [
16] to make full use of spatial and channel information in the features. This attention mechanism consists of three branches, each processing the input feature tensor C
2 × H
2 × W
2. The first branch is the channel attention branch, where the input undergoes pooling and a 7 × 7 convolution, and finally, a Sigmoid function generates spatial attention weights. The second branch establishes interaction between channels C
2 and space H
2 by rotating the feature tensor along the W
2 dimension, transforming it into H
2 × C
2 × W
2. After processing with Z-Pool and a 7 × 7 convolution, it undergoes a Sigmoid function to generate attention weights for the spatial dimension H
2 and channel dimension C
2. The third branch is similar to the second branch, establishing interaction between channels C
2 and space W
2, generating attention weights between channel dimension C
2 and spatial dimension W
2. Finally, the outputs from the three branches are added and averaged to transform into C × H × W. This module, through the triplet attention mechanism, not only utilizes the rich features from the shallow layers more comprehensively but also suppresses a significant amount of noise within them. This enables the network to better focus on infrared ship targets, enhancing the performance of infrared ship detection.
The feature fusion module designed in this paper processes the high-level and low-level features transmitted by the backbone network through attention mechanisms. The fusion is performed by addition, and the resulting features are combined with the features propagated by the FPN, ultimately producing the fused feature map.
2.3. SPD-Conv
Due to hardware limitations, infrared target images exhibit lower resolution and pixel blurring compared to visible light images, making it challenging for networks to extract detailed features, especially in the case of small infrared targets (less than 32 × 32 pixels) that may be overlooked in multi-scale object detection scenarios. To address this issue, researchers have designed BIFPN based on the FPN+PAN structure, introducing a shorter path to improve feature fusion. On the other hand, some scholars have introduced the scale-invariant subspace (SAN) [
17] to map multi-scale features, aiming to enhance detection performance. However, these improvement methods often involve stride convolutions and maximum pooling, which may lead to information loss in infrared target detection, particularly being unfriendly to small targets.
To address this issue, this article introduces SPD-Conv [
18] into the YOLOv5 network architecture. By replacing the stride convolution and pooling layers of the original network model, the detection performance for small targets and low-resolution images has been improved. Assuming a feature map U with a size of
W ×
W ×
V1, the feature map is divided into several sub-series feature maps at each stride:
Each sub-mapping feature
of the feature map
is composed of feature vectors
, where
and
can be proportionally divided. In this way, each sub-mapping is downsampled from
according to the scaling factor. When the scaling factor X = 2, as shown in
Figure 3, four sub-mapping feature maps can be obtained.
The size of each sub-feature map is , with a downsampling factor of 2.
Then, the sub-feature maps are merged along the channel dimension to form a new feature map . It undergoes downsizing in the spatial domain by a scaling factor of and an expansion in the channel dimension by . In other words, the original feature map is transformed into through this process.
After the SPD module, non-strided convolution filters are applied for further transformation, resulting in . The use of non-strided convolution helps better preserve feature information. Otherwise, when using filters with odd strides, such as a stride of 3, the feature map will be downsized proportionally, but each pixel will only be sampled once. If the filter has an even stride, such as a stride of 2, it leads to uneven sampling, causing inconsistency in the sampling between even and odd rows (columns).
2.4. Soft-NMS
In detection tasks, the same object may be detected multiple times, resulting in the generation of numerous overlapping candidate boxes. To address this issue, NMS is an effective method. The operation of NMS involves generating all candidate boxes in each round and organizing them in descending order based on their confidence scores, with higher scoring candidates placed at the front. In each round, the candidate box with the highest confidence score is selected, and attention is focused on the highly overlapping portions with all remaining candidate boxes. These highly overlapping boxes are suppressed in that round, while the selected candidate box is retained and not considered in the next round. By repeating this operation, the final result is the candidate box with the highest confidence score, suppressing highly overlapping candidate boxes.
However, NMS may encounter issues of missed detections when dealing with dense situations of infrared ships near the coast. This is because NMS directly excludes candidate boxes when their overlap exceeds a certain threshold, and this more aggressive approach may lead to the incorrect exclusion of some important targets. Faced with this challenge, there is a need to seek a more flexible and adaptive approach to enhance the robustness of detection.
This paper addresses this issue by introducing Soft-NMS [
19], which significantly improves the algorithm’s performance in dense scenarios of infrared ships. Soft-NMS still follows the idea of NMS, suppressing candidate boxes that overlap with the highest scoring candidate box. However, for densely occluded scenes, Soft-NMS gradually shifts its focus to those candidate boxes with greater overlap, attenuating their scores more. The evolved pruning step rules are as follows:
Here, represents the candidate box with the highest score, and represent the i-th candidate box and its score, and represents the threshold.
Based on the rules mentioned above, the function adjusts the overlap above the threshold with a linear decay function relative to
. In this way, candidate boxes that are farther from
will not be affected, while those closer will receive a larger penalty. However, the overlap is not continuous, and when the set threshold is reached, a penalty is suddenly applied. The ideal scenario is a continuous penalty function that imposes no penalty when there is no overlap and a very heavy penalty when the overlap is high. At the same time,
should not affect the scores of boxes with low overlap, and the penalty should gradually increase when the overlap is low. Building on this idea, Soft-NMS introduces a Gaussian penalty function as follows:
Here, is a hyperparameter. In this way, when the overlap between two boxes is high, the score will be smaller. Compared to traditional NMS, Soft-NMS assigns a very small score instead of directly removing the box. This can significantly improve the detection performance of infrared ships in dense occlusion situations.
This paper presents an improved GT-YOLO based on YOLOv5s, as shown in
Figure 4. By incorporating the SPD-Conv module and the designed feature fusion module, the model not only achieves excellent detection results for small infrared ships but also enhances the fusion of multi-scale ships. It performs well in detecting multi-scale targets in the same scene. Finally, Soft-NMS is applied at the end of the network to address the issue of dense occlusion in infrared ships near the coast. The integration of these improvement points greatly enhances the detection performance of the network.
4. Conclusions
This paper introduces a feature fusion module to facilitate better integration of high-level and low-level features in neural network models while suppressing noise interference in the input. The dataset comprises a diverse range of infrared ships. By incorporating SPD-Conv into the network architecture, the model is optimized, resulting in improved accuracy for low-resolution and small target detection.
In scenes with dense occlusion, the original network model is prone to interference, affecting the localization and detection accuracy of ships. To address this, this paper replaces the traditional NMS with Soft-NMS, significantly enhancing the model’s performance in dense occlusion scenarios.
The improved algorithm demonstrates a 1% increase in mAP0.5 compared to the original algorithm and a 5% improvement in mAP0.5:0.95. Although there is a slight decrease in FPS, it still achieves a frame rate of over 150 f/s. The proposed algorithm in this paper exhibits higher detection accuracy, enabling better monitoring of ships in nearshore ship detection tasks. Through extensive experiments and comparisons with other benchmark algorithms, GT-YOLO demonstrates a notable advantage in detection performance with a considerably smaller parameter count. Nevertheless, there are still limitations. The algorithm proposed in this paper improves the original algorithm by introducing SPD-Conv and simultaneously enhancing NMS. However, this comes at the cost of an increased parameter count by 1.7 million and a slight reduction in detection speed, requiring the network to handle larger computational loads and impacting real-time performance. Future work will focus on further optimizing the network to reduce parameters and computational overhead. Additionally, efforts will be directed toward refining Soft-NMS to mitigate its impact on the network’s detection speed.