Next Article in Journal
SAR-PATT: A Physical Adversarial Attack for SAR Image Automatic Target Recognition
Previous Article in Journal
Pre-, Co-, and Post-Failure Deformation Analysis of the Catastrophic Xinjing Open-Pit Coal Mine Landslide, China, from Optical and Radar Remote Sensing Observations
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

YOLO-IRS: Infrared Ship Detection Algorithm Based on Self-Attention Mechanism and KAN in Complex Marine Background

College of Information and Communication Engineering, Harbin Engineering University (HEU), Harbin 150001, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2025, 17(1), 20; https://doi.org/10.3390/rs17010020
Submission received: 22 November 2024 / Revised: 23 December 2024 / Accepted: 24 December 2024 / Published: 25 December 2024

Abstract

:
Infrared ship detection technology plays a crucial role in ensuring maritime transportation and navigation safety. However, infrared ship targets at sea exhibit characteristics such as multi-scale, arbitrary orientation, and dense arrangements, with imaging often influenced by complex sea–sky backgrounds. These factors pose significant challenges for the fast and accurate detection of infrared ships. In this paper, we propose a new infrared ship target detection algorithm, YOLO-IRS (YOLO for infrared ship target), based on YOLOv10, which improves detection accuracy while maintaining detection speed. The model introduces the following optimizations: First, to address the difficulty of detecting weak and small targets, the Swin Transformer is introduced to extract features from infrared ship images. By utilizing a shifted window multi-head self-attention mechanism, the window field of view is expanded, enhancing the model’s ability to focus on global features during feature extraction, thereby improving small target detection. Second, the C3KAN module is designed to improve detection accuracy while also addressing issues of false positives and missed detections in complex backgrounds and dense occlusion scenarios. Finally, extensive experiments were conducted on an infrared ship dataset: compared to the baseline model YOLOv10, YOLO-IRS improves precision by 1.3%, mAP50 by 0.5%, and mAP50–95 by 1.7%. Compared to mainstream detection algorithms, YOLO-IRS achieves higher detection accuracy while requiring relatively fewer computational resources, verifying the superiority of the proposed algorithm and enhancing the detection performance of infrared ship targets.

1. Introduction

Machine vision technology has advanced as a result of the quick growth of computers and related cutting-edge technologies. This field, which integrates image processing and sensors with computers as tools, has accelerated advancements in areas such as semantic segmentation, target detection and tracking, and super-resolution image reconstruction, among other research directions. Due to its broad application value and significant research importance, infrared target detection has garnered considerable attention in the field of machine vision. It has been extensively applied in both civil and military domains, such as in ground target detection [1] and infrared guidance systems [2], with infrared ship target detection being particularly valuable for sea rescue and marine safety.
Current target detecting systems for ships mainly relies on visible light, synthetic aperture radar (SAR), hyperspectral, and infrared imaging technologies [3]. While visible light has benefits like high resolution, rich texture details, and a broad sensing range, it is challenging to precisely identify targets in low light reflection environments, such as bad weather. SAR is capable of long-distance and all-weather detection, but it is susceptible to electromagnetic interference, islands, harbors, and atmospheric influences during detection on the sea surface. Unlike visible light imaging, infrared imaging has smoke penetration and anti-interference abilities and is not subject to light conditions; infrared imaging passively receives radiation, making it more concealed than SAR [4]. Therefore, the infrared imaging-based surface ship target detection method has versatility in complex sea conditions. However, due to the characteristics of infrared imaging, its resolution is much lower than that of visible light images, leading to the absence of crucial data throughout the weak target (The target is considered a weak target if the contrast is less than 15%, the signal-to-noise ratio is less than 1.5, and the imaging size is less than 80 pixels, specifically 0.15% of a 256 × 256 pixel target.) In Reference [5], a ship feature extraction process is shown to affect the detection outcome. Additionally, a large number of ships docked in coastal harbors obscure one another, posing a significant challenge to achieving the accurate recognition and classification of infrared ship targets.
To summarize, weak targets are key factors constraining the detection performance of infrared imaging systems [6]. Since 2012, when the area of computer vision was rocked by the deep learning wave [7], more and more advanced detection algorithms have brought new ideas to weak target detection and spawned numerous detection algorithm frameworks [8]. The two primary categories of general-purpose target detection methods are the single-stage detector, exemplified by YOLO [9], which is distinguished by its quick detection speed and lightweight design, and the two-stage detector, dominated by the Faster-RCNN [10] series of algorithms, which has an advantage of multi-scale detection. There are also Transformer-based detection algorithms, such as the DETR [11] series. These algorithms focus on image global relationship modeling, but their ability to perceive local details is insufficient, leading to poor performance in small target detection. Thus, current target detection research efforts are focused on enhancing the performance of weak and small target identification algorithms.
To overcome this challenge, researchers have proposed several methods to increase the detection effectiveness of small and weak objects. An attention-guided bidirectional feature pyramid network, created by Yi et al. [12], enhances the ability to recognize tiny objects in remote sensing images. An enhanced anchor-free method was presented by Sun et al. [13], with the goal of improving the efficiency of small target ship recognition in high-resolution synthetic aperture radar (HRSAR) images. Xie et al. [14] employed a module for coordinating attention to aggregate positional information, thereby enhancing the capacity to recognize small objects in distant sensing photos with intricate distributions. Yu et al. [15] designed a visually salient lightweight ship detector (VS-LSDet) to identify small ship targets in SAR images. Additionally, Sun et al. [16] proposed a YOLO-based SAR ship detection algorithm utilizing the classification of angles and bidirectional feature fusion. Li et al. [17] proposed a cross-layer attention network to obtain stronger features for visible light weak targets. In recent years, weak target detection for infrared images has also emerged. Dai et al. [18] proposed an asymmetrical setting modulation to modulate both high-level and underlying semantic features. Ye et al. [19] designed a mixed attention mechanism to improve feature fusion and created a feature fusion technique to gather remote contextual information about small objects. In order to enhance feature representation capability, Zhang et al. [20] created a spatial and frequency attention-based decoder to extract spatial context and frequency domain context. Si et al. [21] designed an improved bidirectional feature fusion pyramid network structure to attain multi-scale weighted feature fusion across layers and increase the detection rate of remote sensing ships. Guo et al. [22] designed a bidirectional attention-based feature pyramid network (BAFPN) for offshore vessel detection, enhancing the detection performance of tiny vessels. Wang et al. [23] successfully improved the accuracy of weak target detection by introducing SPD-Conv and proposing a feature fusion network with a fusion attention mechanism. Zhang et al. [24] designed a feature enhancement, fusion, and context-aware detector to increase the likelihood that small targets in remote sensing photos will be detected. Gong et al. [25] proposed an ASDet detector, which optimizes the loss function to mitigate the discontinuous boundary issue caused by angular periodicity, thereby improving the detection accuracy of small objects. Yuan et al. [26] introduced a two-stage small object detection framework based on a Coarse-to-Fine pipeline and feature imitation learning, which enhances the detection accuracy of small objects.
Recent research has identified several factors contributing to the low efficiency of infrared target detection, such as the following: (1) Small targets occupy a small proportion of pixels in the entire image and carry fewer target features. During the downsampling process of feature extraction, small target information is often lost, leading to low detection accuracy for small targets. (2) Convolutional neural network (CNN)-based object detection algorithms tend to lose contextual information during feature extraction, further decreasing the detection accuracy for small targets. (3) Interference from background information. To address the challenges of infrared ship target detection under complex sea–sky backgrounds and the weak small target extraction capability that results in low detection accuracy, this paper proposes a novel infrared ship target detection algorithm, YOLO-IRS. First, to tackle the challenge of weak small target detection ability, we introduce the Swin Transformer to extract features from infrared ship images. By employing a shifted window multi-head self-attention mechanism, the window’s receptive field is expanded, enhancing the model’s ability to focus on global features during the feature extraction process and thereby improving small target detection performance. Secondly, we design the C3KAN module, which improves the Bottleneck module in the neck of the baseline model’s C2f. In this module, the Conv layer is replaced by the KAN network, which better adapts to multi-scale variations in infrared target detection. This not only improves detection accuracy but also addresses the issues of false positives and missed detections in complex backgrounds and under dense occlusion conditions.
The sea surface infrared ship target is characterized by low contrast, a low signal-to-noise ratio, and so on. The shape and texture characteristics are not obvious, making it easy for the target to be overwhelmed by background clutter noise and missed in detection. Additionally, the sea surface is full of islands, waves, cumulus clouds, and other interferences similar in brightness amplitude to the ship, which can easily lead to false detection. Furthermore, near the ports in coastal areas, the high density of ships increases the likelihood of occlusion, resulting in missed detection. Due to these characteristics of infrared ship detection, traditional methods struggle with poor anti-interference capabilities and insufficient generalization, while existing detection algorithms based on deep learning lack specificity, making it difficult to achieve the accurate identification, classification, and localization of infrared ship targets.
By contrasting several network models, this research proposes an infrared ship target detection method, YOLO-IRS, with the aforementioned obstacles and difficulties in mind. The approach effectively solves the issue of target loss due to occlusion in high-density scenarios by exhibiting strong anti-interference capabilities and outstanding performance in small target detection. Through experimental comparisons, the efficacy of the approach presented in this study is confirmed: precision is increased by 1.3%, mean average precision (mAP50) is increased by 0.5%, and mAP50–95 is increased by 1.7%. The primary contributions of this article are as follows:
  • To capture target context information and enhance the recognition accuracy of small targets, this paper introduces the Swin Transformer module, which can expand the window field of view through a shifted-window multi-head self-attention mechanism, thereby enhancing the ability to focus on global features during the feature extraction process.
  • Addressing the problem of false detection and missed detection in complex backgrounds and dense occlusion, this paper designs the C3KAN module, which effectively improves the recognition accuracy of infrared ship targets.
  • A large number of experiments have been conducted on the public infrared ship dataset. The effects of the Swin Transformer module and the C3KAN module on the experimental results were tested through ablation experiments. Comparative experiments verify the anti-interference capability of the proposed YOLO-IRS algorithm and its effectiveness in detecting infrared ships in complex backgrounds.

2. Proposed Method

2.1. YOLO v10

In 2024, a research team from Tsinghua University proposed the YOLOv10 [27] model. It was designed using the YOLO series of algorithms to produce a high-performance, real-time, end-to-end target detector. Compared with its predecessor, YOLOv9, it has been optimized significantly with regard to both efficiency and accuracy. In the infrared ship target detection task in this article, the following factors led to the selection of YOLOv10 as the baseline network: First, it proposes consistent dual-label assignments for the NMS-free training of YOLOs. That is, it utilizes both one-to-many and one-to-one label assignments during model training, which allows the model to maintain efficient training while eliminating reliance on non-maximal suppression (NMS) during inference. Second, it introduces a comprehensive efficiency–accuracy-driven model design strategy, which optimizes the components of YOLOs from the perspectives of both efficiency and accuracy, reducing the computational overhead of the model. It is appropriate for real-time applications or situations with constrained processing resources because it strikes a reasonable balance between speed and precision.
As the smallest version of the YOLOv10 model framework, the YOLOv10n model chosen for this study has a faster inference speed than other variations (e.g., YOLOv10s, YOLOv10m, YOLOv10b, YOLOv10l, and YOLOv10x). This makes it more appropriate for deployment and operation in contexts with limited resources and shows a notable benefit in real-time scenarios. Figure 1 depicts the YOLOv10 network structure.
YOLOv10 designed an efficient partial self-attention (PSA) module in the backbone, as shown in Figure 1. After 1 × 1 convolution, the features are uniformly divided into two parts, one of which is fed into the Npsa block, consisting of the multi-head self-attention (MHSA) module and the feed-forward network (FFN). The two parts are then fused by connecting them through 1 × 1 convolution. The PSA block is placed only after the fourth stage with the lowest resolution, avoiding the excessive computational overhead associated with the complexity of secondary self-attention computation and, this way, enhancing model performance.

2.2. Swin Transformer

The network finds it more challenging to extract features from weak infrared targets (less than 32 × 32 pixels) because infrared images typically have fuzzy pixels and lower resolution than visual images. Additionally, infrared ship targets vary in scale, and the YOLO model still struggles to detect multi-scale targets in real time. Numerous researchers have put forth various strategies to improve detector performance in order to overcome this issue. For example, Li et al. [3] used DWConv, Dilated Conv, and SELayer modules to accomplish infrared ship detection in intricate scenarios by extracting multi-scale information at various network stages. In this paper, based on the YOLOv10 model, the Swin Transformer [28] module is introduced, which allows cross-window information circulation between consecutive layers by moving the window, making the computation more efficient while retaining the capacity of the model to account for long-distance dependencies. This improves the performance of detecting small infrared targets and low-resolution images.
The Swin Transformer gradually builds hierarchical feature mappings at deeper Transformer levels by merging neighboring chunks of an image. To achieve linear computational complexity, the Swin Transformer computes self-attention within non-overlapping localized windows, which are created by dividing the image. The standard Transformer architecture computes self-attention globally, i.e., it calculates how one token relates to every other token. This global computation results in a computational complexity that is quadratic with the quantity of tokens and is not suitable for vision problems that require processing large-scale, high-dimensional data. Unlike traditional MSA modules, the Swin Transformer module is constructed on a shifted window. Figure 2 schematically depicts the Swin Transformer module’s structure. It consists of a 2-layer MLP with Gaussian Error Linear Units (GELUs) nonlinearity, a LayerNorm (LN) layer, a Multi-head Self-Attention Module, and a residual connection. Swin Transformer modules employ the Window-based Multi-head Self-Attention Module (W-MSA) and the Shifted Window-based Multi-head Self-Attention Module (SW-MSA) in succession. Consequently, it is possible to express the composition of subsequent Swin Transformer modules as
z ^ l = W M S A L N z l 1 + z l 1
z l = M L P L N z ^ l + z ^ l
z ^ l + 1 = S W M S A L N z l + z l
z l + 1 = M L P L N z ^ l + 1 + z ^ l + 1
The outputs of the first block’s W-MSA and MLP modules are indicated by the letters z ^ l and z l , respectively; the outputs of the second block’s SW-MSA and MLP modules are indicated by the letters z ^ l + 1 and z l + 1 , respectively.

2.3. Kolmogorov-Arnold Network (KAN)

To address the issue of target loss in infrared ship target detection jobs caused by multi-target occlusion close to coastal ports, the Kolmogorov–Arnold network (KAN) [29] module is introduced in this paper.
A multivariate function can be expressed as the superposition of continuous functions of a single variable with two parameters, according to the Kolmogorov–Arnold representation theorem. The neural network was viewed as a multivariate continuous function using the Kolmogorov–Arnold representation theorem [30,31]. These networks’ breadth and depth have always been 2 and 2n + 1, respectively, and they did not think about updating the network via backpropagation. Consequently, these models’ performance with high-dimensional data were subpar. Reference [29] created KANs with more adjustable widths and deeper layers based on MLP. It has been demonstrated that KANs can fit functions more accurately than MLP. For the interest of completeness, we give a quick overview of KANs here, but a more thorough explanation can be found in the work of [29].
A KAN module consists of multiple layers, and Figure 3 shows a two-layer KAN block. A matrix of 1D functions can be used to represent a KAN layer with n i n -dimensional inputs and n o u t -dimensional outputs, shown as follows:
Ψ = ϕ l , q , p ,   p = 1 , 2 , , n in ,   q = 1 , 2 , , n out
where the functions ϕ q , p have learnable parameters. The shape of KAN can be expressed as an integer array n 0 , , n l , where n l is the number of nodes (neurons) in the l t h layer (see Figure 3). The i t h neuron in the layer is denoted by ( l , i ) , while the activation value of the ( l , i ) -neuron is represented by x l , i . There are n l n l + 1 activation functions between the consecutive layers l and l + 1 . The activation function connecting the neurons ( l , i ) and ( l + 1 , j ) is denoted by
ϕ l , i , j ,   l = 0 , , L 1 ,   i = 1 , , n l ,   j = 1 , , n l + 1 .
Unlike MLPs, as Figure 3 illustrates, the activation functions show up on the edges instead of the nodes. Specifically, the pre-activation of ϕ l , i , j is x l , i , while the post-activation of ϕ l , i , j is x ˜ l , i , j . Then, the x ˜ l , i , j . s are summed to determine x l + 1 , j , which is the activation value at the ( l + 1 , j ) -neuron:
x l + 1 , j = i = 1 n l x ˜ l , i , j = i = 1 n l ϕ l , i , j x l , i
Here is one way to display this in a matrix format:
x l + 1 = ϕ l , 1 , 1 ( ) ϕ l , 1 , 2 ( ) ϕ l , 1 , n l ( ) ϕ l , 2 , 1 ( ) ϕ l , 2 , 2 ( ) ϕ l , 2 n l ( ) ϕ l , n l + 1 , 1 ( ) ϕ l , n l + 1 , 2 ( ) ϕ l , n l + 1 , n l ( ) Ψ l x l ,
where Ψ l , the function matrix for the l t h KAN layer, is denoted by A. Next, a composition of L layers is used to express a generic KAN:
KAN ( x ) = Ψ L 1 ° Ψ L 2 ° ° Ψ 1 ° Ψ 0 x
The original Kolmogorov–Arnold representation is deemed a special case of the KAB, with two layers and a shape [ n 0 , 2 n 0 + 1 , 1 ] . On the other hand, the well-known MLPs alternate between linear transformation W and nonlinear activation functions σ :
MLP ( x ) = W L 1 ° σ ° W L 2 ° σ ° ° W 1 ° σ ° W 0 x
MLPs distinctly handle linear transformations W and nonlinearities σ , whereas KANs combine both within Ψ .
The KAN network offers several advantages over the traditional Conv layer: its activation function is learnable and can adaptively adjust to better capture the features of the input data. The learnable activation function helps mitigate overfitting, especially when training samples are limited or the class distribution is imbalanced. In the context of the infrared ship target detection task in this paper, the key advantage of the KAN network is its ability to improve small target ship detection through more refined feature extraction. This is because the KAN network can learn more complex function mappings, which aids in the detection of small-sized ship instances.
In this paper, we improve the Bottleneck module in the neck of the baseline model’s C2f by replacing the Conv layer with the KAN network. The specific structure is shown in Figure 4. The Bottleneck module is typically used to reduce the dimensionality of the feature map and then restore it to the original dimension. This design helps to reduce computational load and increases the depth of the network. The KAN network can effectively learn low-dimensional feature representations, and using it in the Bottleneck module improves the overall efficiency and performance of the network.
The Bottleneck module achieves feature fusion through residual connections, which helps retain more detailed information. Due to its learnable activation function, the KAN network is better at capturing and integrating these features, particularly in the case of small targets. This ability enhances the expression of small target features. By reducing and then increasing the number of channels, the Bottleneck module helps the network to learn more generalized features. The learnable activation function of the KAN network further strengthens this generalization capability.

2.4. YOLO-IRS

In this paper, an algorithm based on YOLOv10, called YOLO-IRS, is proposed for infrared ship targets, as shown in Figure 4. The YOLO-IRS model makes corresponding adjustments to the Backbone and neck parts of YOLOv10. First, the C2f module is replaced by the C3KAN module designed in this paper, which achieves better results in the feature extraction of weak infrared ship targets and enhances the feature fusion of multi-scale ship targets, successfully identifying infrared ship targets in the same image that vary in scale. Finally, in order to address the issue of missed target detection brought on by the intense occlusion of infrared ships close to coastal harbors, the Swin Transformer module is introduced before the C2fCIB module in the neck section of the model. With these improvements, the detection rate of YOLO-IRS on infrared ship targets has been further enhanced.

3. Experiments

The experimental dataset, apparatus, setup, and evaluation metrics of YOLO-IRS are first presented in this section. Furthermore, numerous studies are carried out to confirm the YOLO-IRS algorithm’s efficacy. All experimental settings and parameters are maintained constant, and no pre-training weights are employed in the training procedure in order to guarantee the impartiality and fairness of the ablation and comparison experiments.

3.1. Experimental Settings

The Ubuntu 18.04 operating system, which has an Intel(R) Core(TM) i9-9920X CPU running at 3.50 GHz and an NVIDIA RTXA5000 (24G) GPU, was employed in the tests in this study. The Pytorch framework (version 2.0.1) was used to conduct the experiment, and the primary training parameters are displayed in Table 1. The experimental parameters were set using the YOLOv10 default values.

3.2. Infrared Ship Dataset

An open dataset devoted to infrared ships, the InfiRay Infrared Open Platform [32], provided the dataset used in this study. This dataset contains more than 8000 infrared ship images of different categories in various scenarios, gathered using infrared equipment with varying focal lengths and resolutions, with targets in each image labeled. The dataset includes seven categories of targets: liners, bulk carriers, warships, sailboats, canoes, container ships, and fishing boats, in several different scenes such as sea, harbor, and seaside, captured at different time periods and with various resolutions. Most of the images in the dataset are of low resolution, with 4298 images at a resolution of 384 × 288, followed by 3073 images at 1280 × 1024, 627 images at 960 × 576, 163 images at 640 × 512, 135 images at 768 × 576, and 30 images at 704 × 576. The images are concentrated in the harbor and marina, with the main targets near the marina being sailboats and canoes and the main targets near the harbor being fishing boats. Furthermore, 8326 infrared ship photos make up this dataset. The experiments in this paper are split into three sets: the training set (6660 images), the validation set (832 images), and the test set (834 images) in an 8:1:1 ratio. The target detection problem is made more challenging by the vast number of weak targets in this dataset.
Figure 5 shows some instances in the dataset. The distribution of targets in the dataset is shown in Figure 6. Figure 6a shows the number of instances of each type of target in the dataset: Liner 1430, Bulk carrier 1877, Warship 2547, Sailboat 5777, Canoe 4935, Container ship 683, and Fishing boat 9118; Figure 6b shows the proportional distribution of anchor frames in the dataset; x, y in Figure 6c is the location of the target centroid; width and height in Figure 6d are the width and height of the target after normalization; the darker the color, the more centralized the target frame centroid is at that point. As seen in Figure 5, the targets in the dataset are categorized into large targets in the near and small targets in the far distance, and the aspect ratios have large differences.

3.3. Evaluation Metrics

The evaluation criteria employed in this work include mean average precision (mAP), recall, and precision, which are standard measures for assessing model performance in picture target recognition.
Assuming that the total number of detection frames is the number of positive samples (P) and the total number of real frames is the number of negative samples (N), the detection results can be categorized into four cases based on the IOU value between the detection frames and the real frames. The results of the infrared ship target detection procedure are displayed in Table 2, which also provides a description for each assessment statistic.
(1) Precision and Recall
Precision (P) and recall (R) are calculated as follows:
P = T P T P + F P
R = T P T P + F N
Since TP and FN are opposites, P and R are contradictory; if the value of P is higher, the value of R will be lower, and vice versa.
(2) AP and mAP
According to the following formula, AP is the average precision of a single category, where R is the horizontal axis and P is the vertical axis. The P-R curve for each type of ship can be drawn, and the area surrounded by the curve and the horizontal and vertical axes is the value of AP:
A P = 0 1 p ( r ) d r
The following equation illustrates how the mAP, which is the mean of the average precision of all the categories, can be used to gauge the model’s overall performance:
m A P = n = 1 N A P ( n ) N

3.4. Feature Fusion Network Performance

The impacts of the C3KAN and Swin Transformer modules on detection performance are examined in this section, and thorough trials show that the enhancements proposed in this study work effectively.
In this paper, the C3KAN module is proposed by fusing the C3 and KAN modules in the network backbone and replacing the C2f module with C3KAN. A Swin Transformer module is also added to the network’s neck. Infrared images are global in nature due to the nature of infrared imaging, and the Swin Transformer can better capture the global context information. Ship detection accuracy is increased by the YOLO-IRS model presented in this study, which offers a more thorough understanding of the entire infrared image.
Three representative images from the dataset—weak ship targets, poor weather (sea fog), and complex backgrounds—are chosen for model testing in order to show how stable the proposed model is at identifying infrared ship targets. The detection findings are contrasted with YOLOv10 and real results. Figure 7 displays the outcomes of the experiment.
Due to the imaging characteristics of infrared devices, small target ships at sea are usually more blurred, and during network feature extraction, as the convolutional layers become deeper, the target’s feature information could be lost, making it more challenging to locate the small target ships. This presents considerable challenges to the detection of small target ships, as demonstrated in Figure 7a. The enhanced network proposed in this research not only increases the ship recognition rate but also lowers the misdetection rate, whereas the original YOLOv10 has a low recognition rate for small target ships and even causes misdetections. The experimental findings demonstrate that the approach put forward in this study offers a practical answer for the ship recognition task in real-world settings.
In practical applications, infrared ship target detection is also affected by bad weather. Due to the frequent occurrence of foggy sea conditions, the ship images collected by infrared devices are usually more blurred and lack details, creating a challenge for quickly and accurately identifying the targets at sea. The detection results in this paper are shown in Figure 7b. It is evident that the approach proposed in this paper has clear benefits and produces superior detection results under the same weather conditions, demonstrating the method’s stability in sea fog.
The original network may have worse recognition accuracy in coastal regions due to land target interference, a more complex IR ship detection background, and other factors. This study successfully addresses this issue by altering the model structure to increase recognition accuracy. As shown in Figure 7c, compared to the original YOLOv10, the YOLO-IRS algorithm proposed in this article demonstrates better detection performance and anti-interference ability in complex marine environments.
Table 3 presents a comparison of detection performance for various ship targets under several typical scenarios, including weak small ship targets, harsh weather conditions (sea fog), and complex backgrounds, between the proposed YOLO-IRS network and the original YOLOv10 algorithm. It can be observed that the YOLO-IRS network outperforms YOLOv10 in detection accuracy across different scenarios for various target types. Specifically, in the small target scenario, the mAP50 increased by 1.2%; in the sea fog scenario, mAP50 improved by 1%; and in the complex background scenario, mAP50 enhanced by 1.1%.
Figure 8 illustrates the comparison of PR curves for YOLOv10 and YOLO-IRS. Table 4 lists the average accuracies of YOLOv10 and YOLO-IRS for each type of target. Since sailboats and canoes are usually located near docks with complex backgrounds, the background interference is strong, increasing the detection difficulty. Therefore, the average recognition accuracy of these two classes of targets is lower compared to other classes. Compared to YOLOv10, the YOLO-IRS network proposed in this article boosts detection accuracy (P) by 1.3% for all classes, 0.5% for mAP50, and 1.7% for mAP50–95.
To illustrate how effective the feature fusion module created in this article is, experiments were conducted to compare YOLOv10 with other fusion modules under the same dataset and parameter configurations. None of the pre-trained weights were used to ensure the fairness of the experiments. BoTNet [33], C3_faster, C2f_faster, and ShuffleNet [34] were fused with YOLOv10 for comparison experiments. By comparing parameters such as P, number of parameters, GFLOPS, and mAP, with only a slight increase in computation and parameter count, the YOLO-IRS algorithm proposed in this study demonstrates a notable improvement in average detection accuracy. Table 5 displays the findings of the detection, and Figure 9 displays the visualization results.

3.5. Ablation Experiment

Ablation experiments were conducted to verify the effectiveness of the improvement strategy involving the Swin Transformer network and the C3KAN module fused in this paper on model detection. By ensuring the same experimental equipment and parameter settings, an experimental comparison with a single improvement strategy was carried out. By comparing parameters like P, mAP50, number of parameters, GFLOPS, etc., the efficacy of the proposed approach is illustrated. Table 6 displays the outcomes of the experimental comparison.
Table 6 demonstrates that, with only a slight increase in computation and parameter count, the YOLO-IRS technique proposed in this paper significantly improves detection performance when compared to a single network enhancement. The Swin Transformer module in the fusion strategy extracts features from global information, which increases the computational load due to the more complex structure. Meanwhile, the C3KAN module focuses on channel spatial information features, enhancing detection accuracy and successfully lowering the missed detection rate. The detection accuracy of YOLO-IRS is 93.4%, which is significantly better than that of the single improvement strategy network. Compared to the baseline network YOLOv10, the detection accuracy is improved by 1.3%.

3.6. Comparison of Mainstream Detection Algorithms

The same parameter configuration was utilized for the same infrared ship dataset, and none of the pre-trained weights were employed to guarantee that the experiments were fair in order to objectively evaluate the YOLO-IRS algorithm’s detection capacity and further confirm its improvement in detection accuracy. The results of the comparative trials between the traditional target detection algorithms, including RetinaNet, YOLOv8, RT-DETR, and YOLOv8-RT-DETR, are displayed in Table 7.
As shown in Table 7, YOLO-IRS outperforms other comparison algorithms in terms of the evaluation metric mAP50, while requiring only 3.96 M parameters and 11.6 G floating point operations, significantly less computational resources than models other than YOLOv11. In small target and dense scene detection, the RetinaNet algorithm tends to lose target information. Moreover, in images with lower resolution, both RetinaNet and RT-DETR may lead to the loss of some targets, and in dense ship scenes, recognition errors may occur. Additionally, a comparison with algorithms proposed in recent years, including YOLOv11, Dim2Clear, and GT-YOLO, was conducted. Compared to YOLOv11, the proposed algorithm achieved higher infrared ship detection accuracy with limited increases in the parameter count and computational load. When compared to Dim2Clear and GT-YOLO, our algorithm not only requires fewer computational resources but also delivers higher recognition rates. The visualization results of infrared ship detection are shown in Figure 10.

4. Discussion

In this study, we proposed a target detection algorithm for infrared ships, named YOLO-IRS. The algorithm introduces the Swin Transformer module and designs the C3KAN module, which enhances the feature fusion of multi-scale ship targets compared to the baseline network and improves small target detection performance through attention mechanisms. Although the proposed algorithm has achieved promising results in detecting infrared ship targets in complex maritime environments, it still has certain limitations and its application scenarios are somewhat restricted. For instance, the detection performance on multi-modal images, such as visible light and SAR images, remains unknown, which will be a direction for future research.

5. Conclusions

Aiming at the problems of the low detection rates of infrared ship targets in complex ocean backgrounds and dense scenes and the inefficient feature extraction of weak ship targets in blurred scenes, which leads to insufficient detection capability, this paper proposes an algorithmic network, YOLO-IRS, using YOLOv10 as the algorithmic baseline. First, the paper introduces the Swin Transformer module, which solves the problem of missed target detection caused by the dense occlusion of infrared ships near coastal ports by extracting global image features. Second, the paper designs the C3KAN module, which achieves better results in the feature extraction of weak and small infrared ship targets, enhances the feature fusion of multi-scale ship targets, effectively detects infrared ship targets of different scales in the same scene, and improves the model’s detection performance for small targets. Compared with the original algorithm, YOLO-IRS improves the detection accuracy for all target classes, with detection accuracy (P) improved by 1.3%, mAP50 by 0.5%, and mAP50–95 by 1.7%. This paper also compares YOLO-IRS with other mainstream detection algorithms, and the results demonstrate the superiority of the algorithm proposed in this paper. The experimental results show that YOLO-IRS can effectively avoid the problems of misdetection and omission when dealing with small targets, dense scenes, blurred backgrounds, and complex backgrounds, thereby improving the accuracy of infrared ship detection. However, infrared ship targets in distant sea scenes become even weaker, making it challenging for the program to recognize them accurately, and future research will concentrate on resolving these problems.

Author Contributions

Conceptualization, Y.W.; methodology, Y.W.; software, Y.W.; validation, L.G., Y.W. and M.G.; formal analysis, Y.W. and X.Z.; investigation, Y.W. and X.Z.; resources, L.G. and Y.W.; data curation, Y.W.; writing—original draft preparation, Y.W.; writing—review and editing, Y.W.; visualization, Y.W.; supervision, L.G.; project administration, L.G and M.G.; funding acquisition, L.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Strengthening Project of National Defense Science and Technology, grant number 2019-JCJQ-ZD-067-00.

Data Availability Statement

Data are contained within the article.

Acknowledgments

The authors would like to thank the pioneer researchers in infrared ship detection and other related fields.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Zhao, X.; Xu, Y.; Wu, F.; Niu, J.; Cai, W.; Zhang, Z. Ground infrared target detection method based on a parallel attention mechanism (Invited). Infrared Laser Eng. 2022, 51, 20210290. [Google Scholar]
  2. Xie, F.; Dong, M.; Wang, X.; Yan, J. Infrared Small-Target Detection Using Multiscale Local Average Gray Difference Measure. Electronics 2022, 11, 1547. [Google Scholar] [CrossRef]
  3. Li, L.; Jiang, L.; Zhang, J.; Wang, S.; Chen, F. A Complete YOLO-Based Ship Detection Method for Thermal Infrared Remote Sensing Images under Complex Backgrounds. Remote Sens. 2022, 14, 1534. [Google Scholar] [CrossRef]
  4. Kou, R.; Wang, C.; Peng, Z.; Zhao, Z.; Chen, Y.; Han, J.; Huang, F.; Yu, Y.; Fu, Q. Infrared small target segmentation networks: A survey. Pattern Recognit. 2023, 143, 109788. [Google Scholar] [CrossRef]
  5. Zhao, Z.; Zheng, P.; Xu, S.; Wu, X. Object Detection with Deep Learning: A Review. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3212–3232. [Google Scholar] [CrossRef] [PubMed]
  6. Guan, X.; Zhang, L.; Huang, S.; Peng, Z. Infrared small target detection via non-convex tensor rank surrogate joint local contrast energy. Remote Sens. 2020, 12, 1520. [Google Scholar] [CrossRef]
  7. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2012; pp. 1097–1105. [Google Scholar]
  8. Cheng, G.; Yuan, X.; Yao, X.; Yan, K.; Zeng, Q.; Xie, X.; Han, J. Towards large-scale small object detection: Survey and benchmarks. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 123, 13467–13488. [Google Scholar] [CrossRef]
  9. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
  10. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
  11. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
  12. Yi, H.; Liu, B.; Zhao, B.; Liu, E. Small Object Detection Algorithm Based on Improved YOLOv8 for Remote Sensing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 1734–1747. [Google Scholar] [CrossRef]
  13. Sun, Z.; Dai, M.; Leng, X.; Lei, Y.; Xiong, B.; Ji, K.; Kuang, G. An anchor-free detection method for ship targets in high-resolution SAR images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 7799–7816. [Google Scholar] [CrossRef]
  14. Xie, S.; Zhou, M.; Wang, C.; Huang, S. CSPPartial-YOLO: A Lightweight YOLO-Based Method for Typical Objects Detection in Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 388–399. [Google Scholar] [CrossRef]
  15. Yu, H.; Yang, S.; Zhou, S.; Sun, Y. Vs-lsdet: A multiscale ship detector for spaceborne sar images based on visual saliency and lightweight cnn. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 17, 1137–1154. [Google Scholar] [CrossRef]
  16. Sun, Z.; Leng, X.; Lei, Y.; Xiong, B.; Ji, K.; Kuang, G. BiFA-YOLO: A novel YOLO-based method for arbitrary-oriented ship detection in high-resolution SAR images. Remote Sens. 2021, 13, 4209. [Google Scholar] [CrossRef]
  17. Li, Y.; Huang, Q.; Pei, X.; Chen, Y.; Jiao, L.; Shang, R. Cross-Layer Attention Network for Small Object Detection in Remote Sensing Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 2148–2161. [Google Scholar] [CrossRef]
  18. Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Asymmetric contextual modulation for infrared small target detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 950–959. [Google Scholar]
  19. Ye, J.; Yuan, Z.; Qian, C.; Li, X. Caa-yolo: Combined-attention-augmented yolo for infrared ocean ships detection. Sensors 2022, 22, 3782. [Google Scholar] [CrossRef]
  20. Zhang, M.; Zhang, R.; Zhang, J.; Guo, J.; Li, Y.; Guo, X. Dim2Clear network for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–14. [Google Scholar] [CrossRef]
  21. Si, J.; Song, B.; Wu, J.; Lin, W.; Huang, W.; Chen, S. Maritime Ship Detection Method for Satellite Images Based on Multiscale Feature Fusion. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 2023, 16, 6642–6655. [Google Scholar] [CrossRef]
  22. Guo, H.; Gu, D. Closely arranged inshore ship detection using a bi-directional attention feature pyramid network. Int. J. Remote Sens. 2023, 44, 7106–7125. [Google Scholar] [CrossRef]
  23. Wang, Y.; Wang, B.R.; Huo, L.L.; Fan, Y.S. GT-YOLO: Nearshore Infrared Ship Detection Based on Infrared Images. J. Mar. Sci. Eng. 2024, 12, 213. [Google Scholar] [CrossRef]
  24. Zhang, Y.; Ye, M.; Zhu, G.; Liu, Y.; Guo, P.; Yan, J. FFCA-YOLO for small object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5611215. [Google Scholar] [CrossRef]
  25. Gong, M.; Zhao, H.; Wu, Y.; Tang, Z.; Feng, K.; Sheng, K. Dual Appearance-Aware Enhancement for Oriented Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–14. [Google Scholar] [CrossRef]
  26. Yuan, X.; Cheng, G.; Yan, K.; Zeng, Q.; Han, J. Small Object Detection via Coarse-to-fine Proposal Generation and Imitation Learning. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 6294–6304. [Google Scholar]
  27. Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
  28. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
  29. Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljačić, M.; Hou, T.Y.; Tegmark, M. KAN: Kolmogorov-Arnold Networks. arXiv 2024, arXiv:2404.19756. [Google Scholar]
  30. Sprecher, D.A.; Draghici, S. Space-filling curves and Kolmogorov superposition-based neural networks. Neural Netw. 2002, 15, 57–67. [Google Scholar] [CrossRef]
  31. Leni, P.-E.; Fougerolle, Y.D.; Truchetet, F. The kolmogorov spline network for image processing. In Image Processing: Concepts, Methodologies, Tools, and Applications; IGI Global: Hershey, PA, USA, 2013; pp. 54–78. [Google Scholar]
  32. InfiRay Dataset [OL]. Available online: http://openai.iraytek.com/apply/Sea_shipping.html/ (accessed on 15 March 2023).
  33. Srinivas, A.; Lin, T.Y.; Parmar, N.; Shlens, J.; Abbeel, P.; Vaswani, A. Bottleneck Transformers for Visual Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 16519–16529. [Google Scholar]
  34. Zhang, X. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Figure 1. YOLO-v10 network structure diagram.
Figure 1. YOLO-v10 network structure diagram.
Remotesensing 17 00020 g001
Figure 2. Two consecutive Swin Transformer blocks.
Figure 2. Two consecutive Swin Transformer blocks.
Remotesensing 17 00020 g002
Figure 3. Illustration of the activation functions flowing through the network.
Figure 3. Illustration of the activation functions flowing through the network.
Remotesensing 17 00020 g003
Figure 4. YOLO-IRS network structure diagram.
Figure 4. YOLO-IRS network structure diagram.
Remotesensing 17 00020 g004
Figure 5. Some instances in the dataset.
Figure 5. Some instances in the dataset.
Remotesensing 17 00020 g005
Figure 6. Characteristics of dataset distribution. (a) shows the number of instances of each type of target in the dataset; (b) shows the proportional distribution of anchor frames in the dataset; (c) x, y is the location of the target centroid; width and height in (d) are the width and height of the target after normalization.
Figure 6. Characteristics of dataset distribution. (a) shows the number of instances of each type of target in the dataset; (b) shows the proportional distribution of anchor frames in the dataset; (c) x, y is the location of the target centroid; width and height in (d) are the width and height of the target after normalization.
Remotesensing 17 00020 g006
Figure 7. Comparison of ship detection effect in three typical scenarios.
Figure 7. Comparison of ship detection effect in three typical scenarios.
Remotesensing 17 00020 g007
Figure 8. PR curve.
Figure 8. PR curve.
Remotesensing 17 00020 g008
Figure 9. Visualization comparison of detection results under different scenes. (a) Detection results for weak and small targets; (b) Detection results in a multi-scale dense scene; (c) Detection results for small targets in a complex background.
Figure 9. Visualization comparison of detection results under different scenes. (a) Detection results for weak and small targets; (b) Detection results in a multi-scale dense scene; (c) Detection results for small targets in a complex background.
Remotesensing 17 00020 g009
Figure 10. Visualization comparison of detection results under different scenes. (a) Detection results for weak and small targets; (b) Detection results in a multi-scale dense scene; (c) Detection results for small targets in a blurred scene; (d) Detection results in a complex background.
Figure 10. Visualization comparison of detection results under different scenes. (a) Detection results for weak and small targets; (b) Detection results in a multi-scale dense scene; (c) Detection results for small targets in a blurred scene; (d) Detection results in a complex background.
Remotesensing 17 00020 g010
Table 1. Experiment main training parameter settings.
Table 1. Experiment main training parameter settings.
NameConfiguration
Learning rate0.01
Date enhancementMOSAIC
Epochs500
Image Size640
Batch size64
Workers8
Table 2. Infrared ship target detection results.
Table 2. Infrared ship target detection results.
ResultShipBackground
ShipTP (True Positive)FP (False Positive)
BackgroundFN (False Negative)TN (True Negative)
Table 3. Comparison of detection accuracy across three typical scenarios.
Table 3. Comparison of detection accuracy across three typical scenarios.
SceneDim TargetSea FogComplex Background
YOLOv10YOLO-IRSYOLOv10YOLO-IRSYOLOv10YOLO-IRS
Class mAP50 (%)mAP50 (%)mAP50 (%)mAP50 (%)mAP50 (%)mAP50 (%)
Liner88.591.190.192.290.192.4
Bulk carrier94.996.495.996.395.796.8
Warship98.398.998.799.198.299.1
Sailboat89.790.688.190.589.390.8
Canoe84.784.984.284.783.985.1
Container ship96.397.297.197.997.197.5
Fishing boat83.184.984.284.984.785.2
All90.892.091.292.291.392.4
Table 4. Comparison results of average accuracy for each type of target.
Table 4. Comparison results of average accuracy for each type of target.
ClassYOLOv10YOLO-IRS
P(%)mAP50 (%)mAP50–95 (%)P(%)mAP50 (%)mAP50–95 (%)
Liner9390.755.996.992.658.4
Bulk carrier91.596.780.194.497.182.5
Warship96.999.283.298.899.485.6
Sailboat87.790.256.59091.358.5
Canoe86.485.650.486.785.652.5
Container ship97.798.481.495.298.180.2
Fishing boat91.385.344.69285.446.6
All92.192.364.693.492.866.3
Table 5. Comparison of experimental results for different fusion models.
Table 5. Comparison of experimental results for different fusion models.
ModelP (%)mAP50 (%)GFLOPSParameters (M)
YOLOv10+BoTNet91.686.48.42.91
YOLOv10+BoTNet+C390.686.28.02.79
YOLOv10+ShufflNet+Swin Transformer888910.23.13
YOLO-IRS93.492.811.63.96
Table 6. Results of ablation experiments.
Table 6. Results of ablation experiments.
ModelP (%)mAP50 (%)GFLOPSParameters (M)
YOLOv1092.192.38.22.69
YOLOv10+Swin Transformer92.592.48.63.02
YOLOv10+C3KAN92.492.511.33.64
YOLO-IRS93.492.811.63.96
Table 7. Comparison of detection results of mainstream algorithms.
Table 7. Comparison of detection results of mainstream algorithms.
ModelmAP50 (%)GFLOPSParametes
LinerBulk CarrierWarshipSailboatCanoeContainer ShipFishing BoatAll
RetinaNet76.2166.4576.7386.8787.6889.2175.6979.837032
YOLOv891.296.396.686.474.197.576.488.48.22.7
YOLOv118897.299.390.585.798.385926.32.58
RT-DETR81.295.397.179.872.892.867.983.9103.532
Dim2Clear91.896.595.989.784.396.684.891.478.924.6
GT-YOLO92.196.698.189.485.197.383.991.834.48.7
YOLO-IRS92.697.199.491.385.698.185.492.811.63.96
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Guo, L.; Wang, Y.; Guo, M.; Zhou, X. YOLO-IRS: Infrared Ship Detection Algorithm Based on Self-Attention Mechanism and KAN in Complex Marine Background. Remote Sens. 2025, 17, 20. https://doi.org/10.3390/rs17010020

AMA Style

Guo L, Wang Y, Guo M, Zhou X. YOLO-IRS: Infrared Ship Detection Algorithm Based on Self-Attention Mechanism and KAN in Complex Marine Background. Remote Sensing. 2025; 17(1):20. https://doi.org/10.3390/rs17010020

Chicago/Turabian Style

Guo, Limin, Yuwu Wang, Muran Guo, and Xiaohai Zhou. 2025. "YOLO-IRS: Infrared Ship Detection Algorithm Based on Self-Attention Mechanism and KAN in Complex Marine Background" Remote Sensing 17, no. 1: 20. https://doi.org/10.3390/rs17010020

APA Style

Guo, L., Wang, Y., Guo, M., & Zhou, X. (2025). YOLO-IRS: Infrared Ship Detection Algorithm Based on Self-Attention Mechanism and KAN in Complex Marine Background. Remote Sensing, 17(1), 20. https://doi.org/10.3390/rs17010020

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop