E-WFF Net: An Efficient Remote Sensing Ship Detection Method Based on Weighted Fusion of Ship Features

Wang, Qianchen; Xie, Guangqi; Zhang, Zhiqi

doi:10.3390/rs17060985

Open AccessArticle

E-WFF Net: An Efficient Remote Sensing Ship Detection Method Based on Weighted Fusion of Ship Features

by

Qianchen Wang

¹

,

Guangqi Xie

^1,2,* and

Zhiqi Zhang

^1,2

¹

School of Computer Science, Hubei University of Technology, Wuhan 430068, China

²

Hubei Provincial Key Laboratory of Green Intelligent Computing Power Network, Hubei University of Technology, Wuhan 430068, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(6), 985; https://doi.org/10.3390/rs17060985

Submission received: 10 January 2025 / Revised: 6 March 2025 / Accepted: 7 March 2025 / Published: 11 March 2025

Download

Browse Figures

Versions Notes

Abstract

:

Ships are the main carriers of maritime transportation. Real-time object detection of ships through remote sensing satellites is of great significance in ocean rescue, maritime traffic, border management, etc. In remote sensing ship detection, the complexity and diversity of ship shapes, along with scenarios involving ship aggregation, often lead to false negatives and false positives. The diversity of ship shapes can cause detection algorithms to fail in accurately identifying different types of ships. In cases where ships are clustered together, the detection algorithm may mistakenly classify multiple ships as a single target or miss ships that are partially obscured. These factors can affect the accuracy and robustness of the detection, increasing the challenges in remote sensing ship detection. In view of this, we propose a remote sensing ship detection method, E-WFF Net, based on YOLOv8s. Specifically, we introduced a data enhancement method based on elliptical rotating boxes, which increases the sample diversity in the network training stage. We also designed a dynamic attention mechanism feature fusion module (DAT) to make the network pay more attention to ship characteristics. In order to improve the speed of network inference, we designed a residual weighted feature fusion method; by adding a feature extraction branch while simplifying the network layers, the inference speed of the network was accelerated. We evaluated our method on the HRSC2016 and DIOR datasets, and the results show some improvements compared to YOLOv8 and YOLOv10, especially on the HRSC2016 dataset. The results show that our method E-WFF Net achieves a detection accuracy of 96.1% on the HRSC2016 dataset, which is a 1% improvement over YOLOv8s and a 1.1% improvement over YOLOv10n. The detection speed is 175.90 FPS, which is a 3.2% improvement over YOLOv8 and a 9.9% improvement over YOLOv10n.

Keywords:

deep learning; feature fusion; remote sensing ship detection

1. Introduction

Remote sensing technology, as a non-contact information acquisition method, has important application value in the fields of ocean monitoring, shipping management, maritime safety, and other fields [1]. With the booming development of the shipping industry and the continuous growth of global trade, the number of ships on the sea continues to increase and routes become busier [2]. Therefore, timely and accurate monitoring of ship location, navigation status, ship density distribution, and other information has become an important task for maritime management and safety assurance.

Traditional methods can effectively detect objects in certain specific scenes. They use handmade features to complete target classification and location regression [3], resulting in low recognition accuracy and high computational complexity. These traditional methods are limited in practical applications. Therefore, the research focus on target recognition methods has gradually shifted to the use of deep learning techniques.

In recent years, with the rapid development of deep learning [4] technology, some scholars have proposed object detection methods based on convolutional neural networks (CNNs) [5]. Ship detection methods based on deep learning have also increasingly become a research hotspot. Deep learning models are capable of learning more abundant and abstract feature representations from remote sensing images, enabling the efficient detection and recognition of ship targets. Examples include R-CNN [6], Faster R-CNN [7], SPPNet [8], and others. However, these two-stage object detection methods [9] typically have high computational complexity and slower real-time performance. Therefore, Redmon et al. (2016) proposed a one-stage approach (YOLO [10]) to address this issue. It directly completes the regression of bounding boxes and their class positions in the output layer without the process of screening candidate boxes. The YOLO series of algorithms offer rapid computation speeds and high accuracy, bringing a new breakthrough to remote sensing ship detection. Shao et al. [11] proposed a novel saliency-aware CNN framework based on the YOLOv2 [12] pipeline, utilizing CNNs to predict ship classes and approximate locations, thereby enhancing the accuracy of ship detection and robustness under complex coastal monitoring conditions. Wang et al. [13] constructed a new dataset and employed data augmentation techniques for ship target detection based on YOLOv3 [14], but they faced challenges in dealing with complex ship backgrounds. Zhang et al. [15] improved YOLOv5 [16] by using dilated convolution and optimized feature pyramids to enhance the network model’s ability to represent features at different scales, but faced challenges in terms of the model’s complexity and computational resource consumption. The size and shape of ship objects in remote sensing images are diverse, and some ships may be very small or exhibit relatively slender shapes throughout the entire image, which requires the model to have multi-scale detection capabilities. CNN models [17] are good at extracting local features, but find it difficult to capture global relationships. When fusing different input features, most of the above frameworks simply add them up without distinction. At different resolutions, input images of different scales usually have varying degrees of impact on the fused output features.

To solve the above problems, Wu et al. [18] added a transformer for ship detection, which has strong global capture capability and can dynamically focus on the region of interest, making object detection more efficient. Cao et al. [19] integrated an efficient multi-scale attention (EMA) mechanism into the backbone feature extraction network of YOLOv7 [20] to enhance the model’s perception of targets with large multi-scale differences in positional information. Zhang et al. [21] introduced the CBAM attention mechanism into YOLOv8 [22] to fuse spatial and channel feature information, improving the extraction accuracy of detection bounding boxes. Wang et al. [23] integrated the large selective kernel attention mechanism (LSK) into YOLOv8 to focus on key ship features. Li et al. [24] improved the backbone network of YOLOv8 by using a multi-head attention mechanism and enhanced the network’s ability to extract diverse features. Guo [25] et al. introduced the Swin transformer in YOLOv10, enhancing the model’s ability to focus on global features during the feature extraction process. Additionally, the Swin transformer [26] provides hierarchical feature maps in intelligent vehicle detection, effectively alleviating the multi-scale feature extraction problem. Sun [27] proposed a blur image feature recovery module based on the Swin transformer, which extracts multi-scale features through hierarchical construction and multi-stage processing, achieving more accurate vehicle detection. Huang [28] used the Swin transformer reconstruction model to extract and reconstruct global image features, also addressing the issue of limited samples in fabric pattern defect detection during actual production. Since the Swin transformer has shown exceptional performance in many fields, including natural language processing, vehicle detection, and defect detection, we also integrate the Swin transformer into E-WFF Net to improve ship detection accuracy.

Ship features mainly refer to the geometric features and context features of the ship. Geometric features refer to the length, width, and direction of the ship. Context features refer to the spatial relationship between the ship and its surroundings. The ship is distinguished from the background by the water scene (ocean, river, port) where the ship appears, thus improving the accuracy of detection. Due to the diverse types of ships, there are considerable differences in their sizes and aspect ratios, as shown in Figure 1. The images in Figure 1 come from the HRSC2016 and DIOR datasets, where Figure 1a–c,g,h are from the HRSC2016 dataset, and Figure 1d–f are from the DIOR dataset. The groupings in the figure are as follows: Figure 1a,b are images with clouds and fog; Figure 1c,d show ships arranged densely; Figure 1d,e show small sample ships; Figure 1f,g show ships with large differences in length-to-width ratios. Achieving accurate ship detection requires extracting multi-scale information from ships in various sea areas. Ships near ports are very similar to the surrounding buildings, and in complex backgrounds, small islands and nearby sea structures can easily lead to false detections. The dense distribution of ships at docks and at sea results in multiple targets overlapping, which reduces the accuracy of the model in detecting targets. Additionally, since ships float on the water’s surface, clouds and fog formed by water evaporation and condensation can interfere with ship detection based on remote sensing images. Suppressing harmful and non-informative background interference and focusing on key information are crucial for ship detection.

Based on the above analysis, this paper proposes a new ship detection method, E-WFF Net. It mainly includes three parts. The Backbone is responsible for extracting features from the input image. We added a dynamic attention mechanism module (DAT) to the penultimate layer of the Backbone, so that the model can learn sparse attention patterns and model geometric transformations in a data-dependent manner, which can stimulate important information in the feature map and suppress secondary information, thereby improving the robustness of the detection method. We added a feature extraction structure to the Neck part and simplified the feature fusion layer to solve the problems of slow speed and low accuracy of ship detection in remote sensing images. Finally, the position or shape of the bounding box was predicted in the Head part to generate the target box. In addition, we also introduced a data enhancement method based on an elliptical rotation box, which can provide more training samples in the network training stage. The overall framework diagram of E-WFF Net is shown in Figure 2. The specific work is as follows:

(1): Rotation augmentation method: To increase the data volume during the sample training phase and address the variation of ship targets at different angles, this paper introduces a rotation augmentation method based on elliptical bounding boxes to improve the network’s training effectiveness.
(2): Dynamic attention mechanism (DAT): This paper designs a dynamic attention mechanism (DAT) module, which can adaptively adjust based on the characteristics of the input data, helping the network to automatically focus on important feature information, such as ships, thereby significantly improving detection accuracy.
(3): Residual weighted feature fusion method: The method increases the feature extraction branches and simplifies the feature fusion layers, further improving the efficiency of feature fusion. Additionally, the network uses learnable weights to dynamically determine the importance of different input features, optimizing the feature fusion process and enhancing the overall performance of the model.

2. Methods

In this section, we will introduce the overall architecture of E-WFF Net. E-WFF Net consists of three parts: Backbone, Neck, and Head. Specifically, the Backbone is composed of 5 CBS, 4 C2f, 1 DAT module, and 1 SimSPPF module. Its main role is to extract multi-scale feature information from the input remote sensing images, providing the foundation for subsequent detection tasks. CBS consists of convolutional layers (Conv), batch normalization layers (BN), and SILU activation functions, which are responsible for extracting meaningful low-level feature information from the input images. C2f effectively handles and fuses information from different scales through channel splitting and channel concatenation. The DAT module will be described in detail in Section 2.2. The SimSPPF module enhances the ability to detect ship targets of various sizes by applying pooling operations at different scales, providing stronger adaptability to targets of different dimensions. The Neck has four branches and uses the residual weighted feature fusion method, which will be described in detail in Section 2.3. The role of UpSample is to perform 2× up-sampling. The Fusion module assigns appropriate weights to each layer and module based on their contribution, which is learned automatically during training, thus optimizing the detection performance. The Head is responsible for outputting the final detection results. It predicts the category, location, and confidence of the targets from the features extracted by the network and ensures the accuracy and robustness of the detection results through steps such as non-maximum suppression.

The Backbone provides the basic feature maps for the Neck by extracting features at different scales. The Neck further processes these feature maps using weighted fusion, optimizing features from different layers, allowing the network to better integrate features from various levels. The output from the Neck is passed to the Head, where final object detection is performed based on these features, generating outputs such as classification, bounding boxes, and confidence scores. The three modules work together to form an efficient object detection system, ensuring the network can extract useful features from information at different scales and levels, ultimately improving detection performance.

2.1. Rotation Enhancement Based on Elliptical Bounding Box

In this paper, we mainly use rotation augmentation to enhance the dataset samples while still using horizontal bounding boxes. The purpose of rotation augmentation is to generate new training samples by rotating images, thereby improving the generalization ability of the model, especially when dealing with ships at different angles or poses.

When collecting datasets and making labels, we found that many ships have different shapes and directions, resulting in the inability to accurately obtain the coordinates of the target horizontal box after the remote sensing image is rotated only by relying on the four annotation parameters of the horizontal box [29]. As shown in Figure 3, if you rotate the original ground truth box by the same angle and take the largest circumscribed horizontal rectangle, you will get an overly large ground truth box. This paper proposes a rotation enhancement method based on the target elliptical equation based on five parameter annotation methods. The five parameters of the rotation frame are composed of the target center point coordinates (abscissa, ordinate), the target width and height, and target rotation angle, and the target rotation angle, which is obtained by rotating the positive half axis of the Y axis as 0 degrees. According to the width and height of the target and based on the ellipse equation, the minimum circumscribed ellipse of the target is constructed. The formula is as follows:

y = \sqrt{\frac{H^{2}}{4} - \frac{H^{2} \times x^{2}}{W^{2}}}

(1)

The target can obtain the set point through the minimum circumscribed ellipse equation

\sum_{i}^{n} (x_{i}, y_{i})

, and then use the target rotation angle “θ” and calculate the ellipse with rotation angle through the affine transformation formula and get the coordinate set point “Ω” of the rotating ellipse. The formula is as follows:

Ω = (\begin{matrix} \cos θ & - \sin θ \\ \sin θ & \cos θ \end{matrix}) (\begin{matrix} x \\ y \end{matrix})

(2)

Figure 4 describes this process and shows how the 5 parameters are labeled. By comparing with the rectangular frame in Figure 3, it is found that the rotation enhancement based on the elliptical frame can mark the ship more accurately and reduce the loss of the network. Figure 5 is a diagram of the HRSC2016 dataset labels after adding rotation enhancement. The original labels refer to the labels provided by the dataset’s official website, represented by red rectangular boxes, and the blue rectangle in Figure 5 indicates the label after rotation enhancement.

2.2. DAT Module

The existing hierarchical visual attention mechanism, the Swin transformer, adopts a hierarchical design and shifting windows to focus on features. However, the spatial structure of the input data in traditional attention mechanisms is fixed, and the relationships between positions are equal. In remote sensing images, ships are typically slender, and there is a large variation in the sizes and shapes of different ships. When multiple ship targets are located side by side, accurate detection becomes more difficult. To enhance ship detection capabilities, particularly in terms of ship deformation adaptability, cross-scale detection, and local feature representation, this paper proposes a dynamic attention mechanism module (DAT) based on the deformable attention transformer [30], as shown in Figure 6a. The DAT module is composed of an offset network and a multi-head attention mechanism, as shown in the offset and multi-head attention in Figure 6. Figure 6 shows the information flow of the DAT module. Figure 6b is the offset network (θ_offset(·)), which belongs to the offsets in Figure 6a. The overall process of the DAT module is to first generate a number of evenly placed reference points on the feature map, whose offsets are learned from the query vector in the offset network (offset); then, the dynamic offset points are projected from the sampled features to obtain new key and value vectors, and finally the output ship features are calculated by the multi-head attention mechanism. The overall design aims to enhance the model’s perception and utilization of local spatial information in the input image, especially when position changes occur, enabling the model to better handle deformed or altered input data. The specific process is as follows:

We used an input feature map x∈R^H×W×C, where H was the height of the feature map, W was the width of the feature map, and C was the number of feature channels at each position. Our goal was to generate a uniform grid of reference points that were used to sample the input feature map and calculate attention.

(1) Process of generating grid reference points: We first generated a uniform grid point set p∈R^H_G^×W_G^×2, where H_G and W_G were the height and width of the grid, respectively, and 2 means that each point has two coordinates (horizontal and vertical). The grid size was down-sampled from the input feature map size by a factor of r, H_G = H/r, W_G = W/r. The value of the reference point was a 2D coordinate of linear uniform distribution, that is, from (0, 0) to (H_G − 1, W_G − 1), representing the coordinate of each position in the grid. Then, it was normalized to the range of [−1, +1] according to the grid shape H_G × W_G, where (−1, −1) represents the coordinate of the upper left corner and (+1, +1) represents the coordinate of the lower right corner.

(2) Get the offset: We used a weight matrix W_q to perform linear projection to map the input feature map x to the query tag q. That is:

q = W_{q} x

(3)

This query vector q was then used as the basis for calculating the offset of the reference point. The query vector q was input into the lightweight sub-network θ_offset(⋅) to generate the offset Δp of the reference point:

Δ p = θ_{o f f s e t} (q)

(4)

The lightweight sub-network θ_offset(⋅), as shown in Figure 6b, first uses a 5 × 5 depthwise separable convolution [31] to obtain local features, because each convolution kernel only focuses on the 5 × 5 range area in the input feature map, and each channel can retain its features. Then, three parallel RELU [32] activation functions (σ(⋅)) and 1 × 1 convolution [33] are used to obtain the two-dimensional offset, and the average of these three two-dimensional offsets is taken as the final offset value. The main reason for using three branches is that three parallel branches can extract features from different perspectives. Each branch focuses on different image regions or local information, which increases the network’s ability to perceive diverse features. Compared to a single branch, three branches can capture more varied feature representations, thereby improving the accuracy of offset prediction. Additionally, the three parallel branches make the network more robust when facing various changes, such as variations in the target’s angle or scale. The features learned by each branch are sensitive to different input transformations, so multiple branches can complement each other, allowing the network to better adapt to different types of inputs and improve its performance in complex scenarios. One of the operations is shown below:

θ_{o f f s e t} () = σ \sum_{m = - 1}^{1} \sum_{n = - 1}^{1} I (i - m, j - n) \cdot K (m, n)

(5)

Among them, I · K represents the result of the convolution operation, (i, j) is the pixel coordinate in the output image, (m, n) is the index of the convolution kernel, I (i − m, j − n) represents the pixel value of the corresponding position in the input image, and K (m, n) indicates the corresponding weight in the convolution kernel. Through the lightweight sub-network θ_offset(⋅), the location of the reference point changes, and the model focuses more on the ship characteristics.

(3) Feature sampling: The feature map is sampled at the reference point after dynamic shift. The input of the sampling function is the feature map x and the dynamically changed position p + Δp, and the output is the feature map x′:

x^{'} = ϕ (x; p + Δ p)

(6)

The sampling process uses a bilinear interpolation function ϕ (·; ·) to ensure that the operation is differentiable. Bilinear interpolation computes a weighted average of the features at the four closest grid points:

ϕ (z; (p_{x}, p_{y})) = \sum_{(r_{x}, r_{y})} g (p_{x}, r_{x}) g (p_{y}, r_{y}) z [r_{y}, r_{x}, :]

(7)

Among them, g (a, b) = max(0, 1 − |a − b|) is a smooth interpolation weight and (r_x, r_y) represents all positions on z ∈ R^H×W×C.

(4) Multi-head attention mechanism: We performed weight matrix projection on the output feature x′, to obtain a new key k′ and a new v′:

k^{'} = W_{k} x^{'}, v^{'} = W_{v} x^{'}

(8)

After that, a multi-head attention operation was performed on the query q, key k′, and value v′:

Z^{(n)} = σ (q^{(n)} k^{' (n)} / \sqrt{d} + ϕ (\hat{B}; P)) v^{' (n)}, n = 1, 2, \dots, N

(9)

In the transformer model, a relative position bias table

\hat{B}

,

\hat{B} \in R^{(2 H - 1) \times (2 W - 1)}

was constructed to store the value of the relative position bias. Like previous work, the relative position offset value can be obtained by indexing the table. Finally, the features of the n heads were concatenated together and projected through W₀ to obtain the final output Z:

Z = C o n c a t (Z^{(1)}, \dots, Z^{(n)}) W_{0}

(10)

The DAT module dynamically adjusts the attention range by combining offsets and multi-head attention mechanisms, allowing the model to more flexibly adapt to ship deformation, cross-scale target detection, and local feature extraction, thereby improving the accuracy and robustness of ship detection tasks.

2.3. Residual Weighted Feature Fusion Method

This section primarily addresses the issue of improving the ship detection speed while ensuring detection accuracy. To maintain accuracy, we used a “normalization-based weighting method” in the Fusion module, focusing on key branches that contained ship information. Secondly, to improve detection speed, we employed “multi-branch residual fusion” in the Neck of the network, reducing branches with only a single output node. To prevent significant differences between the input image and deeper network branches, which may lead to the loss of detailed information, we added an extra branch at the “P2 layer” for feature extraction. The details are as follows:

To further reduce the network inference time, we adopted a residual weighted feature fusion method in the Neck of the E-WFF Net network. Considering that input features of different resolutions contribute unequally to the output features, we added an extra weight to each input feature in the Fusion module to enable the network to learn the importance of each input feature. The weight calculation method in the Fusion module is as follows:

O = \sum_{i} \frac{δ (W_{i})}{ε + δ (\sum_{j} W_{j})}

(11)

We used the above equation for normalization fusion, where δ is the RELU function. The RELU function has more powerful nonlinear fitting capabilities and can effectively prevent the disappearance of gradients, while ε is constant.

In Figure 7,

P_{2}^{i n}

,

P_{4}^{i n}

,

P_{6}^{i n}

, and

P_{10}^{i n}

represent the inputs of the

P_{2}

,

P_{4}

,

P_{6}

, and

P_{10}

layers, respectively;

P_{2}^{m 1}

,

P_{2}^{m 2}

,

P_{4}^{m}

, and

P_{6}^{m}

, represent the corresponding intermediate layers, respectively; and

P_{2}^{o u t}

,

P_{6}^{o u t}

, and

P_{10}^{o u t}

represent the output results of the

P_{2}

,

P_{6}

, and

P_{10}

layers, respectively. For example, we will describe the feature fusion of the

P_{6}

layer in the E-WFF Net. First,

P_{6}^{i n}

undergoes 1 × 1 convolution and is fused with

P_{10}^{i n}

after 2× down-sampling to obtain

P_{6}^{m}

. Then,

P_{2}^{m 1}

undergoes 3 × 3 convolution and is fused with

P_{6}^{m}

and

P_{6}^{i n}

to obtain

P_{6}^{o u t}

. All other features are constructed in a similar manner:

P_{6}^{m} = C o n v (\frac{W_{1} \cdot C o n v (P_{6}^{i n}) + W_{2} \cdot U p S a m p l e (P_{10}^{i n})}{W_{1} + W_{2} + ε})

(12)

P_{6}^{o u t} = C o n v (\frac{{W_{1}}^{'} P_{6}^{i n} + {W_{2}}^{'} P_{6}^{m} + {W_{3}}^{'} C o n v (P_{2}^{m 1})}{{W_{1}}^{'} + {W_{2}}^{'} + {W_{3}}^{'} + ε})

(13)

Through the research of Tan [34], we found that to improve the feature extraction effect in the Neck, we could remove nodes with only one input edge, because nodes with one input edge contribute little to the network that fuses different features. Therefore, we removed nodes with only one input edge in the Neck and added a skip residual structure in the

P_{6}

layer in Figure 2. Figure 7 shows the residual fusion method, and the Neck in Figure 2 uses the residual fusion method from Figure 7. We also found that ships appear small and slender in remote sensing images, and when the number of convolutional layers increases, ship features are more likely to be lost. In YOLOv8s, the spatial resolution of the input image features is gradually reduced after being down-sampled layer by layer. For example, when the features are passed to the P4 layer, they have been down-sampled 8 times, which will lead to the loss of image details. Especially for small target detection, excessive down-sampling will blur the target edge information and affect the final detection performance. To alleviate this problem, we added the P2 layer to the network, whose feature map is only down-sampled 4 times compared to the original input image and retains more detailed information than the P4 layer. The introduction of the P2 layer can provide richer local information in the feature fusion process, enhance the network’s perception of small targets, and enable the detection network to more effectively extract target features at different scales, thereby improving detection performance.

3. Experimental Results and Analysis

This study compared the proposed E-WFF Net with other comparative methods using the HRSC2016 [35] and DIOR datasets [36].

3.1. Datasets and Evaluation Metrics

HRSC2016 dataset: The HRSC2016 dataset is a remote sensing object detection dataset characterized by high resolution, diversity, and specificity. It consists of 1061 high-resolution remote sensing images collected by Northwestern Polytechnical University from Google Earth. The size of the images ranges from 300 × 300 pixels to 1500 × 900 pixels. The HRSC2016 dataset contains 2976 instances of ship objects from three major categories. This dataset contains three level tasks (L1, L2, and L3), which contain 1 class, 4 classes, and 19 classes, respectively, and we evaluated our method on task L1. We used 617 training images and 444 validation images.

DIOR dataset: The DIOR dataset is a remote sensing object detection dataset with diversity, richness, and pre-processing completion. It contains 23,463 remote sensing images and 190,288 target instances, covering 20 object categories including airplanes, train stations, highways, ports, ships, etc. The image size in the dataset is 800 × 800 pixels, with a spatial resolution of 0.5~30 m. For ship detection, we selected 11,725 ship images as our ship dataset. Among them, 8208 images were used as the training set, and 3517 images were used as the validation set.

Evaluation metrics: We used an evaluation system in which, if the intersection of the predicted box and the ground truth box exceeded a given threshold, it was a correct detection. The number of correctly detected ships was marked as TP, the number of false detection instances was marked as FP, and the number of undetected ships was marked as FN. The ratio of correctly detected ships to the total number of detected ships was denoted as Precision, and the ratio of correctly detected ships to the total number of real ships was denoted as Recall. The F1 score is defined as the harmonic mean of Precision and Recall, which takes Precision and Recall into consideration.

P r e c i s i o n = \frac{T P}{T P + F P}

(14)

R e c a l l = \frac{T P}{T P + F N}

(15)

F 1 = \frac{2 P R}{P + R}

(16)

We drew the P–R curve based on this value and obtained the AP and mAP. AP is defined as the area between the P–R curve and the coordinate axis:

A P = \int_{0}^{1} P (R) d R

(17)

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(18)

3.2. Experimental Environment and Hyperparameters

The algorithm experiment environment and parameter settings of this experiment are shown in Table 1. We set the total iteration number of the algorithm to 300, with the learning rate adjustment method as the cosine annealing algorithm and the optimizer as SGD. The settings of these parameters can speed up the convergence speed of the algorithm. To better conduct comparative experiments, the ablation experiments in this article were carried out while ensuring the same network configuration parameters. In the comparative experiments, the initial configurations of other methods were followed.

3.3. Comparison with Other Algorithms

We compared and analyzed the proposed method against several advanced methods on the HRSC2016 and DIOR datasets to verify its effectiveness. The compared methods included Center Net [37], Efficient Net [38], YOLOv5s [39], YOLOv6 [40], YOLOv7-Tiny [41], YOLOv8s [22], and YOLOv10n [42], and the ship detection methods B-RSD Net, CHPDet [43], and CRAS-YOLO [44]. In order to use the same evaluation metrics, we added elliptical rotation enhancement to the HRSC2016 and DIOR dataset. The quantitative comparison results of various methods on the HRSC2016 dataset are presented in Table 2, while the corresponding results on the DIOR dataset are shown in Table 3. The bold indicator values in the table below represent the best indicator values.

The results showed that E-WFF Net achieved the highest average precision (mAP) on both the HRSC2016 and DIOR datasets.

Table 2 presents the results on the HRSC2016 dataset, where we can see that the E-WFF Net has the fastest FPS and the best mAP values. Compared to popular object detection methods, the FPS is increased by a maximum of 131.79 FPS, and the accuracy rate is increased by a maximum of 40.3%. Compared to the latest method (YOLOv10), this is an increase of 1.1%. Compared to the ship inspection method, the FPS increased by 12.33 FPS, 8.32 FPS, and 5.26 FPS, respectively. Accuracy improved by 6.2%, 5.5%, and 1.9%, respectively. Not only that, but our method also has the fewest network parameters, and residual weighted feature fusion plays a key role. In terms of F1 scores, YOLOv8s has greater recall values and may have more error detection, while our improved DAT module is more accurate in extracting ship features. In general, our approach makes it possible to accurately separate ship targets even against the background of complex terrain. Furthermore, Table 2 shows the training times of E-WFF Net and other models on the HRSC2016 dataset. Center Net, Efficient Net, and YOLOv6 are more complex models, typically requiring more computational resources, resulting in longer training times. YOLOv7-Tiny and YOLOv5s are lightweight models, which have shorter training times. B-RSD Net, CHPDet, and CRAS-YOLO are also improved lightweight detection models, comparable to YOLOv8s. The E-WFF Net model in this paper considers both detection speed and accuracy, with training times at a moderate level.

Table 3 presents the results on the DIOR dataset, where E-WFF Net achieved the best detection performance. Compared to the baseline YOLOv8s, E-WFF Net showed a 0.2% increase in mAP and a 5.94 FPS increase in FPS. Quantitative results indicate that E-WFF Net can effectively handle complex ship features.

In Table 2 and Table 3, we conducted extensive experiments to demonstrate that the improvement in mAP is not caused by fluctuations. The mAP values we report are average precision values calculated from multiple experiments, accounting for the randomness in each training process. These results have been validated through repeated experiments, and the mAP consistently shows an overall improvement within a relatively stable range.

Table 4 shows the comparison results after reducing 30% of the HRSC2016 dataset, with 432 images in the training set and 311 images in the validation set. E-WFF Net still achieves the best mAP value of 93.3%, and performs well in terms of P, R, and FPS. Therefore, the model exhibits good generalization capability.

Figure 8 displays the feature visualization results of different models on various images. The red areas indicate higher feature values corresponding to target features, while the blue areas represent lower feature values associated with background features. The red areas have a greater influence on the prediction outcomes. The most ideal feature map visualization would clearly show the target area in red and other areas in blue, which facilitates more accurate prediction results. Figure 8a shows the feature visualization results of the Efficient Net model, where some ship areas are undetected, particularly for smaller ships and in situations where ships are aligned side by side, making it difficult to identify features. Figure 8b presents the feature map visualization results of the YOLOv6 model. We can observe that smaller ships are generally detected, but there are still undetected ships in side-by-side scenarios, indicating that clustering and alignment have a significant impact on the model. Figure 8c displays the feature visualization results of the YOLOv7-Tiny model. Smaller ships and ships aligned side by side are basically detected, but ships that resemble ports may be easily overlooked. Figure 8d shows the feature visualization results of our E-WFF Net model. Ship areas are highlighted in red, while the surrounding areas are in blue, effectively capturing the characteristics of ships. This demonstrates that our model is more accurate in predicting ship positions.

Figure 9 displays the detection results of several models on the HRSC2016 dataset. Specifically, (a) represents Efficient Net, (b) represents YOLOv5s, (c) represents YOLOv7-Tiny, (d) demonstrates YOLON10n, (e) displays YOLOv8s, and (f) illustrates E-WFF Net. Efficient Net misses a significant number of detections during the process. Both YOLOv5s and YOLOv7-Tiny fail to accurately determine whether the features belong to ships, leading to false detections when identifying structures similar in shape to ships. YOLOv8s and YOLON10n may have experienced overfitting during training, resulting in missed detections when multiple targets are clustered together and false detections for small targets. For the missing ship images, for instance, all four images in the first row contain undetected ships, which are marked with yellow dashed circles. For the falsely detected ship images, such as those in the third column of the second row and the third column of the third row, where the rectangular boxes enclose objects that are not ships, these are marked with blue dashed circles.

Figure 10 shows the detection results of several models on the DIOR dataset. In this figure, column (a) shows the detection results of Efficient Net, column (b) shows the detection results of the YOLOv5s model, column (c) shows the detection results of the YOLOv8s model, and column (d) shows the detection results of our E-WFF Net model. We use the yellow dashed box to mark undetected targets, blue dashed box to mark mistakenly detected targets, and red solid rectangles to represent prediction results. Compared with our method, the second, third, and fourth images in (a) and (b) are all missed, and for the second image, the number of missed ships is more than that of our model. The third image in (c) also has undetected ships. In the context of ocean ports, (a), (b), and (c) all mistakenly detect houses near the port as ships. Especially for densely arranged ships, there are many false detections and missed detections, and our model has a clear advantage.

3.4. Ablation Experiment

Our test model improvements mainly included rotation enhancement, the DAT module, and the residual feature fusion method. To test the effectiveness of each improved model we proposed, we performed ablation experiments on the HRSC2016 dataset to verify the effectiveness of the three improved methods. Among them, the Neck of the E-WFF Net involves the residual feature fusion. In addition, ‘✓’ indicates the method that added the column. Experimental results are shown in Table 5.

The backbone is modified based on Efficient Rep. The E-WFF Net in this paper shares the same backbone network as YOLOv8s. We first conducted experiments with the YOLOv8s network, and its mAP was reduced by 1%. The relatively low FPS performance observed in YOLOv8 could potentially be attributed to random errors or fluctuations that occurred during the experiment. After that, we added the ellipse rotation enhancement method and DAT module, respectively. After adding the elliptical rotation enhancement method, the mAP increased by 0.4%. This method can increase the number of samples and allow the network to learn more feature information. Elliptical rotation enhancement is typically performed during the data preprocessing stage, so it is carried out online. That is, during each training iteration, the input images are dynamically augmented based on rotation angles. This allows the model to receive a more diverse set of training samples, improving its generalization ability. Rotation enhancement does not directly change the network’s parameters. It is performed during the data preprocessing stage and primarily involves rotating the input images to generate diverse training samples, thereby improving the model’s generalization ability. Since rotation enhancement only alters the image data itself and does not modify the network architecture or internal parameters, it does not directly change the network’s parameters. After adding DAT, mAP increased to 95.7%, and the speed remained basically unchanged, indicating that the DAT module can increase the network’s attention to key features, thereby improving detection accuracy without affecting detection speed. After adding DAT and the residual weighted feature fusion method, the mAP increased by 0.8%. After adding the three improved modules at the same time, the mAP reached 96.1% and the speed reached 175.90FPS. The mAP values we report are average precision values calculated from multiple experiments, which account for the randomness in each training process. These results have been validated through repeated experiments, and the mAP consistently shows an overall improvement within a relatively stable range. Therefore, the effectiveness of the proposed method is reliable and not due to random fluctuations. In summary, the experiment shows that the proposed three modules can reduce the complexity of the network, speed up the inference time, and achieve a good balance between accuracy and speed, meeting the high requirements for the timeliness of ship detection.

Figure 11 visualizes the comparison of our method with YOLOv8s in terms of mAP. We can see that E-WFF Net has a relatively stable improvement during the training stage, and the model has a better generalization ability. For a clearer presentation, we captured the 270th round.

Table 5 shows the training times of E-WFF Net and other models on the HRSC2016 dataset. It can be seen that YOLOv7-Tiny and YOLOv5s are lightweight models, typically having shorter training times. More complex models, such as Center Net, Efficient Net, B-RSD Net, and YOLOv10n, usually require more computational resources, thus resulting in longer training times. B-RSD Net, CHPDet, and CRAS-YOLO are also improved complex ship detection models, with training times similar to that of Center Net. The E-WFF Net model in this paper takes detection speed and accuracy into consideration, and its training time is at a moderate level.

4. Discussion

Remote sensing images have complex scenes and cluttered backgrounds, which poses great challenges in real-time object detection. Object detection algorithms can accurately identify ships in images. When multiple ships appear in the same field of view at the same time, object detection technology can simultaneously identify multiple ships, track their trajectories, and determine their location information and bounding boxes. This positioning information is crucial for maritime traffic monitoring, waterway management, maritime safety, and other fields. In addition, object detection technology can quickly identify ships in real-time monitoring data and can use satellite images to provide the real-time location and dynamic information of ships. From Figure 8, we can see that the Efficient Net model rarely focuses on the key areas of ships in the scenario of clustered ships, resulting in the worst detection accuracy. From Figure 9 and Figure 10, we found that the baseline model YOLOv8s detected the ship, but did not distinguish between the ship and the port in the port background resembling the ship, which was prone to false detection and reduced detection accuracy. In contrast, the DAT in the E-WFF Net we proposed can dynamically adjust the feature offset according to the ship characteristics, accurately identify the ship information in the clustered ships, and integrate the multi-branch structure of the feature layer to speed up the detection speed.

We designed a dynamic attention mechanism model to improve detection accuracy. However, the parameters of the multi-head attention module are relatively complex, and the amount of calculation is large, which consumes a lot of time. In addition, in real scenarios, the images taken by remote sensing satellites cover a wide range, and the ship targets are relatively small. We should continue to explore the detection of small targets.

5. Conclusions

In this paper, we proposed a remote sensing ship detection method—E-WFF Net—designed to address issues such as the complex and diverse shapes of ships, large scale variations, and diverse geometric size differences in remote sensing images. The core aims of the design of E-WFF Net were to increase attention to key information, enhance the network’s information extraction capabilities, and simplify complex deep learning networks. We first introduced a data enhancement method based on elliptical rotation boxes, which provided more training samples during the network training phase and enriched the semantic information of the network. To address the problem of slow and inaccurate ship detection in optical remote sensing images, we increased the feature extraction structure while simplifying the feature fusion layer, constructing the E-WFF Net network to improve detection speed. We also incorporated a dynamic attention mechanism module that could stimulate important information in the feature map and suppress secondary information, thus improving the robustness of the detection method. Experiments showed that E-WFF Net had significant advantages in detection accuracy, superior performance, and could meet the real-time processing needs of embedded devices. Our method plays a crucial role in ensuring maritime traffic safety.

The paper also has the following limitations:

(1): Strong dependency on training data: Although the HRSC2016 and DIOR datasets used in the experiments are representative, they are still limited by the diversity of these datasets. For ship detection in different maritime regions and weather conditions, the current method may perform poorly in handling some extreme or special cases. Future work could consider incorporating more diverse training data to enhance the model’s generalization ability.
(2): Limited detection capability for small sample targets: Although data augmentation through elliptical rotation enhances the diversity of training samples, the detection accuracy for small ship targets may still be insufficient, especially for very small or distant ships. There may be issues with missed detections or false positives.

In future research, we will explore the performance of the proposed method on a wider dataset, including more samples, and compare it with the latest methods; we will also improve the detection accuracy under small ship target clustering. In addition, the rotation enhancement in this paper does not use the rotation box. This paper will also explore the construction of OBB version annotations on the HRSC2016 and DIOR datasets, and conduct experiments based on the rotation target detection algorithm.

Author Contributions

Conceptualization, Z.Z. and G.X.; methodology, Z.Z. and G.X.; software, Q.W.; formal analysis, Q.W.; investigation, Q.W.; resources, Z.Z.; data curation, G.X.; writing—original draft preparation, Q.W.; writing—review and editing, Z.Z. and G.X.; visualization, Q.W.; supervision, Z.Z.; project administration, G.X.; funding acquisition, Z.Z. and G.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (62301214), Scientific Research Foundation for Doctoral Program of Hubei University of Technology (XJ2022005901).

Data Availability Statement

The data presented in this study are openly available at https://universe.roboflow.com/thesis-ev7v6/hrsc2016-boys9 (accessed on 6 March 2025) (High-Resolution Ship Collections 2016, HRSC2016) and https://universe.roboflow.com/class-dvpyb/dior-ship (accessed on 6 March 2025) (DetectIon in Optical Remote sensing images, DIOR).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, Z.; Zhang, L.; Wang, Y.; Feng, P.; He, R. ShipRSImageNet: A Large-Scale Fine-Grained Dataset for Ship Detection in High-Resolution Optical Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 8458–8472. [Google Scholar] [CrossRef]
Paladin, Z.; Bauk, S.; Mujalović, R.; Kapidani, N.; Lukšić, Ž. Blockchain Technology’s Effects on Big Data in Maritime Transportation. In Proceedings of the 2024 28th International Conference on Information Technology (IT), Zabljak, Montenegro, 21–24 February 2024; pp. 1–7. [Google Scholar] [CrossRef]
Li, L.; Ren, K.; Yuan, Z.; Feng, C. A polarization HRRP target classification method based on one-dimensional convolutional attention neural network. In Proceedings of the IET International Radar Conference (IRC 2023), Chongqing, China, 21–23 November 2023; pp. 901–905. [Google Scholar] [CrossRef]
You, G.; Zhu, Y. Target Detection Method of Remote Sensing Image Based on Deep Learning. In Proceedings of the 2020 Cross Strait Radio Science & Wireless Technology Conference (CSRSWTC), Fuzhou, China, 11–14 October 2020; pp. 1–2. [Google Scholar] [CrossRef]
Song, J.; Gao, S.; Zhu, Y.; Ma, C. A survey of remote sensing image classification based on CNNs. Big Earth Data 2019, 3, 232–254. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
Varghese, R.; Sambath, M. A Comprehensive Review on Two-Stage Object Detection Algorithms. In Proceedings of the 2023 International Conference on Quantum Technologies, Communications, Computing, Hardware and Embedded Systems Security (iQ-CCHESS), Kottayam, India, 15–16 September 2023; pp. 1–7. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recongnition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788. [Google Scholar]
Shao, Z.; Wang, L.; Wang, Z.; Du, W.; Wu, W. Saliency-Aware Convolution Neural Network for Ship Detection in Surveillance Video. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 781–794. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar] [CrossRef]
Wang, Y.; Wang, D. Ship target detection based on YOLOv3 algorithm. In Proceedings of the 2023 International Conference on Computers, Information Processing and Advanced Education, CIPAE, Ottawa, ON, Canada, 26–28 August 2023; pp. 721–725. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Zhang, Y.; Li, W.; Guo, P.; Liu, J.; Hu, Q. Ship-YOLOv5: Ship Target Detection Based on Enhanced Feature Fusion. In Proceedings of the 2024 5th International Conference on Computing, Networks and Internet of Things, Tokyo, Japan, 24–26 May 2024. [Google Scholar]
Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J.; Kwon, Y.; Fang, J.; Michael, K.; Montes, D.; Nadar, J.; Skalski, P.; et al. ultralytics/yolov5: v6. 1-tensorrt, tensorflow edge tpu and openvino export and inference. Zenodo 2022. [Google Scholar] [CrossRef]
Wu, X.; He, X.; Tian, S.; Wang, B.; Lin, W. Review of ship target detection based on SAR images. In Proceedings of the 2023 7th International Conference on Transportation Information and Safety (ICTIS), Xi’an, China, 4–6 August 2023; pp. 2106–2112. [Google Scholar] [CrossRef]
Wu, K.; Zhang, Z.; Chen, Z.; Liu, G. Object-Enhanced YOLO Networks for Synthetic Aperture Radar Ship Detection. Remote Sens. 2024, 16, 1001. [Google Scholar] [CrossRef]
Cao, H.; Wu, J. A detection method for ship based on the improved YOLOv7-tiny. In Proceedings of the SPIE—The International Society for Optical Engineering, San Diego, CA, USA, 20–22 August 2024; p. 13178. [Google Scholar] [CrossRef]
Wang, C.; Bochkovskiy, A.; Liao, H.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Zhang, J. Ship target detection based on CBAM-YOLOv8. In Proceedings of the SPIE—The International Society for Optical Engineering, San Diego, CA, USA, 20–22 August 2024; p. 13071. [Google Scholar] [CrossRef]
Varghese, R.; Sambath, M. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; pp. 1–6. [Google Scholar] [CrossRef]
Wang, S.; Li, Y.; Qiao, S. ALF-YOLO: Enhanced YOLOv8 based on multiscale attention feature fusion for ship detection. Ocean. Eng. 2024, 308, 118233. [Google Scholar] [CrossRef]
Li, P.; Zheng, J.; Li, P.; Long, H.; Li, M.; Gao, L. Tomato Maturity Detection and Counting Model Based on MHSA-YOLOv8. Sensors 2023, 23, 6701. [Google Scholar] [CrossRef]
Guo, L.; Wang, Y.; Guo, M.; Zhou, X. YOLO-IRS: Infrared Ship Detection Algorithm Based on Self-Attention Mechanism and KAN in Complex Marine Background. Remote Sens. 2024, 17, 20. [Google Scholar] [CrossRef]
Deshmukh, P.; Satyanarayana, G.S.R.; Majhi, S.; Sahoo, U.K.; Das, S.K. Swin transformer based vehicle detection in undisciplined traffic environment. Expert Syst. Appl. 2023, 213 Pt B, 118992. [Google Scholar] [CrossRef]
Sun, Z.; Liu, C.; Qu, H.; Xie, G. A Novel Effective Vehicle Detection Method Based on Swin Transformer in Hazy Scenes. Mathematics 2022, 10, 2199. [Google Scholar] [CrossRef]
Huang, Y.; Xiong, W.; Zhang, H.; Zhang, W. Defect detection of color woven fabrics based on U-shaped Swin Transformer autoencoder. Laser Optoelectron. Prog. 2023, 60, 1215001. [Google Scholar]
Law, H.; Deng, J. CornerNet: Detecting Objects as Paired Keypoints. Int. J. Comput. Vis. 2018, 128, 642–656. [Google Scholar] [CrossRef]
Xia, Z.; Pan, X.; Song, S.; Li, L.E.; Huang, G. Vision Transformer with Deformable Attention. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 4784–4793. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Eckle, K.; Schmidt-Hieber, J. A comparison of deep networks with ReLU activation function and linear spline-type methods. Neural Netw. Off. J. Int. Neural Netw. Soc. 2018, 110, 232–242. [Google Scholar] [CrossRef] [PubMed]
Lin, M.; Chen, Q.; Yan, S. Network in Network. arXiv 2013, arXiv:1312.4400. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10778–10787. [Google Scholar] [CrossRef]
Liu, Z.; Yuan, L.; Weng, L.; Yang, Y. A High Resolution Optical Satellite Image Dataset for Ship Recognition and Some New Baselines. In Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods, Porto, Portugal, 24–26 February 2017. [Google Scholar]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object Detection in Optical Remote Sensing Images: A Survey and A New Benchmark. arXiv 2019, arXiv:1909.00133. [Google Scholar] [CrossRef]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. CenterNet: Keypoint Triplets for Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6568–6577. [Google Scholar] [CrossRef]
Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv 2019, arXiv:1905.11946. [Google Scholar]
Zheng, J.; Sun, S.; Zhao, S. Fast ship detection based on lightweight YOLOv5 network. IET Image Process. 2022, 16, 1585–1593. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Liu, Y.; Wang, X. SAR Ship Detection Based on Improved YOLOv7-Tiny. In Proceedings of the 2022 IEEE 8th International Conference on Computer and Communications (ICCC), Chengdu, China, 9–12 December 2022; pp. 2166–2170. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Zhang, F.; Wang, X.; Zhou, S.; Wang, Y.; Hou, Y. Arbitrary-Oriented Ship Detection Through Center-Head Point Extraction. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5612414. [Google Scholar] [CrossRef]
Zhao, W.; Syafrudin, M.; Fitriyani, N.L. CRAS-YOLO: A Novel Multi-Category Vessel Detection and Classification Model Based on YOLOv5s Algorithm. IEEE Access 2023, 11, 11463–11478. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Wang, C.; Liao, H.M.; Yeh, I.; Wu, Y.; Chen, P.; Hsieh, J. CSPNet: A New Backbone that can Enhance Learning Capability of CNN. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 1571–1580. [Google Scholar]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. RepVGG: Making VGG-style ConvNets Great Again. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13728–13737. [Google Scholar]
Newell, A.; Yang, K.; Deng, J. Stacked Hourglass Networks for Human Pose Estimation. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
Weng, K.; Chu, X.; Xu, X.; Huang, J.; Wei, X. EfficientRep: An Efficient Repvgg-style ConvNets with Hardware-aware Neural Network Design. arXiv 2023, arXiv:2302.00386. [Google Scholar]
Cheng, S.; Zhu, Y.; Wu, S. Deep learning based efficient ship detection from drone-captured images for maritime surveillance. Ocean. Eng. 2023, 285 Pt 2, 115440. [Google Scholar] [CrossRef]

Figure 1. Ship characteristic sample images in the HRSC2016 and DIOR datasets. Among them, (a,b) represent the interference of the cloud environment; (c,d) represent densely packed ships; (e,f) represent the situation where port ships are similar to surrounding buildings, where the red circles indicate ships and the yellow circles indicate objects resembling ships; (g,h) represent the situation with a large aspect ratio difference in the shape of the ships.

Figure 2. The overall framework diagram of E-WFF Net. The convolution detail feature maps of P2, P4, P6, P8, and P10 are given on the left side of the backbone. “Fusion” performs weighted fusion of features from different layers, where the features from different branches are assigned different weights according to their importance; “UpSample” is to perform 2× up-sampling. “MaxPool” uses a 2 × 2 sliding window for the pooling operation; “c” is the abbreviation for Concat (concatenation); “conv/2” means the step size of the convolution kernel is 2, and the other convolution kernels have a step size of 1.

Figure 3. Rotated enhanced image based on horizontal rectangular box.

Figure 4. Rotated enhanced image based on elliptical box.

Figure 5. Comparison of the rotated enhanced labels and the original labels. The red rectangular box represents the original label, and the blue rectangular box represents the label after applying rotation augmentation.

Figure 6. DAT overall framework diagram. (a) The information flow of dynamic attention mechanism module. A set of reference points is uniformly placed on the feature map, and the offset values are learned through offsets. The dynamic offset points mark the dynamically shifted regions. Then, the dynamic keys and values are projected from the sampled features according to the dynamic offset points. Position bias offsets are also computed by the dynamic offset points. For the sake of clarity, we only show 4 reference points for one ship, but there are actually more reference points. (b) The offset Network (θ_offset(·)), marked with sizes of input and output feature maps for each layer. “Conv 1 × 1” is the convolution layer with kernel of 1 × 1; “DWConv 5 × 5” is depthwise separable convolution using a 5 × 5 convolution kernel.

Figure 7. Schematic diagram of the residual fusion method; “3×3/2 Conv” means the kernel size is 3 × 3, and the stride is 2.

Figure 8. Feature visualization outcomes of various methods: (a) Efficient Net; (b) YOLOv6; (c) YOLOv7-Tiny; (d) E-WFF Net.

Figure 9. Visualization effects of different detection methods: (a) Efficient Net; (b) YOLOv5s; (c) YOLOv7-Tiny; (d) YOLOV10n; (e) YOLOv8s; (f) E-WFF Net. The yellow dashed circle represents missed detections, the blue dashed circle represents false detections, and the red solid box represents predicted results.

Figure 10. Results from various methods on the DIOR dataset: (a) Efficient Net; (b) YOLOv5s; (c) YOLOv8s; (d) E-WFF Net. The yellow dashed box indicates non-detection, the blue dashed box indicates detection error, and the red solid rectangular box represents the predicted result.

Figure 11. Comparison of model training mAP_0.5 results with YOLOv8s. To provide a clearer presentation, (b) is a cropped version of (a), showing the results with precision between 0.8 and 1.0.

Table 1. Experimental environment and hyperparameter settings.

Configuration	Setting
Hardware	CPU: Intel i9-10900K
Hardware	GPU: NVIDIA GeForce RTX 3060
Software	PyCharm 2021 + Python 3.7.0 + CUDA 11.4 + Pytorch 1.10.1
Parameters	Batch size: 8
	Learning rate: 0.01
	Epoch: 300
	Input size: (640,640)
	Optimizer: SGD
	Mosaic: True

Table 2. Comparison results of different methods on the HRSC2016 dataset; mAP refers to the overall performance across all ship categories.

Model	Backbone	P (%)	R (%)	F1 (%)	mAP (%)	FPS	Training Time (h)
Center Net	ResNet50 [45]	47.8	45.7	46.7	55.8	63.17	5.97
Efficient Net	Efficient Net0 [38]	62.7	71.3	66.7	77.6	44.11	5.63
YOLOv5s	CSPDarknet53 [46]	91.3	90.2	90.7	94.0	111.11	4.94
YOLOv6	CSPDarknet53 [46]	89.1	81.0	84.9	91.2	77.70	5.71
YOLOv7-Tiny	Darknet53 [14]	89.7	85.3	87.4	92.9	96.75	5.14
YOLOv10n	Darknet53 [14]	94.3	86.0	89.9	95.0	160,00	4.83
B-RSD Net	RepVGG [47]	90.1	84.5	87.2	89.9	163.57	4.36
CHPDet	Hourglass104 [48]	89.6	91.7	90.6	90.6	167.58	4.19
CRAS-YOLO	CSPDarknet53 [46]	90.5	89.6	90.1	94.2	170.64	4.15
YOLOv8s	Efficient Rep [49]	90.9	92.8	91.8	95.1	169.44	4.21
E-WFF Net	Efficient Rep [49]	93.7	90.3	92.0	96.1	175.90	4.17

Table 3. Comparison results of different methods on the DIOR dataset; mAP refers to the overall performance across all ship categories.

Model	P (%)	R (%)	F₁(%)	mAP (%)	FPS
Efficient Net	61.8	70.3	65.78	76.5	107.7
YOLOv5s	91.5	89.9	90.69	93.9	117.67
YOLOv5-ODconvNeXt [50]	91.9	91.5	91.70	94.2	116.25
YOLOv6	77.8	79.4	78.59	80.8	94.30
YOLOv7-Tiny	87.1	88.6	87.84	89.8	118.04
YOLOv8s	90.3	91.0	90.65	94.1	112.70
YOLOv10n	91.9	90.5	91.19	93.8	108.26
B-RSD Net	85.6	82.7	84.13	86.6	90.60
CHPDet	88.4	91.3	89.83	89.8	104.81
CRAS-YOLO	86.3	87.6	86.95	90.1	112.74
E-WFF Net	92.7	90.8	91.74	94.3	118.64

Table 4. Comparison results after reducing 30% of the HRSC2016 dataset.

Model	P (%)	R (%)	F1 (%)	mAP (%)	FPS
Center Net	45.2	43.5	44.33	53.9	61.46
Efficient Net	60.3	68.4	64.10	75.9	47.23
YOLOv5s	89.1	88.9	89.00	92.3	103.76
YOLOv6	85.6	80.7	83.08	89.6	75.50
YOLOv7-Tiny	87.2	82.0	84.52	90.0	93.52
YOLOv10n	92.5	84.4	88.26	92.2	151.39
B-RSD Net	87.8	80.6	84.05	87.7	132.13
CHPDet	87.3	89.1	88.19	88.2	167.98
CRAS-YOLO	88.2	86.2	87.19	91.1	172.02
YOLOv8s	89.4	91.7	90.54	93.1	168.45
E-WFF Net	91.9	88.6	90.22	93.3	170.93

Table 5. Ablation experiment.

Module	Elliptical Rotation Enhancement	DAT Module	Residual Weighted Feature Fusion Method	Parameter (M)	mAP (%)	FPS
YOLOv8s				3.01	95.1	164.44
Ours	✓			3.01	95.5	164.81
		✓		4.01	95.7	165.84
	✓		✓	2.03	95.5	174.52
	✓	✓		4.01	95.9	164.93
	✓	✓	✓	3.03	96.1	175.90

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Q.; Xie, G.; Zhang, Z. E-WFF Net: An Efficient Remote Sensing Ship Detection Method Based on Weighted Fusion of Ship Features. Remote Sens. 2025, 17, 985. https://doi.org/10.3390/rs17060985

AMA Style

Wang Q, Xie G, Zhang Z. E-WFF Net: An Efficient Remote Sensing Ship Detection Method Based on Weighted Fusion of Ship Features. Remote Sensing. 2025; 17(6):985. https://doi.org/10.3390/rs17060985

Chicago/Turabian Style

Wang, Qianchen, Guangqi Xie, and Zhiqi Zhang. 2025. "E-WFF Net: An Efficient Remote Sensing Ship Detection Method Based on Weighted Fusion of Ship Features" Remote Sensing 17, no. 6: 985. https://doi.org/10.3390/rs17060985

APA Style

Wang, Q., Xie, G., & Zhang, Z. (2025). E-WFF Net: An Efficient Remote Sensing Ship Detection Method Based on Weighted Fusion of Ship Features. Remote Sensing, 17(6), 985. https://doi.org/10.3390/rs17060985

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

E-WFF Net: An Efficient Remote Sensing Ship Detection Method Based on Weighted Fusion of Ship Features

Abstract

1. Introduction

2. Methods

2.1. Rotation Enhancement Based on Elliptical Bounding Box

2.2. DAT Module

2.3. Residual Weighted Feature Fusion Method

3. Experimental Results and Analysis

3.1. Datasets and Evaluation Metrics

3.2. Experimental Environment and Hyperparameters

3.3. Comparison with Other Algorithms

3.4. Ablation Experiment

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI