A Deformable and Multi-Scale Network with Self-Attentive Feature Fusion for SAR Ship Classification

Chen, Peng; Zhou, Hui; Li, Ying; Liu, Bingxin; Liu, Peng

doi:10.3390/jmse12091524

Open AccessArticle

A Deformable and Multi-Scale Network with Self-Attentive Feature Fusion for SAR Ship Classification

by

Peng Chen

¹

,

Hui Zhou

^2,*

,

Ying Li

³,

Bingxin Liu

¹

and

Peng Liu

¹

Navigation College, Dalian Maritime University, Dalian 116026, China

²

School of Computer and Software, Dalian Neusoft Information University, Dalian 116023, China

³

Environmental Information Institute, Dalian Maritime University, Dalian 116026, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2024, 12(9), 1524; https://doi.org/10.3390/jmse12091524

Submission received: 9 August 2024 / Revised: 22 August 2024 / Accepted: 27 August 2024 / Published: 2 September 2024

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

The identification of ships in Synthetic Aperture Radar (SAR) imagery is critical for effective maritime surveillance. The advent of deep learning has significantly improved the accuracy of SAR ship classification and recognition. However, distinguishing features between different ship categories in SAR images remains a challenge, particularly as the number of categories increases. The key to achieving high recognition accuracy lies in effectively extracting and utilizing discriminative features. To address this, we propose DCN-MSFF-TR, a novel recognition model inspired by the Transformer encoder–decoder architecture. Our approach integrates a deformable convolutional module (DCN) within the backbone network to enhance feature extraction. Additionally, we introduce multi-scale self-attention processing from the Transformer into the feature hierarchy and fuse these representations at appropriate levels using a feature pyramid strategy. This enables each layer to leverage both its own information and synthesized features from other layers, enhancing feature representation. Extensive evaluations on the OpenSARShip-3-Complex and OpenSARShip-6-Complex datasets demonstrate the effectiveness of our method. DCN-MSFF-TR achieves average recognition accuracies of 78.1% and 66.7% on the three-class and six-class datasets, respectively, outperforming existing recognition models and showcasing its superior capability in accurately identifying ship categories in SAR images.

Keywords:

SAR images; ship recognition; Transformer architecture; deformable convolution; feature fusion

1. Introduction

SAR (Synthetic Aperture Radar) is a high-resolution active microwave remote sensing imaging technology, which can carry out remote sensing observations at any time and under any weather conditions, is not affected by clouds and atmospheric barriers, and has an all-weather observation capability that enables it to obtain richer information about the target [1]. It has been widely used in the fields of earth observation and military reconnaissance, etc. SAR ship classification is a crucial step in Automatic Target Recognition (ATR) [2], enabling precise identification of ship types or categories [3]. This detailed maritime information is invaluable for enhancing marine surveillance, streamlining trade management, monitoring transportation activities, and supporting sustainable fisheries management [4]. The classification of ship identification is frequently determined by the specific purpose or function of the vessel, with bulk carriers transporting large quantities of industrial and commercial goods, container ships handling vital international trade cargo, oil tankers transporting industrial oils, and fishing vessels harvesting marine fish [5].

In the early stages of classification research, manual feature engineering was the dominant paradigm [6,7]. These features include geometric, texture, scattering intensity, and orientation gradient histogram features. The extracted feature set is then fed into a traditional machine learning, such as Support Vector Machines (SVMs), Decision Trees, Random Forests, Perceptrons, and Naive Bayes for ship category identification [8]. For instance, Zhao et al. [9] employed a hierarchical analysis method, utilizing a feature set comprising geometric, transformational, and local invariant features, to identify the most effective feature subset. They then leveraged a K-Nearest Neighbor (KNN) classifier to achieve successful classification of cargo ships, oil tankers, and container ships from TerraSAR-X imagery. Wu et al. [10] delved into the joint optimization of feature selection and classifier design, proposing a BDA-KELM algorithm that automatically performs feature selection and parameter optimization to determine the optimal feature–classifier combination. Zhou et al. [11] introduced MIMR, an optimization framework designed to maximize information content while minimizing redundancy. This framework allows for the extraction of a non-redundant subset of complementary features, enhancing feature representativeness and ultimately improving the classification performance of ships with limited pixel information in medium-resolution SAR images. The reliance on manual expertise, while effective in specific contexts, can become a bottleneck in multi-sensor fusion and diverse scene understanding, limiting the scalability and adaptability of these approaches.

Deep-learning-based ship classification methods for SAR imagery have been garnering significant interest from scholars in recent years. Researchers have leveraged deep convolutional neural network models to achieve SAR ship classification. These models have demonstrated superior accuracy and speed compared to traditional manual feature recognition models [12]. Convolutional Neural Networks (CNNs) have garnered significant attention in SAR ship classification due to their ability to automatically learn discriminative features from labeled data, eliminating the need for laborious manual feature extraction. These learned features represent hierarchical representations of the target objects, and the end-to-end training paradigm simplifies the workflow and enhances efficiency. For instance, Li et al. [13] developed a dense residual network (DRNet) incorporating upsampling data augmentation and in-batch balanced sampling to address the challenge of class imbalance in ship classification. They validated the efficacy of their approach on the publicly available OpenSARShip dataset. However, the large size of DRNet can impact training and testing efficiency. Bentes et al. [14] proposed a multi-resolution input CNN to enhance multi-scale feature representation. They integrated four distinct CNN classifiers to distinguish between cargo ships, oil tankers, offshore platforms, and port facilities. Their experimental results demonstrated superior recognition accuracy compared to traditional machine learning classifiers like Support Vector Machines (SVMs). Dong et al. [15] designed a residual learning network for fine-grained ship type recognition in Gaofen-3 SAR imagery, achieving high-accuracy classification of cargo ships, container ships, and oil tankers. Huang et al. [16] explored the application of transfer learning in SAR ship classification. They transferred knowledge learned from the MSTAR and ImageNet datasets to the OpenSARShip dataset for ship classification. Furthermore, they proposed the Deep SAR-Net model to facilitate knowledge transfer from optical to SAR imagery, ultimately improving recognition accuracy for bulk carriers, container ships, and oil tankers. Despite the strong feature extraction capabilities of CNNs, their inherent inductive bias limits their ability to capture global context, potentially hindering further performance improvements.

Discriminative features for ship classification often manifest as subtle variations within localized image regions. Accurately identifying and harnessing these non-obvious cues is paramount to achieving robust classification performance. The deep self-attention network, Transformer, originally introduced in the seminal paper “Attention Is All You Need” [17], revolutionized the field of natural language processing. It quickly became the dominant model in the field and formed the foundation for large-scale language models like GPT-3. Since 2020, Transformer’s influence has extended to computer vision, marking a significant advancement in the field. Following the pioneering work of Google’s Vision Transformer (ViT) [18], several noteworthy vision Transformers have emerged, including Facebook’s DeiT [19] and Microsoft Research Asia’s Swin Transformer [20]. These models have demonstrated remarkable success in image classification tasks, highlighting the potential of Transformer to match or even surpass the performance of CNNs in computer vision. In 2022, Li et al. [21] introduced the Transformer model to SAR ship detection, aiming to enhance ship feature representation. By leveraging a non-local neural network, the Transformer model establishes long-range spatial dependencies, enabling a broader global receptive field and allowing the network to focus on global contextual information. Their results demonstrated that the Transformer achieved superior detection performance compared to CNNs. The Transformer architecture excels at extracting target features by considering both global and local information. However, despite its high recognition accuracy, the model’s performance in detecting small objects remains a challenge, primarily due to the increased computational complexity associated with high-resolution features. To address these limitations, this paper introduces Deformable Convolution Network and Multi-scale Feature Fusion with Transformer architecture (DCN-MSFF-TR), a novel ship classification model specifically tailored for SAR imagery. Inspired by the Transformer encoder–decoder paradigm, our model incorporates the DCN deformable convolution module into the backbone network. This strategic integration enables the model to focus on sparse spatial information, thereby enhancing small target recognition in SAR images. Furthermore, to meet the demanding feature extraction requirements of ship classification, the model employs multi-scale attention processing of feature layers within the Transformer framework. This process is followed by feature fusion at appropriate locations within the model, guided by a feature pyramid. Consequently, each layer benefits not only from its own information but also from a comprehensive integration of features from other layers. Extensive evaluations conducted on the publicly available OpenSARShip-3-Complex (three-class) and OpenSARShip-6-Complex (six-class) datasets demonstrate the effectiveness of our proposed method. The experimental results confirm that DCN-MSFF-TR achieves superior performance in accurately identifying ship categories within SAR images.

2. The DCN-MSFF-TR Model for Ship Classification

2.1. Transformer Encoder–Decoder Architecture

Our end-to-end classification model leverages a Transformer-based architecture to effectively process image data. As depicted in Figure 1, the model consists of four main modules: a backbone network for extracting image features, a Transformer-based encoder for capturing global context, a decoder for selectively attending to relevant information, and a prediction head for generating the final output. The backbone network utilizes a traditional Convolutional Neural Network (CNN) to extract representative features from the input image. These extracted features, along with positional encodings that preserve spatial information, are then fed into the Transformer-based encoder. This encoder leverages its self-attention mechanism to capture long-range dependencies and global context within the image. Subsequently, the decoder, also employing a self-attention mechanism, refines the encoded features and selectively attends to salient information. This selective attention mechanism proves crucial for accurately identifying and localizing target markers within the image, effectively distinguishing them from background noise and other irrelevant features. The decoder output is passed as input to a Feed-Forward Neural Network (FFN). The FFN’s output is subsequently fed into a fully connected layer, which performs the final classification and outputs the predicted category label. The Transformer encoder–decoder architecture excels in capturing global dependencies within data through its multi-head self-attention mechanism. This architecture, comprising multiple stacked encoders and decoders, allows for an intricate understanding of relationships between different elements within the input. Crucially, in the context of target recognition within SAR images, the self-attention mechanism proves particularly powerful. By calculating attention weights for all objects within the entire image, the model effectively highlights the target of interest (e.g., a ship) while suppressing irrelevant background information. This selective attention, achieved through the self-attention mechanism’s ability to weigh the importance of different image regions, significantly contributes to the efficacy of Transformer-based end-to-end models for tasks like ship classification.

2.2. Deformable Convolutional Networks (DCN) Module

The original backbone network employed a convolutional neural network for feature extraction. However, standard convolution lacks a sufficiently flexible receptive field for target recognition, resulting in diminished efficiency [22], as illustrated in Figure 2a. In contrast, deformable convolution leverages irregular and adaptive shapes to address this limitation [23], as depicted in Figure 2b. To enhance the original recognition model, the process begins by utilizing a conventional convolutional kernel to generate Conv1, Conv2, Conv3, and Conv4 feature maps from the input image. Subsequently, the standard convolutional kernels within convolutional layers Conv2, Conv3, and Conv4 are replaced with deformable convolutional kernels. Unlike standard convolutional kernels, which adhere to a fixed step offset in the x and y directions relative to the center pixel, deformable convolutional kernels employ a novel approach. After each standard convolution operation, a new kernel is generated to record the x and y offsets. This results in an output channel dimension of 2N, where N represents the number of input feature map channels. The doubling arises from the need to record offset features in both the x and y directions. Figure 3 illustrates the deformable convolution learning process. During the training process, the interpolation algorithm and backpropagation are employed to concurrently learn the convolution kernel that generates the output features and offsets. Specifically, for the initial position

p_{0}

of the input feature map, after performing the standard convolution operation, the output result is y:

y (p_{0}) = \sum_{p_{n} \in ℜ} w (p_{n}) \cdot x (p_{0} + p_{n}),

(1)

The formula for deformable kernel convolution can be expressed as follows, where w represents the network weight parameter, ℜ denotes the specified convolution region,

p_{n}

represents the learned position offset applied to the standard convolution, and

Δ p_{n}

introduces an additional learned position offset to the standard convolution position:

y (p_{0}) = \sum_{p_{n} \in ℜ} w (p_{n}) \cdot x (p_{0} + p_{n} + Δ p_{n}) .

(2)

The DCN enhance feature extraction by adaptively adjusting the convolutional kernel’s shape and offset based on input image content. This adaptability allows DCN to capture more representative features, leading to improved performance in image recognition tasks. As shown in Figure 2, deformable convolution employs a unique sampling process. Figure 3 further elaborates on this by demonstrating how DCN facilitates learning within our backbone network.

2.3. A Multi-Block Self-Attention Mechanism for Feature Fusion

Transformer-based recognition models incorporate a self-attention mechanism within their encoder stage. In this process, feature vectors from the backbone network are linearly transformed using distinct weight matrices to generate query (

Q

), key (

K

), and value (

V

) matrices. The query matrix

Q

represents the image’s feature matrix:

\begin{matrix} Q = F e a t u r e \cdot W_{q}, \\ K = F e a t u r e \cdot W_{k}, \\ V = F e a t u r e \cdot W_{v}, \end{matrix}

(3)

Learnable weight matrices

W_{q}

,

W_{k}

, and

W_{v}

are applied to generate the query (

Q

), key (

K

), and value (

V

) matrices, respectively. The attention matrix is then calculated by performing a softmax operation on the dot product of the query (

Q

) and key (

K

) matrices. This attention matrix, representing the weighted relationships between elements, is subsequently multiplied by the value matrix (

V

) to obtain a contextually enriched representation that captures the interdependencies between targets within the SAR image:

\begin{matrix} Self_Attention (Q, K, V) = (S_{s o f t max} (\frac{Q \cdot K^{T}}{\sqrt{d k}})) \cdot V, \end{matrix}

(4)

where

d_{k}

represents the dimensionality of

K

. During the feature extraction phase of the backbone network, the self-attention module effectively captures global contextual information, transforming the input feature maps into richer, more expressive subfeature maps. Subsequently, each encoder layer undergoes Layer Normalization (LN) before being processed by a Feed-Forward Neural Network (FFN).

The Multi-scale Self Attention Layer Feature Fusion (MSFF) is divided into two main parts. Firstly, each block represents a scale feature and includes a fusion module responsible for merging the current block’s output with features from other blocks. Specifically, the features of the ith (i = 1, 2, 3) block are input to the

(i + 1)

th block, while the output features of each block layer are passed to the fusion module. This approach allows each layer to utilize its own information and combine features from other layers, enhancing the overall representation capability. Subsequently, the multi-scale features are fused layer by layer to create a feature fusion using concatenation. The MSFF module performs feature selection and refinement by learning an optimal representation from the concatenated feature maps. The impact of fusing multi-scale attention features on the feature representation can be observed in the heat map shown in Figure 4. The resulting improved model architecture is detailed in Figure 5.

2.4. Optimizing Loss Functions for DCN-MSFF-TR

Traditionally, cross-entropy loss

L_{c e} (p_{i}) = - log p_{i}

with a probability distribution

p_{i} = (p_{0}, p_{1}, \dots, p_{k})

, where k represents the ship target type, is used for classification tasks. However, the ship classification problem, particularly with the six-class dataset, suffers from class imbalance. Directly applying cross-entropy loss in this scenario proves ineffective. To address this, the model’s categorization loss function is enhanced using a balanced cross-entropy loss function. This enhancement involves introducing a weighting factor for each class to mitigate the imbalance. The updated loss function is then defined as:

L_{w c e} (p_{i}, α_{i}) = - α_{i} ln p_{i},

(5)

where

α_{i}

represents the weight assigned to each category. While balanced cross-entropy utilizes

α_{i}

to adjust the importance of different classes, it does not differentiate between the difficulty levels of individual samples. Focal loss, a popular loss function designed to address class imbalance, builds upon the balanced cross-entropy framework. It introduces a modulating factor

γ

, to diminish the influence of easily classified samples and prioritize those that are harder to categorize. The focal loss function is defined as follows:

L_{F} o c a l (p_{i}, α i, γ) = - α i {(1 - p_{i})}^{γ} ln p_{i}

(6)

3. Results and Analysis

3.1. Dataset Collections

The OpenSARShip dataset, which includes 11,364 SAR ship images integrated with AIS messages, possesses five key characteristics: specificity, large scale, diversity, reliability, and public accessibility [24]. We constructed the OpenSARShip-3-Complex dataset by extracting a total of 3990 samples, comprising Cargo, Fishing, and Tanker ship categories, from the OpenSARShip dataset for training and testing. To mitigate the potential adverse effects of class imbalance on training, the number of training and testing samples for each ship category was balanced. The OpenSARShip-3-Complex is a dataset for categorizing SAR ship images into triples, with various categories of ship image samples. An example of the triple categorized data are displayed in Figure 6. To enhance the data quality, data augmentation techniques such as rotation, mirror inversion, and brightness adjustment were applied, resulting in almost 4000 ship image samples. Details are provided in Table 1.

To further validate the proposed method’s efficacy, we applied it to the publicly available OpenSARShip-6-Complex dataset. This dataset, comprising 6000 samples for six-class classification, is detailed in Table 2, and a sample of the data are shown in Figure 7.

3.2. Experimental Procedure

The experimental platform utilized Ubuntu 16.0 with an NVIDIA Tesla V100 GPU (NVIDIA, Santa Clara, CA, USA). Ablation experiments were conducted separately on the OpenSARShip-3-Complex (three-category) and OpenSARShip-6-Complex (six-category) datasets. Figure 8 illustrates the convergence of the model loss function. The primary performance metrics for ship recognition are precision (P) and recall (R), where P = TP/ (TP + FP), R = TP/(TP + FN). Precision (P) is defined as the ratio of correctly predicted ships (True Positives, TP) to the total number of ships predicted as belonging to that category (TP + False Positives, FP). Recall (R) is the ratio of correctly predicted ships (TP) to the total number of ships actually belonging to that category (TP + False Negatives, FN). In addition to precision and recall, mean Average Precision (mAP) and mean Average Recall (mAR) are also used to evaluate overall performance. mAP represents the average precision across all ship categories, while mAR represents the average recall across all categories.

The experimental results, presented in Table 3 and Table 4, demonstrate the effectiveness of our proposed model enhancements. Specifically, our approach achieves a notable improvement in mean Average Precision (mAP) for both three-classification (+5.8%) and six-classification (+4.3%) recognition tasks.

This improvement can be attributed to addressing the limitations of the original Model A’s Transformer architecture. While advantageous for large target recognition, the Transformer’s self-attention mechanism struggles with the small target sizes characteristic of SAR imagery. Our enhancements target this issue.

Model B: By incorporating deformable convolution, Model B demonstrates a superior ability to capture small target features. Compared to Model A, this translates to mAP improvements of 0.3% and 0.9% for three- and six-classification tasks, respectively.

Model C: Multi-scale feature fusion (MSFF) enables Model C to leverage information across different scales. This results in even more pronounced mAP gains compared to Model A: +2.3% for three-classification and +1.3% for six-classification.

Model D: Our final model, Model D, integrates both deformable convolution and multi-scale feature fusion. This combination synergistically leverages the strengths of both approaches, achieving the most significant improvement in deep learning-based target recognition performance.

These findings highlight the importance of tailoring model architectures to the specific challenges posed by SAR imagery. Our proposed enhancements effectively address the limitations of standard Transformer-based approaches, leading to substantial improvements in small target recognition accuracy.

Figure 9 showcases the enhanced recognition capabilities of the improved model compared to its predecessor. The baseline model exhibits misclassifications for Tanker, Tug, and Dredging vessel types, and fails to detect Passenger vessels altogether. In contrast, the improved model accurately identifies all vessel types, demonstrating its superior performance.

Table 5 and Table 6 present the classification accuracy achieved on the OpenSAR Ship three-category and six-category datasets, respectively. The results clearly indicate that DCN-MSFF-TR outperforms the original TR architecture, achieving higher ship recognition accuracy across all categories. This improvement underscores the efficacy of integrating the DCN and MSFF modules for enhanced ship recognition.

Specifically, the “Cargo” category consistently achieves the highest accuracy in both datasets, likely due to its greater representation in the training data. While “Fishing” vessels are accurately classified in the three-category scenario, their accuracy noticeably declines in the six-category dataset. This suggests potential difficulties in distinguishing between finer-grained ship types.

This observation highlights the challenge of class imbalance. Despite employing data balancing techniques to ensure uniform data distribution, categories with limited training instances still exhibit lower classification accuracy. Nonetheless, the strong performance of DCN-MSFF-TR, particularly with well-represented categories, demonstrates its capacity to effectively extract discriminative features from sufficient data.

Table 7 and Table 8 present the categorized single-category recall rates on the SARShip-3-Complex and SARShip-6-Complex datasets, respectively. A comparison of the experimental results demonstrates that the DCN-MSFF-TR architecture effectively mitigates the issue of missed detections for single-category ships compared to the original TR architecture. The tables reveal an improvement in recall rates across all categories in both datasets, with an average recall improvement of 9.23%. Notably, the Cargo ship type exhibits the highest recall rate at 85.5%, performing well in both the three-category and six-category classifications. This superior performance is likely attributed to the relatively stable image features inherent to the Cargo ship type. Furthermore, the Fishing ship type demonstrates the most significant improvement in recall rate within the six-category classification. These results underscore the effectiveness of the DCN-MSFF-TR model in reducing the miss detection rate across all categories within the given datasets.

Figure 10 and Figure 11 illustrate the classification results obtained on the Sentinel-1 OpenSARShip-3-Complex (three-class) and OpenSARShip-6-Complex (six-class) datasets, respectively.

Table 9 and Table 10 present the confusion matrices for DCN-MSFF-TR on the OpenSARShip-3-Complex (three-class) and OpenSARShip-6-Complex (six-class) datasets, respectively. The dominant values along the diagonals of both matrices indicate that DCN-MSFF-TR correctly classifies the majority of ships. This observation highlights the model’s effectiveness in accurately recognizing different ship types.

3.3. Benchmarking against Other Models

Table 11 clearly demonstrates that DCN-MSFF-TR achieves superior recognition accuracy compared to other state-of-the-art methods. Specifically, on the OpenSARShip-3-Complex dataset, DCN-MSFF-TR achieves an accuracy of 78.1%, significantly surpassing the second-best performing model, SqueezeNet, at 72.2%. On the OpenSARShip-6-Complex dataset, DCN-MSFF-TR achieves a precision of 66.7%, substantially outperforming the second-best model, DenseNet, by 13.2%. Furthermore, DCN-MSFF-TR achieves recall rates of 78.0% and 75.0% on the OpenSARShip-3-Complex and OpenSARShip-6-Complex datasets, respectively, again exceeding the performance of the second-best model, DenseNet.

This enhanced performance is attributed to the synergistic integration of key components within the DCN-MSFF-TR architecture. Specifically, the deformable convolution network (DCN) allows for more effective feature extraction by adapting to the varying shapes and orientations of ships in SAR images. Simultaneously, the multi-scale feature fusion (MSFF) module within the Transformer architecture facilitates the aggregation of information from different scales, enriching the ship feature representation. This combination enables DCN-MSFF-TR to achieve higher accuracy in SAR ship recognition.

4. Conclusions

This paper introduces DCN-MSFF-TR, a novel deep learning model for SAR ship recognition. Designed for precise ship type classification, DCN-MSFF-TR leverages an end-to-end Transformer-based architecture. The model incorporates a Deformable Convolutional Network (DCN) within its backbone to enhance feature extraction by effectively capturing ship characteristics despite their sparse distribution in SAR images. Concurrently, feature layers processed by the Transformer’s multi-scale attention mechanism are fused with features at corresponding levels of the model’s feature pyramid to enhance recognition performance. Experiments conducted on the publicly available OpenSARShip-3-Complex (three-category) and OpenSARShip-6-Complex (six-category) datasets demonstrate that the DCN-MSFF-TR model significantly improves SAR image ship recognition. The model achieves average precisions of 78.1% and 66.7% and average recalls of 78.0% and 75.0% on the three-category and six-category datasets, respectively. Therefore, our model outperforms the current mainstream models in both mean Average Precision (mAP) and mean Average Recall (mAR). However, the results also indicate that the proposed model does not completely eliminate misclassifications and missed detections. Further analysis and research are required to enhance the model’s feature extraction capabilities and address these limitations. Furthermore, the inherent complexity of SAR images, which often require expert interpretation, poses challenges for dataset expansion. Consequently, exploring semi-supervised or unsupervised learning approaches for SAR ship feature extraction, with a focus on maintaining robust performance in fine-grained ship classification, presents a compelling avenue for future research.

Author Contributions

P.C. was responsible for the construction of the ship detection dataset, constructed the outline of the manuscript, and made the first draft of the manuscript; H.Z. conceived and designed the algorithm and contributed to the manuscript and experiments; Y.L. supervised the experiments and was also responsible for the dataset; B.L. performed ship detection using deep learning methods; supervision, B.L. and P.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Fundamental Scientific Research Project for Liaoning Education Department, LJKMZ20222006, and the National Natural Science Foundation of China, 52271359.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Owing to the nature of this research, the participants in this study did not agree that their data can be publicly shared; therefore, supporting data are not available.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Xiao, X.; Zhou, Z.; Wang, B.; Li, L.; Miao, L. Ship Detection under Complex Backgrounds Based on Accurate Rotated Anchor Boxes from Paired Semantic Segmentation. Remote Sens. 2019, 11, 2506. [Google Scholar] [CrossRef]
Liu, G.; Zhang, X.; Meng, J. A small ship target detection method based on polarimetric SAR. Remote Sens. 2019, 11, 2938. [Google Scholar] [CrossRef]
Li, Y.; Du, L.; Wei, D. Multiscale CNN based on component analysis for SAR ATR. IEEE Trans. Geosci. Remote. Sens. 2021, 60, 5211212. [Google Scholar] [CrossRef]
Fu, Q.; Luo, K.; Song, Y.; Zhang, M.; Zhang, S.; Zhan, J.; Duan, J.; Li, Y. Study of Sea Fog Environment Polarization Transmission Characteristics. Appl. Sci. 2022, 12, 8892. [Google Scholar] [CrossRef]
Chen, P.; Li, Y.; Zhou, H.; Liu, B.; Liu, P. Detection of Small Ship Objects Using Anchor Boxes Cluster and Feature Pyramid Network Model for SAR Imagery. J. Mar. Sci. Eng. 2020, 8, 112. [Google Scholar] [CrossRef]
Lyu, H.; Shao, Z.; Cheng, T.; Yin, Y.; Gao, X. Sea-Surface Object Detection Based on Electro-Optical Sensors: A Review. IEEE Intell. Transp. Syst. Mag. 2023, 15, 190–216. [Google Scholar] [CrossRef]
Graziano, M.D.; Renga, A.; Moccia, A. Integration of Automatic Identification System (AIS) Data and Single-Channel Synthetic Aperture Radar (SAR) Images by SAR-Based Ship Velocity Estimation for Maritime Situational Awareness. Remote Sens. 2019, 11, 2196. [Google Scholar] [CrossRef]
Xiong, B.; Sun, Z.; Wang, J.; Leng, X.; Ji, K. A Lightweight Model for Ship Detection and Recognition in Complex-Scene SAR Images. Remote Sens. 2022, 14, 6053. [Google Scholar] [CrossRef]
Zhao, Z.; Ji, K.; Xing, X.; Chen, W.; Zou, H. Ship Classification with High Resolution TerraSAR-X Imagery Based on Analytic Hierarchy Process. Int. J. Antennas Propag. 2013, 2013, 698370. [Google Scholar] [CrossRef]
Wu, J.; Zhu, Y.; Wang, Z.; Song, Z.; Liu, X.; Wang, W.; Zhang, Z.; Yu, Y.; Xu, Z.; Zhang, T.; et al. A novel ship classification approach for high resolution SAR images based on the BDA-KELM classification model. Int. J. Remote Sens. 2017, 38, 6457–6476. [Google Scholar] [CrossRef]
Zhou, G.; Zhang, G.; Xue, B. A maximum-information-minimum-redundancy-based feature fusion framework for ship classification in moderate-resolution SAR image. Sensors 2021, 21, 519. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Wang, Z.; Liu, X.; Zeng, N.; Liu, Y.; Alsaadi, F.E. A survey of deep neural network architectures and their applications. Neurocomputing 2017, 234, 11–26. [Google Scholar] [CrossRef]
Li, J.; Qu, C.; Peng, S. Ship classification for unbalanced SAR dataset based on convolutional neural network. J. Appl. Remote. Sens. 2018, 12, 035010. [Google Scholar] [CrossRef]
Bentes, C.; Velotto, D.; Tings, B. Ship classification in TerraSAR-X images with convolutional neural networks. IEEE J. Ocean. Eng. 2017, 43, 258–266. [Google Scholar] [CrossRef]
Dong, Y.; Zhang, H.; Wang, C.; Wang, Y. Fine-grained ship classification based on deep residual learning for high-resolution SAR images. Remote Sens. Lett. 2019, 10, 1095–1104. [Google Scholar] [CrossRef]
Huang, Z.; Pan, Z.; Lei, B. What, where, and how to transfer in SAR target recognition based on deep CNNs. IEEE Trans. Geosci. Remote Sens. 2019, 58, 2324–2336. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 261–272. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16×16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the 38th International Conference on Machine Learning (PMLR), Virtual, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Li, K.; Zhang, M.; Xu, M.; Tang, R.; Wang, L.; Wang, H. Ship detection in SAR images based on feature enhancement Swin transformer and adjacent feature fusion. Remote Sens. 2022, 14, 3186. [Google Scholar] [CrossRef]
Sun, Z.; Meng, C.; Cheng, J.; Zhang, Z.; Chang, S. A Multi-Scale Feature Pyramid Network for Detection and Instance Segmentation of Marine Ships in SAR Images. Remote Sens. 2022, 14, 6312. [Google Scholar] [CrossRef]
Chen, P.; Zhou, H.; Li, Y.; Liu, P.; Liu, B. A novel deep learning network with deformable convolution and attention mechanisms for complex scenes ship detection in SAR images. Remote Sens. 2023, 15, 2589. [Google Scholar] [CrossRef]
Huang, L.; Liu, B.; Li, B.; Guo, W.; Yu, W.; Zhang, Z.; Yu, W. OpenSARShip: A dataset dedicated to Sentinel-1 ship interpretation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 11, 195–208. [Google Scholar] [CrossRef]
Ao, W.; Xu, F.; Qian, Y.; Guo, Q. Feature clustering based discrimination of ship targets for SAR images. J. Eng. 2019, 2019, 6920–6922. [Google Scholar] [CrossRef]
Li, Y.; Ding, Z.; Zhang, C.; Wang, Y.; Chen, J. SAR ship detection based on resnet and transfer learning. In Proceedings of the IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 1188–1191. [Google Scholar]
Hsia, S.C.; Wang, S.H.; Chang, C.Y. Convolution neural network with low operation FLOPS and high accuracy for image recognition. J. Real-Time Image Process. 2021, 18, 1309–1319. [Google Scholar] [CrossRef]
Liu, S.; Kong, W.; Chen, X.; Xu, M.; Yasir, M.; Zhao, L.; Li, J. Multi-scale ship detection algorithm based on a lightweight neural network for spaceborne SAR images. Remote Sens. 2022, 14, 1149. [Google Scholar] [CrossRef]
Zhang, T.; Zhang, X. Squeeze-and-excitation Laplacian pyramid network with dual-polarization feature fusion for ship classification in SAR images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 4019905. [Google Scholar] [CrossRef]
Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Li, C.; Yang, J.; Su, H.; Zhu, J.; et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv 2023, arXiv:2303.05499. [Google Scholar]
Zhang, S.; Wang, X.; Wang, J.; Pang, J.; Lyu, C.; Zhang, W.; Luo, P.; Chen, K. Dense distinct query for end-to-end object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7329–7338. [Google Scholar]

Figure 1. Schematic diagram of the Transformer encoder–decoder-based recognition model.

Figure 2. Comparison of Sampling processes: standard vs. deformable convolutions. (a) Standard convolution sampling and (b) deformable convolution sampling.

Figure 3. An exploration into the learning mechanisms of deformable convolutional kernels.

Figure 4. Feature heat map after fusion of multi-scale attention features.

Figure 5. Improved DCN-MSFF-TR architecture diagrams.

Figure 6. Ship samples from the OpenSARShip-3-Complex dataset.

Figure 7. Ship samples from the OpenSARShip-6-Complex dataset.

Figure 8. Convergence of the loss function over epochs, with (a) original loss and (b) improved loss.

Figure 9. Classification results for baseline and improved models.

Figure 10. Ship classification performance on the OpenSARShip-3-Complex dataset.

Figure 11. Analysisof ship classification results using the OpenSARShip-6-Complex Dataset. Red elliptical boxes indicate missed detections.

Table 1. OpenSARShip-3-Complex dataset for SAR ship triple classification.

Categories	Number of Training Samples	Number of Test Samples	Total
Cargo	930	400	1330
Fishing	930	400	1330
Tanker	930	400	1330

Table 2. OpenSARShip-6-Complex dataset for SAR six-class ship classification.

Categories	Number of Training Samples	Number of Test Samples	Total
Cargo	700	300	1000
Fishing	700	300	1000
Tanker	700	300	1000
Passenger	700	300	1000
Tug	700	300	1000
Dredging	700	300	1000

Table 3. Modeling of tri-categorical ablation experiments using DCN-MSFF-TR.

Model No.	Models	mAP (%)
A	resnet101 + TR	72.3
B	resnet101 + TR + DCN	72.6
C	resnet101 + TR + MSFF	74.6
D	resnet101 + TR + DCN + MSFF	78.1

Table 4. Modeling of six-categorical ablation experiments using DCN-MSFF-TR.

Model No.	Models	mAP (%)
A	resnet101 + TR	62.4
B	resnet101 + TR + DCN	63.3
C	resnet101 + TR + MSFF	63.7
D	resnet101 + TR + DCN + MSFF	66.7

Table 5. Single-category accuracy in tri-categorical classification.

Categories	TR Precision (%)	DCN-MSFF-TR Precision (%)
Cargo	82.9	85.5
Fishing	74.6	79.5
Tanker	59.5	69.4

Table 6. Single-category accuracy in six-categorical classification.

Categories	TR Precision (%)	DCN-MSFF-TR Precision (%)
Cargo	75.9	79.0
Passenger	70.7	73.7
Tanker	53.0	61.9
Fishing	43.2	52.2
Tug	64.1	64.7
Dredging	68.2	68.4

Table 7. Single-category accuracy in tri-categorical recall.

Categories	TR Recall (%)	DCN-MSFF-TR Recall (%)
Cargo	82.9	85.5
Fishing	74.6	79.2
Tanker	59.5	69.3

Table 8. Single-category accuracy in six-categorical recall.

Categories	TR Recall (%)	DCN-MSFF-TR Recall (%)
Cargo	74.3	80.7
Passenger	70.1	79.2
Tanker	65.2	72.5
Fishing	43.2	66.2
Tug	70.5	74.2
Dredging	60.4	77.0

Table 9. Confusion matrix of tri-categorical test datasets.

True Category	Forecast Category
True Category	Cargo	Fishing	Tanker
Cargo	342	30	80
Fishing	33	318	49
Tanker	49	74	277

Table 10. Confusion matrix of six-categorical test datasets.

True Category	Forecast Category
True Category	Cargo	Fishing	Tanker	Fishing	Tug	Dredging
Cargo	237	10	13	15	13	12
Passenger	12	221	15	19	16	17
Tanker	18	22	185	27	25	23
Fishing	26	27	33	156	30	28
Tug	16	19	23	25	194	23
Dredging	14	18	21	23	19	205

Table 11. Comparative accuracy of DCN-MSFF-TR against existing methods.

Models	Three-Category mAP (%)	Three-Category mAR (%)	Six-Category mAP (%)	Six-Category mAR (%)
VGG [25]	70.1	69.5	51.9	64.6
ResNet [26]	74.6	72.1	48.3	66.1
DeseNet [27]	74.7	74.6	53.5	68.7
MobileNet [28]	69.9	70.4	48.8	65.1
SqueezeNet [29]	72.2	72.3	53.1	66.3
DINO [30]	68.4	72.0	49.9	70.4
DDQ [31]	67.5	73.9	49.3	72.1
DCN-MSFF-TR	78.1	78.0	66.7	75.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, P.; Zhou, H.; Li, Y.; Liu, B.; Liu, P. A Deformable and Multi-Scale Network with Self-Attentive Feature Fusion for SAR Ship Classification. J. Mar. Sci. Eng. 2024, 12, 1524. https://doi.org/10.3390/jmse12091524

AMA Style

Chen P, Zhou H, Li Y, Liu B, Liu P. A Deformable and Multi-Scale Network with Self-Attentive Feature Fusion for SAR Ship Classification. Journal of Marine Science and Engineering. 2024; 12(9):1524. https://doi.org/10.3390/jmse12091524

Chicago/Turabian Style

Chen, Peng, Hui Zhou, Ying Li, Bingxin Liu, and Peng Liu. 2024. "A Deformable and Multi-Scale Network with Self-Attentive Feature Fusion for SAR Ship Classification" Journal of Marine Science and Engineering 12, no. 9: 1524. https://doi.org/10.3390/jmse12091524

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Deformable and Multi-Scale Network with Self-Attentive Feature Fusion for SAR Ship Classification

Abstract

1. Introduction

2. The DCN-MSFF-TR Model for Ship Classification

2.1. Transformer Encoder–Decoder Architecture

2.2. Deformable Convolutional Networks (DCN) Module

2.3. A Multi-Block Self-Attention Mechanism for Feature Fusion

2.4. Optimizing Loss Functions for DCN-MSFF-TR

3. Results and Analysis

3.1. Dataset Collections

3.2. Experimental Procedure

3.3. Benchmarking against Other Models

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI