Next Article in Journal
Quantifying Carbon Use Efficiency: Unraveling the Impact of Climate Change and Ecological Engineering on Vegetation in the Three Rivers Source Region
Previous Article in Journal
GNSS/LiDAR/IMU Fusion Odometry Based on Tightly-Coupled Nonlinear Observer in Orchard
Previous Article in Special Issue
A Novel Multi-Beam SAR Two-Dimensional Ambiguity Suppression Method Based on Azimuth Phase Coding
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Complex-Valued 2D-3D Hybrid Convolutional Neural Network with Attention Mechanism for PolSAR Image Classification

by
Wenmei Li
1,
Hao Xia
1,
Jiadong Zhang
1,
Yu Wang
2,
Yan Jia
1 and
Yuhong He
3,*
1
School of Internet of Things, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
2
College of Telecommunications and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210003, China
3
Department of Geography, Geomatics and Environment, University of Toronto, Mississauga, ON L5L 1C6, Canada
*
Author to whom correspondence should be addressed.
Remote Sens. 2024, 16(16), 2908; https://doi.org/10.3390/rs16162908
Submission received: 5 June 2024 / Revised: 25 July 2024 / Accepted: 5 August 2024 / Published: 9 August 2024
(This article belongs to the Special Issue Advances in Synthetic Aperture Radar Data Processing and Application)

Abstract

The recently introduced complex-valued convolutional neural network (CV-CNN) has shown considerable advancements for polarimetric synthetic aperture radar (PolSAR) image classification by effectively incorporating both magnitude and phase information. However, a solitary 2D or 3D CNN encounters challenges such as insufficiently extracting scattering channel dimension features or excessive computational parameters. Moreover, these networks’ default is that all information is equally important, consuming vast resources for processing useless information. To address these issues, this study presents a new hybrid CV-CNN with the attention mechanism (CV-2D/3D-CNN-AM) to classify PolSAR ground objects, possessing both excellent computational efficiency and feature extraction capability. In the proposed framework, multi-level discriminative features are extracted from preprocessed data through hybrid networks in the complex domain, along with a special attention block to filter the feature importance from both spatial and channel dimensions. Experimental results performed on three PolSAR datasets demonstrate our present approach’s superiority over other existing ones. Furthermore, ablation experiments confirm the validity of each module, highlighting our model’s robustness and effectiveness.

Graphical Abstract

1. Introduction

The classification of objects using polarimetric synthetic aperture radar (PolSAR), which involves the transmission of electromagnetic waves and reception of scattering echoes under all-weather and all-time conditions, has proven to be a significant asset in the remote sensing field. Contrasting with conventional synthetic aperture radar (SAR), PolSAR captures four distinct polarimetric states, leveraging electromagnetic wave properties for an enhanced understanding of the target. Image classification is a committed step in interpreting PolSAR images, which determines the true category for each pixel by utilizing the available information. It is widely used in urban planning [1], agricultural production [2], geological survey [3], target detection [4], ocean exploration [5], and so on.
Traditional PolSAR image classification is primarily enhanced by scattering feature extraction and classifier structure improvement. There are two categories of feature extraction methods: those reliant on simple combinations and transformations, and those reliant on polarimetric target decomposition. The former includes the polarimetric scattering matrix and its corresponding vector expansion [6], polarimetric coherence matrix, and covariance matrix [7]. The latter decomposes these matrices into distinct components, each conveying specific scattering or geometric structure information of targets. Commonly used decomposition methods include Cloude decomposition [8], Krogager decomposition [9], Cameron decomposition [10], and Freeman–Durden decomposition [11]. The above methods provide more parameters of targets, but a lot of manpower and more experience are required for feature extraction and selection in a specific task. Regarding the design of classifiers, the main approach evolved from machine learning, including support vector machine (SVM) [12], random forest [13] and decision tree [14]. These methods excel in independent learning and tackling complex non-linear problems. However, the major limitation is the dependence on previous experience for artificial feature selection.
Deep learning (DL) techniques have recently shown impressive potential in computer vision by autonomously acquiring high-level abstract features through incremental learning [15]. Compared to machine learning, they eliminate the reliance on manual feature extraction. Typical DL methods mainly include deep belief network (DBN) [16], sparse auto-encoder (SAE) [17], and convolutional neural network (CNN) [18], where CNN is often used for image classification due to the characteristics of parameter sharing and sparse connections [19]. Multiple CNN models have been developed for PolSAR image classification with excellent results, such as 2D-CNN [20], 3D-CNN [21], and fully convolutional network (FCN) [22]. Generally, the network’s feature extraction capability improves as the computational dimension of the convolutional kernel expands. Specifically, networks based on 3D convolution work better than 1D and 2D in most cases. However, an increase in convolutional dimension also leads to a corresponding rise in model complexity, which means that the classification task requires more time and equipment cost for training. This is consistently deemed undesirable in practical applications.
Notably, all the above approaches entail projecting the complex-valued PolSAR data onto the real domain, serving as input for the model. Nevertheless, the projection will lead to some significant information being omitted. It is necessary to design a complex-valued CNN (CV-CNN) instead of a real-valued CNN (RV-CNN) to fully utilize the abundant information contained in complex data. To address this challenge, Zhang et al. [23] pioneered a CV-CNN that expands all elements to the complex domain, effectively exploiting the phase information in the coherence matrix of PolSAR data and achieving higher accuracy than the RV-CNN. Subsequently, methods such as CV-3D-CNN [24] and CV-FCN [25] have also been proposed in succession, achieving superior results to the real-domain methods. Nevertheless, the augmented quantity of parameters required for expanding the model from the real to the complex domain calls for more labeled samples for training. Unfortunately, this poses a challenge for PolSAR data, which are arduous to interpret manually, making the situation less favorable.
Attention mechanisms are inspired by the human ability to suppress unimportant stimuli and allocate neural computational resources to critical parts when processing complex visual information [26]. In CNN, the importance of each feature map is different. The central concept of attention mechanisms is to give larger weight to the important features and devote more computational resources of the network to more important tasks. Common attention mechanisms include SE [27], CBAM [28], ECA [29], CA [30], etc. Currently, several PolSAR classification studies have noticed the effects of attention mechanisms. They applied attention mechanisms for preprocessing input data and network architecture improvements [31,32,33,34]. Although these approaches acknowledge the efficacy of the attention mechanism in boosting model performance, most of them simply employ it to enhance model inputs or integrate it with sequence networks. How to apply the attention mechanism to the feature maps inside a CNN, especially CV-CNN, is a major challenge.
To tackle the aforementioned issues, this article explores a new complex-valued 2D-3D hybrid convolutional neural network with attention mechanism (CV-2D/3D-CNN-AM). The model is capable of directly processing phase information in a complex form, using a combination of 2D and 3D convolution to extract effective discriminative features and reduce model complexity, along with a specific attention block to improve classification performance.
The major contributions of our article can be summarized as follows:
  • An innovative hybrid CV-CNN for PolSAR image classification is presented, which effectively incorporates the advantages of both 2D and 3D convolution. Specifically, the model first extracts spatial features using complex-valued 2D convolutions with fewer parameters and then integrates spatial and channel dimensional features using 3D convolutions. Compared with a single-dimensional CNN, this hybrid network effectively reduces the number of parameters to avoid overfitting while maintaining better performance.
  • We design an attention block for CV-CNN to increase the model’s efficiency and accuracy. The proposed attention block utilizes both the real and imaginary parts of features to calculate weights that represent feature importance. The classification results can effectively be improved through the utilization of recalibrated features supplemented with appropriate weights. It is worth noting that the inputs and outputs of the block are in complex form.
  • Experiments are performed on three authentic PolSAR datasets to confirm the advancement of our proposed method. Under the same conditions, CV-2D/3D-CNN-AM provides a competitive classification outcome, which effectively extracts discriminative features and focuses on more significant information.
The remainder of this article is as follows: Section 2 provides an overview of the related work. Section 3 delves into the detailed description of the proposed method. Section 4 analyzes the experimental results conducted on various datasets, and lastly, the conclusions are provided in Section 5.

2. Related Work

2.1. PolSAR Data Processing

PolSAR captures the polarimetric state of targets being examined through multiple polarization combinations. The scattering characteristics are represented by the complex two-dimensional scattering matrix S as follows:
S = S H H S H V S V H S V V
where H and V denote horizontal and vertical modes, respectively. S p q denotes that the transmitting wave is p and the receiving wave is q, p , q H , V . When the reciprocity condition is fulfilled, the equation S H V = S V H is considered for the single station case.
The scattering matrix S can be vectored as K by the Pauli decomposition [7] as:
K = 1 2 [ S H H + S V V , S H H S V V , 2 S H V ] T
where the superscript T denotes the transpose operation. The coherence matrix T after multi-look is calculated as:
T = 1 L i = 1 L K i K i H = T 11 T 12 T 13 T 21 T 22 T 23 T 31 T 32 T 33
where L is the count of looks and the superscript H is the conjugate transpose operation. According to Equation (3), the coherence matrix T is a Hermitian matrix with real diagonal elements and complex off-diagonal elements. Therefore, it is sufficient to enter only the upper triangular elements into the model.
For real-valued models, the eigenvector is expressed as a vector of nine elements as follows:
[ T 11 , T 22 , T 33 , ( T 12 ) , ( T 12 ) , ( T 13 ) , ( T 13 ) , ( T 23 ) , ( T 23 ) ]
where ( · ) and ( · ) represent the real and imaginary parts of the complex numbers, respectively.
For complex-valued models, the real and imaginary parts of the elements do not need to be separated, and the eigenvector is represented as a vector of six elements as follows:
T 11 + 0 · j , T 22 + 0 · j , T 33 + 0 · j , T 12 , T 13 , T 23
where j is the imaginary unit satisfying j 2 = 1 .
Polarimetric feature extraction methods aim to provide more scattering properties of the target, among which Cloude decomposition is widely used and effective [8]. In this decomposition, eigenvalues and eigenvectors are obtained from the coherence matrix T by spectral decomposition as:
T = U Λ U H = U λ 1 0 0 0 λ 2 0 0 0 λ 3 U H
where Λ is a diagonal matrix consisting of three eigenvalues λ 1 , λ 2 , and λ 3 of T . U is the matrix consisting of the column vectors of the corresponding eigenvalues. And the entropy H, anisotropy A, and scattering angle α are defined as follows, respectively:
H = i = 1 3 P i log 3 P i , P i = λ i j = 1 3 λ j
A = λ 2 λ 3 λ 2 + λ 3
α = i = 1 3 P i α i
where α i represents the scattering angle corresponding to λ i . In this way, considering the above decomposition features, the real-valued input eigenvector expands as:
[ T 11 , T 22 , T 33 , ( T 12 ) , ( T 12 ) , ( T 13 ) , ( T 13 ) , ( T 23 ) , ( T 23 ) , H , A , α ]
The complex-valued input eigenvector expands as:
[ T 11 + 0 · j , T 22 + 0 · j , T 33 + 0 · j , T 12 , T 13 , T 23 , H + 0 · j , A + 0 · j , α + 0 · j ]

2.2. CV-CNN-Based PolSAR Classification Methods

The distinctive phase information of PolSAR is important in their applications, and the CV-CNN has been extensively researched for its capacity to fully utilize the phase information present in the complex values of PolSAR. The CV-CNN was first employed by Zhang et al. [23] for PolSAR tasks. Recently, CV-CNN-based approaches have been continuously updated, which are broadly classified into driving features-based improvement, model structure-based improvement, and updating strategies-based improvement according to their main contributions.
(1) Driving features-based improvement: Utilizing the proficient understanding of the targeted scattering mechanism interpretation and efficient polarimetric feature extraction in guiding the model inputs contributes to enhancing the classification performance of the convolutional neural network. Qin et al. [35] fused the six-dimensional artificial features and the upper half elements of the coherence matrix into a twelve-dimensional complex vector as input for CV-CNN, which achieves better and more stable classification performance. Barrachina et al. [36] concluded that the data representation of the correlation matrix has limitations in the pixel-level task, and changed to use Pauli representation as input.
However, blindly increasing the dimension of polarimetric features will result in feature redundancy. Redundant inputs not only increase extra computation but also make the model prone to local overfitting. Han et al. [37] employed random forest to select the manually extracted seventeen-dimensional feature vectors and filter the six-dimensional most important features as the optimal set. Yang et al. [38] designed a 1D-CNN feature selection method and combined it with the Kullback–Leibler distance to determine the final feature combination.
(2) Model structure-based improvement: The upper limit of model performance and learning capacity primarily depends on the model structure. Adugna et al. [39] designed a more stable PolSARNet model that adopts a real-valued kernel function to process complex data and produce complex-valued outputs. Tan et al. [24] proposed CV-3D-CNN that combines the advantages of 3D-CNN and CV-CNN, which makes a breakthrough in classification accuracy but also consumes more time for model training. Ren et al. [40] made improvements in the non-convolutional layer part of CV-CNN to achieve better performance. Tan et al. [41] designed a triplet CV-CNN for acquiring the complex-valued representation through optimizing interclass distance and reducing intraclass distance.
Due to the decrease in computational efficiency caused by conventional CNN inputting pixels within each patch range to predict the category of central pixels [42], several studies have focused on developing pixel-to-pixel methods. Cao et al. [25] designed a deep CV-FCN for precise pixel-level annotation, together with a novel loss function to optimize the network. Mullissa et al. [43] explored a multi-branch CV-FCN, which is excellent for dealing with scattering noise in PolSAR images. Liu et al. [44] proposed a polarization coding that effectively preserves structural information in the scattering matrix and combines it with CV-FCN to enhance its performance and achieve greater resilience.
(3) Updating strategies-based improvement: Regular CNN models usually require a supervised learning strategy to model parameter updating. It is challenging to obtain a substantial amount of labeled samples for training in this scenario, particularly for PolSAR images that are intricate to decode manually. Qin et al. [45] presented a weakly supervised CV-CNN, which trains the model by iteratively selecting the labeled samples simultaneously and uses it directly for the classification task.
In recent times, scholars have shown significant interest in semi-supervised learning owing to its capacity to effectively handle a vast amount of unlabeled samples alongside limited labeled samples [46]. This strategy empowers the model to attain favorable classification performance. Xie et al. [47] designed a recurrent CV-CNN, which continuously expands the number of training sample sets through recurrent training and still achieves high accuracy with fewer labels. Zhu et al. [48] used multiple classifiers to generate strong and weak datasets and then used the former as a label to reclassify the latter. Zeng et al. [49] integrated the semi-supervised learning framework with data features to reduce the dependence on label annotation.

2.3. Attention Mechanisms

For image classification, attention mechanisms are mainly classified into the following categories depending on the data area of action: channel attention, spatial attention, and channel and spatial attention [50]. Concretely, Figure 1 shows their operations on features.
Channel attention assigns weights to each channel of feature maps, thereby letting the network know what it needs to focus on. Hu et al. [27] pioneered the channel attention concept and designed the SE network, which improves the model’s expressive power by squeezing and exciting the input features. Li et al. [51] introduced a selective kernel to improve the recognition efficiency based on channel attention.
Spatial attention transforms the spatial information in the feature map to another space with appropriate transformations and retains the key information, enabling the neural network to learn where it should pay attention. Minh et al. [52] devised a spatial attention model that employs the inner state of the model to determine the location of attention, which is capable of producing control messages in a dynamic context. Jaderberg et al. [53] designed a spatial transformer that can be used to obtain regions of interest in features through spatial transformations such as translation, rotation, and scaling.
Channel and spatial attention adaptively chooses representative objects and positions, which is more holistic than single-dimensional attention. Wang et al. [54] designed a residual attention network using channel and spatial attention for the first time. The model generates 3D attention maps by stacking attention modules that produce attention-aware features. However, it is restricted by the high computational workload and limited perceptual domains. To address the above limitations, Refs. [28,55] improved the calculation speed by importing global average pooling and decoupling attention. Refs. [56,57] used self-attentive modules to coordinate the two dimensions of channel and space. Refs. [58,59] improved the validity of attentional outcomes by increasing the receptive field.
In the PolSAR image domain, Dong et al. [31] employed the attention mechanism in polarimetric feature selection to assure the efficiency of high-dimensional data and inputted the selected data into a CNN for classification. Hua et al. [32] proposed an attention-based multi-scale sequential model with a new hybrid loss function to improve classification performance. Yang et al. [33] designed a polarization orientation angle (POA) attention to enable the network to focus on a more physically meaningful range of critical POA to enhance performance. Zhang et al. [34] explored a texture-based attention module that considers both channel and positional importance to learn enhanced features.

3. Proposed Method

3.1. Framework of CV-2D/3D-CNN-AM

In this article, we design a CV-2D/3D-CNN-AM for PolSAR image classification tasks, and Figure 2 shows its detailed structure. Overall, the framework architecture developed in this paper mainly consists of a 2D convolutional block, a specially designed attention mechanism module, a 3D convolutional block, and fully connected layers. The first three parts mainly act as feature extractors, while the last one is used as a classifier.
In terms of process, rich spatial features are first extracted from the processed PolSAR data by a 2D convolutional block consisting of two complex-valued 2D convolutional layers. The extracted features are dimensionality reduced by a complex-valued pooling layer and then fed into the attention module for computation, highlighting the more important information in the features. The 3D convolutional block is then used for feature extraction and integration in both scattering and spatial dimensions. Finally, the extracted feature tensor is sent to the dense linear layer to obtain the classified result. The following section presents more details of the CV-2D/3D-CNN-AM framework.

3.2. Complex-Valued 2D-3D Hybrid CNN

The hybrid 2D and 3D CNN, which is adept at processing complex-valued inputs, is employed for PolSAR image classification tasks to obtain ample discriminative information and valuable multi-scale features. The designed hybrid CNN belongs to the feed-forward neural network with a deep structure, which partially extracts features through the feature extractor in a data-driven manner and then inputs them into the classifier, without any manual design in the whole process.
All elements of the feature extraction part including the parameters and operations in this network are extended from the real field to the complex field. The calculation implementation process for each layer is as follows.
Convolutional layer: Assume that the feature maps of the lth convolutional layer are F m ( l ) C W l × H l × M and the subsequent ( l + 1 ) th layer is F n ( l + 1 ) C W l + 1 × H l + 1 × N , where C denotes the complex field and the superscript is its corresponding dimension. The set of filters between the lth and ( l + 1 ) th layers is defined as w n m l + 1 C S × S × M × N in the 2D case, which increases by one dimension in the 3D case defined as w n m l + 1 C S × S × S × M × N . The additional added bias is defined as b n ( l + 1 ) C N . The specific procedure of the convolution operation is as follows:
F n l + 1 = f ( ( V n ( l + 1 ) ) ) + j f ( ( V n ( l + 1 ) ) ) = max ( 0 , ( V n ( l + 1 ) ) ) + j max ( 0 , ( V n ( l + 1 ) ) )
V n ( l + 1 ) = m = 1 M w n m ( l + 1 ) F m ( l ) + b n ( l + 1 ) = m = 1 M ( ( w n m ( l + 1 ) ) ( F m ( l ) ) ( w n m ( l + 1 ) ) ( F m ( l ) ) ) + j m = 1 M ( ( w n m ( l + 1 ) ) ( F m ( l ) ) + ( w n m ( l + 1 ) ) ( F m ( l ) ) ) + b n ( l + 1 )
where max ( · ) denotes obtaining the maximum value. The operator ∗ refers to the convolution operation. f ( · ) stands for the Relu non-linear activation function, which offers the advantages of computational simplicity as well as sparse representation [60]. The discrepancy between 2D and 3D convolution is directly illustrated in Figure 3, where the latter handles channel dimension information better than the former but with more parameters. Figure 4 explains how complex-valued convolution is performed.
In this model, the 2D convolutional block consists of two convolutional layers. The convolution kernels in each layer are 9, the size of the kernel is 3 × 3, and the move step is 1. The padding pattern is set as “same”. The 3D convolutional block contains two 3D convolutional layers, the number of 3D convolution kernels in each layer is 32, the size of the kernel is 3 × 3 × 3, and the move step is set to 1. The padding pattern is furthermore designated as “same”.
Pooling layer: The pooling layer downsamples features in the spatial dimension by averaging or taking maximal pairs, which merges locally similar features and reduces the feature dimension. In addition, performing pooling operations makes the DL network’s receptive field larger and more robust to changes in the input feature positions.
This model adopts average pooling for data dimension reduction. The average pooling operation under complex form is defined as
F n ( l + 1 ) ( x , y ) = ave a , b = 0 , , d 1 ( ( F i ( l ) ( x · s + a , y · s + b ) ) ) + j ave a , b = 0 , , d 1 ( ( F i ( l ) ( x · s + a , y · s + b ) ) )
where F n ( l + 1 ) ( x , y ) is the feature map cell of F n ( l + 1 ) at location (x, y). d and s represent the dimension and stride of the pooling kernel, respectively. ave ( · ) refers to averaging.
Since the pooling operation will lead to some information loss, the pooling kernel size and stride should not be taken to be large. A suitable value refers to a size of 2 × 2 and a step size of 2.
Fully Connected Layer: After completing the convolution and pooling computation, the learned features are typically mapped from the feature space to the label space through one or more layers that are fully connected. The output of the fully connected layers is represented as:
F n ( l + 1 ) = f ( ( V n ( l + 1 ) ) ) + j f ( ( V n ( l + 1 ) ) )
V n ( l + 1 ) = k = 1 K w n k ( l + 1 ) · F k ( l ) + b n ( l + 1 )
where M refers to neuron amount in the lth layer.
However, the complex tensors of the fully connected layer fed into the softmax classifier most commonly used by CNN networks will no longer result in a probability between 0 and 1. In this paper, we borrow from the approach of flattening complex features’ real and imaginary parts [24] before inputting them into the fully connected layer to cooperate with softmax classifiers for category determination. Between the flatten layer and the final layer, the total neuron count in the fully connected layer is 128.
Output layer: The output layer of the network is at the end and consists of several neurons that correspond to the overall categories. As is usual with neural networks, the softmax function is utilized to determine categories, and the cross-entropy loss ( L ) is commonly applied to optimize the model. Specifically, separate losses are computed for each class separately and summed to obtain the final value. Model training is achieved by adjusting the parameters to minimize the loss values.
Complex weight initialization and backpropagation: The initialization of the complex weights obeys the derivation of [61,62]. In the complex-valued domain, the weight W is denoted as:
W = | W | e i θ = ( W ) + j ( W )
where | W | and θ are separately the amplitude and phase. | W | conforms to the Rayleigh distribution.
The variance is calculated as:
V a r ( W ) = E [ W W ] ( E [ W ] ) 2 = E [ | W | 2 ] ( E [ W ] ) 2
V a r ( | W | ) = E [ | W | | W | ] ( E [ | W | ] ) 2 = E [ | W | 2 ] ( E [ | W | ] ) 2
where E [ · ] is the expectation. When W is symmetrically distributed around 0, V a r ( W ) is equal to E [ | W | 2 ] . Combine the above:
V a r ( W ) = V a r ( | W | ) + ( E [ | W | ] ) 2
Equation (20) shows that the variance of W is associated with | W | . The weight initialization is performed by adjusting the Rayleigh distribution parameter of | W | .
The fact that the real and imaginary parts of the objective and activation functions are differentiable for each complex-valued parameter is a prerequisite for model backpropagation. When the loss function L is real, the complex chain rule is expressed as follows:
L ( z ) = L z = L ( z ) + j L ( z ) = ( L ( z ) ) + j ( L ( z ) )
where z is a complex variable.

3.3. Improved Attention Block for Complex-Valued Tensors

In PolSAR image classification tasks, information from different scattering channels of the center pixel and its surrounding window-sized neighborhood pixels are fed into the network for processing. Nevertheless, the contribution of pixels at different positions and scattering information from different channels to tasks varies significantly. Therefore, it is worth stimulating valuable information while filtering out irrelevant information in various dimensions. Since some attention mechanisms lead to huge computational effort [63], this article is inspired by [30] to design an improved attention mechanism capable of handling complex-valued tensors in CV-CNN. Coordinate Attention integrates positional information into channel attention processing, enabling it to sift through information across both spatial and channel dimensions. Its parallel one-dimensional feature encoding allows the model to attend to more crucial features with minimal computational overhead. The schematic diagram of our improved attention module is illustrated in Figure 5.
The improved attention module filters the complex-valued tensor across space and channels at the cost of a small computational overhead. Assume the input feature tensor is X C C × H × W , which can be divided into ( X ) R C × H × W and ( X ) R C × H × W according to the real and imaginary parts, where R denotes the real domain. Taking ( X ) as an example, using average pooling to encode its position in both H and W dimensions, the output can be represented as:
a h = 1 W i = 0 W 1 ( X ( C , H , i ) ) , a h R C × H × 1
a w = 1 H i = 0 H 1 ( X ( C , i , W ) ) , a w R C × 1 × W
where a h and a w are the result of operating on the height H and the width W, respectively.
After that, we convert the a h dimension to R C × 1 × H and splice it with a w . A shared convolutional operation F s with kernel size 1 × 1 is implemented on the splicing result as:
l = ψ ( F s ( [ a h , a w ] ) ) , l R C / r × 1 × ( W + H )
where [ · , · ] is the splicing operation, ψ ( · ) denotes a non-linear activation function, and r represents a reduction ratio for controlling block size.
Then, l h R C / r × 1 × H and l w R C / r × 1 × W are obtained by separating l along the spatial dimension. It should be noted that the dimension of l h is reconverted to R C / r × H × 1 here. Two independent 1 × 1 convolution operations F h and F w are employed for l h and l w , respectively, to transform them to have the identical channel count as input X:
g h = σ ( F h ( l h ) ) , g h R C × H × 1
g w = σ ( F w ( l w ) ) , g w R C × 1 × W
where σ ( · ) represents the sigmoid activation function. So far, the attention weight W r e computed from the real part of the input X can be obtained as:
W r e = g h × g w , W r e R C × H × W
Similarly the attention weight W i m calculated from the imaginary part can be obtained. Finally, the outcome of our attention block Y can be expressed as:
Y = X W r e + W i m 2 , Y C C × H × W
where ⊙ is the matrix element-wise product.

4. Experiments

4.1. Datasets

The performance of our designed method is validated by performing classification tasks on three datasets: Flevoland, San Francisco, and Oberpfaffenhofen. The details of each dataset are given below.
(1) Flevoland: The Flevoland dataset was acquired by the ARISAR system of NASA/JPL laboratory sounding over the Flevoland region of Poland in 1989, which consists of 1024 × 750 pixels and contains fifteen defined categories. Its corresponding Pauli-RGB map, ground truth, and category diagrams are depicted in Figure 6a–c, respectively.
(2) San Francisco: The San Francisco dataset was acquired by the RadarSat-2 satellite during its earth observation over the San Francisco region of the United States in 2008, which consists of 1024 × 900 pixels and contains five categories. Its corresponding Pauli-RGB map, ground truth, and category diagrams are depicted in Figure 7a–c, respectively.
(3) Oberpfaffenhofen: The Oberpfaffenhofen dataset was acquired by the German Aerospace Center during ground exploration over the Oberpfaffenhofen region of Germany, which consists of 1300 × 1200 pixels and contains three defined categories. Its corresponding Pauli-RGB map, ground truth, and category diagrams are depicted in Figure 8a–c, respectively.

4.2. Experimental Setting

The approach designed in this article is independently experimented on three selected datasets. For the Flevoland dataset, training involves randomly choosing 9% of the labeled samples from each category, while 1% is allocated to validation, leaving the remaining 90% for testing. For the San Francisco and Oberpfaffenhofen datasets, only 1% of the randomly selected labeled pixels are used for training, 1% for validation, and the remaining 98% for testing. Since the scattering noise present in the original PolSAR images interferes with the classification results, this paper employs refined Lee filtering with a 7 × 7 window size to filter each dataset before formal sampling [64]. In addition, the data for each channel are standardized.
During the model adjustment phase, the Adam algorithm is employed to update parameters. The batch size is 64, while the initial learning rate is 0.001. Training consists of 50 iterations with an early stopping mechanism to avoid overfitting, where training halts prematurely when the validation data loss fails to show improvement within 5 epochs.
To illustrate the effectiveness and elevation of the designed approach, we select six other PolSAR image classification methods to compare the results with our method, including SVM [12], CV-MLP [65], CV-2D-CNN [23], CV-3D-CNN [24], CV-FCN [25], CCN-WT [66], and PolSF [67]. In particular, SVM, CCN-WT, and PolSF are parameterized by the real domain, while the other complex-valued methods demonstrate superior performance compared to the real-valued counterparts in the corresponding papers. The details of each method are as follows:
  • SVM: SVM is a classical and powerful supervised machine learning technique. It finds an optimal hyperplane in pixel-level classification by maximizing the distance between pixels of different categories to determine the categories. Specifically, SVM is based on kernels that map the data into a high-dimensional space to make them linearly divisible, thus solving the non-linear problems well.
  • CV-MLP: CV-MLP uses a stack of fully connected layers to map the input image data into high-dimensional features. It extracts and transforms the features through hidden layers to learn the non-linear relationship between data and labels.
  • CV-2D-CNN/CV-3D-CNN: Both CV-2D-CNN and CV-3D-CNN are capable of hierarchically extracting local abstract features for pixel classification. They are major deep neural networks. However, CV-2D-CNN extracts features in spatial dimensions under channel independence, while CV-3D-CNN can simultaneously process information in three dimensions.
  • CV-FCN: CV-FCN learns the mapping between each pixel and its label end-to-end during classification. It eliminates the computational redundancy arising from the need to input a fixed-size patch around the center pixel when identifying it. Moreover, CV-FCN uses inverse convolution to upsample the extracted features and adapts well to inputs of arbitrary size.
  • CNN-WT: CNN-WT is a new method based on deep neural networks and wavelet transform. It notes that the amount of PolSAR data is not enough to support a network with many layers, and the data are noisy. CNN-WT uses the wavelet transform for feature extraction and denoising of the raw data and feeds them into a multi-branch deep network for classification.
  • PolSF: PolSF is a recently proposed PolSAR classification method for extracting abstract features based on the transformer framework. It analyzes complex dependencies within a region through a special local attention mechanism. Compared with shallow CNN models, PolSF has deeper layers and stronger feature extraction capability.
In addition, the complexity of each neural network model is shown in Table 1, including parameter quantity, floating point operations (FLOPs), and multiply–accumulate operations (MACs). The inclusion of the attention module does not burden our model complexity. Compared to CV-3D-CNN, the designed hybrid model has a notable decrease in FLOPs and MACs as reflected when compared to CNN-WT and PolSF. Four frequently chosen metrics, including overall accuracy (OA), average accuracy (AA), kappa coefficient, and per category precision, are applied to evaluate the classification results. Furthermore, the training time of each method on each dataset is recorded. We conduct ten independent replications of each method on each dataset to obtain the mean of the above four metrics and the variance of the first three metrics.
For several involved methods, CV-MLP, CV-2D-CNN, CV-3D-CN, CV-FCN, CNN-WT, PolSF, and the proposed CV-2D/3D-CNN-AM are implemented based on the TensorFlow framework in Python. SVM is implemented based on the machine learning package sklearn in Python. The classification experiments are run on a workstation featuring a 13.75 M cache, 2.40 GHz CPU, a GTX 3070 graphics processor, and 64 GB onboard RAM.

4.3. Experimental Results of Flevoland

The result graphs of all involved methods operating on the Flevoland dataset are displayed in Figure 9. It can be visualized from the result maps that the accuracy of our proposed CV-2D/3D-CNN-AM outperforms the other compared approaches. The machine learning method SVM misjudges most regions of the building category due to its small labeled samples and has non-negligible error points in determining the other categories. CV-MLP and CV-2D-CNN improve classification accuracy for building identification compared to SVM, yet they erroneously classify a significant portion of bare soil pixels as water. CV-FCN adjudicates massive pixels to other classes in barley, water, and rapeseed, indicating that its learned abstract features are unstable. CV-3D-CNN is capable of learning the discriminative features for each category, but there is still room for improvement in the accuracy, such as the occurrence of continuous errors in the rapeseed region. CNN-WT exhibits numerous misclassified regions in the determination of wheat and forest, which is undesirable. PolSF has some misclassified pixels in the distribution area of classes such as potatoes, which limits its accuracy level. The CV-2D/3D-CNN-AM has excellent recognition ability for all categories in the dataset. In other words, the proposed method has a high recognition rate for categories with small distribution areas while ensuring high classifying precision for categories with more samples.
The specific experimental data of all approaches on the Flevoland dataset are displayed in Table 2. Figure 10 displays the loss and accuracy curve trends of the proposed method during the training and validation phases to verify its convergence. From the curve diagram, the proposed method rapidly attains a high level of accuracy following the initial few epochs of training, indicating its efficiency in capturing critical information. It is known that our CV-2D/3D-CNN-AM obtains the most favorable values for OA, AA, and Kappa from the table, corresponding to the results presented in Figure 9. Compared with CV-2D-CNN, although we consume more training time, the improvement achieved in the accuracy metric is very obvious. The increases in OA and AA are 3.05% and 7.78%, respectively. With the hybrid convolution and attention module, the proposed method takes less than half the training time of CV-3D-CNN, reducing training costs. Even in comparison to CNN-WT and PolSF, both of the latest techniques, our method exhibits significantly enhanced OA, AA, and Kappa. Notably, these improvements are achieved without any additional training time burden. In detail for each category, CV-2D/3D-CNN-AM achieves the highest accuracy in all classes except lucerne, beet, wheat2, and water. It is important to mention that the designed approach not only achieves the highest mean in the three metrics of OA, AA, and Kappa but also possesses the lowest variance in OA and Kappa, which implies that our method possesses high classification performance along with strong stability.

4.4. Experimental Results of San Francisco

The result graphs of all involved methods operating on the San Francisco dataset are displayed in Figure 11. The classification results provided by the other comparison methods are less desirable as seen in the result graphs. SVM and CV-MLP have limited fitting ability and contain numerous misclassification points within the vegetation and mountain zones. CV-MLP and CV-2D-CNN tend to confuse bare soil samples with their neighboring water samples resulting in low accuracy of bare soil category. The limited ability of CV-FCN to process border information leads to many errors in its classification results around the boundaries. Although CV-3D-CNN can handle mountain, water, and building categories satisfactorily, the discrimination accuracy for bare soil and vegetation pixels still needs to be raised. CNN-WT struggles to precisely discern the pixels belonging to green spaces, ultimately leading to a substantial accumulation of errors within the category’s region, compromising the accuracy of its classification results. In the PolSF classification diagram, while continuous and large bare soil areas are relatively well recognized, the precision of recognition for fragmented and small areas is less satisfactory. Compared with the above approaches, the designed method in this article accurately discriminates samples across all categories. In particular, it demonstrates exceptional proficiency in efficiently categorizing small, geographically scattered areas.
The specific experimental data of all approaches on the San Francisco dataset are displayed in Table 3. The loss and accuracy curves are illustrated in Figure 12. Similarly, it can be obtained that the transition from low to high accuracy is efficiently accomplished by our model in only a few epochs. In comparison to CV-3D-CNN and PolSF, our method significantly excels in terms of training time. This superiority stems primarily from the integration of multi-level feature and attention modules, which not only empower the model with robust feature extraction capabilities but also facilitate swift identification of the optimal solution. Based on Table 3, it is evident that the proposed approach attains the highest scores for OA, AA, and Kappa with 97.53%, 93.74%, and 0.9613. Since the water pixels occupy a large portion of the graph and are easily discriminated by all methods, the lead of our method is not obvious on OA. However, our method exhibits a minimum improvement of 4.49% in AA compared to others. Particularly in the bare soil category, CV-2D/3D-CNN-AM acquires far superior accuracy than other methods, which indicates that the discriminative features acquired by our method are both generalized and resilient.

4.5. Experimental Results of Oberpfaffenhofen

The result graphs of all involved methods operating on the Oberpfaffenhofen dataset are displayed in Figure 13. Several comparative methods exhibit varying degrees of error rates in the ground object discrimination of each category. SVM incorrectly classifies a large proportion of built-up areas as open areas and there is also some misclassification within the woodland area. The CV-MLP recognition rates are unsatisfactory for all three surface feature types contained in the data, and there is a significant error region located at the bottom left corner of the image. The resultant plots of CV-2D-CNN are relatively cleaner than the previous two, but the discriminative correctness for built-up areas is still not high. CV-3D-CNN misidentifies badly in the area with more complex surface feature types at the lower right corner of the image. CV-FCN has low accuracy in classifying pixels near category boundaries, where the category changes are prone to massive and continuous misclassification. CNN-WT exhibits error dispersion across all three feature domains, with a pronounced tendency to inaccurately categorize numerous pixels within the built-up area region as open area, impeding its overall performance and accuracy. PolSF discriminates woodland areas well but encounters difficulties in classifying the image central region. By comparison, CV-2D/3D-CNN-AM enables better differentiation of samples in each category and performs more satisfactorily in boundary processing and region coherence.
The specific experimental data of all approaches on the Oberpfaffenhofen dataset are displayed in Table 4. The loss and accuracy curves are illustrated in Figure 14. Specifically, the curve’s change rate transitions from rapid to gradual, ultimately trending towards a continuously optimized state. Our model surpasses CV-3D-CNN, CNN-WT, and PolSF in terms of efficiency, achieving significantly more precise classification outcomes while requiring significantly less training time, underscoring its superior performance. Since this dataset contains fewer surface feature types, all methods can achieve a certain level of OA. However, there are still differences in the AA and per-category accuracy. It is evident that our model excels in terms of accuracy, achieving a remarkable 88.28% in the built-up zones and an outstanding 96.07% in the woodland zones. This represents a substantial improvement of at least 3.58% and 0.97%, respectively, over the other models. In addition, the proposed method yields superior values of 95.34%, 94.20%, and 0.9201 for three metrics, indicating that it has stronger classification performance.

4.6. Ablation Study

To verify whether the expansion to complex domains and the attention block have any impact on model performance, we execute RV-2D/3D-CNN, RV-2D/3D-CNN-AM, and CV-2D/3D-CNN on the three datasets, respectively. Among them, RV-2D/3D-CNN possesses the same network structure as the proposed method but removes the attention module and can only process real data. What is more, the attention block in RV-2D/3D-CNN-AM removes the calculation of complex parts. CV-2D/3D-CNN just masks out the attention module compared to our model. The ablation experiments on Flevoland, San Francisco, and Oberpfaffenhofen datasets are shown in Table 5, Table 6 and Table 7, respectively, in which, “Complex” represents the complex computational domain, and “Attention” represents the attention block.
The experimental findings indicate that incorporating the attention module and directly inputting complex-valued data lead to an increase in classification accuracy. RV-2D/3D-CNN and RV-2D/3D-CNN-AM achieve inferior results to the complex-valued models due to their inability to utilize the phase information contained in complex elements. The model’s performance with the added attention module is superior to the non-added one in both real and complex domains. This confirms that the attention block helps the model to identify more important information and thus enhances the feature extraction ability. Notably, even without the addition of an attention module, only CV-2D/3D-CNN achieves accuracy comparable to a single 2D or 3D model. It shows that the hybrid model itself has a decent feature extraction capability. Moreover, our model performs best on all datasets by incorporating the improved attention block and expanding to the complex domain. In particular, the AA of CV-2D/3D-CNN-AM is notably improved on all three datasets by at least 1.08%, 5.36%, and 2.62%, respectively. This suggests that multi-level and targeted abstract features help our model to recognize all types.

5. Conclusions

We propose a simple, practical, and novel PolSAR image classification method called CV-2D/3D-CNN-AM, which combines the attention mechanism with a complex-valued CNN framework that hybridizes 2D and 3D convolutions. First, the model is extended to the complex domain to fully exploit the amplitude and phase information in complex-valued PolSAR data. Secondly, a hybrid of 2D and 3D convolutions extracts rich discriminative features while reducing the computational parameters. In addition, the attention mechanism module serves as a filter, allowing the model to focus on information that is more critical to the practical task. The effectiveness of each module in the designed framework is verified in ablation experiments. CV-2D/3D-CNN-AM is tested on three PolSAR datasets and achieves more advanced results than the other methods. Future work will focus on PolSAR image classification under limited-labeled samples or unsupervised scenarios. By employing self-supervised learning, domain generalization, and other methods to reduce the dependence on labeled samples, it seeks to effectively tackle the inherent complexities involved in the interpretation of PolSAR images.

Author Contributions

Conceptualization, W.L.; Methodology, H.X. and Y.W.; Resources, W.L.; Software, H.X. and J.Z.; Supervision, Y.J. and Y.H.; Validation, J.Z.; Writing—original draft, W.L. and H.X.; Writing—review and editing, Y.W., Y.J. and Y.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under Grant 42071414, Key Laboratory of Land Satellite Remote Sensing Application, Ministry of Natural Resources of the People’s Republic of China (Grant No. KLSMNR-K202201), Open Fund of State Key Laboratory of Remote Sensing Science (Grant No. OFSLRSS202202), and Key Laboratory of Land Satellite Remote Sensing Application, Ministry of Natural Resources of the People’s Republic of China (Grant No. 202305), China Postdoctoral Science Foundation under Grant 2019M661896, and the Postgraduate Research and Practice Innovation Program of Jiangsu Province under Grant KYCX24_1219.

Data Availability Statement

The data presented in this study are available on reasonable request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Li, H.; Li, Q.; Wu, G.; Chen, J.; Liang, S. The Impacts of Building Orientation on Polarimetric Orientation Angle Estimation and Model-Based Decomposition for Multilook Polarimetric SAR Data in Urban Areas. IEEE Trans. Geosci. Remote Sens. 2016, 54, 5520–5532. [Google Scholar] [CrossRef]
  2. Yuzugullu, O.; Erten, E.; Hajnsek, I. Morphology estimation of rice fields using X-band PolSAR data. In Proceedings of the 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Beijing, China, 10–15 July2016; pp. 7121–7124. [Google Scholar]
  3. Whitcomb, J.; Chen, R.; Clewley, D.; Kimball, J.; Pastick, N.; Yi, Y.; Moghaddam, M. Active Layer Thickness Throughout Northern Alaska by Upscaling from P-Band Polarimetric Sar Retrievals. In Proceedings of the IGARSS 2022—2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 3660–3663. [Google Scholar] [CrossRef]
  4. Zhang, T.; Quan, S.; Wang, W.; Guo, W.; Zhang, Z.; Yu, W. Information Reconstruction-Based Polarimetric Covariance Matrix for PolSAR Ship Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5202815. [Google Scholar] [CrossRef]
  5. Ortiz, G.P.; Lorenzzetti, J.A. Observing Multimodal Ocean Wave Systems by a Multiscale Analysis of Polarimetric SAR Imagery. IEEE Geosci. Remote Sens. Lett. 2018, 15, 1735–1739. [Google Scholar] [CrossRef]
  6. Pottier, E. Dr. JR Huynen’s main contributions in the development of polarimetric radar techniques and how the ‘Radar Targets Phenomenological Concept’ becomes a theory. In Proceedings of the Radar Polarimetry, SPIE, San Diego, CA, USA, 22–22 July 1993; Volume 1748, pp. 72–85. [Google Scholar]
  7. Cloude, S.R.; Pottier, E. A review of target decomposition theorems in radar polarimetry. IEEE Trans. Geosci. Remote Sens. 1996, 34, 498–518. [Google Scholar] [CrossRef]
  8. Cloude, S.R.; Pottier, E. An entropy based classification scheme for land applications of polarimetric SAR. IEEE Trans. Geosci. Remote Sens. 1997, 35, 68–78. [Google Scholar] [CrossRef]
  9. Krogager, E.; Boerner, W.M.; Madsen, S.N. Feature-motivated Sinclair matrix (sphere/diplane/helix) decomposition and its application to target sorting for land feature classification. In Proceedings of the Wideband Interferometric Sensing and Imaging Polarimetry, SPIE, San Diego, CA, USA, 27 July–1 August 1997; Volume 3120, pp. 144–154. [Google Scholar]
  10. Cameron, W.L.; Leung, L.K. Feature motivated polarization scattering matrix decomposition. In Proceedings of the IEEE International Conference on Radar, Arlington, VA, USA, 7–10 May 1990; pp. 549–557. [Google Scholar]
  11. Freeman, A.; Durden, S.L. A three-component scattering model for polarimetric SAR data. IEEE Trans. Geosci. Remote Sens. 1998, 36, 963–973. [Google Scholar] [CrossRef]
  12. Lardeux, C.; Frison, P.L.; Tison, C.; Souyris, J.C.; Stoll, B.; Fruneau, B.; Rudant, J.P. Support vector machine for multifrequency SAR polarimetric data classification. IEEE Trans. Geosci. Remote Sens. 2009, 47, 4143–4152. [Google Scholar] [CrossRef]
  13. Liu, W.; Yang, J.; Li, P.; Han, Y.; Zhao, J.; Shi, H. A novel object-based supervised classification method with active learning and random forest for PolSAR imagery. Remote Sens. 2018, 10, 1092. [Google Scholar] [CrossRef]
  14. Yin, Q.; Cheng, J.; Zhang, F.; Zhou, Y.; Shao, L.; Hong, W. Interpretable POLSAR image classification based on adaptive-dimension feature space decision tree. IEEE Access 2020, 8, 173826–173837. [Google Scholar] [CrossRef]
  15. Zhang, S.; Cui, L.; Zhang, Y.; Xia, T.; Dong, Z.; An, W. Research on Input Schemes for Polarimetric SAR Classification Using Deep Learning. Remote Sens. 2024, 16, 1826. [Google Scholar] [CrossRef]
  16. Li, Z.; Huang, H.; Zhang, Z.; Shi, G. Manifold-based multi-deep belief network for feature extraction of hyperspectral image. Remote Sens. 2022, 14, 1484. [Google Scholar] [CrossRef]
  17. Zhang, W.T.; Wang, M.; Guo, J.; Lou, S.T. Crop classification using MSCDN classifier and sparse auto-encoders with non-negativity constraints for multi-temporal, Quad-Pol SAR data. Remote Sens. 2021, 13, 2749. [Google Scholar] [CrossRef]
  18. Seydi, S.T.; Hasanlou, M.; Amani, M. A new end-to-end multi-dimensional CNN framework for land cover/land use change detection in multi-source remote sensing datasets. Remote Sens. 2020, 12, 2010. [Google Scholar] [CrossRef]
  19. Hochstuhl, S.; Pfeffer, N.; Thiele, A.; Hammer, H.; Hinz, S. Your Input Matters—Comparing Real-Valued PolSAR Data Representations for CNN-Based Segmentation. Remote Sens. 2023, 15, 5738. [Google Scholar] [CrossRef]
  20. Zhou, Y.; Wang, H.; Xu, F.; Jin, Y.Q. Polarimetric SAR image classification using deep convolutional neural networks. IEEE Geosci. Remote Sens. Lett. 2016, 13, 1935–1939. [Google Scholar] [CrossRef]
  21. Zhang, L.; Chen, Z.; Zou, B.; Gao, Y. Polarimetric SAR terrain classification using 3D convolutional neural network. In Proceedings of the IGARSS 2018-2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; pp. 4551–4554. [Google Scholar]
  22. He, C.; He, B.; Tu, M.; Wang, Y.; Qu, T.; Wang, D.; Liao, M. Fully convolutional networks and a manifold graph embedding-based algorithm for polsar image classification. Remote Sens. 2020, 12, 1467. [Google Scholar] [CrossRef]
  23. Zhang, Z.; Wang, H.; Xu, F.; Jin, Y.Q. Complex-valued convolutional neural network and its application in polarimetric SAR image classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 7177–7188. [Google Scholar] [CrossRef]
  24. Tan, X.; Li, M.; Zhang, P.; Wu, Y.; Song, W. Complex-valued 3-D convolutional neural network for PolSAR image classification. IEEE Geosci. Remote Sens. Lett. 2019, 17, 1022–1026. [Google Scholar] [CrossRef]
  25. Cao, Y.; Wu, Y.; Zhang, P.; Liang, W.; Li, M. Pixel-wise PolSAR image classification via a novel complex-valued deep fully convolutional network. Remote Sens. 2019, 11, 2653. [Google Scholar] [CrossRef]
  26. Niu, Z.; Zhong, G.; Yu, H. A review on the attention mechanism of deep learning. Neurocomputing 2021, 452, 48–62. [Google Scholar] [CrossRef]
  27. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
  28. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  29. Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
  30. Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
  31. Dong, H.; Zhang, L.; Lu, D.; Zou, B. Attention-based polarimetric feature selection convolutional network for PolSAR image classification. IEEE Geosci. Remote Sens. Lett. 2020, 19, 4001705. [Google Scholar] [CrossRef]
  32. Hua, W.; Wang, X.; Zhang, C.; Jin, X. Attention-Based Multiscale Sequential Network for PolSAR Image Classification. IEEE Geosci. Remote Sens. Lett. 2022, 19, 4506505. [Google Scholar] [CrossRef]
  33. Yang, R.; Xu, X.; Gui, R.; Xu, Z.; Pu, F. Composite sequential network with POA attention for PolSAR image analysis. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5209915. [Google Scholar] [CrossRef]
  34. Zhang, Q.; He, C.; He, B.; Tong, M. Learning Scattering Similarity and Texture-Based Attention with Convolutional Neural Networks for PolSAR Image Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5207419. [Google Scholar] [CrossRef]
  35. Qin, X.; Hu, T.; Zou, H.; Yu, W.; Wang, P. Polsar image classification via complex-valued convolutional neural network combining measured data and artificial features. In Proceedings of the IGARSS 2019-2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 3209–3212. [Google Scholar]
  36. Barrachina, J.; Ren, C.; Morisseau, C.; Vieillard, G.; Ovarlez, J.P. Complex-valued neural networks for polarimetric SAR segmentation using Pauli representation. In Proceedings of the IGARSS 2022-2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 4984–4987. [Google Scholar]
  37. Han, P.; Sun, D. Classification of Polarimetric SAR image with feature selection and deep learning. Signal Process 2019, 35, 972–978. [Google Scholar]
  38. Yang, C.; Hou, B.; Ren, B.; Hu, Y.; Jiao, L. CNN-based polarimetric decomposition feature selection for PolSAR image classification. IEEE Trans. Geosci. Remote Sens. 2019, 57, 8796–8812. [Google Scholar] [CrossRef]
  39. Mullissa, A.G.; Persello, C.; Stein, A. PolSARNet: A deep fully convolutional network for polarimetric SAR image classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 5300–5309. [Google Scholar] [CrossRef]
  40. Ren, Y.; Jiang, W.; Liu, Y. A New Architecture of a Complex-Valued Convolutional Neural Network for PolSAR Image Classification. Remote Sens. 2023, 15, 4801. [Google Scholar] [CrossRef]
  41. Tan, X.; Li, M.; Zhang, P.; Wu, Y.; Song, W. Deep triplet complex-valued network for PolSAR image classification. IEEE Trans. Geosci. Remote Sens. 2021, 59, 10179–10196. [Google Scholar] [CrossRef]
  42. Persello, C.; Stein, A. Deep fully convolutional networks for the detection of informal settlements in VHR images. IEEE Geosci. Remote Sens. Lett. 2017, 14, 2325–2329. [Google Scholar] [CrossRef]
  43. Mullissa, A.G.; Persello, C.; Reiche, J. Despeckling polarimetric SAR data using a multistream complex-valued fully convolutional network. IEEE Geosci. Remote Sens. Lett. 2021, 19, 4011805. [Google Scholar] [CrossRef]
  44. Liu, X.; Jiao, L.; Tang, X.; Sun, Q.; Zhang, D. Polarimetric convolutional network for PolSAR image classification. IEEE Trans. Geosci. Remote Sens. 2018, 57, 3040–3054. [Google Scholar] [CrossRef]
  45. Xianxiang, Q.; Wangsheng, Y.; Peng, W.; Tianping, C.; Huanxin, Z. Weakly supervised classification of PolSAR images based on sample refinement with complex-valued convolutional neural network. J. Radars 2020, 9, 525–538. [Google Scholar]
  46. Jiang, Y.; Zhang, P.; Song, W. Semisupervised complex network with spatial statistics fusion for PolSAR image classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 9749–9761. [Google Scholar] [CrossRef]
  47. Xie, W.; Ma, G.; Zhao, F.; Liu, H.; Zhang, L. PolSAR image classification via a novel semi-supervised recurrent complex-valued convolution neural network. Neurocomputing 2020, 388, 255–268. [Google Scholar] [CrossRef]
  48. Zhu, L.; Ma, X.; Wu, P.; Xu, J. Multiple classifiers based semi-supervised polarimetric SAR image classification method. Sensors 2021, 21, 3006. [Google Scholar] [CrossRef] [PubMed]
  49. Zeng, X.; Wang, Z.; Wang, Y.; Rong, X.; Guo, P.; Gao, X.; Sun, X. SemiPSCN: Polarization Semantic Constraint Network for Semi-supervised Segmentation in Large-scale and Complex-valued PolSAR Images. IEEE Trans. Geosci. Remote Sens. 2023, 62, 5200718. [Google Scholar] [CrossRef]
  50. Guo, M.H.; Xu, T.X.; Liu, J.J.; Liu, Z.N.; Jiang, P.T.; Mu, T.J.; Zhang, S.H.; Martin, R.R.; Cheng, M.M.; Hu, S.M. Attention mechanisms in computer vision: A survey. Comput. Vis. Media 2022, 8, 331–368. [Google Scholar] [CrossRef]
  51. Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 510–519. [Google Scholar]
  52. Mnih, V.; Heess, N.; Graves, A.; Kavukcuoglu, K. Recurrent models of visual attention. In Advances in Neural Information Processing Systems; NeurIPS: San Diego, CA, USA, 2014; Volume 27. [Google Scholar]
  53. Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial transformer networks. In Advances in Neural Information Processing Systems; NeurIPS: San Diego, CA, USA, 2015; Volume 28. [Google Scholar]
  54. Wang, F.; Jiang, M.; Qian, C.; Yang, S.; Li, C.; Zhang, H.; Wang, X.; Tang, X. Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3156–3164. [Google Scholar]
  55. Park, J.; Woo, S.; Lee, J.Y.; Kweon, I.S. Bam: Bottleneck attention module. arXiv 2018, arXiv:1807.06514. [Google Scholar]
  56. Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
  57. Zhang, Z.; Lan, C.; Zeng, W.; Jin, X.; Chen, Z. Relation-aware global attention for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3186–3195. [Google Scholar]
  58. Liu, J.J.; Hou, Q.; Cheng, M.M.; Wang, C.; Feng, J. Improving convolutional networks with self-calibrated convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10096–10105. [Google Scholar]
  59. Hou, Q.; Zhang, L.; Cheng, M.M.; Feng, J. Strip pooling: Rethinking spatial pooling for scene parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4003–4012. [Google Scholar]
  60. Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
  61. Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings, Sardinia, Italy, 13–15 May 2010; pp. 249–256. [Google Scholar]
  62. He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar]
  63. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  64. Lee, J.S.; Grunes, M.R.; De Grandi, G. Polarimetric SAR speckle filtering and its implication for classification. IEEE Trans. Geosci. Remote Sens. 1999, 37, 2363–2373. [Google Scholar]
  65. Hänsch, R. Complex-valued multi-layer perceptrons—An application to polarimetric SAR data. Photogramm. Eng. Remote Sens. 2010, 76, 1081–1088. [Google Scholar] [CrossRef]
  66. Jamali, A.; Mahdianpari, M.; Mohammadimanesh, F.; Bhattacharya, A.; Homayouni, S. PolSAR image classification based on deep convolutional neural networks using wavelet transformation. IEEE Geosci. Remote Sens. Lett. 2022, 19, 4510105. [Google Scholar] [CrossRef]
  67. Jamali, A.; Roy, S.K.; Bhattacharya, A.; Ghamisi, P. Local window attention transformer for polarimetric SAR image classification. IEEE Geosci. Remote Sens. Lett. 2023, 20, 4004205. [Google Scholar] [CrossRef]
Figure 1. The operational domain of different attention modules. (a) Channel attention. (b) Spatial attention. (c) Channel and spatial attention. C represents the channel domain, H and W represent the spatial domain.
Figure 1. The operational domain of different attention modules. (a) Channel attention. (b) Spatial attention. (c) Channel and spatial attention. C represents the channel domain, H and W represent the spatial domain.
Remotesensing 16 02908 g001
Figure 2. Architecture of the proposed CV-2D/3D-CNN-AM for PolSAR image classification.
Figure 2. Architecture of the proposed CV-2D/3D-CNN-AM for PolSAR image classification.
Remotesensing 16 02908 g002
Figure 3. Comparison between 2D convolution and 3D convolution. (a) 2D convolution. (b) 3D convolution.
Figure 3. Comparison between 2D convolution and 3D convolution. (a) 2D convolution. (b) 3D convolution.
Remotesensing 16 02908 g003
Figure 4. The implementation of complex-valued convolution.
Figure 4. The implementation of complex-valued convolution.
Remotesensing 16 02908 g004
Figure 5. The improved attention block.
Figure 5. The improved attention block.
Remotesensing 16 02908 g005
Figure 6. Flevoland dataset. (a) Pauli-RGB image. (b) Ground truth. (c) Category legend.
Figure 6. Flevoland dataset. (a) Pauli-RGB image. (b) Ground truth. (c) Category legend.
Remotesensing 16 02908 g006
Figure 7. San Francisco dataset. (a) Pauli-RGB image. (b) Ground truth. (c) Category legend.
Figure 7. San Francisco dataset. (a) Pauli-RGB image. (b) Ground truth. (c) Category legend.
Remotesensing 16 02908 g007
Figure 8. Oberpfaffenhofen dataset. (a) Pauli-RGB image. (b) Ground truth. (c) Category legend.
Figure 8. Oberpfaffenhofen dataset. (a) Pauli-RGB image. (b) Ground truth. (c) Category legend.
Remotesensing 16 02908 g008
Figure 9. Classification maps of the Flevoland dataset. (a) Ground truth. (b) SVM. (c) CV-MLP. (d) CV-2D-CNN. (e) CV-3D-CNN. (f) CV-FCN. (g) CCN-WT. (h) PolSF. (i) CV-2D/3D-CNN-AM.
Figure 9. Classification maps of the Flevoland dataset. (a) Ground truth. (b) SVM. (c) CV-MLP. (d) CV-2D-CNN. (e) CV-3D-CNN. (f) CV-FCN. (g) CCN-WT. (h) PolSF. (i) CV-2D/3D-CNN-AM.
Remotesensing 16 02908 g009
Figure 10. Loss and accuracy curves of the proposed method for training and validation data on the Flevoland dataset.
Figure 10. Loss and accuracy curves of the proposed method for training and validation data on the Flevoland dataset.
Remotesensing 16 02908 g010
Figure 11. Classification maps of the San Francisco dataset. (a) Ground truth. (b) SVM. (c) CV-MLP. (d) CV-2D-CNN. (e) CV-3D-CNN. (f) CV-FCN. (g) CCN-WT. (h) PolSF. (i) CV-2D/3D-CNN-AM.
Figure 11. Classification maps of the San Francisco dataset. (a) Ground truth. (b) SVM. (c) CV-MLP. (d) CV-2D-CNN. (e) CV-3D-CNN. (f) CV-FCN. (g) CCN-WT. (h) PolSF. (i) CV-2D/3D-CNN-AM.
Remotesensing 16 02908 g011
Figure 12. Loss and accuracy curves of the proposed method for training and validation data on the San Francisco dataset.
Figure 12. Loss and accuracy curves of the proposed method for training and validation data on the San Francisco dataset.
Remotesensing 16 02908 g012
Figure 13. Classification maps of the Oberpfaffenhofen dataset. (a) Ground truth. (b) SVM. (c) CV-MLP. (d) CV-2D-CNN. (e) CV-3D-CNN. (f) CV-FCN. (g) CCN-WT. (h) PolSF. (i) CV-2D/3D-CNN-AM.
Figure 13. Classification maps of the Oberpfaffenhofen dataset. (a) Ground truth. (b) SVM. (c) CV-MLP. (d) CV-2D-CNN. (e) CV-3D-CNN. (f) CV-FCN. (g) CCN-WT. (h) PolSF. (i) CV-2D/3D-CNN-AM.
Remotesensing 16 02908 g013
Figure 14. Loss and accuracy curves of the proposed method for training and validation data on the Oberpfaffenhofen dataset.
Figure 14. Loss and accuracy curves of the proposed method for training and validation data on the Oberpfaffenhofen dataset.
Remotesensing 16 02908 g014
Table 1. Complexity of different models.
Table 1. Complexity of different models.
ModelParamsFLOPsMACs
CV-MLP171,01285,34142,671
CV-2D-CNN10,794355,840177,920
CV-3D-CNN2,754,09594,568,55747,284,279
CV-FCN964,316438,260,314219,130,157
CNN-WT4,714,043195,928,26597,964,133
PolSF1,351,961689,628,230344,814,115
CV-2D/3D-CNN-AM2,716,89140,489,29120,244,646
Table 2. Classification results of different methods on the Flevoland dataset.
Table 2. Classification results of different methods on the Flevoland dataset.
ClassSVMCV-MLPCV-2D-CNNCV-3D-CNNCV-FCNCNN-WTPolSFProposed
Stembeans85.0198.3197.4099.2197.9799.7398.9099.84
Peas92.8396.8199.0199.4698.2799.9599.6599.97
Forest92.5291.6598.5499.0297.1998.9695.2999.92
Lucerne97.2597.7498.5499.9695.2598.7998.7699.91
Wheat93.4891.9697.5398.3799.7197.2199.4899.72
Beet94.4795.1498.9899.6889.4199.6799.8399.76
Potatoes78.8789.4197.8699.2196.2599.1299.5599.65
Bare Soil99.1616.4544.6797.5096.0899.7999.0199.99
Grass92.8081.1394.2293.3393.8988.9098.8899.37
Rapeseed85.5883.2893.8398.3391.7197.5798.5199.44
Barley97.8396.9798.1099.3269.9599.6099.8699.94
Wheat286.4079.6297.1297.7699.8397.0798.7899.69
Wheat395.7196.0699.5099.8999.2199.3799.3699.94
Water99.7998.5999.2098.8375.5399.0999.9999.98
Buildings10.0879.7861.9798.3255.4187.8293.4296.05
Training time (s)39.4245.6291.781303.64227.851434.633332.12480.24
OA (%)91.65 ± 0.35890.64 ± 0.11496.73 ± 0.56498.81 ± 0.01193.49 ± 4.19798.34 ± 0.38698.90 ± 0.14599.78 ± 0.003
AA (%)86.78 ± 0.31686.19 ± 1.11991.77 ± 3.62298.55 ± 0.02590.73 ± 6.41397.51 ± 1.26598.62 ± 0.00999.55 ± 0.018
Kappa (×100)90.88 ± 0.42889.76 ± 0.13896.43 ± 0.67698.71 ± 0.01392.75 ± 4.99198.19 ± 0.45998.80 ± 0.17399.76 ± 0.004
Table 3. Classification results of different methods on the San Francisco dataset.
Table 3. Classification results of different methods on the San Francisco dataset.
ClassSVMCV-MLPCV-2D-CNNCV-3D-CNNCV-FCNCNN-WTPolSFProposed
Bare Soil68.4130.1351.8868.1460.7881.0173.0687.60
Mountain87.5784.2296.2698.1595.9297.8797.2197.75
Water99.1898.7298.9298.9595.4799.0198.8099.07
Building95.7395.7197.8397.5296.2398.6197.3898.18
Vegetation68.7867.4279.8582.7390.0765.0079.8186.10
Training time (s)38.5241.7159.54737.81122.54411.301950.58374.43
OA (%)94.24 ± 0.00293.04 ± 0.08196.17 ± 0.05296.67 ± 0.05894.88 ± 1.68896.17 ± 0.03296.36 ± 0.05997.53 ± 0.016
AA (%)83.93 ± 0.04275.24 ± 8.74684.95 ± 3.02889.10 ± 1.34987.69 ± 6.88188.30 ± 1.15789.25 ± 4.15093.74 ± 0.859
Kappa (×100)90.96 ± 0.00689.03 ± 0.24793.98 ± 0.13294.78 ± 0.14292.06 ± 3.82593.96 ± 0.09994.30 ± 0.12596.13 ± 0.040
Table 4. Classification results of different methods on the Oberpfaffenhofen dataset.
Table 4. Classification results of different methods on the Oberpfaffenhofen dataset.
ClassSVMCV-MLPCV-2D-CNNCV-3D-CNNCV-FCNCNN-WTPolSFProposed
Built-up Areas68.0765.8780.5782.0584.7084.3184.6188.28
Woodland84.9783.0792.4193.2589.8091.7295.1096.07
Open Areas97.8695.9397.5498.0298.4593.4595.8098.23
Training time (s)62.7577.8698.451319.88129.271007.623251.98521.56
OA (%)87.98 ± 0.04185.74 ± 0.08192.33 ± 0.11093.12 ± 0.21290.57 ± 2.91893.64 ± 0.00692.87 ± 0.31295.34 ± 0.043
AA (%)83.63 ± 0.06881.29 ± 0.34090.17 ± 0.24291.10 ± 0.51789.31 ± 3.80691.49 ± 0.01991.84 ± 0.08494.20 ± 0.036
Kappa (×100)78.98 ± 0.12475.27 ± 0.27786.79 ± 0.32688.15 ± 0.63784.03 ± 7.72589.02 ± 0.01687.89 ± 0.70892.01 ± 0.114
Table 5. Ablation experiments on the Flevoland dataset.
Table 5. Ablation experiments on the Flevoland dataset.
MethodsComplexAttentionOA (%)AA (%)Kappa
RV-2D/3D-CNN××96.6596.230.9634
CV-2D/3D-CNN×98.2097.320.9803
RV-2D/3D-CNN-AM×98.5398.650.9840
CV-2D/3D-CNN-AM99.8199.770.9979
Table 6. Ablation experiments on the San Francisco dataset.
Table 6. Ablation experiments on the San Francisco dataset.
MethodsComplexAttentionOA (%)AA (%)Kappa
RV-2D/3D-CNN××95.4685.740.9278
CV-2D/3D-CNN×95.9987.220.9365
RV-2D/3D-CNN-AM×96.3589.380.9428
CV-2D/3D-CNN-AM97.3293.710.9580
Table 7. Ablation experiments on the Oberpfaffenhofen dataset.
Table 7. Ablation experiments on the Oberpfaffenhofen dataset.
MethodsComplexAttentionOA (%)AA (%)Kappa
RV-2D/3D-CNN××92.3290.460.8683
CV-2D/3D-CNN×93.3991.390.8858
RV-2D/3D-CNN-AM×93.1391.140.8816
CV-2D/3D-CNN-AM95.5394.200.9232
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, W.; Xia, H.; Zhang, J.; Wang, Y.; Jia, Y.; He, Y. Complex-Valued 2D-3D Hybrid Convolutional Neural Network with Attention Mechanism for PolSAR Image Classification. Remote Sens. 2024, 16, 2908. https://doi.org/10.3390/rs16162908

AMA Style

Li W, Xia H, Zhang J, Wang Y, Jia Y, He Y. Complex-Valued 2D-3D Hybrid Convolutional Neural Network with Attention Mechanism for PolSAR Image Classification. Remote Sensing. 2024; 16(16):2908. https://doi.org/10.3390/rs16162908

Chicago/Turabian Style

Li, Wenmei, Hao Xia, Jiadong Zhang, Yu Wang, Yan Jia, and Yuhong He. 2024. "Complex-Valued 2D-3D Hybrid Convolutional Neural Network with Attention Mechanism for PolSAR Image Classification" Remote Sensing 16, no. 16: 2908. https://doi.org/10.3390/rs16162908

APA Style

Li, W., Xia, H., Zhang, J., Wang, Y., Jia, Y., & He, Y. (2024). Complex-Valued 2D-3D Hybrid Convolutional Neural Network with Attention Mechanism for PolSAR Image Classification. Remote Sensing, 16(16), 2908. https://doi.org/10.3390/rs16162908

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop