Semantic Space Analysis for Zero-Shot Learning on SAR Images

Liu, Bo; Xu, Jiping; Zeng, Hui; Dong, Qiulei; Hu, Zhanyi

doi:10.3390/rs16142627

Open AccessCommunication

Semantic Space Analysis for Zero-Shot Learning on SAR Images

by

Bo Liu

¹,

Jiping Xu

²,

Hui Zeng

^3,4,*

,

Qiulei Dong

¹

and

Zhanyi Hu

¹

Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China

²

Key Laboratory of Industrial Internet and Big Data, China National Light Industry, Beijing 100048, China

³

Beijing Engineering Research Center of Industrial Spectrum Imaging, School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing 100083, China

⁴

Shunde Innovation School, University of Science and Technology Beijing, Foshan 528399, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(14), 2627; https://doi.org/10.3390/rs16142627

Submission received: 25 May 2024 / Revised: 11 July 2024 / Accepted: 16 July 2024 / Published: 18 July 2024

(This article belongs to the Special Issue Advanced Machine Learning Approaches for Analysis of Remote Sensing Images)

Download

Browse Figures

Versions Notes

Abstract

:

Semantic feature space plays a bridging role from ‘seen classes’ to ‘unseen classes’ in zero-shot learning (ZSL). However, due to the nature of SAR distance-based imaging, which is drastically different from that of optical imaging, how to construct an appropriate semantic space for SAR ZSL is still a tricky and less well-addressed issue. In this work, three different semantic feature spaces, constructed using natural language, remote sensing optical images, and web optical images, respectively, are explored. Furthermore, three factors, i.e., model capacity, dataset scale, and pre-training, are investigated in semantic feature learning. In addition, three datasets are introduced for the evaluation of SAR ZSL. Experimental results show that the semantic space constructed using remote sensing images is better than the other two and that the quality of semantic space can be affected significantly by factors such as model capacity, dataset scale, and pre-training schemes.

Keywords:

semantic space analysis; zero-shot learning; SAR target classification

1. Introduction

Automatic target recognition in synthetic aperture radar (SAR) images has a broad range of applications, such as ocean surveillance, land surveying, and so on. Unfortunately, in contrast to optical images, SAR images are difficult to interpret, even by human beings, due to the mainly phase-coherent nature of SAR imaging. Recently, the tremendous success of deep neural networks (DNNs) in the optical image analysis field has demonstrated that DNNs are able to automatically extract intrinsic information from input data as long as a certain number of representative data is provided. Inspired by this, the SAR image community has also applied DNNs to several SAR-related tasks [1,2], such as ship and scene classification [3,4,5,6], land segmentation [7,8], moving object detection [9,10], and so on. Many elaborated methods and techniques have been proposed, and decent performances have been achieved in the fully supervised classification scenario.

However, the success of DNNs heavily relies on a large number of labeled SAR images, which are costly to acquire, since labeling SAR images requires a massive amount of expert effort. In order to tackle this data-hungry problem, many techniques [11,12,13,14,15,16] have been proposed by the community. Very recently, zero-shot learning (ZSL) [17] has been considered as a promising way to alleviate this data scarcity problem. SAR image ZSL [18,19,20] involves training a model using labeled SAR images from a group of classes (often named seen classes) to recognize SAR images belonging to another group of classes (often named unseen classes) that have never been seen by the model at the training stage.

To achieve SAR image ZSL, Song et al. [18] trained a deep generative neural network with seen-class data to generate SAR images from their class labels, then used the learned hierarchical feature in the intermediate layer of the deep network as the reference feature (also known as a semantic feature in the ZSL community) to classify unseen-class SAR images. Wei et al. [21] used two auto-encoders to learn a latent feature space as the reference feature space, where unseen-class SAR objects were distinguished from seen-class ones. Song et al. [22] used simulated electromagnetic images to simulate unseen-class SAR images for classifier training and proposed a nonessential factor suppression method to reduce the domain gap between real unseen-class SAR images and simulated unseen-class ones. Yan et al. [12] employed generative adversarial networks to model the distribution of unseen classes, then tackled the SAR image ZSL problem via a supervised learning method.

Despite such successes in terms of model designs, less effort has been devoted to the selection of semantic features. We believe that semantic features play a central role in ZSL, since it is the bridge from seen classes to unseen classes. Especially considering that SAR images are less visually informative than optical ones, constructing an appropriate semantic space for SAR ZSL is much more desired. Existing SAR ZSL works [18,21] mainly used latent representations learned from SAR images by unsupervised models as semantic features. Due to the limited availability of external information, these semantic features were applied to a very constrained ZSL task with only one unseen class, which is not a typical ZSL scenario in practice. Notably, Toizumi et al. [19] introduced optical image features to SAR image classification, which is similar to the methodology in ZSL, but they used optical image features only for the study of fully supervised SAR image classification. Rong et al. [20] collected some attributes manually as semantic features to describe characteristics of land cover types in PolSAR images. Despite their efforts, these works mainly aimed to improve SAR image classification performance instead of investigating semantic space learning and its effects on SAR ZSL; hence, more efforts are needed to investigate a general and scalable semantic feature space due to its pivotal role in the SAR image ZSL task.

To this end, in this paper, we report a detailed investigation of semantic feature space for SAR image ZSL. Specifically, we first construct three semantic spaces by introducing three kinds of external resources, i.e., natural language, remote sensing (RS) optical images, and web optical images. Then, we provide an in-depth analysis of the potential factors that may have a substantial effect on the quality of semantic space, including model capacity, the scale of the dataset, and pre-training. For evaluation, we introduce three datasets, where multiple ZSL tasks with different seen–unseen splitting manners and different task difficulties are designed, and more than one unseen class is used to investigate SAR image ZSL in a more typical setting. Considering there exist few datasets for SAR image ZSL, the three introduced datasets and the corresponding semantic features are also expected to be helpful for future research on SAR image ZSL.

In summary, our contributions include the following:

We systematically investigate the effects of three popular semantic spaces on SAR ZSL—a first attempt in the literature to our knowledge;
We introduce three benchmarks for SAR ZSL. These datasets are constructed with multiple semantic features and multiple data settings that are helpful for future research on SAR ZSL.

The remainder of this paper is organized as follows. Three kinds of semantic feature space and a baseline ZSL method are introduced in Section 2, where three datasets for model evaluation and some implementation details are also described. Section 3 and Section 4 present some experimental results and some in-depth analysis, respectively. A short conclusion is presented in Section 5. For ease of reading, the abbreviations used in the paper are summarized in the Abbreviations section.

2. Materials and Methods

2.1. Semantic Spaces and Baseline Method

2.1.1. Semantic Spaces

In principle, ZSL models first learn to relate low-level and middle-level image features with high-level semantic features (e.g., semantic attributes) using seen-class training data; then, they infer the semantic labels of unseen-class image features according to the similarities between their image features and semantic features. Consequently, it is crucial for ZSL to design an appropriate semantic feature space that can reflect accurate semantic relationships among visual objects. Meanwhile, SAR image ZSL has its own particularities in terms of semantic space construction, since visual appearances in SAR images are usually not quite semantically explicit. Taking all these factors into account, here, we construct three kinds of semantic feature space for SAR image ZSL.

The first one is word vectors of class names. Word vectors are learned from large-scale language-based knowledge bases, which are quite easy to collect on the Internet. In unsupervised learning with a large-scale language model, the learned word vectors of semantic concepts generally include some common sense about the semantic relationship among common object classes. For example, as shown in Figure 1, the semantic similarity (cosine similarity) between SUV and pickup truck (

0.588

) is larger than that between SUV and airplane (

0.260

) in the learned word vector space. Meanwhile, SUV and pickup truck have more similar visual appearances than SUV and airplane in SAR images. Since there exist many public pre-trained language models, in this work, we directly acquire word vectors learned by Mikolov et al. [23], such as (1) download pre-trained model from website (https://radimrehurek.com/gensim/models/word2vec.html (accessed on 15 July 2024)) and (2) extract word vectors of SAR object classes using the pre-trained model according to their names.

The second one is high-level representations of remote sensing (RS) optical images. Despite the easy access to word vectors, they contain only limited semantic structural knowledge about SAR objects. In contrast, RS optical images captured from a similar viewpoint to SAR images usually include clearer and more accurate visual appearances of ground objects. As shown in Figure 1, the visual appearances of SUV, pickup truck, and airplane in RS optical images can reflect some key visual features (e.g., geometry) of the three classes in SAR images. Hence, high-level representations of RS optical images are expected to represent more accurate semantic relationships among SAR objects than word vectors. To obtain their representations, here, we first collect some RS optical images from existing datasets, which are detailed in the next section; then, they are used to train DNNs in a conventional classification task. After training, the learned models are utilized to extract high-level representations of RS optical images. Finally, the high-level representation of each class is obtained by averaging the representations of the images belonging to the same class. Furthermore, we investigate how to improve the learned semantic space from the following three perspectives: (1) employing three DNNs with different representation capacities, namely VGG11 [24], VGG16 [24], and ResNet18 [25]; (2) training DNNs using different proportions of training data to fit the semantic feature space of the external resource at different degrees, i.e., using

25 %

,

50 %

, and

100 %

training data; and (3) transferring knowledge from other large-scale datasets by pre-training, i.e., pre-training DNNs with NWPU-RESISC45 [26].

The third one involves the use of high-level representations of web optical images. Although RS optical images contain key visual features of SAR objects, they are not easy to acquire, and the scale of collected samples is usually limited. In contrast, web images are quite easy to collect, and a large number of diverse images is available by simply searching their names using Google Images. Meanwhile, web images also contain some important visual features of SAR objects, as shown in Figure 1. Hence, here, we employ high-level representations of web optical images as semantic features. To this end, we first collect many web images from Google Images, then train DNNs with these images in a classification task. Secondly, we extract high-level representations of these images with the learned models, and the averaged representation of each class is considered the semantic feature of the corresponding class. Similar to the analysis in RS optical images, here, we also investigate the effects of (1) the capacity of DNNs, (2) the scale of datasets, and (3) pre-training on the quality of the semantic space. In (1), the employed DNNs include VGG11, VGG16, and ResNet18. In (2), the scales are set to

20 %

,

60 %

, and

100 %

of the training set. In (3), DNNs are pre-trained on ImageNet1000 [27].

2.1.2. Baseline Method

For semantic space analysis, here, we introduce a method to implement ZSL in SAR images. Note that since the main goal of this paper is to study semantic feature space instead of proposing a novel ZSL method, we directly apply a classical ZSL baseline called DeViSE [17] to SAR images. As shown in Figure 2, DeViSE first embeds an SAR image feature (x) extracted by a feature extractor into a semantic space by model (

F (\cdot)

), i.e.,

\bar{x} = F (x)

. Then, the embedded image feature (

\bar{x}

) is trained to be close to its semantic feature and far from semantic features of other classes. It is trained with seen-class data using the following loss function:

L = E_{(x, y) \in D} C (S (〈 E_{S}, F (x) 〉), y)

(1)

where

D

is a training set, and y is the label of x.

E_{S}

is a feature matrix where each row represents the semantic feature of a seen class.

S (\cdot)

and

C (\cdot)

are the Softmax function and Cross-entropy loss function, respectively. In testing, given an unseen-class SAR image, it is first transformed into an image feature (

x_{t}

) by the feature extractor; then,

x_{t}

is embedded into a semantic space. Finally, it is classified according to its similarities to semantic features of unseen classes as follows:

\hat{y} = arg max 〈 E_{U}, F (x_{t}) 〉

(2)

where

E_{U}

is a feature matrix in which a row represents the semantic feature of an unseen class.

\hat{y}

is the predicted class label of

x_{t}

.

2.2. Datasets

Here, we introduce three datasets for the study of semantic space in SAR image ZSL, including a fine-grained one (namely Unicorn-ZSL), a coarse-grained one (namely COS10-ZSL), and a combined one (namely COS15-ZSL).

2.2.1. Unicorn-ZSL

In Unicorn-ZSL, we directly draw a subset from the Unicorn dataset [26] and split the data into seen and unseen classes in several different ways. Specifically, the original Unicorn includes 10 fine-grained car categories, and different categories have significantly different sample numbers, as shown in Table 1. To avoid the data imbalance problem, we first draw, at most, 2000 images from each class in Unicorn. If the image number of a class is less than 2000, we take all the samples from this class. The statistics of the selected dataset, called Unicorn-ZSL for ease of reading, are reported in Table 1. The SAR images in Unicorn have different sizes, so we reshape them all into 128 × 128 for model training and fair comparisons. In ZSL, the seen–unseen split has a huge effect on ZSL performance. Here, we design 2 kinds of splits with different degrees of difficulty to evaluate ZSL models. The easier split has 2 unseen classes, while the harder one has 3 unseen classes. Within each kind of split, we further make 3 different splits that have the same number of seen/unseen classes but have different seen/unseen classes to investigate more about the SAR image ZSL problem. The specific unseen classes in each split are reported in Table 2. For the training/testing data split, we uniformly sample

20 %

samples per class to build the testing set, while the rest is used as the training set.

In order to learn RS optical image-based semantic features, an RS optical image set for the 10 classes in Unicorn-ZSL is needed. Fortunately, besides SAR images, Unicorn also provides users with RS optical images. Hence, we directly select, at most, 1000 RS optical images for each class in Unicorn, and the statistics of the resulting dataset (called Unicorn-ZSL-RS) are reported in Table 1.

To construct a web image-based semantic space, a web image set is needed to train DNNs for high-level feature learning. To this end, we download 60 web optical images for each class in Unicorn-ZSL from Google Images, forming a dataset called Unicorn-ZSL-Web, where all images are checked carefully so that they are in natural but diverse poses.

2.2.2. COS10-ZSL

In COS10-ZSL, we first collect and annotate five coarse-grained SAR object classes by ourselves, then draw five other SAR object classes from Unicorn. More specifically, in order to collect five coarse-grained classes, we first download SAR images from the websites of the TerraSAR-X satellite and the GaoFen-3 satellite; then, we annotate SAR objects in these images and crop them into small, distinct patches. The statistics of the five classes are shown in Table 3, which also reports the statistics of the five classes selected from Unicorn. For the five Unicorn classes, we draw, at most, 800 images per class. All images in COS10-ZSL are reshaped into 128 × 128 for model training and fair comparison. Similar to the case in Unicorn-ZSL, multiple seen–unseen splits with different degrees of ZSL difficulty are designed to evaluate ZSL models in COS10-ZSL, whose details are reported in Table 2. Note that in the 2-unseen-class setting, one unseen class is from Unicorn, and the other is from the five classes collected by us. As done in Unicorn-ZSL,

20 %

samples are uniformly sampled per class as the testing set.

To construct semantic space for COS10-ZSL based on RS optical images, we have to collect RS optical images for the five coarse-grained classes in COS10-ZSL. To this end, we directly draw images of the five classes from the NWPU-RESISC45 dataset [2], which is a large-scale RS optical image set containing 45 ground-object classes. As for the RS optical images of the five Unicorn classes, we directly copy them from Unicorn. Table 3 shows the statistics of the whole image set (called COS10-ZSL-RS).

A web image set for the 10 classes in COS10-ZSL is required to learn a web image-based semantic space. For this purpose, a web optical image set called COS10-ZSL-Web is similarly built as in Unicorn-ZSL-Web by searching for and selecting 60 images per class from Google Images.

2.2.3. COS15-ZSL

In COS15-ZSL, we directly combine the 10 classes in Unicorn-ZSL with the 5 coarse-grained classes collected by ourselves. The corresponding RS optical image set and web optical image set are also combined in the same way. In COS15-ZSL, the 15 classes are split into 12 seen classes and 3 unseen classes in 3 different manners, where 2 unseen classes are from Unicorn-ZSL and 1 unseen class is from the 5 coarse-grained classes. The detailed unseen classes in each split are shown in Table 2. The training/testing data split here follows the splits in its two subsets.

2.3. Experimental Setup

Experiments reported in this work were conducted in the three introduced datasets, i.e., Unicorn-ZSL, COS10-ZSL, and COS15-ZSL. In each dataset, the semantic spaces and the models were evaluated in multiple seen–unseen configurations according to different degrees of ZSL difficulty, as described in Section 2.2, and

20 %

samples were uniformly sampled per class as the testing set, while the rest was used as the training set.

To implement the ZSL model, i.e., DeViSE [17], a slightly modified VGG11 [24] network was employed. VGG11 originally consisted of 5 network blocks followed by a maxpool layer to extract visual features. In our implementation, the last maxpool layer was replaced by an avepool layer to match the input image size. Note that more advanced networks, such as ResNet [25] and DenseNet [28], can also be used here, but considering that the used datasets are small-scale ones, those larger and deeper models could be at risk of overfitting. Therefore, here, we chose VGG11 with moderate model complexity for experiments. After visual feature extraction, the visual features were projected into a semantic feature space for ZSL via a linear layer, whose input and output dimensions are equal to the visual feature and semantic feature dimensions, respectively. Finally, the projected semantic features and the ground-truth semantic features were used to compute losses as shown in Equation (1). The networks were trained with an Adam optimizer for 50 epochs with a batch size of 64. The initial learning rate was set as

0.001

, and a step schedule was used to adjust the learning rate, degrading the learning rate by

10 %

per 20 epochs. The input RS/web optical images were reshaped into 128 × 128 for model training, and no data augmentations were used in the training procedure. Note that in the training procedure, all the model parameters were jointly optimized.

In semantic feature learning, since word vectors were pre-trained with a large-scale language model, we directly downloaded word vectors provided by Mikolov et al. [23], then extracted word vectors of SAR object classes in the corresponding datasets. In both RS optical image-based and web optical image-based semantic feature learning, 3 kinds of DNNs (i.e., VGG11 [24], VGG16 [24], and ResNet18 [25]) with different architectures were used to extract features. Similar to the ZSL model, here, we did not use the deeper and larger ResNet50 [25] and DenseNet [28] to avoid overfitting on our small-scale datasets. The input RS/web optical images were reshaped into 128 × 128 for model training after some random data augmentations (such as horizontal flipping, resizing, and random cropping). Models were trained in the task of image classification by a cross-entropy loss function. They were trained with an Adam optimizer for 50 epochs. The batch size and the initial learning rate were set as 64 and

0.001

, respectively. A step schedule was also used in the same way as in the training of the ZSL model. Note that all the model parameters were optimized in training. All models were trained on a machine equipped with 32 GB RAM, an Intel CPU, and an NVIDIA TITAN X GPU. The coding environment is based on Python 3.6 and Torch 1.2.

3. Results

ZSL Performance

We first report the results with different semantic spaces in different seen–unseen splits of Unicorn-ZSL, COS10-ZSL, and COS15-ZSL in Table 4, Table 5, and Table 6, respectively. Overall accuracy (OA) and average per-class accuracy (APA) are two widely used metrics for evaluating classification models. Overall accuracy is used to compute the accuracy of all testing samples, which is suitable for data-balanced tasks. Average per-class accuracy is used to first compute the accuracy of each class, then average these accuracies over the testing classes, and it can deal with situations of data imbalance. Due to the data imbalance in the proposed datasets, average per-class accuracy over unseen classes is first used as the metric for evaluation. From Table 4, we can see the following: (1) RS optical image-based semantic space achieves the best performances among the three semantic spaces under both two-unseen-class and three-unseen-class splits, demonstrating that RS optical images can reflect more appropriate semantic relationships among SAR object classes, since they contain some crucial visual features of SAR objects; (2) web optical image-based semantic space, although achieves inferior performance to that of RS optical image-based semantic space, still achieves moderate results; (3) word vectors seem unable to adequately deal with these ZSL tasks in both kinds of split. This is because Unicorn-ZSL is a fine-grained dataset where semantic relationships among visual SAR objects are quite close, but word vectors have difficulty capturing such fine-grained features in SAR images; and (4) despite the same number of unseen classes, the three different seen-unseen splits achieve largely different performances in both the two-unseen-class setting and the three-unseen-class setting. This is because recognition of unseen classes intrinsically relies on the used seen-class training data. With different seen-class data, the difficulty of ZSL differs accordingly, which is also consistent with observation in the ZSL community [29,30].

From Table 5 and Table 6, similar observations can be made. However, some notable differences also exist. Firstly, compared with the case in Unicorn-ZSL, both two-unseen-class and three-unseen-class ZSL tasks in COS10-ZSL are better handled by all three semantic features, especially for the two-unseen-class tasks, with an achieved accuracy of

100 %

, although word vectors are employed as semantic features. This is because COS10-ZSL is a relatively coarse-grained dataset, where unseen classes are distinct in both semantic and visual feature spaces. This also indicates that the choice of a ‘good’ semantic feature space is also dependent on the discriminability between the object classes in the ZSL task. Word vectors, as a scalable and accessible semantic feature, are suited to coarse-grained ZSL tasks given a limited-cost budget. Secondly, by comparing the results of the three-unseen-class tasks in Table 5 and Table 6, we can see that the results in COS15-ZSL are better than those in COS10-ZSL, demonstrating that more seen-class data are helpful to learn a more discriminative ZSL model.

4. Discussion

4.1. Factors to Improve Semantic Features

Here, we analyze the potential factors to improve semantic features from three perspectives, namely model capacity, dataset scale, and pre-training. Experiments are conducted on both RS optical images and web optical images in Unicorn-ZSL.

We first report the results of web optical images in Table 7. For the training of web optical image features, the default model is ResNet18 pre-trained with ImageNet1000 [27], which is then fine-tuned with the whole web image training set (i.e.,

100 %

of the training set). The intervention groups include different architectures (i.e., VGG16 and VGG11), different dataset scales (i.e.,

60 %

and

20 %

of the training set), and different pre-training (i.e., pre-training without ImageNet1000). From the results presented in Table 7, we find the following: (1) Compared with the semantic feature learned by a model without ImageNet1000 pre-training, that learned by a pre-trained model is significantly better, which demonstrates that pre-training on a large-scale and homogeneous image set is helpful to learn high-quality semantic features from web optical images. (2) VGG11 learns better web image-based semantic features, on average, than VGG16 and the default ResNet18. This is because a small model capacity can alleviate overfitting of these web images so that the general semantic knowledge learned from large-scale pre-training can be kept so that the semantic features are more discriminative. (3) When trained with

20 %

of the training data, the corresponding semantic feature achieves the best performance. The reason is similar to that of the second observation, i.e., when trained with too many web images in Unicorn-ZSL-Web, the model overfits these web images and forgets the knowledge learned from the large-scale pre-training image set; then, the generalization ability of its representations is weakened.

Then, we analyze the RS optical image-based semantic feature spaces learned with different DNNs under different model capacities, dataset scales, and pre-training cases on Unicorn-ZSL. To investigate from a different side, in contrast to the web image experiment, the default model here is not pre-trained with other large-scale datasets, and it is trained from scratch with the whole RS image training set (i.e.,

100 %

of the training set). VGG11 is used as the default architecture. The intervention groups include different architectures (i.e., VGG16 and ResNet18), different dataset scales (i.e.,

50 %

and

25 %

of the training data), and a different pre-training case (i.e., pre-training on NWPU-RESISC45 [2]). Note that in the RS optical image experiment, we investigated pre-training on an RS optical image dataset but not a web image dataset, since RS images and web images have distinct domain gaps, such as viewpoint, appearance, and so on. All the results are reported in Table 8. From Table 8, we can see the following: (1) When pre-trained on NWPU-RESISC45, the learned semantic feature achieves worse performance, in contrast to the case of web image-based feature learning. This is because NWPU-RESISC45 differs considerably from Unicorn-ZSL (with very different semantic classes), so pre-training on NWPU-RESISC45 does not contribute useful prior semantic knowledge for recognizing object classes in the Unicorn-ZSL dataset, while in the web image experiment, the pre-training dataset, i.e., ImageNet1000, has a very fine-grained semantic category hierarchy, which is helpful in learning a discriminative semantic feature space. The comparison indicates that a large-scale, fine-grained homogeneous dataset is important for pre-training when constructing semantic space, which is also largely consistent with observations in the machine learning community [31,32]. (2) Compared with larger networks, VGG11 learns better RS image-based semantic feature, on average. This is because a small model capacity can alleviate overfitting of RS optical images, consistent with observation in the previous web image experiment. (3) When trained with fewer training data (i.e.,

25 %

of the training data), the corresponding semantic feature achieves significantly worse performance. This is because the model is trained from scratch. When the scale of the training data is too small, the randomly initialized model cannot learn informative features from the data. This observation is also in contrast to the previous web image experiment. This difference is attributed to the fact that in the web image experiment, the model was pre-trained on a very large-scale, homogeneous image dataset, giving it the ability to extract informative semantic features, while here, the model is trained from scratch.

In summary, the above experimental results indicate that pre-training on a large-scale, fine-grained, homogeneous dataset is helpful in learning more accurate semantic features from external resources for SAR image ZSL compared to training from scratch or pre-training on a small-scale, coarse-grained dataset with a distinct domain shift. Different kinds of semantic feature space may be suited to pre-training on different datasets; hence, we should be careful to choose an appropriate dataset for pre-training, given the specified external resources used to learn semantic space. At the same time, when learning semantic features from a small scale of external resources, we should keep in mind the problem of overfitting to the small-scale external resources, which can, instead, prevent the model from learning general semantic features or weaken the discriminability of the semantic space learned by a large-scale, pre-trained model.

4.2. Bias Phenomenon

For a comprehensive analysis, here, we also report some results based on the overall accuracy metric. Table 9 shows a comparison between overall accuracy and average per-class accuracy, where ‘C1’ and ‘C2’ represent the two testing classes in the two-unseen-class setting and the numbers after them indicate the number of tested samples. From Table 9, we can see that in an unbalanced dataset, the model is easily trained to bias towards the ‘large’ class with a large number of samples so that excellent overall accuracy can be achieved, especially when a less discriminative semantic space (i.e., word vectors) is used. However, this does not necessarily mean the model is a good model that is able to effectively discriminate the unseen testing classes, since it just blindly classifies most of the testing samples into the ‘large’ class. In contrast, the average per-class accuracy that takes all the testing classes into account can evaluate the ZSL discrimination ability of the model in a fairer way, and this could serve as a helpful reminder for peers and practitioners in future works.

5. Conclusions

In this work, we investigated the semantic feature spaces for ZSL on SAR images. To this end, we first constructed three semantic spaces by introducing three external resources. Then, we introduced three datasets and designed several ZSL tasks to evaluate the learned semantic spaces. Furthermore, we analyzed the factors to improve semantic features from three perspectives, namely model capacity, dataset scale, and pre-training. Experimental results show that semantic space selection is essential for SAR image ZSL, and its quality can be improved by pre-training on a large-scale, homogeneous dataset with an appropriate model capacity and scale of fine tuning data. At the same time, overfitting to small-scale external resources can prevent the model from learning general semantic features.

There are some limitations in our current work. First, constrained by the scarcity of SAR image data, the number of categories in our introduced benchmarks still corresponds to a small-scale, which prevents models from learning more prior knowledge from seen classes and weakens their ability to accomplish more challenging unseen-class object recognition tasks. In the future, first and foremost, we suggest collecting larger-scale SAR image datasets and more SAR object classes with more diverse semantics to perform a more in-depth investigation on the semantics on SAR ZSL.

Second, due to the nature of SAR distance-based imaging and the resulting difficulty of human interpretation of SAR images, how to construct a suitable ‘semantic feature’ for SAR ZSL is largely still an open question, and our current work is merely a small step in this direction. We believe it is an unavoidable, crucial step if SAR ZSL is to perform on par with ZSL for optical images in the future, and many more efforts are needed. For the immediate future, we believe the following two steps are worthy of exploring. (1) We suggest the used of an implicit semantic space, ‘semantic’ means ‘concepts’. Our current empirical results show that it seems impractical to construct a suitable explicit semantic feature for SAR images. Considering that SAR–optical image translation has achieved notable advances in the SAR community, how to adapt an intermediate representation in the translation flow is worthy of pursuing, and will be the subject of our future work. (2) We also suggest the use of large language model-based representation. Currently large-scale models, including large language models, are revolutionizing AI research. Is it possible to use texts or SAR images to generate a suitable semantic representation for SAR ZSL? Conceptually speaking, it should be possible, but the key is how. We believe this is also a promising direction for SAR ZSL and hope fruitful results will emerge in the future.

Author Contributions

Conceptualization: B.L., H.Z. and Z.H.; methodology: B.L., H.Z., J.X. and Q.D.; software and dataset: B.L.; validation: B.L., H.Z. and Q.D.; draft: B.L. and Z.H.; review and editing: B.L., H.Z., J.X., Q.D. and Z.H.; funding acquisition: H.Z., Q.D. and Z.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China (Grant Nos. 61991423 and 62273034), the Scientific and Technological Innovation Foundation of Foshan (BK21BF004), and the Open Project Program of Key Laboratory of Industrial Internet and Big Data, China National Light Industry, Beijing Technology and Business University.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

The original SAR images of the five coarse-grained classes are provided by the Microwave Imaging Lab, School of Electronics Engineering and Computer Science, Peking University.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

SAR	Synthetic aperture radar
ZSL	Zero-shot learning
DNNs	Deep neural networks
VGG11/VGG16/ResNet18	Each one is a deep neural network
RS	Remote sensing
Unicorn-ZSL	A dataset for SAR ZSL consisting of 10 classes collected from the open Unicorn dataset
COS10-ZSL	A dataset for SAR ZSL consisting of five classes collected by ourselves and five classes collected from the open Unicorn dataset
COS15-ZSL	A dataset for SAR ZSL combining Unicorn-ZSL with COS10-ZSL
OA	Overall accuracy of all testing samples
APA	Average per-class accuracy; average of accuracies of all testing classes

References

Gui, S.; Song, S.; Qin, R.; Tang, Y. Remote sensing object detection in the deep learning era—A review. Remote Sens. 2024, 16, 327. [Google Scholar] [CrossRef]
Cheng, G.; Han, J.; Lu, X. Remote sensing image scene classification: Benchmark and state of the art. Proc. IEEE 2017, 105, 1865–1883. [Google Scholar] [CrossRef]
Zhang, T.; Zhang, X.; Ke, X.; Liu, C.; Xu, X.; Zhan, X.; Wang, C.; Ahmad, I.; Zhou, Y.; Pan, D.; et al. Hog-shipclsnet: A novel deep learning network with hog feature fusion for sar ship classification. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5210322. [Google Scholar] [CrossRef]
Qian, X.; Liu, F.; Jiao, L.; Zhang, X.; Chen, P.; Li, L.; Gu, J.; Cui, Y. A hybrid network with structural constraints for sar image scene classification. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5202717. [Google Scholar] [CrossRef]
Wang, L.; Qi, Y.; Mathiopoulos, P.T.; Zhao, C.; Mazhar, S. An improved sar ship classification method using text-to-image generation-based data augmentation and squeeze and excitation. Remote Sens. 2024, 16, 1299. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, W.; Chen, W.; Chen, C. Bsdsnet: Dual-stream feature extraction network based on segment anything model for synthetic aperture radar land cover classification. Remote Sens. 2024, 16, 1150. [Google Scholar] [CrossRef]
Ren, S.; Zhou, F.; Bruzzone, L. Transfer-aware graph u-net with cross-level interactions for polsar image semantic segmentation. Remote Sens. 2024, 16, 1428. [Google Scholar] [CrossRef]
Zhang, S.; Li, W.; Wang, R.; Liang, C.; Feng, X.; Hu, Y. Daliws: A high-resolution dataset with precise annotations for water segmentation in synthetic aperture radar images. Remote Sens. 2024, 16, 720. [Google Scholar] [CrossRef]
Zhang, H.; Jian, Y.; Zhang, J.; Li, X.; Zhang, X.; Wu, J. Moving target shadow detection in video sar based on multi-frame images and deep learning. In Proceedings of the IGARSS 2022–2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 2666–2669. [Google Scholar]
Li, C.; Yang, Y.; Yang, X.; Chu, D.; Cao, W. A novel multi-scale feature map fusion for oil spill detection of sar remote sensing. Remote Sens. 2024, 16, 1684. [Google Scholar] [CrossRef]
Wei, Q.-R.; Chen, C.-Y.; He, M.; He, H.-M. Zero-shot sar target recognition based on classification assistance. IEEE Geosci. Remote Sens. Lett. 2023, 20, 4003705. [Google Scholar] [CrossRef]
Yan, K.; Sun, Y.; Li, W. Feature generation-aided zero-shot fast sar target recognition with semantic attributes. IEEE Geosci. Remote Sens. Lett. 2024, 21, 4006805. [Google Scholar] [CrossRef]
Wei, H.; Wang, Z.; Hua, G.; Ni, Y. A zero-shot nas method for sar ship detection under polynomial search complexity. IEEE Signal Process. Lett. 2024, 31, 1329–1333. [Google Scholar] [CrossRef]
Guo, Q.; Xu, H.; Xu, F. Causal adversarial autoencoder for disentangled sar image representation and few-shot target recognition. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5221114. [Google Scholar] [CrossRef]
Zheng, J.; Li, M.; Li, X.; Zhang, P.; Wu, Y. Revisiting local and global descriptor-based metric network for few-shot sar target classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5205814. [Google Scholar] [CrossRef]
Zhao, Y.; Zhao, L.; Ding, D.; Hu, D.; Kuang, G.; Liu, L. Few-shot class-incremental sar target recognition via cosine prototype learning. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5212718. [Google Scholar]
Frome, A.; Corrado, G.S.; Shlens, J.; Bengio, S.; Dean, J.; Mikolov, T. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems, NIPS; NeurIPS: San Diego, CA, USA, 2013; pp. 2121–2129. [Google Scholar]
Song, Q.; Xu, F. Zero-shot learning of sar target feature space with deep generative neural networks. IEEE Geosci. Remote Sens. Lett. 2017, 14, 2245–2249. [Google Scholar] [CrossRef]
Toizumi, T.; Sagi, K.; Senda, Y. Automatic association between sar and optical images based on zero-shot learning. In Proceedings of the IGARSS 2018–2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; pp. 17–20. [Google Scholar]
Gui, R.; Xu, X.; Wang, L.; Yang, R.; Pu, F. A generalized zero-shot learning framework for polsar land cover classification. Remote Sens. 2018, 10, 1307. [Google Scholar] [CrossRef]
Wei, Q.-R.; He, H.; Zhao, Y.; Li, J.-A. Learn to recognize unknown sar targets from reflection similarity. IEEE Geosci. Remote Sens. Lett. 2020, 19, 4002205. [Google Scholar] [CrossRef]
Song, Q.; Chen, H.; Xu, F.; Cui, T.J. Em simulation-aided zero-shot learning for sar automatic target recognition. IEEE Geosci. Remote Sens. Lett. 2019, 17, 1092–1096. [Google Scholar] [CrossRef]
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, NIPS; NeurIPS: San Diego, CA, USA, 2013; pp. 3111–3119. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Liu, J.; Inkawhich, N.; Nina, O.; Timofte, R. Ntire 2021 multi-modal aerial view object classification challenge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 588–595. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Huang, G.; Liu, Z.; Maaten, L.V.D.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Xian, Y.; Lampert, C.H.; Schiele, B.; Akata, Z. Zero-shot learning—A comprehensive evaluation of the good, the bad and the ugly. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 41, 2251–2265. [Google Scholar] [CrossRef] [PubMed]
Liu, B.; Hu, L.; Hu, Z.; Dong, Q. Hardboost: Boosting zero-shot learning with hard classes. arXiv 2022, arXiv:2201.05479. [Google Scholar]
Cohen-Wang, B.; Vendrow, J.; Madry, A. Ask your distribution shift if pre-training is right for you. arXiv 2024, arXiv:2403.00194. [Google Scholar]
Hsu, W.-N.; Sriram, A.; Baevski, A.; Likhomanenko, T.; Xu, Q.; Pratap, V.; Kahn, J.; Lee, A.; Collobert, R.; Synnaeve, G.; et al. Robust wav2vec 2.0: Analyzing domain shift in self-supervised pre-training. arXiv 2021, arXiv:2104.01027. [Google Scholar]

Figure 1. Illustration of SAR images and three kinds of external resources for semantic space learning, i.e., natural language, remote sensing optical images, and web optical images.

Figure 2. Architecture of the introduced ZSL method on SAR images.

Table 1. Statistics of Unicorn, Unicorn-ZSL, Unicorn-ZSL-RS, and Unicorn-ZSL-Web.

Dataset	Sedan	SUV	Pickup Truck	Van	Box Truck	Motorcycle	Flatbed Truck	Bus	Pickup Truck with Trailer	Flatbed Truck with Trailer
Unicorn	234,209	28,089	15,301	10,655	1741	852	828	624	840	633
Unicorn-ZSL	2000	2000	2000	2000	1741	852	828	624	840	633
Unicorn-ZSL-RS	1000	1000	1000	1000	1000	852	828	624	840	633
Unicorn-ZSL-Web	60	60	60	60	60	60	60	60	60	60

Table 2. Unseen classes in different seen–unseen splits of Unicorn-ZSL, COS10-ZSL, and COS15-ZSL.

Dataset	Split	Unseen Classes
Unicorn-ZSL	2-1	Pickup truck with trailer, flatbed truck with trailer
Unicorn-ZSL	2-2	SUV, flatbed truck
Unicorn-ZSL	2-3	SUV, bus
Unicorn-ZSL	3-1	Sedan, pickup truck with trailer, flatbed truck with trailer
Unicorn-ZSL	3-2	Motorcycle, pickup truck with trailer, flatbed truck with trailer
Unicorn-ZSL	3-3	SUV, bus, pickup truck with trailer
COS10-ZSL	2-1	Flyover, flatbed truck
COS10-ZSL	2-2	Flyover, pickup truck
COS10-ZSL	2-3	Flyover, SUV
COS10-ZSL	3-1	Flyover, sedan, flatbed truck
COS10-ZSL	3-2	Flyover, sedan, pickup truck
COS10-ZSL	3-3	Flyover, pickup truck, bus
COS15-ZSL	3-1	Flyover, pickup truck with trailer, flatbed truck with trailer
COS15-ZSL	3-2	Flyover, SUV, pickup truck with trailer
COS15-ZSL	3-3	Flyover, sedan, flatbed truck with trailer

Table 3. Statistics of COS10-ZSL, COS10-ZSL-RS, and COS10-ZSL-Web.

Dataset	Airplane	Bridge	Building	Flyover	Oiltank	Sedan	SUV	Pickup Truck	Flatbed Truck	Bus
COS10-ZSL	190	400	500	495	655	800	800	800	800	624
COS10-ZSL-RS	700	700	700	700	700	800	800	800	800	624
COS10-ZSL-Web	60	60	60	60	60	60	60	60	60	60

Table 4. Comparative results of three semantic spaces in different seen–unseen splits of Unicorn-ZSL.

Semantic Space	Splits
	2-1	2-2	2-3	3-1	3-2	3-3
Word vector	50.0	50.0	66.2	33.3	33.3	37.5
RS optical image	95.1	71.7	82.8	51.2	65.1	60.9
Web optical image	52.4	65.5	68.1	39.2	36.5	51.0

Table 5. Comparative results of three semantic spaces in different seen–unseen splits of COS10-ZSL.

Semantic Space	Splits
	2-1	2-2	2-3	3-1	3-2	3-3
Word vector	100.0	100.0	100.0	66.7	66.7	68.0
RS optical image	100.0	100.0	100.0	66.7	73.5	76.4
Web optical image	100.0	100.0	100.0	66.7	66.7	67.2

Table 6. Comparative results of three semantic spaces in different seen–unseen splits of COS15-ZSL.

Semantic Space	Splits
	3-1	3-2	3-3
Word vector	66.8	75.4	74.8
RS optical image	79.1	66.7	84.0
Web optical image	82.1	90.0	71.1

Table 7. Results of web image-based semantic space trained with different configurations on Unicorn-ZSL. The default configuration is ResNet18 pre-trained with ImageNet1000 and fine-tuned with

100 %

of the training set.

Table 7. Results of web image-based semantic space trained with different configurations on Unicorn-ZSL. The default configuration is ResNet18 pre-trained with ImageNet1000 and fine-tuned with

100 %

of the training set.

Setting	Splits
	2-1	2-2	2-3	2-ave	3-1	3-2	3-3	3-ave
Default	52.4	65.5	68.1	62.0	39.2	36.5	51.0	42.2
VGG11	54.0	50.0	83.6	62.5	44.1	57.0	58.5	53.2
VGG16	53.7	50.0	76.0	59.9	45.5	33.3	33.3	37.4
Scale ( $60 %$ )	59.5	50.0	66.8	58.8	41.9	51.2	41.7	44.9
Scale ( $20 %$ )	68.3	50.4	69.6	62.8	47.3	50.4	61.6	53.1
w/o PT	50.0	50.0	58.5	52.8	33.3	33.3	43.5	36.7

Table 8. Results of RS image-based semantic space trained with different configurations on Unicorn-ZSL. The default configuration is VGG11 without pre-training and trained from scratch with

100 %

of the training set.

Table 8. Results of RS image-based semantic space trained with different configurations on Unicorn-ZSL. The default configuration is VGG11 without pre-training and trained from scratch with

100 %

of the training set.

Setting	Splits
	2-1	2-2	2-3	2-ave	3-1	3-2	3-3	3-ave
Default	95.1	71.7	82.8	83.2	56.3	41.8	58.7	52.3
VGG16	74.0	60.5	61.1	65.2	52.5	65.4	33.3	50.4
ResNet18	67.1	75.6	71.7	71.5	41.5	33.3	65.1	46.6
Scale ( $50 %$ )	94.7	68.5	86.9	83.4	48.4	50.9	51.0	50.1
Scale ( $25 %$ )	50.3	71.8	66.6	62.9	43.3	47.9	41.5	44.2
PT	62.6	68.9	77.0	69.5	44.2	35.9	46.6	42.2

Table 9. Comparison between overall accuracy (OA) and average per-class accuracy (APA).

Setting	2-1				2-2				2-3
	C1(168)	C2(126)	OA	APA	C1(400)	C2(165)	OA	APA	C1(400)	C2(124)	OA	APA
Word vector	100.0	0.0	57.1	50.0	100.0	0.0	70.8	50.0	47.8	84.7	56.5	66.2
RS optical image	91.1	99.2	94.6	95.1	68.3	75.2	70.3	71.7	90.0	75.8	86.4	82.8
Web optical image	100.0	4.8	59.2	52.4	89.3	41.8	75.4	65.5	95.0	41.1	82.3	68.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, B.; Xu, J.; Zeng, H.; Dong, Q.; Hu, Z. Semantic Space Analysis for Zero-Shot Learning on SAR Images. Remote Sens. 2024, 16, 2627. https://doi.org/10.3390/rs16142627

AMA Style

Liu B, Xu J, Zeng H, Dong Q, Hu Z. Semantic Space Analysis for Zero-Shot Learning on SAR Images. Remote Sensing. 2024; 16(14):2627. https://doi.org/10.3390/rs16142627

Chicago/Turabian Style

Liu, Bo, Jiping Xu, Hui Zeng, Qiulei Dong, and Zhanyi Hu. 2024. "Semantic Space Analysis for Zero-Shot Learning on SAR Images" Remote Sensing 16, no. 14: 2627. https://doi.org/10.3390/rs16142627

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Semantic Space Analysis for Zero-Shot Learning on SAR Images

Abstract

1. Introduction

2. Materials and Methods

2.1. Semantic Spaces and Baseline Method

2.1.1. Semantic Spaces

2.1.2. Baseline Method

2.2. Datasets

2.2.1. Unicorn-ZSL

2.2.2. COS10-ZSL

2.2.3. COS15-ZSL

2.3. Experimental Setup

3. Results

ZSL Performance

4. Discussion

4.1. Factors to Improve Semantic Features

4.2. Bias Phenomenon

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI