1. Introduction
Remote sensing images play an important role in meteorology, hydrology research, traffic management, disaster prediction, and so on. In the big data era, the number of remote sensing images has increased significantly. Therefore, the efficient utilization of large volumes of remote sensing data has become the focus of attention. Generally, single-modal data retrieval methods are less versatile compared to multi-modal approaches. Multi-modal methods not only enhance data utilization flexibility but also facilitate cross-modal information integration, thereby improving information reliability and richness. Consequently, cross-modal retrieval in remote sensing has garnered considerable scholarly attention in recent years.
In order to realize the mutual retrieval of different modal data, the existing method for cross-modal image–text retrieval of remote sensing images involves two primary steps: feature extraction in single-modal data and interaction between multi-modal features [
1]. Extracting image and text features represents the initial crucial phase in cross-modal retrieval, with image feature extraction posing greater challenges. In terms of image feature extraction, prevailing cross-modal image–text retrieval models for remote sensing images primarily employ the convolutional neural network (CNN) and vision transformer (ViT) [
2]. CNN stands as the predominant method for image feature extraction in cross-modal retrieval of remote sensing images, as indicated by References [
3,
4,
5,
6,
7,
8,
9,
10,
11]. With the recent surge in ViT’s popularity, many researchers have also utilized ViT to directly extract visual features [
2,
12,
13]. Some alternative methods have also served as visual feature encoders. For instance, References [
14,
15] employed hypergraph neural networks to construct visual encoders. Reference [
16] applied the concept of image neural networks and built text and remote sensing image modules, achieving an interactive fusion of image and text features. Certain studies have employed multiple methods to extract features from images [
17,
18]. Generally, the mainstream CNN architecture usually requires a lot of parameters and computing resources. In addition, due to being limited by the size of the convolution kernel, the CNN model’s ability to perceive global information is poor. In contrast, ViT models commonly used in computer vision in recent years do not rely on convolutional operations but rather adopt a global self-attention calculation method that can effectively perceive global features. Moreover, the ViT model also has certain advantages in terms of the number of parameters and computational complexity. However, this approach may overly focus on large targets, impeding the extraction of small targets. The distinction between large and small targets in this paper is relative. In a remote sensing image, a target with a relatively high pixel proportion is classified as a large target, while a target with a relatively low proportion is deemed small. In cross-modal image–text retrieval of remote sensing images, text includes information on both large and small targets. When distinguishing between similar images at a fine-grained level, small target information often holds greater significance. Thus, enhancing the extraction of small target information while ensuring the extraction of large target information remains a pivotal concern in cross-modal image–text retrieval tasks for remote sensing images.
To address the challenge of extracting small targets from remote sensing images, certain researchers have endeavored to enhance the ViT model. For instance, Reference [
19] devised a layered ViT to extract both shallow and deep semantic information from images. Through downsampling, they achieved multi-scale feature fusion, partially retaining small target information in the image. However, despite these efforts, the utilization of the global self-attention calculation method during the extraction of shallow semantic information failed to fundamentally resolve the issue of large target interference with small target information. Reference [
20] augmented ViT with dense blocks to preserve semantic features within each transformer block layer. These features were then concatenated with global features and passed on to the subsequent transformer block layer, aiding in the preservation of semantic features related to small targets and mitigating vanishing issues during forward propagation. Furthermore, Reference [
21] employed a combination of CNN and ViT to extract local and global visual features, respectively, thereby enhancing the extraction of small target information in images. Nonetheless, this approach entailed a complex structure. The primary limitation impeding ViT’s efficacy in extracting small targets stems from the receptive field problem in its self-attention layer [
22]. For small target features, an excessive number of patches can introduce noise (in the form of large target features) during the feature update process, thereby suppressing the expression of small target features. Consequently, the direct incorporation of ViT in cross-modal retrieval of remote sensing images may lead to large target features interfering with small target feature extraction, thereby impacting retrieval effectiveness.
Under the global attention mechanism, large target features will interfere with the extraction of small target features, which will affect the fine-grained semantic alignment of images in the cross-modal retrieval process of remote sensing images, resulting in a reduction in retrieval accuracy. To mitigate noise interference within the global receptive field, this study proposes the construction of distinct receptive fields within the self-attention layer to diminish the impact of large targets on small target feature extraction. Furthermore, given the dispersed nature of small targets in remote sensing images, we establish diverse receptive fields targeting the saliency and similarity of regional image features. Specifically, we utilize feature similarity between patches to discern whether a patch corresponds to a large target feature. Additionally, distinct receptive fields are crafted based on this similarity to enable separate representations of large and small target features. This strategy reduces noise interference from large target features on small target features, thereby facilitating the expression of small target features. In summary, the main contributions of this work can be categorized into the following two aspects:
1. Given the multi-scene attributes of remote sensing images, we introduce a classification method relying on cosine similarity between patches. The total sum of similarities serves as the metric for feature saliency, with cosine similarity forming the foundation for classification.
2. Building upon the ViT image feature extraction approach, we devised a tailored structure for the enhanced vision transformer (EViT) model aimed at improving small target features. In comparison to conventional ViT models, EViT incorporates multiple sets of twin networks featuring different receptive fields for parallel parameter operations within certain block layers. Additionally, we developed an enhanced feature extraction framework (AEFEF) for cross-modal image–text retrieval in remote sensing images utilizing EViT. Verification experiments conducted on the RSITMD and RSICD yielded promising results. The proposed method enhances retrieval accuracy.
The chapter arrangement of this paper is as follows.
Section 2 gives a comprehensive introduction to the proposed method.
Section 3 introduces the experimental details and results and provides a simple analysis. In
Section 4, the experimental results are discussed in depth, and the advantages and disadvantages of the proposed method are expounded. Finally,
Section 5 gives a brief summary of the research.
4. Discussion
During the experiments, EViT exhibited significantly better performance on the RSITMD dataset compared to the RSICD dataset, indicating a higher improvement in retrieval accuracy over the ViT method on the former dataset. This disparity can be attributed to the richer text semantics and lower text repeatability in the RSITMD dataset. Additionally, the RSITMD dataset contained a larger proportion of small targets in the text, emphasizing the need for extracting features of small targets from the image. Furthermore, increasing the number of patches corresponded to higher retrieval accuracy and a stronger improvement over ViT. This improvement, based on the local receptive field in the feature space, shielded the interference of irrelevant image regions in the self-attention calculation. When there were fewer patches, this interference was relatively small and did not require shielding. However, as the number of patches increased, the softmax computation mechanism could not effectively address the interference caused by irrelevant regions.
The effectiveness of the proposed method has been proven by the experimental results in the previous section. We grouped tokens by feature similarity and set up twin networks to improve the retrieval efficiency of the model for small targets. However, in remote sensing image cross-modal retrieval, the big target is also an important part of the text description. Another problem to be considered in the experiment is that there is interference between small target and large target feature extraction when using the EViT method to enhance small target feature extraction.
We think the number of twinned transformer blocks represents how much emphasis the model places on small targets. Therefore, we designed experiments to explore the effect of changing the number of transformer blocks on retrieval accuracy.
Table 5 shows the retrieval accuracy when the EViT method was used for image feature extraction when different numbers of twinned transformer blocks were set. Due to GPU performance constraints, the value of (a) was limited to the range of [1, 2, 3, 4]. In order to facilitate the observation of the changing trend of retrieval accuracy with the number of twinned transformer blocks, mR was used as the only comparison index.
Table 5 illustrates that increasing the number of twinned transformer blocks led to higher retrieval accuracy and enhanced scene-distinguishing capability when the number of twinned transformer blocks was less than four. The reason for this phenomenon may be that, with the increase in the number of twinned transformer blocks, the feature extraction ability of the model for small targets in remote sensing images will increase, and it will be more in line with the fine-grained text description. This is conducive to the improvement of retrieval accuracy. However, no further experiments were conducted under the same variable due to resource constraints.
To explore the effects of a more twinned transformer block, we performed a simple extension experiment and set the batch size to 16. At the same time, in order to achieve the same training effect as much as possible, when the batch size was 16, we designed the accumulation of gradients for every two training batches. In order to ensure the regularity of the network structure, we compared the retrieval accuracy when adding three, four, six, and eight twinned transformer blocks. When the number of twinned transformer blocks was between three and four, the position of the twinned transformer blocks selected by us was the same as the corresponding position in
Table 5. When the number of twinned transformer blocks was six, we changed an odd number of the 12 blocks to twinned transformer blocks. When the number of twinned transformer blocks was eight, we changed the first three of every four blocks to twinned transformer blocks. Because of the different batch sizes used, the results are not displayed in the same table.
The results in the
Table 6 show that model retrieval was best when the number of twinned transformer blocks was four. When the number exceeded four, the retrieval accuracy tended to decline.
Through analysis, we found that the reason for the above experimental phenomenon is that there is interference between small target and large target feature extraction when the EViT method is used to enhance small target feature extraction. By exploring the impact of changing the number of transformer blocks on retrieval accuracy, it was observed that the model achieved optimal performance on the RSITMD dataset when four blocks were modified. However, when the number of modified blocks was less than four, the retrieval accuracy gradually improved with an increasing number of changed blocks, but it gradually decreased and even fell below that of the ViT method when the number exceeded four. This preliminary finding suggests that enhancing the feature extraction ability of small targets can improve retrieval effectiveness to a certain extent, but it may also hinder the retrieval of features for large targets. Hence, a balanced consideration of both aspects is essential, depending on the retrieval requirements.
The existing model structure remains somewhat rigid, and the number and position of the twinned transformer block in the visual encoder are still experimental conclusions. In future work, adjustments to the structure will be necessary when the method is applied to different datasets. For images with varying proportions of small targets, the network structure may need flexible adjustments to different blocks. When the proportion of small targets is large, the number of twinned transformer blocks should also be appropriately increased, and vice versa. Further, a network architecture that dynamically adjusts the number and position of twinned transformer blocks is needed. In addition, some details also require further optimization, such as the selection of the classification threshold in the token grouping block, γ, in Equation (9). Finally, it should be noted that the method is not only for small target feature extraction. It is proposed on the premise that both large and small targets need to be preserved, so the result may not be superior if this method is applied solely to the feature extraction of small targets.
5. Conclusions
In this paper, we propose an enhanced ViT method to tackle the challenge of extracting small target features in cross-modal image–text retrieval of remote sensing images. Addressing the interference problem of large targets on small targets within the global receptive field of conventional ViT methods, our approach constructs different receptive fields based on the saliency and similarity of image regions to enhance the ability to extract small target features. Unlike the conventional method of designing local receptive fields in the spatial dimension, our method groups image regions in the feature dimension to construct different receptive fields. Additionally, to optimize the network structure and reduce the number of parameters, we introduced a twin network structure, which improved model retrieval accuracy while only marginally increasing computational complexity. The experimental evaluation of public datasets demonstrates the effectiveness of the proposed method in improving accuracy. Compared with ViT, the mR indices of this method on the Remote Sensing Image–Text Match Dataset improved by 2.08%, with an increase of only 1.41% in model parameters.
The limitation of this study is that the retrieval effects of large and small targets were not evaluated in the experiment. In remote sensing images, the object scale is different, and there is demand for both large and small object retrieval. In order to better analyze the performance of the model when retrieving objects at different scales, it is necessary to design better evaluation indexes. This makes it difficult for our proposed method to enhance the expression of small targets to be effectively verified quantitatively using existing indicators. In future research, we will focus on enhancing the existing network structure to dynamically adapt to scenarios where both large and small target feature extraction requirements exist and improving the accuracy and generalization ability of the model.