1. Introduction
With the advancement of remote-sensing technology, obtaining high-resolution remote-sensing images has become increasingly accessible, providing abundant data resources for research in the field of Earth observation. Traditional computer vision tasks based on remote-sensing images, such as object detection [
1], semantic segmentation [
2], and scene classification [
3], typically focus on analyzing the local semantic information of objects in the image or summarizing the global semantic information of the entire scene. In contrast, remote-sensing image captioning (RSIC) [
4,
5] aims to capture both local and global semantic information within images, understand the semantic relationship between local semantic information and between local and global semantic information, and finally, “translate” them into descriptive sentences that conform to the grammatical rules of natural language.
RSIC views the description generation process as an alternative machine translation [
6] task. Therefore, it adopts the mainstream model construction method used in machine translation—the encoder–decoder framework. Convolutional neural networks (CNNs) [
7] are used as the encoder to extract visual features. At the same time, the decoder employs language models such as recurrent neural networks (RNNs) [
8], long short-term memory networks (LSTMs) [
9], or transformer [
10] to generate text.
To improve the performance and interpretability of models, some studies [
11,
12,
13] have attempted to integrate attention mechanisms into the encoder–decoder framework to achieve semantic alignment between text and image. On this basis, additional tasks are added to obtain auxiliary information, further enhancing the model’s ability to understand semantic information and relationships within the images. We refer to this as a multi-stage method [
14,
15,
16,
17]. Zhao et al. [
16] proposed a structured-atteention block that leverages the structured characteristics of semantic content in remote-sensing images to perform semantic segmentation. Create masks based on the segmented content, guiding attention to specific image regions during description generation. Ren et al. [
17] replaced the original transformer encoder with the pre-trained visual transformer (Vit) [
18], using the multi-head self-attention (MSA), to explore the dependencies between different patches in the patch sequence, thereby enabling the model to learn both local and global semantic information from the images.
The above work mainly processes image features from the perspective of the spatial domain, exploring semantic information and relationships in images. However, challenges remain when addressing significant differences in the same object type in different scenes.
Figure 1 shows that the same type of object—“house”—exhibits significant differences in quantity, structure, layout, and relationship with the surrounding environment in different scenes. This indicates that solely analyzing images from the spatial domain may not be sufficient to capture all necessary semantic information. Therefore, it is necessary to integrate other methods to more comprehensively understand and describe the complex semantic relationships in remote-sensing images.
Previous works [
19,
20,
21] have demonstrated that each channel of a feature map corresponds to different semantic information. Therefore, we attempt to approach this from both the spatial and channel domains, learning dependencies between different positions while exploring dependencies between different channels. After obtaining the above two dependencies, how to effectively utilize these information to enhance the model’s understanding of complex semantic relationships in images has become a new challenge. To address this issue, existing works [
22,
23,
24] often employ traditional fusion techniques, such as channel concatenation or element-wise summation. However, these methods may be insufficient to fully capture and utilize the complex relationships between spatial and channel domains. Particularly when dealing with remote-sensing images, which are highly complex and rich in detail, simple fusion techniques often lack the flexibility to adapt to semantic variations across different scenes.
To address the above issues, we propose a model for RSIC called the positional-channel semantic fusion transformer (PCSFTr). PCSFTr is constructed using a multi-stage method, and the first-stage model training is completed through the scene classification task to obtain the visual feature extractor, so as to preliminarily learn the image semantic information and obtain the visual features. To further explore the semantic information in images, we introduce the positional-channel multi-headed self-attention (PCMSA) block. PCMSA learns the dependencies within visual features from both spatial and channel perspectives. Then, a feature fusion (FF) block based on the channel attention mechanism is introduced to fuse these two dependencies to further enhance the model’s understanding of image semantic information and relationships.
The main contributions of this paper are as follows:
We propose a novel RSIC model, PCSFTr, which can learn image semantic information from both spatial and channel domains and fuse them to understand semantic relationships in images.
To enhance the extraction of semantic information, we introduce a PCMSA block, designed to concurrently learn dependencies across both positional and channel dimensions. This dual-domain approach allows the model to detect subtle nuances often missed by conventional methods, thereby enabling a deeper and more comprehensive analysis of complex scenes.
We introduce an FF block that can integrate dependencies to enhance semantic understanding and capture complex image relationships. This block refines traditional fusion methods by employing both global and local channel attention strategies, effectively elucidating the relationships between objects and attributes, thus enhancing description accuracy and reliability.
Experiments were conducted on four RSIC datasets and compared with several state-of-the-art methods. The results indicate that PCSFTr achieved significant improvements across all four datasets.
The remainder of the paper is organized as follows:
Section 2 provides a review of related work, summarizing previous developments and identifying gaps that our study aims to fill.
Section 3 describes the methodology, detailing the design and implementation of our proposed positional-channel semantic fusion transformer (PCSFTr).
Section 4 reports on experiments conducted to evaluate the effectiveness of our model across four standard datasets, followed by a comparison with existing approaches. Finally,
Section 5 concludes the paper with a summary of our findings and a discussion of potential avenues for future research.
4. Experiments
This section presents the experimental evaluation of the PCSFTr across various datasets and scenarios to assess its performance. It includes descriptions of the datasets, evaluation metrics, experimental setup, ablation studies, comparative analyses, and visualizations to demonstrate the model’s effectiveness in remote-sensing image captioning.
4.1. Datasets
In the field of RSIC, three widely used datasets are UCM-caption [
4], Sydney-caption [
4], and RSICD [
5]. Recently, a new dataset, NWPU-captions [
27], has been proposed. This dataset has more images, more comprehensive ground scenes, and more complex and diverse sentences than the previous three datasets.
UCM-caption: Proposed by Qu et al. [
4], based on the UC Merced land-use dataset [
38]. It consists of 2100 images of size 256 × 256, covering 21 scenes, each with 100 images. Each image has 5 reference sentences, with a vocabulary of 285 words.
Sydney-caption: Proposed by Qu et al. [
4], comprising 613 images of size 500 × 500, covering 7 scenes. Each image has 5 reference sentences, with a vocabulary of 169 words.
RSICD: Proposed by Lu et al. [
5], consisting of 10,921 images of size 224 × 224, covering 30 scenes. Each image has up to 5 reference sentences, with a vocabulary of 2632 words.
NWPU-captions: Proposed by Cheng et al. [
27] based on the NWPU-RESISC45 [
39]. It comprises 31,500 images of size 256 × 256, covering 45 scenes, each with 700 images. Each image has 5 reference sentences, with a vocabulary of 3149 words, the largest among the four datasets.
All four datasets are divided into training, validation, and test sets in a ratio of 8:1:1.
4.2. Evaluation Indicators
In RSIC, the following five metrics are commonly used to evaluate the generated sentences: BLEU [
40], METEOR [
41], ROUGE-L [
42], CIDEr [
43], and SPICE [
44]. BLEU, METEOR, ROUGE-L, and SPICE range from 0 to 1, while CIDEr ranges from 0 to 5.
BLEU: A metric used to evaluate machine translation quality. BLEU measures the similarity between the generated and reference sentences based on n-gram matching principles, thereby assessing the accuracy of the generated sentences. N-grams are sequences of n consecutive words, with n ranging from 1 to 4.
METEOR: A metric for evaluating machine translation quality. It uses external semantic resources like WordNet to achieve precise word matching, considering stems, prefixes, synonyms, and word order, thus measuring the similarity between the generated and reference sentences.
ROUGE-L: A metric for evaluating text summarization quality, where L is the longest common sub-sequence. It calculates recall, precision, and F1 score based on the length of the longest common sub-sequence between the generated and reference sentences.
CIDEr: A metric designed explicitly for the image caption. It treats each sentence as a document and computes its n-gram term frequency-inverse document frequency (TF-TDF) vector. TF-TDF weights each n-gram, and cosine similarity measures semantic consistency between the generated description and reference sentences.
SPICE: Another metric designed for image description. Unlike the above metrics, SPICE is based on graph structures, encoding objects, attributes, and their relationships in the description, mapping them into a scene graph, and computing F-scores for these components.
Among these metrics, BLEU, ROUGE-L, and METEOR are borrowed from other text-generation tasks. They assess model performance by evaluating the similarity between generated and reference sentences but cannot adequately measure the model’s ability to extract and utilize semantic information. CIDEr and SPICE, however, are tailored for image description and provide evaluations closer to human judgment.
4.3. Experimental Setup
All experiments were conducted on an NVIDIA GeForce RTX 2080 Ti device. In these experiments, the image sizes from the UCM-caption, RSICD, and NWPU-captions datasets were resized to 224 × 224 pixels. In contrast, the images from the Sydney-caption dataset were resized to 384 × 384 pixels. During the training phase, the batch size of PCSFTr was set to 8, the initial learning rate was 3 × 10−5, with a decay of 0.8 every 3 epochs, and the total number of epochs was 20. During the inference phase, beam search was used with a beam size of 3. ResNet was trained on the remote-sensing image scene classification task in this study. During the training of PCSFTr, ResNet served as the visual feature extractor, and its parameters were not updated.
4.4. Ablation Experiments
To verify the effectiveness of the PCMSA block and the FF block, we conducted ablation experiments on four datasets, the results of which are shown in
Table 1,
Table 2,
Table 3 and
Table 4.
In these experiments, the model constructed with the visual feature extractor and the decoder layer is referred to as the Baseline, as shown in
Figure 5a. After adding the encoder layer to Baseline, the model using only the PMSA block is referred to as Baseline_p, as shown in
Figure 5b, and the model using only the CMSA block is referred to as Baseline_c, as shown in
Figure 5c.
When the PMSA and CMSA blocks are paralleled, the model summing their outputs element-wise is referred to as Baseline_p+c, as shown in
Figure 5d; the model that fuses their outputs through the FF block is referred to as the PCSFTr. Additionally, the highest evaluation metrics in this section are highlighted in bold.
The ablation experiment results shown in
Table 1,
Table 2 and
Table 4 indicate that integrating the PMSA or CMSA block into the Baseline leads to performance improvements in most evaluation metrics for Baseline_p and Baseline_c. This indicates that learning the dependencies between different positions or channels in an image can effectively extract semantic information, thereby improving the model’s ability to understand images.
Although Baseline_p or Baseline_c show a slight decrease in some metrics compared to the Baseline in
Table 3, Baseline_p+c demonstrates performance improvements in all tables. This suggests that fusing different dependencies can further enhance the semantic information contained in visual features and better understand the complex semantic relationships in images, even if the fusion method is merely a simple element-wise summation.
More importantly, the ablation experiment results show that PCSFTr performs best on all four datasets, which further emphasizes the significant role of the FF block in enhancing semantic information and understanding semantic relations.
4.5. Comparison Experiments
In this section, we validate our model on four datasets and compare it with various models, including Soft-attention and Hard-attention [
5], CSMLF [
45], Attribute-attention [
12], Structured-attention [
16], GVFGA-LSGA [
22], MLCA-Net [
27], GLCM [
34], SCAMET [
20], Topic-guided [
17], Deformable-T [
35], and BITA [
36].
Soft-attention and hard-attention employ VGG16 as the encoder, integrating soft or hard attention with LSTM as the decoder. Due to its dynamic weight allocation, the soft-attention method is chosen for comparison.
CSMLF maps Glove-based sentence embedding features and CNN-based visual features into the same semantic space, achieving multi-sentence descriptions for remote-sensing images.
Attribute-attention uses VGG-16 as the encoder and employs an attribute attention mechanism to selectively extract object and attribute information from the high-level convolutional features for sentence generation.
Structured attention’s main framework is ResNet50 + LSTM, combined with the structured-atteention block to extract structural features from high-resolution images.
GVFGA-LSGA proposes GVFGA and LSGA mechanisms based on the attention mechanism to filter out redundant and irrelevant information in the fused global-local image features and the fused visual–textual feature, respectively.
MLCA-Net utilizes multi-level attention blocks to adaptively aggregate visual features and introduces contextual attention blocks to explore the correlation between words and different regions of the image.
GLCM enhances the overall visual correlation between words and images by acquiring the global visual features of the image, and improves the recognition of words through local visual features.
SCAMET constructs a multi-attention encoder based on CNN-output visual features and uses a transformer model as the decoder, replacing the traditional RNN-like decoders.
Topic-guided uses a full transformer framework, employing Vit as the encoder to generate Topic Tokens and a transformer-like network as the decoder to fuse multi-modal features.
Deformable-T, inspired by “selective search” in Structured attention, designs a novel transformer framework to mine foreground, background, and raw information from the image and perform feature interactions.
BITA connects pre-trained image encoders with a large language model (LLM) through an interactive Fourier transformer (IFT) and aligns visual cues and text features using image–text comparative learning (ITC).
In
Table 5, PCSFTr outperforms the comparison models in most evaluation metrics, especially the CIDEr metric, which is 8.7% higher than the Topic-guided. This highlights the advantages of PCSFTr in capturing image details and generating high-quality descriptive text.
In
Table 6, PCSFTr shows a significant gap compared to Deformable-T, particularly in the CIDEr metric, where the latter leads by approximately 40%. This discrepancy does not appear in the other three datasets. The likely reason is the size difference in data pre-processing and visual feature extraction. The visual features extracted from Sydney-caption have dimensions of (12, 12, 512), whereas other datasets have dimensions of (7, 7, 512), increasing the sequence length from 49 to 144, nearly threefold, while keeping the channel number constant. This change significantly increases the complexity of establishing dependencies in the spatial domain, potentially affecting the model’s understanding of semantic information in the images.
In
Table 7, PCSFTr slightly outperforms other models in metrics such as BLEU-2 to BLEU-4, demonstrating its effectiveness in constructing longer word sequences.
In
Table 8, PCSFTr exhibits outstanding performance in several evaluation metrics. Particularly in the CIDEr metric, PCSFTr improves by 6.2% compared to the BITA, highlighting its semantic relevance and content diversity ability. However, in other metrics, PCSFTr performs slightly worse than the BITA, indicating directions for further research and optimization.
4.6. Visualization
Figure 6 presents example sentences generated by PCSFTr across four datasets. Two scenes are shown for each dataset, with correct, incorrect, and novel words highlighted in different colors.
Specifically, words correctly expressing semantic information are marked in blue. For example, in
Figure 6a, first row, “tennis court” and “road”; in
Figure 6b, second row, “green bushes”, “meadow”, and “highway”; in
Figure 6c, first row, “white stadium”; and in
Figure 6d, second row, “wetland”, “bare land”, and “Green plants” accurately describe the objects and their attributes in the images.
Words correctly expressing semantic relationships are marked in green. For instance, in
Figure 6a, second row, “compose of”; in
Figure 6b, first row, “arranged neatly” and “divided into rectangles by”; in
Figure 6c, second row, “near”; and in
Figure 6d, first row, “next to” accurately describe the relationships between objects in the images.
These words also exist in the ground truth, indicating that the sentences predicted by PCSFTr are consistent with the ground truth as a whole.
Novel words are marked in purple, such as “white bunkers” in
Figure 6b, second row, which does not appear in the ground truth but is present in the image. Similarly, “football field” in
Figure 6c, first row is more accurate than the ground truth’s “ground for the green” in conveying the same semantic information.
Unpredictable or mispredicted words are marked in red. For example, “small” in
Figure 6a first-row ground truth or “scattered” and “houses” in
Figure 6b second-row ground truth was not predicted by PCSFTr. Additionally, “a swimming pool” in
Figure 6c, second row correctly describes the swimming pool in the image, but it incorrectly describes the quantity. Therefore, they are marked in red.
This may be due to the weak ability of PCSFTr to handle multi-scale problems. PCSFTr only uses the output tensor of the last stage of the ResNet as the input of the encoder layer, and the receptive field of the convolution kernel in this stage is large, which makes it difficult to effectively extract the features of small objects, resulting in description errors.
5. Discussion and Conclusions
In this paper, we introduced the PCSFTr model, a novel approach to remote-sensing image captioning that leverages both positional and channel attention mechanisms to enhance semantic understanding of images. Through comparative analysis, PCSFTr demonstrated superior performance over existing methods. Specifically, compared with the BITA model, PCSFTr’s BLEU-4 index on UCM-caption is improved by 6.55%, 4.06% higher in RSICD, and 1.41% higher in NWPU-captions. Notably, the PCSFTr showed remarkable improvements in capturing the complex semantics of urban landscapes and mixed terrain, which are often challenging for traditional models.
However, while PCSFTr offers substantial advancements, it also encounters limitations in handling images with extreme scale variations, where traditional feature extraction methods may still hold an edge. Such limitations underscore the necessity for ongoing refinement of the feature fusion techniques to better adapt to diverse and challenging environmental conditions.
Looking forward, our research will focus on addressing these challenges by optimizing the model’s architecture to handle multi-scale information more effectively.