Enhanced Transformer for Remote-Sensing Image Captioning with Positional-Channel Semantic Fusion

Zhao, An; Yang, Wenzhong; Chen, Danny; Wei, Fuyuan

doi:10.3390/electronics13183605

Open AccessArticle

Enhanced Transformer for Remote-Sensing Image Captioning with Positional-Channel Semantic Fusion

¹

School of Computer Science and Technology, Xinjiang University, Urumqi 830017, China

²

Xinjiang Key Laboratory of Multilingual Information Technology, Xinjiang University, Urumqi 830017, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(18), 3605; https://doi.org/10.3390/electronics13183605

Submission received: 20 July 2024 / Revised: 19 August 2024 / Accepted: 20 August 2024 / Published: 11 September 2024

(This article belongs to the Topic Computational Intelligence in Remote Sensing: 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

Remote-sensing image captioning (RSIC) aims to generate descriptive sentences for ages by capturing both local and global semantic information. This task is challenging due to the diverse object types and varying scenes in ages. To address these challenges, we propose a positional-channel semantic fusion transformer (PCSFTr). The PCSFTr model employs scene classification to initially extract visual features and learn semantic information. A novel positional-channel multi-headed self-attention (PCMSA) block captures spatial and channel dependencies simultaneously, enriching the semantic information. The feature fusion (FF) module further enhances the understanding of semantic relationships. Experimental results show that PCSFTr significantly outperforms existing methods. Specifically, the BLEU-4 index reached 78.42% in UCM-caption, 54.42% in RSICD, and 69.01% in NWPU-captions. This research provides new insights into RSIC by offering a more comprehensive understanding of semantic information and relationships within images and improving the performance of image captioning models.

Keywords:

remote-sensing image captioning; semantic information and relationship; spatial and channel dependencies; semantic fusion

1. Introduction

With the advancement of remote-sensing technology, obtaining high-resolution remote-sensing images has become increasingly accessible, providing abundant data resources for research in the field of Earth observation. Traditional computer vision tasks based on remote-sensing images, such as object detection [1], semantic segmentation [2], and scene classification [3], typically focus on analyzing the local semantic information of objects in the image or summarizing the global semantic information of the entire scene. In contrast, remote-sensing image captioning (RSIC) [4,5] aims to capture both local and global semantic information within images, understand the semantic relationship between local semantic information and between local and global semantic information, and finally, “translate” them into descriptive sentences that conform to the grammatical rules of natural language.

RSIC views the description generation process as an alternative machine translation [6] task. Therefore, it adopts the mainstream model construction method used in machine translation—the encoder–decoder framework. Convolutional neural networks (CNNs) [7] are used as the encoder to extract visual features. At the same time, the decoder employs language models such as recurrent neural networks (RNNs) [8], long short-term memory networks (LSTMs) [9], or transformer [10] to generate text.

To improve the performance and interpretability of models, some studies [11,12,13] have attempted to integrate attention mechanisms into the encoder–decoder framework to achieve semantic alignment between text and image. On this basis, additional tasks are added to obtain auxiliary information, further enhancing the model’s ability to understand semantic information and relationships within the images. We refer to this as a multi-stage method [14,15,16,17]. Zhao et al. [16] proposed a structured-atteention block that leverages the structured characteristics of semantic content in remote-sensing images to perform semantic segmentation. Create masks based on the segmented content, guiding attention to specific image regions during description generation. Ren et al. [17] replaced the original transformer encoder with the pre-trained visual transformer (Vit) [18], using the multi-head self-attention (MSA), to explore the dependencies between different patches in the patch sequence, thereby enabling the model to learn both local and global semantic information from the images.

The above work mainly processes image features from the perspective of the spatial domain, exploring semantic information and relationships in images. However, challenges remain when addressing significant differences in the same object type in different scenes. Figure 1 shows that the same type of object—“house”—exhibits significant differences in quantity, structure, layout, and relationship with the surrounding environment in different scenes. This indicates that solely analyzing images from the spatial domain may not be sufficient to capture all necessary semantic information. Therefore, it is necessary to integrate other methods to more comprehensively understand and describe the complex semantic relationships in remote-sensing images.

Previous works [19,20,21] have demonstrated that each channel of a feature map corresponds to different semantic information. Therefore, we attempt to approach this from both the spatial and channel domains, learning dependencies between different positions while exploring dependencies between different channels. After obtaining the above two dependencies, how to effectively utilize these information to enhance the model’s understanding of complex semantic relationships in images has become a new challenge. To address this issue, existing works [22,23,24] often employ traditional fusion techniques, such as channel concatenation or element-wise summation. However, these methods may be insufficient to fully capture and utilize the complex relationships between spatial and channel domains. Particularly when dealing with remote-sensing images, which are highly complex and rich in detail, simple fusion techniques often lack the flexibility to adapt to semantic variations across different scenes.

To address the above issues, we propose a model for RSIC called the positional-channel semantic fusion transformer (PCSFTr). PCSFTr is constructed using a multi-stage method, and the first-stage model training is completed through the scene classification task to obtain the visual feature extractor, so as to preliminarily learn the image semantic information and obtain the visual features. To further explore the semantic information in images, we introduce the positional-channel multi-headed self-attention (PCMSA) block. PCMSA learns the dependencies within visual features from both spatial and channel perspectives. Then, a feature fusion (FF) block based on the channel attention mechanism is introduced to fuse these two dependencies to further enhance the model’s understanding of image semantic information and relationships.

The main contributions of this paper are as follows:

We propose a novel RSIC model, PCSFTr, which can learn image semantic information from both spatial and channel domains and fuse them to understand semantic relationships in images.
To enhance the extraction of semantic information, we introduce a PCMSA block, designed to concurrently learn dependencies across both positional and channel dimensions. This dual-domain approach allows the model to detect subtle nuances often missed by conventional methods, thereby enabling a deeper and more comprehensive analysis of complex scenes.
We introduce an FF block that can integrate dependencies to enhance semantic understanding and capture complex image relationships. This block refines traditional fusion methods by employing both global and local channel attention strategies, effectively elucidating the relationships between objects and attributes, thus enhancing description accuracy and reliability.
Experiments were conducted on four RSIC datasets and compared with several state-of-the-art methods. The results indicate that PCSFTr achieved significant improvements across all four datasets.

The remainder of the paper is organized as follows: Section 2 provides a review of related work, summarizing previous developments and identifying gaps that our study aims to fill. Section 3 describes the methodology, detailing the design and implementation of our proposed positional-channel semantic fusion transformer (PCSFTr). Section 4 reports on experiments conducted to evaluate the effectiveness of our model across four standard datasets, followed by a comparison with existing approaches. Finally, Section 5 concludes the paper with a summary of our findings and a discussion of potential avenues for future research.

2. Related Work

According to the implementation process of the models, the current mainstream methods for RSIC can be divided into two categories: single-stage methods and multi-stage methods.

2.1. Single-Stage Method

Models constructed using single-stage methods do not receive any auxiliary information during overall training. Vinyals et al. [25] proposed the natural image captioning (NIC) model, which uses CNN as the encoder to extract image visual features and LSTM as the decoder to generate descriptive sentences. Influenced by this, Qu et al. [4] adapted NIC to RSIC and proposed two public remote-sensing image captioning datasets: UCM-caption and Sydney-caption. Xu et al. [11] introduced hard- and soft-attention mechanisms, focusing on specific parts of the image during word generation, mimicking the process of humans observing and describing images. To further promote the development of RSIC, Lu et al. [5] proposed the RSICD dataset and explored various encoder–decoder configurations and attention mechanism variants. Li et al. [13] proposed a multi-level attention model, which includes attention to different regions of the image, attention to different words, and attention to visual and semantic. Huang et al. [26] proposed a senoising-based multi-scale feature fusion (DMSFF) mechanism, which aggregates multi-scale features. DMSFF uses denoising operations during visual feature extraction. Cheng et al. [27] released the largest remote-sensing image captioning dataset to date, NWPU-captions, and proposed a multilevel and contextual attention network (MLCA-Net) based on it.

The transformer, proposed in 2017, quickly became one of the most commonly used models in cross-modal fields, including RSIC, due to its excellent sequence processing capabilities, particularly through its self-attention mechanism that effectively captures long-range dependencies between different elements in sequence data. Shen et al. [28] attempted to use the transformer as the decoder in a captioning model, leveraging its global modeling capabilities to explore the modal differences between images and texts. Liu et al. [24] used the transformer encoder layers to learn dependencies between different positions in multi-scale features, and combined LSTM to aggregate features from different encoding layers, ultimately using the transformer decoder layer to generate descriptive sentences. Zhuang et al. [29] combined the grid features extracted by CNN with the transformer, using the transformer to learn the complex relationships between different positions in images, between different words in text sequences, and between images and texts.a

2.2. Multi-Stage Methods

Compared to single-stage methods, multi-stage methods divide the overall training of the model into multiple stages. The first stage typically sets an auxiliary task different from image captioning, and through training on this task, prior knowledge is obtained. Then, in the subsequent stages, this prior knowledge is used to assist in the overall training of the model to generate descriptive sentences. Auxiliary tasks are usually closely related to RSIC, such as object detection and semantic segmentation. Therefore, in the first stage, appropriate models are used according to the specific auxiliary task, such as object detection models or semantic segmentation models.

Anderson et al. [14] introduced a method that combines bottom-up and top-down attention mechanisms. The bottom-up attention mechanism focuses on the visual features of the image by identifying the visual representation of each target through object detection technology. Zhang et al. [30] proposed a label attention mechanism, which calculates attention masks by extracting label information from the image, thereby generating better descriptive sentences. Wang et al. [31] proposed a word–sentence framework consisting of a word extractor and a sentence generator, using a CNN-based multi-label classifier to provide prior information to the sentence generator.

Sumbul et al. [32] proposed a summary-driven RSIC model, which first trains a pointer-generator network [33] to generate summaries and then combines the descriptive sentences with the summaries to address the information gap in the image-language mapping. Zhao et al. [16] proposed a structured-atteention module that segments the image according to semantic content to provide prior information for the captioning model. Ye et al. [15] combined image captioning with multi-label classification in a joint-training process and proposed a novel joint-training two-stage (JTTS) method.

Wang et al. [34] introduced a global–local captioning model (GLCM), which relies on a pre-trained feature extractor on an object detection dataset to obtain local features of the image. Ren et al. [17] used Vit as an encoder instead of the traditional CNN. They divided the image into multiple patches, learned the interrelations between different patches, and introduced topic tokens to learn the global semantic information of the image. Du et al. [35] proposed a deformable transformer model, which separates the foreground and background to give the planar remote-sensing image a sense of depth and uses deformable scaled dot-product attention to learn multi-scale features from the foreground and background. Yang et al. [36] proposed a novel two-stage vision-language pre-training (VLP)-based approach to bootstrap interactive image–text alignment for RSIC.

3. Methodology

In this section, we provide a detailed introduction to PCSFTr for RSIC. As shown in Figure 2, PCSFTr employs an encoder–decoder architecture. The encoder part consists of the visual feature extractor and

N_{e}

encoder layers, while the decoder part comprises

N_{d}

decoder layers.

3.1. Visual Feature Extractor and Encoder Layer

Remote-sensing images contain a large number of objects, and the same type of object can exhibit significant differences in different types of scenes. To extract visual features that are discriminative and accurately express local and global semantic information. Ren et al. [17] attempted to use Vit as an encoder, dividing the image into multiple patches and processing them as sequential data. Although Vit explores the long and short-distance dependencies between image patches through a self-attention mechanism, it still has the following shortcomings: (1) Vit can capture global information of the image but is not inherently superior to CNN models in extracting local information [37]. However, remote-sensing images contain a large number of objects, requiring the model to extract local information from the images effectively. (2) Vit learns the relationships between different positions in the sequence data of image patches, but does not directly focus on the relationships between channels, which to some extent limits its ability to learn semantic information.

3.1.1. Visual Feature Extractor

To effectively capture global and local information from images, we utilize ResNet, which is trained on scene classification tasks, as the visual feature extractor for PCSFTr. From the given input image I, visual features

F \in R^{C \times H \times W}

are extracted, where C denotes the channel dimension of the feature map, H and W denote the height and width of the feature map, respectively.

F = ResNet (I)

(1)

3.1.2. Position-Channel Multi-Head Self-Attention

In order to extract richer semantic information from the visual features, while preserving the learning of spatial domain dependencies, we added the capability to learn channel domain dependencies. We flatten the visual features F into sequence data

F_{s} \in R^{C \times N}

, where

N = H \times W

denotes the sequence length. We then add positional encoding information and input it into the encoder layer. In the encoder layer, the PCMSA block learns dependencies between different positions and channels from the spatial and channel domains, respectively, capturing semantic information from the image. The structure of the PCMSA block is shown in Figure 3.

F_{s}

is transposed into

F_{P} \in R^{N \times C}

and then projected into the

Q

,

K

and

V

vector spaces. Next, C is divided according to the number of attention heads

h_{p}

. The formula for position multi-head self-attention (PMSA) is shown as follows:

Q = F_{P} \cdot W_{q}, K = F_{P} \cdot W_{k}, V = F_{P} \cdot W_{v}

(2)

h e a d_{i} = Attention (Q_{i}, K_{i}, V_{i}) = softmax (\frac{Q_{i} K_{i}^{T}}{\sqrt{d_{i}^{k}}}) \cdot V_{i}

(3)

P = PMSA (Q, K, V) = Concat (h e a d_{1}, \dots, h e a d_{h_{p}}) \cdot W_{o}

(4)

where

Q, K, V \in R^{N \times C}

;

W_{q}, W_{k}, W_{v}, W_{o} \in R^{C \times C}

denote learnable parameter matrices;

Q_{i}, K_{i}, V_{i} \in R^{N \times C^{'}}

;

P \in R^{N \times C}

denotes the positional attention feature map.

Assign

F_{s}

, respectively, to

Q

,

K

and

V

, N is divided according to the number of attention heads

h_{c}

. The formula for channel multi-head self-attention (CMSA) is shown as follows:

Q = K = V = F_{s}

(5)

h e a d_{i} = Attention (Q_{i}, K_{i}, V_{i}) = softmax (\frac{Q_{i} K_{i}^{T}}{\sqrt{d_{i}^{k}}}) \cdot V_{i}

(6)

C = CMSA (Q, K, V) = Concat {(h e a d_{1}, \dots, h e a d_{h_{c}})}^{T} \cdot W_{o}

(7)

where

Q, K, V \in R^{C \times N}

;

W_{o} \in R^{C \times C}

denotes learnable parameter matrices;

Q_{i}, K_{i}, V_{i} \in R^{C \times N^{'}}

;

C \in R^{C \times N}

denotes the channel attention feature map.

3.1.3. Feature Fusion

After learning positional and channel dependencies from visual features, we fused the two types of dependencies to further enhance semantic information and better understand the complex semantic relationships in images. Standard feature fusion methods, such as element-wise addition or channel concatenation, are relatively simple. However, this approach may diminish the model’s ability to express image content. To better utilize the positional attention feature map

P

and the channel attention feature map

C

, we employed the FF block based on the channel attention mechanism to fuse the two. The structure of the FF block is illustrated in Figure 4.

After reshaping the given features

P, C

into

p, c \in R^{C \times H \times W}

, respectively, they are input into FF blocks for feature fusion. The calculation formula for feature fusion is as follows:

F_{f} = FF (p, c) = M (z) \otimes p + (1 - M (z)) \otimes c

(8)

where

z \in R^{C \times H \times W}

denotes the initial fused feature, and

F_{f} \in R^{C \times H \times W}

denotes the final fused feature. In the feature fusion process, p and c are first element-wise summed to obtain z.

M (z) \in (0, 1)

denotes the attention weight that varies with the initial fused feature, assigning different weights to p and c, allowing the network to perform a weighted average operation between p and c. The calculation formula is as follows:

M (z) = σ (L (z) \oplus G (z))

(9)

where

σ

is the sigmoid function.

L (z)

and

G (z)

are local and global feature contexts, respectively. The calculation process of

L (z)

and

G (z)

is as follows:

L (z) = BN ({Conv}_{2} (δ (BN ({Conv}_{1} (z)))))

(10)

where

{Conv}_{1} (\cdot)

and

{Conv}_{2} (\cdot)

are

1 \times 1

convolution functions. The number of channels is reduced from C to

C / r

by

{Conv}_{1} (\cdot)

and then increased back to C by

{Conv}_{2} (\cdot)

.

BN

is the batch normalization operation, and

δ

is the rectified linear unit.

G (z) = BN (W_{2} (δ (BN (W_{1} (ga (z))))))

(11)

where

g (\cdot)

denotes global average pooling.

W_{1} (\cdot)

and

W_{2} (\cdot)

are the channel reduction and expansion layers, respectively.

W_{1} (\cdot)

reduces the number of channels from C to

C / r

, and

W_{2} (\cdot)

expands the number of channels back to C.

The implementation of

L (z)

and

G (z)

adopts a bottleneck structure. This structural design not only optimizes the computational efficiency and network performance, but also enhances the model’s ability to understand complex data, especially when processing high-dimensional data.

3.2. Decoder Layer

The objective of the decoder layer is to predict the next word based on the visual features

F_{e} \in R^{C \times N}

output by the encoder layer and the previously generated words

w \in R^{m \times 1}

, m is the number of words. To achieve cross-modal interaction and transformation between visual and textual features, the decoder layer primarily consists of a semantic masking attention block and a cross-modal attention block, as illustrated in Figure 2.

3.2.1. Semantic Mask Multi-Head Attention

During word prediction, it is crucial to ensure that the generation of each new word depends only on the previously generated words. Therefore, the transformer model employs the mask multi-head self-attention (MaskMSA) block. MaskMSA avoids the influence of ungenerated words on generating new words by setting their attention weights to negative infinity. This way, the attention mechanism focuses only on the already generated words, ensuring that the model does not use future information when predicting the next word, producing outputs that more closely match actual language sequences.

In traditional MaskMSA, the word sequence embedding vector

E_{w}

is projected into the

Q

,

K

and

V

vector spaces. In contrast, in the semantic mask multi-head attention (SMaskMHA) block,

E_{w}

is projected only into the

Q

vector space. Simultaneously, the global semantic information g is extracted from

F_{e}

by global average pooling, and concatenates it with

E_{w}

along the sequence dimension. Finally, the concatenated vector is projected into the

K

and

V

vector spaces. The calculation formula is as follows:

E_{w} = Embeding (w)

(12)

g = ga (F_{e})

(13)

Q = E_{w} \cdot W_{q}, K = [g^{T}, E_{w}] \cdot W_{k}, V = [g^{T}, E_{w}] \cdot W_{v}

(14)

where

Embeding (\cdot)

denotes the word embedding operation.

E_{w} \in R^{m \times C}

, C is the word embedding dimension.

g \in R^{C \times 1}

is the global semantic information of the image.

Q \in R^{m \times C}

,

K, V \in R^{(m + 1) \times C}

;

W_{q}, W_{k}, W_{v} \in R^{m \times C}

are learnable parameter matrices.

[,]

denotes the concatenation operation. C is divided according to the number of attention heads

h_{sm}

. The formula for SMaskMHA is shown as follows:

h e a d_{i} = Attention (Q_{i}, K_{i}, V_{i}) = softmax (\frac{Q_{i} K_{i}^{T}}{\sqrt{d_{i}^{k}}}) \cdot V_{i}

(15)

F_{S M} = SMaskMHA (Q, K, V) = Concat (h e a d_{1}, \dots, h e a d_{h_{sm}}) \cdot W_{o}

(16)

where

Q_{i} \in R^{m \times C^{'}}, K_{i}, V_{i} \in R^{(m + 1) \times C^{'}}

.

W_{o} \in R^{C \times C}

represents learnable parameter matrices.

F_{S M} \in R^{m \times C}

denotes the output of SMaskMHA.

The global semantic information g is obtained and concatenated with the word sequence embedding vector

E_{w}

through the above operations. Then, in the attention calculation process, the first cross-modal interaction between visual content and natural language is realized.

3.2.2. Cross-Modal Multi-Head Attention

To further integrate visual features and text embeddings, enabling the model to understand contextual information better, we project

F_{S M}

into the Q vector space and

F_{e}

into the K and V vector spaces. C is divided according to the number of attention heads

h_{CM}

. The cross-modal multi-head attention (CrossMHA) blocks are employed to achieve the secondary cross-modal interaction between visual content and natural language. The formula for CrossMHA is shown as follows:

Q = F_{S M} \cdot W_{q}, K = F_{e}^{T} \cdot W_{k}, V = F_{e}^{T} \cdot W_{v}

(17)

h e a d_{i} = Attention (Q_{i}, K_{i}, V_{i}) = softmax (\frac{Q_{i} K_{i}^{T}}{\sqrt{d_{i}^{k}}}) \cdot V_{i}

(18)

F_{C M} = CrossMHA (Q, K, V) = Concat (h e a d_{1}, \dots, h e a d_{h_{CM}}) \cdot W_{o}

(19)

where

Q \in R^{m \times C}, K, V \in R^{N \times C}

.

W_{q}, W_{k}, W_{v}, W_{o} \in R^{C \times C}

are learnable parameter matrices.

Q_{i} \in R^{m \times C^{'}}, K_{i}, V_{i} \in R^{N \times C^{'}}

.

F_{C M} \in R^{m \times C}

denotes the output of CrossMHA.

4. Experiments

This section presents the experimental evaluation of the PCSFTr across various datasets and scenarios to assess its performance. It includes descriptions of the datasets, evaluation metrics, experimental setup, ablation studies, comparative analyses, and visualizations to demonstrate the model’s effectiveness in remote-sensing image captioning.

4.1. Datasets

In the field of RSIC, three widely used datasets are UCM-caption [4], Sydney-caption [4], and RSICD [5]. Recently, a new dataset, NWPU-captions [27], has been proposed. This dataset has more images, more comprehensive ground scenes, and more complex and diverse sentences than the previous three datasets.

UCM-caption: Proposed by Qu et al. [4], based on the UC Merced land-use dataset [38]. It consists of 2100 images of size 256 × 256, covering 21 scenes, each with 100 images. Each image has 5 reference sentences, with a vocabulary of 285 words.
Sydney-caption: Proposed by Qu et al. [4], comprising 613 images of size 500 × 500, covering 7 scenes. Each image has 5 reference sentences, with a vocabulary of 169 words.
RSICD: Proposed by Lu et al. [5], consisting of 10,921 images of size 224 × 224, covering 30 scenes. Each image has up to 5 reference sentences, with a vocabulary of 2632 words.
NWPU-captions: Proposed by Cheng et al. [27] based on the NWPU-RESISC45 [39]. It comprises 31,500 images of size 256 × 256, covering 45 scenes, each with 700 images. Each image has 5 reference sentences, with a vocabulary of 3149 words, the largest among the four datasets.

All four datasets are divided into training, validation, and test sets in a ratio of 8:1:1.

4.2. Evaluation Indicators

In RSIC, the following five metrics are commonly used to evaluate the generated sentences: BLEU [40], METEOR [41], ROUGE-L [42], CIDEr [43], and SPICE [44]. BLEU, METEOR, ROUGE-L, and SPICE range from 0 to 1, while CIDEr ranges from 0 to 5.

BLEU: A metric used to evaluate machine translation quality. BLEU measures the similarity between the generated and reference sentences based on n-gram matching principles, thereby assessing the accuracy of the generated sentences. N-grams are sequences of n consecutive words, with n ranging from 1 to 4.
METEOR: A metric for evaluating machine translation quality. It uses external semantic resources like WordNet to achieve precise word matching, considering stems, prefixes, synonyms, and word order, thus measuring the similarity between the generated and reference sentences.
ROUGE-L: A metric for evaluating text summarization quality, where L is the longest common sub-sequence. It calculates recall, precision, and F1 score based on the length of the longest common sub-sequence between the generated and reference sentences.
CIDEr: A metric designed explicitly for the image caption. It treats each sentence as a document and computes its n-gram term frequency-inverse document frequency (TF-TDF) vector. TF-TDF weights each n-gram, and cosine similarity measures semantic consistency between the generated description and reference sentences.
SPICE: Another metric designed for image description. Unlike the above metrics, SPICE is based on graph structures, encoding objects, attributes, and their relationships in the description, mapping them into a scene graph, and computing F-scores for these components.

Among these metrics, BLEU, ROUGE-L, and METEOR are borrowed from other text-generation tasks. They assess model performance by evaluating the similarity between generated and reference sentences but cannot adequately measure the model’s ability to extract and utilize semantic information. CIDEr and SPICE, however, are tailored for image description and provide evaluations closer to human judgment.

4.3. Experimental Setup

All experiments were conducted on an NVIDIA GeForce RTX 2080 Ti device. In these experiments, the image sizes from the UCM-caption, RSICD, and NWPU-captions datasets were resized to 224 × 224 pixels. In contrast, the images from the Sydney-caption dataset were resized to 384 × 384 pixels. During the training phase, the batch size of PCSFTr was set to 8, the initial learning rate was 3 × 10⁻⁵, with a decay of 0.8 every 3 epochs, and the total number of epochs was 20. During the inference phase, beam search was used with a beam size of 3. ResNet was trained on the remote-sensing image scene classification task in this study. During the training of PCSFTr, ResNet served as the visual feature extractor, and its parameters were not updated.

4.4. Ablation Experiments

To verify the effectiveness of the PCMSA block and the FF block, we conducted ablation experiments on four datasets, the results of which are shown in Table 1, Table 2, Table 3 and Table 4.

In these experiments, the model constructed with the visual feature extractor and the decoder layer is referred to as the Baseline, as shown in Figure 5a. After adding the encoder layer to Baseline, the model using only the PMSA block is referred to as Baseline_p, as shown in Figure 5b, and the model using only the CMSA block is referred to as Baseline_c, as shown in Figure 5c.

When the PMSA and CMSA blocks are paralleled, the model summing their outputs element-wise is referred to as Baseline_p+c, as shown in Figure 5d; the model that fuses their outputs through the FF block is referred to as the PCSFTr. Additionally, the highest evaluation metrics in this section are highlighted in bold.

The ablation experiment results shown in Table 1, Table 2 and Table 4 indicate that integrating the PMSA or CMSA block into the Baseline leads to performance improvements in most evaluation metrics for Baseline_p and Baseline_c. This indicates that learning the dependencies between different positions or channels in an image can effectively extract semantic information, thereby improving the model’s ability to understand images.

Although Baseline_p or Baseline_c show a slight decrease in some metrics compared to the Baseline in Table 3, Baseline_p+c demonstrates performance improvements in all tables. This suggests that fusing different dependencies can further enhance the semantic information contained in visual features and better understand the complex semantic relationships in images, even if the fusion method is merely a simple element-wise summation.

More importantly, the ablation experiment results show that PCSFTr performs best on all four datasets, which further emphasizes the significant role of the FF block in enhancing semantic information and understanding semantic relations.

4.5. Comparison Experiments

In this section, we validate our model on four datasets and compare it with various models, including Soft-attention and Hard-attention [5], CSMLF [45], Attribute-attention [12], Structured-attention [16], GVFGA-LSGA [22], MLCA-Net [27], GLCM [34], SCAMET [20], Topic-guided [17], Deformable-T [35], and BITA [36].

Soft-attention and hard-attention employ VGG16 as the encoder, integrating soft or hard attention with LSTM as the decoder. Due to its dynamic weight allocation, the soft-attention method is chosen for comparison.
CSMLF maps Glove-based sentence embedding features and CNN-based visual features into the same semantic space, achieving multi-sentence descriptions for remote-sensing images.
Attribute-attention uses VGG-16 as the encoder and employs an attribute attention mechanism to selectively extract object and attribute information from the high-level convolutional features for sentence generation.
Structured attention’s main framework is ResNet50 + LSTM, combined with the structured-atteention block to extract structural features from high-resolution images.
GVFGA-LSGA proposes GVFGA and LSGA mechanisms based on the attention mechanism to filter out redundant and irrelevant information in the fused global-local image features and the fused visual–textual feature, respectively.
MLCA-Net utilizes multi-level attention blocks to adaptively aggregate visual features and introduces contextual attention blocks to explore the correlation between words and different regions of the image.
GLCM enhances the overall visual correlation between words and images by acquiring the global visual features of the image, and improves the recognition of words through local visual features.
SCAMET constructs a multi-attention encoder based on CNN-output visual features and uses a transformer model as the decoder, replacing the traditional RNN-like decoders.
Topic-guided uses a full transformer framework, employing Vit as the encoder to generate Topic Tokens and a transformer-like network as the decoder to fuse multi-modal features.
Deformable-T, inspired by “selective search” in Structured attention, designs a novel transformer framework to mine foreground, background, and raw information from the image and perform feature interactions.
BITA connects pre-trained image encoders with a large language model (LLM) through an interactive Fourier transformer (IFT) and aligns visual cues and text features using image–text comparative learning (ITC).

Table 5, Table 6, Table 7 and Table 8 compare our proposed PCSFTr with the above RSIC models across four datasets. Specifically:

In Table 5, PCSFTr outperforms the comparison models in most evaluation metrics, especially the CIDEr metric, which is 8.7% higher than the Topic-guided. This highlights the advantages of PCSFTr in capturing image details and generating high-quality descriptive text.

In Table 6, PCSFTr shows a significant gap compared to Deformable-T, particularly in the CIDEr metric, where the latter leads by approximately 40%. This discrepancy does not appear in the other three datasets. The likely reason is the size difference in data pre-processing and visual feature extraction. The visual features extracted from Sydney-caption have dimensions of (12, 12, 512), whereas other datasets have dimensions of (7, 7, 512), increasing the sequence length from 49 to 144, nearly threefold, while keeping the channel number constant. This change significantly increases the complexity of establishing dependencies in the spatial domain, potentially affecting the model’s understanding of semantic information in the images.

In Table 7, PCSFTr slightly outperforms other models in metrics such as BLEU-2 to BLEU-4, demonstrating its effectiveness in constructing longer word sequences.

In Table 8, PCSFTr exhibits outstanding performance in several evaluation metrics. Particularly in the CIDEr metric, PCSFTr improves by 6.2% compared to the BITA, highlighting its semantic relevance and content diversity ability. However, in other metrics, PCSFTr performs slightly worse than the BITA, indicating directions for further research and optimization.

4.6. Visualization

Figure 6 presents example sentences generated by PCSFTr across four datasets. Two scenes are shown for each dataset, with correct, incorrect, and novel words highlighted in different colors.

Specifically, words correctly expressing semantic information are marked in blue. For example, in Figure 6a, first row, “tennis court” and “road”; in Figure 6b, second row, “green bushes”, “meadow”, and “highway”; in Figure 6c, first row, “white stadium”; and in Figure 6d, second row, “wetland”, “bare land”, and “Green plants” accurately describe the objects and their attributes in the images.

Words correctly expressing semantic relationships are marked in green. For instance, in Figure 6a, second row, “compose of”; in Figure 6b, first row, “arranged neatly” and “divided into rectangles by”; in Figure 6c, second row, “near”; and in Figure 6d, first row, “next to” accurately describe the relationships between objects in the images.

These words also exist in the ground truth, indicating that the sentences predicted by PCSFTr are consistent with the ground truth as a whole.

Novel words are marked in purple, such as “white bunkers” in Figure 6b, second row, which does not appear in the ground truth but is present in the image. Similarly, “football field” in Figure 6c, first row is more accurate than the ground truth’s “ground for the green” in conveying the same semantic information.

Unpredictable or mispredicted words are marked in red. For example, “small” in Figure 6a first-row ground truth or “scattered” and “houses” in Figure 6b second-row ground truth was not predicted by PCSFTr. Additionally, “a swimming pool” in Figure 6c, second row correctly describes the swimming pool in the image, but it incorrectly describes the quantity. Therefore, they are marked in red.

This may be due to the weak ability of PCSFTr to handle multi-scale problems. PCSFTr only uses the output tensor of the last stage of the ResNet as the input of the encoder layer, and the receptive field of the convolution kernel in this stage is large, which makes it difficult to effectively extract the features of small objects, resulting in description errors.

5. Discussion and Conclusions

In this paper, we introduced the PCSFTr model, a novel approach to remote-sensing image captioning that leverages both positional and channel attention mechanisms to enhance semantic understanding of images. Through comparative analysis, PCSFTr demonstrated superior performance over existing methods. Specifically, compared with the BITA model, PCSFTr’s BLEU-4 index on UCM-caption is improved by 6.55%, 4.06% higher in RSICD, and 1.41% higher in NWPU-captions. Notably, the PCSFTr showed remarkable improvements in capturing the complex semantics of urban landscapes and mixed terrain, which are often challenging for traditional models.

However, while PCSFTr offers substantial advancements, it also encounters limitations in handling images with extreme scale variations, where traditional feature extraction methods may still hold an edge. Such limitations underscore the necessity for ongoing refinement of the feature fusion techniques to better adapt to diverse and challenging environmental conditions.

Looking forward, our research will focus on addressing these challenges by optimizing the model’s architecture to handle multi-scale information more effectively.

Author Contributions

Conceptualization, A.Z. and W.Y.; methodology, A.Z.; software, A.Z.; validation, A.Z. and D.C.; formal analysis, A.Z., D.C. and F.W.; investigation, A.Z.; resources, A.Z.; data curation, A.Z.; writing—original draft preparation, A.Z.; writing—review and editing, A.Z. and F.W.; visualization, A.Z.; supervision, A.Z.; project administration, W.Y.; funding acquisition, W.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key Research and Development Program of China (Grant No. 2022ZD0115802), the Key Research and Development Program of the Autonomous Region (No. 2022B01008), the National Natural Science Foundation of China (Grant No. 62262065), the Tianshan Science and Technology Innovation Leading talent Project of the Autonomous Region (Grant No. 2022TSYCLJ0037), the Central guidance for local special projects (Grant No. ZYYD2022C19).

Data Availability Statement

The RSICD, UCM-caption, and Sydney-caption datasets are available at https://github.com/201528014227051/RSICD_optimal (accessed on 2 July 2024). The NWPU-Captions dataset is available at https://github.com/HaiyanHuang98/NWPU-Captions (accessed on 2 July 2024).

Acknowledgments

We thank all anonymous reviewers for their constructive comments.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Zhao, D.; Shao, F.; Liu, Q.; Yang, L.; Zhang, H.; Zhang, Z. A Small Object Detection Method for Drone-Captured Images Based on Improved YOLOv7. Remote Sens. 2024, 16, 1002. [Google Scholar] [CrossRef]
Zhou, N.; Hong, J.; Cui, W.; Wu, S.; Zhang, Z. A Multiscale Attention Segment Network-Based Semantic Segmentation Model for Landslide Remote Sensing Images. Remote Sens. 2024, 16, 1712. [Google Scholar] [CrossRef]
Lv, P.; Wu, W.; Zhong, Y.; Du, F.; Zhang, L. SCViT: A spatial-channel feature preserving vision transformer for remote sensing image scene classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–12. [Google Scholar] [CrossRef]
Qu, B.; Li, X.; Tao, D.; Lu, X. Deep Semantic Understanding of High Resolution Remote Sensing Image. In Proceedings of the 2016 International Conference on Computer, Information and Telecommunication Systems (Cits), Kunming, China, 6–8 July 2016; pp. 1–5. [Google Scholar] [CrossRef]
Lu, X.; Wang, B.; Zheng, X.; Li, X. Exploring models and data for remote sensing image caption generation. IEEE Trans. Geosci. Remote Sens. 2017, 56, 2183–2195. [Google Scholar] [CrossRef]
Koehn, P. Statistical Machine Translation; Cambridge University Press: Cambridge, UK, 2009. [Google Scholar] [CrossRef]
Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A survey of convolutional neural networks: Analysis, applications, and prospects. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 6999–7019. [Google Scholar] [CrossRef]
Sutskever, I.; Martens, J.; Hinton, G.E. Generating Text with Recurrent Neural Networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), Bellevue, WA, USA, 28 June–2 July 2011; pp. 1017–1024. [Google Scholar]
Graves, A.; Graves, A. Long Short-Term Memory. In Supervised Sequence Labelling with Recurrent Neural Networks; Springer: Berlin, Germany, 2012; pp. 37–45. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 2048–2057. [Google Scholar]
Zhang, X.; Wang, X.; Tang, X.; Zhou, H.; Li, C. Description generation for remote sensing images using attribute attention mechanism. Remote Sens. 2019, 11, 612. [Google Scholar] [CrossRef]
Li, Y.; Fang, S.; Jiao, L.; Liu, R.; Shang, R. A multi-level attention model for remote sensing image captions. Remote Sens. 2020, 12, 939. [Google Scholar] [CrossRef]
Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6077–6086. [Google Scholar]
Ye, X.; Wang, S.; Gu, Y.; Wang, J.; Wang, R.; Hou, B.; Giunchiglia, F.; Jiao, L. A joint-training two-stage method for remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–16. [Google Scholar] [CrossRef]
Zhao, R.; Shi, Z.; Zou, Z. High-resolution remote sensing image captioning based on structured attention. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–14. [Google Scholar] [CrossRef]
Ren, Z.; Gou, S.; Guo, Z.; Mao, S.; Li, R. A mask-guided transformer network with topic token for remote sensing image captioning. Remote Sens. 2022, 14, 2939. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Gajbhiye, G.O.; Nandedkar, A.V. Generating the captions for remote sensing images: A spatial-channel attention based memory-guided transformer approach. Eng. Appl. Artif. Intell. 2022, 114, 105076. [Google Scholar] [CrossRef]
Chen, L.; Zhang, H.; Xiao, J.; Nie, L.; Shao, J.; Liu, W.; Chua, T.S. Sca-cnn: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5659–5667. [Google Scholar]
Zhang, Z.; Zhang, W.; Yan, M.; Gao, X.; Fu, K.; Sun, X. Global visual feature and linguistic state guided attention for remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–16. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, W.; Zhang, Z.; Gao, X.; Sun, X. Multiscale multiinteraction network for remote sensing image captioning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 2154–2165. [Google Scholar] [CrossRef]
Liu, C.; Zhao, R.; Shi, Z. Remote-sensing image captioning based on multilayer aggregated transformer. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and Tell: A Neural Image Caption Generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3156–3164. [Google Scholar]
Huang, W.; Wang, Q.; Li, X. Denoising-based multiscale feature fusion for remote sensing image captioning. IEEE Geosci. Remote Sens. Lett. 2020, 18, 436–440. [Google Scholar] [CrossRef]
Cheng, Q.; Huang, H.; Xu, Y.; Zhou, Y.; Li, H.; Wang, Z. NWPU-captions dataset and MLCA-net for remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–19. [Google Scholar] [CrossRef]
Shen, X.; Liu, B.; Zhou, Y.; Zhao, J. Remote sensing image caption generation via transformer and reinforcement learning. Multimed. Tools Appl. 2020, 79, 26661–26682. [Google Scholar] [CrossRef]
Zhuang, S.; Wang, P.; Wang, G.; Wang, D.; Chen, J.; Gao, F. Improving remote sensing image captioning by combining grid features and transformer. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Zhang, Z.; Diao, W.; Zhang, W.; Yan, M.; Gao, X.; Sun, X. LAM: Remote sensing image captioning with label-attention mechanism. Remote Sens. 2019, 11, 2349. [Google Scholar] [CrossRef]
Wang, Q.; Huang, W.; Zhang, X.; Li, X. Word–sentence framework for remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 2020, 59, 10532–10543. [Google Scholar] [CrossRef]
Sumbul, G.; Nayak, S.; Demir, B. SD-RSIC: Summarization-driven deep remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 2020, 59, 6922–6934. [Google Scholar] [CrossRef]
See, A.; Liu, P.J.; Manning, C.D. Get to the point: Summarization with pointer-generator networks. arXiv 2017, arXiv:1704.04368. [Google Scholar] [CrossRef]
Wang, Q.; Huang, W.; Zhang, X.; Li, X. GLCM: Global–local captioning model for remote sensing image captioning. IEEE Trans. Cybern. 2022, 53, 6910–6922. [Google Scholar] [CrossRef]
Du, R.; Cao, W.; Zhang, W.; Zhi, G.; Sun, X.; Li, S.; Li, J. From plane to hierarchy: Deformable transformer for remote sensing image captioning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 7704–7717. [Google Scholar] [CrossRef]
Yang, C.; Li, Z.; Zhang, L. Bootstrapping interactive image-text alignment for remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5607512. [Google Scholar] [CrossRef]
Li, C.; Zhang, C. Toward a Deeper understanding: RetNet viewed through convolution. Pattern Recognit. 2024, 155, 110625. [Google Scholar] [CrossRef]
Yang, Y.; Newsam, S. Bag-of-Visual-Words and Spatial Extensions for Land-Use Classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, Redondo Beach, CA, USA, 7–10 November 2010; pp. 270–279. [Google Scholar] [CrossRef]
Cheng, G.; Han, J.; Lu, X. Remote sensing image scene classification: Benchmark and state of the art. Proc. IEEE 2017, 105, 1865–1883. [Google Scholar] [CrossRef]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
Banerjee, S.; Lavie, A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; pp. 65–72. [Google Scholar]
Lin, C.Y. Rouge: A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. 2004, pp. 74–81. Available online: https://aclanthology.org/W04-1013.pdf (accessed on 1 March 2022).
Vedantam, R.; Lawrence Zitnick, C.; Parikh, D. Cider: Consensus-Based Image Description Evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4566–4575. [Google Scholar]
Anderson, P.; Fernando, B.; Johnson, M.; Gould, S. Spice: Semantic Propositional Image Caption Evaluation. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part V 14. Springer: Berlin, Germany, 2016; pp. 382–398. [Google Scholar] [CrossRef]
Wang, B.; Lu, X.; Zheng, X.; Li, X. Semantic descriptions of high-resolution remote sensing images. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1274–1278. [Google Scholar] [CrossRef]

Figure 1. Example diagram of the differences of the same type of objects in different scenes, where semantic information is marked in blue and semantic relationships are marked in green.

Figure 2. Overall diagram of the position-channel semantic fusion transformer.

Figure 3. Diagram of the position-channel multi-head self-attention block.

Figure 4. Diagram of the feature fusion block.

Figure 5. Diagram of ablation model.

Figure 6. Example sentences generated by PCSFTr on four datasets. (a) UCM-Captions. (b) Sydney-Captions. (c) RSICD. (d) NWPU-Captions.

Table 1. Results of ablation experiments on the UCM-caption [4] (%).

Method	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L	CIDEr	SPICE
Baseline	88.25	83.20	78.49	73.81	48.00	82.99	374.10	53.46
Baseline_p	89.40	84.02	79.02	73.94	47.62	83.38	374.20	52.42
Baseline_c	89.53	84.03	79.00	73.95	48.02	84.39	380.60	52.66
Baseline_p+c	88.43	83.70	79.49	75.28	48.50	83.97	384.90	52.26
PCSFTr	90.43	86.25	82.36	78.42	49.92	86.01	398.70	55.24

Table 2. Results of ablation experiments on the Sydney-caption [4] (%).

Method	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L	CIDEr	SPICE
Baseline	77.33	68.92	61.76	55.21	37.84	70.37	240.40	42.46
Baseline_p	78.24	70.05	63.01	56.45	38.54	71.58	249.90	41.97
Baseline_c	78.30	69.78	62.54	55.93	38.43	70.76	242.00	41.90
Baseline_p+c	78.27	70.18	63.19	56.79	39.42	71.77	249.50	43.55
PCSFTr	79.64	71.82	64.82	58.52	39.49	72.65	262.10	45.22

Table 3. Results of ablation experiments on the RSICD [5] (%).

Method	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L	CIDEr	SPICE
Baseline	80.05	69.44	60.54	53.10	38.70	69.54	295.10	50.55
Baseline_p	79.74	69.38	60.60	53.12	38.60	69.07	290.50	50.07
Baseline_c	80.19	69.69	61.17	53.99	38.49	69.76	294.60	50.34
Baseline_p+c	80.15	69.90	61.18	53.83	38.91	69.41	295.20	50.48
PCSFTr	80.39	70.26	61.72	54.42	39.05	70.21	301.20	51.06

Table 4. Results of ablation experiments on the NWPU-captions [27] (%).

Method	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L	CIDEr	SPICE
Baseline	87.83	79.52	72.54	66.60	43.10	76.44	196.00	29.36
Baseline_p	88.18	80.14	73.38	67.63	43.98	77.68	201.40	29.39
Baseline_c	87.69	79.61	72.90	67.27	43.60	76.82	197.50	29.39
Baseline_p+c	87.93	79.96	73.32	67.75	43.99	77.47	199.70	29.89
PCSFTr	88.89	81.13	74.55	69.01	44.27	78.04	203.60	29.89

Table 5. Results of comparison experiments on the UCM-caption [4] (%).

Method	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L	CIDEr	SPICE
Soft-attention [5]	74.54	65.45	58.55	52.50	38.86	72.37	261.24	-
CSMLF [45]	36.71	14.85	7.63	5.05	9.44	29.86	13.51	2.85
Attribute-attention [12]	81.54	75.75	69.36	64.58	42.40	76.32	318.64	-
Structured-attention [16]	85.38	80.35	75.72	71.49	46.32	81.41	334.89	-
GVFGA-LSGA [22]	83.19	76.57	71.03	65.96	44.36	78.45	332.70	48.53
MLCA-Net [27]	82.60	77.00	71.70	66.80	43.50	77.20	324.00	47.30
GLCM [34]	81.82	75.40	69.86	64.68	46.19	75.24	302.79	-
SCAMET [20]	84.60	77.72	72.62	68.12	52.57	81.66	337.72	-
Topic-guided [17]	89.36	84.82	80.57	76.50	50.81	85.86	389.92	-
Deformable-T [35]	82.30	77.00	72.28	67.92	44.39	78.39	346.29	48.25
BITA [36]	88.89	83.12	77.30	71.87	46.88	83.76	384.50	54.88
PCSFTr (Ours)	90.43	86.25	82.36	78.42	49.92	86.01	398.70	55.24

Table 6. Results of comparison experiments on the Sydney-caption [4] (%).

Method	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L	CIDEr	SPICE
Soft-attention [5]	73.22	66.74	62.23	58.20	39.42	71.27	249.93	-
CSMLF [45]	59.98	45.83	38.69	34.33	24.75	50.18	75.55	26.27
Attribute-attention [12]	81.43	73.51	65.86	58.06	41.11	71.95	230.21	-
Structured-attention [16]	77.95	70.19	63.92	58.61	39.54	72.99	237.91	-
GVFGA-LSGA [22]	76.81	68.46	61.45	55.04	38.66	70.30	245.22	45.32
MLCA-Net [27]	83.10	74.20	65.90	58.00	39.00	71.10	232.40	40.90
GLCM [34]	80.41	73.05	67.45	62.59	44.21	69.65	243.37	-
SCAMET [20]	80.72	71.36	64.31	58.46	46.14	72.18	235.70	-
Topic-guided [17]	83.38	75.72	67.72	59.80	43.46	76.60	269.82	-
Deformable-T [35]	83.73	77.71	71.98	66.59	45.48	78.60	303.69	48.39
PCSFTr (ours)	79.64	71.82	64.82	58.52	39.49	72.65	262.10	45.22

Table 7. Results of comparison experiments on the RSICD [5] (%).

Method	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L	CIDEr	SPICE
Soft-attention [5]	67.53	53.08	43.33	36.17	32.55	61.09	196.43	-
CSMLF [45]	51.06	29.11	19.03	13.52	16.93	37.89	33.88	13.49
Attribute-attention [12]	75.71	63.36	53.85	46.12	35.13	64.58	235.63	-
Structured-attention [16]	70.16	56.14	46.48	39.34	32.91	57.06	170.31	-
GVFGA-LSGA [22]	67.79	56.00	47.81	41.65	32.85	59.29	260.12	46.83
MLCA-Net [27]	75.70	63.40	53.90	46.10	35.10	64.60	235.60	44.40
GLCM [34]	77.67	64.92	56.42	49.37	36.27	67.79	254.91	-
SCAMET [20]	76.81	63.09	53.52	46.11	45.72	69.79	246.81	-
Topic-guided [17]	80.42	69.96	61.36	54.14	39.37	70.58	298.39	-
Deformable-T [35]	75.81	64.16	55.85	49.23	35.50	65.23	258.14	45.79
BITA [36]	77.38	66.54	57.65	50.36	41.99	71.74	304.53	54.79
PCSFTr (ours)	80.39	70.26	61.72	54.42	39.05	70.21	301.20	51.06

Table 8. Results of comparison experiments on the NWPU-captions [27] (%).

Method	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L	CIDEr	SPICE
Soft-attention [5]	73.10	60.90	52.50	46.20	33.90	59.90	113.60	28.50
CSMLF [45]	71.70	59.00	50.40	44.00	32.00	57.80	106.50	26.50
MLCA-Net [27]	74.50	62.40	54.10	47.80	33.70	60.10	126.40	28.50
Deformable-T [35]	75.15	62.91	54.57	48.28	31.87	58.58	120.71	26.78
BITA [36]	88.54	80.70	73.76	67.60	45.27	78.53	197.04	33.65
PCSFTr (ours)	88.89	81.13	74.55	69.01	44.27	78.04	203.60	29.89

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, A.; Yang, W.; Chen, D.; Wei, F. Enhanced Transformer for Remote-Sensing Image Captioning with Positional-Channel Semantic Fusion. Electronics 2024, 13, 3605. https://doi.org/10.3390/electronics13183605

AMA Style

Zhao A, Yang W, Chen D, Wei F. Enhanced Transformer for Remote-Sensing Image Captioning with Positional-Channel Semantic Fusion. Electronics. 2024; 13(18):3605. https://doi.org/10.3390/electronics13183605

Chicago/Turabian Style

Zhao, An, Wenzhong Yang, Danny Chen, and Fuyuan Wei. 2024. "Enhanced Transformer for Remote-Sensing Image Captioning with Positional-Channel Semantic Fusion" Electronics 13, no. 18: 3605. https://doi.org/10.3390/electronics13183605

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhanced Transformer for Remote-Sensing Image Captioning with Positional-Channel Semantic Fusion

Abstract

1. Introduction

2. Related Work

2.1. Single-Stage Method

2.2. Multi-Stage Methods

3. Methodology

3.1. Visual Feature Extractor and Encoder Layer

3.1.1. Visual Feature Extractor

3.1.2. Position-Channel Multi-Head Self-Attention

3.1.3. Feature Fusion

3.2. Decoder Layer

3.2.1. Semantic Mask Multi-Head Attention

3.2.2. Cross-Modal Multi-Head Attention

4. Experiments

4.1. Datasets

4.2. Evaluation Indicators

4.3. Experimental Setup

4.4. Ablation Experiments

4.5. Comparison Experiments

4.6. Visualization

5. Discussion and Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI