Unified Transformer with Cross-Modal Mixture Experts for Remote-Sensing Visual Question Answering

Liu, Gang; He, Jinlong; Li, Pengfei; Zhong, Shenjun; Li, Hongyang; He, Genrong

doi:10.3390/rs15194682

Open AccessArticle

Unified Transformer with Cross-Modal Mixture Experts for Remote-Sensing Visual Question Answering

by

Gang Liu

^1,2

,

Jinlong He

^1,2

,

Pengfei Li

^1,2,*

,

Shenjun Zhong

³

,

Hongyang Li

^1,2 and

Genrong He

^1,2

¹

College of Computer Science and Technology, Harbin Engineering University, Harbin 150001, China

²

National Engineering Laboratory of E-Government Modeling Simulation, Harbin Engineering University, Harbin 150001, China

³

Monash Biomedical Imaging, Australia and National Imaging Facility, Monash University, Victoria 3800, Australia

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(19), 4682; https://doi.org/10.3390/rs15194682

Submission received: 14 July 2023 / Revised: 19 September 2023 / Accepted: 22 September 2023 / Published: 24 September 2023

(This article belongs to the Special Issue Advanced Application of Artificial Intelligence and Machine Vision in Remote Sensing II)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Remote-sensing visual question answering (RSVQA) aims to provide accurate answers to remote sensing images and their associated questions by leveraging both visual and textual information during the inference process. However, most existing methods ignore the significance of the interaction between visual and language features, which typically adopt simple feature fusion strategies and fail to adequately model cross-modal attention, struggling to capture the complex semantic relationships between questions and images. In this study, we introduce a unified transformer with cross-modal mixture expert (TCMME) model to address the RSVQA problem. Specifically, we utilize the vision transformer (VIT) and BERT to extract visual and language features, respectively. Furthermore, we incorporate cross-modal mixture experts (CMMEs) to facilitate cross-modal representation learning. By leveraging the shared self-attention and cross-modal attention within CMMEs, as well as the modality experts, we effectively capture the intricate interactions between visual and language features and better focus on their complex semantic relationships. Finally, we conduct qualitative and quantitative experiments on two benchmark datasets: RSVQA-LR and RSVQA-HR. The results demonstrate that our proposed method surpasses the current state-of-the-art (SOTA) techniques. Additionally, we perform an extensive analysis to validate the effectiveness of different components in our framework.

Keywords:

remote-sensing visual question answering; cross-modal mixture experts; cross-modal attention; transformer; vision transformer; BERT

1. Introduction

In recent decades, the utilization of remote sensing technology has emerged as an omnipotent instrument for the acquisition and analysis of prodigious quantities of spatial data derived from the Earth’s surface [1,2]. In light of the relentless progression in sensor technology and the pervasive utilization of high-resolution satellite imagery, remote sensing has become an indispensable source of information in various domains, including natural resource management, environmental protection, urban planning, and disaster monitoring [3,4,5,6,7]. Remote sensing image provides valuable visual information about the Earth’s surface, which has led to the exploration of numerous research directions and applications, such as remote sensing scene classification [8,9], object detection [10,11,12], semantic segmentation [13,14], and image classification [15,16,17]. However, the aforementioned tasks only require perceptual processing of remote sensing images to extract specific task-related information, such as image categories, object locations, scene categories, etc. It does not necessitate a comprehensive understanding and reasoning of the entire image.

To achieve a comprehensive understanding of remote sensing information, researchers have actively explored the integration of the natural language processing and computer vision fields, integrating them into the realm of remote sensing. This integration has been applied in tasks like image-text retrieval [18,19,20,21], image captioning [22,23,24], and visual question answering (VQA) [25,26,27]. As a subfield of VQA, RSVQA aims to integrate remote sensing images with textual questions, ultimately generating accurate answers. As shown in Figure 1, when presented with a remote sensing image, the objective for the question “Is a building present?” regarding that image is to obtain the correct answer “Yes”. Compared to other remote sensing image analysis tasks, remote sensing VQA presents higher challenges as it requires models to handle multiple modalities, including visual and textual information. RSVQA has diverse and extensive potential applications. It can strengthen environmental monitoring by allowing researchers to inquire about land cover changes, vegetation health, and deforestation patterns. It can support urban planning and infrastructure management by enabling city officials to query information related to building density, road networks, and land usage. Furthermore, RSVQA can aid disaster management by facilitating real-time analysis of satellite images and assisting in emergency response efforts.

Despite the significant potential of RSVQA, the research in this field is still in its early stages, with limited relevant literature. A typical VQA model usually consists of four steps: extraction of visual representations, extraction of semantic representations of language, fusion of multimodal representations, and answer prediction. In the pioneering work of RSVQA, researchers proposed the first benchmark model [25] following this design. The model utilizes ResNet-152 [28] as the visual representation encoder, RNN as the language representation encoder, and concatenates the encoded visual representation and language representation into a single vector. Finally, an answer prediction is performed through the utilization of fully connected layers. Due to the presence of complex spatial and semantic information in remote sensing images, models need to possess advanced capabilities in spatial and semantic analysis. As a result, recent RSVQA models have primarily focused on learning visual feature extraction. Yuan et al. proposed a pioneering methodology for multi-level visual feature acquisition, wherein language-guided global and regional image features are concurrently extracted [27]. They progressively train the model using question–answer pairs of increasing difficulty. Bazi et al. introduced the CLIP model [29] into the RSVQA domain, using its fixed parameters as visual and language feature extractors [30]. They embedded image patches and question words into the feature representation sequence. Additionally, Zhang et al. [31] designed a spatial multiscale visual representation module based on hashing techniques, enabling the generation of visual object features containing appearance features and spatial information across multiple scales. This module enables the inference of visual and semantic relationships between different scales of geographic spatial objects in remote sensing images.

Undeniably, the aforementioned work has significantly contributed to the advancement of RSVQA research and highlighted the importance of visual feature extraction for RSVQA. However, most existing methods have not adequately emphasized the significance of the interaction between visual and language features. Specifically, the following limitations still exist:

Most existing methods utilize basic fusion strategies, like concatenation or element-wise operations, to combine visual and language features. However, these approaches inadequately capture the complex interactions between the two modalities, resulting in fused features with limited information richness;
The existing models do not adequately model cross-modal attention, whereas cross-modal attention mechanisms can assist the model in better attending to the correlation between visual and language features, enabling accurate determination of which features to focus on in specific questions and images;
The task of RSVQA requires the model to comprehend the semantic essence of the posed question and reason with the relevant content in the image to generate accurate answers. However, existing models may struggle to effectively model and capture the intricate semantic relationship between the question and the image.

In this study, we present our novel solution, namely the unified transformer with cross-modal mixture expert (TCMME) model to address these limitations. The architecture of our proposed framework is illustrated in Figure 2; we utilize VIT [32] as the visual encoder to capture semantic information at different positions in the image, including visual features and spatial relationships. Simultaneously, we leverage BERT [33] as the language encoder to comprehend the meaning of the question. To achieve multimodal representation learning, we introduce the novelty cross-modal mixture expert (CMME) module, which serves three purposes: (1) harnessing the self-attention mechanism embedded within the CMME module to fuse visual and language features, capturing the complex interaction between them, and generating fused features through modality experts; (2) applying cross-modal attention after each self-attention layer to enhance the focus on the correlation between visual and language features, effectively establishing connections between the semantics conveyed in the question and the relevant regions depicted in the image; and (3) effectively modeling and capturing the intricate semantic relationship between the question and the image by parameter sharing of self-attention and cross-modal attention within the CMME module, thereby enhancing the model’s reasoning capability. Finally, the outputs of each modality expert, after cross-modal interaction, are concatenated and propagated through a fully connected layer for accurate answer prediction.

The main contributions of this paper can be summarized as follows:

We propose a novel TCMME model to address the RSVQA tasks;
We design the cross-modal mixture expert module to facilitate the interaction between visual and language features and effectively model the cross-modal attention, which enables effectively modeling and capturing the intricate semantic relationship between the image and question;
The experimental results unequivocally validate the efficacy of our proposed methodology. Compared to existing approaches, our model attains state-of-the-art (SOTA) performance on both the RSVQA-LR and RSVQA-HR datasets.

The present paper is organized into distinct sections. In Section 2, a comprehensive review is presented, thoroughly examining the current SOTA in the domains of VQA, RSVQA, and the advancements facilitated by the application of transformer models in the field of remote sensing. In Section 3, we present the detailed description of the proposed TCMME model. Section 4 presents the experimental results along with a discussion. Section 5 provides a brief summary of this study.

2. Related Work

This section provides a concise overview of the previous research literature on VQA, summarizes the latest advancements in the RSVQA domain, and presents the current research status of transformer models in the remote sensing field.

2.1. Visual Question Answering

In the realm of multimodal tasks, the effective extraction and utilization of features from disparate modalities have continually captivated the attention of researchers. VQA epitomizes the convergence of the visual modality and the natural language modality. Since the pioneering work in VQA was introduced [34], it has remained a prominent and highly active research field. Early VQA approaches commonly utilized popular convolutional neural networks (e.g., ResNet [28], VGGNet [35], and GoogleNet [36]), as well as recurrent neural networks variants (e.g., LSTM [37] and GRU [38]), to extract visual and language features, respectively, which were then fused together using simple mechanisms like concatenation and pooling. Subsequently, researchers introduced more expressive and complex feature fusion strategies based on bilinear techniques, such as multimodal compact bilinear pooling (MCB) [39], multimodal factorized bilinear pooling (MFB) [40], and multimodal low-rank bilinear pooling (MLB) [41].

Recently, attention mechanisms have emerged as a pivotal component in elevating the performance of VQA models, primarily by capturing semantic relationships between visual and textual information. Anderson et al. [42] pioneered the utilization of a bottom-up and top-down attention mechanism. This innovative technique enabled the model to effectively learn intricate image region features by leveraging the object level detection achieved through the Faster R-CNN [43]. Yang et al. introduced the stacked attention networks (SAN), which gradually searched relevant image regions based on the semantic representation of the question [44]. Kim et al. introduced the bilinear attention network (BAN), leveraging low-rank bilinear pooling to generate bilinear attention maps for the fusion of multimodal features [45]. Lu et al. presented a novel end-to-end composite relation attention network (CRA-Net) which extracts multiple relations guided by the corresponding question and effectively coordinates object and relation features, leading to significant improvements in VQA reasoning performance [46].

However, despite the fact that the attention mechanism is widely used in various models to learn visual and language features, there exist further opportunities for improvement. Most current approaches mainly focus on attending to the image based on the question while giving less consideration to the influence of the image on the question, thus lacking comprehensive cross-modal interaction modeling.

2.2. Remote-Sensing Visual Question Answering

While VQA has gained significant popularity in general domains, its application within the realm of remote sensing is still in early stages. Pioneering work [25] established two large-scale benchmark datasets for RSVQA: RSVQA-LR and RSVQA-HR. All LR and HR remote sensing images in these datasets are sourced from OpenStreetMap. Additionally, the authors present a baseline model for RSVQA, which employs ResNet-152 [28] for the extraction of image features and LSTM [37] for the extraction of language features. The extracted visual and language features are concatenated into a unified vector representation. This consolidated representation serves as the input for a fully connected layer, which plays a crucial role in accurately predicting the answer.

Subsequently, Yuan et al. devised a progressive VQA learning method that draws inspiration from the learning process observed in humans. The approach trains the model incrementally, starting from simple question–answer pairs and gradually transitioning to more challenging ones [27]. Zheng et al. introduced an improved method called MAIN, which aligns image features with text information using attention mechanisms and bilinear techniques to learn image features with textual information, thereby augmenting the joint feature representation in RSVQA [26]. On the other hand, Bazi et al. employed the pre-trained model CLIP [29] with fixed parameters as the feature extractors for both visual and textual modalities [30]. They incorporated image patches and question words into the feature representation sequence and captured cross-modal dependencies using the co-attention mechanism. Chappuis et al. [47] introduced the Prompt-RSVQA method, where visual information is translated into words and integrated into a pure language model, avoiding reliance on joint visual–textual representations. Considering the substantial variations observed in remote sensing images and geospatial objects with location-sensitive characteristics, Zhang et al. [31] proposed an innovative methodology termed the spatial hierarchy reasoning network (SHRNet). This approach employed a hash-based spatial multiscale visual representation module to encode multiscale visual features, effectively incorporating spatial position information. The SHRNet framework further leveraged spatial hierarchy reasoning to capture high-order internal group object relationships across multiple scales, thereby enhancing the model’s visual spatial reasoning capabilities.

A recent study [48] introduced a novel and significant task, wherein a VQA system was employed for change detection on multitemporal aerial images. The authors constructed a specific dataset, known as the change-detection visual question answering (CDVQA) dataset. To construct this dataset, they utilized an automatic question–answering generation method, which involves generating question–answer pairs corresponding to multitemporal images. Furthermore, they developed a baseline method to address the CDVQA task, which comprises four components: multitemporal feature encoding, multitemporal fusion, multimodal fusion, and answer prediction.

2.3. Transformers in Remote Sensing

The transformer architecture has recently emerged as a formidable paradigm, yielding remarkable accomplishments in natural language processing (NLP) and computer vision (CV), and its application has also extended to the field of remote sensing. Within the domain of remote sensing image classification, Bazi et al. conducted an in-depth exploration of the architecture’s influence on performance. They further addressed network compression by reducing the number of layers, marking the pioneering utilization of VIT [32] in this particular domain [49]. Deng et al. [50] proposed a two-stream framework that combines both CNN and transformer streams. To optimize the performance of the aforementioned two-stream framework, a joint loss function encompassing both cross-entropy and center loss was employed during training. Ma et al. introduced a transformer-based framework that integrates a patch generation module capable of generating both homogeneous and heterogeneous patches. Specifically, the patch generation module directly generates heterogeneous patches, while homogeneous patches are collected through the utilization of a superpixel segmentation technique [51].

Within the realm of remote-sensing image object detection, Xu et al. [52] harnessed the capabilities of both transformer and CNN architectures through the introduction of the local perception Swin Transformer (LPSW). This framework specifically focuses on augmenting the model’s local perception capability, thereby achieving superior detection performance. In [53], a framework leveraging transformers is introduced to discern relationships among sampled features, thereby improving grouping and bounding box predictions. Importantly, this framework obviates the necessity for further post-processing. Dai et al. [54] introduced AO2-DETR, a transformer-based object detector that incorporates an oriented proposal generation scheme. This scheme is specifically designed to generate oriented object proposals explicitly.

In the domain of remote-sensing image change detection, Chen et al. [55] introduced a novel bi-temporal image transformer. This approach leverages the power of transformers to effectively model spatiotemporal contextual information. The proposed method incorporates an encoder to capture context within a token-based spacetime representation. Subsequently, the decoder receives the contextualized tokens and performs feature refinement in the pixel-space. Wang et al. [56] presented UVACD, an innovative change detection architecture. UVACD combines the strengths of a convolutional neural network (CNN) backbone for high-level semantic feature extraction with transformers for capturing temporal information interaction. This fusion leads to enhanced generation of change features, enabling more accurate change detection. Ke et al. [57] proposed the Hybrid-TransCD, a hybrid multiscale transformer model designed specifically for change detection. This model leverages heterogeneous tokens and multiple receptive fields to effectively capture both fine-grained details and large object features in the change detection process. This comprehensive approach enables a more holistic understanding of change patterns within remote sensing images.

Within the domain of remote sensing image segmentation, Xu et al. [58] introduced a lightweight framework named Efficient-T, built upon the transformer architecture. Efficient-T incorporates an implicit technique for enhancing edges and employes a combination of hierarchical swin transformer and an MLP head. The study conducted by [59] employed a pre-trained Swin Transformer backbone in combination with three decoder designs, for the purpose of semantic segmentation in aerial images. Xiao et al. proposed STEB-UNet, which incorporates an encoding booster based on the Swin Transformer to capture semantic information from multi-level features derived from various scales [60].

3. Methodology

In this research endeavor, we introduce a pioneering and comprehensive framework, denoted as the unified transformer with cross-modal mixture experts (TCMMEs) for RSVQA, as illustrated in Figure 2. The proposed approach leverages the VIT, pre-trained on the ImageNet dataset as the visual encoder to extract high-quality visual features. Similarly, BERT, a powerful transformer-based model, serves as the text encoder to extract informative language features. Subsequently, these visual and language features are adequately interacted through the proposed cross-modal mixture experts (served as multimodal fusion encoder). Finally, the interacted visual and language features undergo a concatenation operation, the resultant feature representation is then processed through a fully connected layer for accurate answer prediction. The following section provides a detailed explanation of the proposed TCMME method.

3.1. Problem Definition

The RSVQA problem can be considered as a classification task. Given a remote sensing dataset

D = {\{v_{i}, q_{i}, a_{i}\}}_{i = 1}^{N}

, the task of RSVQA is to select the correct answer

a_{i}

from a candidate set

A

consisting of

d_{A}

possible answers, based on the current input remote sensing image

v_{i} \in V

and its corresponding question

q_{i} \in Q

, where N is the total number of training samples in the dataset. The candidate answer can be obtained by optimizing the function

f_{θ} (v_{i}, q_{i})

:

f_{θ} (v_{i}, q_{i}) = {{C_{θ} (M E}_{θ} (V E}_{V} (v_{i}), {L E}_{Q} (q_{i}))),

(1)

where

{V E}_{v}

is the visual encoder

V \in R^{n_{V} \times d_{V}}

,

{L E}_{Q}

is the language encoder

Q \in R^{n_{Q} \times d_{Q}}

,

{M E}_{θ}

represents the multimodal fusion encoder, and

C_{θ}

denotes the answer predict module. The goal of the training process is to learn the optimal model parameters

θ

by maximizing the log-likelihood of the correct answer

a_{i}

, which can be formulated as follows:

θ = \underset{θ}{\arg} \max \sum_{i = 1}^{N} log P (a_{i}, f_{θ} (v_{i}, q_{i})) .

(2)

3.2. Model Architecture

3.2.1. Visual Encoder

To effectively capture spatial information and long-range dependencies in the input images, we adopt VIT [32] as the visual encoder in our proposed method. We commence by initializing the VIT model with weights pre-trained on ImageNet dataset [61] to facilitate better adaptation to new remote sensing data. In VIT, an input image

v_{i} \in R^{H \times W \times C}

is initially divided into a series of small patches

p_{1}, p_{2}, \dots . p_{N} \in R^{N \times (P^{2} C)}

, where

H \times W

is the image resolution, C represents the number of channels,

P \times P

represents the dimension of each individual patch and

N = H W / P^{2}

denotes the total number of patches. Subsequently, these patches are flattened, followed by a linear projection into patch embeddings utilizing a linear transformation

E^{v} \in R^{(P^{2} C) \times D}

. To provide global information within these embeddings, a learnable special token embedding

v [c l s] \in R^{D}

is added at the beginning of the embeddings, where D represents the dimension of

v [c l s]

. Consequently, the input image representations are acquired by aggregating the patch embeddings together with the learnable 1D position embeddings denoted as

E_{p o s}^{v} \in R^{(N + 1) \times D}

:

X^{v} = [v_{[c l s]}, p_{1} E^{v}, p_{2} E^{v}, \dots . p_{N} E^{v}] + E_{p o s}^{v},

(3)

Then

X^{v}

is fed into a transformer model, which consists of 12 transformer layers, each layer is equipped with a self-attention and a feed-forward layer. Within the transformer, the attention mechanism applied to self-attention layer can be defined as follows:

S E L F A T T (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V,

(4)

where

d_{k}

represents the dimension of the K. Finally, we obtain the contextualized visual representations:

H^{v} = F F N (S E L F A T T (X^{v}, X^{v}, X^{v})) + X^{v'} .

(5)

where

F F N

is feed forward and

X^{v'}

is the output of previous transformer layer.

3.2.2. Language Encoder

In the language encoder module, we utilize the WordPiece [62] tokenization technique to segment the input question into subword tokens denoted as

w_{1}, w_{2}, \dots . w_{M}

, and following BERT [33], M denotes the length of the tokenized text sequence. Each token

w_{i} \in R^{V}

is encoded in a one-hot format, where V represents the vocabulary size. Two special tokens, namely the start-of-sequence token (

w_{[c l s]}

) and the special boundary token (

w_{[s e p]}

), are added to the text sequence. We then apply a linear transformation

E^{l} \in R^{V \times D}

to obtain embeddings for these tokens. Consequently, the text input representations are calculated by summing the token embeddings with the corresponding text position embeddings

E_{p o s}^{l} \in R^{(M + 2) \times D}

:

X^{l} = [w_{[c l s]}, w_{1} E^{l}, w_{2} E^{l}, \dots . w_{M} E^{l}, w_{[s e p]}] + E_{p o s}^{l},

(6)

Similar to the visual encoder,

X^{l}

is fed into a transformer model with six transformer layers to obtain the contextualized language representations:

H^{l} = F F N (S E L F A T T (X^{l}, X^{l}, X^{l})) + X^{l'} .

(7)

where

X^{l'}

is the output of previous transformer layer.

3.2.3. Multimodal Fusion Encoder

Inspired by mixture-of-experts networks [63,64,65], we propose cross-modal mixture experts (CMMEs) as the multimodal fusion encoder. In detail, we introduce visual and textual experts to supplant the conventional feed-forward networks of standard transformers. Within each CMME block, these experts specialize in capturing fusion information unique to their respective modalities, enabling the switching between different modality experts. The self-attention mechanism within the CMME module is utilized to fuse visual and language features, capturing the intricate interaction between visual and textual information, and generating fused features through modality experts. Furthermore, the cross-modal attention layer is applied after each self-attention layer to fuse the encoded visual representations

H^{v}

and textual representations

H^{l}

. Through cross-modal attention, the model can leverage information from one modality to enhance the representation of another modality. The query vector (Q) establishes correspondence with the key vector (K), the value vector (V) is weighted based on the degree of matching, resulting in cross-modal attention weights. This intricate interplay enables the system to effectively establish semantic connections between the posed question and the corresponding regions of interest within the image, fostering a comprehensive understanding of the underlying semantics.

Whether

H^{v}

and

H^{l}

that are input to CMME, the parameters of the self-attention and cross-attention layers are shared. By leveraging different experts, we obtain fused visual and language feature representations. The sharing of parameters not only yields a reduction in the overall number of model parameters but also facilitates effective interaction between visual and textual modalities. This design enables the model to proficiently capture the intricate correlations inherent in multimodal inputs, fostering information exchange and integration during the encoding process, effectively modeling and capturing the complex semantic relationship between the question and the image, thereby achieving superior performance in the RSVQA task.

In CMME, the attention mechanism employed in cross-attention layers can be formally defined as follows:

C R O S S A T T (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V,

(8)

The intra-modal interactions in the self-attention layer are represented as follows:

H^{v s} = S E L F A T T (H^{v}, H^{v}, H^{v}), H^{l s} = S E L F A T T (H^{l}, H^{l}, H^{l}),

(9)

The execution process of cross-modal interaction is illustrated in Figure 3. In the cross-attention layer, the cross-modal interaction information is integrated into their respective representations:

\begin{matrix} H^{v c} = H^{v s} + C R O S S A T T (H^{v s}, H^{l s}, H^{l s}) + H^{v}, \\ H^{l c} = H^{l s} + C R O S S A T T (H^{l s}, H^{v s}, H^{v s}) + H^{l}, \end{matrix}

(10)

Given the preceding layer’s output vectors, denoted as

H^{v z'}

and

H^{l z'}

, derived from the visual and textual experts, finally,

H^{v c}

and

H^{l c}

are fed into the visual and textual experts to obtain the multimodal visual and textual representations:

H^{v z} = H^{v z'} + {V F F N (L N (H}^{v c})), H^{l z} = H^{l z'} + L F F N ({L N (H}^{l c})) .

(11)

where

L N

is short for layer normalization.

3.2.4. Answer Prediction

Our proposed model predicts the answer by combining the visual representations

H^{v z}

and textual representations

H^{l z}

obtained from the visual and language experts in the CMME. Specifically, we apply the pooler to the

H^{v z}

and

H^{l z}

to transform them into fixed-length vector representations, capturing the most important information from

H^{v z}

and

H^{l z}

. Subsequently, we concatenate the processed features as shown in the formula below:

H^{o} = C o n c a t (P o o l e r (H^{v z}) + P o o l e r (H^{l z})),

(12)

Then, we feed

H^{o}

into a multi-layer perceptron (

M L P

) followed by a softmax layer to compute the probability of each candidate answer:

a_{p r e d} = s o f t m a x (M L P (H^{o})) .

(13)

where

a_{p r e d} \in R^{d_{A}}

represents the answer probability vector. The final answer is determined by selecting the candidate with the highest probability.

4. Experiment Results and Discussion

We conducted a comprehensive set of experiments on two commonly used RSVQA datasets, RSVQA-LR and RSVQA-HR [25], to evaluate the performance of our proposed TCMME. To ascertain the efficacy of our approach, we conducted comparative analyses against state-of-the-art methodologies. Additionally, we performed ablation experiments to discern the individual contributions of various components within our model. Furthermore, we analyzed the potential interpretability of the model by visualizing the attention maps of the images and the questions selected for the RSVQA task.

4.1. Datasets

4.1.1. RSVQA-LR

The RSVQA-LR dataset is derived from Sentinel-2 images obtained in the Netherlands. It consists of nine distinct image patches with a spatial resolution of 10 m, covering an area of 6.55 square kilometers. The image patches are partitioned into 772 images, and each consists of

256 \times 256

pixels. There are 77,232 question–answer pairs about these images. On average, each image corresponds to approximately 100 questions. An example of the images, questions, and answers is shown in Figure 4a. The questions in this dataset can be categorized into four types: rural/urban, presence, comparison, and counting. For object counting, we utilize the method from [25] to assign object counts into predefined ranges (answers: 0, 1–10, 11–100, 101–1000, and more than 1000). Therefore, the dataset comprises 11 distinct answer options. Following the split method proposed in [25], we allocate 77.8%, 11.1%, and 11.1% of the original image patches to the training, validation, and testing sets, respectively.

4.1.2. RSVQA-HR

The RSVQA-HR dataset was created by selecting 161 image blocks from high-resolution orthophoto (HRO) datasets obtained from USGS. These HRO datasets have a spatial resolution of 15 cm and cover a significant portion of urban areas along the northeastern coast of the United States. The image blocks were partitioned into 10,659 images with a size of

512 \times 512

pixels, and each image corresponds to an area of 5898 square meters. Subsequently, a total of 1,066,316 question–answer pairs were generated, with an average of approximately 100 question–answer pairs per image. Figure 4b showcases illustrative examples extracted from this dataset, encompassing images, associated questions, and corresponding answers. The question–answer pairs are categorized into four types: existence, comparison, object counting, and specific object coverage within a predefined range. For the existence and comparison types, the answers are either affirmative (yes) or negative (no). In the case of object counting, the answers span a numerical range from 0 to 89, denoting the count of objects. This categorization is similar to the object counting RSVQA-LR dataset, which includes answers such as equal to 0 m

^{2}

, greater than 0 m

^{2}

and less than or equal to 10 m

^{2}

, greater than 10 m

^{2}

and less than or equal to 100 m

^{2}

, greater than 100 m

^{2}

and less than or equal to 1000 m

^{2}

, and greater than 1000 m

^{2}

. Consequently, this dataset encompasses a maximum of 99 distinct answers. For dataset partitioning, the training set comprises 61.5% of the total image blocks, the validation set accounts for 11.2%, and the remaining portion serves as the test set. Specifically, test set 1 covers 20.5% of the total image blocks, while test set 2 accounts for 6.8%. It is noteworthy that test set 2 is employed to evaluate the model’s robustness to images acquired from diverse locations.

4.2. Implementation Details

The implementation of our method was carried out using Python 3.8 and PyTorch 2.0 on a server comprising an AMD EPYC 7753 processor and an NVIDIA RTX 4090 GPU with 24 GB of RAM. To initialize the visual encoder, we utilized the pre-trained weights of the VIT-B model trained on ImageNet. Similarly, for the language encoder, we initialized it with the pre-trained weights of the BERT model trained on a vast amount of text data.

We randomly cropped the images to a resolution of

256 \times 256

as input and applied RandAugment [66] for data augmentation during the data preprocessing stage. The training process involved a batch size of 32, encompassing a total of 20 epochs. To optimize our model, we employed the AdamW optimizer [67], initializing it with a learning rate of

2 \times 10^{- 5}

. The learning rate was decayed using a cosine schedule until it reached

1 \times 10^{- 8}

, ensuring a smooth and gradual reduction throughout the training process. To prevent overfitting and promote regularization, a weight decay of 0.05 was applied.

4.3. Comparison Results and Analysis

To evaluate the performance of the proposed model, we compared it with state-of-the-art methods on two benchmark RSVQA datasets: RSVQA-LR and RSVQA-HR [25]. Consistent with prior works [25,26,27,30,31], we utilize accuracy as the performance metric for evaluating the model, which denotes the ratio of correctly predicted samples to the total number of samples. Additionally, to provide a comprehensive analysis, we calculate the accuracy of our model on per-question types for both the RSVQA-LR and RSVQA-HR datasets. The detailed comparison of the results is presented in Table 1, Table 2 and Table 3. Our method outperforms previous approaches, exhibiting the highest average accuracy and overall accuracy across both the RSVQA datasets. Following comes the details of the comparison.

4.3.1. Accuracy Comparison on RSVQA-LR Dataset

As shown in Table 1, our model not only achieves an overall accuracy of 86.69% and an average accuracy of 88.21% on the RSVQA-LR dataset, significantly surpassing the existing SOTA methods, but also attains the highest accuracy across all question types. Compared to the current SOTA method, SHRNet [31], our method, TCMME, showed improvements of 0.84% and 0.87% in overall accuracy and average accuracy, respectively. In direct comparison to the most competitive existing method, our proposed model achieves a superior accuracy increase by 0.21%, 0.61%, 0.94%, and 1.00% in count, presence, comparison, and rural/urban questions, respectively. Considering the performance improvement brought by the current SOTA method SHRNet compared to its previous best method Bi-Modal [30], the improvement brought by our method in comparison to SHRNet is undoubtedly significant.

4.3.2. Accuracy Comparison on RSVQA-HR Dataset

The comparative results for the two test sets within the RSVQA-HR dataset are presented in Table 2 and Table 3. Specifically, Table 2 showcases the exceptional performance of our proposed TCMME model on the test 1 dataset. With an impressive overall accuracy of 85.96% and an average accuracy of 86.12%, our model surpasses the current SOTA method, SHRNet, by 0.57% and 0.99%, respectively. Notably, our method exhibits an improvement of 4.73% on the area question type. Additionally, there is a performance gain of 0.31% on the comparison question type. However, our method exhibits a slight decrease of 0.61% and 0.48% in the count and presence question types, respectively. Although our method slightly underperforms compared to the SHRNet method on the count and presence questions, the accuracy differences between the two types are not substantial.

Parallel advancements are witnessed on the test 2 dataset. As shown in Table 3, our TCMME model achieves an absolute improvement of 0.78% and 1.53% in overall accuracy and average accuracy, respectively, compared to the current SOTA method, SHRNet. For each type of questions, our method significantly outperforms SHRNet by 7.85% on the area question type and demonstrates superior performance by 0.45% on the comparison question type. However, it exhibits a decrease of 1.29% and 0.88% in accuracy for the count and presence question types, respectively.

4.4. Discussion

It is noteworthy that on the two test sets of RSVQA-HR, our approach slightly underperforms in the count and presence tasks compared to the current SOTA method, SHRNet. However, in the count and presence tasks on the RSVQA-LR dataset, our method surpasses the SOTA method. Several factors potentially account for this performance variation.

Firstly, both the count and presence tasks are inherently challenging. In terms of count tasks, the models might be constrained by the remote sensing image resolutions, potentially missing minuscule objects. Furthermore, the non-uniform or cluttered distribution of count objects in images intensifies the task’s complexity. As a result, to date, no RSVQA approaches have achieved a truly satisfactory recognition outcome. For presence tasks, while models have attained relatively high recognition precision, further improvements might be challenging.

Secondly, the count tasks in RSVQA-LR and RSVQA-HR differ due to resolution limitations. In RSVQA-LR, counting focuses on ranges, such as more than 10, whereas in RSVQA-HR, counting is precise down to the exact number.

Lastly, the disparity between the RSVQA-LR and RSVQA-HR datasets stems from their inherent resolutions. RSVQA-LR features low-resolution images, while RSVQA-HR showcases high-resolution ones. Consequently, models for RSVQA-HR require a fine-grained understanding and perception of images, whereas those for RSVQA-LR necessitate a more macroscopic perspective due to its lower resolution. Our proposed method, TCMME, built entirely on the transformer architecture, leverages its inherent self-attention mechanism. This allows the model to capture long-range dependencies, granting it superior global perception capabilities. As a result, it demonstrates enhanced performance on RSVQA-LR. Conversely, SHRNet, by augmenting the model’s multiscale spatial visual representations, offers a heightened fine-grained perception. This makes it particularly proficient on the high-resolution RSVQA-HR dataset.

Given these observations, we need to not only focus on the interaction of cross-modal features but also enhance the multiscale spatial visual representation capabilities of the visual encoder. A potential viable approach is to replace the visual encoder VIT with the Swin Transformer. The hierarchical partitioned design of the Swin Transformer enables it to capture context information across different levels, thereby more effectively integrating multiscale information and further boosting the model’s performance.

4.5. Ablation Study

To evaluate the effectiveness of the proposed method, we conduct a rigorous set of ablation studies on both the RSVQA-LR and RSVQA-HR datasets, focusing on the contributions of different components within the TCMME framework, the optimal number of layers within the CMME module, and the sensitivity to the training set size. In the subsequent sections, we present a detailed exposition of our findings.

4.5.1. Ablation Study on the Effectiveness of Different Components

To ascertain the contribution of each component to the performance of our proposed method, we conduct several ablation experiments using the following different variants of TCMME.

TCMME w/o CMME: We remove the CMME module, which renders the model incapable of performing multimodal interaction, and directly concatenate the encoded visual representations with the language representations for answer prediction.
TCMME w/o ME: We remove the modality experts within the CMME module, which is replaced by a layer FFN, allowing the FFN outputs both the fused visual features and the fused language features.
TCMME w/o CA: We remove the cross-attention layer from the CMME module and utilize the remaining self-attention and modality experts components in CMME to fuse the visual and language features.

A comprehensive ablation comparison is presented in Table 4, showcasing the performance disparities among four distinct variants on both the RSVQA-LR and RSVQA-HR datasets. The empirical results substantiate the undeniable fact that the complete TCMME model manifests remarkable enhancements and surpasses all alternative variants. Compared to TCCME w/o CMME, the full TCMME model achieves a performance improvement of 1.32% on overall accuracy and 2.01% on average accuracy on the RSVQA-LR dataset. Additionally, it demonstrates a performance gain of over 1% in overall accuracy and average accuracy on both test sets of the RSVQA-HR dataset. These results indicate the effectiveness of the CMME module in the proposed model. Next, we attempt to remove different components within the CMME module to elucidate their individual influence on the holistic prediction prowess of the model. Specifically, when we remove the modality experts (MEs) within the CMME module, the proposed method exhibits a decrease of 0.40% in overall accuracy and 0.54% in average accuracy on the RSVQA-LR dataset. Likewise, on the overall accuracy of both test sets in the RSVQA-HR dataset, the performance decreases by 0.45% and 0.22%, respectively. Similarly, when we remove the cross attention (CA) from the CMME module, it leads to a slightly higher performance decline compared to the similar decrease caused by removing the ME. These results indicate that the CA plays a crucial role in capturing cross-attention patterns and is beneficial for improving the model’s prediction performance. The decrease in accuracy when removing the ME further demonstrates its importance in the overall architecture of the proposed model.

4.5.2. Ablation Study on the Optimal Number of Layers in the CMME Module

To ascertain the optimal number of layers in the CMME module, we conducted an extensive ablation study. The study encompassed an evaluation of the model’s performance on both the RSVQA-LR and RSVQA-HR datasets, employing various numbers of layers ranging from one to six. Table 5 presents the results of the ablation experiments for different numbers of layers in the CMME module. When the CMME module consisted of a single layer, the model achieved an overall accuracy of 86.05% and an average accuracy of 87.47% on the RSVQA-LR dataset. On RSVQA-HR test 1, the model attained an overall accuracy of 85.36% and an average accuracy of 85.47%. Regarding RSVQA-HR test 2, the model demonstrated an overall accuracy of 81.62% and an average accuracy of 81.63%. Interestingly, we observed that as the number of layers in the CMME module increased, the performance of the model also improved. We observed that the model achieved optimal performance when the CMME module comprised three layers. Furthermore, as we increased the number of layers beyond three, the performance started to plateau and even slightly decline. This indicates that adding more layers in the CMME module does not necessarily lead to better performance and may introduce unnecessary complexity.

4.5.3. Ablation Study on the Sensitivity to the Training Set Size

In order to probe the repercussions of altering the training set size on the model’s performance, we conducted an extensive ablation study focusing on the sensitivity to variations in the training set size. This comprehensive investigation was carried out using the RSVQA-LR and RSVQA-HR datasets. As shown in Table 6, our findings revealed that utilizing only 10% of the training data yielded a remarkable overall accuracy of 83.07% on the RSVQA-LR dataset. Subsequently, as we increased the size of the training set, we observed a consistent improvement in the model’s overall performance. Particularly noteworthy is that when the training dataset size reached 40%, our model surpassed the current SOTA method SHRNet, exhibiting a minimal performance degradation of merely 0.81% compared to using the entire training set. Similarly, as depicted in Table 7, we noted that setting the training dataset size to 10% of the total yielded respective overall accuracies of 84.88% and 81.37% on RSVQA-HR dataset of the test sets 1 and 2. As we increased the training dataset size to 20%, our model demonstrated performance comparable to that of the SHENet model. Furthermore, as the training dataset size reached 30%, our model outperformed SHRNet. Significantly, with a training dataset size of 40%, our model achieved performance almost on par with using the entire training set. These compelling results highlight the robustness of our model, indicating that even with a relatively small training dataset, it can still achieve highly competitive performance.

4.6. Visualization

In order to delve into the interpretability of the proposed TCMME model and gain deeper insights into the relevance of textual and visual regions in generating answers, a comprehensive qualitative analysis was undertaken. Leveraging the widely adopted Grad-CAM method [68] to generate attention maps, in line with [69], we consider the average of the attention heads.

Figure 5 illustrates the visualization of four attention maps overlaid on the original images from the RSVQA-LR dataset. In Figure 5a,c, we demonstrate two presence questions, which inquire about the presence of corresponding objects in two distinct images. Our model accurately determines the presence of the relevant regions. In Figure 5a, to answer the question “Is a residential building present?” the model accurately observes the residential building area and provides the corresponding answer “Yes”. Similarly, In Figure 5c, the question is “Is a large forest?” The model accurately pays attention to the position of the large forest, and gives the answer “Yes” as well. In Figure 5b, we present a count question, specifically “What is the number of grass areas?” The model exhibits a notable inclination towards directing its attention towards the grassy regions within the image, placing relatively lesser emphasis on the buildings or the forest, and it provides the predicted answer “Between 101 and 1000”. In Figure 5d, we showcase a comparison question, asking “Are there more water areas than buildings?” The model attends to the locations of the water areas and makes a judgment, ultimately producing the answer “No”.

Figure 6 presents the visualization results obtained from the RSVQA-HR dataset. In Figure 6a, a count question is presented, which is different from Figure 5b as it requires determining the quantity of residential buildings in the image rather than providing a range, making it more challenging. Fortunately, our model successfully recognizes all the residential buildings in the image and accurately generates the corresponding answer of “4”. Furthermore, as depicted in Figure 6b, the question is “Are there more buildings than residential areas?”. The model focuses on the building regions, performs the comparison, and provides the answer “Yes”. In Figure 6c, to answer the question “Is there a residential building?”, the model needs to correctly focus on the relevant residential building area, determine its presence and provide the accurate answer of “Yes”. Similarly, in Figure 6d, the question is about the area covered by residential buildings. The model faces the dual challenge of accurately identifying the regions housing residential buildings and estimating the approximate area range encompassed by these structures, ultimately providing the predicted answer “More than 1000 m

^{2}

”.

In conclusion, these attention maps provide valuable insights into the TCMME model’s focus on different input regions. They enable us to explain the model’s decision-making process and understand the reasoning behind the generated answers while offering an intuitive representation of the model’s inference process. By visualizing attention patterns, we gain valuable insights into the model’s ability to capture the relationship between textual and visual information in the VQA task. Additionally, we can draw meaningful conclusions regarding the model’s attention to specific objects or concepts and its capability to answer questions based on relevant textual and visual information.

5. Conclusions

In this paper, we introduce a pioneering model architecture, TCMME, to tackle the RSVQA task. TCMME leverages VIT for visual feature extraction and BERT for language feature extraction. These features are fused through the CMME module, capturing their complex interactions and effectively modeling the intricate semantic relationships between questions and images. Comprehensive experiments were conducted on the challenging RSVQA-LR and RSVQA-HR datasets to assess the efficacy of the proposed TCMME model, and comparisons were made against SOTA methods. The experimental results demonstrate the significant superiority of our approach over existing SOTA methods. For future developments, we propose adopting vision–language pre-training methods to learn rich visual and textual representations from a large-scale dataset of remote sensing image–text pairs, which can be fine-tuned for downstream RSVQA tasks.

Author Contributions

Conceptualization, P.L.; methodology, P.L.; software, J.H.; validation, J.H. and H.L.; formal analysis, G.L.; investigation, P.L. and G.H.; resources, G.L. and S.Z.; data curation, P.L. and J.H.; writing—original draft preparation, J.H. and P.L.; writing—review and editing, P.L. and G.L.; visualization, J.H.; supervision, G.L. and S.Z.; project administration, G.L.; funding acquisition, G.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by Natural Science Foundation of Heilongjiang Province under grant number LH2021F015.

Data Availability Statement

The RSVQA-LR dataset utilized in this work is openly available at: https://zenodo.org/record/6344334, accessed on 10 March 2022 day month year. The RSVQA-HR dataset utilized in this work is openly available at: https://zenodo.org/record/6344367, accessed on 10 March 2022 day month year, respectively.

Conflicts of Interest

The authors declare no conflict of interest.

References

Zheng, X.; Gong, T.; Li, X.; Lu, X. Generalized Scene Classification From Small-Scale Datasets with Multitask Learning. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5609311. [Google Scholar] [CrossRef]
Ye, Y.; Bruzzone, L.; Shan, J.; Bovolo, F.; Zhu, Q. Fast and Robust Matching for Multimodal Remote Sensing Image Registration. IEEE Trans. Geosci. Remote Sens. 2019, 57, 9059–9070. [Google Scholar] [CrossRef]
Pham, H.M.; Yamaguchi, Y.; Bui, T.Q. A case study on the relation between city planning and urban growth using remote sensing and spatial metrics. Landsc. Urban Plan. 2011, 100, 223–230. [Google Scholar] [CrossRef]
Cheng, G.; Guo, L.; Zhao, T.; Han, J.; Li, H.; Fang, J. Automatic landslide detection from remote-sensing imagery using a scene classification method based on BoVW and pLSA. Int. J. Remote Sens. 2013, 34, 45–59. [Google Scholar] [CrossRef]
Jahromi, M.N.; Jahromi, M.N.; Pourghasemi, H.R.; Zand-Parsa, S.; Jamshidi, S. Accuracy assessment of forest mapping in MODIS land cover dataset using fuzzy set theory. In Forest Resources Resilience and Conflicts; Elsevier: Amsterdam, The Netherlands, 2021; pp. 165–183. [Google Scholar]
Li, Y.; Yang, J. Meta-learning baselines and database for few-shot classification in agriculture. Comput. Electron. Agric. 2021, 182, 106055. [Google Scholar] [CrossRef]
Li, X.; Shao, G. Object-based urban vegetation mapping with high-resolution aerial photography as a single data source. Int. J. Remote Sens. 2013, 34, 771–789. [Google Scholar] [CrossRef]
Maggiori, E.; Tarabalka, Y.; Charpiat, G.; Alliez, P. Convolutional Neural Networks for Large-Scale Remote-Sensing Image Classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 645–657. [Google Scholar] [CrossRef]
Cheng, G.; Yang, C.; Yao, X.; Guo, L.; Han, J. When Deep Learning Meets Metric Learning: Remote Sensing Image Scene Classification via Learning Discriminative CNNs. IEEE Trans. Geosci. Remote Sens. 2018, 56, 2811–2821. [Google Scholar] [CrossRef]
Feng, X.; Han, J.; Yao, X.; Cheng, G. TCANet: Triple Context-Aware Network for Weakly Supervised Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2021, 59, 6946–6955. [Google Scholar] [CrossRef]
Qian, X.; Lin, S.; Cheng, G.; Yao, X.; Ren, H.; Wang, W. Object Detection in Remote Sensing Images Based on Improved Bounding Box Regression and Multi-Level Features Fusion. Remote Sens. 2020, 12, 143. [Google Scholar] [CrossRef]
Zhang, L.; Zhang, Y. Airport Detection and Aircraft Recognition Based on Two-Layer Saliency Model in High Spatial Resolution Remote-Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 1511–1524. [Google Scholar] [CrossRef]
Yao, X.; Cao, Q.; Feng, X.; Cheng, G.; Han, J. Scale-Aware Detailed Matching for Few-Shot Aerial Image Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5611711. [Google Scholar] [CrossRef]
Zheng, Z.; Zhong, Y.; Wang, J.; Ma, A. Foreground-Aware Relation Network for Geospatial Object Segmentation in High Spatial Resolution Remote Sensing Imagery. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, WA, USA, 13–19 June 2020; Computer Vision Foundation: New York, NY, USA, 2020; pp. 4095–4104. [Google Scholar] [CrossRef]
Xu, P.; Li, Q.; Zhang, B.; Wu, F.; Zhao, K.; Du, X.; Yang, C.; Zhong, R. On-Board Real-Time Ship Detection in HISEA-1 SAR Images Based on CFAR and Lightweight Deep Learning. Remote Sens. 2021, 13, 1995. [Google Scholar] [CrossRef]
Noothout, J.M.H.; de Vos, B.D.; Wolterink, J.M.; Postma, E.M.; Smeets, P.A.M.; Takx, R.A.P.; Leiner, T.; Viergever, M.A.; Isgum, I. Deep Learning-Based Regression and Classification for Automatic Landmark Localization in Medical Images. IEEE Trans. Med. Imaging 2020, 39, 4011–4022. [Google Scholar] [CrossRef] [PubMed]
Cen, F.; Wang, G. Boosting Occluded Image Classification via Subspace Decomposition-Based Estimation of Deep Features. IEEE Trans. Cybern. 2020, 50, 3409–3422. [Google Scholar] [CrossRef] [PubMed]
Zheng, G.; Li, X.; Zhou, L.; Yang, J.; Ren, L.; Chen, P.; Zhang, H.; Lou, X. Development of a Gray-Level Co-Occurrence Matrix-Based Texture Orientation Estimation Method and Its Application in Sea Surface Wind Direction Retrieval From SAR Imagery. IEEE Trans. Geosci. Remote Sens. 2018, 56, 5244–5260. [Google Scholar] [CrossRef]
Yuan, Z.; Zhang, W.; Rong, X.; Li, X.; Chen, J.; Wang, H.; Fu, K.; Sun, X. A Lightweight Multi-Scale Crossmodal Text-Image Retrieval Method in Remote Sensing. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–19. [Google Scholar] [CrossRef]
Cheng, Q.; Zhou, Y.; Fu, P.; Xu, Y.; Zhang, L. A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 4284–4297. [Google Scholar] [CrossRef]
Yuan, Z.; Zhang, W.; Tian, C.; Rong, X.; Zhang, Z.; Wang, H.; Fu, K.; Sun, X. Remote Sensing Cross-Modal Text-Image Retrieval Based on Global and Local Information. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–16. [Google Scholar] [CrossRef]
Li, Y.; Zhang, X.; Gu, J.; Li, C.; Wang, X.; Tang, X.; Jiao, L. Recurrent Attention and Semantic Gate for Remote Sensing Image Captioning. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5608816. [Google Scholar] [CrossRef]
Zhang, Z.; Zhang, W.; Yan, M.; Gao, X.; Fu, K.; Sun, X. Global Visual Feature and Linguistic State Guided Attention for Remote Sensing Image Captioning. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5615216. [Google Scholar] [CrossRef]
Zhao, R.; Shi, Z.; Zou, Z. High-Resolution Remote Sensing Image Captioning Based on Structured Attention. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5603814. [Google Scholar] [CrossRef]
Lobry, S.; Marcos, D.; Murray, J.; Tuia, D. RSVQA: Visual Question Answering for Remote Sensing Data. IEEE Trans. Geosci. Remote Sens. 2020, 58, 8555–8566. [Google Scholar] [CrossRef]
Zheng, X.; Wang, B.; Du, X.; Lu, X. Mutual Attention Inception Network for Remote Sensing Visual Question Answering. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5606514. [Google Scholar] [CrossRef]
Yuan, Z.; Mou, L.; Wang, Q.; Zhu, X.X. From Easy to Hard: Learning Language-Guided Curriculum for Visual Question Answering on Remote Sensing Data. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–11. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016; IEEE Computer Society: Washington, DC, USA, 2016; pp. 770–778. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML 2021), Virtual Event, 18–24 July 2021; Meila, M., Zhang, T., Eds.; PMLR: Sydney, Australia, 2021; Volume 139, pp. 8748–8763. [Google Scholar]
Bazi, Y.; Rahhal, M.M.A.; Mekhalfi, M.L.; Zuair, M.A.A.; Melgani, F. Bi-Modal Transformer-Based Approach for Visual Question Answering in Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4708011. [Google Scholar] [CrossRef]
Zhang, Z.; Jiao, L.; Li, L.; Liu, X.; Chen, P.; Liu, F.; Li, Y.; Guo, Z. A Spatial Hierarchical Reasoning Network for Remote Sensing Visual Question Answering. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4400815. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the 9th International Conference on Learning Representations (ICLR 2021), Virtual Event, 3–7 May 2021. [Google Scholar]
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), Minneapolis, MN, USA, 2–7 June 2019; Volume 1 (Long and Short Papers). Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Toronto, ON, Canada, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zitnick, C.L.; Parikh, D. VQA: Visual Question Answering. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV 2015), Santiago, Chile, 7–13 December 2015; IEEE Computer Society: Washington, DC, USA, 2015; pp. 2425–2433. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.E.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, MA, USA, 7–12 June 2015; IEEE Computer Society: Washington, DC, USA, 2015; pp. 1–9. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Cho, K.; van Merrienboer, B.; Gülçehre, Ç.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), Doha, Qatar, 25–29 October 2014; A Meeting of SIGDAT, a Special Interest Group of the ACL. Moschitti, A., Pang, B., Daelemans, W., Eds.; ACL: Toronto, ON, Canada, 2014; pp. 1724–1734. [Google Scholar] [CrossRef]
Fukui, A.; Park, D.H.; Yang, D.; Rohrbach, A.; Darrell, T.; Rohrbach, M. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP 2016), Austin, TX, USA, 1–4 November 2016; Su, J., Carreras, X., Duh, K., Eds.; The Association for Computational Linguistics: Toronto, ON, Canada, 2016; pp. 457–468. [Google Scholar] [CrossRef]
Yu, Z.; Yu, J.; Fan, J.; Tao, D. Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy, 22–29 October 2017; IEEE Computer Society: Washington, DC, USA, 2017; pp. 1839–1848. [Google Scholar] [CrossRef]
Kim, J.; On, K.W.; Lim, W.; Kim, J.; Ha, J.; Zhang, B. Hadamard Product for Low-rank Bilinear Pooling. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), Toulon, France, 24–26 April 2017. [Google Scholar]
Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 18–22 June 2018; Computer Vision Foundation: New York, NY, USA; IEEE Computer Society: Washington, DC, USA, 2018; pp. 6077–6086. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef]
Yang, Z.; He, X.; Gao, J.; Deng, L.; Smola, A.J. Stacked Attention Networks for Image Question Answering. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 27–30 June 2016; IEEE Computer Society: Washington, DC, USA, 2016; pp. 21–29. [Google Scholar] [CrossRef]
Kim, J.; Jun, J.; Zhang, B. Bilinear Attention Networks. In Proceedings of the Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018 (NeurIPS 2018), Montréal, QC, Canada, 3–8 December 2018; Bengio, S., Wallach, H.M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R., Eds.; 2018; pp. 1571–1581. [Google Scholar]
Peng, L.; Yang, Y.; Wang, Z.; Wu, X.; Huang, Z. CRA-Net: Composed Relation Attention Network for Visual Question Answering. In Proceedings of the 27th ACM International Conference on Multimedia (MM 2019), Nice, France, 21–25 October 2019; Amsaleg, L., Huet, B., Larson, M.A., Gravier, G., Hung, H., Ngo, C., Ooi, W.T., Eds.; ACM: New York, NY, USA, 2019; pp. 1202–1210. [Google Scholar] [CrossRef]
Chappuis, C.; Zermatten, V.; Lobry, S.; Le Saux, B.; Tuia, D. Prompt-RSVQA: Prompting visual context to a language model for remote sensing visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1372–1381. [Google Scholar]
Yuan, Z.; Mou, L.; Xiong, Z.; Zhu, X.X. Change Detection Meets Visual Question Answering. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5630613. [Google Scholar] [CrossRef]
Bazi, Y.; Bashmal, L.; Rahhal, M.M.A.; Dayil, R.A.; Ajlan, N.A. Vision Transformers for Remote Sensing Image Classification. Remote Sens. 2021, 13, 516. [Google Scholar] [CrossRef]
Deng, P.; Xu, K.; Huang, H. When CNNs Meet Vision Transformer: A Joint Framework for Remote Sensing Scene Classification. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8020305. [Google Scholar] [CrossRef]
Ma, J.; Li, M.; Tang, X.; Zhang, X.; Liu, F.; Jiao, L. Homo-Heterogenous Transformer Learning Framework for RS Scene Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 2223–2239. [Google Scholar] [CrossRef]
Xu, X.; Feng, Z.; Cao, C.; Li, M.; Wu, J.; Wu, Z.; Shang, Y.; Ye, S. An Improved Swin Transformer-Based Model for Remote Sensing Object Detection and Instance Segmentation. Remote Sens. 2021, 13, 4779. [Google Scholar] [CrossRef]
Tang, J.; Zhang, W.; Liu, H.; Yang, M.; Jiang, B.; Hu, G.; Bai, X. Few Could Be Better Than All: Feature Sampling and Grouping for Scene Text Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 18–24 June 2022; pp. 4553–4562. [Google Scholar] [CrossRef]
Dai, L.; Liu, H.; Tang, H.; Wu, Z.; Song, P. AO2-DETR: Arbitrary-Oriented Object Detection Transformer. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 2342–2356. [Google Scholar] [CrossRef]
Chen, H.; Qi, Z.; Shi, Z. Remote Sensing Image Change Detection With Transformers. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5607514. [Google Scholar] [CrossRef]
Wang, G.; Li, B.; Zhang, T.; Zhang, S. A Network Combining a Transformer and a Convolutional Neural Network for Remote Sensing Image Change Detection. Remote Sens. 2022, 14, 2228. [Google Scholar] [CrossRef]
Ke, Q.; Zhang, P. Hybrid-TransCD: A Hybrid Transformer Remote Sensing Image Change Detection Network via Token Aggregation. ISPRS Int. J. Geo Inf. 2022, 11, 263. [Google Scholar] [CrossRef]
Xu, Z.; Zhang, W.; Zhang, T.; Yang, Z.; Li, J. Efficient Transformer for Remote Sensing Image Segmentation. Remote Sens. 2021, 13, 3585. [Google Scholar] [CrossRef]
Panboonyuen, T.; Jitkajornwanich, K.; Lawawirojwong, S.; Srestasathiern, P.; Vateekul, P. Transformer-Based Decoder Designs for Semantic Segmentation on Remotely Sensed Images. Remote Sens. 2021, 13, 5100. [Google Scholar] [CrossRef]
Xiao, X.; Guo, W.; Chen, R.; Hui, Y.; Wang, J.; Zhao, H. A Swin Transformer-Based Encoding Booster Integrated in U-Shaped Network for Building Extraction. Remote Sens. 2022, 14, 2611. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), Miami, FL, USA, 20–25 June 2009; IEEE Computer Society: Washington, DC, USA, 2009; pp. 248–255. [Google Scholar] [CrossRef]
Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv 2016, arXiv:1609.08144. [Google Scholar]
Shazeer, N.; Mirhoseini, A.; Maziarz, K.; Davis, A.; Le, Q.; Hinton, G.; Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv 2017, arXiv:1701.06538. [Google Scholar]
Fedus, W.; Zoph, B.; Shazeer, N. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. J. Mach. Learn. Res. 2022, 23, 120. [Google Scholar]
Bao, H.; Wang, W.; Dong, L.; Liu, Q.; Mohammed, O.K.; Aggarwal, K.; Som, S.; Piao, S.; Wei, F. VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts. In Proceedings of the NeurIPS, New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
Cubuk, E.D.; Zoph, B.; Shlens, J.; Le, Q.V. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR Workshops 2020), Seattle, WA, USA, 14–19 June 2020; Computer Vision Foundation: New York, NY, USA, 2020; pp. 3008–3017. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the 7th International Conference on Learning Representations (ICLR 2019), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy, 22–29 October 2017; IEEE Computer Society: Washington, DC, USA, 2017; pp. 618–626. [Google Scholar] [CrossRef]
Abnar, S.; Zuidema, W.H. Quantifying Attention Flow in Transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), Online, 5–10 July 2020; Jurafsky, D., Chai, J., Schluter, N., Tetreault, J.R., Eds.; Association for Computational Linguistics: Toronto, ON, Canada, 2020; pp. 4190–4197. [Google Scholar] [CrossRef]

Figure 1. An example of RSVQA.

Figure 2. Architecture of the TCMME for remote sensing VQA. It consists of a visual encoder, a language encoder, a multimodal fusion encoder, and a answer classification module.

Figure 3. Illustration of fusion operations performed within the backbone.

Figure 4. Example of the RSVQA-LR (a) and RSVQA-HR (b) datasets with different types of question–answer pairs.

Figure 5. Visualizing image attention maps on RSVQA-LR dataset. (a,c) are two presence questions, while (b,d) are count and comparison questions.

Figure 6. Visualizing image attention maps on RSVQA-HR dataset. (a) is a count question, (b) is a comparison question, (c) is a presence question, and (d) is an area question.

Table 1. Comparison of TCMME with existing methods on the RSVQA-LR dataset. The standard deviation is reported in brackets. Bold represents the highest accuracy among these methods.

Types			RSVQA-LR
Types	RSVQA [25]	EasyToHard [27]	Bi-Modal [30]	SHRNet [31]	TCMME (Ours)
Count	67.01% (0.59%)	69.22% (0.33%)	72.22% (0.57%)	73.87% (0.22%)	74.08% (0.35%)
Presence	87.46% (0.06%)	90.66% (0.24%)	91.06% (0.17%)	91.03% (0.13%)	91.64% (0.13%)
Comparison	81.50% (0.03%)	87.49% (0.10%)	91.16% (0.09%)	90.48% (0.05%)	92.10% (0.06%)
Rural/Urban	90.00% (1.41%)	91.67% (1.53%)	92.66% (1.52%)	94.00% (0.87%)	95.00% (0.93%)
Average Accuracy	81.49% (0.49%)	84.76% (0.35%)	86.78% (0.28%)	87.34% (0.13%)	88.21% (0.24%)
Overall Accuracy	79.08% (0.20%)	83.09% (0.15%)	85.56% (0.16%)	85.85% (0.28%)	86.69% (0.21%)

Table 2. Comparison of TCMME with existing methods on test set 1 of RSVQA-HR dataset. The standard deviation is reported in brackets. Bold represents the highest accuracy among these methods.

Types	RSVQA-HR Test Set 1
Types	RSVQA [25]	EasyToHard [27]	Bi-Modal [30]	SHRNet [31]	TCMME (Ours)
Count	68.63% (0.11%)	69.06% (0.13%)	69.80% (0.09%)	70.04% (0.15%)	69.43% (0.11%)
Presence	90.43% (0.04%)	91.39% (0.15%)	92.03% (0.08%)	92.45% (0.11%)	91.97% (0.07%)
Comparison	88.19% (0.08%)	89.75% (0.10%)	91.83% (0.00%)	91.68% (0.09%)	91.99% (0.03%)
Area	85.24% (0.05%)	85.92% (0.19%)	86.27% (0.05%)	86.35% (0.13%)	91.08% (0.09%)
Average Accuracy	83.12% (0.03%)	83.97% (0.06%)	84.98% (0.05%)	85.13% (0.08%)	86.12% (0.05%)
Overall Accuracy	83.23% (0.02%)	84.16% (0.05%)	85.30% (0.05%)	85.39% (0.05%)	85.96% (0.06%)

Table 3. Comparison of TCMME with existing methods on test set 2 of RSVQA-HR dataset. The standard deviation is reported in brackets. Bold represents the highest accuracy among these methods.

Types	RSVQA-HR Test Set 2
Types	RSVQA [25]	EasyToHard [27]	Bi-Modal [30]	SHRNet [31]	TCMME (Ours)
Count	61.47% (0.08%)	61.95% (0.08%)	63.06% (0.11%)	63.42% (0.14%)	62.13% (0.10%)
Presence	86.26% (0.47%)	87.97% (0.06%)	89.37% (0.21%)	89.81% (0.27%)	88.93% (0.19%)
Comparison	85.94% (0.12%)	87.68% (0.23%)	89.62% (0.29%)	89.44% (0.23%)	89.89% (0.22%)
Area	76.33% (0.50%)	78.62% (0.23%)	80.12% (0.39%)	80.37% (0.16%)	88.22% (0.33%)
Average Accuracy	77.50% (0.29%)	79.06% (0.15%)	80.54% (0.16%)	80.76% (0.21%)	82.29% (0.16%)
Overall Accuracy	78.23% (0.25%)	79.29% (0.15%)	81.23% (0.15%)	81.37% (0.19%)	82.15% (0.18%)

Table 4. Comparison of different components of TCMME module on both RSVQA-LR and RSVQA-HR datasets. Bold represents the highest accuracy among these variants.

Variant	RSVQA-LR		RSVQA-HR (Test Set 1)		RSVQA-HR (Test Set 2)
Variant	Overall Accuracy	Average Accuracy	Overall Accuracy	Average Accuracy	Overall Accuracy	Average Accuracy
w/o CMME	85.37%	86.20%	84.95%	85.01%	81.03%	81.04%
w/o ME	86.29%	87.67%	85.51%	85.60%	81.93%	82.02%
w/o CA	86.12%	87.53%	85.42%	85.55%	81.65%	81.71%
TCMME (full)	86.69%	88.21%	85.96%	86.12%	82.15%	82.29%

Table 5. Comparison of different number of layers in the CMME module on both RSVQA-LR and RSVQA-HR datasets. Bold represents the highest accuracy among these variants.

Number of Layers in the CMME	RSVQA-LR		RSVQA-HR (Test Set 1)		RSVQA-HR (Test Set 2)
Number of Layers in the CMME	Overall Accuracy	Average Accuracy	Overall Accuracy	Average Accuracy	Overall Accuracy	Average Accuracy
1	86.05%	87.47%	85.36%	85.47%	81.62%	81.63%
2	86.23%	87.65%	85.77%	85.95%	82.06%	82.01%
3	86.69%	88.21%	85.96%	86.12%	82.15%	82.29%
4	86.36%	87.19%	85.78%	85.98%	82.12%	82.22%
5	86.21%	88.02%	85.91%	86.06%	81.94%	81.91%
6	86.09%	86.52%	85.53%	85.64%	82.08%	82.00%

Table 6. Results obtained from RSVQA-LR test set with training sets of 10%, 20%, 30%, 40%, and 100%. Bold represents the highest accuracy among these training sets.

Types	RSVQA-LR
Types	Train (10%)	Train (20%)	Train (30%)	Train (40%)	Train (100%)
Count	66.98%	70.21%	71.50%	72.38%	74.08%
Presence	90.93%	90.46%	90.66%	90.59%	91.64%
Comparison	88.93%	90.08%	90.70%	92.15%	92.10%
Rural/Urban	90.00%	88.00%	92.00%	93.00%	95.00%
Average Accuracy	84.21%	84.69%	86.22%	87.03%	88.21%
Overall Accuracy	83.07%	84.32%	85.05%	85.88%	86.69%

Table 7. Results obtained from RSVQA-HR test sets 1 and 2 with training sets of 10%, 20%, 30%, 40%, and 100%. Bold represents the highest accuracy among these training sets.

Types	RSVQA-HR Test Set 1					RSVQA-HR Test Set 2
Types	Train (10%)	Train (20%)	Train (30%)	Train (40%)	Train (100%)	Train (10%)	Train (20%)	Train (30%)	Train (40%)	Train (100%)
Count	68.60%	68.82%	68.85%	69.68%	69.43%	62.41%	62.02%	61.67%	62.42%	62.13%
Presence	90.91%	91.24%	91.89%	91.87%	91.97%	88.11%	88.63%	89.06%	89.43%	88.93%
Comparison	90.63%	91.34%	91.45%	91.53%	91.99%	88.79%	89.09%	89.25%	89.24%	89.89%
Area	90.13%	90.63%	90.73%	90.89%	91.08%	86.31%	86.78%	87.54%	87.33%	88.22%
Average Accuracy	85.07%	85.51%	85.73%	85.99%	86.12%	81.41%	81.63%	81.88%	80.10%	82.29%
Overall Accuracy	84.88%	85.33%	85.55%	85.82%	85.96%	81.37%	81.57%	81.76%	82.01%	82.15%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, G.; He, J.; Li, P.; Zhong, S.; Li, H.; He, G. Unified Transformer with Cross-Modal Mixture Experts for Remote-Sensing Visual Question Answering. Remote Sens. 2023, 15, 4682. https://doi.org/10.3390/rs15194682

AMA Style

Liu G, He J, Li P, Zhong S, Li H, He G. Unified Transformer with Cross-Modal Mixture Experts for Remote-Sensing Visual Question Answering. Remote Sensing. 2023; 15(19):4682. https://doi.org/10.3390/rs15194682

Chicago/Turabian Style

Liu, Gang, Jinlong He, Pengfei Li, Shenjun Zhong, Hongyang Li, and Genrong He. 2023. "Unified Transformer with Cross-Modal Mixture Experts for Remote-Sensing Visual Question Answering" Remote Sensing 15, no. 19: 4682. https://doi.org/10.3390/rs15194682

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Unified Transformer with Cross-Modal Mixture Experts for Remote-Sensing Visual Question Answering

Abstract

1. Introduction

2. Related Work

2.1. Visual Question Answering

2.2. Remote-Sensing Visual Question Answering

2.3. Transformers in Remote Sensing

3. Methodology

3.1. Problem Definition

3.2. Model Architecture

3.2.1. Visual Encoder

3.2.2. Language Encoder

3.2.3. Multimodal Fusion Encoder

3.2.4. Answer Prediction

4. Experiment Results and Discussion

4.1. Datasets

4.1.1. RSVQA-LR

4.1.2. RSVQA-HR

4.2. Implementation Details

4.3. Comparison Results and Analysis

4.3.1. Accuracy Comparison on RSVQA-LR Dataset

4.3.2. Accuracy Comparison on RSVQA-HR Dataset

4.4. Discussion

4.5. Ablation Study

4.5.1. Ablation Study on the Effectiveness of Different Components

4.5.2. Ablation Study on the Optimal Number of Layers in the CMME Module

4.5.3. Ablation Study on the Sensitivity to the Training Set Size

4.6. Visualization

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI