Co-LLaVA: Efficient Remote Sensing Visual Question Answering via Model Collaboration

Liu, Fan; Dai, Wenwen; Zhang, Chuanyi; Zhu, Jiale; Yao, Liang; Li, Xin

doi:10.3390/rs17030466

Open AccessArticle

Co-LLaVA: Efficient Remote Sensing Visual Question Answering via Model Collaboration

by

Fan Liu

,

Wenwen Dai

,

Chuanyi Zhang

^*

,

Jiale Zhu

,

Liang Yao

and

Xin Li

College of Computer Science and Software Engineering, Hohai University, Nanjing 211100, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(3), 466; https://doi.org/10.3390/rs17030466

Submission received: 24 December 2024 / Revised: 24 January 2025 / Accepted: 27 January 2025 / Published: 29 January 2025

(This article belongs to the Special Issue Multi-platform and Multi-modal Remote Sensing Data Fusion with Advanced Deep Learning Techniques (Second Edition))

Download

Browse Figures

Versions Notes

Abstract

Large vision language models (LVLMs) are built upon large language models (LLMs) and incorporate non-textual modalities; they can perform various multimodal tasks. Applying LVLMs in remote sensing (RS) visual question answering (VQA) tasks can take advantage of the powerful capabilities to promote the development of VQA in RS. However, due to the greater complexity of remote sensing images compared to natural images, general-domain LVLMs tend to perform poorly in RS scenarios and are prone to hallucination phenomena. Multi-agent debate for collaborative reasoning is commonly utilized to mitigate hallucination phenomena. Although this method is effective, it comes with a significant computational burden (e.g., high CPU/GPU demands and slow inference speed). To address these limitations, we propose Co-LLaVA, a model specifically designed for RS VQA tasks. Specifically, Co-LLaVA employs model collaboration between Large Language and Vision Assistant (LLaVA-v1.5) and Contrastive Captioners (CoCas). It combines LVLM with a lightweight generative model, reducing computational burden compared to multi-agent debate. Additionally, through high-dimensional multi-scale features and higher-resolution images, Co-LLaVA can enhance the perception of details in RS images. Experimental results demonstrate the significant performance improvements of our Co-LLaVA over existing LVLMs (e.g., Geochat, RSGPT) on multiple metrics of four RS VQA datasets (e.g., +3% over SkySenseGPT on “Rural/Urban” accuracy in the test set of RSVQA-LR dataset).

Keywords:

large vision language models (LVLMs); large language models (LLMs); model collaboration (MC); remote sensing (RS); visual question answering (VQA)

1. Introduction

RS images contain rich information that can be utilized for various tasks such as object counting [1], scene classification [2], and object detection [3]. To fully leverage RS images as a source of information and to enable non-expert users to access the rich information contained in RS images easily, the visual question answering (VQA) task has been introduced into RS. VQA is a challenging task because the model needs to understand both the image and the question semantically and to reason about the interactions between objects to extract information from the image. Although the VQA task has become widespread in general domains, the remote sensing visual question answering (RS VQA) task is still in the infant stage due to the complexity of RS images (e.g., a large variety of small-scale objects). Early VQA models in RS [4,5] treated the VQA task as a classification task, selecting answers from the expected responses in the training data. Consequently, these methods were typically unable to generate open-ended responses. In other words, they have limited expressive and interactive capabilities, often failing to capture complex information in RS scenes.

Researchers have further harnessed the potential of large language models (LLMs) [6,7,8,9] by integrating visual perception modules, creating large vision language models (LVLMs) [10,11,12]. Moreover, LVLMs can understand complex image scenes and perform visual reasoning [13,14,15]. However, despite the impressive performance of LVLMs in general domains, their performance is often suboptimal when dealing with RS data [16]. This performance gap can be attributed to the diverse scales [17,18], unique acquisition angles [19,20], and complex land cover types [21,22,23] of RS images. Consequently, when faced with RS VQA tasks, LVLMs may provide inaccurate or even fabricated reasoning results.

Recently, significant progress has been made in the research on LVLMs for RS [24]. Many large-scale high-quality RS instruction datasets [25,26] have already emerged, but there have been relatively few improvements in algorithms specifically designed for RS images. For example, FIT-RS [25] is a large-scale instruction following dataset, including several complex comprehension tasks. The RS instruction dataset [27] integrates four diverse single-task datasets related to captioning and VQA. LHRS-Instruct [26] is built upon two image-caption datasets (RSITMD [28] and NWPU [29]) and one image–text dataset (LHRS-Align [26]). Specifically, existing LVLMs for RS still face the following challenges, as illustrated in Figure 1: (1) Since the inception of LVLMs, hallucination phenomena have consistently troubled researchers [30]. In the domain of RS, hallucinations in LVLMs primarily manifest as false object detection, incorrect land cover classification, erroneous change detection, etc. General LVLMs often employ multi-agent debate for collaborative inference to mitigate hallucinations [31,32]. However, this approach demands significant computational power and imposes a substantial computational burden on hardware devices. Notably, mitigating hallucination phenomena in RS has not been widely studied. (2) Remote sensing scenes are complex, encompassing water bodies, residential areas, forests, deserts, roads, and more. There is a significant scale difference between objects, making it difficult for a single scale to simultaneously capture the features of small objects (e.g., cars) and large objects (e.g., bridges and buildings). This scale difference also complicates the balance between global image understanding and local object recognition. Therefore, accurately extracting features of objects at different scales is a challenging task. Existing LVLMs often overlook the scale variations in RS images. (3) The typical resolution for RS images is

512 \times 512

, whereas the input resolution for LVLM’s visual encoder (e.g., CLIP-ViT-L/14 [33]) is generally

224 \times 224

. Due to many small objects within RS images, reducing the resolution to fit the visual encoder makes these small objects even harder to distinguish. Therefore, the resolution of LVLM’s visual encoder is insufficient to understand the details presented in RS images.

To alleviate the above obstacles, in this paper, we propose Co-LLaVA, a multimodal model specifically tailored for RS VQA tasks. We employ the lightweight generative model CoCa [34] and the large vision language model LLaVA-v1.5 [35] for efficient RS VQA. Its purpose is to correct or reduce the erroneous visual understanding of LLaVA-v1.5 by integrating the visual understanding results of CoCa, thus mitigating hallucination phenomena in RS. Inspired by the work of Shi et al. [36], we consider the scale variation in RS images, introducing a multi-scale feature fusion (MFF) module. In particular, we increase the resolution of LLaVA-v1.5’s visual encoder in this module. Through the above approaches, Co-LLaVA can perform exceptionally well in RS VQA tasks.

In summary, our contributions are as follows:

We propose Co-LLaVA, which introduces a model collaboration approach using CoCa and LLaVA-v1.5 to achieve efficient RS VQA. By utilizing the visual understanding results of CoCa as a reference, LLaVA-v1.5 can correct or reduce its erroneous visual understanding, mitigating hallucination phenomena in RS. This method effectively mitigates hallucination phenomena without imposing a significant computational burden, unlike multi-agent debate for collaborative reasoning.
Our approach enhances the model’s perception of RS image details through a multi-scale feature fusion (MFF) module. This module can be entered with high-resolution RS images and integrate global and local information to adapt to object scale variations.
We conduct a series of experiments on four widely used RS VQA datasets. Co-LLaVA achieves state-of-the-art (SOTA) performance across multiple metrics of these datasets, demonstrating the effectiveness of our approach.

2. Related Works

2.1. Lightweight Models for RS VQA Tasks

The RS VQA task was first proposed by Lobry et al. [37]. They built the first large-scale VQA benchmark dataset RSVQA for RS images. They also proposed a VQA model consisting of a convolutional neural network (CNN) for image feature extraction and a Recurrent Neural Network (RNN) for question encoding. This model can predict answers by integrating visual information and question representations. Subsequent work has focused on effective modality fusion in RS VQA tasks. Bazi et al. [38] proposed Bi-Modal, which utilizes CLIP [33] to extract visual and question representations. Bi-Modal captures the internal dependencies and interdependencies of these representations using self-attention and cross-attention. To achieve a more efficient training process, Yuan et al. [39] introduced a VQA model based on Self-Paced Curriculum Learning (SPCL), training the network in an easy-to-hard way. Zhang et al. [5] discovered significant shortcomings in current RS VQA models regarding visual-spatial reasoning. Then they proposed a Spatial Hierarchical Reasoning Network (SHRNet) to perform spatial hierarchical reasoning on multi-scale visual features guided by language features. They also employed a visual question (VQ) interaction module to learn effective image–text fusion features to predict the final answer. To fully utilize the global and local information of images and interact with question information, Li et al. [40] proposed MADNet for RS VQA tasks. MADNet achieves a 6.89% increase in overall accuracy (OA) over the baseline.

2.2. LVLMs for RS VQA Tasks

Inspired by the outstanding performance of LLMs and LVLMs, many studies have begun exploring the application of LVLMs in the RS domain. These studies primarily focus on constructing high-quality instruction following datasets to fine-tune LVLMs, enabling them to handle various downstream tasks in RS, such as caption generation, visual question answering, and visual grounding. For example, Kuckreja et al. [16] generated a new multimodal RS instruction following dataset by expanding the image–text pairs of existing RS datasets and fine-tuning the first versatile RS LVLM (GeoChat). GeoChat has multi-task conversational capabilities for RS images. Zhang et al. [41] extended the images in their RS instruction the following dataset to include SAR and infrared images. Unlike previous RS datasets, Hu et al. [42] constructed a high-quality RS image captioning dataset (RSICap). RSICap provides detailed descriptions for each image, including scene descriptions (e.g., residential area, airport, or farmland) and object information (e.g., color, shape, quantity, absolute location, etc.). RSGPT, fine-tuned on this dataset, performs exceptionally well on fine-grained tasks. Luo et al. [25] created a large-scale instruction tuning dataset FIT-RS. FIT-RS innovatively introduces several complex comprehension tasks of increasing difficulty, from relational reasoning to image-level scene graph generation. SkySenseGPT, proposed based on this dataset, can understand fine-grained relationships between objects in complex RS scenes. Other similar works include LHRS-Bot, SkyEyeGPT, and RS-LLAVA. Unlike the above studies, which only utilized large-scale image–text data for pre-training, Pang et al. [43] was the first to address the inevitable hallucination problem in RS LVLMs. By introducing various unanswerable questions in RS VQA tasks, they constructed the dataset RSSA. RSSA is the first dataset aimed at enhancing the self-awareness of RS LVLMs. Experimental results demonstrate that RSSA can improve the honesty of H2RSVLM.

2.3. Multi-Agent Debate for Collaborative Reasoning

From LLaVA [13] to Qwen-VL [14], and from GPT-4V [15] to CogAgent [44], the hallucination phenomenon has consistently been a significant issue for current LVLMs. Most of these models tend to provide extremely inaccurate responses to user-provided images and questions [30]. For example, they describe nonexistent objects in the image or identify features that do not match the object’s color, quantity, or spatial relationships. Recent research has explored the strategy of multi-agent debate for collaborative reasoning to reduce model hallucinations. This approach aims to improve output quality by leveraging diverse perspectives from multiple agents, each proposing and discussing their responses over multiple rounds to reach a common final answer. Du et al. [45] was the first to apply and evaluate this strategy in arithmetic reasoning tasks, where multiple agents converged on a single solution after several rounds of debate. Li et al. [46] proposed PRD, which considers preference rankings for possible answers from each agent and prompts agents to debate and try to reach a mutual agreement on their preferences. Additionally, Chen et al. [47] introduced ReConcile, where agents learn to persuade others to improve their answers and utilize a confidence-weighted voting mechanism to reach a better consensus. Multiple studies have shown that drawing inspiration from collaborative intelligence is beneficial, and diverse perspectives often converge into more refined solutions.

3. Model Architecture

To achieve efficient visual understanding in RS VQA, we design the framework for Co-LLaVA as illustrated in Figure 2. Co-LLaVA combines two efficient models, CoCa and LLaVA-v1.5, which perform well on RS VQA tasks after being pre-trained or fine-tuned. In this part, we will first introduce the multi-scale feature fusion module in Section 3.1, followed by a model collaboration module in Section 3.2. Lastly, we will present the core components in the model training phase in Section 3.3.

3.1. Multi-Scale Feature Fusion

RS images typically contain information at various scales (e.g., small objects like cars and large objects like land areas), and have resolutions of

512 \times 512

or larger. However, conventional vision models typically process images at a single scale (e.g.,

224 \times 224

), which may not adequately capture the comprehensive content of RS images. Directly entering RS images into the CLIP visual encoder may make it difficult to distinguish small-scale objects in the images due to decreased resolution. Therefore, we introduce a multi-scale feature fusion (MFF) module, where multi-scale feature fusion refers to the integration of global features and local features of the image into a single feature map. It interpolates the positional encoding into the pre-trained visual backbone of CLIP-ViT-L/14-336px to scale image sizes to

504 \times 504

[16]. Here, we utilize bicubic interpolation [48]. Bicubic interpolation is a method utilized to interpolate data points on a two-dimensional grid. This method is often applied in image processing to achieve smoother results than bilinear interpolation [49]. It estimates the value of an unknown point by considering the 16 surrounding points in a 4 × 4 grid. The bicubic interpolation formula can be expressed as:

g (m, n) = \sum_{a = 0}^{3} \sum_{c = 0}^{3} w (a, c) \cdot p (m + a - 1, n + c - 1),

(1)

where

g (m, n)

is the interpolated pixel value at the point

(m, n)

,

p (m + a - 1, n + c - 1)

represents the pixel value at the point

(m + a - 1, n + c - 1)

in the original data grid, and

w (m, n)

are the weight coefficients based on the cubic polynomial functions of the distances from the interpolated point to the surrounding grid points. Therefore, the input image resolution is scaled to

504 \times 504

. This approach extends the visual backbone of LLaVA-v1.5 to a higher resolution. Although this increases the number of patches per image from 576 to 1296, nearly doubling it, the enhanced resolution improves the model’s ability to perceive details in RS images. It provides a better visual foundation for high-resolution RS images.

Moreover, we expand RS images to multiple scales by dividing the original image into smaller sub-images to capture local information in the multi-scale feature fusion (MFF) module. For instance, given an input image

I_{ori}

, we obtain the global feature map denoted as

F_{glo}

using the following formula:

F_{glo} = E_{v} (I_{ori}) .

(2)

Next, we interpolate

I_{ori}

to twice its original size and divide the interpolated image into four sub-images

I_{sub}^{i}

, the local feature map

F_{loc_sub}^{i}

of each sub-image can be formulated as follows:

F_{loc_sub}^{i} = E_{v} (I_{sub}^{i}), i = 1, 2, 3, 4 .

(3)

After merging

F_{loc_sub}^{i}

into a single feature map and average-pooling the feature map to match the size of

F_{glo}

, we can obtain the local feature map

F_{loc}

. We concatenate

F_{glo}

with

F_{loc}

to produce multi-scale feature map

F_{multi_scale}

:

F_{multi_scale} = C o n c a t (F_{glo}, F_{loc}) .

(4)

F_{multi_scale}

has the same spatial dimension as the single-scale features but with a higher channel dimension (e.g., 2048 vs. 1024). This higher dimensionality allows subsequent layers to better capture the relationships between different features. By integrating global and local features, our model Co-LLaVA can effectively utilize both overall and detailed information, leading to more accurate classification and recognition of objects in RS images.

After obtaining visual tokens from the visual encoder, we leverage an MLP (Multi-Layer Perceptron) adapter [35] with two linear layers to connect the visual backbone and the LLM. The forward propagation process in an MLP with two linear layers can be described mathematically.

Let

x \in R^{n}

be the input vector, the final output

y

is obtained by applying GeLU [50]:

y (x) = GeLU (W x + b),

(5)

where

W \in R^{m \times n}

is the weight matrix,

b \in R^{m}

is the bias vector, and m is the number of neurons.

The MLP adapter learns complex mappings from inputs to outputs through a series of linear transformations followed by GeLU and optimizes its performance on the given task, leading to potentially higher accuracy and better generalization on unseen data. The MLP adapter projects the visual tokens (

\in R^{1296 \times 1024}

) with dimensions of 1024 into the embedding space. The adapter’s input dimensionality is 1024, and it outputs a vector of size 4096.

3.2. Model Collaboration

Cognitive dissonance theory [51] posits that individuals experience discomfort when confronted with two conflicting pieces of information. To reduce this discomfort, individuals strive to change their attitudes, beliefs, or behaviors. For example, when solving a math problem, if you provide an answer and a second person gives the same answer, your confidence increases. However, if the second person’s answer differs from yours, you might begin to doubt your own answer and recheck it. In other words, when we encounter a situation where someone else’s answer conflicts with our own, we experience cognitive dissonance and tend to re-evaluate our answer to alleviate this discomfort. Motivated by cognitive dissonance theory, we propose a model collaboration strategy for RS VQA tasks. Specifically, we utilize a lightweight model CoCa, and a large vision language model LLaVA-v1.5, to perform efficient RS VQA. CoCa utilizes contrastive loss to learn global representations while using captioning loss [52] to learn fine-grained region-level features, performing well across multiple downstream tasks. This model collaboration strategy allows LLaVA-v1.5 to re-evaluate its own answer after receiving the visual understanding result from CoCa.

Specifically, as represented in Figure 2, given a question, CoCa generates an answer

A_{C}

, and LLaVA-v1.5 generates an answer

A_{L}

. LVLMs often struggle to balance short and long answers [35], while answers in RS VQA tasks are typically short. To obtain answers with minimal redundant information, we designed

{Prompt}_{1}

to prompt LLaVA-v1.5 to generate short answers:

${Prompt}_{1}$ : Answer in one word or a short phrase.

Next, we compare whether

A_{C}

and

A_{L}

are equal. If they are equal, we directly leverage

A_{L}

as the final answer. Otherwise, we leverage both

A_{C}

and

A_{L}

in

{Prompt}_{2}

:

${Prompt}_{2}$ : The possible answer is $A_{C}$ or $A_{L}$ . The true answer may not be included in the possible answer. Answer in one word or a short phrase.

While designing

{Prompt}_{2}

, we consider that both

A_{C}

and

A_{L}

might be incorrect. Therefore, we incorporate “The true answer may not be included in the possible answer” into

{Prompt}_{2}

to enhance LLaVA-v1.5’s rethinking capability. At last, we enter

{Prompt}_{2}

into LLaVA-v1.5 to generate the final visual understanding result.

3.3. Model Training

In this part, we will introduce the core components in the model training phase, including fine-tuning CoCa and LLaVA. While fine-tuning CoCa, we apply contrastive loss and captioning loss to optimize CoCa’s responses. During the phase of fine-tuning LLaVA, the MLP adapter and CLIP visual encoder are frozen, while the large language model Vicuna-v1.5 (7B) is fine-tuned. LoRA (low-rank adaptation) [53] is employed to facilitate the training efficiency of Vicuna-v1.5 (7B), while the autoregressive loss function is employed to optimize Vicuna-v1.5 (7B)’s responses. The pseudo-code for our method Co-LLaVA is shown in Algorithm 1.

Algorithm 1: Co-LLaVA: model collaboration for RS VQA.

3.3.1. Fine-Tuning CoCa

The core structure of CoCa includes an image encoder and a text decoder. The image encoder is employed to extract visual tokens, while the text decoder is employed to extract text tokens and generate responses. To facilitate the alignment between visual tokens and text tokens, we compute contrastive loss

L_{C o n}

using the following formula:

L_{C o n} = - \frac{1}{H} (\underset{image - to - text}{\underset{︸}{\sum_{i}^{H} log \frac{exp (x_{i}^{⊤} y_{i} / σ)}{\sum_{j = 1}^{H} exp (x_{i}^{⊤} y_{j} / σ)}}} + \underset{text - to - image}{\underset{︸}{\sum_{i}^{H} log \frac{exp (y_{i}^{⊤} x_{i} / σ)}{\sum_{j = 1}^{H} exp (y_{i}^{⊤} x_{j} / σ)}}}),

(6)

where

(x_{i}, y_{j})

is the normalized encoding of the image from the i-th image–text pair and the text from the j-th image–text pair, H is the batch size, and

σ

is the temperature coefficient utilized to scale the confidence function.

To generate responses that are closer to the ground truths, the captioning loss

L_{C a p}

is utilized.

L_{C a p}

can be formulated as:

L_{C a p} = - \sum_{s = 1}^{S} l o g P_{θ} (y_{s} | y_{< s}, x),

(7)

where S represents the total number of tokens in the normalized text encoding,

P_{θ} (y_{s} | y_{< s}, x)

denotes the likelihood function maximized by the text decoder under the forward autoregressive factorization for the image x paired with the text y.

While pre-training CoCa, the total loss

L_{C o C a}

which combines

L_{C o n}

with

L_{C a p}

can be computed by:

L_{C o C a} = λ_{1} \cdot L_{C o n} + λ_{2} \cdot L_{C a p},

(8)

where

λ_{1}

and

λ_{2}

are the learnable hyperparameters for the weight of the contrastive loss

L_{C o n}

and the captioning loss

L_{C a p}

, respectively. During training, the values of

λ_{1}

and

λ_{2}

are adjusted to control the contributions of the contrastive loss

L_{C o n}

and the captioning loss

L_{C a p}

to the total loss

L_{C o C a}

, until

L_{C o C a}

gradually decreases and stabilizes.

3.3.2. Fine-Tuning LLaVA-v1.5

LoRA is an efficient model fine-tuning method designed to reduce the complexity of parameter updates while maintaining the capabilities of LVLMs (e.g., LLaVA-v1.5). The core idea of LoRA is to perform low-rank decomposition on the weight matrix within the model. Specifically, given a weight matrix W, LoRA adjusts it by adding a low-rank matrix A and another low-rank matrix B, represented mathematically as:

W^{'} = W + A \cdot B .

(9)

Here,

W^{'}

is the updated weight matrix, A is a low-rank matrix typically of size

m \times r

, and B is another low-rank matrix of size

r \times n

. The rank r is usually much smaller than the dimensions m and n, which allows for the updates to focus solely on the smaller matrices A and B, rather than the entire weight matrix.

In LoRA, the forward propagation of the model can be expressed as:

Output = f (W^{'} \cdot X) = f ((W + A \cdot B) \cdot X),

(10)

where f denotes the forward propagation function of the model, and X represents the input data. This adaptation mechanism effectively reduces the number of parameters that need to be trained, thereby lowering the computational burden associated with fine-tuning large pre-trained models.

When implementing LoRA, several parameter settings are crucial for its effectiveness. The rank r determines the dimensions of the low-rank matrices. The open-source large language model Vicuna-v1.5 (7B) [54] is the foundation for LLaVA-v1.5. During the fine-tuning process, it is common to freeze most weights of Vicuna-v1.5 (7B) while training only the low-rank matrices A and B.

Vicuna-v1.5 (7B) takes a sequence of visual and language tokens as input and generates responses in an autoregressive manner. This approach involves predicting the next word in a sequence given the previous words, which allows for the modeling of language sequentially. We employ the autoregressive loss in Vicuna-v1.5 (7B) to optimize the generation of text.

Specifically, given a sequence of length L, we compute the autoregressive loss

L_{L L a V A}

by:

L_{L L a V A} (K_{a} ∣ K_{v}, K_{i n s t r u c t}) = - log (\prod_{z = 1}^{L} p_{θ} (k_{z} ∣ K_{v}, K_{i n s t r u c t, < z}, K_{a, < z})),

(11)

where

θ

are the trainable parameters,

K_{v}

is the image,

K_{i n s t r u c t}

is the instructions from all rounds,

K_{i n s t r u c t, < z}

represents the instructions from all rounds before the current predicted word

k_{z}

, and

K_{a, < z}

represents the answers from all rounds before the current predicted word

k_{z}

.

The autoregressive loss is a fundamental component of the large language model Vicuna-v1.5 (7B). By predicting the conditional probabilities of words in a sequence, these models are able to generate coherent and contextually relevant text. The optimization process through the negative log-likelihood ensures that the model learns to represent the underlying distribution of language accurately.

4. Experiments and Results

4.1. Datasets and Evaluation Metrics

RSVQA-HR. RSVQA-HR [37] contains 10,659 high-resolution images ( $512 \times 512$ ) and 1,066,316 question–answer pairs. The 15 cm resolution images in the dataset are sourced from the USGS High-Resolution Orthophoto (HRO), with each image covering an area of 5898 square meters. Based on the total number of images, the dataset is divided into a training set (61.5%), a validation set (11.2%), and two test sets (Test Set 1 at 20.5% and Test Set 2 at 6.8%). The distinction between Test Set 1 and Test Set 2 is that Test Set 1 covers areas similar to those in the training and validation sets, while Test Set 2 covers previously unseen areas during training. This dataset includes four types of questions: object presence (answer: yes/no), object comparison (answer: yes/no), object counting, and area estimation. In RSVQA-HR, the questions of area estimation are quantified into five categories: 0, between 1 and 10, between 11 and 100, between 101 and 1000, and greater than 1000, with the unit being square meters.
RSVQA-LR. RSVQA-LR [37] contains 772 low-resolution images ( $256 \times 256$ ) and 77,232 question–answer pairs. These images are sourced from the Sentinel-2 satellite over The Netherlands, with a resolution as low as 10 m. Based on the total number of images, 77.8% of the dataset is utilized for training, 11.1% for validation, and 11.1% for testing. The dataset includes four types of questions: object presence (answer: yes/no), object comparison (answer: yes/no), rural/urban classification (answer: rural/urban), and object counting. In RSVQA-LR, the questions of object counting are quantified into five categories: 0, between 1 and 10, between 11 and 100, between 101 and 1000, and greater than 1000.
CRSVQA. CRSVQA [55] consists of 4639 images sourced from a publicly available RS classification aerial image dataset (AID) [56]. It includes 30 different scene types, with approximately 230 images per type, each sized at 600 × 600 pixels. Additionally, CRSVQA contains a total of 4644 question–answer pairs. The questions are roughly categorized into three groups: scene understanding, object detection, and relationship reasoning. The CRSVQA subset contains 1000 pairs of image questions, which we divided into a 70% training set and a 30% test set.

We selected a random 10% of RSVQA-HR, all of RSVQA-LR, and 70% of the subset of CRSVQA as the training data. For CoCa, we utilized the original training data; for LLaVA-v1.5, we constructed an instruction-following dataset based on the original training data, providing the model with

{Prompt}_{1}

“Answer the question using a single word or phrase”. To evaluate the performance of Co-LLaVA, we utilized the test sets of RSVQA-LR, RSVQA-HR, and the subset of CRSVQA. We evaluated the answers to all questions in each test set via accuracy, average accuracy (AA), and overall accuracy (OA).

4.2. Experimental Details

To adapt Co-LLaVA for RS VQA tasks, we fine-tuned CoCa_ViT-L-14 [34] for 6 epochs. During the fine-tuning of Vicuna-v1.5 (7B) on our RS instruction following dataset, we froze the weights of the CLIP image encoder and the MLP adapter and updated the pre-trained weights of Vicuna-v1.5 (7B). We utilized the resolution-enhanced pre-trained CLIP-ViT-L/14-336px [33] as the image encoder (scaling from

336 \times 336

to

504 \times 504

). Given the large number of parameters in Vicuna-v1.5 (7B), full fine-tuning could be challenging and computationally expensive. To address this issue, we employed LoRA [53], a low-rank adaptation fine-tuning technique. LoRA improved training efficiency while preventing the LLM from forgetting its original knowledge from the source domain when fine-tuned on the target domain [57]. In the training process, we set the rank r of LoRA to 64 and

α

to 16. We utilized the Adam optimizer with a learning rate of

1 \times 10^{- 4}

. We trained Co-LLaVA on 2 × NVIDIA GeForce RTX 3090 GPUs with 24 GB of memory for approximately 20 h. The parameter count of Co-LLaVA is 7B.

4.3. Comparison Results and Analysis

We compared the performance of Co-LLaVA on the test sets of RSVQA-LR, RSVQA-HR, and the subset of CRSVQA with existing models, including lightweight models and large vision language models. The average inference time on one image-instruction pair for Co-LLaVA is approximately 0.59 s. The comparison results are presented in Table 1, Table 2, Table 3 and Table 4.

Table 1 and Table 2 demonstrate the performance of existing models on the two test sets of RSVQA-HR. In Table 1, our model Co-LLaVA outperforms all other models across multiple question types. For example, Co-LLaVA achieves an accuracy of 70.12% on the “Count” question type, which is an improvement of 1.49% over the baseline model RSVQA. Compared to large vision language models RSGPT and SkyEyeGPT, Co-LLaVA also delivers the best performance on “Presence” and “Comparison” question types, with accuracies of 92.56% and 92.20%, respectively. In terms of the OA metric, Co-LLaVA outperforms all existing models, achieving a score of 85.55%. In Table 2, Co-LLaVA achieves higher accuracy on “Count” and “Comparison” question types than all existing models. For the “Count” question type, Co-LLaVA’s accuracy increases by 0.09% compared to the best existing method SHRNet. Although Co-LLaVA does not achieve the highest accuracy on the “Presence” question type compared to all existing models, it surpasses all existing lightweight models with an accuracy of 89.85%, second only to RSGPT’s accuracy of 89.87%. Overall, Co-LLaVA achieves state-of-the-art (SOTA) results on both the AA and OA metrics, with scores of 81.06% and 81.84%, respectively. Considering Co-LLaVA’s performance on both test sets of RSVQA-HR, it is evident that Co-LLaVA achieves higher or nearly equivalent accuracy across all question types compared to existing LVLMs. We also observe that Co-LLaVA’s overall performance in Table 2 is inferior to its performance in Table 1. A possible reason for this is that Table 1 covers areas similar to those in the training and validation sets, whereas Table 2 covers areas that are relatively different from those in the training and validation sets.

Table 3 demonstrates the accuracy of existing models on the test set of RSVQA-LR. From the table, our model Co-LLaVA achieves the highest accuracy on the “Comparison” and “Rural/Urban” question types, outperforming the best existing model SkySenseGPT by 0.73% and 3%, respectively. Although Co-LLaVA does not achieve the highest accuracy on the “Presence” question type, it remains the second-best. Regarding the AA metric, Co-LLaVA surpasses all lightweight models. Furthermore, compared to RS-LLaVA, Co-LLaVA shows an improvement of 0.79% in the AA metric.

Combining the results from Table 1, Table 2 and Table 3, it can be observed that general-purpose LVLMs (e.g., MiniGPT-4 [59] and Shikra [60]) have significantly lower accuracy across all question types compared to LVLMs fine-tuned on RS data (e.g., GeoChat [16] and RSGPT [42]), which demonstrates that RS data provides domain-specific advantages for general-purpose LVLMs. In the types of questions related to scene understanding, object detection, and relationship reasoning, the accuracy of Co-LLaVA surpasses that of the general-purpose LLaVA-v1.5 [13] and domain-specific Geochat [16], as shown in Table 4.

4.4. Ablation Study

Co-LLaVA consists of multiple essential components, and the effectiveness of these components needs to be discussed and validated. In this section, we conducted several ablation experiments using the following variants of Co-LLaVA on four test sets (Test Set 1 of RSVQA-HR, Test Set 2 of RSVQA-HR, test set of RSVQA-LR, and test set of the subset of CRSVQA) to verify the contribution of each component to the overall prediction performance.

(1) Co-LLaVA w/o MC: We removed the model collaboration (MC) strategy, meaning that the visual understanding results were not corrected using CoCa* (a variant of CoCa trained on RS VQA data).

(2) Co-LLaVA w/o MFF and MC: We removed the multi-scale feature fusion (MFF) module, and the model collaboration (MC) strategy. The model relied solely on the fine-tuned LLaVA-v1.5 (7B) to generate visual understanding results.

As illustrated in Table 5, Co-LLaVA outperforms other variants, undeniably certifying the effectiveness of components designed in Co-LLaVA. Compared with the Co-LLaVA w/o MC, Co-LLaVA delivers higher performance, which demonstrates that the model collaboration (MC) strategy is definitely effective. The lightweight model CoCa* can help Co-LLaVA w/o MC rethink its original response, thus Co-LLaVA can give a correct response in most cases. Additionally, compared with the Co-LLaVA w/o MFF and MC, the Co-LLaVA w/o MC delivers higher performance, indicating the effectiveness of the MFF module. The MFF module effectively integrates the global and local features of remote sensing images, enabling Co-LLaVA to better perceive image details. We can also observe that the improvements from MFF and MC are more pronounced on RSVQA-LR compared to RSVQA-HR. It is possible that MFF and MC effectively compensate for the lack of information carried by each pixel in low-resolution images. In other words, they enhance the model’s feature extraction and reasoning capabilities.

4.5. Qualitative Analysis

In this section, we conducted a qualitative analysis of the effectiveness of Co-LLaVA. Figure 3, Figure 4, Figure 5 and Figure 6 demonstrate visual examples of typical image–question–response trios selected from Test Set 1 of RSVQA-HR, Test Set 2 of RSVQA-HR, test set of RSVQA-LR, and test set of the subset of CRSVQA, separately. Compared to Co-LLaVA w/o MC, Co-LLaVA is able to predict the correct answers in most cases with the help of CoCa*, even when faced with challenging counting problems. Figure 6 features diverse scenes, demonstrating that Co-LLaVA can provide accurate responses based on questions across multiple scenes. Certainly, there are some edge cases. In the “Count” question type of Figure 3, when CoCa* generates the incorrect answer “6” and Co-LLaVA w/o MC generates the correct answer “0”, Co-LLaVA is able to maintain the original correct answer by considering both responses. Similarly, in the “Comparison” and “Presence” question types of Figure 5, when either CoCa* or Co-LLaVA w/o MC generates the correct answer, Co-LLaVA is able to produce the correct answer in most cases. More interestingly, as shown in the “Area” question type of Figure 4, when CoCa* and Co-LLaVA w/o MC provide different answers, Co-LLaVA gives a third answer after evaluating the two, and the third answer matches the ground truth, demonstrating that

{Prompt}_{2}

grants Co-LLaVA the ability to rethink. We also compared responses from our method Co-LLaVA with other methods (i.e., GeoChat and RS-LLaVA) on the test set of RSVQA-LR in Figure 7. Figure 7 demonstrates that Co-LLaVA could respond correctly when faced with challenging image–question pairs in most cases. For instance, in Figure 7a, both GeoChat and RS-LLaVA provide incorrect answers for the “Rural/Urban” and “Comparison” question types, while Co-LLaVA successfully delivers the correct responses. Similarly, in Figure 7b, for the “Count” question type, GeoChat and RS-LLaVA also yield incorrect answers, whereas Co-LLaVA is able to provide the correct answer. Regarding the “Presence” question types illustrated in Figure 7, either GeoChat or RS-LLaVA tends to give incorrect answers, while Co-LLaVA consistently offers correct responses.

5. Limitations and Discussion

Although the experiments confirm that our proposed Co-LLaVA is superior to many advanced RS VQA methods on multiple metrics, several limitations and the future outlook must be discussed.

Firstly, the Co-LLaVA model we proposed focuses on the domain of RS VQA. In future work, we will extend Co-LLaVA to other generative cross-modal tasks in RS scenes (e.g., visual grounding, image captioning, etc.) to enhance the scalability of Co-LLaVA.

Secondly, in unseen domains, the model might struggle to effectively leverage the knowledge it has acquired, leading to a decline in performance. Future research should focus on integrating knowledge graphs or other external knowledge bases to help models make better inferences when faced with unseen domains.

Thirdly, the datasets that Co-LLaVA utilizes comprise eight question types. While we have demonstrated the effectiveness of the MFF and MC modules through ablation studies, the question types addressed by Co-LLaVA may still lack comprehensiveness. In the future, we will expand the scope of questions that Co-LLaVA can address.

Finally, the effectiveness of the model collaboration (MC) strategy heavily depends on the performance of the models utilized. In rare cases, Co-LLaVA w/o MC may produce incorrect answers due to erroneous responses from CoCa*, thus compromising its own correct answers. Therefore, making modular improvements to the lightweight models or introducing additional models may help alleviate such issues. In future work, we will explore more efficient model collaboration strategies, such as knowledge distillation or parameter pruning, to reduce computational burden while maintaining performance.

6. Conclusions

This paper explored the potential capabilities of LVLMs in RS VQA tasks and proposed Co-LLaVA via a model collaboration (MC) strategy. The visual understanding results of the lightweight generative model CoCa were utilized as a reference, allowing LLaVA-v1.5 to correct or reduce its erroneous visual understanding results. This strategy could effectively mitigate the hallucination phenomenon in LVLMs. To adapt Co-LLaVA to the varying object scales in RS images, we introduced a multi-scale feature fusion (MFF) module to fuse global and local information. This module enhanced Co-LLaVA’s perception of details in RS images via increased resolution. We compared Co-LLaVA with existing methods across four test sets and conducted extensive experiments to evaluate the effectiveness of Co-LLaVA and its components. The experimental results demonstrated that Co-LLaVA significantly outperforms the current state-of-the-art methods across multiple metrics.

Author Contributions

Methodology, F.L., W.D. and C.Z.; investigation, F.L.; data curation, W.D. and J.Z.; writing—original draft preparation, W.D.; writing—review and editing, F.L., W.D., C.Z., J.Z. and L.Y.; supervision, L.Y. and X.L.; funding acquisition, F.L. and L.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially supported by the Fundamental Research Funds for the Central Universities (No. B240201077), National Nature Science Foundation of China (No. 62372155 and No. 62302149), Aeronautical Science Fund (No. 2022Z071108001), Joint Fund of Ministry of Education for Equipment Pre-research (No. 8091B022123), Water Science and Technology Project of Jiangsu Province under grant No. 2021063, Qinglan Project of Jiangsu Province, Changzhou science and technology project No. 20231313. The work of Liang Yao was supported in part by Postgraduate Research & Practice Innovation Program of Jiangsu Province (No. SJCX24_0183).

Data Availability Statement

Our code and data will be released at https://github.com/demo1shining/ Co-LLaVA (accessed on 27 January 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

LVLM	Large Vision Language Model
LLM	Large Language Model
RS	Remote Sensing
VQA	Visual Question Answering
LLaVA	Large Language and Vision Assistant
CoCa	Contrastive Captioners
MFF	Multi-scale Feature Fusion
MC	Model Collaboration
LoRA	Low-Rank Adaptation
MLP	Multi-Layer Perceptron

References

Liu, F.; Chen, D.; Guan, Z.; Zhou, X.; Zhu, J.; Ye, Q.; Fu, L.; Zhou, J. RemoteCLIP: A Vision Language Foundation Model for Remote Sensing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5622216. [Google Scholar] [CrossRef]
Li, X.; Wen, C.; Hu, Y.; Zhou, N. RS-CLIP: Zero shot remote sensing scene classification via contrastive vision-language supervision. Int. J. Appl. Earth Obs. Geoinf. 2023, 124, 103497. [Google Scholar] [CrossRef]
Li, Z.; Wang, Y.; Zhang, N.; Zhang, Y.; Zhao, Z.; Xu, D.; Ben, G.; Gao, Y. Deep learning-based object detection techniques for remote sensing images: A survey. Remote Sens. 2022, 14, 2385. [Google Scholar] [CrossRef]
Zheng, X.; Wang, B.; Du, X.; Lu, X. Mutual attention inception network for remote sensing visual question answering. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5606514. [Google Scholar] [CrossRef]
Zhang, Z.; Jiao, L.; Li, L.; Liu, X.; Chen, P.; Liu, F.; Li, Y.; Guo, Z. A Spatial Hierarchical Reasoning Network for Remote Sensing Visual Question Answering. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4400815. [Google Scholar] [CrossRef]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 2022, 35, 27730–27744. [Google Scholar]
Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. Palm: Scaling language modeling with pathways. J. Mach. Learn. Res. 2023, 24, 1–113. [Google Scholar]
Zhang, S.; Roller, S.; Goyal, N.; Artetxe, M.; Chen, M.; Chen, S.; Dewan, C.; Diab, M.; Li, X.; Lin, X.V.; et al. Opt: Open pre-trained transformer language models. arXiv 2022, arXiv:2205.01068. [Google Scholar]
Najdenkoska, I.; Zhen, X.; Worring, M. Meta Learning to Bridge Vision and Language Models for Multimodal Few-Shot Learning. arXiv 2023, arXiv:2302.14794. [Google Scholar]
Gong, T.; Lyu, C.; Zhang, S.; Wang, Y.; Zheng, M.; Zhao, Q.; Liu, K.; Zhang, W.; Luo, P.; Chen, K. MultiModal-GPT: A Vision and Language Model for Dialogue with Humans. arXiv 2023, arXiv:2305.04790. [Google Scholar]
Li, B.; Zhang, Y.; Chen, L.; Wang, J.; Yang, J.; Liu, Z. Otter: A Multi-Modal Model with In-Context Instruction Tuning. arXiv 2023, arXiv:2305.03726. [Google Scholar]
Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual instruction tuning. arXiv 2023, arXiv:2304.08485. [Google Scholar]
Bai, J.; Bai, S.; Yang, S.; Wang, S.; Tan, S.; Wang, P.; Lin, J.; Zhou, C.; Zhou, J. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv 2023, arXiv:2308.12966. [Google Scholar]
Yang, Z.; Li, L.; Lin, K.; Wang, J.; Lin, C.C.; Liu, Z.; Wang, L. The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision). arXiv 2023, arXiv:2309.17421. [Google Scholar]
Kuckreja, K.; Danish, M.S.; Naseer, M.; Das, A.; Khan, S.; Khan, F.S. Geochat: Grounded large vision-language model for remote sensing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–18 June 2024; pp. 27831–27840. [Google Scholar]
Wu, H.; Li, Z.L. Scale issues in remote sensing: A review on analysis, processing and modeling. Sensors 2009, 9, 1768–1793. [Google Scholar] [CrossRef] [PubMed]
Deng, Z.; Sun, H.; Zhou, S.; Zhao, J.; Lei, L.; Zou, H. Multi-scale object detection in remote sensing imagery with convolutional neural networks. ISPRS J. Photogramm. Remote Sens. 2018, 145, 3–22. [Google Scholar] [CrossRef]
Navalgund, R.; Jayaraman, V.; Kumar, A.K.; Sharma, T.; Mathews, K.; Mohanty, K.; Dadhwal, V.; Potdar, M.; Singh, T.; Ghosh, R.; et al. Remote sensing data acquisition, platforms and sensor requirements. J. Indian Soc. Remote Sens. 1996, 24, 207–237. [Google Scholar] [CrossRef]
Zhang, H.; Yang, Z.; Zhang, L.; Shen, H. Super-resolution reconstruction for multi-angle remote sensing images considering resolution differences. Remote Sens. 2014, 6, 637–657. [Google Scholar] [CrossRef]
Mahdianpari, M.; Salehi, B.; Rezaee, M.; Mohammadimanesh, F.; Zhang, Y. Very deep convolutional neural networks for complex land cover mapping using multispectral remote sensing imagery. Remote Sens. 2018, 10, 1119. [Google Scholar] [CrossRef]
Hu, S.; Wang, L. Automated urban land-use classification with remote sensing. Int. J. Remote Sens. 2013, 34, 790–803. [Google Scholar] [CrossRef]
Karalas, K.; Tsagkatakis, G.; Zervakis, M.; Tsakalides, P. Land classification using remotely sensed data: Going multilabel. IEEE Trans. Geosci. Remote Sens. 2016, 54, 3548–3563. [Google Scholar] [CrossRef]
Zhou, Y.; Feng, L.; Ke, Y.; Jiang, X.; Yan, J.; Yang, X.; Zhang, W. Towards Vision-Language Geo-Foundation Model: A Survey. arXiv 2024, arXiv:2406.09385. [Google Scholar]
Luo, J.; Pang, Z.; Zhang, Y.; Wang, T.; Wang, L.; Dang, B.; Lao, J.; Wang, J.; Chen, J.; Tan, Y.; et al. SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding. CoRR 2024. abs/2406.10100. [Google Scholar] [CrossRef]
Muhtar, D.; Li, Z.; Gu, F.; Zhang, X.; Xiao, P. Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model. arXiv 2024, arXiv:2402.02544. [Google Scholar]
Bazi, Y.; Bashmal, L.; Al Rahhal, M.M.; Ricci, R.; Melgani, F. RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery. Remote Sens. 2024, 16, 1477. [Google Scholar] [CrossRef]
Yuan, Z.; Zhang, W.; Fu, K.; Li, X.; Deng, C.; Wang, H.; Sun, X. Exploring a Fine-Grained Multiscale Method for Cross-Modal Remote Sensing Image Retrieval. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–19. [Google Scholar] [CrossRef]
Cheng, G.; Han, J.; Lu, X. Remote Sensing Image Scene Classification: Benchmark and State of the Art. Proc. IEEE 2017, 105, 1865–1883. [Google Scholar] [CrossRef]
Bai, Z.; Wang, P.; Xiao, T.; He, T.; Han, Z.; Zhang, Z.; Shou, M.Z. Hallucination of multimodal large language models: A survey. arXiv 2024, arXiv:2404.18930. [Google Scholar]
Elhenawy, M.; Abutahoun, A.; Alhadidi, T.I.; Jaber, A.; Ashqar, H.I.; Jaradat, S.; Abdelhay, A.; Glaser, S.; Rakotonirainy, A. Visual Reasoning and Multi-Agent Approach in Multimodal Large Language Models (MLLMs): Solving TSP and mTSP Combinatorial Challenges. arXiv 2024, arXiv:2407.00092. [Google Scholar] [CrossRef]
Xie, J.; Chen, Z.; Zhang, R.; Wan, X.; Li, G. Large multimodal agents: A survey. arXiv 2024, arXiv:2402.15116. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Yu, J.; Wang, Z.; Vasudevan, V.; Yeung, L.; Seyedhosseini, M.; Wu, Y. CoCa: Contrastive Captioners are Image-Text Foundation Models. arXiv 2022, arXiv:2205.01917. [Google Scholar]
Liu, H.; Li, C.; Li, Y.; Lee, Y.J. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–18 June 2024; pp. 26296–26306. [Google Scholar]
Shi, B.; Wu, Z.; Mao, M.; Wang, X.; Darrell, T. When Do We Not Need Larger Vision Models? arXiv 2024, arXiv:2403.13043. [Google Scholar]
Lobry, S.; Marcos, D.; Murray, J.; Tuia, D. RSVQA: Visual question answering for remote sensing data. IEEE Trans. Geosci. Remote Sens. 2020, 58, 8555–8566. [Google Scholar] [CrossRef]
Bazi, Y.; Rahhal, M.M.A.; Mekhalfi, M.L.; Zuair, M.A.A.; Melgani, F. Bi-Modal Transformer-Based Approach for Visual Question Answering in Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4708011. [Google Scholar] [CrossRef]
Yuan, Z.; Mou, L.; Wang, Q.; Zhu, X.X. From Easy to Hard: Learning Language-Guided Curriculum for Visual Question Answering on Remote Sensing Data. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5623111. [Google Scholar] [CrossRef]
Li, Y.; Ma, Y.; Liu, G.; Wei, Q.; Chen, Y.; Shang, R.; Jiao, L. Enhancing Remote Sensing Visual Question Answering: A Mask-Based Dual-Stream Feature Mutual Attention Network. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6007805. [Google Scholar] [CrossRef]
Zhang, W.; Cai, M.; Zhang, T.; Zhuang, Y.; Mao, X. EarthGPT: A Universal Multimodal Large Language Model for Multisensor Image Comprehension in Remote Sensing Domain. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5917820. [Google Scholar] [CrossRef]
Hu, Y.; Yuan, J.; Wen, C.; Lu, X.; Li, X. RSGPT: A Remote Sensing Vision Language Model and Benchmark. arXiv 2023, arXiv:2307.15266. [Google Scholar]
Pang, C.; Wu, J.; Li, J.; Liu, Y.; Sun, J.; Li, W.; Weng, X.; Wang, S.; Feng, L.; Xia, G.S.; et al. H2RSVLM: Towards Helpful and Honest Remote Sensing Large Vision Language Model. arXiv 2024, arXiv:2403.20213. [Google Scholar]
Hong, W.; Wang, W.; Lv, Q.; Xu, J.; Yu, W.; Ji, J.; Wang, Y.; Wang, Z.; Zhang, Y.; Li, J.; et al. CogAgent: A Visual Language Model for GUI Agents. arXiv 2024, arXiv:2312.08914. [Google Scholar]
Du, Y.; Li, S.; Torralba, A.; Tenenbaum, J.B.; Mordatch, I. Improving Factuality and Reasoning in Language Models through Multiagent Debate. arXiv 2023, arXiv:2305.14325. [Google Scholar]
Li, R.; Patel, T.; Du, X. PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations. arXiv 2024, arXiv:2307.02762. [Google Scholar]
Chen, J.C.Y.; Saha, S.; Bansal, M. ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs. arXiv 2024, arXiv:2309.13007. [Google Scholar]
Li, Y.; Qi, F.; Wan, Y. Improvements on bicubic image interpolation. In Proceedings of the 2019 IEEE 4th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Chengdu, China, 20–22 December 2019; Volume 1, pp. 1316–1320. [Google Scholar]
Kirkland, E.J.; Kirkland, E.J. Bilinear interpolation. In Advanced Computing in Electron Microscopy; Springer: Boston, MA, USA, 2010; pp. 261–263. [Google Scholar]
Hendrycks, D.; Gimpel, K. Gaussian Error Linear Units (GELUs). arXiv 2023, arXiv:1606.08415. [Google Scholar]
Harmon-Jones, E.; Harmon-Jones, C. Cognitive dissonance theory. Handbook of Motivation Science; Oxford University Press: Oxford, UK, 2012; Volume 71. [Google Scholar]
Wang, F.; Liu, H. Understanding the behaviour of contrastive loss. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 2495–2504. [Google Scholar]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
Chiang, W.L.; Li, Z.; Lin, Z.; Sheng, Y.; Wu, Z.; Zhang, H.; Zheng, L.; Zhuang, S.; Zhuang, Y.; Gonzalez, J.E.; et al. Vicuna: An Open-Source Chatbot Impressing Gpt-4 with 90%* Chatgpt Quality. 2023, Volume 2, p. 6. Available online: https://vicuna.lmsys.org (accessed on 14 April 2023).
Zhang, M.; Chen, F.; Li, B. Multi-step question-driven visual question answering for remote sensing. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4704912. [Google Scholar]
Xia, G.S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L.; Lu, X. AID: A benchmark data set for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3965–3981. [Google Scholar] [CrossRef]
Biderman, D.; Ortiz, J.G.; Portes, J.; Paul, M.; Greengard, P.; Jennings, C.; King, D.; Havens, S.; Chiley, V.; Frankle, J.; et al. Lora learns less and forgets less. arXiv 2024, arXiv:2405.09673. [Google Scholar]
Chen, J.; Zhu, D.; Shen, X.; Li, X.; Liu, Z.; Zhang, P.; Krishnamoorthi, R.; Chandra, V.; Xiong, Y.; Elhoseiny, M. MiniGPT-v2: Large language model as a unified interface for vision-language multi-task learning. CoRR 2023. abs/2310.09478. [Google Scholar] [CrossRef]
Zhu, D.; Chen, J.; Shen, X.; Li, X.; Elhoseiny, M. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. In Proceedings of the Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Chen, K.; Zhang, Z.; Zeng, W.; Zhang, R.; Zhu, F.; Zhao, R. Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic. CoRR 2023. abs/2306.15195. [Google Scholar] [CrossRef]
Zhan, Y.; Xiong, Z.; Yuan, Y. Skyeyegpt: Unifying remote sensing vision-language tasks via instruction tuning with large language model. arXiv 2024, arXiv:2401.09712. [Google Scholar]

Figure 1. Motivation for our work. Existing LVLMs for RS still face the following challenges: (1) Hallucinations in LVLMs for RS have consistently troubled researchers. The multi-agent debate for collaborative inference is often employed to mitigate hallucinations, but this approach demands significant computational power and imposes a substantial computational burden on hardware devices. However, utilizing a lightweight model and LVLM, our model collaboration approach can achieve efficient RS VQA. (2) There is a significant scale difference between objects in RS images, such as small objects (e.g., cars) and large objects (e.g., bridges and buildings). Existing LVLMs often overlook the scale variations in RS images. Extracting multi-scale features can enhance LVLM’s perception of RS image details. (3) Due to the many small objects within RS images, reducing the resolution to fit the visual encoder makes these small objects even harder to distinguish. Therefore, a relatively small resolution is insufficient to understand the details presented in RS images. Enlarging the visual encoder’s resolution can benefit the RS VQA.

Figure 2. Framework of the proposed Co-LLaVA. The multi-scale visual features are generated via our multi-scale feature fusion (MFF) module. CoCa and large language model (LLM) leverage visual and text features to produce each answer. If the visual understanding results from CoCa and LLM are different, we utilize them as a prompt and re-input into the LLM for a final answer.

Figure 3. Typical image–question–response trios from Test Set 1 of RSVQA-HR. Given an image–question pair, CoCa* generates an answer

A_{C}

. With

{Prompt}_{1}

, Co-LLaVA w/o MC (without the help of CoCa*) generates another answer

A_{L}

. With

A_{C}

,

A_{L}

, and

{Prompt}_{2}

, Co-LLaVA (with the help of CoCa*) generates the final answer. Blue indicates the question type, while green indicates a correct answer and red indicates a wrong answer.

Figure 3. Typical image–question–response trios from Test Set 1 of RSVQA-HR. Given an image–question pair, CoCa* generates an answer

A_{C}

. With

{Prompt}_{1}

, Co-LLaVA w/o MC (without the help of CoCa*) generates another answer

A_{L}

. With

A_{C}

,

A_{L}

, and

{Prompt}_{2}

, Co-LLaVA (with the help of CoCa*) generates the final answer. Blue indicates the question type, while green indicates a correct answer and red indicates a wrong answer.

Figure 4. Typical image–question–response trios from Test Set 2 of RSVQA-HR. In the left-bottom corner of the figure, when CoCa* and Co-LLaVA w/o MC (without the help of CoCa*) generate wrong answers, Co-LLaVA can generate a third response which is correct after rethinking thanks to

{Prompt}_{2}

. Blue indicates the question type, while green indicates a correct answer and red indicates a wrong answer.

Figure 4. Typical image–question–response trios from Test Set 2 of RSVQA-HR. In the left-bottom corner of the figure, when CoCa* and Co-LLaVA w/o MC (without the help of CoCa*) generate wrong answers, Co-LLaVA can generate a third response which is correct after rethinking thanks to

{Prompt}_{2}

. Blue indicates the question type, while green indicates a correct answer and red indicates a wrong answer.

Figure 5. Typical image–question–response trios from test set of RSVQA-LR. When Co-LLaVA w/o MC (without the help of CoCa*) generates an answer different from CoCa*’s, Co-LLaVA can revise its response with the help of CoCa* in most cases. Blue indicates the question type, while green indicates a correct answer and red indicates a wrong answer.

Figure 6. Typical image–question–response trios from a test set of the subset of CRSVQA. With the help of CoCa*, Co-LLaVA can generate the correct response in most cases. S.U., O.D., and R.R. stand for scene understanding, object detection, and relationship reasoning, respectively. Blue indicates the question type, while green indicates a correct answer and red indicates a wrong answer.

Figure 7. Response comparison with other methods (i.e., GeoChat and RS-LLaVA) on test set of RSVQA-LR. (a) Both GeoChat and RS-LLaVA provide incorrect answers for the “Rural/Urban” and “Comparison” question types, while Co-LLaVA successfully delivers the correct responses. (b) For the “Count” question type, GeoChat and RS-LLaVA also yield incorrect answers, whereas Co-LLaVA is able to provide the correct answer. Blue indicates the question type, while green indicates a correct answer and red indicates a wrong answer.

Table 1. Accuracy comparison with other methods on Test Set 1 of RSVQA-HR dataset. The values after “±” are the standard deviations. The bold numbers indicate the highest accuracy among existing models.

Method	# Parameters	Count	Presence	Comparison	Area	AA	OA
Lightweight Models:
RSVQA [37]	85.69M	68.63	90.43	88.19	85.24	83.12	83.23
EasyToHard [39]	148.83M	69.06	91.39	89.75	85.92	83.97	84.16
Bi-Modal [38]	-	69.80	92.03	91.83	86.27	84.98	85.30
SHRNet [5]	105.56M	70.04	92.45	91.68	86.35	85.13	85.39
MADNet [40]	-	70.02	92.36	91.87	86.58	85.21	85.51
Large Vision Language Models:
LLaVA-v1.5 [13]	7B	43.34	63.97	64.69	1.27	43.32	38.19
MiniGPT-v2 [58]	7B	-	64.80	59.17	-	-	-
MiniGPT-4 [59]	7B	-	52.91	54.76	-	-	-
Shikra [60]	13B	-	58.85	57.40	-	-	-
RSGPT [42]	13B	-	91.86	92.15	-	-	-
SkyEyeGPT [61]	7B	-	84.95	85.63	-	-	-
Co-LLaVA	7B	70.12 ± 0.33	92.56 ± 0.24	92.20 ± 0.27	85.49 ± 0.44	85.09 ± 0.08	85.55 ± 0.18

Table 2. Accuracy comparison with other methods on Test Set 2 of RSVQA-HR dataset. The values after “±” are the standard deviations. The bold numbers indicate the highest accuracy among existing models, and underlined numbers indicate suboptimal accuracy among existing models.

Method	# Parameters	Count	Presence	Comparison	Area	AA	OA
Lightweight Models:
RSVQA [37]	85.69M	61.47	86.26	85.94	76.33	77.50	78.23
EasyToHard [39]	148.83M	61.95	87.97	87.68	78.62	79.06	79.29
Bi-Modal [38]	-	63.06	89.37	89.62	80.12	80.54	81.23
SHRNet [5]	105.56M	63.42	89.81	89.44	80.37	80.76	81.37
MADNet [40]	-	63.38	89.69	89.82	80.58	80.87	81.51
Large Vision Language Models:
LLaVA-v1.5 [13]	7B	42.14	68.15	65.72	0.64	44.19	39.64
MiniGPT-v2 [58]	7B	-	66.34	59.40	-	-	-
MiniGPT-4 [59]	7B	-	50.43	52.60	-	-	-
Shikra [60]	13B	-	57.28	56.63	-	-	-
GeoChat [16]	7B	-	58.45	83.19	-	-	-
EarthGPT [41]	13B	-	62.77	79.53	-	-	-
RSGPT [42]	13B	-	89.87	89.68	-	-	-
SkyEyeGPT [61]	7B	-	83.50	80.28	-	-	-
H2RSVLM [43]	7B	-	65.00	83.70	-	-	-
SkySenseGPT [25]	7B	-	69.14	84.14	-	-	-
Co-LLaVA	7B	63.51 ± 0.26	89.85 ± 0.16	90.73 ± 0.13	80.14 ± 0.35	81.06 ± 0.19	81.84 ± 0.28

Table 3. Accuracy comparison with other methods of the test set of the RSVQA-LR dataset. The values after “±” are the standard deviations. The bold numbers indicate the highest accuracy among existing models, and underlined numbers indicate suboptimal accuracy among existing models.

Method	# Parameters	Count	Presence	Comparison	Rural/Urban	AA	OA
Lightweight Models:
RSVQA [37]	85.69M	67.01	87.46	81.50	90.00	81.49	79.08
EasyToHard [39]	148.83M	69.22	90.66	87.49	91.67	84.76	83.09
Bi-Modal [38]	-	72.22	91.06	91.16	92.66	86.78	85.56
SHRNet [5]	105.56M	73.87	91.03	90.48	94.00	87.34	85.85
MADNet [40]	-	72.85	90.96	91.68	95.00	87.62	85.97
Large Vision Language Models:
LLaVA-v1.5 [13]	7B	26.13	54.45	65.72	59.00	51.32	50.66
MiniGPT-v2 [58]	7B	-	49.85	63.09	59.00	-	-
MiniGPT-4 [59]	7B	-	43.86	57.55	62.00	-	-
Shikra [60]	13B	-	46.47	60.31	63.62	-	-
GeoChat [16]	7B	-	91.09	90.33	94.00	-	-
RSGPT [42]	13B	-	91.17	91.70	94.00	-	-
SkyEyeGPT [61]	7B	-	88.63	75.00	88.93	-	-
LHRS-Bot [26]	7B	-	88.51	90.00	89.07	-	-
H2RSVLM [43]	7B	-	89.58	89.79	88.00	-	-
SkySenseGPT [25]	7B	-	91.07	92.00	95.00	-	-
RS-LLaVA [27]	7B	74.38	92.80	91.33	94.00	88.13	86.95
Co-LLaVA	7B	73.53 ± 0.32	91.44 ± 0.22	92.73 ± 0.30	98.00 ± 1.00	88.92 ± 0.12	86.75 ± 0.23

Table 4. Accuracy comparison with LLaVA-v1.5 and Geochat of the test set of the subset of the CRSVQA dataset. The results from LLaVA-v1.5 and GeoChat were generated by open-source model weights. S.U., O.D., and R.R. stand for scene understanding, object detection, and relationship reasoning, respectively. The values after “±” are the standard deviations.

Method	# Parameters	S.U.	O.D.	R.R.	AA	OA
LLaVA-v1.5 [13]	7B	21.92	36.84	22.37	27.04	25.91
Geochat [16]	7B	20.55	28.95	18.42	22.64	21.59
Co-LLaVA	7B	79.45 ± 0.22	61.84 ± 0.44	78.29 ± 0.52	73.19 ± 0.39	74.42 ± 0.19

Table 5. Test accuracy (%) of each variant of our method Co-LLaVA on four test sets (Test Set 1 of RSVQA-HR, Test Set 2 of RSVQA-HR, test set of RSVQA-LR, and test set of the subset of CRSVQA). S.U., O.D., and R.R. stand for scene understanding, object detection, and relationship reasoning, respectively.

Methods	RSVQA-HR (Test Set 1)
Methods	Count	Presence	Comparison	Area
(w/o) MFF and MC	69.49	92.26	91.94	85.23
(w/o) MC	69.78	92.40	92.05	85.44
Co-LLaVA	70.12	92.56	92.20	85.49
Methods	RSVQA-HR (Test Set 2)
Methods	Count	Presence	Comparison	Area
(w/o) MFF and MC	63.14	89.35	90.66	79.95
(w/o) MC	63.42	89.51	90.69	80.01
Co-LLaVA	63.51	89.85	90.73	80.14
Methods	RSVQA-LR (Test Set)
Methods	Count	Presence	Comparison	Rural/Urban
(w/o) MFF and MC	72.54	90.80	91.46	94.00
(w/o) MC	72.66	91.02	92.61	96.00
Co-LLaVA	73.53	91.44	92.73	98.00
Methods	Subset of CRSVQA (Test Set)
Methods	S.U.		O.D.	R.R.
(w/o) MFF and MC	77.65		59.87	73.68
(w/o) MC	78.08		60.47	74.34
Co-LLaVA	79.45		61.84	78.29

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, F.; Dai, W.; Zhang, C.; Zhu, J.; Yao, L.; Li, X. Co-LLaVA: Efficient Remote Sensing Visual Question Answering via Model Collaboration. Remote Sens. 2025, 17, 466. https://doi.org/10.3390/rs17030466

AMA Style

Liu F, Dai W, Zhang C, Zhu J, Yao L, Li X. Co-LLaVA: Efficient Remote Sensing Visual Question Answering via Model Collaboration. Remote Sensing. 2025; 17(3):466. https://doi.org/10.3390/rs17030466

Chicago/Turabian Style

Liu, Fan, Wenwen Dai, Chuanyi Zhang, Jiale Zhu, Liang Yao, and Xin Li. 2025. "Co-LLaVA: Efficient Remote Sensing Visual Question Answering via Model Collaboration" Remote Sensing 17, no. 3: 466. https://doi.org/10.3390/rs17030466

APA Style

Liu, F., Dai, W., Zhang, C., Zhu, J., Yao, L., & Li, X. (2025). Co-LLaVA: Efficient Remote Sensing Visual Question Answering via Model Collaboration. Remote Sensing, 17(3), 466. https://doi.org/10.3390/rs17030466

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Co-LLaVA: Efficient Remote Sensing Visual Question Answering via Model Collaboration

Abstract

1. Introduction

2. Related Works

2.1. Lightweight Models for RS VQA Tasks

2.2. LVLMs for RS VQA Tasks

2.3. Multi-Agent Debate for Collaborative Reasoning

3. Model Architecture

3.1. Multi-Scale Feature Fusion

3.2. Model Collaboration

3.3. Model Training

3.3.1. Fine-Tuning CoCa

3.3.2. Fine-Tuning LLaVA-v1.5

4. Experiments and Results

4.1. Datasets and Evaluation Metrics

4.2. Experimental Details

4.3. Comparison Results and Analysis

4.4. Ablation Study

4.5. Qualitative Analysis

5. Limitations and Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI