DDFAV: Remote Sensing Large Vision Language Models Dataset and Evaluation Benchmark

Li, Haodong; Zhang, Xiaofeng; Qu, Haicheng

doi:10.3390/rs17040719

Open AccessArticle

DDFAV: Remote Sensing Large Vision Language Models Dataset and Evaluation Benchmark

by

Haodong Li

¹,

Xiaofeng Zhang

² and

Haicheng Qu

^1,*

¹

School of Software, Liaoning Technical University, Huludao 125000, China

²

Department of Automation, Shanghai Jiao Tong University, Shanghai 200240, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(4), 719; https://doi.org/10.3390/rs17040719

Submission received: 27 December 2024 / Revised: 17 February 2025 / Accepted: 17 February 2025 / Published: 19 February 2025

Download

Browse Figures

Versions Notes

Abstract

:

With the rapid development of large visual language models (LVLMs) and multimodal large language models (MLLMs), these models have demonstrated strong performance in various multimodal tasks. However, alleviating the generation of hallucinations remains a key challenge in LVLMs research. For remote sensing LVLMs, there are problems such as low quality, small number and unreliable datasets and evaluation methods. Therefore, when applied to remote sensing tasks, they are prone to hallucinations, resulting in unsatisfactory performance. This paper proposes a more reliable and effective instruction set production process for remote sensing LVLMs to address these issues. The process generates detailed and accurate instruction sets through strategies such as shallow-to-deep reasoning, internal and external considerations, and manual quality inspection. Based on this production process, we collect 1.6 GB of remote sensing images to create the DDFAV dataset, which covers a variety of remote sensing LVLMs tasks. Finally, we develop a closed binary classification polling evaluation method, RSPOPE, specifically designed to evaluate hallucinations in remote sensing LVLMs or MLLMs visual question-answering tasks. Using this method, we evaluate the zero-shot remote sensing visual question-answering capabilities of multiple mainstream LVLMs. Our proposed dataset images, corresponding instruction sets, and evaluation method files are all open source.

Keywords:

remote sensing; large vision language models; hallucination; evaluation methods

1. Introduction

Remote sensing technology uses image data acquired from high altitudes to analyze and understand various phenomena on the Earth’s surface. It is widely applied in fields such as land cover classification [1,2], disaster monitoring [3,4], and environmental protection [5,6]. In recent years, with the rapid development of multimodal learning technology, researchers in the field of remote sensing have begun exploring how to apply large visual language models (LVLMs) and multimodal large language models (MLLMs) to the processing and analysis of remote sensing data. These models can not only process complex remote sensing images but also generate rich descriptions and explanations in combination with natural language, greatly enhancing the application value of remote sensing data. They have made breakthroughs in tasks such as image caption generation, scene classification, and complex reasoning. Specifically, SkyEyeGPT [7] constructs a multimodal dialogue instruction dataset containing both single-task and multi-task components and designs a two-stage adjustment method to successfully unify large language models (LLMs) and remote sensing images, further enhancing the model’s multi-task processing capabilities. RS-LLaVA [8] builds RS-instructions, a remote sensing LVLMs dataset for image captioning and visual question answering, and fine-tunes the dataset based on the open-source LLaVA model, significantly improving the model’s performance in remote sensing image understanding. EarthGPT [9] designs a visual enhancement perception mechanism to address the imbalance in remote sensing object recognition performance at different scales and proposes a cross-modal method to enhance image and language understanding, strengthening the model’s ability to process data in various modalities. GeoChat [10], based on image-level LVLMs dialogue, achieves regional-level dialogue by accepting user input coordinates, greatly improving interactivity and flexibility between users and models.

Furthermore, although LVLMs and MLLMs demonstrate excellent performance in multiple fields, the hallucination problem remains one of the major challenges these models face. Hallucination refers to the phenomenon where the content generated by the model is inconsistent with the input image or does not exist at all. From a cognitive science perspective, hallucination is similar to the “completion” phenomenon in the human brain when perceiving incomplete information. When faced with blurred or unclear images, the human visual system often fills in the missing details. In computer science, hallucinations arise from the model’s reasoning mechanism and data incompleteness. This issue is particularly common in tasks such as image description and visual question answering. This phenomenon not only reduces the accuracy of the model and causes users to distrust it, but it may also lead to catastrophic outcomes in high-precision application scenarios, such as medical diagnosis, autonomous driving, and others. There are significant differences between remote sensing images and natural images. These differences are evident not only in the complexity of image content and target size but also in the intricate spatial reasoning required for remote sensing images captured from a unique bird’s-eye perspective. These factors often lead LVLMs to produce hallucinations in remote sensing tasks. As shown in Figure 1, in the image description task, the answers generated by three different LVLMs exhibit hallucination phenomena, including object counting errors and fabricated objects. This highlights the limitations of current remote sensing visual language models and datasets in addressing hallucinations. Current hallucination research on LVLMs mainly focuses on developing new hallucination evaluation benchmarks or improving data quality [11,12,13,14,15,16,17,18,19], using external measures such as human knowledge or LLMs for comparative feedback [20,21,22,23,24,25,26,27,28,29,30,31], improving decoding strategies when outputting LVLMs [28,32,33,34,35,36,37], and designing some unique frameworks or mechanisms to alleviate hallucinations [38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56], etc.

In current research on remote sensing LVLMs and MLLMs, the main approach to alleviating hallucinations is to generate negative samples with the help of external knowledge for comparative feedback during the training process. Although this approach alleviates hallucinations to some extent, it incurs high training costs, involves complex processes, and is typically applicable to a single task. For example, LHRS-Bot-Nova [57] reduces auditory hallucinations in remote sensing object positioning by adding a large number of negative samples to the dataset and optimizing the MLLMs structure. H2RSVLM [58] introduces incorrect questions and answers about the non-existence of objects, based on the basic remote sensing image text dataset, to enhance the robustness of model training and reduce auditory hallucinations. RS-GPT4V [59] opts to add negative samples across various remote sensing scenes in the instruction tracking dataset. However, alleviating the hallucination phenomenon in remote sensing LVLMs by improving the quality of datasets and instruction sets, as well as developing new and effective remote sensing LVLMs hallucination evaluation methods, has not been fully explored.

Most existing remote sensing LVLMs datasets are either restricted to a single task [60,61] or lack diversity and detail [62], limiting the models’ generalization and multi-task processing capabilities. In addition, the annotation quality of many datasets is inconsistent, there is a lack of complex scene reasoning data [63], and the image caption instruction set caption is too short [64], which leads to significant bias or errors in the training models when processing diverse remote sensing images. Therefore, there is an urgent need for a high-quality remote sensing visual language dataset that covers a wider range of scenes, perspectives, and categories and can perform multiple tasks, ranging from simple image description to complex reasoning.

Current evaluation methods for remote sensing LVLMs also have notable shortcomings. For example, in image captioning tasks, most evaluation methods focus on the overall similarity between the generated text and the reference text, while overlooking the ability of LVLMs to accurately recognize specific objects in remote sensing images, such as small or overlapping objects. Metrics like BLEU [65] and METEOR [66] emphasize n-gram overlap between generated and reference captions, ROUGE-L [67] focuses on the longest common subsequence between texts, and CIDEr-D [68] measures consistency across multiple reference captions. For visual question-answering tasks, methods such as RSVQA [69] include four types of questions: existence, comparison, rural/urban classification, and counting, seemingly providing a comprehensive evaluation of remote sensing LVLMs. However, the subjectivity and variability introduced by open-ended questions present a significant challenge for achieving consistent and reliable evaluation results.

In light of the above analysis, we choose to focus our research on developing new hallucination assessment benchmarks and enhancing data quality. First, the quality of data annotation improves by designing a rigorous and precise instruction generation process. Based on this, high-quality visible light remote sensing images are selected to produce a multi-task remote sensing LVLMs dataset, addressing the scarcity of datasets and further helping to alleviate hallucinations. Secondly, a stable, consistent, and reliable hallucination assessment method for remote sensing visual question-answering tasks is developed, taking into account the particularities of images in the remote sensing field. This method evaluates the degree of hallucination of remote sensing LVLMs from a new perspective. Finally, compared with other hallucination mitigation methods, the development of new hallucination assessment benchmarks and data filtering techniques offers a higher degree of automation and scalability. These advances can be extended to a wider range of remote sensing tasks and application scenarios in the future.

The main contributions of this paper can be summarized as follows:

We analyze the shortcomings of existing remote sensing LVLMs in terms of dedicated datasets and evaluation methods, and design a Chain-of-Thought (CoT) production process to create a more complex and accurate remote sensing LVLMs instruction set.
We collect 1755 high-quality visible light remote sensing images as the DDFAV dataset, which covers multiple categories, and perspectives, and spans a larger scale. We then produce the corresponding instruction set based on the CoT process, covering image captioning, complex reasoning, and visual question-answering tasks.
We develop an RSPOPE evaluation method, based on POPE [11], for current remote sensing LVLMs using the proposed DDFAV dataset and employ mainstream LVLMs to evaluate their zero-shot remote sensing image recognition capabilities.

2. Materials and Methods

2.1. Source and Selection of DDFAV Image Dataset

The remote sensing LVLMs dataset DDFAV we propose integrates five mainstream target detection datasets: DIOR [70], DOTA [71], FAIR1M [72], AI-TOD [73], and VisDrone-2019 [74]. Each dataset has unique advantages and compensates for the shortcomings of the others, together forming comprehensive, multi-perspective, object-rich, and multi-scenario-challenging remote sensing LVLMs. Table 1 shows the relevant information of the DDFAV instruction set.

DIOR is a dataset widely used for target detection in optical remote sensing images. It provides standard remote sensing satellite image perspectives and high-quality annotations, covering a variety of scene categories, such as ground track field, expressway service area, and train station. Recent research, DIOR-RSVG [75], further expands the application scope of DIOR. In the visual localization task, DIOR-RSVG introduces additional annotation information, including object bounding boxes, category labels, and text descriptions, enabling the model to accurately locate and understand the position and attributes of specific objects in the image. It also explores how to enhance the visual localization ability of remote-sensing LVLMs by improving both the dataset and the model. As a result, DIOR is not only an essential part of DDFAV but also provides a valuable foundation for both current and future research.

DOTA is one of the largest satellite image object detection datasets, containing more than 180,000 annotated instances across 15 different object categories, including helicopter, helipad, and container-crane, which addresses the limitations of DIOR. Additionally, unlike DIOR, DOTA images have higher resolution, denser objects, and more complex scenes. Therefore, DOTA is chosen to complement the support for high-resolution, dense objects in DDFAV, enhancing the model’s ability to handle complex scenes.

FAIR1M focuses on fine-grained target recognition tasks in high-resolution remote sensing images and is suitable for scenes that require accurate classification and localization. The key feature of this dataset is that it includes a variety of fine-grained categories, such as airplane (e.g., A220, A321, C919), ships (e.g., Dry-Cargo-Ship, Liquid-Cargo-Ship, Engineering-Ship), and vehicles (e.g., Dump-Truck, Cargo-Truck, Bus). Therefore, FAIR1M is chosen to provide DDFAV with more refined target recognition capabilities, particularly when processing high-resolution images, which helps improve the model’s accuracy and robustness.

AI-TOD focuses on small target detection in aerial images and is particularly suitable for processing targets that are small in size and difficult for traditional models to recognize. The dataset contains more than 100,000 annotated instances, with 87.7% of the targets smaller than 32 × 32 pixels, an average size of only 12.8 pixels, and a standard deviation of 5.9 pixels. Therefore, AI-TOD is chosen to enhance the performance of DDFAV in detecting small objects, addressing the challenges of recognizing small targets in remote sensing LVLMs when processing visible light remote sensing images.

VisDrone-2019 is a target detection dataset designed specifically for drone aerial images, covering a variety of targets, such as pedestrians, cyclists, and vehicles. Unlike traditional remote sensing satellite images, VisDrone-2019 provides images captured from a drone’s perspective, increasing the dataset’s diversity. VisDrone-2019 is chosen to complement DDFAV by supporting low-altitude perspectives and filling the gap in object categories, such as pedestrians and cyclists, which are relatively underrepresented in other datasets. Additionally, the inclusion of drone aerial images allows DDFAV to better simulate real-world application scenarios.

Finally, we merge all the remote sensing object categories from these five datasets into 29 categories, which are then divided into four groups based on the functional attributes of the objects: transport infrastructure, energy and public facilities, sports facilities, and transportation and people. We select 1755 images to create the DDFAV instruction set, totaling 1.6 GB of images. The object category and quantity information are shown in Figure 2, including large objects (such as airports, stadiums, and bridges) and small objects (such as cars, pedestrians, and ships). DDFAV not only covers various scenes in urban and rural areas but also provides different perspectives from satellites and drones. These features make DDFAV a high-quality and diverse remote sensing LVLMs dataset that effectively supports various remote sensing multimodal tasks.

2.2. DDFAV Instruction Set CoT Production Process

In this section, we describe in detail the process of creating the DDFAV instruction set using the idea of CoT prompts. The process for creating the instruction set for each image is shown in Figure 3. Our CoT design mainly includes three steps: user questioning, GPT answering, and quality inspection and iteration.

Users need to clarify their questions or concerns when asking. These questions can involve object recognition in images, scene function inference, relationships between objects, and more. To ensure that GPT’s answers accurately reflect the user’s intentions, users should be as specific as possible when asking questions. Additionally, during the questioning process, users need to define the required answer range and expected output format, providing necessary background information for GPT’s subsequent responses. GPT generates corresponding answers based on the user’s questions, combining the image content and dataset characteristics, and offers ideas for subsequent quality inspection and optimization. After GPT generates the answers, users manually check whether the generated content accurately reflects the actual content of the image. Key inspection points include the accuracy of object recognition, the completeness of the description, and the rationality of the reasoning. If the user finds GPT’s answers to be inaccurate or incomplete, they can provide feedback and ask GPT to correct them.

Combined with CoT’s structured thinking process, multiple iterations of different tasks for each image allow GPT to understand the image more deeply and comprehensively. As a result, the generated answers become more accurate and standardized, ultimately achieving user satisfaction. The word cloud statistical visualization of the DDFAV instruction set, based on the questions asked by humans and the content of GPT’s answers, is shown in Figure 4.

2.2.1. Preparation Stage

In the preparation phase, the user assigns GPT the role of a remote sensing expert and invites it to perform remote sensing image analysis, clearly expressing the intention to collaborate with GPT for step-by-step reasoning and problem-solving. The user’s initial request is as follows: “You will act as an expert in remote sensing image analysis. I will send you a visible light remote sensing image, and together we will reason through it step by step. You will provide answers based on my questions and the analysis of the image. <image\n>” The user then uploads a visible light remote sensing image. This image becomes the foundation of the entire CoT process, with all subsequent analysis and reasoning centered around it. After receiving the image and the user’s initial request, GPT confirms and asks for the user’s next instructions. “OK, I have received the visible light remote sensing images you sent me, please tell me your specific questions or areas you want to focus on. We will solve them together step by step”.

2.2.2. Detailed Description Stage

For the detailed image description task, we design three sub-steps to gradually refine GPT’s understanding of the image and guide it toward generating satisfactory answers: object recognition, brief description, and detailed description. In the object recognition step, GPT identifies all visible remote sensing objects in the image. The user then checks for any missed or misidentified objects and provides feedback. In the brief image description step, GPT provides a three to four sentence description based on user instructions and the recognized objects. This description covers the main objects and scene features in the image, but without going into too much detail. In the detailed image description step, GPT combines the results of the first two stages with the user’s quality inspection feedback to generate a six-to-seven-sentence detailed description of the image. This description includes more details, the relationships between objects, scene functions, and other important information. The final detailed description is then collected in the instruction set.

2.2.3. Complex Reasoning Stage

The complex reasoning stage ensures a progressively refined understanding of the image through three increasingly detailed steps. The first step infers the overall function of the scene, the second step further deduces additional potential information in the image, and the third step combines the inferences from the first two stages for a comprehensive complex analysis. Each stage includes manual quality inspection and iterative feedback to GPT, ensuring the accuracy and completeness of the final result. The output from the third step is included in the instruction set as the answer to the complex reasoning task, providing rich background information and a solid analytical foundation for subsequent visual question-answering tasks.

2.2.4. Visual Question-Answering Stage

The final task is visual question answering. We combine the image with the previous detailed description and complex reasoning results to perform three types of visual question-answering tasks: object counts, object colors, and object location. The human user asks GPT to provide two specific question–answer pairs for the relevant tasks, repeating this process three times. GPT generates clear answers based on the image content, followed by a user quality check step after each response to ensure accuracy. Finally, the outputs of these question–answer pairs are collected into the instruction set. Figure 5 shows a production example of our instruction set.

2.3. RSPOPE Metric Evaluation

To evaluate the hallucination performance of LVLMs in visual question-answering tasks when processing remote sensing images, we design a binary classification hallucination evaluation method specifically for the proposed DDFAV dataset, inspired by the POPE evaluation method. This method is named RSPOPE. RSPOPE not only inherits the advantages of POPE in stability and scalability for binary classification tasks, but also further designs three levels of difficulty—easy, medium, and hard—based on the number and categories of objects in the remote sensing images. This allows for a more comprehensive and intuitive evaluation of LVLMs’ performance in remote sensing tasks. Based on the sampling strategy and difficulty level, a total of nine pairwise combination evaluation methods are defined. Figure 6 shows three of the RSPOPE evaluation methods.

2.3.1. RSPOPE Sampling Strategy

RSPOPE adopts the same sampling strategies as POPE, including random sampling, popular sampling, and adversarial sampling, but adjusts them according to the characteristics of the DDFAV dataset and its 29 object categories. In all three sampling strategies, the binary classification answer for object categories that already exist in the image is set to “Yes”, while object categories that do not appear are sampled with the binary classification answer set to “No”.

Random sampling: An object category is randomly selected with equal probability from the remaining categories to create the “No” binary classification answer. This randomness ensures the diversity of category samples, captures the distribution of objects in different scenes, and conducts preliminary tests on the remote sensing basic object category recognition and visual question-answering capabilities of LVLMs.
Popular sampling: First, the frequencies of occurrence of the 29 object categories in all images of the DDFAV dataset are counted and sorted from high to low. Then, based on this sorted list, the remaining object categories that do not appear in the image are selected to supplement the question–answer pairs with the “No” answer. This sampling method prioritizes the more common or important categories in practical applications, which are also more likely to experience hallucination phenomena. As a result, it helps evaluate the anti-interference ability of LVLMs and presents a more challenging test than the random sampling method.
Adversarial sampling: First, the co-occurrence frequencies of each object category with other object categories in the DDFAV dataset are sorted from high to low. To further demonstrate the statistical process of adversarial sampling, we create a heat map of the adversarial matrix, shown in Figure 7, which illustrates the co-occurrence frequencies of the 29 object categories. Based on this statistical matrix, object categories that do not appear in the image but co-occur most frequently with the categories that do appear are selected. Adversarial sampling increases the challenge and robustness of hallucination evaluation for LVLMs, helping to distinguish whether LVLMs truly understand the image or are making inferences based on common sense—such as assuming that where there are airplanes, there must be airports, or where there are ships, there must be harbors.

2.3.2. Difficulty Level

Given the large number of objects and the wide range of category scenes in different remote sensing images, we further design three difficulty levels based on the number of object instances and object categories in a single image, ranging from small to large: Easy, Medium, and Hard. We select 100 remote sensing images for each difficulty level to provide more detailed evaluation methods for different LVLMs. The specific definition of each level is as follows:

Easy level: Each image must contain at least 2 different categories of objects, with the total number of objects not exceeding 5. Each image generates 6 binary classification problems.
Medium level: Each image must contain at least 3 different categories of objects, with the total number of objects between 6 and 10. Each image generates 8 binary classification problems.
Hard level: Each image must contain at least 4 different categories of objects, with the total number of objects exceeding 11. Each image generates 10 binary classification problems.

2.3.3. Evaluation Metrics

To comprehensively evaluate the object hallucination problem of different LVLMs in remote sensing tasks, RSPOPE introduces several commonly used evaluation metrics for binary classification problems, including Accuracy, Precision, Recall, F1-Score, and Yes Ratio. To further elucidate the role of these indicators in the RSPOPE evaluation method, we introduce four key metrics: True Positive (TP), False Negative (FN), False Positive (FP), and True Negative (TN). We then provide a detailed explanation of each, complemented by their corresponding formulas. “Accuracy” refers to the proportion of correct predictions made by the LVLMs across all binary classification tasks. It measures the overall classification performance of the LVLMs. The formula is as follows:

Accuracy = \frac{TP + TN}{TP + FP + FN + TN}

(1)

The advantage is that it directly evaluates the overall performance of the model. The disadvantage is that, when the dataset categories are imbalanced, “Accuracy” may be dominated by correct predictions of high-frequency categories (such as vehicles, ships, people, etc.), which can obscure the performance of low-frequency categories (such as dams, windmills, trainstations, etc.). “Precision” refers to the proportion of correctly predicted positive samples by the LVLMs. It measures the accuracy of the LVLMs in identifying the categories of the detected objects. The formula is as follows:

Precision = \frac{TP}{TP + FP}

(2)

The advantage is that “Precision” is suitable for evaluating the “out of thin air” type of object hallucination, where the model mistakenly identifies non-existent objects as present. The disadvantage is that “Precision” does not account for the “obliterating the facts” type of hallucination, where real objects are missed or incorrectly classified. “Recall” refers to the proportion of correct predictions made by the LVLMs among all samples that are actually positive. It measures the LVLMs’ ability to identify real objects in the image. The formula is as follows:

Recall = \frac{TP}{TP + FN}

(3)

The advantage is that “Recall” is particularly suitable for evaluating the “obliterating the facts” situation, where the model mistakenly ignores real objects. The disadvantage is that “Recall” alone cannot assess the “out of thin air” hallucination phenomenon. “F1-Score” is the harmonic mean of “Precision” and “Recall”, serving as an important indicator for measuring the balanced performance of the LVLMs. The formula is as follows:

F1-Score = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(4)

The advantage of the “F1-Score” is that it combines “Precision” and “Recall” for evaluation, thus avoiding the limitations of using a single indicator. The disadvantage is that it can only assess whether the model experiences the “out of thin air” and “obliterating the facts” hallucination phenomena, but it cannot evaluate more detailed aspects of the object, such as counting errors, color inaccuracies, etc. The “Yes Ratio” refers to the proportion of “Yes” answers given by the LVLMs across all questions. It reflects the LVLMs’ confidence and tendency, helping to determine whether it is overconfident or too conservative. The formula is as follows:

Yes Ratio = \frac{TP + FP}{TP + FP + FN + TN}

(5)

The “Yes Ratio” alone cannot evaluate the model’s hallucination phenomenon; it needs to be used in conjunction with other indicators and the actual object context of the remote sensing images. In addition, to evaluate the timeliness of different LVLMs in processing visual question-answering tasks, we also record the evaluation time for each test. This measures the processing efficiency and reasoning speed of the LVLMs. In practical applications, fast and accurate responses are crucial for ensuring a good user experience.

3. Results

We apply our proposed RSPOPE remote sensing LVLMs binary hallucination evaluation method to several mainstream LVLMs to test their zero-shot capabilities. These LVLMs include Minigpt4, InstructBLIP, Shikra, GeoChat, and LLaVA-v1.5, with GeoChat being a remote sensing-specific LVLM and the others being general-purpose LVLMs. The LLMs use the 7B open-source versions of Llama, as well as the 7B and 13B open-source versions of Vicuna.

The main environmental parameters for the entire test experiment are as follows: the video memory uses an NVIDIA Tesla A40 with 48 GB, the Python version is 3.9.10, the PyTorch version is 2.0.1, the TorchVision version is 0.15.2, and the Torchaudio version is 0.12.1.

3.1. Comparison of Zero-Shot Visual Question-Answering Capabilities at the Easy Level

Table 2 and Table 3 present the results of zero-shot visual question-answering hallucination evaluation experiments at the “Easy” level of RSPOPE, using several mainstream LVLMs with 7B and 13B size LLMs, respectively. Overall, the LLaVA-v1.5 model performs the best, achieving the highest F1-Score in all six experiments at the “Easy” level, with values of 82.44%, 91.91%, 84.72%, 82.70%, 88.89%, and 84.89%, respectively. These results even outperform the remote sensing-specific LVLM GeoChat by 0.99%, 9.16%, 4.04%, 3.89%, 1.25%, and 4.52%, respectively. The running time ranks third from the bottom among the 7B models, with times of 799s, 832s, and 830s, respectively, which are faster than Shikra and GeoChat. It ranks last among the 13B models, with times of 1473s, 1473s, and 1470s, respectively. This indicates that the rich and diverse datasets used by the LLaVA-v1.5 model during pre-training and fine-tuning enhanced the model’s knowledge reserve, reducing the hallucination phenomenon in the visual question-answering task, but also increasing the inference time.

3.2. Comparison of Zero-Shot Visual Question-Answering Capabilities at the Medium Level

Table 4 and Table 5 present the results of zero-shot visual question-answering hallucination evaluation experiments at the “Medium” level of RSPOPE, using several mainstream LVLMs with 7B and 13B size LLMs, respectively. Similar to the “Easy” level, LLaVA-v1.5 achieves the highest F1-Score in all six experiments. Additionally, we observe that Minigpt4 using Vicuna-v0-7B as the LLMs has the highest Yes Ratio among all 7B LVLMs, with values of 93.88%, 88.50%, and 91.25%, respectively. However, the corresponding F1-Scores are the lowest, at 63.73%, 67.21%, and 65.89%, respectively, which are 10.28%, 22.31%, and 12.13% lower than the best-performing LLaVA-v1.5 model. This suggests that the 7B version of Minigpt4 is an overconfident LVLM, and the low performance is likely due to the lack of remote sensing images in the pre-training and fine-tuning stages.

3.3. Comparison of Zero-Shot Visual Question-Answering Capabilities at the Hard Level

Table 6 and Table 7 present the results of zero-shot visual question-answering hallucination evaluation experiments at the “Hard” level of RSPOPE, using several mainstream LVLMs with 7B and 13B size LLMs, respectively. In the random sampling experiment group for the 7B model, InstructBLIP surpasses LLaVA-v1.5 in the F1-Score index, while GeoChat achieves the highest Recall index in all three 7B experiments, reaching 98.53%. However, its F1-Score performs poorly, with all six LVLMs under Random, Popular, and Adversarial settings ranking fourth, falling behind InstructBLIP, LLaVA-v1.5, and Shikra. This suggests that GeoChat, which has been fine-tuned with remote sensing images, can effectively recognize objects in the image, resulting in high Recall. However, its high Yes Ratio indicates overconfidence, leading to the misidentification of non-existent objects as present, which, in turn, lowers Precision. LLaVA-v1.5, on the other hand, maintains a relatively balanced Precision and Recall with a lower Yes Ratio, making it a more rigorous and modest LVLM.

4. Discussion

The main advantage of this study is that it proposes a systematic solution to address the hallucination problem in remote sensing LVLMs. First, the CoT production pipeline ensures the accuracy and rigor of the instruction set through multi-level reasoning and quality checks, thereby enhancing the model’s understanding ability. Second, the RSPOPE evaluation method is the first to provide consistent and reliable quantitative evaluation of hallucinations in remote sensing LVLMs. It evaluates the model’s performance from multiple sampling methods and difficulty levels, helping researchers better understand the model’s strengths and weaknesses. Finally, the diversity and high quality of the DDFAV dataset offer a solid foundation for future research and promote the development of remote sensing LVLMs.

Although this study achieves certain results, it still has some limitations and shortcomings. First, the DDFAV dataset currently focuses on visible light remote sensing images and does not yet include multispectral or hyperspectral remote sensing data. This limits the model’s generalization ability in a wider range of application scenarios, particularly in remote sensing tasks involving different spectral bands. In the future, we plan to make up for this shortcoming by screening existing dataset images of different spectral bands that have been maturely applied to tasks such as classification, detection, and segmentation, or by cooperating with relevant companies or institutions to obtain the required images of different spectral bands using professional equipment. Second, the DDFAV dataset currently consists of 1755 images, which is a small number, and there are some cases where there are fewer samples in some categories (such as Windmill, Trainstation, etc.). Future work should focus on expanding the dataset while ensuring data quality. To this end, it is possible to continue to screen and supplement existing remote sensing dataset images, while adopting data enhancement strategies, such as geometric transformation, color jittering, etc. In addition, it is also possible to try to generate synthetic data using generative adversarial network (GAN) technology. Third, while the RSPOPE evaluation method effectively evaluates model hallucinations, its scope is currently limited to visual question-answering tasks, with little exploration of other types of remote sensing LVLMs tasks. Future work can extend RSPOPE to include more comprehensive evaluation methods across different remote sensing LVLMs tasks (such as image description and scene classification). Fourth, the evaluation indicators used by the RSPOPE method can only assess the hallucinations of “creating something out of nothing” and “obliterating facts”. They cannot evaluate more detailed aspects of remote sensing objects. Given that these hallucination phenomena are very important for remote sensing applications, in the future, we plan to expand the RSPOPE evaluation framework and add hallucination evaluation indicators based on factors such as color and position to improve the breadth and depth of the evaluation. At the same time, we will transition from a binary scoring system to a hierarchical scoring system, calculate color differences, object position offsets, etc., and grade them according to the differences from the real data, so as to comprehensively evaluate the model performance. Fifth, this study primarily uses improved data quality and newly developed hallucination evaluation methods to investigate hallucinations in remote sensing LVLMs. Future work could explore additional hallucination mitigation techniques, such as generating constraints or consistency checks during the decoding stage, to further enhance the model’s ability to handle hallucinations. Sixth, although the CoT instruction set design process proposed in this paper provides an accurate and comprehensive strategy for creating remote sensing LVLMs datasets, it remains somewhat complex and may include redundant or unnecessary steps. The design process can be further refined in the future to achieve the goal of maintaining data quality while enabling fast and efficient production.

5. Conclusions

This paper systematically analyzed the research status of remote sensing LVLMs and MLLMs, as well as the approaches to alleviating hallucinations, and highlighted the deficiencies and issues in datasets and evaluation methods within the field of remote sensing. To this end, a precise and rigorous remote sensing LVLMs instruction set production pipeline is designed based on the idea of CoT. The instruction set produced through this pipeline can improve data quality, better guide model training, and reduce the generation of hallucinations. In addition, we collected 1755 images from five high-quality visible light remote sensing datasets to form the DDFAV dataset, which contains 29 remote sensing object categories, covering both small objects, such as people, and large objects, such as stadiums, covering a variety of scenes, such as airports, harbors, and cities, and covering high-quality pictures from multiple perspectives, such as high-altitude satellites and low-altitude drones. Based on the designed CoT, we produced instruction sets with detailed annotation information for these images, enabling comprehensive completion of tasks such as detailed image description, complex reasoning, and visual question answering. Finally, we developed RSPOPE, a hallucination evaluation method for remote sensing LVLMs and MLLMs in visual question-answering tasks. By mixing and matching three different difficulty levels with three different sampling methods, we aimed to achieve a more intuitive and detailed quantitative assessment of model hallucination degrees. We also evaluated their zero-shot capabilities for remote sensing images on several mainstream LVLMs to verify the effectiveness and practicality of the method.

Author Contributions

Conceptualization, H.L. and X.Z.; methodology, H.L. and X.Z.; software, H.L.; validation, H.L. and X.Z.; formal analysis, H.L. and X.Z.; investigation, H.L. and X.Z.; resources, H.L. and X.Z.; data curation, H.L. and X.Z.; writing—original draft preparation, H.L.; writing—review and editing, H.L. and X.Z.; visualization, H.L.; supervision, H.Q.; project administration, H.Q.; funding acquisition, H.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This paper was supported by the National Natural Science Foundation of China (Grant No. 42271409) and the Scientiffc Research Foundation of the Higher Education Institutions of Liaoning Province (Grant No. LJKMZ20220699).

Data Availability Statement

Our proposed DDFAV remote sensing LVLMs dataset and RSPOPE evaluation files can be found at https://github.com/HaodongLi2024/rspope (accessed on 10 March 2024).

Acknowledgments

The authors would like to thank the editors and reviewers for their advice.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, G.; Chen, J.; Mo, L.; Wu, P.; Yi, X. Border-Enhanced Triple Attention Mechanism for High-Resolution Remote Sensing Images and Application to Land Cover Classification. Remote Sens. 2024, 16, 2814. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, W.; Chen, W.; Chen, C. BSDSNet: Dual-Stream Feature Extraction Network Based on Segment Anything Model for Synthetic Aperture Radar Land Cover Classification. Remote Sens. 2024, 16, 1150. [Google Scholar] [CrossRef]
Huo, F.; Guo, F.; Shi, P.; Gao, Z.; Zhao, Y.; Wang, Y.; Meng, X.; Yue, D. The Application of Remote Sensing Technology in Post-Disaster Emergency Investigations of Debris Flows: A Case Study of the Shuimo Catchment in the Bailong River, China. Remote Sens. 2024, 16, 2817. [Google Scholar] [CrossRef]
Zhang, W.; Peng, L.; Ge, X.; Yang, L.; Chen, L.; Li, W. Spatio-Temporal Knowledge Graph-Based Research on Agro-Meteorological Disaster Monitoring. Remote Sens. 2023, 15, 4403. [Google Scholar] [CrossRef]
Wang, H.; Liu, C.; Zang, F.; Liu, Y.; Chang, Y.; Huang, G.; Fu, G.; Zhao, C.; Liu, X. Remote sensing-based approach for the assessing of ecological environmental quality variations using Google Earth Engine: A case study in the Qilian Mountains, Northwest China. Remote Sens. 2023, 15, 960. [Google Scholar] [CrossRef]
Duo, L.; Wang, J.; Zhang, F.; Xia, Y.; Xiao, S.; He, B.J. Assessing the Spatiotemporal Evolution and Drivers of Ecological Environment Quality Using an Enhanced Remote Sensing Ecological Index in Lanzhou City, China. Remote Sens. 2023, 15, 4704. [Google Scholar] [CrossRef]
Zhan, Y.; Xiong, Z.; Yuan, Y. Skyeyegpt: Unifying remote sensing vision-language tasks via instruction tuning with large language model. arXiv 2024, arXiv:2401.09712. [Google Scholar] [CrossRef]
Bazi, Y.; Bashmal, L.; Al Rahhal, M.M.; Ricci, R.; Melgani, F. Rs-llava: A large vision-language model for joint captioning and question answering in remote sensing imagery. Remote Sens. 2024, 16, 1477. [Google Scholar] [CrossRef]
Zhang, W.; Cai, M.; Zhang, T.; Zhuang, Y.; Mao, X. Earthgpt: A universal multi-modal large language model for multi-sensor image comprehension in remote sensing domain. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5917820. [Google Scholar]
Kuckreja, K.; Danish, M.S.; Naseer, M.; Das, A.; Khan, S.; Khan, F.S. Geochat: Grounded large vision-language model for remote sensing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 27831–27840. [Google Scholar]
Li, Y.; Du, Y.; Zhou, K.; Wang, J.; Zhao, X.; Wen, J.R. Evaluating Object Hallucination in Large Vision-Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023. [Google Scholar]
Gunjal, A.; Yin, J.; Bas, E. Detecting and preventing hallucinations in large vision language models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024; Volume 38, pp. 18135–18143. [Google Scholar]
Lu, J.; Rao, J.; Chen, K.; Guo, X.; Zhang, Y.; Sun, B.; Yang, C.; Yang, J. Evaluation and mitigation of agnosia in multimodal large language models. arXiv 2023, arXiv:2309.04041. [Google Scholar]
Chen, X.; Wang, C.; Xue, Y.; Zhang, N.; Yang, X.; Li, Q.; Shen, Y.; Liang, L.; Gu, J.; Chen, H. Unified Hallucination Detection for Multimodal Large Language Models. arXiv 2024, arXiv:2402.03190. [Google Scholar]
Han, T.; Lian, Q.; Pan, R.; Pi, R.; Zhang, J.; Diao, S.; Lin, Y.; Zhang, T. The Instinctive Bias: Spurious Images lead to Hallucination in MLLMs. arXiv 2024, arXiv:2402.03757. [Google Scholar]
Wu, J.; Liu, Q.; Wang, D.; Zhang, J.; Wu, S.; Wang, L.; Tan, T. Logical closed loop: Uncovering object hallucinations in large vision-language models. arXiv 2024, arXiv:2402.11622. [Google Scholar]
Yue, Z.; Zhang, L.; Jin, Q. Less is more: Mitigating multimodal hallucination from an eos decision perspective. arXiv 2024, arXiv:2402.14545. [Google Scholar]
Sarkar, P.; Ebrahimi, S.; Etemad, A.; Beirami, A.; Arık, S.Ö.; Pfister, T. Mitigating Object Hallucination via Data Augmented Contrastive Tuning. arXiv 2024, arXiv:2405.18654. [Google Scholar]
Wu, M.; Ji, J.; Huang, O.; Li, J.; Wu, Y.; Sun, X.; Ji, R. Evaluating and analyzing relationship hallucinations in lvlms. arXiv 2024, arXiv:2406.16449. [Google Scholar]
Liu, F.; Lin, K.; Li, L.; Wang, J.; Yacoob, Y.; Wang, L. Mitigating hallucination in large multi-modal models via robust instruction tuning. In Proceedings of the Twelfth International Conference on Learning Representations, Boston, MA, USA, 16–17 January 2023. [Google Scholar]
Wang, J.; Zhou, Y.; Xu, G.; Shi, P.; Zhao, C.; Xu, H.; Ye, Q.; Yan, M.; Zhang, J.; Zhu, J.; et al. Evaluation and analysis of hallucination in large vision-language models. arXiv 2023, arXiv:2308.15126. [Google Scholar]
Hu, H.; Zhang, J.; Zhao, M.; Sun, Z. CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning. arXiv 2023, arXiv:2309.02301. [Google Scholar]
Sun, Z.; Shen, S.; Cao, S.; Liu, H.; Li, C.; Shen, Y.; Gan, C.; Gui, L.Y.; Wang, Y.X.; Yang, Y.; et al. Aligning large multimodal models with factually augmented rlhf. arXiv 2023, arXiv:2309.14525. [Google Scholar]
Lovenia, H.; Dai, W.; Cahyawijaya, S.; Ji, Z.; Fung, P. Negative object presence evaluation (nope) to measure object hallucination in vision-language models. arXiv 2023, arXiv:2310.05338. [Google Scholar]
Yu, Q.; Li, J.; Wei, L.; Pang, L.; Ye, W.; Qin, B.; Tang, S.; Tian, Q.; Zhuang, Y. Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 12944–12953. [Google Scholar]
Chen, Z.; Zhu, Y.; Zhan, Y.; Li, Z.; Zhao, C.; Wang, J.; Tang, M. Mitigating hallucination in visual language models with visual supervision. arXiv 2023, arXiv:2311.16479. [Google Scholar]
Yu, T.; Yao, Y.; Zhang, H.; He, T.; Han, Y.; Cui, G.; Hu, J.; Liu, Z.; Zheng, H.T.; Sun, M.; et al. Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 13807–13816. [Google Scholar]
Wang, X.; Pan, J.; Ding, L.; Biemann, C. Mitigating hallucinations in large vision-language models with instruction contrastive decoding. arXiv 2024, arXiv:2403.18715. [Google Scholar]
Jiang, C.; Xu, H.; Dong, M.; Chen, J.; Ye, W.; Yan, M.; Ye, Q.; Zhang, J.; Huang, F.; Zhang, S. Hallucination augmented contrastive learning for multimodal large language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 27036–27046. [Google Scholar]
Sun, L.; Wang, L.; Sun, J.; Okatani, T. Temporal Insight Enhancement: Mitigating Temporal Hallucination in Multimodal Large Language Models. arXiv 2024, arXiv:2401.09861. [Google Scholar]
Kim, J.; Kim, Y.J.; Ro, Y.M. What if...?: Counterfactual inception to mitigate hallucination effects in large multimodal models. arXiv 2024, arXiv:2403.13513. [Google Scholar]
Leng, S.; Zhang, H.; Chen, G.; Li, X.; Lu, S.; Miao, C.; Bing, L. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 13872–13882. [Google Scholar]
Huang, Q.; Dong, X.; Zhang, P.; Wang, B.; He, C.; Wang, J.; Lin, D.; Zhang, W.; Yu, N. Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 13418–13427. [Google Scholar]
Zhu, L.; Ji, D.; Chen, T.; Xu, P.; Ye, J.; Liu, J. Ibd: Alleviating hallucinations in large vision-language models via image-biased decoding. arXiv 2024, arXiv:2402.18476. [Google Scholar]
Chen, Z.; Zhao, Z.; Luo, H.; Yao, H.; Li, B.; Zhou, J. HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding. arXiv 2024, arXiv:2403.00425. [Google Scholar]
Ghosh, S.; Evuru, C.K.R.; Kumar, S.; Tyagi, U.; Nieto, O.; Jin, Z.; Manocha, D. VDGD: Mitigating LVLM Hallucinations in Cognitive Prompts by Bridging the Visual Perception Gap. arXiv 2024, arXiv:2405.15683. [Google Scholar]
Kim, J.; Kim, H.; Kim, Y.; Ro, Y.M. CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models. arXiv 2024, arXiv:2406.01920. [Google Scholar]
Wang, B.; Wu, F.; Han, X.; Peng, J.; Zhong, H.; Zhang, P.; Dong, X.; Li, W.; Li, W.; Wang, J.; et al. Vigc: Visual instruction generation and correction. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 5309–5317. [Google Scholar]
Zhou, Y.; Cui, C.; Yoon, J.; Zhang, L.; Deng, Z.; Finn, C.; Bansal, M.; Yao, H. Analyzing and Mitigating Object Hallucination in Large Vision-Language Models. arXiv 2023, arXiv:2310.00754. [Google Scholar]
Zhai, B.; Yang, S.; Zhao, X.; Xu, C.; Shen, S.; Zhao, D.; Keutzer, K.; Li, M.; Yan, T.; Fan, X. Halle-switch: Rethinking and controlling object existence hallucinations in large vision language models for detailed caption. arXiv 2023, arXiv:2310.01779. [Google Scholar]
Yin, S.; Fu, C.; Zhao, S.; Xu, T.; Wang, H.; Sui, D.; Shen, Y.; Li, K.; Sun, X.; Chen, E. Woodpecker: Hallucination correction for multimodal large language models. Sci. China Inf. Sci. 2024, 67, 220105. [Google Scholar] [CrossRef]
Jing, L.; Li, R.; Chen, Y.; Jia, M.; Du, X. Faithscore: Evaluating hallucinations in large vision-language models. arXiv 2023, arXiv:2311.01477. [Google Scholar]
Wang, J.; Wang, Y.; Xu, G.; Zhang, J.; Gu, Y.; Jia, H.; Yan, M.; Zhang, J.; Sang, J. An llm-free multi-dimensional benchmark for mllms hallucination evaluation. arXiv 2023, arXiv:2311.07397. [Google Scholar]
Zhao, Z.; Wang, B.; Ouyang, L.; Dong, X.; Wang, J.; He, C. Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. arXiv 2023, arXiv:2311.16839. [Google Scholar]
Ben-Kish, A.; Yanuka, M.; Alper, M.; Giryes, R.; Averbuch-Elor, H. Mocha: Multi-objective reinforcement mitigating caption hallucinations. arXiv 2023, arXiv:2312.03631. [Google Scholar]
Zhang, Y.F.; Yu, W.; Wen, Q.; Wang, X.; Zhang, Z.; Wang, L.; Jin, R.; Tan, T. Debiasing large visual language models. arXiv 2024, arXiv:2403.05262. [Google Scholar]
Pi, R.; Han, T.; Xiong, W.; Zhang, J.; Liu, R.; Pan, R.; Zhang, T. Strengthening multimodal large language model with bootstrapped preference optimization. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Cham, Switzerland, 2025; pp. 382–398. [Google Scholar]
Xiao, W.; Huang, Z.; Gan, L.; He, W.; Li, H.; Yu, Z.; Jiang, H.; Wu, F.; Zhu, L. Detecting and mitigating hallucination in large vision language models via fine-grained ai feedback. arXiv 2024, arXiv:2404.14233. [Google Scholar]
An, W.; Tian, F.; Leng, S.; Nie, J.; Lin, H.; Wang, Q.; Dai, G.; Chen, P.; Lu, S. AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention. arXiv 2024, arXiv:2406.12718. [Google Scholar]
Liu, S.; Zheng, K.; Chen, W. Paying more attention to image: A training-free method for alleviating hallucination in lvlms. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Cham, Switzerland, 2025; pp. 125–140. [Google Scholar]
Qu, X.; Chen, Q.; Wei, W.; Sun, J.; Dong, J. Alleviating hallucination in large vision-language models with active retrieval augmentation. arXiv 2024, arXiv:2408.00555. [Google Scholar]
Yan, B.; Zhang, Z.; Jing, L.; Hossain, E.; Du, X. FIHA: Autonomous Hallucination Evaluation in Vision-Language Models with Davidson Scene Graphs. arXiv 2024, arXiv:2409.13612. [Google Scholar]
Jiang, N.; Kachinthaya, A.; Petryk, S.; Gandelsman, Y. Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations. arXiv 2024, arXiv:2410.02762. [Google Scholar]
Zou, X.; Wang, Y.; Yan, Y.; Huang, S.; Zheng, K.; Chen, J.; Tang, C.; Hu, X. Look twice before you answer: Memory-space visual retracing for hallucination mitigation in multimodal large language models. arXiv 2024, arXiv:2410.03577. [Google Scholar]
Yuan, X.; Shen, C.; Yan, S.; Zhang, X.F.; Xie, L.; Wang, W.; Guan, R.; Wang, Y.; Ye, J. Instance-adaptive Zero-shot Chain-of-Thought Prompting. arXiv 2024, arXiv:2409.20441. [Google Scholar]
Zhang, X.; Quan, Y.; Gu, C.; Shen, C.; Yuan, X.; Yan, S.; Cheng, H.; Wu, K.; Ye, J. Seeing Clearly by Layer Two: Enhancing Attention Heads to Alleviate Hallucination in LVLMs. arXiv 2024, arXiv:2411.09968. [Google Scholar]
Li, Z.; Muhtar, D.; Gu, F.; Zhang, X.; Xiao, P.; He, G.; Zhu, X. LHRS-Bot-Nova: Improved Multimodal Large Language Model for Remote Sensing Vision-Language Interpretation. arXiv 2024, arXiv:2411.09301. [Google Scholar]
Pang, C.; Wu, J.; Li, J.; Liu, Y.; Sun, J.; Li, W.; Weng, X.; Wang, S.; Feng, L.; Xia, G.S.; et al. H2RSVLM: Towards Helpful and Honest Remote Sensing Large Vision Language Model. arXiv 2024, arXiv:2403.20213. [Google Scholar]
Xu, L.; Zhao, L.; Guo, W.; Li, Q.; Long, K.; Zou, K.; Wang, Y.; Li, H. RS-GPT4V: A Unified Multimodal Instruction-Following Dataset for Remote Sensing Image Understanding. arXiv 2024, arXiv:2406.12479. [Google Scholar]
Zhang, Z.; Zhao, T.; Guo, Y.; Yin, J. Rs5m: A large scale vision-language dataset for remote sensing vision-language foundation model. arXiv 2023, arXiv:2306.11300. [Google Scholar]
Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose, CA, USA, 2–5 November 2010; pp. 270–279. [Google Scholar]
Lu, X.; Wang, B.; Zheng, X.; Li, X. Exploring models and data for remote sensing image caption generation. IEEE Trans. Geosci. Remote Sens. 2017, 56, 2183–2195. [Google Scholar] [CrossRef]
Qu, B.; Li, X.; Tao, D.; Lu, X. Deep semantic understanding of high resolution remote sensing image. In Proceedings of the 2016 International Conference on Computer, Information and Telecommunication Systems (Cits), Kunming, China, 6–8 July 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1–5. [Google Scholar]
Zhang, F.; Du, B.; Zhang, L. Saliency-guided unsupervised feature learning for scene classification. IEEE Trans. Geosci. Remote Sens. 2014, 53, 2175–2184. [Google Scholar] [CrossRef]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; pp. 65–72. [Google Scholar]
Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
Vedantam, R.; Lawrence Zitnick, C.; Parikh, D. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4566–4575. [Google Scholar]
Lobry, S.; Marcos, D.; Murray, J.; Tuia, D. RSVQA: Visual question answering for remote sensing data. IEEE Trans. Geosci. Remote Sens. 2020, 58, 8555–8566. [Google Scholar] [CrossRef]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar]
Sun, X.; Wang, P.; Yan, Z.; Xu, F.; Wang, R.; Diao, W.; Chen, J.; Li, J.; Feng, Y.; Xu, T.; et al. FAIR1M: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2022, 184, 116–130. [Google Scholar] [CrossRef]
Wang, J.; Yang, W.; Guo, H.; Zhang, R.; Xia, G.S. Tiny object detection in aerial images. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 3791–3798. [Google Scholar]
Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y.; et al. VisDrone-DET2019: The vision meets drone object detection in image challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Zhan, Y.; Xiong, Z.; Yuan, Y. Rsvg: Exploring data and models for visual grounding on remote sensing data. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5604513. [Google Scholar] [CrossRef]
Zhu, D.; Chen, J.; Shen, X.; Li, X.; Elhoseiny, M. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Chiang, W.L.; Li, Z.; Lin, Z.; Sheng, Y.; Wu, Z.; Zhang, H.; Zheng, L.; Zhuang, S.; Zhuang, Y.; Gonzalez, J.E.; et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. 2023, 2, 6. Available online: https://vicuna.lmsys.org (accessed on 14 April 2023).
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar]
Dai, W.; Li, J.; LI, D.; Tiong, A.; Zhao, J.; Wang, W.; Li, B.; Fung, P.N.; Hoi, S. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2023; Volume 36, pp. 49250–49267. [Google Scholar]
Chen, K.; Zhang, Z.; Zeng, W.; Zhang, R.; Zhu, F.; Zhao, R. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv 2023, arXiv:2306.15195. [Google Scholar]
Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual instruction tuning. In Advances in Neural Information Processing Systems, Proceedings of the 2023 Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Neural Information Processing Systems Foundation, Inc.: San Diego, CA, USA, 2024; Volume 36. [Google Scholar]

Figure 1. Comparison of hallucination phenomena of general LVLMs in image captioning tasks for remote sensing images. All three LVLMs exhibit red−flagged hallucinated sentences, demonstrating errors in object counting and instances of nothingness hallucination.

Figure 2. We propose the remote sensing LVLMs dataset DDFAV, which features detailed information. DDFAV contains 1.6 GB of images, categorized into four major classes and 29 minor categories based on their functionalities. The dataset also includes a count of the number of object instances within each category.

Figure 3. Based on the CoT method, we design the DDFAV remote sensing LVLMs instruction set production process. For each image, we assign a GPT identity and upload the image in the preparation stage. We then generate dialogue data for the image description, complex reasoning, and visual question-answering tasks. Each sub-step involves manual prompts, GPT generation, and quality inspection with iterative feedback, ensuring high data quality and accuracy.

Figure 4. We provide word meta-graph statistics for all sentences in the DDFAV instruction set. We provide word statistics for all sentences in the DDFAV instruction set. (a) The figure illustrates the top 30 most frequently occurring words in human-generated questions. (b) The figure illustrates the top 30 most frequently occurring words in GPT-generated answers.

Figure 5. An example from the DDFAV remote sensing LVLMs instruction set includes a remote sensing image with 8 question-answer pairs: 1 detailed image description, 1 complex reasoning question, 2 visual questions about color, 2 visual questions about counting, and 2 visual questions about object location.

Figure 6. The image illustrates the overall production process of the RSPOPE remote sensing LVLMs hallucination evaluation method we have proposed. It begins with determining the difficulty level and category of positive samples based on the actual object category names and the number of objects present in the images. Following this, the remaining negative sample categories are selected using three distinct sampling methods. Lastly, the corresponding number of question–answer pairs is generated, their internal order is randomized, and they are compiled into the appropriate evaluation files.

Figure 7. The adversarial matrix heat map visualization of the DDFAV dataset shows the co-occurrence frequency of 29 categories. The diagonal elements are all 0, indicating that the co-occurrence frequency within the same category is not considered. The darker the red, the higher the co-occurrence frequency, while the darker the blue, the lower the co-occurrence frequency.

Table 1. DDFAV instruction set statistics including the count of images, sentences, words, and average word length for detailed description tasks, categorized by dataset source.

Dataset	Images	Sentences	Words	Caption Length
DIOR	499	7636	127,951	46.81
DOTA	673	10,756	226,218	85.22
FAIR1M	167	2704	49,676	74.22
AI-TOD	37	564	10,961	71.73
VisDrone-2019	379	5990	92,105	44.13

Table 2. The proposed RSPOPE evaluation method is applied in a zero-shot manner to LVLMs of 7B sizes, using 100 remote sensing images from the selected DDFAV dataset at the easy level. The unit of time is seconds (s), and the other values are in percentages (%).

Level	Setting	Model	LLM	Accuracy	Precision	Recall	F1-Score	Yes Ratio	Time (s)
Easy	Random	Minigpt4 [76]	Vicuna-v0-7B [77]	68.17	67.14	98.44	79.83	93.83	390
		Minigpt4	Llama2-7B [78]	65.50	66.30	93.75	77.67	90.50	408
		InstructBLIP [79]	Vicuna-v1.1-7B	74.67	84.73	73.70	78.83	55.67	408
		Shikra [80]	Vicuna-7B	71.00	72.44	88.28	79.58	78.00	1294
		GeoChat [10]	Vicuna-v1.5-7B	71.00	68.95	99.48	81.45	92.33	1096
		LLaVA-v1.5 [81]	Vicuna-v1.5-7B	75.00	74.89	91.67	82.44	78.33	799
	Popular	Minigpt4	Vicuna-v0-7B	68.17	67.38	97.40	79.66	92.50	396
		Minigpt4	Llama2-7B	71.50	71.01	93.75	80.81	84.50	408
		InstructBLIP	Vicuna-v1.1-7B	75.00	85.24	73.70	79.05	55.33	413
		Shikra	Vicuna-7B	82.17	84.54	88.28	86.37	66.83	1286
		GeoChat	Vicuna-v1.5-7B	73.83	71.14	99.48	82.95	89.50	1110
		LLaVA-v1.5	Vicuna-v1.5-7B	89.67	92.15	96.17	91.91	63.67	832
	Adversarial	Minigpt4	Vicuna-v0-7B	66.50	66.19	97.40	78.82	94.17	393
		Minigpt4	Llama2-7B	67.50	67.80	93.75	78.69	88.50	407
		InstructBLIP	Vicuna-v1.1-7B	67.50	75.07	73.70	74.38	62.83	404
		Shikra	Vicuna-7B	75.83	77.22	88.28	82.38	73.17	1286
		GeoChat	Vicuna-v1.5-7B	69.50	67.85	99.48	80.68	93.83	1074
		LLaVA-v1.5	Vicuna-v1.5-7B	78.83	78.75	91.67	84.72	74.50	830

Table 3. The proposed RSPOPE evaluation method is applied in a zero-shot manner to LVLMs of 13B sizes, using 100 remote sensing images from the selected DDFAV dataset at the easy level. The unit of time is seconds (s), and the other values are in percentages (%).

Level	Setting	Model	LLM	Accuracy	Precision	Recall	F1-Score	Yes Ratio	Time (s)
Easy	Random	Minigpt4	Vicuna-v0-13B	62.83	76.92	59.90	67.35	49.83	668
		InstructBLIP	Vicuna-v1.1-13B	69.17	70.35	89.58	78.81	81.50	524
		LLaVA-v1.5	Vicuna-v1.5-13B	74.33	72.73	95.83	82.70	84.33	1473
	Popular	Minigpt4	Vicuna-v0-13B	65.67	81.56	59.90	69.07	47.00	669
		InstructBLIP	Vicuna-v1.1-13B	83.83	85.79	89.58	87.64	66.83	523
		LLaVA-v1.5	Vicuna-v1.5-13B	84.67	82.89	95.83	88.89	74.00	1473
	Adversarial	Minigpt4	Vicuna-v0-13B	65.50	81.27	59.90	68.97	47.72	669
		InstructBLIP	Vicuna-v1.1-13B	72.00	72.88	89.58	80.37	78.67	509
		LLaVA-v1.5	Vicuna-v1.5-13B	78.17	76.19	95.83	84.89	80.50	1470

Table 4. The proposed RSPOPE evaluation method is applied in a zero-shot manner to LVLMs of 7B sizes, using 100 remote sensing images from the selected DDFAV dataset at the medium level. The unit of time is seconds (s), and the other values are in percentages (%).

Level	Setting	Model	LLM	Accuracy	Precision	Recall	F1-Score	Yes Ratio	Time (s)
Medium	Random	Minigpt4	Vicuna-v0-7B	49.50	49.27	94.15	63.73	93.88	520
		Minigpt4	Llama2-7B	49.50	49.24	90.33	64.69	90.13	543
		InstructBLIP	Vicuna-v1.1-7B	73.00	72.41	72.77	72.59	49.38	533
		Shikra	Vicuna-7B	66.13	60.97	86.26	71.44	69.50	1721
		GeoChat	Vicuna-v1.5-7B	59.88	55.07	99.49	70.90	88.75	1461
		LLaVA-v1.5	Vicuna-v1.5-7B	69.63	63.84	88.04	74.01	67.75	1103
	Popular	Minigpt4	Vicuna-v0-7B	54.88	52.26	94.15	67.21	88.50	519
		Minigpt4	Llama2-7B	59.25	55.21	90.33	68.53	80.38	545
		InstructBLIP	Vicuna-v1.1-7B	81.38	87.20	72.77	79.33	41.00	553
		Shikra	Vicuna-7B	81.13	77.75	86.26	81.79	54.50	1719
		GeoChat	Vicuna-v1.5-7B	67.38	60.15	99.49	74.98	81.25	1470
		LLaVA-v1.5	Vicuna-v1.5-7B	89.88	91.05	88.04	89.52	47.50	1099
	Adversarial	Minigpt4	Vicuna-v0-7B	52.13	50.68	94.15	65.89	91.25	520
		Minigpt4	Llama2-7B	55.75	52.91	90.33	66.73	83.88	543
		InstructBLIP	Vicuna-v1.1-7B	71.63	70.44	72.77	71.59	50.75	540
		Shikra	Vicuna-7B	73.13	67.80	86.26	75.92	62.50	1720
		GeoChat	Vicuna-v1.5-7B	61.50	56.10	99.49	71.74	87.13	1437
		LLaVA-v1.5	Vicuna-v1.5-7B	75.63	70.04	88.04	78.02	61.75	1097

Table 5. The proposed RSPOPE evaluation method is applied in a zero-shot manner to LVLMs of 13B sizes, using 100 remote sensing images from the selected DDFAV dataset at the medium level. The unit of time is seconds (s), and the other values are in percentages (%).

Level	Setting	Model	LLM	Accuracy	Precision	Recall	F1 Score	Yes Ratio	Time (s)
Medium	Random	Minigpt4	Vicuna-v0-13B	61.50	61.84	56.49	59.04	44.88	890
		InstructBLIP	Vicuna-v1.1-13B	60.63	56.66	84.48	67.82	73.25	712
		LLaVA-v1.5	Vicuna-v1.5-13B	66.50	60.40	92.37	73.04	75.13	1963
	Popular	Minigpt4	Vicuna-v0-13B	66.88	70.25	56.49	62.62	39.50	892
		InstructBLIP	Vicuna-v1.1-13B	77.38	73.45	84.48	78.58	56.50	708
		LLaVA-v1.5	Vicuna-v1.5-13B	84.75	79.78	92.37	85.61	56.86	1960
	Adversarial	Minigpt4	Vicuna-v0-13B	61.75	62.18	56.49	59.20	44.63	892
		InstructBLIP	Vicuna-v1.1-13B	66.50	61.60	84.48	71.24	67.38	707
		LLaVA-v1.5	Vicuna-v1.5-13B	72.00	65.17	92.37	76.42	69.63	1960

Table 6. The proposed RSPOPE evaluation method is applied in a zero-shot manner to LVLMs of 7B sizes, using 100 remote sensing images from the selected DDFAV dataset at the hard level. The unit of time is seconds (s), and the other values are in percentages (%).

Level	Setting	Model	LLM	Accuracy	Precision	Recall	F1-Score	Yes Ratio	Time (s)
Hard	Random	Minigpt4	Vicuna-v0-7B	41.10	39.98	88.48	55.07	90.30	649
		Minigpt4	Llama2-7B	43.60	41.22	89.71	56.48	88.80	681
		InstructBLIP	Vicuna-v1.1-7B	78.30	74.30	71.57	72.91	39.30	657
		Shikra	Vicuna-7B	68.10	57.40	84.56	68.38	60.10	2130
		GeoChat	Vicuna-v1.5-7B	50.30	45.17	98.53	61.80	89.30	1874
		LLaVA-v1.5	Vicuna-v1.5-7B	71.10	59.93	87.99	71.30	59.90	1374
	Popular	Minigpt4	Vicuna-v0-7B	44.60	41.59	88.48	56.58	86.80	651
		Minigpt4	Llama2-7B	50.70	44.80	89.71	59.76	81.70	675
		InstructBLIP	Vicuna-v1.1-7B	79.80	77.25	71.57	74.30	37.80	653
		Shikra	Vicuna-7B	76.30	66.47	84.56	74.43	51.90	2153
		GeoChat	Vicuna-v1.5-7B	50.30	45.02	98.53	61.80	89.30	1847
		LLaVA-v1.5	Vicuna-v1.5-7B	77.50	67.10	88.00	76.14	53.50	1371
	Adversarial	Minigpt4	Vicuna-v0-7B	44.70	41.64	88.48	56.63	86.70	649
		Minigpt4	Llama2-7B	50.40	44.63	89.71	59.61	82.00	679
		InstructBLIP	Vicuna-v1.1-7B	77.60	73.00	71.57	72.28	40.00	663
		Shikra	Vicuna-7B	75.50	64.46	84.56	73.80	52.70	2152
		GeoChat	Vicuna-v1.5-7B	50.20	44.97	98.53	61.75	89.40	1848
		LLaVA-v1.5	Vicuna-v1.5-7B	79.40	69.57	87.99	77.71	51.60	1374

Table 7. The proposed RSPOPE evaluation method is applied in a zero-shot manner to LVLMs of 13B sizes, using 100 remote sensing images from the selected DDFAV dataset at the hard level. The unit of time is seconds (s), and the other values are in percentages (%).

Level	Setting	Model	LLM	Accuracy	Precision	Recall	F1-Score	Yes Ratio	Time (s)
Hard	Random	Minigpt4	Vicuna-v0-13B	61.90	53.98	44.85	49.00	33.90	1110
		InstructBLIP	Vicuna-v1.1-13B	60.00	50.63	78.19	61.46	63.00	953
		LLaVA-v1.5	Vicuna-v1.5-13B	66.40	55.57	87.99	68.12	64.60	2451
	Popular	Minigpt4	Vicuna-v0-13B	64.90	59.22	44.85	51.05	30.90	1118
		InstructBLIP	Vicuna-v1.1-13B	67.50	57.48	78.19	66.25	55.50	885
		LLaVA-v1.5	Vicuna-v1.5-13B	72.50	61.37	87.99	72.31	58.50	2449
	Adversarial	Minigpt4	Vicuna-v0-13B	63.80	57.19	44.85	50.27	32.00	1118
		InstructBLIP	Vicuna-v1.1-13B	65.50	55.48	78.19	64.90	57.50	903
		LLaVA-v1.5	Vicuna-v1.5-13B	75.10	64.22	87.99	74.25	55.90	2449

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, H.; Zhang, X.; Qu, H. DDFAV: Remote Sensing Large Vision Language Models Dataset and Evaluation Benchmark. Remote Sens. 2025, 17, 719. https://doi.org/10.3390/rs17040719

AMA Style

Li H, Zhang X, Qu H. DDFAV: Remote Sensing Large Vision Language Models Dataset and Evaluation Benchmark. Remote Sensing. 2025; 17(4):719. https://doi.org/10.3390/rs17040719

Chicago/Turabian Style

Li, Haodong, Xiaofeng Zhang, and Haicheng Qu. 2025. "DDFAV: Remote Sensing Large Vision Language Models Dataset and Evaluation Benchmark" Remote Sensing 17, no. 4: 719. https://doi.org/10.3390/rs17040719

APA Style

Li, H., Zhang, X., & Qu, H. (2025). DDFAV: Remote Sensing Large Vision Language Models Dataset and Evaluation Benchmark. Remote Sensing, 17(4), 719. https://doi.org/10.3390/rs17040719

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DDFAV: Remote Sensing Large Vision Language Models Dataset and Evaluation Benchmark

Abstract

1. Introduction

2. Materials and Methods

2.1. Source and Selection of DDFAV Image Dataset

2.2. DDFAV Instruction Set CoT Production Process

2.2.1. Preparation Stage

2.2.2. Detailed Description Stage

2.2.3. Complex Reasoning Stage

2.2.4. Visual Question-Answering Stage

2.3. RSPOPE Metric Evaluation

2.3.1. RSPOPE Sampling Strategy

2.3.2. Difficulty Level

2.3.3. Evaluation Metrics

3. Results

3.1. Comparison of Zero-Shot Visual Question-Answering Capabilities at the Easy Level

3.2. Comparison of Zero-Shot Visual Question-Answering Capabilities at the Medium Level

3.3. Comparison of Zero-Shot Visual Question-Answering Capabilities at the Hard Level

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI