MDPI - Publisher of Open Access Journals

23 pages, 4379 KB

Open AccessArticle

Large Vision Language Model: Enhanced-RSCLIP with Exemplar-Image Prompting for Uncommon Object Detection in Satellite Imagery

by Taiwo Efunogbon, Abimbola Efunogbon, Enjie Liu, Dayou Li and Renxi Qiu

Electronics 2025, 14(15), 3071; https://doi.org/10.3390/electronics14153071 - 31 Jul 2025

Viewed by 372

Abstract

Large Vision Language Models (LVLMs) have shown promise in remote sensing applications, yet struggle with “uncommon” objects that lack sufficient public labeled data. This paper presents Enhanced-RSCLIP, a novel dual-prompt architecture that combines text prompting with exemplar-image processing for cattle herd detection in satellite imagery. Our approach introduces a key innovation where an exemplar-image preprocessing module using crop-based or attention-based algorithms extracts focused object features which are fed as a dual stream to a contrastive learning framework that fuses textual descriptions with visual exemplar embeddings. We evaluated our method on a custom dataset of 260 satellite images across UK and Nigerian regions. Enhanced-RSCLIP with crop-based exemplar processing achieved 72% accuracy in cattle detection and 56.2% overall accuracy on cross-domain transfer tasks, significantly outperforming text-only CLIP (31% overall accuracy). The dual-prompt architecture enables effective few-shot learning and cross-regional transfer from data-rich (UK) to data-sparse (Nigeria) environments, demonstrating a 41% improvement over baseline approaches for uncommon object detection in satellite imagery. Full article

(This article belongs to the Topic Next-Generation IoT and Smart Systems for Communication and Sensing)

► Show Figures

Figure 1

16 pages, 3396 KB

Open AccessArticle

Parameter-Efficient Adaptation of Large Vision—Language Models for Video Memorability Prediction

by Iván Martín-Fernández, Sergio Esteban-Romero, Fernando Fernández-Martínez and Manuel Gil-Martín

Sensors 2025, 25(6), 1661; https://doi.org/10.3390/s25061661 - 7 Mar 2025

Viewed by 1698

Abstract

The accurate modelling of video memorability, or the intrinsic properties that render a piece of audiovisual content more likely to be remembered, will facilitate the development of automatic systems that are more efficient in retrieving, classifying and generating impactful media. Recent studies have indicated a strong correlation between the visual semantics of video and its memorability. This underscores the importance of developing advanced visual comprehension abilities to enhance model performance. It has been demonstrated that Large Vision–Language Models (LVLMs) demonstrate exceptional proficiency in generalist, high-level semantic comprehension of images and video, due to their extensive multimodal pre-training on a vast scale. This work makes use of the vast generalist knowledge of LVLMs and explores efficient adaptation techniques with a view to utilising them as memorability predictors. In particular, the Quantized Low-Rank Adaptation (QLoRA) technique is employed to fine-tune the Qwen-VL model with memorability-related data extracted from the Memento10k dataset. In light of existing research, we propose a particular methodology that transforms Qwen-VL from a language model to a memorability score regressor. Furthermore, we consider the influence of selecting appropriate LoRA hyperparameters, a design aspect that has been insufficiently studied. We validate the LoRA rank and alpha hyperparameters using 5-Fold Cross-Validation and evaluate our best configuration on the official testing portion of the Memento10k dataset, obtaining a state-of-the-art Spearman Rank Correlation Coefficient (SRCC) of 0.744. Consequently, this work represents a significant advancement in modelling video memorability through high-level semantic understanding. Full article

(This article belongs to the Special Issue Multimodal Perception Modeling Based on Advanced Computational Technologies)

► Show Figures

Figure 1

21 pages, 1486 KB

Open AccessArticle

DDFAV: Remote Sensing Large Vision Language Models Dataset and Evaluation Benchmark

by Haodong Li, Xiaofeng Zhang and Haicheng Qu

Remote Sens. 2025, 17(4), 719; https://doi.org/10.3390/rs17040719 - 19 Feb 2025

Cited by 1 | Viewed by 1999

Abstract

With the rapid development of large visual language models (LVLMs) and multimodal large language models (MLLMs), these models have demonstrated strong performance in various multimodal tasks. However, alleviating the generation of hallucinations remains a key challenge in LVLMs research. For remote sensing LVLMs, there are problems such as low quality, small number and unreliable datasets and evaluation methods. Therefore, when applied to remote sensing tasks, they are prone to hallucinations, resulting in unsatisfactory performance. This paper proposes a more reliable and effective instruction set production process for remote sensing LVLMs to address these issues. The process generates detailed and accurate instruction sets through strategies such as shallow-to-deep reasoning, internal and external considerations, and manual quality inspection. Based on this production process, we collect 1.6 GB of remote sensing images to create the DDFAV dataset, which covers a variety of remote sensing LVLMs tasks. Finally, we develop a closed binary classification polling evaluation method, RSPOPE, specifically designed to evaluate hallucinations in remote sensing LVLMs or MLLMs visual question-answering tasks. Using this method, we evaluate the zero-shot remote sensing visual question-answering capabilities of multiple mainstream LVLMs. Our proposed dataset images, corresponding instruction sets, and evaluation method files are all open source. Full article

► Show Figures

Figure 1

21 pages, 1298 KB

Open AccessArticle

Co-LLaVA: Efficient Remote Sensing Visual Question Answering via Model Collaboration

by Fan Liu, Wenwen Dai, Chuanyi Zhang, Jiale Zhu, Liang Yao and Xin Li

Remote Sens. 2025, 17(3), 466; https://doi.org/10.3390/rs17030466 - 29 Jan 2025

Viewed by 2517

Abstract

Large vision language models (LVLMs) are built upon large language models (LLMs) and incorporate non-textual modalities; they can perform various multimodal tasks. Applying LVLMs in remote sensing (RS) visual question answering (VQA) tasks can take advantage of the powerful capabilities to promote the development of VQA in RS. However, due to the greater complexity of remote sensing images compared to natural images, general-domain LVLMs tend to perform poorly in RS scenarios and are prone to hallucination phenomena. Multi-agent debate for collaborative reasoning is commonly utilized to mitigate hallucination phenomena. Although this method is effective, it comes with a significant computational burden (e.g., high CPU/GPU demands and slow inference speed). To address these limitations, we propose Co-LLaVA, a model specifically designed for RS VQA tasks. Specifically, Co-LLaVA employs model collaboration between Large Language and Vision Assistant (LLaVA-v1.5) and Contrastive Captioners (CoCas). It combines LVLM with a lightweight generative model, reducing computational burden compared to multi-agent debate. Additionally, through high-dimensional multi-scale features and higher-resolution images, Co-LLaVA can enhance the perception of details in RS images. Experimental results demonstrate the significant performance improvements of our Co-LLaVA over existing LVLMs (e.g., Geochat, RSGPT) on multiple metrics of four RS VQA datasets (e.g., +3% over SkySenseGPT on “Rural/Urban” accuracy in the test set of RSVQA-LR dataset). Full article

(This article belongs to the Special Issue Multi-platform and Multi-modal Remote Sensing Data Fusion with Advanced Deep Learning Techniques (Second Edition))

► Show Figures

Figure 1

20 pages, 19180 KB

Open AccessArticle

Leveraging Multi-Source Data for the Trustworthy Evaluation of the Vibrancy of Child-Friendly Cities: A Case Study of Tianjin, China

by Di Zhang, Kun Song and Di Zhao

Electronics 2024, 13(22), 4564; https://doi.org/10.3390/electronics13224564 - 20 Nov 2024

Cited by 2 | Viewed by 1077

Abstract

The vitality of a city is shaped by its social structure, environmental quality, and spatial form, with child-friendliness being an essential component of urban vitality. While there are numerous qualitative studies on the relationship between child-friendliness and various indicators of urban vitality, quantitative research remains relatively scarce, leading to a lack of sufficient objective and trustworthy data to guide urban planning and the development of child-friendly cities. This paper presents an analytical framework, using Heping District in Tianjin, China, as a case study. It defines four main indicators—social vitality, environmental vitality, spatial vitality, and urban scene perception—for a trustworthy and transparent quantitative evaluation. The study integrates multi-source data, including primary education (POI) data, street view image (SVI) data, spatiotemporal big data, normalized difference vegetation index (NDVI), and large visual language models (LVLMs) for the trustworthy analysis. These data are visualized using corresponding big data and weighted analysis methods, ensuring transparent and accurate assessments of the child-friendliness of urban blocks. This research introduces an innovative and trustworthy method for evaluating the child-friendliness of urban blocks, addressing gaps in the quantitative theory of child-friendliness in urban planning. It also provides a practical and reliable tool for urban planners, offering a solid theoretical foundation to create environments that better meet the needs of children in a trustworthy manner. Full article

(This article belongs to the Special Issue Adversarial Attacks and Defenses in AI Safety/Reliability)

► Show Figures

Figure 1

18 pages, 3629 KB

Open AccessArticle

RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery

by Yakoub Bazi, Laila Bashmal, Mohamad Mahmoud Al Rahhal, Riccardo Ricci and Farid Melgani

Remote Sens. 2024, 16(9), 1477; https://doi.org/10.3390/rs16091477 - 23 Apr 2024

Cited by 39 | Viewed by 10823

Abstract

In this paper, we delve into the innovative application of large language models (LLMs) and their extension, large vision-language models (LVLMs), in the field of remote sensing (RS) image analysis. We particularly emphasize their multi-tasking potential with a focus on image captioning and visual question answering (VQA). In particular, we introduce an improved version of the Large Language and Vision Assistant Model (LLaVA), specifically adapted for RS imagery through a low-rank adaptation approach. To evaluate the model performance, we create the RS-instructions dataset, a comprehensive benchmark dataset that integrates four diverse single-task datasets related to captioning and VQA. The experimental results confirm the model’s effectiveness, marking a step forward toward the development of efficient multi-task models for RS image analysis. Full article

(This article belongs to the Special Issue Advanced Application of Artificial Intelligence and Machine Vision in Remote Sensing (Third Edition))

► Show Figures

Figure 1

Search Results (6)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (6)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI