Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (6)

Search Parameters:
Keywords = LVLM

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
23 pages, 4379 KB  
Article
Large Vision Language Model: Enhanced-RSCLIP with Exemplar-Image Prompting for Uncommon Object Detection in Satellite Imagery
by Taiwo Efunogbon, Abimbola Efunogbon, Enjie Liu, Dayou Li and Renxi Qiu
Electronics 2025, 14(15), 3071; https://doi.org/10.3390/electronics14153071 - 31 Jul 2025
Viewed by 372
Abstract
Large Vision Language Models (LVLMs) have shown promise in remote sensing applications, yet struggle with “uncommon” objects that lack sufficient public labeled data. This paper presents Enhanced-RSCLIP, a novel dual-prompt architecture that combines text prompting with exemplar-image processing for cattle herd detection in [...] Read more.
Large Vision Language Models (LVLMs) have shown promise in remote sensing applications, yet struggle with “uncommon” objects that lack sufficient public labeled data. This paper presents Enhanced-RSCLIP, a novel dual-prompt architecture that combines text prompting with exemplar-image processing for cattle herd detection in satellite imagery. Our approach introduces a key innovation where an exemplar-image preprocessing module using crop-based or attention-based algorithms extracts focused object features which are fed as a dual stream to a contrastive learning framework that fuses textual descriptions with visual exemplar embeddings. We evaluated our method on a custom dataset of 260 satellite images across UK and Nigerian regions. Enhanced-RSCLIP with crop-based exemplar processing achieved 72% accuracy in cattle detection and 56.2% overall accuracy on cross-domain transfer tasks, significantly outperforming text-only CLIP (31% overall accuracy). The dual-prompt architecture enables effective few-shot learning and cross-regional transfer from data-rich (UK) to data-sparse (Nigeria) environments, demonstrating a 41% improvement over baseline approaches for uncommon object detection in satellite imagery. Full article
Show Figures

Figure 1

16 pages, 3396 KB  
Article
Parameter-Efficient Adaptation of Large Vision—Language Models for Video Memorability Prediction
by Iván Martín-Fernández, Sergio Esteban-Romero, Fernando Fernández-Martínez and Manuel Gil-Martín
Sensors 2025, 25(6), 1661; https://doi.org/10.3390/s25061661 - 7 Mar 2025
Viewed by 1698
Abstract
The accurate modelling of video memorability, or the intrinsic properties that render a piece of audiovisual content more likely to be remembered, will facilitate the development of automatic systems that are more efficient in retrieving, classifying and generating impactful media. Recent studies have [...] Read more.
The accurate modelling of video memorability, or the intrinsic properties that render a piece of audiovisual content more likely to be remembered, will facilitate the development of automatic systems that are more efficient in retrieving, classifying and generating impactful media. Recent studies have indicated a strong correlation between the visual semantics of video and its memorability. This underscores the importance of developing advanced visual comprehension abilities to enhance model performance. It has been demonstrated that Large Vision–Language Models (LVLMs) demonstrate exceptional proficiency in generalist, high-level semantic comprehension of images and video, due to their extensive multimodal pre-training on a vast scale. This work makes use of the vast generalist knowledge of LVLMs and explores efficient adaptation techniques with a view to utilising them as memorability predictors. In particular, the Quantized Low-Rank Adaptation (QLoRA) technique is employed to fine-tune the Qwen-VL model with memorability-related data extracted from the Memento10k dataset. In light of existing research, we propose a particular methodology that transforms Qwen-VL from a language model to a memorability score regressor. Furthermore, we consider the influence of selecting appropriate LoRA hyperparameters, a design aspect that has been insufficiently studied. We validate the LoRA rank and alpha hyperparameters using 5-Fold Cross-Validation and evaluate our best configuration on the official testing portion of the Memento10k dataset, obtaining a state-of-the-art Spearman Rank Correlation Coefficient (SRCC) of 0.744. Consequently, this work represents a significant advancement in modelling video memorability through high-level semantic understanding. Full article
Show Figures

Figure 1

21 pages, 1486 KB  
Article
DDFAV: Remote Sensing Large Vision Language Models Dataset and Evaluation Benchmark
by Haodong Li, Xiaofeng Zhang and Haicheng Qu
Remote Sens. 2025, 17(4), 719; https://doi.org/10.3390/rs17040719 - 19 Feb 2025
Cited by 1 | Viewed by 1999
Abstract
With the rapid development of large visual language models (LVLMs) and multimodal large language models (MLLMs), these models have demonstrated strong performance in various multimodal tasks. However, alleviating the generation of hallucinations remains a key challenge in LVLMs research. For remote sensing LVLMs, [...] Read more.
With the rapid development of large visual language models (LVLMs) and multimodal large language models (MLLMs), these models have demonstrated strong performance in various multimodal tasks. However, alleviating the generation of hallucinations remains a key challenge in LVLMs research. For remote sensing LVLMs, there are problems such as low quality, small number and unreliable datasets and evaluation methods. Therefore, when applied to remote sensing tasks, they are prone to hallucinations, resulting in unsatisfactory performance. This paper proposes a more reliable and effective instruction set production process for remote sensing LVLMs to address these issues. The process generates detailed and accurate instruction sets through strategies such as shallow-to-deep reasoning, internal and external considerations, and manual quality inspection. Based on this production process, we collect 1.6 GB of remote sensing images to create the DDFAV dataset, which covers a variety of remote sensing LVLMs tasks. Finally, we develop a closed binary classification polling evaluation method, RSPOPE, specifically designed to evaluate hallucinations in remote sensing LVLMs or MLLMs visual question-answering tasks. Using this method, we evaluate the zero-shot remote sensing visual question-answering capabilities of multiple mainstream LVLMs. Our proposed dataset images, corresponding instruction sets, and evaluation method files are all open source. Full article
Show Figures

Figure 1

21 pages, 1298 KB  
Article
Co-LLaVA: Efficient Remote Sensing Visual Question Answering via Model Collaboration
by Fan Liu, Wenwen Dai, Chuanyi Zhang, Jiale Zhu, Liang Yao and Xin Li
Remote Sens. 2025, 17(3), 466; https://doi.org/10.3390/rs17030466 - 29 Jan 2025
Viewed by 2517
Abstract
Large vision language models (LVLMs) are built upon large language models (LLMs) and incorporate non-textual modalities; they can perform various multimodal tasks. Applying LVLMs in remote sensing (RS) visual question answering (VQA) tasks can take advantage of the powerful capabilities to promote the [...] Read more.
Large vision language models (LVLMs) are built upon large language models (LLMs) and incorporate non-textual modalities; they can perform various multimodal tasks. Applying LVLMs in remote sensing (RS) visual question answering (VQA) tasks can take advantage of the powerful capabilities to promote the development of VQA in RS. However, due to the greater complexity of remote sensing images compared to natural images, general-domain LVLMs tend to perform poorly in RS scenarios and are prone to hallucination phenomena. Multi-agent debate for collaborative reasoning is commonly utilized to mitigate hallucination phenomena. Although this method is effective, it comes with a significant computational burden (e.g., high CPU/GPU demands and slow inference speed). To address these limitations, we propose Co-LLaVA, a model specifically designed for RS VQA tasks. Specifically, Co-LLaVA employs model collaboration between Large Language and Vision Assistant (LLaVA-v1.5) and Contrastive Captioners (CoCas). It combines LVLM with a lightweight generative model, reducing computational burden compared to multi-agent debate. Additionally, through high-dimensional multi-scale features and higher-resolution images, Co-LLaVA can enhance the perception of details in RS images. Experimental results demonstrate the significant performance improvements of our Co-LLaVA over existing LVLMs (e.g., Geochat, RSGPT) on multiple metrics of four RS VQA datasets (e.g., +3% over SkySenseGPT on “Rural/Urban” accuracy in the test set of RSVQA-LR dataset). Full article
Show Figures

Figure 1

20 pages, 19180 KB  
Article
Leveraging Multi-Source Data for the Trustworthy Evaluation of the Vibrancy of Child-Friendly Cities: A Case Study of Tianjin, China
by Di Zhang, Kun Song and Di Zhao
Electronics 2024, 13(22), 4564; https://doi.org/10.3390/electronics13224564 - 20 Nov 2024
Cited by 2 | Viewed by 1077
Abstract
The vitality of a city is shaped by its social structure, environmental quality, and spatial form, with child-friendliness being an essential component of urban vitality. While there are numerous qualitative studies on the relationship between child-friendliness and various indicators of urban vitality, quantitative [...] Read more.
The vitality of a city is shaped by its social structure, environmental quality, and spatial form, with child-friendliness being an essential component of urban vitality. While there are numerous qualitative studies on the relationship between child-friendliness and various indicators of urban vitality, quantitative research remains relatively scarce, leading to a lack of sufficient objective and trustworthy data to guide urban planning and the development of child-friendly cities. This paper presents an analytical framework, using Heping District in Tianjin, China, as a case study. It defines four main indicators—social vitality, environmental vitality, spatial vitality, and urban scene perception—for a trustworthy and transparent quantitative evaluation. The study integrates multi-source data, including primary education (POI) data, street view image (SVI) data, spatiotemporal big data, normalized difference vegetation index (NDVI), and large visual language models (LVLMs) for the trustworthy analysis. These data are visualized using corresponding big data and weighted analysis methods, ensuring transparent and accurate assessments of the child-friendliness of urban blocks. This research introduces an innovative and trustworthy method for evaluating the child-friendliness of urban blocks, addressing gaps in the quantitative theory of child-friendliness in urban planning. It also provides a practical and reliable tool for urban planners, offering a solid theoretical foundation to create environments that better meet the needs of children in a trustworthy manner. Full article
(This article belongs to the Special Issue Adversarial Attacks and Defenses in AI Safety/Reliability)
Show Figures

Figure 1

18 pages, 3629 KB  
Article
RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery
by Yakoub Bazi, Laila Bashmal, Mohamad Mahmoud Al Rahhal, Riccardo Ricci and Farid Melgani
Remote Sens. 2024, 16(9), 1477; https://doi.org/10.3390/rs16091477 - 23 Apr 2024
Cited by 39 | Viewed by 10823
Abstract
In this paper, we delve into the innovative application of large language models (LLMs) and their extension, large vision-language models (LVLMs), in the field of remote sensing (RS) image analysis. We particularly emphasize their multi-tasking potential with a focus on image captioning and [...] Read more.
In this paper, we delve into the innovative application of large language models (LLMs) and their extension, large vision-language models (LVLMs), in the field of remote sensing (RS) image analysis. We particularly emphasize their multi-tasking potential with a focus on image captioning and visual question answering (VQA). In particular, we introduce an improved version of the Large Language and Vision Assistant Model (LLaVA), specifically adapted for RS imagery through a low-rank adaptation approach. To evaluate the model performance, we create the RS-instructions dataset, a comprehensive benchmark dataset that integrates four diverse single-task datasets related to captioning and VQA. The experimental results confirm the model’s effectiveness, marking a step forward toward the development of efficient multi-task models for RS image analysis. Full article
Show Figures

Figure 1

Back to TopTop