MDPI - Publisher of Open Access Journals

16 pages, 74967 KB

Open AccessArticle

TVI-MFAN: A Text–Visual Interaction Multilevel Feature Alignment Network for Visual Grounding in Remote Sensing

by Hao Chi, Weiwei Qin, Xingyu Chen, Wenxin Guo and Baiwei An

Remote Sens. 2025, 17(17), 2993; https://doi.org/10.3390/rs17172993 - 28 Aug 2025

Visual grounding for remote sensing (RSVG) focuses on localizing specific objects in remote sensing (RS) imagery based on linguistic expressions. Existing methods typically employ pre-trained models to locate the referenced objects. However, due to the insufficient capability of cross-modal interaction and alignment, the [...] Read more.

Visual grounding for remote sensing (RSVG) focuses on localizing specific objects in remote sensing (RS) imagery based on linguistic expressions. Existing methods typically employ pre-trained models to locate the referenced objects. However, due to the insufficient capability of cross-modal interaction and alignment, the extracted visual features may suffer from semantic drift, limiting the performance of RSVG. To address this, the article introduces a novel RSVG framework named the text–visual interaction multilevel feature alignment network (TVI-MFAN), which leverages a text–visual interaction attention (TVIA) module to dynamically generate adaptive weights and biases at both spatial and channel dimensions, enabling the visual feature to focus on relevant linguistic expressions. Additionally, a multilevel feature alignment network (MFAN) aggregates contextual information by using cross-modal alignment to enhance features and suppress irrelevant regions. Experiments demonstrate that the proposed method achieves 75.65% and 80.24% (2.42% and 3.1% absolute improvement) accuracy on the OPT-RSVG and DIOR-RSVG dataset, validating its effectiveness. Full article

(This article belongs to the Special Issue Advances in Methods and Techniques for Satellite Image Processing and Analysis)

26 pages, 1594 KB

Open AccessReview

Symmetry-Aware Advances in Multimodal Large Language Models: Architectures, Training, and Evaluation

by Xinran Liu and Haojie Liu

Symmetry 2025, 17(9), 1400; https://doi.org/10.3390/sym17091400 - 28 Aug 2025

Abstract

With the exponential growth of multimodal data, the limitations of traditional unimodal models in cross-modal understanding and complex scenario reasoning have become increasingly evident. Built upon the foundation of Large Language Models (LLMs), Multimodal Large Language Models (MLLMs) retain strong reasoning abilities and [...] Read more.

With the exponential growth of multimodal data, the limitations of traditional unimodal models in cross-modal understanding and complex scenario reasoning have become increasingly evident. Built upon the foundation of Large Language Models (LLMs), Multimodal Large Language Models (MLLMs) retain strong reasoning abilities and demonstrate unique capabilities in multimodal understanding. This survey provides a comprehensive overview of the current research landscape of MLLMs. It systematically analyzes mainstream model architectures, training, fine-tuning strategies, and task classifications, while offering a structured account of evaluation methodologies. Beyond synthesis, the paper highlights emerging trends that aim for balanced integration across modalities, tasks, and components, and critically examines key challenges together with potential solutions. The survey specifically emphasizes recent reasoning-oriented MLLMs, with a focus on DeepSeek-R1, analyzing their design paradigms and contributions from the perspective of symmetric reasoning capabilities. Overall, this work offers a comprehensive overview of cutting-edge advancements and lays a foundation for the future development of MLLMs, especially those guided by symmetry principles. Full article

(This article belongs to the Special Issue Recent Advances in Symmetry-Based Approaches to Retrieval-Augmented Generation in Large Language Models)

► Show Figures

Figure 1

18 pages, 16540 KB

Open AccessArticle

E-CMCA and LSTM-Enhanced Framework for Cross-Modal MRI-TRUS Registration in Prostate Cancer

by Ciliang Shao, Ruijin Xue and Lixu Gu

J. Imaging 2025, 11(9), 292; https://doi.org/10.3390/jimaging11090292 - 27 Aug 2025

Abstract

Accurate registration of MRI and TRUS images is crucial for effective prostate cancer diagnosis and biopsy guidance, yet modality differences and non-rigid deformations pose significant challenges, especially in dynamic imaging. This study presents a novel cross-modal MRI-TRUS registration framework, leveraging a dual-encoder architecture [...] Read more.

Accurate registration of MRI and TRUS images is crucial for effective prostate cancer diagnosis and biopsy guidance, yet modality differences and non-rigid deformations pose significant challenges, especially in dynamic imaging. This study presents a novel cross-modal MRI-TRUS registration framework, leveraging a dual-encoder architecture with an Enhanced Cross-Modal Channel Attention (E-CMCA) module and a LSTM-Based Spatial Deformation Modeling Module. The E-CMCA module efficiently extracts and integrates multi-scale cross-modal features, while the LSTM-Based Spatial Deformation Modeling Module models temporal dynamics by processing depth-sliced 3D deformation fields as sequential data. A VecInt operation ensures smooth, diffeomorphic transformations, and a FuseConv layer enhances feature integration for precise alignment. Experiments on the

μ

-RegPro dataset from the MICCAI 2023 Challenge demonstrate that our model significantly improves registration accuracy and performs robustly in both static 3D and dynamic 4D registration tasks. Experiments on the

μ

-RegPro dataset from the MICCAI 2023 Challenge demonstrate that our model achieves a DSC of 0.865, RDSC of 0.898, TRE of 2.278 mm, and RTRE of 1.293, surpassing state-of-the-art methods and performing robustly in both static 3D and dynamic 4D registration tasks. Full article

(This article belongs to the Special Issue Celebrating the 10th Anniversary of the Journal of Imaging)

► Show Figures

Figure 1

25 pages, 3905 KB

Open AccessArticle

Physics-Guided Multi-Representation Learning with Quadruple Consistency Constraints for Robust Cloud Detection in Multi-Platform Remote Sensing

by Qing Xu, Zichen Zhang, Guanfang Wang and Yunjie Chen

Remote Sens. 2025, 17(17), 2946; https://doi.org/10.3390/rs17172946 - 25 Aug 2025

Viewed by 176

Abstract

With the rapid expansion of multi-platform remote sensing applications, cloud contamination significantly impedes cross-platform data utilization. Current cloud detection methods face critical technical challenges in cross-platform settings, including neglect of atmospheric radiative transfer mechanisms, inadequate multi-scale structural decoupling, high intra-class variability coupled with [...] Read more.

With the rapid expansion of multi-platform remote sensing applications, cloud contamination significantly impedes cross-platform data utilization. Current cloud detection methods face critical technical challenges in cross-platform settings, including neglect of atmospheric radiative transfer mechanisms, inadequate multi-scale structural decoupling, high intra-class variability coupled with inter-class similarity, cloud boundary ambiguity, cross-modal feature inconsistency, and noise propagation in pseudo-labels within semi-supervised frameworks. To address these issues, we introduce a Physics-Guided Multi-Representation Network (PGMRN) that adopts a student–teacher architecture and fuses tri-modal representations—Pseudo-NDVI, structural, and textural features—via atmospheric priors and intrinsic image decomposition. Specifically, PGMRN first incorporates an InfoNCE contrastive loss to enhance intra-class compactness and inter-class discrimination while preserving physical consistency; subsequently, a boundary-aware regional adaptive weighted cross-entropy loss integrates PA-CAM confidence with distance transforms to refine edge accuracy; furthermore, an Uncertainty-Aware Quadruple Consistency Propagation (UAQCP) enforces alignment across structural, textural, RGB, and physical modalities; and finally, a dynamic confidence-screening mechanism that couples PA-CAM with information entropy and percentile-based thresholding robustly refines pseudo-labels. Extensive experiments on four benchmark datasets demonstrate that PGMRN achieves state-of-the-art performance, with Mean IoU values of 70.8% on TCDD, 79.0% on HRC_WHU, and 83.8% on SWIMSEG, outperforming existing methods. Full article

(This article belongs to the Special Issue Multi-platform and Multi-modal Remote Sensing Data Fusion with Advanced Deep Learning Techniques (Second Edition))

► Show Figures

Figure 1

26 pages, 3068 KB

Open AccessArticle

EAR-CCPM-Net: A Cross-Modal Collaborative Perception Network for Early Accident Risk Prediction

by Wei Sun, Lili Nurliyana Abdullah, Fatimah Binti Khalid and Puteri Suhaiza Binti Sulaiman

Appl. Sci. 2025, 15(17), 9299; https://doi.org/10.3390/app15179299 - 24 Aug 2025

Viewed by 274

Abstract

Early traffic accident risk prediction in complex road environments poses significant challenges due to the heterogeneous nature and incomplete semantic alignment of multimodal data. To address this, we propose a novel Early Accident Risk Cross-modal Collaborative Perception Mechanism Network (EAR-CCPM-Net) that integrates hierarchical [...] Read more.

Early traffic accident risk prediction in complex road environments poses significant challenges due to the heterogeneous nature and incomplete semantic alignment of multimodal data. To address this, we propose a novel Early Accident Risk Cross-modal Collaborative Perception Mechanism Network (EAR-CCPM-Net) that integrates hierarchical fusion modules and cross-modal attention mechanisms to enable semantic interaction between visual, motion, and textual modalities. The model is trained and evaluated on the newly constructed CAP-DATA dataset, incorporating advanced preprocessing techniques such as bilateral filtering and a rigorous MINI-Train-Test sampling protocol. Experimental results show that EAR-CCPM-Net achieves an AUC of 0.853, AP of 0.758, and improves the Time-to-Accident (TTA_0.5) from 3.927 s to 4.225 s, significantly outperforming baseline methods. These findings demonstrate that EAR-CCPM-Net effectively enhances early-stage semantic perception and prediction accuracy, providing an interpretable solution for real-world traffic risk anticipation. Full article

► Show Figures

Figure 1

17 pages, 2418 KB

Open AccessArticle

InstructSee: Instruction-Aware and Feedback-Driven Multimodal Retrieval with Dynamic Query Generation

by Guihe Gu, Yuan Xue, Zhengqian Wu, Lin Song and Chao Liang

Sensors 2025, 25(16), 5195; https://doi.org/10.3390/s25165195 - 21 Aug 2025

Viewed by 380

Abstract

In recent years, cross-modal retrieval has garnered significant attention due to its potential to bridge heterogeneous data modalities, particularly in aligning visual content with natural language. Despite notable progress, existing methods often struggle to accurately capture user intent when queries are expressed through [...] Read more.

In recent years, cross-modal retrieval has garnered significant attention due to its potential to bridge heterogeneous data modalities, particularly in aligning visual content with natural language. Despite notable progress, existing methods often struggle to accurately capture user intent when queries are expressed through complex or evolving instructions. To address this challenge, we propose a novel cross-modal representation learning framework that incorporates an instruction-aware dynamic query generation mechanism, augmented by the semantic reasoning capabilities of large language models (LLMs). The framework dynamically constructs and iteratively refines query representations conditioned on natural language instructions and guided by user feedback, thereby enabling the system to effectively infer and adapt to implicit retrieval intent. Extensive experiments on standard multimodal retrieval benchmarks demonstrate that our method significantly improves retrieval accuracy and adaptability, outperforming fixed-query baselines and showing enhanced cross-modal alignment and generalization across diverse retrieval tasks. Full article

(This article belongs to the Special Issue Artificial Intelligence in Computer Vision: Methods and Applications—2nd Edition)

► Show Figures

Figure 1

20 pages, 1818 KB

Open AccessArticle

Image Captioning Model Based on Multi-Step Cross-Attention Cross-Modal Alignment and External Commonsense Knowledge Augmentation

by Liang Wang, Meiqing Jiao, Zhihai Li, Mengxue Zhang, Haiyan Wei, Yuru Ma, Honghui An, Jiaqi Lin and Jun Wang

Electronics 2025, 14(16), 3325; https://doi.org/10.3390/electronics14163325 - 21 Aug 2025

Viewed by 413

Abstract

To address the semantic mismatch between limited textual descriptions in image captioning training datasets and the multi-semantic nature of images, as well as the underutilized external commonsense knowledge, this article proposes a novel image captioning model based on multi-step cross-attention cross-modal alignment and [...] Read more.

To address the semantic mismatch between limited textual descriptions in image captioning training datasets and the multi-semantic nature of images, as well as the underutilized external commonsense knowledge, this article proposes a novel image captioning model based on multi-step cross-attention cross-modal alignment and external commonsense knowledge enhancement. The model employs a backbone architecture comprising CLIP’s ViT visual encoder, Faster R-CNN, BERT text encoder, and GPT-2 text decoder. It incorporates two core mechanisms: a multi-step cross-attention mechanism that iteratively aligns image and text features across multiple rounds, progressively enhancing inter-modal semantic consistency for more accurate cross-modal representation fusion. Moreover, the model employs Faster R-CNN to extract region-based object features. These features are mapped to corresponding entities within the dataset through entity probability calculation and entity linking. External commonsense knowledge associated with these entities is then retrieved from the ConceptNet knowledge graph, followed by knowledge embedding via TransE and multi-hop reasoning. Finally, the fused multimodal features are fed into the GPT-2 decoder to steer caption generation, enhancing the lexical richness, factual accuracy, and cognitive plausibility of the generated descriptions. In the experiments, the model achieves CIDEr scores of 142.6 on MSCOCO and 78.4 on Flickr30k. Ablations confirm both modules enhance caption quality. Full article

► Show Figures

Figure 1

20 pages, 7578 KB

Open AccessArticle

Cross Attention Based Dual-Modality Collaboration for Hyperspectral Image and LiDAR Data Classification

by Khanzada Muzammil Hussain, Keyun Zhao, Yang Zhou, Aamir Ali and Ying Li

Remote Sens. 2025, 17(16), 2836; https://doi.org/10.3390/rs17162836 - 15 Aug 2025

Viewed by 442

Abstract

Advancements in satellite sensor technology have enabled access to diverse remote sensing (RS) data from multiple platforms. Hyperspectral Image (HSI) data offers rich spectral detail for material identification, while LiDAR captures high-resolution 3D structural information, making the two modalities naturally complementary. By fusing [...] Read more.

Advancements in satellite sensor technology have enabled access to diverse remote sensing (RS) data from multiple platforms. Hyperspectral Image (HSI) data offers rich spectral detail for material identification, while LiDAR captures high-resolution 3D structural information, making the two modalities naturally complementary. By fusing HSI and LiDAR, we can mitigate the limitations of each and improve tasks like land cover classification, vegetation analysis, and terrain mapping through more robust spectral–spatial feature representation. However, traditional multi-scale feature fusion models often struggle with aligning features effectively, which can lead to redundant outputs and diminished spatial clarity. To address these issues, we propose the Cross Attention Bridge for HSI and LiDAR (CAB-HL), a novel dual-path framework that employs a multi-stage cross-attention mechanism to guide the interaction between spectral and spatial features. In CAB-HL, features from each modality are refined across three progressive stages using cross-attention modules, which enhance contextual alignment while preserving the distinctive characteristics of each modality. These fused representations are subsequently integrated and passed through a lightweight classification head. Extensive experiments on three benchmark RS datasets demonstrate that CAB-HL consistently outperforms existing state-of-the-art models, confirm that CAB-HL consistently outperforms in learning deep joint representations for multimodal classification tasks. Full article

(This article belongs to the Special Issue Artificial Intelligence Remote Sensing for Earth Observation)

► Show Figures

Figure 1

25 pages, 1734 KB

Open AccessArticle

A Multimodal Affective Interaction Architecture Integrating BERT-Based Semantic Understanding and VITS-Based Emotional Speech Synthesis

by Yanhong Yuan, Shuangsheng Duo, Xuming Tong and Yapeng Wang

Algorithms 2025, 18(8), 513; https://doi.org/10.3390/a18080513 - 14 Aug 2025

Viewed by 493

Abstract

Addressing the issues of coarse emotional representation, low cross-modal alignment efficiency, and insufficient real-time response capabilities in current human–computer emotional language interaction, this paper proposes an affective interaction framework integrating BERT-based semantic understanding with VITS-based speech synthesis. The framework aims to enhance the [...] Read more.

Addressing the issues of coarse emotional representation, low cross-modal alignment efficiency, and insufficient real-time response capabilities in current human–computer emotional language interaction, this paper proposes an affective interaction framework integrating BERT-based semantic understanding with VITS-based speech synthesis. The framework aims to enhance the naturalness, expressiveness, and response efficiency of human–computer emotional interaction. By introducing a modular layered design, a six-dimensional emotional space, a gated attention mechanism, and a dynamic model scheduling strategy, the system overcomes challenges such as limited emotional representation, modality misalignment, and high-latency responses. Experimental results demonstrate that the framework achieves superior performance in speech synthesis quality (MOS: 4.35), emotion recognition accuracy (91.6%), and response latency (<1.2 s), outperforming baseline models like Tacotron2 and FastSpeech2. Through model lightweighting, GPU parallel inference, and load balancing optimization, the system validates its robustness and generalizability across English and Chinese corpora in cross-linguistic tests. The modular architecture and dynamic scheduling ensure scalability and efficiency, enabling a more humanized and immersive interaction experience in typical application scenarios such as psychological companionship, intelligent education, and high-concurrency customer service. This study provides an effective technical pathway for developing the next generation of personalized and immersive affective intelligent interaction systems. Full article

(This article belongs to the Section Algorithms for Multidisciplinary Applications)

► Show Figures

Figure 1

23 pages, 2744 KB

Open AccessArticle

CASF: Correlation-Alignment and Significance-Aware Fusion for Multimodal Named Entity Recognition

by Hui Li, Yunshi Tao, Huan Wang, Zhe Wang and Qingzheng Liu

Algorithms 2025, 18(8), 511; https://doi.org/10.3390/a18080511 - 14 Aug 2025

Viewed by 272

Abstract

With the increasing content richness of social media platforms, Multimodal Named Entity Recognition (MNER) faces the dual challenges of heterogeneous feature fusion and accurate entity recognition. Aiming at the key problems of inconsistent distribution of textual and visual information, insufficient feature alignment and [...] Read more.

With the increasing content richness of social media platforms, Multimodal Named Entity Recognition (MNER) faces the dual challenges of heterogeneous feature fusion and accurate entity recognition. Aiming at the key problems of inconsistent distribution of textual and visual information, insufficient feature alignment and noise interference fusion, this paper proposes a multimodal named entity recognition model based on dual-stream Transformer: CASF-MNER, which designs cross-modal cross-attention based on visual and textual features, constructs a bidirectional interaction mechanism between single-layer features, forms a higher-order semantic correlation modeling, and realizes the cross relevance alignment of modal features; construct a dynamic perception mechanism of multimodal feature saliency features based on multiscale pooling method, construct an entropy weighting strategy of global feature distribution information to adaptively suppress noise redundancy and enhance key feature expression; establish a deep semantic fusion method based on hybrid isomorphic model, design a progressive cross-modal interaction structure, and combine with contrastive learning to realize global fusion of the deep semantic space and representational consistency optimization. The experimental results show that CASF-MNER achieves excellent performance on both Twitter-2015 and Twitter-2017 public datasets, which verifies the effectiveness and advancement of the method proposed in this paper. Full article

(This article belongs to the Section Algorithms for Multidisciplinary Applications)

► Show Figures

Figure 1

25 pages, 54500 KB

Open AccessArticle

Parking Pattern Guided Vehicle and Aircraft Detection in Aligned SAR-EO Aerial View Images

by Zhe Geng, Shiyu Zhang, Yu Zhang, Chongqi Xu, Linyi Wu and Daiyin Zhu

Remote Sens. 2025, 17(16), 2808; https://doi.org/10.3390/rs17162808 - 13 Aug 2025

Viewed by 350

Abstract

Although SAR systems can provide high-resolution aerial view images all-day, all-weather, the aspect and pose-sensitivity of the SAR target signatures, which defies the Gestalt perceptual principles, sets a frustrating performance upper bound for SAR Automatic Target Recognition (ATR). Therefore, we propose a network [...] Read more.

Although SAR systems can provide high-resolution aerial view images all-day, all-weather, the aspect and pose-sensitivity of the SAR target signatures, which defies the Gestalt perceptual principles, sets a frustrating performance upper bound for SAR Automatic Target Recognition (ATR). Therefore, we propose a network to support context-guided ATR by using aligned Electro-Optical (EO)-SAR image pairs. To realize EO-SAR image scene grammar alignment, the stable context features highly correlated to the parking patterns of the vehicle and aircraft targets are extracted from the EO images as prior knowledge, which is used to assist SAR-ATR. The proposed network consists of a Scene Recognition Module (SRM) and an instance-level Cross-modality ATR Module (CATRM). The SRM is based on a novel light-condition-driven adaptive EO-SAR decision weighting scheme, and the Outlier Exposure (OE) approach is employed for SRM training to realize Out-of-Distribution (OOD) scene detection. Once the scene depicted in the cut of interest is identified with the SRM, the image cut is sent to the CATRM for ATR. Considering that the EO-SAR images acquired from diverse observation angles often feature unbalanced quality, a novel class-incremental learning method based on the Context-Guided Re-Identification (ReID)-based Key-view (CGRID-Key) exemplar selection strategy is devised so that the network is capable of continuous learning in the open-world deployment environment. Vehicle ATR experimental results based on the UNICORN dataset, which consists of 360-degree EO-SAR images of an army base, show that the CGRID-Key exemplar strategy offers a classification accuracy 29.3% higher than the baseline model for the incremental vehicle category, SUV. Moreover, aircraft ATR experimental results based on the aligned EO-SAR images collected over several representative airports and the Arizona aircraft boneyard show that the proposed network achieves an F1 score of 0.987, which is 9% higher than YOLOv8. Full article

(This article belongs to the Special Issue Applications of SAR for Environment Observation Analysis)

► Show Figures

Figure 1

25 pages, 9564 KB

Open AccessArticle

Semantic-Aware Cross-Modal Transfer for UAV-LiDAR Individual Tree Segmentation

by Fuyang Zhou, Haiqing He, Ting Chen, Tao Zhang, Minglu Yang, Ye Yuan and Jiahao Liu

Remote Sens. 2025, 17(16), 2805; https://doi.org/10.3390/rs17162805 - 13 Aug 2025

Viewed by 319

Abstract

Cross-modal semantic segmentation of individual tree LiDAR point clouds is critical for accurately characterizing tree attributes, quantifying ecological interactions, and estimating carbon storage. However, in forest environments, this task faces key challenges such as high annotation costs and poor cross-domain generalization. To address [...] Read more.

Cross-modal semantic segmentation of individual tree LiDAR point clouds is critical for accurately characterizing tree attributes, quantifying ecological interactions, and estimating carbon storage. However, in forest environments, this task faces key challenges such as high annotation costs and poor cross-domain generalization. To address these issues, this study proposes a cross-modal semantic transfer framework tailored for individual tree point cloud segmentation in forested scenes. Leveraging co-registered UAV-acquired RGB imagery and LiDAR data, we construct a technical pipeline of “2D semantic inference—3D spatial mapping—cross-modal fusion” to enable annotation-free semantic parsing of 3D individual trees. Specifically, we first introduce a novel Multi-Source Feature Fusion Network (MSFFNet) to achieve accurate instance-level segmentation of individual trees in the 2D image domain. Subsequently, we develop a hierarchical two-stage registration strategy to effectively align dense matched point clouds (MPC) generated from UAV imagery with LiDAR point clouds. On this basis, we propose a probabilistic cross-modal semantic transfer model that builds a semantic probability field through multi-view projection and the expectation–maximization algorithm. By integrating geometric features and semantic confidence, the model establishes semantic correspondences between 2D pixels and 3D points, thereby achieving spatially consistent semantic label mapping. This facilitates the transfer of semantic annotations from the 2D image domain to the 3D point cloud domain. The proposed method is evaluated on two forest datasets. The results demonstrate that the proposed individual tree instance segmentation approach achieves the highest performance, with an IoU of 87.60%, compared to state-of-the-art methods such as Mask R-CNN, SOLOV2, and Mask2Former. Furthermore, the cross-modal semantic label transfer framework significantly outperforms existing mainstream methods in individual tree point cloud semantic segmentation across complex forest scenarios. Full article

(This article belongs to the Topic Vegetation Characterization and Classification With Multi-Source Remote Sensing Data)

► Show Figures

Figure 1

22 pages, 15242 KB

Open AccessArticle

A Modality Alignment and Fusion-Based Method for Around-the-Clock Remote Sensing Object Detection

by Yongjun Qi, Shaohua Yang, Jiahao Chen, Meng Zhang, Jie Zhu, Xin Liu and Hongxing Zheng

Sensors 2025, 25(16), 4964; https://doi.org/10.3390/s25164964 - 11 Aug 2025

Viewed by 403

Abstract

Cross-modal remote sensing object detection holds significant potential for around-the-clock applications. However, the modality differences between cross-modal data and the degradation of feature quality under adverse weather conditions limit detection performance. To address these challenges, this paper presents a novel cross-modal remote sensing [...] Read more.

Cross-modal remote sensing object detection holds significant potential for around-the-clock applications. However, the modality differences between cross-modal data and the degradation of feature quality under adverse weather conditions limit detection performance. To address these challenges, this paper presents a novel cross-modal remote sensing object detection framework designed to overcome two critical challenges in around-the-clock applications: (1) significant modality disparities between visible light, infrared, and synthetic aperture radar data, and (2) severe feature degradation under adverse weather conditions including fog, and nighttime scenarios. Our primary contributions are as follows: First, we develop a multi-scale feature extraction module that employs a hierarchical convolutional architecture to capture both fine-grained details and contextual information, effectively compensating for missing or blurred features in degraded visible-light images. Second, we introduce an innovative feature interaction module that utilizes cross-attention mechanisms to establish long-range dependencies across modalities while dynamically suppressing noise interference through adaptive feature selection. Third, we propose a feature correction fusion module that performs spatial alignment of object boundaries and channel-wise optimization of global feature consistency, enabling robust fusion of complementary information from different modalities. The proposed framework is validated on visible light, infrared, and SAR modalities. Extensive experiments on three challenging datasets (LLVIP, OGSOD, and Drone Vehicle) demonstrate our framework’s superior performance, achieving state-of-the-art mean average precision scores of 66.3%, 58.6%, and 71.7%, respectively, representing significant improvements over existing methods in scenarios with modality differences or extreme weather conditions. The proposed solution not only advances the technical frontier of cross-modal object detection but also provides practical value for mission-critical applications such as 24/7 surveillance systems, military reconnaissance, and emergency response operations where reliable around-the-clock detection is essential. Full article

(This article belongs to the Section Remote Sensors)

► Show Figures

Figure 1

22 pages, 28581 KB

Open AccessArticle

Remote Sensing Interpretation of Geological Elements via a Synergistic Neural Framework with Multi-Source Data and Prior Knowledge

by Kang He, Ruyi Feng, Zhijun Zhang and Yusen Dong

Remote Sens. 2025, 17(16), 2772; https://doi.org/10.3390/rs17162772 - 10 Aug 2025

Viewed by 439

Abstract

Geological elements are fundamental components of the Earth’s ecosystem, and accurately identifying their spatial distribution is essential for analyzing environmental processes, guiding land-use planning, and promoting sustainable development. Remote sensing technologies, combined with artificial intelligence algorithms, offer new opportunities for the efficient interpretation [...] Read more.

Geological elements are fundamental components of the Earth’s ecosystem, and accurately identifying their spatial distribution is essential for analyzing environmental processes, guiding land-use planning, and promoting sustainable development. Remote sensing technologies, combined with artificial intelligence algorithms, offer new opportunities for the efficient interpretation of geological features. However, in areas with dense vegetation coverage, the information directly extracted from single-source optical imagery is limited, thereby constraining interpretation accuracy. Supplementary inputs such as synthetic aperture radar (SAR), topographic features, and texture information—collectively referred to as sensitive features and prior knowledge—can improve interpretation, but their effectiveness varies significantly across time and space. This variability often leads to inconsistent performance in general-purpose models, thus limiting their practical applicability. To address these challenges, we construct a geological element interpretation dataset for Northwest China by incorporating multi-source data, including Sentinel-1 SAR imagery, Sentinel-2 multispectral imagery, sensitive features (such as the digital elevation model (DEM), texture features based on the gray-level co-occurrence matrix (GLCM), geological maps (GMs), and the normalized difference vegetation index (NDVI)), as well as prior knowledge (such as base geological maps). Using five mainstream deep learning models, we systematically evaluate the performance improvement brought by various sensitive features and prior knowledge in remote sensing-based geological interpretation. To handle disparities in spatial resolution, temporal acquisition, and noise characteristics across sensors, we further develop a multi-source complement-driven network (MCDNet) that integrates an improved feature rectification module (IFRM) and an attention-enhanced fusion module (AFM) to achieve effective cross-modal alignment and noise suppression. Experimental results demonstrate that the integration of multi-source sensitive features and prior knowledge leads to a 2.32–6.69% improvement in mIoU for geological elements interpretation, with base geological maps and topographic features contributing most significantly to accuracy gains. Full article

(This article belongs to the Special Issue Multimodal Remote Sensing Data Fusion, Analysis and Application)

► Show Figures

Figure 1

30 pages, 2469 KB

Open AccessReview

Open-Vocabulary Object Detection in UAV Imagery: A Review and Future Perspectives

by Yang Zhou, Junjie Li, Congyang Ou, Dawei Yan, Haokui Zhang and Xizhe Xue

Drones 2025, 9(8), 557; https://doi.org/10.3390/drones9080557 - 8 Aug 2025

Viewed by 966

Abstract

Due to its extensive applications, aerial image object detection has long been a hot topic in computer vision. In recent years, advancements in unmanned aerial vehicle (UAV) technology have further propelled this field to new heights, giving rise to a broader range of [...] Read more.

Due to its extensive applications, aerial image object detection has long been a hot topic in computer vision. In recent years, advancements in unmanned aerial vehicle (UAV) technology have further propelled this field to new heights, giving rise to a broader range of application requirements. However, traditional UAV aerial object detection methods primarily focus on detecting predefined categories, which significantly limits their applicability. The advent of cross-modal text–image alignment (e.g., CLIP) has overcome this limitation, enabling open-vocabulary object detection (OVOD), which can identify previously unseen objects through natural language descriptions. This breakthrough significantly enhances the intelligence and autonomy of UAVs in aerial scene understanding. This paper presents a comprehensive survey of OVOD in the context of UAV aerial scenes. We begin by aligning the core principles of OVOD with the unique characteristics of UAV vision, setting the stage for a specialized discussion. Building on this foundation, we construct a systematic taxonomy that categorizes existing OVOD methods for aerial imagery and provides a comprehensive overview of the relevant datasets. This structured review enables us to critically dissect the key challenges and open problems at the intersection of these fields. Finally, based on this analysis, we outline promising future research directions and application prospects. This survey aims to provide a clear road map and a valuable reference for both newcomers and seasoned researchers, fostering innovation in this rapidly evolving domain. We keep track of related works in a public GitHub repository. Full article

► Show Figures

Figure 1

Search Results (159)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (159)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI