MDPI - Publisher of Open Access Journals

34 pages, 3611 KB

Open AccessReview

A Review of Multi-Sensor Fusion in Autonomous Driving

by Hui Qian, Mingchen Wang, Maotao Zhu and Hai Wang

Sensors 2025, 25(19), 6033; https://doi.org/10.3390/s25196033 - 1 Oct 2025

Multi-modal sensor fusion has become a cornerstone of robust autonomous driving systems, enabling perception models to integrate complementary cues from cameras, LiDARs, radars, and other modalities. This survey provides a structured overview of recent advances in deep learning-based fusion methods, categorizing them by [...] Read more.

Multi-modal sensor fusion has become a cornerstone of robust autonomous driving systems, enabling perception models to integrate complementary cues from cameras, LiDARs, radars, and other modalities. This survey provides a structured overview of recent advances in deep learning-based fusion methods, categorizing them by architectural paradigms (e.g., BEV-centric fusion and cross-modal attention), learning strategies, and task adaptations. We highlight two dominant architectural trends: unified BEV representation and token-level cross-modal alignment, analyzing their design trade-offs and integration challenges. Furthermore, we review a wide range of applications, from object detection and semantic segmentation to behavior prediction and planning. Despite considerable progress, real-world deployment is hindered by issues such as spatio-temporal misalignment, domain shifts, and limited interpretability. We discuss how recent developments, such as diffusion models for generative fusion, Mamba-style recurrent architectures, and large vision–language models, may unlock future directions for scalable and trustworthy perception systems. Extensive comparisons, benchmark analyses, and design insights are provided to guide future research in this rapidly evolving field. Full article

(This article belongs to the Section Vehicular Sensing)

► Show Figures

Figure 1

19 pages, 7222 KB

Open AccessArticle

Multi-Channel Spectro-Temporal Representations for Speech-Based Parkinson’s Disease Detection

by Hadi Sedigh Malekroodi, Nuwan Madusanka, Byeong-il Lee and Myunggi Yi

J. Imaging 2025, 11(10), 341; https://doi.org/10.3390/jimaging11100341 - 1 Oct 2025

Abstract

Early, non-invasive detection of Parkinson’s Disease (PD) using speech analysis offers promise for scalable screening. In this work, we propose a multi-channel spectro-temporal deep-learning approach for PD detection from sentence-level speech, a clinically relevant yet underexplored modality. We extract and fuse three complementary [...] Read more.

Early, non-invasive detection of Parkinson’s Disease (PD) using speech analysis offers promise for scalable screening. In this work, we propose a multi-channel spectro-temporal deep-learning approach for PD detection from sentence-level speech, a clinically relevant yet underexplored modality. We extract and fuse three complementary time–frequency representations—mel spectrogram, constant-Q transform (CQT), and gammatone spectrogram—into a three-channel input analogous to an RGB image. This fused representation is evaluated across CNNs (ResNet, DenseNet, and EfficientNet) and Vision Transformer using the PC-GITA dataset, under 10-fold subject-independent cross-validation for robust assessment. Results showed that fusion consistently improves performance over single representations across architectures. EfficientNet-B2 achieves the highest accuracy (84.39% ± 5.19%) and F1-score (84.35% ± 5.52%), outperforming recent methods using handcrafted features or pretrained models (e.g., Wav2Vec2.0, HuBERT) on the same task and dataset. Performance varies with sentence type, with emotionally salient and prosodically emphasized utterances yielding higher AUC, suggesting that richer prosody enhances discriminability. Our findings indicate that multi-channel fusion enhances sensitivity to subtle speech impairments in PD by integrating complementary spectral information. Our approach implies that multi-channel fusion could enhance the detection of discriminative acoustic biomarkers, potentially offering a more robust and effective framework for speech-based PD screening, though further validation is needed before clinical application. Full article

(This article belongs to the Special Issue Celebrating the 10th Anniversary of the Journal of Imaging)

► Show Figures

Figure 1

24 pages, 5484 KB

Open AccessArticle

TFI-Fusion: Hierarchical Triple-Stream Feature Interaction Network for Infrared and Visible Image Fusion

by Mingyang Zhao, Shaochen Su and Hao Li

Information 2025, 16(10), 844; https://doi.org/10.3390/info16100844 - 30 Sep 2025

Abstract

As a key technology in multimodal information processing, infrared and visible image fusion holds significant application value in fields such as military reconnaissance, intelligent security, and autonomous driving. To address the limitations of existing methods, this paper proposes the Hierarchical Triple-Feature Interaction Fusion [...] Read more.

As a key technology in multimodal information processing, infrared and visible image fusion holds significant application value in fields such as military reconnaissance, intelligent security, and autonomous driving. To address the limitations of existing methods, this paper proposes the Hierarchical Triple-Feature Interaction Fusion Network (TFI-Fusion). Based on a hierarchical triple-stream feature interaction mechanism, the network achieves high-quality fusion through a two-stage, separate-model processing approach: In the first stage, a single model extracts low-rank components (representing global structural features) and sparse components (representing local detail features) from source images via the Low-Rank Sparse Decomposition (LSRSD) module, while capturing cross-modal shared features using the Shared Feature Extractor (SFE). In the second stage, another model performs fusion and reconstruction: it first enhances the complementarity between low-rank and sparse features through the innovatively introduced Bi-Feature Interaction (BFI) module, realizes multi-level feature fusion via the Triple-Feature Interaction (TFI) module, and finally generates fused images with rich scene representation through feature reconstruction. This separate-model design reduces memory usage and improves operational speed. Additionally, a multi-objective optimization function is designed based on the network’s characteristics. Experiments demonstrate that TFI-Fusion exhibits excellent fusion performance, effectively preserving image details and enhancing feature complementarity, thus providing reliable visual data support for downstream tasks. Full article

► Show Figures

Figure 1

20 pages, 1488 KB

Open AccessArticle

Attention-Fusion-Based Two-Stream Vision Transformer for Heart Sound Classification

by Kalpeshkumar Ranipa, Wei-Ping Zhu and M. N. S. Swamy

Bioengineering 2025, 12(10), 1033; https://doi.org/10.3390/bioengineering12101033 - 26 Sep 2025

Abstract

Vision Transformers (ViTs), inspired by their success in natural language processing, have recently gained attention for heart sound classification (HSC). However, most of the existing studies on HSC rely on single-stream architectures, overlooking the advantages of multi-resolution features. While multi-stream architectures employing early [...] Read more.

Vision Transformers (ViTs), inspired by their success in natural language processing, have recently gained attention for heart sound classification (HSC). However, most of the existing studies on HSC rely on single-stream architectures, overlooking the advantages of multi-resolution features. While multi-stream architectures employing early or late fusion strategies have been proposed, they often fall short of effectively capturing cross-modal feature interactions. Additionally, conventional fusion methods, such as concatenation, averaging, or max pooling, frequently result in information loss. To address these limitations, this paper presents a novel attention fusion-based two-stream Vision Transformer (AFTViT) architecture for HSC that leverages two-dimensional mel-cepstral domain features. The proposed method employs a ViT-based encoder to capture long-range dependencies and diverse contextual information at multiple scales. A novel attention block is then used to integrate cross-context features at the feature level, enhancing the overall feature representation. Experiments conducted on the PhysioNet2016 and PhysioNet2022 datasets demonstrate that the AFTViT outperforms state-of-the-art CNN-based methods in terms of accuracy. These results highlight the potential of the AFTViT framework for early diagnosis of cardiovascular diseases, offering a valuable tool for cardiologists and researchers in developing advanced HSC techniques. Full article

(This article belongs to the Section Biosignal Processing)

► Show Figures

Figure 1

25 pages, 1432 KB

Open AccessArticle

GATransformer: A Network Threat Detection Method Based on Graph-Sequence Enhanced Transformer

by Qigang Zhu, Xiong Zhan, Wei Chen, Yuanzhi Li, Hengwei Ouyang, Tian Jiang and Yu Shen

Electronics 2025, 14(19), 3807; https://doi.org/10.3390/electronics14193807 - 25 Sep 2025

Abstract

Emerging complex multi-step attacks such as Advanced Persistent Threats (APTs) pose significant risks to national economic development, security, and social stability. Effectively detecting these sophisticated threats is a critical challenge. While deep learning methods show promise in identifying unknown malicious behaviors, they often [...] Read more.

Emerging complex multi-step attacks such as Advanced Persistent Threats (APTs) pose significant risks to national economic development, security, and social stability. Effectively detecting these sophisticated threats is a critical challenge. While deep learning methods show promise in identifying unknown malicious behaviors, they often struggle with fragmented modal information, limited feature representation, and generalization. To address these limitations, we propose GATransformer, a new dual-modal detection method that integrates topological structure analysis with temporal sequence modeling. Its core lies in a cross-attention semantic fusion mechanism, which deeply integrates heterogeneous features and effectively mitigates the constraints of unimodal representations. GATransformer reconstructs network behavior representation via a parallel processing framework in which graph attention captures intricate spatial dependencies, and self-attention focuses on modeling long-range temporal correlations. Experimental results on the CIDDS-001 and CIDDS-002 datasets demonstrate the superior performance of our method compared to baseline methods with detection accuracies of 99.74% (nodes) and 88.28% (edges) on CIDDS-001 and 99.99% and 99.98% on CIDDS-002, respectively. Full article

(This article belongs to the Special Issue Advances in Information Processing and Network Security)

► Show Figures

Figure 1

23 pages, 17670 KB

Open AccessArticle

UWS-YOLO: Advancing Underwater Sonar Object Detection via Transfer Learning and Orthogonal-Snake Convolution Mechanisms

by Liang Zhao, Xu Ren, Lulu Fu, Qing Yun and Jiarun Yang

J. Mar. Sci. Eng. 2025, 13(10), 1847; https://doi.org/10.3390/jmse13101847 - 24 Sep 2025

Viewed by 119

Abstract

Accurate and efficient detection of underwater targets in sonar imagery is critical for applications such as marine exploration, infrastructure inspection, and autonomous navigation. However, sonar-based object detection remains challenging due to low resolution, high noise, cluttered backgrounds, and the scarcity of annotated data. [...] Read more.

Accurate and efficient detection of underwater targets in sonar imagery is critical for applications such as marine exploration, infrastructure inspection, and autonomous navigation. However, sonar-based object detection remains challenging due to low resolution, high noise, cluttered backgrounds, and the scarcity of annotated data. To address these issues, we propose UWS-YOLO, a novel detection framework specifically designed for underwater sonar images. The model integrates three key innovations: (1) a C2F-Ortho module that enhances multi-scale feature representation through orthogonal channel attention, improving sensitivity to small and low-contrast targets; (2) a DySnConv module that employs Dynamic Snake Convolution to adaptively capture elongated and irregular structures such as pipelines and cables; and (3) a cross-modal transfer learning strategy that pre-trains on large-scale optical underwater imagery before fine-tuning on sonar data, effectively mitigating overfitting and bridging the modality gap. Extensive evaluations on real-world sonar datasets demonstrate that UWS-YOLO achieves a mAP@0.5 of 87.1%, outperforming the YOLOv8n baseline by 3.5% and seven state-of-the-art detectors in accuracy while maintaining real-time performance at 158 FPS with only 8.8 GFLOPs. The framework exhibits strong generalization across datasets, robustness to noise, and computational efficiency on embedded devices, confirming its suitability for deployment in resource-constrained underwater environments. Full article

(This article belongs to the Section Ocean Engineering)

► Show Figures

Figure 1

28 pages, 14783 KB

Open AccessArticle

HSSTN: A Hybrid Spectral–Structural Transformer Network for High-Fidelity Pansharpening

by Weijie Kang, Yuan Feng, Yao Ding, Hongbo Xiang, Xiaobo Liu and Yaoming Cai

Remote Sens. 2025, 17(19), 3271; https://doi.org/10.3390/rs17193271 - 23 Sep 2025

Viewed by 133

Abstract

Pansharpening fuses multispectral (MS) and panchromatic (PAN) remote sensing images to generate outputs with high spatial resolution and spectral fidelity. Nevertheless, conventional methods relying primarily on convolutional neural networks or unimodal fusion strategies frequently fail to bridge the sensor modality gap between MS [...] Read more.

Pansharpening fuses multispectral (MS) and panchromatic (PAN) remote sensing images to generate outputs with high spatial resolution and spectral fidelity. Nevertheless, conventional methods relying primarily on convolutional neural networks or unimodal fusion strategies frequently fail to bridge the sensor modality gap between MS and PAN data. Consequently, spectral distortion and spatial degradation often occur, limiting high-precision downstream applications. To address these issues, this work proposes a Hybrid Spectral–Structural Transformer Network (HSSTN) that enhances multi-level collaboration through comprehensive modelling of spectral–structural feature complementarity. Specifically, the HSSTN implements a three-tier fusion framework. First, an asymmetric dual-stream feature extractor employs a residual block with channel attention (RBCA) in the MS branch to strengthen spectral representation, while a Transformer architecture in the PAN branch extracts high-frequency spatial details, thereby reducing modality discrepancy at the input stage. Subsequently, a target-driven hierarchical fusion network utilises progressive crossmodal attention across scales, ranging from local textures to multi-scale structures, to enable efficient spectral–structural aggregation. Finally, a novel collaborative optimisation loss function preserves spectral integrity while enhancing structural details. Comprehensive experiments conducted on QuickBird, GaoFen-2, and WorldView-3 datasets demonstrate that HSSTN outperforms existing methods in both quantitative metrics and visual quality. Consequently, the resulting images exhibit sharper details and fewer spectral artefacts, showcasing significant advantages in high-fidelity remote sensing image fusion. Full article

(This article belongs to the Special Issue Artificial Intelligence in Hyperspectral Remote Sensing Data Analysis)

► Show Figures

Figure 1

26 pages, 1333 KB

Open AccessArticle

Category Name Expansion and an Enhanced Multimodal Fusion Framework for Few-Shot Learning

by Tianlei Gao, Lei Lyu, Xiaoyun Xie, Nuo Wei, Yushui Geng and Minglei Shu

Entropy 2025, 27(9), 991; https://doi.org/10.3390/e27090991 - 22 Sep 2025

Viewed by 202

Abstract

With the advancement of image processing techniques, few-shot learning (FSL) has gradually become a key approach to addressing the problem of data scarcity. However, existing FSL methods often rely on unimodal information under limited sample conditions, making it difficult to capture fine-grained differences [...] Read more.

With the advancement of image processing techniques, few-shot learning (FSL) has gradually become a key approach to addressing the problem of data scarcity. However, existing FSL methods often rely on unimodal information under limited sample conditions, making it difficult to capture fine-grained differences between categories. To address this issue, we propose a multimodal few-shot learning method based on category name expansion and image feature enhancement. By integrating the expanded category text with image features, the proposed method enriches the semantic representation of categories and enhances the model’s sensitivity to detailed features. To further improve the quality of cross-modal information transfer, we introduce a cross-modal residual connection strategy that aligns features across layers through progressive fusion. This approach enables the fused representations to maximize mutual information while reducing redundancy, effectively alleviating the information bottleneck caused by uneven entropy distribution between modalities and enhancing the model’s generalization ability. Experimental results demonstrate that our method achieves superior performance on both natural image datasets (CIFAR-FS and FC100) and a medical image dataset. Full article

► Show Figures

Figure 1

18 pages, 4817 KB

Open AccessArticle

A Multimodal Deep Learning Framework for Accurate Wildfire Segmentation Using RGB and Thermal Imagery

by Tao Yue, Hong Huang, Qingyang Wang, Bo Song and Yun Chen

Appl. Sci. 2025, 15(18), 10268; https://doi.org/10.3390/app151810268 - 21 Sep 2025

Viewed by 227

Abstract

Wildfires pose serious threats to ecosystems, human life, and climate stability, underscoring the urgent need for accurate monitoring. Traditional approaches based on either optical or thermal imagery often fail under challenging conditions such as lighting interference, varying data sources, or small-scale flames, as [...] Read more.

Wildfires pose serious threats to ecosystems, human life, and climate stability, underscoring the urgent need for accurate monitoring. Traditional approaches based on either optical or thermal imagery often fail under challenging conditions such as lighting interference, varying data sources, or small-scale flames, as they do not account for the hierarchical nature of feature representations. To overcome these limitations, we propose a multimodal deep learning framework that integrates visible (RGB) and thermal infrared (TIR) imagery for accurate wildfire segmentation. The framework incorporates edge-guided supervision and multilevel fusion to capture fine fire boundaries while exploiting complementary information from both modalities. To assess its effectiveness, we constructed a multi-scale flame segmentation dataset and validated the method across diverse conditions, including different data sources, lighting environments, and five flame size categories ranging from small to large. Experimental results show that BFCNet achieves an IoU of 88.25% and an F1 score of 93.76%, outperforming both single-modality and existing multimodal approaches across all evaluation tasks. These results demonstrate the potential of multimodal deep learning to enhance wildfire monitoring, offering practical value for disaster management, ecological protection, and the deployment of autonomous aerial surveillance systems. Full article

► Show Figures

Figure 1

8 pages, 541 KB

Open AccessPerspective

Rethinking Metabolic Imaging: From Static Snapshots to Metabolic Intelligence

by Giuseppe Maulucci

Biophysica 2025, 5(3), 42; https://doi.org/10.3390/biophysica5030042 - 19 Sep 2025

Viewed by 309

Abstract

Metabolic imaging is undergoing a fundamental transformation. Traditionally confined to static representations of metabolite distribution through modalities such as PET, MRS, and MSOT, imaging has offered only partial glimpses into the dynamic and systemic nature of metabolism. This Perspective envisions a shift toward [...] Read more.

Metabolic imaging is undergoing a fundamental transformation. Traditionally confined to static representations of metabolite distribution through modalities such as PET, MRS, and MSOT, imaging has offered only partial glimpses into the dynamic and systemic nature of metabolism. This Perspective envisions a shift toward dynamic metabolic intelligence—an integrated framework where real-time imaging is fused with physics-informed models, artificial intelligence, and wearable data to create adaptive, predictive representations of metabolic function. We explore how novel technologies like hyperpolarized MRI and time-resolved optoacoustics can serve as dynamic inputs into digital twin systems, enabling closed-loop feedback that not only visualizes but actively guides clinical decisions. From early detection of metabolic drift to in silico therapy simulation, we highlight translational pathways across oncology, cardiology, neurology, and space medicine. Finally, we call for a cross-disciplinary effort to standardize, validate, and ethically implement these systems, marking the emergence of a new paradigm: metabolism as a navigable, model-informed space of precision medicine. Full article

(This article belongs to the Collection Feature Papers in Biophysics)

► Show Figures

Figure 1

18 pages, 6012 KB

Open AccessArticle

Vision-AQ: Explainable Multi-Modal Deep Learning for Air Pollution Classification in Smart Cities

by Faisal Mehmood, Sajid Ur Rehman and Ahyoung Choi

Mathematics 2025, 13(18), 3017; https://doi.org/10.3390/math13183017 - 18 Sep 2025

Viewed by 409

Abstract

Accurate air quality prediction (AQP) is crucial for safeguarding public health and guiding smart city management. However, reliable assessment remains challenging due to complex emission patterns, meteorological variability, and chemical interactions, compounded by the limited coverage of ground-based monitoring networks. To address this [...] Read more.

Accurate air quality prediction (AQP) is crucial for safeguarding public health and guiding smart city management. However, reliable assessment remains challenging due to complex emission patterns, meteorological variability, and chemical interactions, compounded by the limited coverage of ground-based monitoring networks. To address this gap, we propose Vision-AQ (Visual Integrated Operational Network for Air Quality), a novel multi-modal deep learning framework that classifies Air Quality Index (AQI) levels by integrating environmental imagery with pollutant data. Vision-AQ employs a dual-input neural architecture: (1) a pre-trained ResNet50 convolutional neural network (CNN) that extracts high-level features from city-scale environmental photographs in India and Nepal, capturing haze, smog, and visibility patterns, and (2) a multi-layer perceptron (MLP) that processes tabular sensor data, including

{PM}_{2.5}

,

{PM}_{10}

, and AQI values. The fused representations are passed to a classifier to predict six AQI categories. Trained on a comprehensive dataset, the model achieves strong predictive performance with high accuracy, precision, recall and F1-score of 99%, with 23.7 million parameters. To ensure interpretability, we use Grad-CAM visualization to highlights the model’s reliance on meaningful atmospheric features, confirming its explainability. The results demonstrate that Vision-AQ is a reliable, scalable, and cost-effective approach for localized AQI classification, offering the potential to augment conventional monitoring networks and enable more granular air quality management in urban South Asia. Full article

(This article belongs to the Special Issue Explainable and Trustworthy AI Models for Data Analytics)

► Show Figures

Figure 1

21 pages, 1694 KB

Open AccessArticle

Integrating Temporal Interest Dynamics and Virality Factors for High-Precision Ranking in Big Data Recommendation

by Zhaoyang Ye, Jingyi Yang, Fanyu Meng, Manzhou Li and Yan Zhan

Electronics 2025, 14(18), 3687; https://doi.org/10.3390/electronics14183687 - 18 Sep 2025

Viewed by 334

Abstract

In large-scale recommendation scenarios, achieving high-precision ranking requires simultaneously modeling user interest dynamics and content propagation potential. In this work, we propose a unified framework that integrates a temporal interest modeling stream with a multimodal virality encoder. The temporal stream captures sequential user [...] Read more.

In large-scale recommendation scenarios, achieving high-precision ranking requires simultaneously modeling user interest dynamics and content propagation potential. In this work, we propose a unified framework that integrates a temporal interest modeling stream with a multimodal virality encoder. The temporal stream captures sequential user behavior through the self-attention-based modeling of long-term and short-term interests, while the virality encoder learns latent virality factors from heterogeneous modalities, including text, images, audio, and user comments. The two streams are fused in the ranking layer to form a joint representation that balances personalized preference with content dissemination potential. To further enhance efficiency, we design hierarchical cascade heads with gating recursion for progressive refinement, along with a multi-level pruning and cache management strategy that reduces redundancy during inference. Experiments on three real-world datasets (Douyin, Bilibili, and MIND) demonstrate that our method achieves significant improvements over state-of-the-art baselines across multiple metrics. Additional analyses confirm the interpretability of the virality factors and highlight their positive correlation with real-world popularity indicators. These results validate the effectiveness and practicality of our approach for high-precision recommendation in big data environments. Full article

(This article belongs to the Special Issue Advances in Data-Driven Artificial Intelligence)

► Show Figures

Figure 1

25 pages, 6169 KB

Open AccessArticle

Processing Written Language in Video Games: An Eye-Tracking Study on Subtitled Instructions

by Haiting Lan, Sixin Liao, Jan-Louis Kruger and Michael J. Richardson

J. Eye Mov. Res. 2025, 18(5), 44; https://doi.org/10.3390/jemr18050044 - 17 Sep 2025

Viewed by 235

Abstract

Written language is a common component among the multimodal representations that help players construct meanings and guide actions in video games. However, how players process texts in video games remains underexplored. To address this, the current exploratory eye-tracking study examines how players processed [...] Read more.

Written language is a common component among the multimodal representations that help players construct meanings and guide actions in video games. However, how players process texts in video games remains underexplored. To address this, the current exploratory eye-tracking study examines how players processed subtitled instructions and resultant game performance. Sixty-four participants were recruited to play a videogame set in a foggy desert, where they were guided by subtitled instructions to locate, corral, and contain robot agents (targets). These instructions were manipulated into three modalities: visual-only (with subtitled instructions only), auditory only (with spoken instructions), and visual–auditory (with both subtitled and spoken instructions). The instructions were addressed to participants (as relevant subtitles) or their AI teammates (as irrelevant subtitles). Subtitle-level results of eye movements showed that participants primarily focused on the relevant subtitles, as evidenced by more fixations and higher dwell time percentages. Moreover, the word-level results indicate that participants showed lower skipping rates, more fixations, and higher dwell time percentages on words loaded with immediate action-related information, especially in the absence of audio. No significant differences were found in player performance across conditions. The findings of this study contribute to a better understanding of subtitle processing in video games and, more broadly, text processing in multimedia contexts. Implications for future research on digital literacy and computer-mediated text processing are discussed. Full article

► Show Figures

Figure 1

25 pages, 4796 KB

Open AccessArticle

Vision-Language Guided Semantic Diffusion Sampling for Small Object Detection in Remote Sensing Imagery

by Jian Ma, Mingming Bian, Fan Fan, Hui Kuang, Lei Liu, Zhibing Wang, Ting Li and Running Zhang

Remote Sens. 2025, 17(18), 3203; https://doi.org/10.3390/rs17183203 - 17 Sep 2025

Viewed by 503

Abstract

Synthetic aperture radar (SAR), with its all-weather and all-day active imaging capability, has become indispensable for geoscientific analysis and socio-economic applications. Despite advances in deep learning–based object detection, the rapid and accurate detection of small objects in SAR imagery remains a major challenge [...] Read more.

Synthetic aperture radar (SAR), with its all-weather and all-day active imaging capability, has become indispensable for geoscientific analysis and socio-economic applications. Despite advances in deep learning–based object detection, the rapid and accurate detection of small objects in SAR imagery remains a major challenge due to their extremely limited pixel representation, blurred boundaries in dense distributions, and the imbalance of positive–negative samples during training. Recently, vision–language models such as Contrastive Language-Image Pre-Training (CLIP) have attracted widespread research interest for their powerful cross-modal semantic modeling capabilities. Nevertheless, their potential to guide precise localization and detection of small objects in SAR imagery has not yet been fully exploited. To overcome these limitations, we propose the CLIP-Driven Adaptive Tiny Object Detection Diffusion Network (CDATOD-Diff). This framework introduces a CLIP image–text encoding-guided dynamic sampling strategy that leverages cross-modal semantic priors to alleviate the scarcity of effective positive samples. Furthermore, a generative diffusion-based module reformulates the sampling process through iterative denoising, enhancing contextual awareness. To address regression instability, we design a Balanced Corner–IoU (BC-IoU) loss, which decouples corner localization from scale variation and reduces sensitivity to minor positional errors, thereby stabilizing bounding box predictions. Extensive experiments conducted on multiple SAR and optical remote sensing datasets demonstrate that CDATOD-Diff achieves state-of-the-art performance, delivering significant improvements in detection robustness and localization accuracy under challenging small-object scenarios with complex backgrounds and dense distributions. Full article

(This article belongs to the Special Issue Object Detection in Remote Sensing Images Based on Artificial Intelligence)

► Show Figures

Figure 1

17 pages, 1773 KB

Open AccessArticle

CrossInteraction: Multi-Modal Interaction and Alignment Strategy for 3D Perception

by Weiyi Zhao, Xinxin Liu and Yu Ding

Sensors 2025, 25(18), 5775; https://doi.org/10.3390/s25185775 - 16 Sep 2025

Viewed by 387

Abstract

Cameras and LiDAR are the primary sensors utilized in contemporary 3D object perception, leading to the development of various multi-modal detection algorithms for images, point clouds, and their fusion. Given the demanding accuracy requirements in autonomous driving environments, traditional multi-modal fusion techniques often [...] Read more.

Cameras and LiDAR are the primary sensors utilized in contemporary 3D object perception, leading to the development of various multi-modal detection algorithms for images, point clouds, and their fusion. Given the demanding accuracy requirements in autonomous driving environments, traditional multi-modal fusion techniques often overlook critical information from individual modalities and struggle to effectively align transformed features. In this paper, we introduce an improved modal interaction strategy, called CrossInteraction. This method enhances the interaction between modalities by using the output of the first modal representation as the input for the second interaction enhancement, resulting in better overall interaction effects. To further address the challenge of feature alignment errors, we employ a graph convolutional network. Finally, the prediction process is completed through a cross-attention mechanism, ensuring more accurate detection out- comes. Full article

(This article belongs to the Special Issue Advances in Sensing, Imaging and Computing for Autonomous Driving: 2nd Edition)

► Show Figures

Figure 1

Search Results (702)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (702)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI