MDPI - Publisher of Open Access Journals

17 pages, 36646 KB

Open AccessArticle

A Two-Stage Approach for Infrared and Visible Image Fusion and Segmentation

by Wang Ren, Lanhua Luo and Jia Ren

Appl. Sci. 2025, 15(19), 10698; https://doi.org/10.3390/app151910698 - 3 Oct 2025

Early studies rarely considered cascades among multiple tasks, such as image fusion and semantic segmentation tasks. Most image fusion methods fail to consider the interrelationship between image fusion and segmentation. We propose a new two-stage infrared and visible image fusion and segmentation method [...] Read more.

Early studies rarely considered cascades among multiple tasks, such as image fusion and semantic segmentation tasks. Most image fusion methods fail to consider the interrelationship between image fusion and segmentation. We propose a new two-stage infrared and visible image fusion and segmentation method called TSFS. By cascading the fusion module and the segmentation module in the first stage, we obtain better fusion results and enhance the semantic information transferred to the second stage. The fusion module in the first stage of this method uses a feature extraction module (FEM) to extract deep features and then fuses these features through a feature mixture fusion module (FMFM). To enhance the fusion performance of multimodal data in the second-stage segmentation network, we propose a Cross-Semantic Fusion Attention Module (CSFAM) to cross-fuse these features. Compared to state-of-the-art methods, experimental evaluations on public datasets reveal that our proposed TSFS improves mIoU segmentation by 1.5% and 3.3% on the FMB and MFNet datasets, respectively, and produces visually better fused images. Full article

► Show Figures

Figure 1

21 pages, 2248 KB

Open AccessArticle

TSFNet: Temporal-Spatial Fusion Network for Hybrid Brain-Computer Interface

by Yan Zhang, Bo Yin and Xiaoyang Yuan

Sensors 2025, 25(19), 6111; https://doi.org/10.3390/s25196111 - 3 Oct 2025

Abstract

Unimodal brain–computer interfaces (BCIs) often suffer from inherent limitations due to the characteristic of using single modalities. While hybrid BCIs combining electroencephalography (EEG) and functional near-infrared spectroscopy (fNIRS) offer complementary advantages, effectively integrating their spatiotemporal features remains a challenge due to inherent signal [...] Read more.

Unimodal brain–computer interfaces (BCIs) often suffer from inherent limitations due to the characteristic of using single modalities. While hybrid BCIs combining electroencephalography (EEG) and functional near-infrared spectroscopy (fNIRS) offer complementary advantages, effectively integrating their spatiotemporal features remains a challenge due to inherent signal asynchrony. This study aims to develop a novel deep fusion network to achieve synergistic integration of EEG and fNIRS signals for improved classification performance across different tasks. We propose a novel Temporal-Spatial Fusion Network (TSFNet), which consists of two key sublayers: the EEG-fNIRS-guided Fusion (EFGF) layer and the Cross-Attention-based Feature Enhancement (CAFÉ) layer. The EFGF layer extracts temporal features from EEG and spatial features from fNIRS to generate a hybrid attention map, which is utilized to achieve more effective and complementary integration of spatiotemporal information. The CAFÉ layer enables bidirectional interaction between fNIRS and fusion features via a cross-attention mechanism, which enhances the fusion features and selectively filters informative fNIRS representations. Through the two sublayers, TSFNet achieves deep fusion of multimodal features. Finally, TSFNet is evaluated on motor imagery (MI), mental arithmetic (MA), and word generation (WG) classification tasks. Experimental results demonstrate that TSFNet achieves superior classification performance, with average accuracies of 70.18% for MI, 86.26% for MA, and 81.13% for WG, outperforming existing state-of-the-art multimodal algorithms. These findings suggest that TSFNet provides an effective solution for spatiotemporal feature fusion in hybrid BCIs, with potential applications in real-world BCI systems. Full article

(This article belongs to the Special Issue EEG-Based Brain–Computer Interface: Trends, Challenges and Advancements)

► Show Figures

Figure 1

20 pages, 57579 KB

Open AccessArticle

Radar–Camera Fusion in Perspective View and Bird’s Eye View for 3D Object Detection

by Yuhao Xiao, Xiaoqing Chen, Yingkai Wang and Zhongliang Fu

Sensors 2025, 25(19), 6106; https://doi.org/10.3390/s25196106 - 3 Oct 2025

Abstract

Three-dimensional object detection based on the fusion of millimeter-wave radar and cameras is increasingly gaining attention due to characteristics of low cost, high accuracy, and strong robustness. Recently, the bird’s eye view (BEV) fusion paradigm has dominated radar–camera fusion-based 3D object detection methods. [...] Read more.

Three-dimensional object detection based on the fusion of millimeter-wave radar and cameras is increasingly gaining attention due to characteristics of low cost, high accuracy, and strong robustness. Recently, the bird’s eye view (BEV) fusion paradigm has dominated radar–camera fusion-based 3D object detection methods. In the BEV fusion paradigm, the detection accuracy is jointly determined by the precision of both image BEV features and radar BEV features. The precision of image BEV features is significantly influenced by depth estimation accuracy, whereas estimating depth from a monocular image is naturally a challenging, ill-posed problem. In this article, we propose a novel approach to enhance depth estimation accuracy by fusing camera perspective view (PV) features and radar perspective view features, thereby improving the precision of image BEV features. The refined image BEV features are then fused with radar BEV features to achieve more accurate 3D object detection results. To realize PV fusion, we designed a radar image generation module based on radar cross-section (RCS) and depth information, accurately projecting radar data into the camera view to generate radar images. The radar images are used to extract radar PV features. We present a cross-modal feature fusion module using the attention mechanism to dynamically fuse radar PV features with camera PV features. Comprehensive evaluations on the nuScenes 3D object detection dataset demonstrate that the proposed dual-view fusion paradigm outperforms the BEV fusion paradigm, achieving state-of-the-art performance with 64.2 NDS and 56.3 mAP. Full article

(This article belongs to the Section Sensing and Imaging)

► Show Figures

Figure 1

17 pages, 1322 KB

Open AccessArticle

Robust 3D Object Detection in Complex Traffic via Unified Feature Alignment in Bird’s Eye View

by Ajian Liu, Yandi Zhang, Huichao Shi and Juan Chen

World Electr. Veh. J. 2025, 16(10), 567; https://doi.org/10.3390/wevj16100567 - 2 Oct 2025

Abstract

Reliable three-dimensional (3D) object detection is critical for intelligent vehicles to ensure safety in complex traffic environments, and recent progress in multi-modal sensor fusion, particularly between LiDAR and camera, has advanced environment perception in urban driving. However, existing approaches remain vulnerable to occlusions [...] Read more.

Reliable three-dimensional (3D) object detection is critical for intelligent vehicles to ensure safety in complex traffic environments, and recent progress in multi-modal sensor fusion, particularly between LiDAR and camera, has advanced environment perception in urban driving. However, existing approaches remain vulnerable to occlusions and dense traffic, where depth estimation errors, calibration deviations, and cross-modal misalignment are often exacerbated. To overcome these limitations, we propose BEVAlign, a local–global feature alignment framework designed to generate unified BEV representations from heterogeneous sensor modalities. The framework incorporates a Local Alignment (LA) module that enhances camera-to-BEV view transformation through graph-based neighbor modeling and dual-depth encoding, mitigating local misalignment from depth estimation errors. To further address global misalignment in BEV representations, we present the Global Alignment (GA) module comprising a bidirectional deformable cross-attention (BDCA) mechanism and CBR blocks. BDCA employs dual queries from LiDAR and camera to jointly predict spatial sampling offsets and aggregate features, enabling bidirectional alignment within the BEV domain. The stacked CBR blocks then refine and integrate the aligned features into unified BEV representations. Experiment on the nuScenes benchmark highlights the effectiveness of BEVAlign, which achieves 71.7% mAP, outperforming BEVFusion by 1.5%. Notably, it achieves strong performance on small and occluded objects, particularly in dense traffic scenarios. These findings provide a basis for advancing cooperative environment perception in next-generation intelligent vehicle systems. Full article

(This article belongs to the Special Issue Recent Advances in Intelligent Vehicle)

18 pages, 11220 KB

Open AccessArticle

LM3D: Lightweight Multimodal 3D Object Detection with an Efficient Fusion Module and Encoders

by Yuto Sakai, Tomoyasu Shimada, Xiangbo Kong and Hiroyuki Tomiyama

Appl. Sci. 2025, 15(19), 10676; https://doi.org/10.3390/app151910676 - 2 Oct 2025

Abstract

In recent years, the demand for both high accuracy and real-time performance in 3D object detection has increased alongside the advancement of autonomous driving technology. While multimodal methods that integrate LiDAR and camera data have demonstrated high accuracy, these methods often have high [...] Read more.

In recent years, the demand for both high accuracy and real-time performance in 3D object detection has increased alongside the advancement of autonomous driving technology. While multimodal methods that integrate LiDAR and camera data have demonstrated high accuracy, these methods often have high computational costs and latency. To address these issues, we propose an efficient 3D object detection network that integrates three key components: a DepthWise Lightweight Encoder (DWLE) module for efficient feature extraction, an Efficient LiDAR Image Fusion (ELIF) module that combines channel attention with cross-modal feature interaction, and a Mixture of CNN and Point Transformer (MCPT) module for capturing rich spatial contextual information. Experimental results on the KITTI dataset demonstrate that our proposed method outperforms existing approaches by achieving approximately 0.6% higher 3D mAP, 7.6% faster inference speed, and 17.0% fewer parameters. These results highlight the effectiveness of our approach in balancing accuracy, speed, and model size, making it a promising solution for real-time applications in autonomous driving. Full article

► Show Figures

Figure 1

28 pages, 32809 KB

Open AccessArticle

LiteSAM: Lightweight and Robust Feature Matching for Satellite and Aerial Imagery

by Boya Wang, Shuo Wang, Yibin Han, Linfeng Xu and Dong Ye

Remote Sens. 2025, 17(19), 3349; https://doi.org/10.3390/rs17193349 - 1 Oct 2025

Abstract

We present a (Light)weight (S)atellite–(A)erial feature (M)atching framework (LiteSAM) for robust UAV absolute visual localization (AVL) in GPS-denied environments. Existing satellite–aerial matching methods struggle with large appearance variations, texture-scarce regions, and limited efficiency for real-time UAV [...] Read more.

We present a (Light)weight (S)atellite–(A)erial feature (M)atching framework (LiteSAM) for robust UAV absolute visual localization (AVL) in GPS-denied environments. Existing satellite–aerial matching methods struggle with large appearance variations, texture-scarce regions, and limited efficiency for real-time UAV applications. LiteSAM integrates three key components to address these issues. First, efficient multi-scale feature extraction optimizes representation, reducing inference latency for edge devices. Second, a Token Aggregation–Interaction Transformer (TAIFormer) with a convolutional token mixer (CTM) models inter- and intra-image correlations, enabling robust global–local feature fusion. Third, a MinGRU-based dynamic subpixel refinement module adaptively learns spatial offsets, enhancing subpixel-level matching accuracy and cross-scenario generalization. The experiments show that LiteSAM achieves competitive performance across multiple datasets. On UAV-VisLoc, LiteSAM attains an RMSE@30 of 17.86 m, outperforming state-of-the-art semi-dense methods such as EfficientLoFTR. Its optimized variant, LiteSAM (opt., without dual softmax), delivers inference times of 61.98 ms on standard GPUs and 497.49 ms on NVIDIA Jetson AGX Orin, which are 22.9% and 19.8% faster than EfficientLoFTR (opt.), respectively. With 6.31M parameters, which is 2.4× fewer than EfficientLoFTR’s 15.05M, LiteSAM proves to be suitable for edge deployment. Extensive evaluations on natural image matching and downstream vision tasks confirm its superior accuracy and efficiency for general feature matching. Full article

26 pages, 4710 KB

Open AccessArticle

Research on Safe Multimodal Detection Method of Pilot Visual Observation Behavior Based on Cognitive State Decoding

by Heming Zhang, Changyuan Wang and Pengbo Wang

Multimodal Technol. Interact. 2025, 9(10), 103; https://doi.org/10.3390/mti9100103 - 1 Oct 2025

Abstract

Pilot visual behavior safety assessment is a cross-disciplinary technology that analyzes pilots’ gaze behavior and neurocognitive responses. This paper proposes a multimodal analysis method for pilot visual behavior safety, specifically for cognitive state decoding. This method aims to achieve a quantitative and efficient [...] Read more.

Pilot visual behavior safety assessment is a cross-disciplinary technology that analyzes pilots’ gaze behavior and neurocognitive responses. This paper proposes a multimodal analysis method for pilot visual behavior safety, specifically for cognitive state decoding. This method aims to achieve a quantitative and efficient assessment of pilots’ observational behavior. Addressing the subjective limitations of traditional methods, this paper proposes an observational behavior detection model that integrates facial images to achieve dynamic and quantitative analysis of observational behavior. It addresses the “Midas contact” problem of observational behavior by constructing a cognitive analysis method using multimodal signals. We propose a bidirectional long short-term memory (LSTM) network that matches physiological signal rhythmic features to address the problem of isolated features in multidimensional signals. This method captures the dynamic correlations between multiple physiological behaviors, such as prefrontal theta and chest-abdominal coordination, to decode the cognitive state of pilots’ observational behavior. Finally, the paper uses a decision-level fusion method based on an improved Dempster–Shafer (DS) evidence theory to provide a quantifiable detection strategy for aviation safety standards. This dual-dimensional quantitative assessment system of “visual behavior–neurophysiological cognition” reveals the dynamic correlations between visual behavior and cognitive state among pilots of varying experience. This method can provide a new paradigm for pilot neuroergonomics training and early warning of vestibular-visual integration disorders. Full article

► Show Figures

Figure 1

19 pages, 7222 KB

Open AccessArticle

Multi-Channel Spectro-Temporal Representations for Speech-Based Parkinson’s Disease Detection

by Hadi Sedigh Malekroodi, Nuwan Madusanka, Byeong-il Lee and Myunggi Yi

J. Imaging 2025, 11(10), 341; https://doi.org/10.3390/jimaging11100341 - 1 Oct 2025

Abstract

Early, non-invasive detection of Parkinson’s Disease (PD) using speech analysis offers promise for scalable screening. In this work, we propose a multi-channel spectro-temporal deep-learning approach for PD detection from sentence-level speech, a clinically relevant yet underexplored modality. We extract and fuse three complementary [...] Read more.

Early, non-invasive detection of Parkinson’s Disease (PD) using speech analysis offers promise for scalable screening. In this work, we propose a multi-channel spectro-temporal deep-learning approach for PD detection from sentence-level speech, a clinically relevant yet underexplored modality. We extract and fuse three complementary time–frequency representations—mel spectrogram, constant-Q transform (CQT), and gammatone spectrogram—into a three-channel input analogous to an RGB image. This fused representation is evaluated across CNNs (ResNet, DenseNet, and EfficientNet) and Vision Transformer using the PC-GITA dataset, under 10-fold subject-independent cross-validation for robust assessment. Results showed that fusion consistently improves performance over single representations across architectures. EfficientNet-B2 achieves the highest accuracy (84.39% ± 5.19%) and F1-score (84.35% ± 5.52%), outperforming recent methods using handcrafted features or pretrained models (e.g., Wav2Vec2.0, HuBERT) on the same task and dataset. Performance varies with sentence type, with emotionally salient and prosodically emphasized utterances yielding higher AUC, suggesting that richer prosody enhances discriminability. Our findings indicate that multi-channel fusion enhances sensitivity to subtle speech impairments in PD by integrating complementary spectral information. Our approach implies that multi-channel fusion could enhance the detection of discriminative acoustic biomarkers, potentially offering a more robust and effective framework for speech-based PD screening, though further validation is needed before clinical application. Full article

(This article belongs to the Special Issue Celebrating the 10th Anniversary of the Journal of Imaging)

► Show Figures

Figure 1

22 pages, 5982 KB

Open AccessArticle

YOLO-FDLU: A Lightweight Improved YOLO11s-Based Algorithm for Accurate Maize Pest and Disease Detection

by Bin Li, Licheng Yu, Huibao Zhu and Zheng Tan

AgriEngineering 2025, 7(10), 323; https://doi.org/10.3390/agriengineering7100323 - 1 Oct 2025

Abstract

As a global staple ensuring food security, maize incurs 15–20% annual yield loss from pests/diseases. Conventional manual detection is inefficient (>7.5 h/ha) and subjective, while existing YOLO models suffer from >8% missed detections of small targets (e.g., corn armyworm larva) in complex fields [...] Read more.

As a global staple ensuring food security, maize incurs 15–20% annual yield loss from pests/diseases. Conventional manual detection is inefficient (>7.5 h/ha) and subjective, while existing YOLO models suffer from >8% missed detections of small targets (e.g., corn armyworm larva) in complex fields due to feature loss and poor multi-scale fusion. We propose YOLO-FDLU, a YOLO11s-based framework: LAD (Light Attention-Downsampling)-Conv preserves small-target features; C3k2_DDC (DilatedReparam–DilatedReparam–Conv) enhances cross-scale fusion; Detect_FCFQ (Feature-Corner Fusion and Quality Estimation) optimizes bounding box localization; UIoU (Unified-IoU) loss reduces high-IoU regression bias. Evaluated on a 25,419-sample dataset (6 categories, 3 public sources + 1200 compliant web images), it achieves 91.12% Precision, 92.70% mAP@0.5, 78.5% mAP@0.5–0.95, and 20.2 GFLOPs/15.3 MB. It outperforms YOLOv5-s to YOLO12-s, supporting precision maize pest/disease monitoring. Full article

► Show Figures

Figure 1

14 pages, 2759 KB

Open AccessArticle

Unmanned Airborne Target Detection Method with Multi-Branch Convolution and Attention-Improved C2F Module

by Fangyuan Qin, Weiwei Tang, Haishan Tian and Yuyu Chen

Sensors 2025, 25(19), 6023; https://doi.org/10.3390/s25196023 - 1 Oct 2025

Abstract

In this paper, a target detection network algorithm based on a multi-branch convolution and attention improvement Cross-Stage Partial-Fusion Bottleneck with Two Convolutions (C2F) module is proposed for the difficult task of detecting small targets in unmanned aerial vehicles. A C2F module method consisting [...] Read more.

In this paper, a target detection network algorithm based on a multi-branch convolution and attention improvement Cross-Stage Partial-Fusion Bottleneck with Two Convolutions (C2F) module is proposed for the difficult task of detecting small targets in unmanned aerial vehicles. A C2F module method consisting of fusing partial convolutional (PConv) layers was designed to improve the speed and efficiency of extracting features, and a method consisting of combining multi-scale feature fusion with a channel space attention mechanism was applied in the neck network. An FA-Block module was designed to improve feature fusion and attention to small targets’ features; this design increases the size of the miniscule target layer, allowing richer feature information about the small targets to be retained. Finally, the lightweight up-sampling operator Content-Aware ReAssembly of Features was used to replace the original up-sampling method to expand the network’s sensory field. Experimental tests were conducted on a self-complied mountain pedestrian dataset and the public VisDrone dataset. Compared with the base algorithm, the improved algorithm improved the mAP50, mAP50-95, P-value, and R-value by 2.8%, 3.5%, 2.3%, and 0.2%, respectively, on the Mountain Pedestrian dataset and the mAP50, mAP50-95, P-value, and R-value by 9.2%, 6.4%, 7.7%, and 7.6%, respectively, on the VisDrone dataset. Full article

(This article belongs to the Section Sensing and Imaging)

► Show Figures

Figure 1

25 pages, 9710 KB

Open AccessArticle

SCS-YOLO: A Lightweight Cross-Scale Detection Network for Sugarcane Surface Cracks with Dynamic Perception

by Meng Li, Xue Ding, Jinliang Wang and Rongxiang Luo

AgriEngineering 2025, 7(10), 321; https://doi.org/10.3390/agriengineering7100321 - 1 Oct 2025

Abstract

Detecting surface cracks on sugarcane is a critical step in ensuring product quality control, with detection precision directly impacting raw material screening efficiency and economic benefits in the sugar industry. Traditional methods face three core challenges: (1) complex background interference complicates texture feature [...] Read more.

Detecting surface cracks on sugarcane is a critical step in ensuring product quality control, with detection precision directly impacting raw material screening efficiency and economic benefits in the sugar industry. Traditional methods face three core challenges: (1) complex background interference complicates texture feature extraction; (2) variable crack scales limit models’ cross-scale feature generalization capabilities; and (3) high computational complexity hinders deployment on edge devices. To address these issues, this study proposes a lightweight sugarcane surface crack detection model, SCS-YOLO (Surface Cracks on Sugarcane-YOLO), based on the YOLOv10 architecture. This model incorporates three key technical innovations. First, the designed RFAC2f module (Receptive-Field Attentive CSP Bottleneck with Dual Convolution) significantly enhances feature representation capabilities in complex backgrounds through dynamic receptive field modeling and multi-branch feature processing/fusion mechanisms. Second, the proposed DSA module (Dynamic SimAM Attention) achieves adaptive spatial optimization of cross-layer crack features by integrating dynamic weight allocation strategies with parameter-free spatial attention mechanisms. Finally, the DyHead detection head employs a dynamic feature optimization mechanism to reduce parameter count and computational complexity. Experiments demonstrate that on the Sugarcane Crack Dataset v3.1, compared to the baseline model YOLOv10, our model achieves mAP50:95 to 71.8% (up 2.1%). Simultaneously, it achieves significant reductions in parameter count (down 19.67%) and computational load (down 11.76%), while boosting FPS to 122 to meet real-time detection requirements. Considering the multiple dimensions of precision indicators, complexity indicators, and FPS comprehensively, the SCS—YOLO detection framework proposed in this study provides a feasible technical reference for the intelligent detection of sugarcane quality in the raw materials of the sugar industry. Full article

► Show Figures

Figure 1

27 pages, 2645 KB

Open AccessArticle

Short-Text Sentiment Classification Model Based on BERT and Dual-Stream Transformer Gated Attention Mechanism

by Song Yang, Jiayao Xing, Zhaoxia Liu and Yunhao Sun

Electronics 2025, 14(19), 3904; https://doi.org/10.3390/electronics14193904 - 30 Sep 2025

Abstract

With the rapid development of social media, short-text data have become increasingly important in fields such as public opinion monitoring, user feedback analysis, and intelligent recommendation systems. However, existing short-text sentiment analysis models often suffer from limited cross-domain adaptability and poor generalization performance. [...] Read more.

With the rapid development of social media, short-text data have become increasingly important in fields such as public opinion monitoring, user feedback analysis, and intelligent recommendation systems. However, existing short-text sentiment analysis models often suffer from limited cross-domain adaptability and poor generalization performance. To address these challenges, this study proposes a novel short-text sentiment classification model based on the Bidirectional Encoder Representations from Transformers (BERTs) and a dual-stream Transformer gated attention mechanism. This model first employs Bidirectional Encoder Representations from Transformers (BERTs) and the Chinese Robustly Optimized BERT Pretraining Approach (Chinese-RoBERTa) to achieve data augmentation and multilevel semantic mining, thereby expanding the training corpus and enhancing minority class coverage. Second, a dual-stream Transformer gated attention mechanism was developed to dynamically adjust feature fusion weights, enhancing adaptability to heterogeneous texts. Finally, the model integrates a Bidirectional Gated Recurrent Unit (BiGRU) with Multi-Head Self-Attention (MHSA) to strengthen sequence information modeling and global context capture, enabling the precise identification of key sentiment dependencies. The model’s superior performance in handling data imbalance and complex textual sentiment logic scenarios is demonstrated by the experimental results, achieving significant improvements in accuracy and F1 score. The F1 score reached 92.4%, representing an average increase of 8.7% over the baseline models. This provides an effective solution for enhancing the performance and expanding the application scenarios of short-text sentiment analysis models. Full article

(This article belongs to the Special Issue Deep Generative Models and Recommender Systems)

► Show Figures

Figure 1

25 pages, 13955 KB

Open AccessArticle

Adaptive Energy–Gradient–Contrast (EGC) Fusion with AIFI-YOLOv12 for Improving Nighttime Pedestrian Detection in Security

by Lijuan Wang, Zuchao Bao and Dongming Lu

Appl. Sci. 2025, 15(19), 10607; https://doi.org/10.3390/app151910607 - 30 Sep 2025

Abstract

In security applications, visible-light pedestrian detectors are highly sensitive to changes in illumination and fail under low-light or nighttime conditions, while infrared sensors, though resilient to lighting, often produce blurred object boundaries that hinder precise localization. To address these complementary limitations, we propose [...] Read more.

In security applications, visible-light pedestrian detectors are highly sensitive to changes in illumination and fail under low-light or nighttime conditions, while infrared sensors, though resilient to lighting, often produce blurred object boundaries that hinder precise localization. To address these complementary limitations, we propose a practical multimodal pipeline—Adaptive Energy–Gradient–Contrast (EGC) Fusion with AIFI-YOLOv12—that first fuses infrared and low-light visible images using per-pixel weights derived from local energy, gradient magnitude and contrast measures, then detects pedestrians with an improved YOLOv12 backbone. The detector integrates an AIFI attention module at high semantic levels, replaces selected modules with A2C2f blocks to enhance cross-channel feature aggregation, and preserves P3–P5 outputs to improve small-object localization. We evaluate the complete pipeline on the LLVIP dataset and report Precision, Recall, mAP@50, mAP@50–95, GFLOPs, FPS and detection time, comparing against YOLOv8, YOLOv10–YOLOv12 baselines (n and s scales). Quantitative and qualitative results show that the proposed fusion restores complementary thermal and visible details and that the AIFI-enhanced detector yields more robust nighttime pedestrian detection while maintaining a competitive computational profile suitable for real-world security deployments. Full article

(This article belongs to the Special Issue Advanced Image Analysis and Processing Technologies and Applications)

24 pages, 5484 KB

Open AccessArticle

TFI-Fusion: Hierarchical Triple-Stream Feature Interaction Network for Infrared and Visible Image Fusion

by Mingyang Zhao, Shaochen Su and Hao Li

Information 2025, 16(10), 844; https://doi.org/10.3390/info16100844 - 30 Sep 2025

Abstract

As a key technology in multimodal information processing, infrared and visible image fusion holds significant application value in fields such as military reconnaissance, intelligent security, and autonomous driving. To address the limitations of existing methods, this paper proposes the Hierarchical Triple-Feature Interaction Fusion [...] Read more.

As a key technology in multimodal information processing, infrared and visible image fusion holds significant application value in fields such as military reconnaissance, intelligent security, and autonomous driving. To address the limitations of existing methods, this paper proposes the Hierarchical Triple-Feature Interaction Fusion Network (TFI-Fusion). Based on a hierarchical triple-stream feature interaction mechanism, the network achieves high-quality fusion through a two-stage, separate-model processing approach: In the first stage, a single model extracts low-rank components (representing global structural features) and sparse components (representing local detail features) from source images via the Low-Rank Sparse Decomposition (LSRSD) module, while capturing cross-modal shared features using the Shared Feature Extractor (SFE). In the second stage, another model performs fusion and reconstruction: it first enhances the complementarity between low-rank and sparse features through the innovatively introduced Bi-Feature Interaction (BFI) module, realizes multi-level feature fusion via the Triple-Feature Interaction (TFI) module, and finally generates fused images with rich scene representation through feature reconstruction. This separate-model design reduces memory usage and improves operational speed. Additionally, a multi-objective optimization function is designed based on the network’s characteristics. Experiments demonstrate that TFI-Fusion exhibits excellent fusion performance, effectively preserving image details and enhancing feature complementarity, thus providing reliable visual data support for downstream tasks. Full article

► Show Figures

Figure 1

19 pages, 5891 KB

Open AccessArticle

MS-YOLOv11: A Wavelet-Enhanced Multi-Scale Network for Small Object Detection in Remote Sensing Images

by Haitao Liu, Xiuqian Li, Lifen Wang, Yunxiang Zhang, Zitao Wang and Qiuyi Lu

Sensors 2025, 25(19), 6008; https://doi.org/10.3390/s25196008 - 29 Sep 2025

Abstract

In remote sensing imagery, objects smaller than

32 \times 32

pixels suffer from three persistent challenges that existing detectors inadequately resolve: (1) their weak signal is easily submerged in background clutter, causing high miss rates; (2) the scarcity of valid pixels yields few [...] Read more.

In remote sensing imagery, objects smaller than

32 \times 32

pixels suffer from three persistent challenges that existing detectors inadequately resolve: (1) their weak signal is easily submerged in background clutter, causing high miss rates; (2) the scarcity of valid pixels yields few geometric or textural cues, hindering discriminative feature extraction; and (3) successive down-sampling irreversibly discards high-frequency details, while multi-scale pyramids still fail to compensate. To counteract these issues, we propose MS-YOLOv11, an enhanced YOLOv11 variant that integrates “frequency-domain detail preservation, lightweight receptive-field expansion, and adaptive cross-scale fusion.” Specifically, a 2D Haar wavelet first decomposes the image into multiple frequency sub-bands to explicitly isolate and retain high-frequency edges and textures while suppressing noise. Each sub-band is then processed independently by small-kernel depthwise convolutions that enlarge the receptive field without over-smoothing. Finally, the Mix Structure Block (MSB) employs the MSPLCK module to perform densely sampled multi-scale atrous convolutions for rich context of diminutive objects, followed by the EPA module that adaptively fuses and re-weights features via residual connections to suppress background interference. Extensive experiments on DOTA and DIOR demonstrate that MS-YOLOv11 surpasses the baseline in mAP@50, mAP@95, parameter efficiency, and inference speed, validating its targeted efficacy for small-object detection. Full article

(This article belongs to the Section Remote Sensors)

► Show Figures

Figure 1

Search Results (1,285)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (1,285)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI