MDPI - Publisher of Open Access Journals

22 pages, 6682 KB

Open AccessArticle

Multimodal Fire Salient Object Detection for Unregistered Data in Real-World Scenarios

by Ning Sun, Jianmeng Zhou, Kai Hu, Chen Wei, Zihao Wang and Lipeng Song

Fire 2025, 8(11), 415; https://doi.org/10.3390/fire8110415 (registering DOI) - 26 Oct 2025

In real-world fire scenarios, complex lighting conditions and smoke interference significantly challenge the accuracy and robustness of traditional fire detection systems. Fusion of complementary modalities, such as visible light (RGB) and infrared (IR), is essential to enhance detection robustness. However, spatial shifts and [...] Read more.

In real-world fire scenarios, complex lighting conditions and smoke interference significantly challenge the accuracy and robustness of traditional fire detection systems. Fusion of complementary modalities, such as visible light (RGB) and infrared (IR), is essential to enhance detection robustness. However, spatial shifts and geometric distortions occur in multi-modal image pairs collected by multi-source sensors due to installation deviations and inconsistent intrinsic parameters. Existing multi-modal fire detection frameworks typically depend on pre-registered data, which struggles to handle modal misalignment in practical deployment. To overcome this limitation, we propose an end-to-end multi-modal Fire Salient Object Detection framework capable of dynamically fusing cross-modal features without pre-registration. Specifically, the Channel Cross-enhancement Module (CCM) facilitates semantic interaction across modalities in salient regions, suppressing noise from spatial misalignment. The Deformable Alignment Module (DAM) achieves adaptive correction of geometric deviations through cascaded deformation compensation and dynamic offset learning. For validation, we constructed an unregistered indoor fire dataset (Indoor-Fire) covering common fire scenarios. Generalizability was further evaluated on an outdoor dataset (RGB-T Wildfire). To fully validate the effectiveness of the method in complex building fire scenarios, we conducted experiments using the Fire in historic buildings (Fire in historic buildings) dataset. Experimental results demonstrate that the F1-score reaches 83% on both datasets, with the IoU maintained above 70%. Notably, while maintaining high accuracy, the number of parameters (91.91 M) is only 28.1% of the second-best SACNet (327 M). This method provides a robust solution for unaligned or weakly aligned modal fusion caused by sensor differences and is highly suitable for deployment in intelligent firefighting systems. Full article

► Show Figures

Figure 1

20 pages, 3201 KB

Open AccessArticle

Dual-Branch Multimodal Fusion Network for Driver Facial Emotion Recognition

by Le Wang, Yuchen Chang and Kaiping Wang

Appl. Sci. 2024, 14(20), 9430; https://doi.org/10.3390/app14209430 - 16 Oct 2024

Cited by 3 | Viewed by 2072

Abstract

In the transition to fully automated driving, the interaction between drivers and vehicles is crucial as drivers’ emotions directly influence their behavior, thereby impacting traffic safety. Currently, relying solely on a backbone based on a convolutional neural network (CNN) to extract single RGB [...] Read more.

In the transition to fully automated driving, the interaction between drivers and vehicles is crucial as drivers’ emotions directly influence their behavior, thereby impacting traffic safety. Currently, relying solely on a backbone based on a convolutional neural network (CNN) to extract single RGB modal facial features makes it difficult to capture enough semantic information. To address this issue, this paper proposes a Dual-branch Multimodal Fusion Network (DMFNet). DMFNet extracts semantic features from visible–infrared (RGB-IR) image pairs effectively capturing complementary information between two modalities and achieving a more accurate understanding of the drivers’ emotional state at a global level. However, the accuracy of facial recognition is significantly affected by variations in the drivers’ head posture and light environment. Thus, we further propose a U-Shape Reconstruction Network (URNet) to focus on enhancing and reconstructing the detailed features of RGB modes. Additionally, we design a Detail Enhancement Block (DEB) embedded in a U-shaped reconstruction network for high-frequency filtering. Compared with the original driver emotion recognition model, our method improved the accuracy by 18.77% on the DEFE++ dataset, proving the superiority of the proposed method. Full article

► Show Figures

Figure 1

19 pages, 8953 KB

Open AccessArticle

Leveraging Multimodal Large Language Models (MLLMs) for Enhanced Object Detection and Scene Understanding in Thermal Images for Autonomous Driving Systems

by Huthaifa I. Ashqar, Taqwa I. Alhadidi, Mohammed Elhenawy and Nour O. Khanfar

Automation 2024, 5(4), 508-526; https://doi.org/10.3390/automation5040029 - 10 Oct 2024

Cited by 17 | Viewed by 4819

Abstract

The integration of thermal imaging data with multimodal large language models (MLLMs) offers promising advancements for enhancing the safety and functionality of autonomous driving systems (ADS) and intelligent transportation systems (ITS). This study investigates the potential of MLLMs, specifically GPT-4 Vision Preview and [...] Read more.

The integration of thermal imaging data with multimodal large language models (MLLMs) offers promising advancements for enhancing the safety and functionality of autonomous driving systems (ADS) and intelligent transportation systems (ITS). This study investigates the potential of MLLMs, specifically GPT-4 Vision Preview and Gemini 1.0 Pro Vision, for interpreting thermal images for applications in ADS and ITS. Two primary research questions are addressed: the capacity of these models to detect and enumerate objects within thermal images, and to determine whether pairs of image sources represent the same scene. Furthermore, we propose a framework for object detection and classification by integrating infrared (IR) and RGB images of the same scene without requiring localization data. This framework is particularly valuable for enhancing the detection and classification accuracy in environments where both IR and RGB cameras are essential. By employing zero-shot in-context learning for object detection and the chain-of-thought technique for scene discernment, this study demonstrates that MLLMs can recognize objects such as vehicles and individuals with promising results, even in the challenging domain of thermal imaging. The results indicate a high true positive rate for larger objects and moderate success in scene discernment, with a recall of 0.91 and a precision of 0.79 for similar scenes. The integration of IR and RGB images further enhances detection capabilities, achieving an average precision of 0.93 and an average recall of 0.56. This approach leverages the complementary strengths of each modality to compensate for individual limitations. This study highlights the potential of combining advanced AI methodologies with thermal imaging to enhance the accuracy and reliability of ADS, while identifying areas for improvement in model performance. Full article

► Show Figures

Figure 1

16 pages, 5861 KB

Open AccessArticle

NRPerson: A Non-Registered Multi-Modal Benchmark for Tiny Person Detection and Localization

by Yi Yang, Xumeng Han, Kuiran Wang, Xuehui Yu, Wenwen Yu, Zipeng Wang, Guorong Li, Zhenjun Han and Jianbin Jiao

Electronics 2024, 13(9), 1697; https://doi.org/10.3390/electronics13091697 - 27 Apr 2024

Cited by 1 | Viewed by 1629

Abstract

In recent years, the detection and localization of tiny persons have garnered significant attention due to their critical applications in various surveillance and security scenarios. Traditional multi-modal methods predominantly rely on well-registered image pairs, necessitating the use of sophisticated sensors and extensive manual [...] Read more.

In recent years, the detection and localization of tiny persons have garnered significant attention due to their critical applications in various surveillance and security scenarios. Traditional multi-modal methods predominantly rely on well-registered image pairs, necessitating the use of sophisticated sensors and extensive manual effort for registration, which restricts their practical utility in dynamic, real-world environments. Addressing this gap, this paper introduces a novel non-registered multi-modal benchmark named NRPerson, specifically designed to advance the field of tiny person detection and localization by accommodating the complexities of real-world scenarios. The NRPerson dataset comprises 8548 RGB-IR image pairs, meticulously collected and filtered from 22 video sequences, enriched with 889,207 high-quality annotations that have been manually verified for accuracy. Utilizing NRPerson, we evaluate several leading detection and localization models across both mono-modal and non-registered multi-modal frameworks. Furthermore, we develop a comprehensive set of natural multi-modal baselines for the innovative non-registered track, aiming to enhance the detection and localization of unregistered multi-modal data using a cohesive and generalized approach. This benchmark is poised to facilitate significant strides in the practical deployment of detection and localization technologies by mitigating the reliance on stringent registration requirements. Full article

(This article belongs to the Special Issue Big Model Techniques for Image Processing)

► Show Figures

Figure 1

21 pages, 7411 KB

Open AccessArticle

MFMG-Net: Multispectral Feature Mutual Guidance Network for Visible–Infrared Object Detection

by Fei Zhao, Wenzhong Lou, Hengzhen Feng, Nanxi Ding and Chenglong Li

Drones 2024, 8(3), 112; https://doi.org/10.3390/drones8030112 - 21 Mar 2024

Cited by 4 | Viewed by 2914

Abstract

Drones equipped with visible and infrared sensors play a vital role in urban road supervision. However, conventional methods using RGB-IR image pairs often struggle to extract effective features. These methods treat these spectra independently, missing the potential benefits of their interaction and complementary [...] Read more.

Drones equipped with visible and infrared sensors play a vital role in urban road supervision. However, conventional methods using RGB-IR image pairs often struggle to extract effective features. These methods treat these spectra independently, missing the potential benefits of their interaction and complementary information. To address these challenges, we designed the Multispectral Feature Mutual Guidance Network (MFMG-Net). To prevent learning bias between spectra, we have developed a Data Augmentation (DA) technique based on the mask strategy. The MFMG module is embedded between two backbone networks, promoting the exchange of feature information between spectra to enhance extraction. We also designed a Dual-Branch Feature Fusion (DBFF) module based on attention mechanisms, enabling deep feature fusion by emphasizing correlations between the two spectra in both the feature channel and space dimensions. Finally, the fused features feed into the neck network and detection head, yielding ultimate inference results. Our experiments, conducted on the Aerial Imagery (VEDAI) dataset and two other public datasets (M3FD and LLVIP), showcase the superior performance of our method and the effectiveness of MFMG in enhancing multispectral feature extraction for drone ground detection. Full article

► Show Figures

Figure 1

23 pages, 2929 KB

Open AccessArticle

Misaligned RGB-Infrared Object Detection via Adaptive Dual-Discrepancy Calibration

by Mingzhou He, Qingbo Wu, King Ngi Ngan, Feng Jiang, Fanman Meng and Linfeng Xu

Remote Sens. 2023, 15(19), 4887; https://doi.org/10.3390/rs15194887 - 9 Oct 2023

Cited by 11 | Viewed by 5012

Abstract

Object detection based on RGB and infrared images has emerged as a crucial research area in computer vision, and the synergy of RGB-Infrared ensures the robustness of object-detection algorithms under varying lighting conditions. However, the RGB-IR image pairs captured typically exhibit spatial misalignment [...] Read more.

Object detection based on RGB and infrared images has emerged as a crucial research area in computer vision, and the synergy of RGB-Infrared ensures the robustness of object-detection algorithms under varying lighting conditions. However, the RGB-IR image pairs captured typically exhibit spatial misalignment due to sensor discrepancies, leading to compromised localization performance. Furthermore, since the inconsistent distribution of deep features from the two modalities, directly fusing multi-modal features will weaken the feature difference between the object and the background, therefore interfering with the RGB-Infrared object-detection performance. To address these issues, we propose an adaptive dual-discrepancy calibration network (ADCNet) for misaligned RGB-Infrared object detection, including spatial discrepancy and domain-discrepancy calibration. Specifically, the spatial discrepancy calibration module conducts an adaptive affine transformation to achieve spatial alignment of features. Then, the domain-discrepancy calibration module separately aligns object and background features from different modalities, making the distribution of the object and background of the fusion feature easier to distinguish, therefore enhancing the effectiveness of RGB-Infrared object detection. Our ADCNet outperforms the baseline by 3.3% and 2.5% in

m A P_{50}

on the FLIR and misaligned M3FD datasets, respectively. Experimental results demonstrate the superiorities of our proposed method over the state-of-the-art approaches. Full article

(This article belongs to the Special Issue Object Detection and Information Extraction Based on Remote Sensing Imagery)

► Show Figures

Graphical abstract

15 pages, 1171 KB

Open AccessArticle

Cross-Modality Person Re-Identification via Local Paired Graph Attention Network

by Jianglin Zhou, Qing Dong, Zhong Zhang, Shuang Liu and Tariq S. Durrani

Sensors 2023, 23(8), 4011; https://doi.org/10.3390/s23084011 - 15 Apr 2023

Cited by 9 | Viewed by 2652

Abstract

Cross-modality person re-identification (ReID) aims at searching a pedestrian image of RGB modality from infrared (IR) pedestrian images and vice versa. Recently, some approaches have constructed a graph to learn the relevance of pedestrian images of distinct modalities to narrow the gap between [...] Read more.

Cross-modality person re-identification (ReID) aims at searching a pedestrian image of RGB modality from infrared (IR) pedestrian images and vice versa. Recently, some approaches have constructed a graph to learn the relevance of pedestrian images of distinct modalities to narrow the gap between IR modality and RGB modality, but they omit the correlation between IR image and RGB image pairs. In this paper, we propose a novel graph model called Local Paired Graph Attention Network (LPGAT). It uses the paired local features of pedestrian images from different modalities to build the nodes of the graph. For accurate propagation of information among the nodes of the graph, we propose a contextual attention coefficient that leverages distance information to regulate the process of updating the nodes of the graph. Furthermore, we put forward Cross-Center Contrastive Learning (

C^{3} L

) to constrain how far local features are from their heterogeneous centers, which is beneficial for learning the completed distance metric. We conduct experiments on the RegDB and SYSU-MM01 datasets to validate the feasibility of the proposed approach. Full article

(This article belongs to the Special Issue Single Sensor and Multi-Sensor Object Identification and Detection with Deep Learning)

► Show Figures

Figure 1

16 pages, 5473 KB

Open AccessArticle

Infusion-Net: Inter- and Intra-Weighted Cross-Fusion Network for Multispectral Object Detection

by Jun-Seok Yun, Seon-Hoo Park and Seok Bong Yoo

Mathematics 2022, 10(21), 3966; https://doi.org/10.3390/math10213966 - 25 Oct 2022

Cited by 21 | Viewed by 3394

Abstract

Object recognition is conducted using red, green, and blue (RGB) images in object recognition studies. However, RGB images in low-light environments or environments where other objects occlude the target objects cause poor object recognition performance. In contrast, infrared (IR) images provide acceptable object [...] Read more.

Object recognition is conducted using red, green, and blue (RGB) images in object recognition studies. However, RGB images in low-light environments or environments where other objects occlude the target objects cause poor object recognition performance. In contrast, infrared (IR) images provide acceptable object recognition performance in these environments because they detect IR waves rather than visible illumination. In this paper, we propose an inter- and intra-weighted cross-fusion network (Infusion-Net), which improves object recognition performance by combining the strengths of the RGB-IR image pairs. Infusion-Net connects dual object detection models using a high-frequency (HF) assistant (HFA) to combine the advantages of RGB-IR images. To extract HF components, the HFA transforms input images into a discrete cosine transform domain. The extracted HF components are weighted via pretrained inter- and intra-weights for feature-domain cross-fusion. The inter-weighted fused features are transmitted to each other’s networks to complement the limitations of each modality. The intra-weighted features are also used to enhance any insufficient HF components of the target objects. Thus, the experimental results present the superiority of the proposed network and present improved performance of the multispectral object recognition task. Full article

(This article belongs to the Special Issue Advances in Pattern Recognition and Image Analysis)

► Show Figures

Figure 1

11 pages, 1416 KB

Open AccessEditor’s ChoiceCommunication

Evaluating Thermal and Color Sensors for Automating Detection of Penguins and Pinnipeds in Images Collected with an Unoccupied Aerial System

by Jefferson T. Hinke, Louise M. Giuseffi, Victoria R. Hermanson, Samuel M. Woodman and Douglas J. Krause

Drones 2022, 6(9), 255; https://doi.org/10.3390/drones6090255 - 15 Sep 2022

Cited by 14 | Viewed by 3341

Abstract

Estimating seabird and pinniped abundance is central to wildlife management and ecosystem monitoring in Antarctica. Unoccupied aerial systems (UAS) can collect images to support monitoring, but manual image analysis is often impractical. Automating target detection using deep learning techniques may improve data acquisition, [...] Read more.

Estimating seabird and pinniped abundance is central to wildlife management and ecosystem monitoring in Antarctica. Unoccupied aerial systems (UAS) can collect images to support monitoring, but manual image analysis is often impractical. Automating target detection using deep learning techniques may improve data acquisition, but different image sensors may affect target detectability and model performance. We compared the performance of automated detection models based on infrared (IR) or color (RGB) images and tested whether IR images, or training data that included annotations of non-target features, improved model performance. For this assessment, we collected paired IR and RGB images of nesting penguins (Pygoscelis spp.) and aggregations of Antarctic fur seals (Arctocephalus gazella) with a small UAS at Cape Shirreff, Livingston Island (60.79 °W, 62.46 °S). We trained seven independent classification models using the Video and Image Analytics for Marine Environments (VIAME) software and created an open-access R tool, vvipr, to standardize the assessment of VIAME-based model performance. We found that the IR images and the addition of non-target annotations had no clear benefits for model performance given the available data. Nonetheless, the generally high performance of the penguin models provided encouraging results for further improving automated image analysis from UAS surveys. Full article

(This article belongs to the Special Issue UAV Design and Applications in Antarctic Research)

► Show Figures

Figure 1

Search Results (9)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (9)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI