1. Introduction
Industrial gas is widely used in modern society in fields such as steel production, commercial activities, transportation, power generation, and chemical industry [
1], making it a crucial component of the energy system and a significant driver of high-quality socio-economic development. However, gas leaks frequently occur during storage, use, and transportation, which can easily lead to fires and explosions [
2], causing substantial loss of life and property. Therefore, achieving rapid and effective gas leak detection, along with accurate localization of the leak source and the affected area, are of paramount importance.
Traditional contact-based detection methods, such as acoustic [
3], electrochemical [
4], and semiconductor techniques [
5], can accurately pinpoint leak locations and have low production costs. However, they are significantly affected by environmental interference signals and temperature variations [
6,
7]. Infrared imaging technology, as a typical non-contact detection method, offers a wide detection range and intuitive visualization, enabling remote dynamic monitoring [
8,
9]. Considering that industrial gas predominantly exhibits characteristic absorption spectra in the mid-wave and long-wave infrared bands, post-processing with image analysis algorithms allows for the identification of gas leak locations and diffusion trends, thereby enhancing the detection efficiency and saving substantial time and labor. This technology is widely used for industrial gas leak detection. However, the non-solid nature and variable characteristics of gas plumes [
10] indicate that the visual effects of gas infrared images are highly influenced by environmental changes. This often results in a gas plume with blurred contours and low contrast against the background in infrared images, making it difficult to observe effectively.
Infrared imaging systems are categorized into active [
11] and passive [
12] systems based on the radiation source. Unlike active imaging systems, which require additional light sources to provide radiation, passive infrared imaging systems rely solely on ambient background radiation within the field of view. This makes them smaller, more cost-effective, and easier to implement in portable automated detection. Consequently, imaging technologies based on infrared focal plane arrays are more widely used [
13]. As the core component of infrared imaging systems, infrared detectors are classified into thermal and photon detectors based on their energy conversion methods. Additionally, they are divided into cooled and uncooled detectors depending on their operating temperature and cooling requirements. Cooled mid-wave infrared detectors [
14], which detect infrared radiation through the photoelectric effect generated by the interaction between incoming photons and sensitive materials, offer rapid response and high reliability. These detectors provide high sensitivity and fast response imaging, detecting minute temperature differences and radiation disparities, making them particularly advantageous for gas leak detection tasks.
With the advancement of deep learning, artificial intelligence has been widely applied in the field of computer vision, demonstrating strong vitality in areas such as autonomous driving [
15], medical image processing [
16], and equipment inspection [
17]. Owing to their effective feature extraction capabilities, deep learning-based methods have become a significant direction in gas leak detection research, with models suitable for continuous gas detection. However, several challenges remain: (a) For computer vision tasks, large-scale, high-quality, and well-annotated datasets are crucial for model training, yet there is currently no unified dataset for comparing the performance of different algorithms. (b) Infrared image quality, affected by infrared detectors, significantly impacts the detection work. A low-concentration gas plume is easily confused with image noise, which reduces the detection efficiency. (c) A gas plume in infrared images often has blurred edges, low background contrast, and poor discernibility, resulting in limited usable spatial features and challenging gas feature extraction.
The You Only Look Once (YOLO) series of algorithms, a convolutional neural network-based object detection framework, has demonstrated excellent performance in terms of generalization and real-time capabilities for computer vision tasks. It is widely used in object detection research and can be considered a viable solution for gas leak detection using infrared imaging.
The main contents of this paper are as follows:
A high-sensitivity and high-response imaging effect was achieved using a cooled mid-wave infrared (MWIR) imager. A dataset labeled with gas leak segmentation, MWIRGas-Seg, was collected and underwent visual classification and small target counting.
For the task of gas leak detection in MWIR imaging, an algorithm based on YOLOv8-seg is proposed. This algorithm, named MWIRGas-YOLO, effectively detects and segments gas leaks within a given scene.
A global attention mechanism was introduced during the feature fusion stage to reduce image information dispersion, enhance gas plume localization, and improve the extraction of small target gas plume features. Transfer learning was applied using a visible light smoke dataset with similar characteristics to make the pre-trained model more adept at handling and extracting gas features.
Experimental validation confirms that MWIRGas-YOLO achieves more effective feature extraction and fusion of infrared gas plume targets, outperforming the original YOLOv8-seg and several typical image detection and segmentation algorithms. It is suitable for infrared gas image detection tasks.
4. Methods
4.1. YOLOv8-Seg Model
The YOLO series of methods, as one-stage object detection algorithms, utilize a backbone network to extract features, fuse multi-scale features, and then output target detection boxes through multiple detection heads. This approach achieves an excellent detection accuracy and real-time performance. The YOLOv8 model integrates methods from YOLACT, adding a segmentation branch on top of the existing detection branches to combine detection boxes and segmentation results, thereby achieving instance segmentation of targets. The backbone enhances the feature extraction capabilities using a lightweight C2f module and incorporates an SPPF layer at the end to extract features with different receptive fields, making the network more suited for targets of different scales. The neck section employs a dual-stream Feature Pyramid Network (FPN) structure that aggregates feature information from top to bottom, fusing high- and low-level feature information via upsampling to compute prediction feature maps. A Path Aggregation Network (PAN) introduces lateral connections to enhance semantic feature information and fusion capabilities. The head section adopts an anchor-free decoupling head, which inputs position and category information from the feature map into the detection and classification branches to compute position and category, respectively. Predictions for small-, medium-, and large-scale targets were generated based on the fused P3, P4, and P5 feature maps.
This paper proposes the MWIRGas-YOLO model for mid-wave infrared imaging gas leak detection tasks, which incorporates a global attention mechanism and a small object detection layer based on the YOLOv8-seg model. The network structure is illustrated in
Figure 5.
4.2. Global Attention Mechanism
In the context of infrared imaging gas leak detection, weak-concentration gas plume targets are often confused with background noise, posing challenges for detection and segmentation. To address this issue, adding the global attention mechanism can assign different weights to different parts of the input feature map, allowing the network to focus fully on gas plume targets while ignoring irrelevant background information, thus improving the detection accuracy.
The global attention mechanism (GAM) [
51] consists of the channel attention submodule
and spatial attention submodule
. The channel attention submodule preserves information in three different dimensions and utilizes a two-layer Multilayer Perceptron (MLP) to amplify spatial information interaction across dimensions, thereby enhancing feature representation capability (as shown in
Figure 6). The spatial attention submodule focuses on spatial information, using two convolutional layers for spatial information fusion and fully learning spatial features (as shown in
Figure 7).
The entire process is illustrated in
Figure 8. For a given input feature map
, after channel attention, the resulting feature is denoted as
and the final output feature is denoted as
; the intermediate state
and output
are defined as follows:
where ⊗ indicates that multiplication is performed element by element.
In the YOLOv8 model, the backbone is responsible for extracting feature information from images, whereas the neck section focuses on better utilizing the extracted features for feature fusion. For gas leak detection tasks, noise in infrared images can significantly affect the feature extraction of gas plume targets. Adding an attention mechanism at the backbone feature extraction stage amplified the noise impact while reinforcing the gas plume target features. Therefore, this paper adds a global attention mechanism at the neck feature fusion stage to enhance the feature fusion capability of gas plume targets and improve the detection accuracy.
4.3. Small Target Detection Layer
The original YOLOv8 model was equipped with three detection heads, enabling multi-scale object detection. The detection sizes were P3/8, P4/16, and P5/32, which correspond to feature map sizes of 80 × 80, 40 × 40, and 20 × 20, respectively. These feature maps are responsible for detecting objects of sizes 8 × 8, 16 × 16, 32 × 32, and larger However, due to the presence of small targets within gas plume clusters, which are often influenced by environmental interference, deeper feature maps struggled to capture the features of small objects effectively. Consequently, the original model exhibited a poor performance in detecting small targets.
This paper proposes the addition of a small object detection layer to the original network, specifically a P2/4 detection layer with a size of 160 × 160. This modification enhances the semantic information and feature representation capabilities of small objects by incorporating supplementary fusion feature layers and an additional detection step. The 80 × 80 scale layer from the fifth layer (P2) in the backbone was stacked with an upsampled feature layer in the neck. After processing with C2f and upsampling, this results in a deep semantic feature layer containing small object feature information. This layer is then further stacked with the shallow positional feature layer from the third layer in the backbone, creating a comprehensive 160 × 160 scale fusion feature layer that improves the expression of semantic and positional information for small objects. Finally, this enriched feature layer is sent through a C2f module to an additional decoupled head.
The enhancement in the head section allows the small object feature information to continue propagating along the downsampling path to the other three scale feature layers. This strengthens the feature fusion capability of the network, and the introduction of an additional decoupled head expands the detection range for gas plume targets. These improvements in the detection accuracy and range enable the network to more precisely identify small gas plume targets within the scene.
4.4. Transfer Learning
The objective of transfer learning is to apply knowledge or patterns learned from one domain or task to a different but related domain or problem [
52]. Given that gas plume targets and smoke share similar uncertain contour characteristics, this paper employed a smoke dataset with similar features for transfer learning pre-training. This approach yields a prior model capable of effectively extracting features from gas plume targets. The smoke images used for this purpose were sourced from publicly available online datasets.
Testing the model, which was trained on a visible light smoke image dataset, directly on infrared gas images revealed that the smoke model could effectively identify some gas regions, as shown in
Figure 9. This demonstrates that infrared gas plumes and smoke share similar features and confirms that the prior model has the capability to learn and extract features for gas plume targets.
6. Conclusions
In this paper, a cooled mid-wave infrared thermal imager was used to create the MWIRGas-Seg dataset for mid-wave infrared gas leak segmentation. The analysis was conducted based on experimental scenarios and visual judgments. A model specifically designed for mid-wave infrared gas leak detection and segmentation, MWIRGas-YOLO, was proposed to achieve effective detection of leaking gas and segmentation of gas plumes in the scene. Compared with typical image segmentation models, the proposed model demonstrated superior performance in gas target detection and segmentation. Moreover, our method focused directly on studying gas plume features in infrared images, embedding the GAM to enhance the fusion of features, thereby focusing on gas cloud targets. Supplementary small target detection layers effectively address issues of missed small-sized gas cloud targets. The model is pre-trained using smoke images, showing significant prior knowledge in learning infrared gas cloud characteristics. The experimental results indicate that the MWIRGas-YOLO model designed in this paper exhibits superior detection and segmentation performance, making it suitable for infrared gas image detection and segmentation tasks. Importantly, our proposed method is versatile and does not depend on specific infrared detectors; thus, it can be applied broadly in the field of infrared imaging for gas leak detection.
Future studies will continue to explore this technology by expanding the dataset to represent a broader range of real-world leak scenarios. The goal was to make the algorithm more applicable for leak detection in real-world scenarios. Additionally, as gas leak processes are continuous, future efforts will investigate incorporating temporal information from videos and exploring sequence modeling to address gas leak issues, further optimizing the algorithm for enhanced accuracy.