Flame and Smoke Semantic Dataset: Indoor Fire Detection with Deep Semantic Segmentation Model

Hou, Feifei; Rui, Xiyue; Chen, Yuanheng; Fan, Xinyu

doi:10.3390/electronics12183778

Open AccessArticle

Flame and Smoke Semantic Dataset: Indoor Fire Detection with Deep Semantic Segmentation Model

School of Automation, Central South University, Changsha 410083, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(18), 3778; https://doi.org/10.3390/electronics12183778

Submission received: 5 August 2023 / Revised: 31 August 2023 / Accepted: 4 September 2023 / Published: 7 September 2023

Download

Browse Figures

Versions Notes

Abstract

:

Indoor fires can easily cause property damage and especially serious casualties. Early and timely fire detection helps firefighters make scientific judgments on the cause of fires, thereby effectively controlling fire accidents. However, most of the existing computer-vision-based fire detection methods are only able to detect a single case of flame or smoke. In this paper, a tailored deep-learning-based scheme is designed to simultaneously detect flame and smoke objects in indoor scenes. We adopt the semantic segmentation architecture DeepLabv3+ as the main model, which is an encoder-decoder architecture for both the detection and segmentation of fire objects. Within this, the key module, e.g., atrous convolution, is integrated into the architecture to improve image resolution and accurately locate targets. In addition, to solve the question of an insufficient indoor fire dataset, we prepare and construct a new annotated dataset named the ‘Flame and Smoke Semantic Dataset (FSSD)’, which includes extensive semantic information of fire objects and is collected from real indoor scenes and other fire sources. Experiments conducted on our FSSD database and the comparisons with state-of-the-art methods (FCN, PSPNet, and DeepLabv3), confirm the high performance of the proposed scheme with 91.53% aAcc, 89.67% mAcc, and 0.8018 mIoU, respectively.

Keywords:

fire detection; flame and smoke dataset; indoor scene; semantic segmentation; deep learning (DL)

1. Introduction

People stay indoors for a long time during the day, while the elderly and children stay indoors longer. The impact of the indoor environment on people’s lives, work, and health is far greater than that of the outdoor environment [1,2]. Frequent indoor fires have caused immeasurable harm to society, with a large number of lives and property suffering huge losses every year [3]. Therefore, indoor fire detection is a research hotspot and a necessary means to reduce losses caused by fires. In the early fire stages, the smoke always appears before the flames due to the incomplete combustion of objects. Smoke detection and alarm can struggle to provide more time for fire alarm and extinguishing [4]. Based on this point, the accurate identification of both indoor smoke and flames is especially crucial in indoor scenes [5].

Several fire detection and segmentation techniques have been proposed [6], but this domain is still far from mature due to various reasons: (1) Fire objects are composed of flames and smoke. Due to their different and complex physical characteristics, it is difficult to achieve simultaneous detection of flames and smoke, which is reflected in two challenges: inconsistent feature descriptions of flame or smoke, and a lack of datasets consisting of both flame and smoke. (2) The indoor fire benchmark dataset is not available. On the one hand, due to the presence of casualties at indoor fire sites, considering the impact of social ethics, the dataset will not be publicly shared or even used; on the other hand, due to the complexity of the indoor environment, the damage caused by fires to the indoor environment, and the lack of advanced monitoring equipment, it is difficult to obtain and collect a large number of high-quality indoor fire images, thereby limiting future research in this area. (3) Segmentation of both flame and smoke faces challenges. The difficulty of fire recognition lies in the uncertainty of the morphology of flame and smoke. In addition, smoke is different from flames and has a transparency property, making it easy to mix with the image background and difficult to completely segment by masks. The irregular and chaotic motion of flames and smoke also brings difficulties to fire recognition and segmentation.

To address the above problems, we implement a DL-based model for fire detection and segmentation in indoor environments. Specifically, an efficient semantic segmentation model DeepLabv3+ [7] is employed to identify both flames and smoke, and we combine it with various modules to achieve accurate edge segmentation, e.g., atrous convolution and encoder-decoder. The proposed scheme can simultaneously implement the detection and segmentation tasks of fire objects, which include flame and smoke. The main contributions of this paper can be summarized as follows:

We present a semantic segmentation scheme by incorporating the DeepLabv3+ model for intelligent fire disaster management. Our scheme has high adaptability to complex indoor environments and can be used for the segmentation of both flame and smoke objects.
To augment real-world fire datasets, we build a unique fire semantic segmentation dataset named ‘Flame and Smoke Semantic Dataset (FSSD)’ by manually labeling the flames and smoke objects.
Experiments are conducted to demonstrate the validity of the proposed scheme, and it is compared with state-of-the-art semantic segmentation models from various metrics, i.e., global accuracy (aAcc), mean accuracy (mAcc), and mean intersection over union (mIoU). Ultimately, we advocate using DeepLabv3+ to detect and segment fire as an optimal semantic segmentation model.

The remainder of this paper is organized as follows. Section 2 lists the detailed literature on fire detection using traditional approaches, deep-learning-based methods, or both of them. Our collected database FSSD, which contains both semantic and annotation information of images, is presented in Section 3. Complete details of our DL framework for flame and smoke segmentation in indoor scenes is provided in Section 4. Section 5 presents extensive experiments with our FSSD. Section 6 concludes the paper and points out directions for future research. The general framework of our scheme is shown in Figure 1.

2. Literature Review

Existing surveys on fire detection can be divided into two broad categories: traditional and DL-based approaches. Several studies based on conventional approaches have been reported. For example, a manually designed algorithm by Peng et al. [8] was used to capture suspected smoke areas. Using a background model with the feature of manual design, the image detected by the surveillance camera was processed for each frame. In 2019, innovative detection approaches according to the multi-feature fusion of flame and a new algorithm of flame centroid stability based on space-time relations were presented by Gong et al. [9], and a support vector machine was used for training. The algorithm proposed by Wang et al. [10] was aimed at applying a rapid early fire smoke detection method using the slope fitting to the histogram of the video image. The smoke alarms were triggered by setting a time window using the color and diffusion characteristics of the smoke and fitting the linear rate of change within the time window. Sun et al. [11] proposed a smoke identification model and a smoke concentration inversion model in 2023. Both of them were based on Mahalanobis distance (MD). Since some researchers only accurately annotated some of the data, a semisupervised learning method based on fire features was proposed for this situation [12]. A novel system combining laser-induced breakdown spectroscopy and machine learning was developed to detect and identify smoke from various tree species [13]. The Gaussian mixture model (GMM) was used to speed up detection and remove most of the static background, while the transfer model was used to effectively detect smoke areas [14]. In addition, a surrogate modeling technique based on machine learning was proposed which succeeded in calibrating the fire source characteristics [15]. These approaches used a variety of features, such as color, shape, and texture, to identify fire areas without going through a learning process. Moreover, several methods used low-level features and passed them on to different classifiers or clustering techniques for fire prediction.

In contrast to the traditional approaches, DL-based methods utilize learned features to identify fire objects and can perform judgmental detection more effectively than human or traditional machine vision solutions. Several DL-based methods employed different models for fire detection and recognition. These methods were reinforced by convolutions, pooling, and fully connected layers for learning visual notions. A unified flame and smoke detection system was proposed by Hosseini et al. [16]. The method can classify frames into eight classes. In addition, a decision-making module based on a voting scheme was added, further increasing reliability. Zhan et al. [17] proposed a new adjacent layer composite network for detecting forest fire smoke by unmanned aerial vehicle (UAV). By improving the backbone, feature pyramid network (FPN), and nonmaximum suppression (NMS), the model performance was developed. Some researchers were devoted to improving the DL model by some strategies such as intersection over union (IoU) calculation and loss function calculation [18]. In 2021, Wu et al. [19] proposed a patchwise dictionary learning method aimed at detecting forest fire smoke, a feature extraction method based on pixel blocks, and an online dictionary learning method based on an elastic-net-based sparse representation algorithm. Based on the advantage of convolutional neural networks (CNNs), Lin et al. [20] proposed a joint deep detection framework for identifying smoke. To perform early smoke detection and real-time processing in a normal, foggy, or uncertain Internet of Things (IoT) environment, the proposed method [21] presented an energy-saving system that uses the visual geometry group network (VGG). To further improve the robustness of the model, Peng et al. [8] focused on the modification of the model layers, including deep separation convolution layers and batch normalization layers. The study in [14] first used smoke domain knowledge to segment suspicious smoke areas in video frames, and then a deep network was designed to extract features of smoke regions and distinguish them from all suspicious areas. Li et al. [22] built an end-to-end detector and then constructed a detection framework. The main limitation of these methods is that they cannot achieve simultaneous fire detection and segmentation.

Recently, Khan et al. [23] attempted to address this limitation and proposed a two-step smoke detection method. First, the EfficientNet architecture was improved to classify smoke, nonsmoke, smoke with fog, and nonsmoke with fog. Next, the detection results were transferred to the DL module to realize smoke segmentation. However, this is relatively complex in two steps and the flame identification is ignored. In order to solve these problems, we propose a one-stage scheme that performs well in simultaneous smoke and flame segmentation. Specifically, Section 4 describes the details of the proposed method, and Section 5 discusses the experimental verification.

3. FSSD Dataset Preparation

DL-based methods automatically extract the required features from raw data. Therefore, a suitable and sufficient training dataset is critical to the entire training process. Inadequate training data will result in an overfitting problem. However, it is difficult and time-consuming to collect the indoor fire dataset, which can be summarized into four factors: (1) Serious fire may cause casualties. Out of social ethics and morality, indoor fire scene image data are often not published. (2) In the event of a fire, the fire will spread rapidly, burning any nearby combustibles and making them completely unrecognizable. The fire scene will be extensively damaged and chaotic, resulting in an incomplete collection of fire data. (3) Advanced monitoring equipment is a critical component in fire investigation. The equipment in the fire scene is significantly insufficient, and the level of technology and equipment is relatively backward, due to reasons such as lack of attention, insufficient capital investment, and unscientific management practices. (4) Internal scene photographs are more complicated than outdoor images, because of their rich backdrops, intricate internal decorations, and severe occlusion. At the same time, different viewpoints, scales, and texture variations in cluttered environments also add to their complexity. Because there are not enough labeled datasets in the smoke and flame detection field, it is necessary to collect a standard annotated fire dataset.

In this paper, we collect a dataset containing indoor fire images with semantic annotations, named ‘Flame and Smoke Semantic Dataset (FSSD)’. Next, a detailed description is given.

3.1. Image Collection

Our data collection and annotation work lasted from October 2022 to April 2023. FSSD contains a total of 1968 labeled real-world images, which consist of two categories: (1) indoor fire images and (2) fire images in other various scenes. The collection contains 2971 annotated instances of the fire category. Table 1 displays the number of annotated instances of each part in the dataset. The first part contains 304 labeled images that are gathered by searching for related keywords on the Kaggle platform (https://www.kaggle.com/ accessed on 10 October 2022). Some examples are shown in Figure 2. The second part of FSSD contains fire images derived from multiple scenes, not just indoor ones, as illustrated in Figure 3. The data collection procedure is detailed below.

Indoor fire images. The first part of our FSSD includes real, on-site, and indoor fire sceneries. We collected these from the Kaggle website, a machine learning and data science community with powerful tools and resources. First, we collected images on a large scale using keywords such as fire, smoke, and building fire. Then, after filtering the images one by one, we only kept the indoor fire dataset and discarded the rest. During this process, we collected 16 videos. Since videos are often the same scene, we captured an image every 15 s. Finally, we obtained 191 images from these 16 videos and an additional 113 images for a total of 304.

Fire in other scenes. The second component of our FSSD dataset was gathered from publicly available datasets and other websites. The scenes in these images include buildings, churches, factories, streets, airports, cars, trains, ships, forests, grasslands, and farmland. In the screening, we selected the images with obvious flame characteristics, clear boundaries, flames not occupying the entire image, and low smoke transparency, then removed the rest. Finally, 1664 images were selected to form our dataset in this part.

3.2. Image Annotation

This is our own proprietary dataset for evaluating the semantic segmentation module. A fire will be recognized as soon as flames or smoke are detected. Its images are annotated using LabelMe [24], a graphical image polygonal annotation tool in the Python environment.

LabelMe is an image annotation tool developed by the Computer Science and Artificial Intelligence Laboratory (CSAIL) of the Massachusetts Institute of Technology (MIT). It is written in Python and PyQT and may be used to generate customized annotation tasks or perform image annotation. The project source code is already available and can be installed and used on the server. It is an online Javascript image annotation tool that can be used anywhere without the requirement for huge datasets to be installed on the computer. The main functions are as follows: (1) labeling images as polygons, rectangles, circles, polylines, line segments, and points (which can be used for tasks such as object detection and image segmentation); (2) labeling images as flags (which can be used for image classification and cleaning tasks); (3) video annotation; (4) generating a dataset in visual object classes (VOC) format; and (5) generating a dataset in common objects in context (COCO) format.

To ensure quality and consistency, the following criteria apply to the annotation process:

Sufficient size. Each annotation object in the image must be large enough in two dimensions. Otherwise, they will be discarded (e.g., for being too thin or too small). The smallest size is

1 / 50

of the image size. But this is a soft stipulation and it is estimated manually.

Clear boundaries. Each annotation object should have a clear boundary, otherwise, it will be discarded.

Category definition. Each annotation object belongs to the category “fire”; otherwise, it will be ignored. The object category is uniformly labeled as “fire” when the object is recognized as (1) flame only, (2) smoke only, or (3) both flame and smoke.

Appropriate annotation point density. Low annotation point density can reduce the model’s detection accuracy and performance. However, an excessively high density of annotation points will take too much work and resources while not considerably improving the model’s detection accuracy and performance. Therefore, an appropriate density of annotation points is crucial for model training to obtain a depth model that meets our expectations. It is vital to raise the density of annotation points suitably, especially at the irregular boundaries of smoke and flames, but this standard is vague and artificially estimated.

Each object that meets the above criteria is annotated with a polygon indicating its boundary.

During the image annotation process, the annotation region must be a closed loop. When the mouse clicks, it will place a point, take a point as the starting point, pull the line, and then place down another point. This will encircle the selected object for one week, and the endpoint and the starting point will intersect to form a closed loop, with a prompt box appearing to enter the label name. After annotating the dataset, we obtain the same number of JavaScript object notation (JSON) files as images.

We store the huge annotation details in Microsoft COCO JSON files [25]. In order to reach a consensus, the two authors conducted cross-validation on each annotated image. Sample images are given in Figure 4, where each labeled instance is indicated by a color mask, and the category name is in the lower right corner.

4. Methodology

4.1. Semantic Segmentation Background and DeepLabv3+ Model

The semantic segmentation model is essentially based on a classification model, which uses CNN to extract features for classification. In general, the CNN model utilizes convolution layers and pooling layers with

stride > 1

to downsample the input image, thereby reducing the dimensionality of the feature map and obtaining higher-level features and richer semantics. The above principles apply to simple classification tasks, not segmentation tasks. The former only needs to predict a global probability, while the latter requires classification probabilities for different positions in the input image. If the feature map is too small, a lot of information will be lost. However, the downsampling layer is indispensable. A downsampling layer with

stride > 1

is very important for enhancing the receptive field, which enriches the semantics of high-level features and is also crucial for segmentation tasks with a larger receptive field. On the contrary, without a downsampling layer, the feature map will always maintain its original size, resulting in a massive amount of computation. In addition, spatial information is also crucial for accurate segmentation. The lack of downsampling will result in the loss of context semantic information and inaccurate boundary location. In summary, the semantics segmentation task based on deep convolutional neural networks (DCNNs) faces two challenges: one is that convolution and pooling operations will reduce the feature map resolution; the other is that convolution operations enlarge the receptive field, which leads to the transformation invariance of feature map and makes accurate positioning difficult.

To alleviate the above problems, in 2015, DeepLabv1 [26] proposed the atrous convolution and the fully connected conditional random field (CRF) to overcome the limitations of deep networks and better pinpoint the segmentation boundary. Compared to the previous version, DeepLabv2 [27] in 2017 made the following improvements: firstly, the backbone was changed from VGG [28] to residual network (ResNet) [29], and secondly, the atrous spatial pyramid pooling (ASPP) was proposed to acquire multiscale information. ASPP is an atrous convolution-based SPP, which fully utilizes the atrous convolution to effectively expand the receptive field and merge more contextual information without increasing the number of parameters. Then, to handle the multiscale segmentation problem, DeepLabv3 [30] designed the cascade or parallel atrous convolution module. Its innovation lies in the use of a multigrid strategy to replace the later convolution layers with the atrous convolution at different rates, which increases the receptive field while maintaining the feature map resolution. In 2018, DeepLabv3+ extended DeepLabv3 by adopting the SPP module in the encoder-decoder structure. The encoder extracts rich semantic information while the decoder restores fine object edges. Table 2 gives a comparison of the structural components among these four semantic segmentation models.

4.2. DL Fire Segmentation Architecture

Figure 5 depicts the overall DL fire segmentation architecture utilizing the DeepLabv3+ model, which is divided into two parts: encoder and decoder. The main body of the encoder is a backbone DCNN with an atrous convolution that is used to extract the image features. Afterward, the ASPP+ network has three tasks: processing the output of the backbone, connecting the results, and using

1 \times 1

convolution to reduce the number of channels and introduce multiscale information. Following that, the decoder will further integrate the intermediate output (low-level features) of the backbone and the ASPP+ output (high-level features) to increase the accuracy of boundary segmentation. Furthermore, ASPP+ and decoder modules adopt deepwise separable convolution to form a faster and more powerful encoder–decoder network. The following are the specifics of several main modules:

4.2.1. ASPP+

ASPP is a key module used for semantic segmentation in the DeepLab series. The detected object has varied sizes, making segmentation more difficult. The previous method involved rescaling images first and then fusing them after DCNN detection, resulting in high computational complexity. To solve the above problem, DeepLabv2 adopts the ASPP module to obtain multiscale information without huge computation. The ASPP+ module is then proposed in DeepLabv3, which improves the ASPP module to further extract multiscale information. Compared to DeepLabv2, DeepLabv3 incorporates a batch normalization (BN) layer in ASPP and adopts global pooling for the final feature map. DeepLabv3+ is modeled after the ASPP+ module, and its structure is shown in Figure 6, mainly consisting of the following parts:

(1) One

1 \times 1

convolution layer and three

3 \times 3

atrous convolutions are utilized. The scale ratio of the input image to the output feature map is recorded as output_stride. For output_stride = 16, the rate is (6, 12, 18), and for output_stride = 8, the rate is doubled. These convolutional layers have 256-channel outputs and include a BN layer.

(2) A global average pooling layer obtains features of the image level and then feeds them to a

1 \times 1

convolutional layer (output 256 channels) for bilinear interpolation to the original size.

(3) The obtained four different scale features are concatenated together in the channel dimension, then sent into a

1 \times 1

convolution for fusion, and finally obtain a 256-channel feature.

4.2.2. Decoder

The function of the ASPP+ module is to output a feature map obtained by the classification layer, and then guide it to the original image size by bilinear interpolation. This is a very rough decoder scheme and is not suitable for obtaining finer segmentation results. Thus, DeepLabv3+ takes advantage of the encoder-decoder structure, as shown in Figure 7. First, the output features of the encoder module are subjected to bilinear interpolation to obtain 4× features and are then concatenated with low-level features of the corresponding size in the encoder module.

The feature number obtained by the encoder is only 256, and the dimensions of low-level features may be very large. To prevent the high-level features obtained by the encoder from being weakened, we first adopt a

1 \times 1

convolution to lower the dimensions of low-level features. The two features are first concatenated and then fused using

3 \times 3

convolution. Finally, bilinear interpolation is used to obtain segmentation prediction results of the same size as the original image.

4.2.3. Improved Xception

The backbone used by DeepLabv3+ is an improved Xception [17]. As shown in Figure 8, the Xception network mainly uses deepwise separable convolution, which makes the Xception calculation smaller. The improved Xconcept is mainly reflected in the following points:

(1): It adds more levels to a deeper Xception structure;
(2): Deepwise separable convolution with $stride = 2$ replaces all maximum pooling layer structures, allowing atrous separable convolution to be used to extract features with arbitrary resolution;
(3): It adds BN and ReLU after each $3 \times 3$ depthwise convolution.

5. Experiments Evaluation

5.1. Experimental Environment and Settings

The experiments are performed on the platform with NVIDIA GeForce RTX 3080ti GPU, 12th Gen Intel(R) Core(TM) i9-12900K CPU. Our implementation is built on PyTorch [31] and depends on the MMSegmentation toolbox [32], which is an open-source semantic toolbox and is a part of the OpenMMLab project [33]. We employ ResNetV1c [34] with a depth of 101 layers as the model backbone, and select ImageNet-1k [35] to pretrain the backbone for extracting dense feature maps through atrous convolution. The whole model is evaluated on the PASCAL VOC 2012 [36] semantic segmentation benchmark. Our dataset contains 1968 pixel-level annotated images, with a partition ratio of 8:1:1 for the training, validation, and testing sets. The dataset includes two categories, namely, fire and background.

Specifically, the batch size was set to 4; the optimizer type to stochastic gradient descent (SGD) and momentum were set to 0.9; initial learning rate was set to 0.01, and weight decay was set to 0.0005. In addition, we adopted the fine-tuning batch normalization parameters (output_stride = 16), and random scale data augmentation during training. Original images were input into the model and uniformly cropped to a size of

512 \times 512

, followed by a random flip operation at 0.5 probability. Table 3 lists the hyperparameter settings.

The performance is measured based on three major metrics of aAcc, mAcc, and mIoU. The first two metrics relate to accuracy. The aAcc represents the ratio of the number of correctly classified pixels and the total number of pixels, as shown in Equation (1). The mAcc is the sum of accuracies for each category divided by the total number of categories, as shown in Equation (2), where

A c c_{i}

is the accuracy of

i

-th category and

M

is the total number of categories. The mIoU is pixel IoU averaged across the two categories. More detailed indicators, such as accuracy (Acc) background, Acc fire, IoU background, and IoU fire, are also used for evaluation. Acc background and Acc fire indicate the global accuracy of background and fire categories. The IoU background and IoU fire represent the IoU value of the background category and fire category, respectively. IoU represents the overlapping area between the predicted box and the actual box. The larger the value of IoU, the higher the accuracy of the algorithm in predicting the target. A is the actual box, C is the predicted box, B is the intersection of A and C, and their areas are indicated as

S_{A}

,

S_{B}

, and

S_{C}

. The calculation formula for IoU is given in Equation (3).

aAcc = \frac{total correct pixels}{total pixels}

(1)

mAcc = \frac{1}{M} \sum_{i = 1}^{M} A c c_{i}

(2)

IoU = \frac{S_{B}}{S_{A} {+ S}_{C} - S_{B}}

(3)

5.2. Experiment 1: Performance Analysis of the Proposed Scheme

To verify the effectiveness of the proposed model, we provide visual results using our FSSD dataset in various indoor scenes, e.g., office, kitchen, and living room. As displayed in Figure 9, the proposed scheme is able to segment fire objects well without any post-processing. The first and third rows in Figure 9 show input fire images, while the corresponding output results segmented by the proposed model are shown in the second and fourth rows in Figure 9, where the fire object is indicated by a red mask and the other areas are darker. It can be observed that the fire scene in Figure 9(a1,a2) has relatively strong fires and generates a large amount of thick smoke. In Figure 9(b1,b2,c2), indoor beds and sofas have combustible fabric materials, making them highly flammable places, and the fireplace and kitchen stove are also dangerous places due to their obvious sources of ignition, as shown in Figure 9(c1,d1). In addition, as shown in Figure 9(d2), electric vehicles catching fire or even exploding are also a hot topic. The above visualization results verify that the proposed scheme can achieve segmentation of both flame and smoke in indoor fire scenes. Hence, it is feasible to identify the fire objects using the proposed scheme.

Figure 10a displays the loss curve and Figure 10b shows three main metric curves (mIoU, mAcc, and aAcc) during the training process. The loss curve is made up of loss (vertical axis) and iteration (horizontal axis). It can be observed from Figure 10a that the loss value first rapidly decreases from 0.9 to 0.4 at about 20,000 iterations, and then slowly drops to approximately 0.07 at 160,000 iterations. As the number of iterations increases, the loss value can converge to around 0.05. The metric curves consist of mIoU, mAcc, aAcc values (vertical axis), and epoch (horizontal axis), which are more objective judgment criteria for the evaluation of model performance. It can be observed from Figure 10b that at 212 epochs, the first metric reaches a mIoU value of about 0.80, the second metric achieves about 89% mAcc, and the last one attains approximately 91% aAcc. Overall, the value of the aAcc curve is always greater than that of the mAcc curve. Their comparison results demonstrate that global accuracy is much higher than mean accuracy during the training process, because global accuracy actually ignores some samples with poor accuracy in the fire category.

5.3. Experiment 2: Comparison with State-of-the-Art Semantic Segmentation Models

In order to prove the superiority of our model, our semantic segmentation approach was compared with other state-of-the-art semantic segmentation models. They are Full Connected Network (FCN) [37], Pyramid Scene Parsing Network (PSPNet) [38], and DeepLabv3 [30]. The results were evaluated by using our own FSSD dataset. Their backbone adopts ResNetV1c with a depth of 101 layers, and the input image is cropped to

512 \times 512

image size. Other settings of baseline are under the same training protocol.

Table 4 displays the comparison results evaluated by aAcc, mAcc, Acc.background, and Acc.fire metrics. FCN performed the worst on the first three evaluation indicators and the best on the last one on our FSSD dataset. In all evaluation indicators, the proposed scheme achieves better performance than PSPNet and DeepLabv3. Overall, it is excellent in all evaluation indicators except the Acc.fire, which is 1.31% lower than the FCN model, and is within an allowable range. The results show that the proposed scheme is superior to the existing state-of-the-art depth semantic segmentation models.

Table 5 shows the experimental results using the following evaluation metrics: mIoU, IoU.background, and IoU.fire. As for the mIoU, our model achieved 0.8018 mIoU; 0.0104 higher than the DeepLabv3 model, 0.0128 higher than the PSPNet model, and 0.0266 higher than the FCN model. In general, the IoU.background metric is always better than IoU.fire. We are able to confirm that our scheme outperformed all existing models in all of the IoU evaluation indicators, demonstrating its effectiveness.

Some representative comparison results are visualized in Figure 11. It can intuitively demonstrate the advantages of our scheme compared to state-of-the-art semantic segmentation models. Figure 11a displays the input indoor fire images and Figure 11b–e show the segmentation results achieved by FCN, PSPNet, DeepLabv3, and our scheme, respectively. The compared segmentation results indicate that our scheme performs well in both global and local indoor fires. More specifically, our scheme has a significant advantage compared to other models when the flame and smoke are larger in size, and is slightly better than others if the fire is smaller in size. It can be observed that the other three models are not sensitive to smoke targets, so in most cases, smoke is missed and is not well segmented. In addition, in terms of segmentation details, other models easily consider the neighborhood information of fire objects, resulting in unclear segmentation edges. Overall, the segmentation results verify that the proposed scheme outperforms the existing models by correctly and simultaneously segmenting flame and smoke.

5.4. Experiment 3: Some Unsatisfactory Segmentation Results and Analysis

Some examples of unsatisfactory segmentation results obtained by the proposed scheme are also shown in Figure 12. The first example, as shown on the right of Figure 12(a2), demonstrates that the proposed scheme can segment flame and smoke exactly. However, the bright area on the left side of the television (TV) (indicated by blue bounding box and arrow) is also detected, which is not an actual fire but a reflection of the fire. In Figure 12(b2), several characters on the banner are mistakenly identified as targets (marked with blue bounding box and arrow). That is because the characters of the banner are yellow with a red background, which is very similar to the color of the flame, and the structure of the characters is also very similar to the irregular morphology of flame. The above issues need to be further addressed to reduce misjudgment.

6. Conclusions and Discussion

This paper employed a semantic segmentation model to detect and segment fire objects in indoor scenes. DeepLabv3+ was adopted as the main model to simultaneously segment both flame and smoke. Based on the advantages of the encoder-decoder architecture, DeepLabv3+ can remove redundant information from input data and obtain high-quality segmentation results. Furthermore, a new dataset FSSD was collected to support model training and address the issue of insufficient indoor fire data. Different from other fire image datasets, our own FSSD provides rich semantic information of fire objects. Experiment results on the FSSD dataset demonstrated the effectiveness of using DeepLabv3+ in fire segmentation performance. The proposed scheme achieved the mAcc of 89.67% and the mIoU of 0.8018.

Further efforts need to be made to support this research. First, the indoor fire dataset still faces insufficient challenges. We are considering utilizing generative adversarial networks to simulate indoor fire images and expand our database. Second, this work achieved the detection and recognition of fire objects, and more surrounding information needs to be obtained and analyzed. Future research should focus on the acquisition of additional fire load (combustible materials) information, which is crucial for assessing indoor fire losses and guiding firefighters in rescuing trapped individuals.

Author Contributions

Conceptualization, F.H.; data curation, X.R. and Y.C.; formal analysis, F.H.; funding acquisition, X.F.; investigation, Y.C.; methodology, F.H.; project administration, X.F.; resources, F.H.; software, F.H.; supervision, X.F.; validation, X.R.; visualization, X.R.; writing—original draft, F.H. and X.R.; writing—review and editing, X.R. and X.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by Changsha Natural Science Foundation under Grant kq2208285, and National Natural Science Foundation of China under Grant 62203475.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author, Xinyu Fan, upon reasonable request.

Acknowledgments

We were supported by the 2023 College Student Innovation and Entrepreneurship Training Program Project. We also acknowledge Mengqi Fang, Qian Yu, Wenqing Zhao, Yuanzhi Wang, and Haoran Wang from Central South University for their contribution to the dataset annotation.

Conflicts of Interest

The authors declare no conflict of interest.

References

Liu, F.; Chang-Richards, A.; Wang, K.I.K.; Dirks, K.N. Effects of indoor environment factors on productivity of university workplaces: A structural equation model. Build. Environ. 2023, 233, 110098. [Google Scholar] [CrossRef]
Chen, Y.; Li, M.; Lu, J.; Chen, B. Influence of residential indoor environment on quality of life in China. Build. Environ. 2023, 232, 110068. [Google Scholar] [CrossRef]
Muhammad, K.; Ahmad, J.; Baik, S.W. Early fire detection using convolutional neural networks during surveillance for effective disaster management. Neurocomputing 2018, 288, 30–42. [Google Scholar] [CrossRef]
Yuan, F.; Zhang, L.; Xia, X.; Wan, B.; Huang, Q.; Li, X. Deep smoke segmentation. Neurocomputing 2019, 357, 248–260. [Google Scholar] [CrossRef]
Chen, K.; Cheng, Y.; Bai, H.; Mou, C.; Zhang, Y. Research on Image Fire Detection Based on Support Vector Machine. In Proceedings of the 2019 9th International Conference on Fire Science and Fire Protection Engineering (ICFSFPE), Chengdu, China, 18–20 October 2019. [Google Scholar]
Chaturvedi, S.; Khanna, P.; Ojha, A. A survey on vision-based outdoor smoke detection techniques for environmental safety. ISPRS J. Photogramm. Remote Sens. 2022, 185, 158–187. [Google Scholar] [CrossRef]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
Peng, Y.; Wang, Y. Real-time forest smoke detection using hand-designed features and deep learning. Comput. Electron. Agric. 2019, 167, 105029. [Google Scholar] [CrossRef]
Gong, F.; Li, C.; Gong, W.; Li, X.; Yuan, X.; Ma, Y.; Song, T. A Real-Time Fire Detection Method from Video with Multifeature Fusion. Comput. Intell. Neurosci. 2019, 2019, 1939171. [Google Scholar] [CrossRef] [PubMed]
Wang, H.; Zhang, Y.; Fan, X. Rapid Early Fire Smoke Detection System Using Slope Fitting in Video Image Histogram. Fire Technol. 2019, 56, 695–714. [Google Scholar] [CrossRef]
Sun, Y.; Jiang, L.; Pan, J.; Sheng, S.; Hao, L. A satellite imagery smoke detection framework based on the Mahalanobis distance for early fire identification and positioning. Int. J. Appl. Earth Obs. Geoinf. 2023, 118, 103257. [Google Scholar] [CrossRef]
Sun, G.; Wen, Y.; Li, Y. Instance segmentation using semi-supervised learning for fire recognition. Heliyon 2022, 8, e12375. [Google Scholar] [CrossRef]
Zhai, R.; Ye, Y.; Wan, E.; Chen, Z.; Liu, Y. Source tracing of tree smoke in forest fires based on laser-induced breakdown spectroscopy. Optik 2023, 282, 170867. [Google Scholar] [CrossRef]
Jia, Y.; Chen, W.; Yang, M.; Wang, L.; Liu, D.; Zhang, Q. Video smoke detection with domain knowledge and transfer learning from deep convolutional neural networks. Optik 2021, 240, 166947. [Google Scholar] [CrossRef]
Nguyen, H.T.; Abu-Zidan, Y.; Zhang, G.; Nguyen, K.T.Q. Machine learning-based surrogate model for calibrating fire source properties in FDS models of façade fire tests. Fire Saf. J. 2022, 130, 103591. [Google Scholar] [CrossRef]
Hosseini, A.; Hashemzadeh, M.; Farajzadeh, N. UFS-Net: A unified flame and smoke detection method for early detection of fire in video surveillance applications using CNNs. J. Comput. Sci. 2022, 61, 101638. [Google Scholar] [CrossRef]
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Hu, Y.; Zhan, J.; Zhou, G.; Chen, A.; Cai, W.; Guo, K.; Hu, Y.; Li, L. Fast forest fire smoke detection using MVMNet. Knowl.-Based Syst. 2022, 241, 108219. [Google Scholar] [CrossRef]
Wu, X.; Cao, Y.; Lu, X.; Leung, H. Patchwise dictionary learning for video forest fire smoke detection in wavelet domain. Neural Comput. Appl. 2021, 33, 7965–7977. [Google Scholar] [CrossRef]
Lin, G.; Zhang, Y.; Xu, G.; Zhang, Q. Smoke Detection on Video Sequences Using 3D Convolutional Neural Networks. Fire Technol. 2019, 55, 1827–1847. [Google Scholar] [CrossRef]
Khan, S.; Muhammad, K.; Mumtaz, S.; Baik, S.W.; de Albuquerque, V.H.C. Energy-Efficient Deep CNN for Smoke Detection in Foggy IoT Environment. IEEE Internet Things J. 2019, 6, 9237–9245. [Google Scholar] [CrossRef]
Li, Y.; Zhang, W.; Liu, Y.; Jing, R.; Liu, C. An efficient fire and smoke detection algorithm based on an end-to-end structured network. Eng. Appl. Artif. Intell. 2022, 116, 105492. [Google Scholar] [CrossRef]
Khan, S.; Muhammad, K.; Hussain, T.; Del Ser, J.; Cuzzolin, F.; Bhattacharyya, S.; Akhtar, Z.; de Albuquerque, V.H.C. DeepSmoke: Deep learning model for smoke detection and segmentation in outdoor environments. Expert Syst. Appl. 2021, 182, 115125. [Google Scholar] [CrossRef]
Torralba, A.; Russell, B.C.; Yuen, J. LabelMe: Online Image Annotation and Applications. Proc. IEEE 2010, 98, 1467–1484. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. Comput. Sci. 2014, 4, 357–361. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. Comput. Sci. 2014. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Chintala, S. PyTorch: An Imperative Style, High-Performance Deep Learning Library. arXiv 2019, arXiv:1912.01703. [Google Scholar]
Contributors, M. MMSegmentation: OpenMMLab Semantic Segmentation Toolbox and Benchmark. 2020. Available online: https://github.com/open-mmlab/mmsegmentation (accessed on 14 March 2023).
Contributors, M. MMCV: OpenMMLab Computer Vision Foundation. 2018. Available online: https://github.com/open-mmlab/mmcv (accessed on 12 March 2023).
He, T.; Zhang, Z.; Zhang, H.; Zhang, Z.; Xie, J.; Li, M. Bag of Tricks for Image Classification with Convolutional Neural Networks. arXiv 2018, arXiv:1812.01187. [Google Scholar]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Everingham, M.; Eslami, S.M.A.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes Challenge: A Retrospective. Int. J. Comput. Vis. 2015, 111, 98–136. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 640–651. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]

Figure 1. Flame and smoke recognition framework.

Figure 2. Indoor fire images.

Figure 3. Fire in other scenes.

Figure 4. Sample images with instance annotations. (a) Only flame; (b) only smoke; (c) both.

Figure 5. Fire segmentation framework based on DeepLabv3+ model.

Figure 6. ASPP+ module.

Figure 7. Encoder–decoder with atrous convolutions.

Figure 8. The advanced aligned Xception.

Figure 9. Visualization of segmented results by the proposed scheme. There are eight examples: (a1,b1,c1,d1,a2,b2,c2,d2). Each example consists of two parts: input image and segment result.

Figure 10. Performance curves. (a) Loss changes with iterations; (b) mIoU, mAcc, and aAcc changes with epochs.

Figure 11. Comparison of visualization results among different semantic segmentation models, where row (a) represents input fire images, (b–d) show the corresponding results segmented by FCN, PSPNet, and DeepLabv3, respectively, and (e) displays the results of the proposed scheme.

Figure 12. Samples of unsatisfactory segmentation results. Panels (a1,b1) represent original images, and (a2,b2) indicate mistake marks using blue box and arrow.

Table 1. Number of annotated instances of each part in the dataset.

Part	Number of Instances
Indoor fire images	489
Fire in other scenes	2482

Table 2. Details of the structural components for the DeepLab series.

Structure	DeepLabv1	DeepLabv2	DeepLabv3	DeepLabv3+
Backbone	VGG-16	ResNet	ResNet+	Xception
Atrous Conv	√	√	√	√
CRF	√	√	×	×
ASPP	×	ASPP	ASPP+	ASPP+
Encoder–decoder	×	×	×	√

Table 3. Experimental settings.

Setting	Value
Batch size	4
Momentum	0.9
Initial learning rate	0.01
Weight decay	0.0005
Crop size	$512 \times 512$

Table 4. Accuracy comparison results among various semantic segmentation models (best results are in bold) (%).

Model	aAcc	mAcc	Acc.background	Acc.fire
FCN (2015)	0.8996	0.8908	0.9078	0.8738
PSPNet (2017)	0.9086	0.8905	0.9256	0.8554
DeepLabv3 (2017)	0.9100	0.8909	0.9276	0.8542
The proposed scheme	0.9153	0.8967	0.9327	0.8607

Table 5. IoU comparison results among various semantic segmentation models (best results are in bold).

Model	mIoU	IoU.background	IoU.fire
FCN (2015)	0.7752	0.8727	0.6777
PSPNet (2017)	0.7890	0.8848	0.6933
DeepLabv3 (2017)	0.7914	0.8866	0.6961
The proposed scheme	0.8018	0.8930	0.7106

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hou, F.; Rui, X.; Chen, Y.; Fan, X. Flame and Smoke Semantic Dataset: Indoor Fire Detection with Deep Semantic Segmentation Model. Electronics 2023, 12, 3778. https://doi.org/10.3390/electronics12183778

AMA Style

Hou F, Rui X, Chen Y, Fan X. Flame and Smoke Semantic Dataset: Indoor Fire Detection with Deep Semantic Segmentation Model. Electronics. 2023; 12(18):3778. https://doi.org/10.3390/electronics12183778

Chicago/Turabian Style

Hou, Feifei, Xiyue Rui, Yuanheng Chen, and Xinyu Fan. 2023. "Flame and Smoke Semantic Dataset: Indoor Fire Detection with Deep Semantic Segmentation Model" Electronics 12, no. 18: 3778. https://doi.org/10.3390/electronics12183778

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Flame and Smoke Semantic Dataset: Indoor Fire Detection with Deep Semantic Segmentation Model

Abstract

1. Introduction

2. Literature Review

3. FSSD Dataset Preparation

3.1. Image Collection

3.2. Image Annotation

4. Methodology

4.1. Semantic Segmentation Background and DeepLabv3+ Model

4.2. DL Fire Segmentation Architecture

4.2.1. ASPP+

4.2.2. Decoder

4.2.3. Improved Xception

5. Experiments Evaluation

5.1. Experimental Environment and Settings

5.2. Experiment 1: Performance Analysis of the Proposed Scheme

5.3. Experiment 2: Comparison with State-of-the-Art Semantic Segmentation Models

5.4. Experiment 3: Some Unsatisfactory Segmentation Results and Analysis

6. Conclusions and Discussion

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI