Real-Time Early Indoor Fire Detection and Localization on Embedded Platforms with Fully Convolutional One-Stage Object Detection

Li, Yimang; Shang, Jingwei; Yan, Meng; Ding, Bei; Zhong, Jiacheng

doi:10.3390/su15031794

Open AccessArticle

Real-Time Early Indoor Fire Detection and Localization on Embedded Platforms with Fully Convolutional One-Stage Object Detection

by

Yimang Li

¹,

Jingwei Shang

^2,*,

Meng Yan

¹,

Bei Ding

¹ and

Jiacheng Zhong

¹

School of Mechanical Engineering and Rail Transit, Changzhou University, Changzhou 213164, China

²

Ministry of Industry and Information Technology, CEPREI, Guangzhou 510610, China

^*

Author to whom correspondence should be addressed.

Sustainability 2023, 15(3), 1794; https://doi.org/10.3390/su15031794

Submission received: 16 November 2022 / Revised: 27 December 2022 / Accepted: 16 January 2023 / Published: 17 January 2023

(This article belongs to the Special Issue Fire Safety Technology and Human Behavioural Science for Building Sustainability)

Download

Browse Figures

Versions Notes

Abstract

:

Fire disasters usually cause significant damage to human lives and property. Thus, early fire detection and localization in real time are crucial in minimizing fire disasters and reducing ecological losses. Studies of convolution neural networks (CNNs) show their capabilities in image processing tasks such as image classification, visual recognition, and object detection. Using CNNs for fire detection could improve detection accuracy. However, the high computational cost of CNNs requires an extensive training model size, making it difficult to deploy to resource-constrained edge devices. Moreover, the large size of the training model is challenging for real-time object detection. This paper develops a real-time early indoor fire-detection and -localization system that could be deployed on embedded platforms such as Jetson Nano. First, we propose a fully convolutional one-stage object detection framework for fire detection with real-time surveillance videos. The combination of backbone, path aggregation network, and detection head with generalized focal loss is used in the framework. We evaluate several networks as backbones and select the one with balanced efficiency and accuracy. Then we develop a fire localization strategy to locate the fire with two cameras in the indoor setting. Results show that the proposed architecture can achieve similar accuracy compared with the Yolo framework but using one-tenth of the model size. Moreover, the localization accuracy could be achieved within 0.7 m.

Keywords:

fire detection; embedded systems; convolutional neural networks

1. Introduction

Fire accidents are the most commonly occurring disasters nowadays. Indoor and outdoor fire disasters usually cause significant damage to human lives and social properties. Since fire is one of the most severe types of accidents, there is always a need to improve fire detection capabilities. Precise, fast, and portable solutions to detect and localize fires have received increased attention among the masses. Previous fire detection approaches mainly use physical sensors, such as smoke, thermal, and flame sensors, to detect fires. However, these sensors might cause false alarms as the thresholds to trigger the fire alarms are difficult to set. Besides, the response time of these sensors greatly depends on the sampling rate of the discrete signals. Thus a low sampling rate will cause considerable delays in the fire alarm.

Recently, there has been growing interest in using image processing and computer vision techniques to detect fires, which techniques are proving to be reliable and robust. Among all the image processing techniques, convolutional neural networks (CNNs) are believed to be one of the most promising artificial intelligence approaches for image classification and segmentation [1]. The use of CNNs to extract deep static features of fire has dramatically improved the accuracy of fire detection [2,3,4]. However, a CNN requires a high computational cost to achieve high accuracy, which would result in an extensive training model size and a low video processing rate. Then deploying such a large model on resource-constrained embedded devices is challenging for real-time fire detection. Moreover, most of the research mainly focuses on improving the accuracy of fire detection, while the fire regions cannot be determined under the current architecture. The determined fire regions can help the robots to locate the fire and fight the fire at an early stage.

This paper develops a real-time early indoor fire-detection and -localization system deployed on embedded platforms such as Raspberry PI and Jetson-Nano to address these issues. Inspired by the fully convolutional one-stage (FCOS) objection detection model, we propose a fire detection CNN architecture for real-time surveillance videos by combining backbone, path aggregation network, and detection head with generalized focal loss. Four networks are evaluated for efficiency and accuracy as backbones of the architecture, namely EfficientNet [5], ShuffleNet [6], RepVGG [7] and the Cross Stage Partial Network (CSPNet) [8]. Based on the fire segmentation result, we developed a fire localization strategy with two cameras in the indoor setting to locate the fire. Results show that the proposed architecture can achieve similar accuracy compared with the Yolo framework but using one-tenth of the model size. Moreover, the localization accuracy is proved to be within 0.7 m.

This paper is organized as follows. Section 2 discusses the previous research work done in the area of CNN-based fire detection. Section 3 provides a detailed explanation of our proposed architecture to detect and localize fires. Section 4 discusses the real-world implementation of the fire-detection and -localization system deployed on the Jetson-Nano platform, followed by a discussion of the results. Section 6 concludes the paper and discusses future work.

2. Related Work

CNNs have yielded state-of-the-art performance in image classification and other computer vision tasks. Initially, the researchers mainly focused on using CNN architectures to detect fires. Muhammad et al. first proposed a CNN architecture based on SqueezeNet for fire detection, which used smaller convolutional kernels and contained no dense, fully connected layers [2]. Compared with Alexnet, their proposed network architecture had a much smaller size and could successfully detect fire at an early stage with four frames/sec and a resolution of 320 × 240 with an 8.87% false positive rate and 94.50% accuracy. Based on this, they proposed an adaptive prioritization mechanism in the surveillance system [9]. Kim and Lee proposed using faster region-based CNN to detect the suspected regions of fire with spatial features and then using Long short-term memory(LSTM) to classify whether there was a fire [3]. The decisions of the successive video sequence were then combined for the final decision. Hashemzadeh et al. proposed using a robust Imperialist Competitive Algorithm to detect all candidate fire regions in a setting and then analyze the characteristics of the fire with the motion intensity rate [4]. Saeed et al. proposed using both sensors data and images data for fire detection with the AdaBoost and neural network models [10]. Foggia et al. proposed combining color, shape variation, and Motion analysis data for fire detection [11]. Antony et al. proposed evaluating multiple experts such as color, movement, and shape features to detect fire and smoke from videos acquired by surveillance cameras in real-time [12]. Xie et al. proposed using both the dynamic features and deep static features fire for fire detection [13]. The adaptive, lightweight convolutional neural network was used as the model for fire detection. Zhang et al. proposed using a joined deep CNN to detect files and the fine-grained patch classifier to detect the precise location of fire patches [14]. The proposed approach achieved 97% and 90% detection accuracy on training and testing datasets, respectively. Guo et al. proposed using the Faster R-CNN model to detect flame in noisy images of fire ground [15]. The result shows that the obtained precision and frame rates of the proposed method were up to 99.8% and 1.4 FPS. However, these approaches required expensive calculations for the training and real-time detection processes and were not deployable for resource-constrained devices.

To address the issue that fire cannot be detected accurately under extreme weather outdoors, Muhammad et al. also proposed an efficient CNN-based system for fire detection in videos captured in uncertain surveillance scenarios with smoke, fog, and snow [16]. Khan et al. proposed using deep convolutional neural networks for early smoke detection in typical and foggy IoT environments with the VGG-16 architecture [17]. Results showed that smoke detection accuracy and time efficiency were better than computationally expensive networks like GoogleNet and AlexNet.

To achieve a good tradeoff among accuracy, model size, and speed, Li et al. proposed using a multiscale feature extraction mechanism to capture fire-like objects and employing a channel attention mechanism to selectively emphasize the contribution between different feature maps [18]. Results showed that it achieved 95.3% accuracy, which outperformed the suboptimal method by 2.5%. Moreover, the speed and model size of their method were 3.76% faster on the GPU and 63.64% smaller than the suboptimal method. Bari et al. fine-tuned pre-trained InceptionV3 and MobileNetV2 models with transfer learning using the curated dataset [19]. Results showed that transfer-learned models performed much better than fully trained models when trained on the limited dataset. Jadon et al. proposed a Firenet that could be deployed to embedded platforms like Raspberry Pi. However, the Firenet was trained on a standard dataset of fires, which was not suitable for early fire detection. Moreover, Firenet did not support fire localization.

3. Methodology

The CNN-based fire detection approaches mentioned in the previous section followed the process only of fine-tuning different CNNs like Googlenet, SqueezeNet, VGG, and MobileNet. The major drawback of these approaches is their large number of layers and large model size. So it is difficult to deploy the trained models onto some resource-constrained embedded devices such as Raspberry Pi or Jetson-Nano for fire detection with a reasonable frame rate. Moreover, existing approaches need to support localizing the fire regions in real time. Thus, having a unified framework to detect and locate the fire in real time is helpful.

Therefore, we propose a cost-effective fire-detection and -localization framework that can be deployed to embedded platforms to detect and localize indoor fires in real time. We first discuss the overall architecture of the system and then describe the fire detection module and fire localization module in detail.

3.1. Architecture

The architecture diagram of the proposed framework is shown in Figure 1. Two cameras are put inside a room to record real-time videos processed by the embedded platforms. We use two cameras to determine the three coordinates in a room. In this paper, we use Jetson-Nano as the platform to detect and localize fires. Before real-time fire detection and localization, we first need to train the CNN models and calibrate the cameras. Then the frames of the videos are passed into the fire detection module and fire localization module to calculate the coordinates of the fire in the room. Finally, the system will set the alarm if there are fires in the room and output the actual coordinates of the fire to the firefighting robots to put out the fire.

3.2. Fire Detection Framework

Inspired by the fully convolutional one-stage objection detection model [20], we propose a fire-detection CNN architecture for fire detection and segmentation. Figure 2 shows the CNN architecture, which mainly contains three parts, namely the backbone, path aggregation network, and detection head. Backbone is used as a feature-extracting network. In the training process, we use the pre-trained backbone to extract the features from the input images. Then the features are passed into the path-aggregation network to aggregate the features from different layers. Finally, the aggregated features are passed into the detection head to predict the bounding boxes and segmentation masks. The detection head comprises two parts: the classification head and the regression head. The classification head is used to predict the class of the object: fire or background. The regression head is used to predict the bounding boxes of the fire. The segmentation head is used to predict the segmentation masks of the fire. The segmentation masks are used to calculate the coordinates of the fire. In object detection tasks, pre-defined anchor boxes have been used in frameworks such as Faster R-CNN [21], YOLO [22], and SSD [23]. However, anchor-free methods have been proposed to overcome the limitations of anchor-based methods. In anchor-free methods, the bounding boxes are predicted directly from the feature maps. In this paper, we use the fully convolutional one-stage approach to predict the bounding boxes, which reduces the number of hyper-parameters and the computational complexity.

First, we evaluate several networks, namely EfficientNet, ShuffleNet, RepVGG, and CSPNet, as feature-extracting networks in the backbone. This feature extractor encodes the network’s input into a specific feature representation. This backbone technique is commonly used in image segmentation and object detection tasks. In this paper, we mainly evaluate four networks as backbones as follows:

EfficientNet: With the fast development of embedded systems, CNN nowadays is commonly developed at a fixed resource budget and then scaled up for better accuracy if more resources are available. To identify the effect of the network depth d, width w, and resolution r on the model accuracy, Tan and Le formalized this optimization problem as seen in Equation (1) [5]. Here

N

denotes the CNN, which a list of composing layers can represent. The optimization problem wishes to find the maximum accuracy the CNN architecture

\hat{N}

can achieve under the restrictions of target memory and target flops. The neural architecture search technique is used here to find the best EfficientNet architecture. This paper uses efficient_lite0 as the backbone model and uses the second, fourth, and sixth stages as output to the Feature Pyramid Network(FPN).

\begin{matrix} \begin{matrix} m a x_{d, w, r} & A c c u r a c y (N (d, w, r)) \\ s . t . & N (d, w, r) = \hat{N} \\ M e m o r y (N) \leq t a r g e t_m e m o r y \\ F L O P S (N) \leq t a r g e t_f l o p s \end{matrix} \end{matrix}

(1)

ShuffleNet: The architecture of ShuffleNet utilizes two new operations, namely pointwise group convolution and channel shuffle, to reduce computation cost while maintaining model accuracy. The architecture of ShuffleNet is mainly composed of a stack of ShuffleNet units grouped into three stages. The image is first fed into a convolution layer with a stride of 2. Then the output is fed into the following stages, and the output channels are doubled. Results show that ShuffleNet can achieve better model accuracy than other models on the ARM platform [6].

RepVGG: The RepVGG has a VGG-style architecture in that every layer takes the output of its only preceding layer as input and feeds the output into its only following layer. The

3 \times 3

Conv and ReLu are used in the model. There are five stages in the RepVGG; the output is fed into the following stages, and the output channels are doubled.

CSPNet: CSPNet is a new backbone that can be used to reduce the heavy inference computations by integrating feature maps from the beginning and the end of a network stage [8]. The CSPNet will use some ResBlock [24] to skip the calculations for gradients. Table 1 shows our custom configuration for the CSPNet used in this paper.

Different layers of the output of the backbone would then be fed into the FPN, which outputs proportionally sized feature maps for the different layers in a fully convolutional fashion. The feature pyramid network builds the feature pyramids inside the deep convolutional networks and can be used in object-detection tasks. This paper uses the Path Aggregation Network (PAN) [25] as the FPN to build the feature pyramids. The construction of PAN mainly involves two pathways. One is the bottom-up pathway, which computers a feature hierarchy consisting of feature maps at several scales.

Moreover, the top-down pathway hallucinates higher-resolution features by upsampling spatially coarser feature maps from higher pyramid levels. These features are then enhanced with features from the bottom-up pathway via lateral connections. Each lateral connection merges feature maps of the same spatial size from the bottom-up and top-down pathways. The configurations, such as the out stages, activation function, input for PAN, and output for PAN used in this paper, are listed in Table 2.

In image classification and localization, Focal Loss is usually used to measure the classification accuracy, and the Dirac delta distribution is used to measure the box location of object detection. Li et al. proposed the Generalized Focal Loss (GFL) to measure the joint representation of localization quality and classification [26]. Focal loss is typically used for one-stage classification, which supports discrete labels such as 0 and 1. However, the proposed joint representation of localization quality and classification has continuous values ranging from 0 to 1. This paper uses the two forms of GFL, namely Quality Focal Loss (QFL) and Distribution Focal Loss (DFL), as the loss function to train the model.

GFL uses a float target

y \in [0, 1]

to present the standard one-hot category label. Here, if

y = 0

, there are no category samples. Furthermore, if

0 < y \leq 1

, then there exists some positive samples with the score y. The FL consists of two parts: the cross entropy part

- l o g (p_{t})

and a dynamic scaling factor part

{(1 - p_{t})}^{γ}

. QFL is then extended based on FL to support the joint representation of localization quality and classification. QFL is presented in Equation (2). Here

- l o g (p_{t})

is extended to

(1 - y) (l o g (1 - σ)) + y l o g (σ)

, where

σ

is the output of multiple binary classifications with the sigmoid operator.

{(1 - p_{t})}^{γ}

is extended to ∣

y - σ

∣

^{β}

to calculate the absolute distance between the estimation

σ

to the continuous label y. In this paper, we use

β = 2

to train the model.

Q F L (σ) = - {∣ y - σ ∣}^{β} ((1 - y) (l o g (1 - σ)) + y l o g (σ))

(2)

DFL uses the Dirac delta distribution as the loss function to train the regression model, whose targets are the relative offsets from the location of the object to the four sides of a bounding box. First, the sampling technique converts the continuous domain into a discrete domain. For example, if the range is discretized into

{y_{0}, y_{1}, \dots, y_{n}}

, then the estimated regression value of

\hat{y} \sum_{i = 0}^{n} P (y_{i}) y_{i}

, where

\sum_{i = 0}^{n} P (y_{i}) = 1

. Then the DFL of any successive

P (y_{i})

is shown in Equation (3). As seen from the equation, DFL will force the network to rapidly focus on the values near the label by enlarging the probabilities of

y_{i}

and

y_{i + 1}

.

D F L (P (y_{i}), P (y_{i + 1})) = - ((y_{i + 1} - y) l o g (P (y_{i})) + (y - y_{i}) l o g (P (y_{i + 1})))

(3)

So for the detection head, we use QFL and DFL as loss functions. The other configurations are shown in Table 3. The number of input channels and feature channels is set to 96. Moreover, we use [8, 16, 32] as the strides. Moreover, we use batch normalization as the normalized approach.

3.3. Fire Localization Framework

The fire detection framework trains a model that can be fed into the embedded system for real-time surveillance video processing. The inference module will calculate the position of the fire in the current image in pixels. Next, we propose a two-step real-world fire localization framework that maps the position of the fire on the two frames of surveillance videos to the localization in the real-world setting.

Figure 3 shows our two-step framework for real-time fire localization. The first step is for camera resectioning and geometric camera calibration. Camera resectioning is estimating the parameters of a pinhole camera model approximating the camera that produced a given photograph or video. The single-camera resectioning can calculate the camera’s intrinsic parameters, such as focal length and principal point. Here we use

f_{x}

and

f_{y}

to denote the focal length in pixels. Then we use

c_{x}

and

c_{y}

to denote the principal points of the camera. After the single-camera resectioning, we perform the stereo calibration to calculate the extrinsic parameters, which denote the coordinate system transformations from 3D world coordinates to 3D camera coordinates. Here

R

is the rotation matrix that is used to perform a rotation in Euclidean space.

T

is the position of the origin of the world coordinate system expressed in coordinates of the camera-centered coordinate system. Typically the rotation matrix is a

3 \times 3

matrix, and the transformation matrix T is a

3 \times 1

matrix. Let u and v be the position of fire in the video in pixels, X, Y, and Z be the localization of the fire in a real-world setting. The relationship between

(u, v)

and

(X, Y, Z)

satisfies Equation (4). Here s is the projective transformation’s arbitrary scaling.

s [\begin{matrix} u \\ v \\ 1 \end{matrix}] = [\begin{matrix} f_{x} & 0 & c_{x} \\ 0 & f_{y} & c_{y} \\ 0 & 0 & 1 \end{matrix}] (R [\begin{matrix} X \\ Y \\ Z \end{matrix}] + T)

(4)

Furthermore, actual lenses usually have radial distortion and tangential distortion. After the single-camera resectioning and the calibration of two cameras, we compute the rectification transforms for each calibrated stereo camera, giving us the radial and tangential distortion coefficients. The radical coefficients and tangential distortion coefficients can make the localization framework more accurate.

The first step of the fire-localization framework mainly focuses on calibrating the two cameras. In the second step, we use an anchor point to calculate the relative coordinates of the fire position to the anchor point. First, we determine a fixed anchor point and find the pixels of the point, denoted as

(u_{a}, v_{a})

. Then we calculate the real-word coordinates

(X_{a}, Y_{a}, Z_{a})

of the anchor point. The inference module of the trained model can detect and localize fires on each frame, which will obtain the area

([u_{f} m i n, u_{f} m a x], [v_{f} m i n, v_{f} m a x])

, which then can be calculated to the real-word coordinates

([X_{f} m i n, X_{f} m a x], [Y_{f} m i n, Y_{f} m a x],

[Z_{f} m i n, Z_{f} m a x])

. The relative coordinates of the fire are shown in Equation (5).

([X_{f} m i n - X_{a}, X_{f} m a x - X_{a}], [Y_{f} m i n - Y_{a}, Y_{f} m a x - Y_{a}], [Z_{f} m i n - Z_{a}, Z_{f} m a x - Z_{a}])

(5)

4. Experiment Setup

4.1. Hardware Setup

The Jetson Nano 2 GB is a powerful but compact, low-cost, embedded computer used for object detection. It consists of a GPU 128-core Maxwell, whose CPU is a Quad-core ARM A57 at 1.43 GHz. We use the 2 GB version of Jetson Nano, which supports 64-bit LPDDR4 25.6 GB/s. The Jetson Nano 2 GB video encoder supports working with nine images of 720p at 30 frames per second (fps), and the decoder works with eighteen images of 720p at 30 fps. We use two cameras in the system—one CSI camera and one USB camera—as the Jetson Nano only has one CSI-2 connector. First, the two cameras are calibrated. Then we train the model with a local machine, deploy the trained model to the Jetson Nano, and run it as a stand-alone application. The application processes the frames in real time and outputs the fire coordinates in the real world if it detects any small frames.

4.2. Dataset Description

There exist some fire detection-related datasets in works [2,9]. However, the dataset contains some outdoor wildfires, which does not suit our early indoor fire detection requirement. The dataset provided by work in [11] contains 31 videos. Despite the vast amount of data, more than the quantity of the dataset is needed to train the deep neural network, and the network is expected to perform poorly in realistic fire detection scenarios. In order to train a model that could detect early fires in indoor scenarios, we tried to accumulate as many realistic images as possible. We collected the images that contained small frames in indoor scenarios such as labs and rooms so that the trained model could detect early fires at an early stage. Our complete train dataset consisted of 92 images associated with labels that showed the position of fires as XML files. We also collected 20 images as a test dataset with the position of fires. In Figure 4, we show a few sample images from our training dataset and test dataset.

During the training process, we first pre-processed the collected images. First, all the images were scaled to

320 \times 320

as the input to the deep neural network. Then, we used several typical image transformation techniques to transform the images. All of the images were flipped vertically with a probability of 0.5. The brightness of the images was set to 0.4, whereas the contrast of the images was set to 0.7, and the saturation of the images was set to 0.7. Finally, we normalized the images.

5. Results and Discussion

5.1. Backbone Model Evaluation

As mentioned in Section 3, we used the fully convolutional one-stage architecture to construct the CNN model to detect and localize fires. We used the same PAN architecture and detection head for the fully convolutional one-stage architecture. In the experiment, we compared our approaches with four different backbones with the typical Yolo v3 framework for indoor fire detection. We mainly examined four evaluation matrices, namely the QFL_loss, DFL_loss,

A P_{50}

, and the model size. Here

A P_{50}

was one of the most commonly used evaluation metrics for object detection. The

A P_{50}

was the average precision of the model at the IoU threshold of 0.5.

Table 4 shows the training loss, average precision, and model size of the fully convolutional one-stage framework with different backbones and Yolo v3. It is evident from the result that the ShuffleNet trained the model with the smallest model size, which is about 7.7 MB. EfficientNet and CSPNet had similar model sizes, but the model size was three times that of ShuffleNet. RepVGG had the most prominent model size, twice that of EfficientNet and CSPNet. Despite the model size, all the networks had similar training losses and average precision. Moreover, compared with Yolo v3, the fully convolutional one-stage framework had a smaller model size but similar training loss and average precision. It can be seen from the table that the model size of the EfficientNet and ShuffleNet was one-tenth of that of YOLO v3. So it was more suitable for deployment on edge devices.

5.2. Video Processing Performance Evaluation

We then tested the deployed model as a stand-alone application in Jetson Nano to evaluate the real-time fire-detection and -localization performance. Figure 5 shows a sample output frame of the application. To compare the performance of different backbone models, we recorded parameters such as network forward time, frame decode time, fps, CPU usage, memory usage, swapping memory, and percentage of EMC memory bandwidth in use relative to the current running frequency.

Table 5 shows the recorded parameters for models trained with different backbone networks. It can be seen from the table that the forward time and decode time for the networks are around 0.04 s and 0.02 s, respectively, for the networks. Moreover, the real-time fire-detection and -localization processing capability can achieve 14 fps, which showed better performance compared with existing work in [9]. So it can be concluded that the fully convolutional one-stage architecture can achieve better early fire-detection and -localization performance despite the choice of pre-trained models. Then the choice of different pre-trained models will lead to different resource usages. All the pre-trained models use similar memory usages and swap memory usages. However, ShuffleNet and CSPNet use CPUs for real-time video processing. It is evident that the pre-trained model of ShuffleNet has the lowest percentage of EMC memory bandwidth, while the pre-trained model of RepVGG has the highest percentage of EMC memory bandwidth.

Another observation is the detection accuracy of the four pre-trained models. Although RepVGG has the most prominent memory size when deployed to Jetson Nano, the pre-trained model cannot detect some small frames in real scenarios. Moreover, ShuffleNet has the smallest memory size to be deployed on low-cost embedded devices. It cannot detect some small frames at an early stage either. We observe that the custom CSPNet can detect some small frames under extreme scenarios, and the model size is applicable to be deployed onto the resource-constrained embedded devices.

5.3. Real-World Fire Localization

In the final experiment, we tested the localization accuracy of the proposed framework. We picked ten locations around the anchor point and used the Jetson nano with the calibrated cameras to calculate the real-world coordinates of the fire. The results are shown in Table 6. The average localization error was 0.44 m. The maximum localization error was 0.63 m. The minimum localization error was 0.28 m. It can also be concluded from the table that the farther the test location was from the anchor point, the greater the error. Nevertheless, the localization error was less than 0.7 m. The localization accuracy was acceptable, as the firefighting robot could cover the fire within this range.

6. Conclusions and Future Work

This work presents a real-time indoor fire-detection and -localization system that could be deployed onto embedded platforms such as the Jetson-Nano platform. This work aims to develop a lightweight and fast response system to detect indoor fires early and inform the fire firefighting robot of the real-world coordinates of the fire. The fully convolutional one-stage objection detection architecture trains the fire detection model. We evaluate four lightweight networks as backbones of the fully convolutional one-stage objection detection architecture. Results show that although ShuffleNet and EfficientNet have lower memory usage, their accuracy is less than RepVGG and CSPNet. Significantly, the models trained with the RepVGG network can detect small frames under extreme scenarios. The path-aggregation network and detection head are also used to improve the fire detection accuracy with QFL and DFL as loss functions. Based on the position of fires in the frames of surveillance videos, we also design a two-step real-world fire localization framework that maps the position of the fire in the two frames of surveillance videos to the localization in the real-world setting. Results show that the proposed framework can achieve similar accuracy compared with the YOLO framework but using one-tenth of the model size. Moreover, the localization accuracy is proved to be within 0.7 m. The system performed exceptionally well with low latency and high detection and localization accuracy.

In this paper, we mainly calculate the coordinates of fire relative to the anchor point. However, we assume that all indoor fires are not allowed in our scenarios. It could be challenging to distinguish between arson and safe fires like candles. Future research could focus on identifying the expanding indoor fires as arson fires. Those controllable fires could be identified as safe fires. Moreover, we will build firefighting robots, which could use the relative coordinates of the fire to locate the fire and put out the fire at an early stage.

Author Contributions

Conceptualization, Y.L. and J.S.; methodology, J.S.; software, Y.L.; validation, M.Y., B.D. and J.Z.; writing—original draft preparation, Y.L.; writing—review and editing, Y.L.; visualization, Y.L.; supervision, Y.L.; project administration, J.S.; funding acquisition, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CNN	Convolution Neural Networks
FCOS	Fully Convolutional One-Stage
CSPNet	Cross Stage Partial Network
LSTM	Long Short-Term Memory
FPN	Feature Pyramid Network
GFL	Generalized Focal Loss
QFL	Quality Focal Loss
DFL	Distribution Focal Loss

References

Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef] [Green Version]
Muhammad, K.; Ahmad, J.; Lv, Z.; Bellavista, P.; Yang, P.; Baik, S.W. Efficient deep CNN-based fire detection and localization in video surveillance applications. IEEE Trans. Syst. Man Cybern. Syst. 2018, 49, 1419–1434. [Google Scholar] [CrossRef]
Kim, B.; Lee, J. A video-based fire detection using deep learning models. Appl. Sci. 2019, 9, 2862. [Google Scholar] [CrossRef] [Green Version]
Hashemzadeh, M.; Zademehdi, A. Fire detection for video surveillance applications using ICA K-medoids-based color model and efficient spatio-temporal visual features. Expert Syst. Appl. 2019, 130, 60–78. [Google Scholar] [CrossRef]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning. PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13733–13742. [Google Scholar]
Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 390–391. [Google Scholar]
Muhammad, K.; Ahmad, J.; Baik, S.W. Early fire detection using convolutional neural networks during surveillance for effective disaster management. Neurocomputing 2018, 288, 30–42. [Google Scholar] [CrossRef]
Saeed, F.; Paul, A.; Karthigaikumar, P.; Nayyar, A. Convolutional neural network based early fire detection. Multimed. Tools Appl. 2019, 79, 9083–9099. [Google Scholar] [CrossRef]
Foggia, P.; Saggese, A.; Vento, M. Real-time fire detection for video-surveillance applications using a combination of experts based on color, shape, and motion. IEEE Trans. Circuits Syst. Video Technol. 2015, 25, 1545–1556. [Google Scholar] [CrossRef]
Antony, J.; Prasad, J. Real Time Fire and Smoke Detection using Multi-Expert System for Video-Surveillance Applications. Int. J. Innov. Res. Sci. Technol. 2016, 3, 203–212. [Google Scholar]
Xie, Y.; Zhu, J.; Cao, Y.; Zhang, Y.; Feng, D.; Zhang, Y.; Chen, M. Efficient video fire detection exploiting motion-flicker-based dynamic features and deep static features. IEEE Access 2020, 8, 81904–81917. [Google Scholar] [CrossRef]
Zhang, Q.; Xu, J.; Xu, L.; Guo, H. Deep convolutional neural networks for forest fire detection. In Proceedings of the 2016 International Forum on Management, Education and Information Technology Application, Guangzhou, China, 30–31 January 2016; Atlantis Press: Paris, France, 2016. [Google Scholar]
Guo, J.; Hou, Z.; Xie, X.; Yao, S.; Wang, Q.; Jin, X. Faster R-CNN Based Indoor Flame Detection for Firefighting Robot. In Proceedings of the 2019 IEEE Symposium Series on Computational Intelligence (SSCI), Xiamen, China, 6–9 December 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1390–1395. [Google Scholar]
Muhammad, K.; Khan, S.; Elhoseny, M.; Ahmed, S.H.; Baik, S.W. Efficient fire detection for uncertain surveillance environment. IEEE Trans. Ind. Inform. 2019, 15, 3113–3122. [Google Scholar] [CrossRef]
Khan, S.; Muhammad, K.; Mumtaz, S.; Baik, S.W.; de Albuquerque, V.H.C. Energy-efficient deep CNN for smoke detection in foggy IoT environment. IEEE Internet Things J. 2019, 6, 9237–9245. [Google Scholar] [CrossRef]
Li, S.; Yan, Q.; Liu, P. An efficient fire detection method based on multiscale feature extraction, implicit deep supervision and channel attention mechanism. IEEE Trans. Image Process. 2020, 29, 8467–8475. [Google Scholar] [CrossRef] [PubMed]
Bari, A.; Saini, T.; Kumar, A. Fire Detection Using Deep Transfer Learning on Surveillance Videos. In Proceedings of the 2021 Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV), Tirunelveli, India, 4–6 February 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1061–1067. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9627–9636. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; Springer: Cham, The Netherlands, 2016; pp. 21–37. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18 June–22 June 2018; pp. 8759–8768. [Google Scholar]
Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. arXiv 2020, arXiv:2006.04388. [Google Scholar]

Figure 1. Architecture of the Proposed Real-Time Fire Detection and Localization Framework.

Figure 2. Convolutional Neural Network Architecture for Fire Detection and Segmentation.

Figure 3. Two-step Fire Localization Framework.

Figure 4. Few Images For the Training Set and Test Set.

Figure 5. Real-Time Indoor Fire Detection and Localization.

Table 1. Custom CSPNet Architecture.

	Input Channels	Output Channels	ResBlock Num	Kernel Size	Stride
Conv	3	32	-	3	2
MaxPool	-	-	-	3	2
CspBlock	32	-	1	3	1
CspBlock	64	-	2	3	2
CspBlock	128	-	2	3	2
CspBlock	256	-	2	3	2

Table 2. Network Architectures and Path Aggregation Network Configurations for EfficientNet, ShuffleNet, RepVGG, and CSPNet.

Network	Out Stages	Activation Function	PAN In	PAN Out
EfficientNet	[2, 4, 6]	ReLU6	[40, 112, 320]	96
ShuffleNet	[2, 3, 4]	LeakyReLU	[48, 96, 192]	96
RepVGG	[2, 3, 4]	ReLU	[96, 192, 512]	96
CSPNet	[3, 4, 5]	LeakyReLU	[128, 256, 512]	96

Table 3. Detection Head Configurations for EfficientNet, ShuffleNet, RepVGG, and CSPNet.

	Input Channel	Feature Channel	Strides	Norm
Configuration	96	96	[8, 16, 32]	Batch Normalization

Table 4. Training Loss, Average Precision and Model Size of Fully Convolutional One-Stage Framework with Different Backbones and Yolo v3.

Model	Backbone	QFL_loss	DFL_loss	${AP}_{50}$	Model Size (MB)
Fully	EfficientNet	0.02	0.0985	0.954	25.014
Convolutional	ShuffleNet	0.01	0.0983	0.962	7.729
One-Stage	RepVGG	0.02	0.0999	0.948	52.395
Object Detection	CSPNet	0.03	0.0989	0.941	26.69
Yolo v3		0.01	0.0869	0.971	237

Table 5. Recorded Parameters for Models Trained with Different Backbone Networks.

Pre-Trained Models	Forward Time	Decode Time	fps	CPU Usage	Memory	SWAP	EMC
EfficientNet	0.06 s	0.02 s	14.28	49.2%	1.96 G	1.67 G	27%
ShuffleNet	0.04 s	0.02 s	14.28	124.4%	1.9 G	1.55 G	18%
RepVGG	0.07 s	0.01 s	14.28	100.7%	1.9 G	1.58 G	31%
CSPNet	0.04 s	0.02 s	14.28	131.7%	1.9 G	1.5 G	21%

Table 6. The true and predicted coordinates of ten random locations around the anchor point.

	True Coordinates	Predicted Coordinates	Error (Meters)
Location 1	(1.3, 2.8, 2.1)	(1, 3, 2)	0.37
Location 2	(2.3, 4.6, 0)	(2.5, 4.8, 0)	0.28
Location 3	(3.5, 6.2, 1.4)	(3, 6.6, 1.3)	0.648
Location 4	(1.5, 2.9, 3.3)	(1.3, 3.2, 3.2)	0.37
Location 5	(6.8, 2, 1.4)	(6.2, 1.8, 1.4)	0.632
Location 6	(1.2, 1.5, 3.3)	(1.4, 1.3, 3.3)	0.28
Location 7	(1.5, 1.8, 2)	(1.7, 1.6, 2.1)	0.3
Location 8	(6.2, 6.1, 1.4)	(5.8, 5.5, 1.3)	0.72
Location 9	(3.3, 3.1, 2)	(3, 3.4, 2.1)	0.43
Location 10	(4.3, 4, 1.4)	(3.9, 4.2, 1.3)	0.458

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Y.; Shang, J.; Yan, M.; Ding, B.; Zhong, J. Real-Time Early Indoor Fire Detection and Localization on Embedded Platforms with Fully Convolutional One-Stage Object Detection. Sustainability 2023, 15, 1794. https://doi.org/10.3390/su15031794

AMA Style

Li Y, Shang J, Yan M, Ding B, Zhong J. Real-Time Early Indoor Fire Detection and Localization on Embedded Platforms with Fully Convolutional One-Stage Object Detection. Sustainability. 2023; 15(3):1794. https://doi.org/10.3390/su15031794

Chicago/Turabian Style

Li, Yimang, Jingwei Shang, Meng Yan, Bei Ding, and Jiacheng Zhong. 2023. "Real-Time Early Indoor Fire Detection and Localization on Embedded Platforms with Fully Convolutional One-Stage Object Detection" Sustainability 15, no. 3: 1794. https://doi.org/10.3390/su15031794

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Real-Time Early Indoor Fire Detection and Localization on Embedded Platforms with Fully Convolutional One-Stage Object Detection

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Architecture

3.2. Fire Detection Framework

3.3. Fire Localization Framework

4. Experiment Setup

4.1. Hardware Setup

4.2. Dataset Description

5. Results and Discussion

5.1. Backbone Model Evaluation

5.2. Video Processing Performance Evaluation

5.3. Real-World Fire Localization

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI