*3.1. Structural Similarity*

SSIM (Structural Similarity) [49] is a measure of the similarity of the original image and distortion due to compression and transformation. This is more widely used in signal processing because it has higher accuracy than the Mean Square Error (MSE) method, which uses a measure of the difference between pixel values of two images. We used the evaluation of the test image (X) against the original image (Y) to measure the quantification of visual similarity. The more similar the test image to the original image, the closer the value is to 1.0, and the more different the test image is to the original image, the closer the value is to 0.0. The SSIM formulas are defined as follows:

$$\mathcal{L}(\mathbf{x}, \mathbf{y}) = \frac{2\mu\_x \mu\_y + K\mathbf{1}}{\mu\_x^2 + \mu\_y^2 + K\mathbf{1}} \tag{1}$$

$$\mathbf{u}(\mathbf{x}, \mathbf{y}) = \frac{2\sigma\_{\mathbf{x}}\sigma\_{\mathbf{y}} + \mathbf{K2}}{\sigma\_{\mathbf{x}}^2 \sigma\_{\mathbf{y}}^2 + \mathbf{K2}} \tag{2}$$

$$\mathbf{N}(\mathbf{x}, \mathbf{y}) = \frac{\sigma\_{xy} + K3}{\sigma\_{x}\sigma\_{y} + K3} \tag{3}$$

where μ*<sup>x</sup>* and μ*<sup>y</sup>* are the mean of the pixels, σ*<sup>x</sup>* and σ*<sup>y</sup>* are the standard deviations, and σ*xy* is covariance. *K1*, *K2*, and *K3* are constants for preventing the denominator and numerator from becoming zero. *L(x, y)* is the relationship of the brightness difference, *M(x, y)* is the contrast difference, and *N(x, y)* is the similarity of the structural change between x and y. The structural similarity is shown in Equation (4):

$$\text{SSIM} = [L(\mathbf{x}, \mathbf{y})]^a [M(\mathbf{x}, \mathbf{y})]^\beta [N(\mathbf{x}, \mathbf{y})]^\gamma \tag{4}$$

where α, β, and γ represent the importance of each term; 1.0 was used in this paper.

### *3.2. RGB Color Histogram*

Generally, smoke is grayish (dark gray, gray, light gray, and white). Black smoke occurs by unburned materials or a combustion at high temperatures; this means that a certain time has passed since the fire occurred. This paper focuses on the smoke of the initial generation, and sets the conditions as shown in Equation (5) to use smoke colors ranging from gray to white:

$$\mathbf{C} = (\mathbf{R} + \mathbf{G} + \mathbf{B})/3, \quad \tau \mathbf{1} < \mathbf{C}\_L < \tau \mathbf{2}, \quad \tau \mathbf{3} < \mathbf{C}\_H < \tau \mathbf{4} \tag{5}$$

where *C* is the output image, R is the red image, G is the green image, and B is the blue image. This research set the *CL* to a minimum value between 80 (τ1) and 150 and the *CH* (τ2) an upper range value between 180 (τ3) and 250 (τ4). The average image *C* is histogrammed into 256 bins (0 to 255) for each pixel. The values stored in each bin of the histogram are normalized using the input image size, and the sum is obtained, as in Equation (6):

$$H\_S = \sum\_{i=0}^{255} \frac{b\_i}{(h \times w)}\tag{6}$$

where *HS* is the RGB color histogram result value, *bi* means the histogram bins from 0 to 255, which is only included Equation (5) range, and *h* and *w* is height and width for an input image. The grayish color is distributed intensively between 80 and 250.

Fire flames are usually bright orange or red (red -> orange -> yellow -> white -> mellow). This paper used HSV color instead of RGB color. The range of HSV color used in the paper is as follows:


As shown in smoke color extraction, HSV color image is also calculated for the average value for the filtered range image. The HSV histogram is obtained by Equation (6).

### *3.3. Coe*ffi*cient of Variation (CV)*

The coefficient of variation is a type of statistic that represents the precision or scatter of a sample, such as variance and standard deviation, in that it shows how scattered the distribution is relative to the mean. CV is a measure of how large the standard deviation is relative to the mean. These coefficients of variation are useful for comparing the spread in two types of data and for comparing variability when the differences between data are large. It is also used to determine the volatility of economic models and securities of economists and investors, as well as areas such as engineering or physics, when conducting quality assurance research and ANOVA gaugeR&R[50].

The coefficient of variation is the standard value divided by the mean, as shown Equation (7):

$$\mathbf{CV} = \sigma/\mathbf{m} \tag{7}$$

where σ is standard deviation and *m* is mean. It showed that the image with smoke and fire flame region has lower *CV* value. In the contrast, the region with false alarm showed higher *CV* value, as shown Figure 5. This paper adapted as the coefficient value (weighting value) of wavelet transform to remove the false alarm cases. In case of Fire flame, we used the R color in RGB color space, and adapted Y color in YCbCr color space for the smoke region.

**Figure 5.** The result of coefficient variation values for detected area, (**a**) smoke area Coefficient of Variation (CV) value: 1.5 (87%), (**b**) fire area CV value: 1.9 (51%), (**c**) false alarm area CV value: 6.2 (20%), and (**d**) false alarm area CV value: 13.6 (76%).

### *3.4. Wavelet Transform*

In general, smoke is blurry and uneven, thus, it is difficult to detect the contour using the contour detection method. DWT (Discrete Wavelet Transform) [51,52] supports multiple resolutions, and can express contour information of vertical, horizontal, and diagonal components, respectively. Using this feature to represent smoke in DWT energy, it is more apparent than in conventional edge detection methods.

When smoke with translucent characteristics occurs, the smoke part of the image frame is less sharp and the high frequency component is reduced in the area. Wavelet algorithms are generally suitable for expressing image textures and edge characteristics of smoke and fire flames. Background images generally have lower wavelet energy and few moving objects. In contrast, the edge of smoke images becomes less visible, and may disappear from the scene after a certain time. It means that the high frequency energy of the background scene is decreasing. In order to identify smoke in a scene, any decrease in high frequency from the detected blob images in the frame was monitored by a spatial wavelet transform algorithm.

As shown in Figure 6, if the smoke spreads to the edges of the image, it may be difficult to see initially and the smoke may darken over time, causing part of the background to disappear. [53,54]. This means that there is a high probability that smoke will be present and smoke detection will be easier, as shown in Figure 6. Therefore, this paper used the spatial energy to evaluate the sub-image energy by dividing the image into first stage wavelet transform and summing the squared from each coefficient images in Equation (8):

$$\mathbf{E}(\mathbf{x}, \mathbf{y}) = \sqrt{\left[LH(\mathbf{x}, \mathbf{y})^2 + HL(\mathbf{x}, \mathbf{y})^2 + HH(\mathbf{x}, \mathbf{y})^2\right]}\tag{8}$$

where *x* and *y* represent positions within the image, and *LH*, *HL*, and *HH* each contain contour information of the high frequency component of the DWT (Discrete Wavelet Transform). *LH* is horizontal low-band vertical high-band, *HL* is horizontal high-band vertical low-band, and *HH* is horizontal high-band vertical high-band. E(x, y) is wavelet energy at each pixel in the candidate region which is detected by deep learning algorithm within each frame.

**Figure 6.** Single level of wavelet transform results, (**a**) non-smoke sub-images and (**b**) smoke sub-images.

### **4. Experimental Results**

We proposed a new algorithm using similarity and color histogram of global and local area in the frame to reduce smoke false positive rate generated by fire detection systems using Onvif camera based on deep learning. In this paper, we used a computer with an Intel Core i7-7700 (3.5 GHz) CPU, 16 GB of memory, and Geforce TITAN-X to perform the experiment. The flame and smoke databases used in this study was obtained from the internet, and general direct ground and factory recorded video. The video recording device was a mobile phone camera, a Canon G5 camera, and a Raspberry pi camera. Python 3.5, Tensorflow, and Opencv were used in this paper.

In order to implement the proposed algorithm, the following process was carried out. The first step is labeling dataset from training database. The first task is labeling data using the LabelImg program, as shown Figure 4. The labeling categories used in this paper are flame, smoke, Grinder, Welding, and human. The result of labeling data is stored in an .xml file that contains the object type name and the four-point coordinates of the object area.

The second step is training process with labeled images. In the training process, the input image is a JPEG or PNG file. The .xml file should be converted to the learning data format of the Tensorflow. Since the meta data and labels of these images are stored in a separate file and must be read separately from the meta data and label file, the code becomes complicated when reading the training data. Additionally, performance degradation can occur if the image is read in JPEG or PNG format and decoded each time. However, the TFRecord file format avoids the above performance degradation and

makes it easier to develop. The TFRecord file format stores the height and width of the image, the file name, the encoding format, the image binary, and the rectangle coordinate of the object in the image. Through this process, the entire training data is classified and stored as 70% training data and 30% validation data. We used the FASTER-CNN ResNet (Deep Residual Network) as the primary model for training, and it is characterized by the smallest number of objects and the highest detection rate. The fire images used in the training consisted of 21,230 pieces.

Finally, we extracted the training model. The learning process stores a check pointer that represents the learning result for the predetermined pointer. Each check pointer has meta information about the Tensorflow model file format and can be learned again. However, because there is a lot of unnecessary information in the ".meta" file, the .meta file needs to be improved to use the actual model. Finally, a ".pb" file is generated that combines the weights except for the unnecessary data in the ".meta" file.

In this paper, we used factory recorded video images, mobile camera, Raspberry pi camera, and general camera as the experimental data. Figure 7 shows an example of continuous frames of video used in the experiment. Fire detection experiment was performed using ".pb" file based on Fater R-CNN model. Figure 8 shows fire and smoke detection results included true positive and false positive using general deep learning.

**Figure 7.** Example of the frame sequence of test video.

**Figure 8.** The experimental results using the Faster R-CNN: (**a**) the results of true positive, (**b**) the results of false positive (similar shape and color and reflection of sun and light), (**c**) the results of false positive (moving objects and similar color).

Figure 8a shows the result of the experiment to detect fire and smoke using various videos. The detection threshold of Faster R-CNN was 30% or higher. Figure 8b,c shows the result of false positive detection by applying deep learning training results. Although false positives have appeared in many places, there are two types of false positives. First, smoke or flame is detected by reflection of sunlight. Second, facilities inside and outside the factory show similar shapes and colors like smoke and fire. Third, when objects are moving around, deep learning system recognize them as fire flame or smoke for the similar shape of trained fire flame and smoke, as shown in Figure 8c. Table 1 shows the fire and smoke detection results for several videos.


**Table 1.** The results of video test using general Faster R-CNN (frame).

Videos 1 to 8 contain smoke and fire flame and Videos 9 to 16 contain non-fire (factory and office) scenes. Video 3, Video 7, and non-fire Video included a number of false positive frames. Especially, Video 3 showed the same number of true positive frames and false positives. It means that each frame has False Positive object in the images. In Table 1, Ground Truth represents the total number of frames in the video, True Positive (TP) indicates when a fire flame and smoke is detected as fire flame and smoke. True Negative (TN) indicates that non-fire objects are not detected as fire flame and smoke. False Positive (FP) is a case where non-fire objects are detected as a fire. NON signifies a non-fire video and F/S signifies a fire flame and smoke video. In Table 1, F/S means including fire and smoke frames and NON means without fire and smoke frames.

In the case of Videos, they is not generated in a continuous frame. Since the video is 30 fps, it can be sufficiently compensated. However, in the case of Video 3, Video 7, and non-fire video, the alarm continues to ring and the stress of the worker becomes higher. In order to reduce false positives generated in False Positive Videos, we use the following characteristics. The first is a global check. We checked the motion characteristics before performing deep learning using mean square error (8) and three frame differences (9) [55]. Since there is motion when a fire occurs, if a block of moving pixels is generated, it is registered as a fire candidate state. If the fire candidate frame status is True, a deep learning process is performed, as shown Figure 1.

$$S\_k = SSIM(f\_{i\nu}, f\_j)\_{\nu} \; M\_k = MSE(f\_{i\nu}, f\_j)\_{\nu} \; A\_k = diff\left(f\_{i\nu}, f\_j\right) \tag{9}$$

$$FS\_G = \begin{cases} 1 & \text{if } S\_k < th1, \, M\_k < th2, \, A\_k < th3 \\ 0 & \text{else} \end{cases} \tag{10}$$

where FSG is global decision parameter.

The second is a local check for the detected area (bounding box) by deep learning. If there is a trained class in the input frame image, a bounding box is created and stored as a local area of interest. The next step is to verify the local area of interest again. In this paper, we determine the final fire region using the color histogram *H*, *SSIM* index, and mean square error (*MSE*), coefficient variant, and wavelet transform with other frames as the following equation:

$$\begin{aligned} F\_L &= \begin{cases} 1 & \text{if } M\_k < fth1, \, A\_k < fth2, \, H\_{sum} < fth3, \, WE\_k < fth4\\ 0 & \text{else} \end{cases} \\ WE\_k &= \sqrt{FWL^2 + \mathcal{C}\_n\mathcal{R}\_-HH^2 \times (\mathcal{R}\_-H\_{sum} + \mathcal{Y}\_-H\_{sum})} \\ \text{FVV} &= \sqrt{\mathcal{C}\_n\mathcal{R}\_-HH^2 \times \mathcal{C}V + \mathcal{C}\_-\mathcal{Y}\_-HH^2 \times \mathcal{C}V} \end{aligned} \tag{11}$$

where *k* means frames, from *fth1* to *fth4* are threshold value by experiment. *C\_R\_HH* and *C\_Y\_HH* is the wavelet transform coefficient HH for RGB and *YCbCr* color. Moreover, *R\_Hsum* and *C\_Hsum* are the result of R color and Y color histogram for the local region. We compared the local region (bounding box area) of interest using the three frame difference algorithm (first, middle, and last frames) from the stored 10 frame images.

The final smoke region, in common with fire detection, we also adapted same sequence as the following equation:

$$\mathcal{S}\_{L} = \begin{cases} 1 & \text{if } M\_{k} > \text{sth1}, \ A\_{k} > \text{sth2}, \ H\_{\text{sum\\_F}} > \text{sth3}, \ W E\_{k} < \text{sth4} \\ 0 & \text{else} \end{cases} \tag{12}$$

This paper added the following conditions to remove false positives:

$$\begin{aligned} \text{SD1} &= \sqrt{\text{C}\_{-}Y\_{-}H l^{2} \times F W V} \\ \text{SD2} &= \text{SD1} \times \text{FWV} \\ \text{SD3} &= \{(\text{CV}\_{S} + \text{CV}\_{F})/2\} \times \text{SD2} \\ \text{SD4} &= \text{CV}\_{S} \times \text{C}\_{-}Y\_{-}H l\_{-}L H \\ \text{C}\_{-}Y\_{-}H l\_{-}L H &= \sqrt{\text{C}\_{-}Y\_{-}H l\_{-}^{2} \times F W V + \text{C}\_{-}Y\_{-}L H^{2} \times F W V} \\ S\_{SD} &= \begin{cases} 1 & \text{if } SD1 > sh55, \text{ SD2} > sh66, \text{ SD3} > sh75, \text{ SD4} < sh68 \\ 0 & \text{else} \end{cases} \end{aligned} \tag{13}$$

where CVS and CVF are the coefficient variance of local smoke and fire region, respectively. In this paper, it is regarded as a fire if FD is satisfied as shown in the following equation:

$$\begin{aligned} \text{FS}\_{\text{G}} = \begin{cases} 1 & \text{if } F\_L > 0, \text{ } \text{S}\_L > 0, \text{ } F\_{SD} > 0\\ 0 & \text{else} \end{cases} \end{aligned} \tag{14}$$

We described the result of the experiment applying the proposed algorithms in Table 2.

Table 2 shows the experimental results using the proposed algorithm. In the Videos, the false positive rate dropped to 0% and the fire detection of Video 1 to Video 6 persisted. Even though the Video 3 and Video 4 missed a few fire images, it has no problem because it is not continuously generated and the alarm system has no problem sending a warning signal to operator if it misses one or two frames. As shown in Table 2, the proposed algorithm using color histogram, wavelet transform, and coefficient variant was able to eliminate false positives (similar shape and color objects, sun and light reflection, moving objects, etc.) shown in Figure 8b,c. The results of the proposed algorithm using color histogram performance, high frequency components of wavelet transform, which is background discrimination of smoke and fire flame, and coefficient variant coefficients showed higher ratio of false alarm removal than the traditional deep learning method. However, in the case of Video 7 and Video 8, we must seriously consider the case of the missing frames. Additionally, we tested other factory and office videos. It also marked zero false positive rate for the proposed method. The false positive rate for the additional 16 videos was 99%, and the image examples used in the video experiment are shown

in Figure 9. Figure 9a is office and factory videos and Figure 9b is fire and smoke videos. Since this involves a lot of movement, it is likely that it has affected the frames missing in Video 7 and Video 8.


**Table 2.** The results of video test using proposed algorithm.

**Figure 9.** Experimental videos for proposed algorithm test: (**a**) factory and office videos and (**b**) fire and smoke videos.

### **5. Conclusions**

Fires resulting from small sparks can cause terrible natural disasters that can lead to both economic losses and the loss of human lives. In this paper, we describe a new fire flame and smoke detection method to remove false positive detection using spatial and temporal features based on deep learning from surveillance cameras. In general, a deep learning method using the shape of an object frequently generate false positives, where general object is detected as the fire or smoke. To solve this problem, first, we used motion detection using the three frame difference algorithm as the global information. We then applied the frame similarity using SSIM and MSE. Second, we adapted the Faster R-CNN algorithm to find smoke and fire candidate region for the detected frame. Third, we determined the final fire flame and smoke area using the spatial and temporal features; wavelet transform, coefficient of variation, color histogram, frame similarity, and MSE for the candidate region. Experiments have shown that the probability of false positives in the proposed algorithm is significantly lower than that of conventional deep learning method.

For future work, it is necessary to study the analysis for the moving videos and the experiment using the correlation of the frame and the deep learning model to further reduce false positives and missing fire and smoke frames.

**Author Contributions:** We provide our contribution as follow, conceptualization, Y.L. and J.S.; methodology, Y.L. and J.S.; software, Y.L.; validation, Y.L.; formal analysis, Y.L.; investigation, Y.L.; resources, Y.L.; data curation, Y.L.; writing—original draft preparation, Y.L.; writing—review and editing, Y.L. and J.S.; visualization, Y.L.; supervision, J.S.; project administration, J.S.; funding acquisition, J.S.

**Funding:** This research was supported by the MSIP (Ministry of Science, ICT & Future Planning), Korea, under the National Program for Excellence in SW) (IITP-2019-0-01113) supervised by the IITP (Institute for Information & communications Technology Planning & Evaluation).

**Conflicts of Interest:** The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Brief Report*
