*2.3. DenseNet*

The early wildfire-detection algorithm was constructed using the state-of-the-art net architecture, DenseNet, which is known to perform well in wildfire detection, while alleviating the vanishing gradient problem and reducing the training time [40]. It is a densely connected CNN structure that has a connection strategy. Figure 3 illustrates the original dense block architecture. The network comprises layers, each of which contain a non-linear transformation, and includes functions such as batch normalization, rectified linear unit (ReLU), and convolution. *X*0 is a single image, and the network output of the (*l* − 1)*th* layer after passing through a convolution is *Xl*−1. The *lth* layer receives the feature maps of all preceding layers as its input (Equation (6)).

$$X\_l = \ H\_l([X\_0, X\_1, X\_2, \dots, X\_{l-1}])\tag{6}$$

**Figure 3.** Architecture of five-layer densely connected convolution networks.

#### *2.4. Performance Evaluation Metrics*

To compare the performance of the models, five commonly used metrics were calculated—accuracy, precision, sensitivity, specificity, and F1-Score [44–46]. Accuracy is the ratio of accurately predicted observations to the total number of observations and is the most intuitive performance measurement. Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. Sensitivity is the ratio of correctly predicted positive observations to the actual true observations. Specificity is the ratio of correctly predicted negative observations to the total number of predicted negative observations. The F1 score is the harmonic average of precision and sensitivity, which is generally useful for determining the performance of a model in terms of accuracy. The expressions for the evaluation metrics are presented as follows.

$$\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}. \tag{7}$$

$$\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}.\tag{8}$$

$$\text{Sensitivity} = \frac{\text{TP}}{\text{TP} + \text{FN}}.\tag{9}$$

$$\text{Specificity} = \frac{\text{TN}}{\text{TN} + \text{FP}}.\tag{10}$$

$$\text{F1}-\text{score} = \frac{2 \times \text{precision} \times \text{Sensitivity}}{\text{precision} + \text{Sensitivity}}.\tag{11}$$

In the aforementioned equations, the number of true positives that the model predicts, i.e., the number of wildfire images predicted as wildfires and the number of true negatives that model the predicts, i.e., the number of non-fire images identified as non-fire, are denoted by true

positive (TP) and true negative (TN), respectively. In addition, the number of false positives that the model predicts, i.e., the non-fire images predicted as wildfires, and the number of false negatives that model predicts, i.e., the wildfire images predicted as non-fire, are denoted as false positive (FP) and false negative (FN), respectively. These four types of data are defined using a confusion matrix in the binary classification. The overall performance-evaluation metrics were evaluated using the wildfire and non-wildfire testing sets.

#### **3. Experimental Results**

The following sections present the obtained results of the dataset balancing and wildfire detection models. The experiment environment was CentOS (Community enterprise operating system) Linux release 8.2.2004, which was constructed as an artificial intelligence server. The hardware configuration of the server consists of an Intel(R) Xeon(R) Gold 6240 central processing unit, 2.60 GHz, with an Nvidia Tesla V100 GPU, 32 GB memory. The experiences were conducted using the PyTorch deep learning framework [47] with Python language. The result and the example experiment code is available online at Github repository (https://github.com/pms5343/pms5343-WildfireDetection\_by\_DenseNet).

#### *3.1. Dataset Augmentation Using GAN*

To alleviate the data imbalance of the collected images, new wildfire images were generated using the CycleGAN as a data augmentation strategy. The objective of using the image-generation model is to convert non-wildfire images from a part of the collected data into wildfire images. A total of 1294 wildfire images (Domain A) and 2311 non-wildfire images (Domain B) from our original dataset were used.

As can be observed from Figure 4, the training was performed by increasing the number of epochs until there was a slight change in each loss, in order to improve the model. The generator loss was learned in the direction of increasing loss as the number of epochs increased because the objective of the generators was to create a fake image such that the discriminator could not determine whether the generated image was real or fake. Conversely, the discriminator losses were trained to reduce the loss, in order to distinguish between the generated and original images. Figure 4b shows that the cycle consistency loss added for the purpose of increasing the diversity of the generated image and the identity mapping loss added for the purpose of minimizing changes in the background of the generated image were also trained in the direction of decreasing exposure. After 650 epochs, there was no significant change in loss, and the training was thus terminated.

**Figure 4.** Training loss curve of CycleGAN-based non-fire–wildfire image converter. (**a**) Adversarial loss curve for generator and discriminator by the number of epochs. (**b**) Cycle-consistency and identity mapping loss curve by the number of epochs.

Figure 5 illustrates the overall process of the model and an example of when the images of domains A and B undergo the model-training process. The mountain image without a fire in domain B was converted into a wildfire image through the generator *GBA* and then compared with the image of domain A (original wildfire image), by discriminator A (*DA*) (-1 →-2 →-3 process in Figure 5). The converted image was the image reconstructed by generator *GAB*, and the result was not significantly different from that of domain B (-1 →-2 →-4 process in Figure 5). In addition, it was confirmed that there was no difference in the image converted by generator *GAB* from domain B (-1 →-5 process in Figure 5). Conversely, the process was conducted in the same manner, and 1195 new 224 × 224-pixel fire images were created from domain B (Figure 6) and included in the wildfire dataset.

**Figure 5.** CycleGAN-based wildfire-image-generation architecture.

**Figure 6.** Sample of the wildfire images converted from non-fire mountain images.

#### *3.2. Wildfire Detection*

The wildfire detection was realized through the use of a DenseNet-based classification network model consisting of three dense blocks and two transition layers to identify the fire with 224 × 224-pixel-size image inputs. The architecture of the simple network is illustrated in Figure 7.

The dense block included a two-kernel filter. One filter was a 1 × 1 size convolution, which was used to decrease the number of input feature map channels, and the other was a 3 × 3 size convolution. After the dense block, the feature maps passed through a phase layer consisting of batch normalization, ReLU, 1 × 1 convergence, and 2 × 2 average pooling, which reduced the width and length of the feature map and the number of feature maps. Finally, after three dense block sessions, the result was drawn after the linear layer at the end, after passing through the global average pooling and softmax classifier sequentially, as in the case of a traditional CNN.

**Figure 7.** DenseNet-based wildfire-detection architecture.

The following section presents the results of the wildfire-detection performance obtained using the deep learning classification model based on DenseNet, as compared to the pre-trained model. Two results were derived for each model—one for train set A and the other for train set B.

#### 3.2.1. Dataset Partition

The train and test set partition are specified in the following section. From the collected original dataset, several images were used to generate new images. The forest image used as the GAN domain was deleted from the dataset for the classification model; however, the wildfire domain was not eliminated because it was used as a reference; it was thus not deleted from the dataset. A total of horizontal flip and random crop (by 200 pixel) were used to expand the number of samples of the training sets. The train sets were divided into trainset A, consisting only of photographs taken, and trainset B, consisting of wildfire images generated by the GAN. Many precedent research showed that accuracy becomes lower when the number of data points is imbalanced [48]. In order to avoid the disadvantages of already well-known data imbalances, Train set A kept the data ratio between the two classes similar, even if the total number of data is set less than B. The test set only contains the original photograph and not the generated image. Twenty percent of the total collected original image dataset was selected as the test dataset. Partition of the datasets are shown in Table 1.


**Table 1.** Image datasets for wildfire-detection model.

#### 3.2.2. Model Training and Comparison of the Models

To demonstrate the performance of the proposed method, two train sets were used in the proposed model and well-known pre-trained models, ResNet-16 and VGG-50, for the performance evaluation. To improve the models' performance of each model, the learning rate and optimizer were tested. Ten values of the initial learning rate between 0.1 and 0.00001 were tested, while changing three representative optimizers—stochastic gradient descent (SGD), Adam [49], and PMSprop [50]. The number of epochs was fixed at 250, and batch size was fixed at 64. The best hyperparameter combination was found based on the average accuracy from the k-folds (k = 5) cross-validation process; presented in Table 2.


**Table 2.** Selected hyperparameters for CNN architectures.

The training process of each model using the selected hyperparameter combination is illustrated in Figure 8. The training accuracy curve obtained as the number of epochs increased is presented in Figure 8a. The accuracy of the six models increased most significantly between epochs 1 and 10 and then increased steadily until epoch 250.

**Figure 8.** Learning curve of training process over epochs. (**a**) Accuracy curve. Final accuracy: VGG-16, trainset A (0.954); VGG-16, trainset B (0.969); ResNet-50, trainset A (0.989); ResNet-50, trainset B (0.995); DenseNet trainset A (0.985); and DenseNet trainset B (0.995). (**b**) Loss curve. Final loss: VGG-16, trainset A (0.123); VGG-16 trainset B (0.085); ResNet-50, trainset A (0.028); ResNet-50, trainset B (0.016); DenseNet, trainset A (0.0003; SGD); and DenseNet, trainset B (0.00006; SGD).

The DenseNet-based proposed model demonstrated the highest training accuracy, with an approximate accuracy of 99% in the final learning approach, followed by ResNet-50 and then VGG-16. In addition, it was demonstrated that the accuracy performance of trainset B, which included generated images, was greater than that of trainset A for all three models. The training loss curve obtained as the number of epochs increased is presented in Figure 8b. The DenseNet and ResNet-16 losses rapidly decreased until epoch 20, whereas the loss of VGG-16 continued to decrease steadily. The training loss also exhibited a better performance for trainset B than that for trainset A in the case of both the initial and final losses.

The classifier models were evaluated based on the performance results, using the five metrics presented in Table 3. DenseNet yielded the best results in terms of all five metrics. Although the VGG-50 model exhibited a slightly lower accuracy, sensitivity, and F1-score, the results obtained on using trainset B were at a similar level as (or better than) those obtained with trainset A. For example, in the case of DenseNet, the accuracy increased from 96.734% to 98.271%, the precision increased from 96.573% to 99.380%, sensitivity increased from 96.573% to 96.976, specificity increased from 96.881% to 99.450%, and the F1-score increased from 96.573 to 98.163. The experimental results showed that a new image created by changing a normal image of a mountain into an image of a mountain on which a fire had occurred could maintain the performance of the CNN and also improve the model performance via the input of various data as training.


**Table 3.** Comparisons of performance evaluation.

The bold is the best result among other methods.

#### 3.2.3. Influence of Data Augmentation Methods

In this section, proposed model performance is compared with and without using CycleGAN-based data augmentation, to verify the influence of the proposed method. Horizontal flip, random zoom (200 pixel), rotation (original images were rotated by 10◦ and 350◦), and random brightness (two values were selected arbitrarily from *lmin* = 0.8 to *lmax* = 1.2) methods were used in this section, as traditional data augmentation without GAN. The F1-score was obtained from a combination of training sets consisting of various augmentation methods.

Based on the experimental results, it could be seen that data augmentation from CycleGAN improved the accuracy of wildfire detection models. As can be seen from Table 4, the F1 score trained from data combination including the GAN method was higher by 1.154, 0.902, and 0.821, respectively, than the model trained from traditional method without GAN.

**Table 4.** F1-scores for model trained by various combination sets.


#### 3.2.4. Visualization of the Contributed Features

In order to visualize the output result of the model that exhibits the best performance, a class activation map (CAM) [51] was used to determine the features of the image that were extracted to detect the wildfire. As can be observed from the example of the CAM results in Figure 9, the detection was made primarily based on the presence of smoke or flames in the image, and the elements used for the classification as wildfires were found even in the early stages of the fire, with no flame and little smoke.

**Figure 9.** Sample of CAM results of the wildfire images.

The smoke in the part of the image that comprises the forest could be detected well, but the smoke in the part that comprises sky was not judged as a factor. It is hypothesized that this occurred because the model confused smoke with clouds or fog, and the smoke near the sky background could thus not be treated as a powerful factor for classifying the features.

#### *3.3. Model Application*

To apply the learned model to on-site drones or surveillance cameras used to monitor forests, a method of application for higher-resolution images than the model input-image size (224 × 224) is required. There is also a method used for resizing a remote camera image to a lower resolution; however, the method proposed in this study comprises cropping high-resolution images at regular intervals—considering that surveillance cameras are generally used to observe large areas—to derive the result values for each image.

Figures 10–12 present an example of a model application based on a drone-tested forest video [52]. This is a 1280 × 720-size drone video of a wildfire that occurred in Daejeon, Korea, in 2015. The white and jade green boxes denote the cropped areas of size 224 × 224 and are indicated in alternate colors for visualization convenience. The cropped images were cut to overlap each other at a certain interval, and 28 images per video frame were cut and input to the classification model. The text in the square box indicates the value derived from the softmax layer of the model, which was the final layer of the model (as it was trained using two classes; if the softmax value of the model was greater than 0.5, it was determined that the range comprised a fire, otherwise, it was determined that the range did not comprise a fire.)

**Figure 10.** Example of model application with softmax result for early wildfire (with error).

Figure 10 presents the result of the application of the model to the image captured approximately 1 min after the wildfire occurred. The photos include not only the forest, but also parts of the nearby villages. The model detected the smoke generated in the forest and determined the location at which the fire had occurred. However, a greenhouse at the bottom right of the photo was falsely detected as a wildfire (0.829). It was suggested that this was a problem caused by the error of not properly taking into consideration specific images like cities, roads, and farmland, when training the initial model. This phenomenon was also found when applied to other sites.

**Figure 11.** Example of model application with softmax result for non-wildfire (with error).

**Figure 12.** Example of model application with softmax result after 10-min wildfire progress.

As can be seen from the class activation map in Figure 11, the model mistook the building feature. Although it could not be judged that this was falsely detected by all artifacts, it was confirmed that false positives might occur when more than half of the cropped images were not natural objects. Conversely, there were no false positives caused by natural objects, such as confusion of distinguishing between clouds and smoke.

Figure 12 presents the result of the application of the model, approximately 10 min after the wildfire progression. As the fire was accompanied by flames after the fire had grown to some extent, the softmax layer provided a prediction with 100% probability, and the fire could be detected more easily than at the beginning of the fire. After applying the method of cropping without resizing the image, damaging the original image becomes unnecessary. As each cropped image is discriminated

individually, the location of the fire can be tracked, while continuously obtaining real-time video footage, using a surveillance camera.
