This section presents the learning process and test results of the MLC model to support wildfire responses. Experiments were conducted in a CentOS (Community Enterprise Operating System) Linux release 8.2.2004 environment with Nvidia Tesla V100 GPU, 32 GB memory, and models were built and trained using PyTorch [
45], a deep learning open-source framework.
4.1. Dataset
The dataset used to train and test the deep learning model contained daytime and nighttime wildfire images captured by surveillance cameras or drone cameras downloaded from the Web and cropped images of a controlled fire in the forest captured by a drone by the researchers. This study also included a day–night image matching (DNIM) dataset [
46], which was used to reduce the effects of day and night lighting changes, and Korean tourist spot (KTS) [
47] datasets generated for deep learning research, which comprise images linked by forest labels containing important wooden cultural properties in the forest. Additionally, wildfire-like images and 154 cloud and 100 sun images were also included as datasets because they have similar properties with early wildfire smoke and flames as the color or shape and are often detected erroneously. As such, they were included in the training dataset to prevent predictable errors in the verification stage and to train the robust model against wildfire-like images.
The collected images were resized or cropped to 224 × 224 pixels to consider whether the model is applicable to high-definition images. The datasets included 3,800 images.
Figure 3 shows samples of the images.
All instances were annotated according to the following classes: “Wildfire”, “Non-Fire”, “Flame”, “Smoke”, “Building”, and “Pedestrian” (each class was abbreviated as “W”, “N”, “F”, “S”, “B”, and “P”, respectively).
Table 1 lists the number of images for each designated label set before data augmentation. It consists of 2165 images downloaded from the Web, 1000 images from the KTS dataset, 101 images from the DNIM dataset, 254 images for error protection purposes, and 280 cropped images captured by the researchers. To ensure the annotation quality and accuracy, all of the annotated images were checked twice by different authors.
4.2. Data Partition
The dataset used for the experiment was divided into train, validation, and test sets. The test dataset included 2280 images from the entire dataset. The remaining 1520 images were pre-processed by data augmentation techniques, such as rotation, horizontal flip, and brightness, which are typically used in CNN image classification studies to secure sufficient data for learning, as shown in
Table 2.
Table 1 also lists the number of images for each designated label set after augmentation. Overall, the non-fire label group was the most common, and as the number of multi-labels increased, the number of the label group decreased. In particular, the number of label groups in which pedestrians and houses exist in the wildfire site, which is difficult to obtain, has the lowest number.
Since drones are generally not perpendicular to the horizon or are inverted when photographing wildfires, the rotation was not set up as extreme, such as at 90° or 180°, but was instead set up between 10° and 350°, considering the lateral tilt of the drone. In addition, if the brightness of the image is too high or too low, the boundary line of the objective target becomes unclear, and the object becomes ambiguous. Therefore, data enhancement was performed between the maximum brightness and the minimum brightness . After data augmentation, the training and test datasets were divided in a ratio of 4:1. In the model learning phase, 912 randomly sampled instances from the training dataset were evenly divided into 10 groups for evaluation using the cross-validation strategy.
The total number of classes of the prepared data was checked, and the distribution is shown in
Figure 4. Due to the nature of wildfire response, most of the early detection was performed by smoke, so the number of smoke classes was higher than the number of flame classes. In addition, the wildfire classes and non-fire classes also had an imbalance, and the building and pedestrian classes also had relatively few classes. Since there was an imbalance between the labeling classification table in
Table 1 and the overall class distribution in
Figure 4, the micro average-based metric evaluation index should be checked.
4.3. Performance Analysis
This study compared the models with different backbones and verified the efficiency of transfer learning and data augmentation. The model was constructed using training and validation sets that had been partitioned by a 10-fold cross-validation strategy, and the final performance was measured according to each performance metric from the test dataset. The initialized learnable parameters (i.e., hyperparameters) for the CNN-based MLC architectures are listed in
Table 3.
The models were trained using binary cross-entropy as a loss function with the selected parameters. Each model was trained using a 10-fold cross-validation strategy, and the results were calculated 10 times. The training process using the validation scheme of each model with the selected hyperparameter combination is illustrated in
Figure 5. In the case of VGG-16, the training loss fell gently and started at a very high validation loss value, while the training loss of ResNet-50 and DenseNet-121 fell sharply to about epoch 10 at the initial stage and then remained close to zero. However, it was found that DensNet-121 remained lower in terms of the validation learning curve. In the final epoch, epoch 100, it was shown that both the training loss and validation loss recorded the lowest values in DensNet-121.
The models were evaluated using the label-based performance metrics, which are shown in
Figure 6 as a box plot. All of the proposed models showed good multilabel classification ability using images of forest and wildfire sites for disaster response, with high scores (above 0.9) for most of the evaluation metrics. Among the proposed models, DenseNet-121 not only showed a significantly higher score for all of the evaluation metrics (distribution of the highest box and median value) but the interquartile ranges of each metric result were also typically smaller (i.e., with fewer distributed results) than in other models. Thus, the model maintained consistently high performance over several tests.
Table 4 presents the results of the evaluation measurements with the mean and standard deviation.
However, an evaluation that only uses label-based measurements cannot highlight the dependencies between classes. Therefore,
Table 4 presents example-based scores that consider all of the classes simultaneously and thus are considered more suitable for multilabel problems. The mAP score for the best-performing model (DenseNet-121) was 0.9629, whereas the mAP score for HL was 0.009.
In addition, the per-class score of the area under the receiver operating characteristic curve (ROC-AUC) values of our proposed models were calculated to determine the performance for each class in the image dataset. The ROC curve is a graph showing the performance of the classification model at all possible classification thresholds, unlike the recall and precision values that change as the threshold is adjusted. AUC is a numerical value calculated from the area under the ROC curve and represents the measure of separability. Therefore, the ROC-AUC is a performance metric that is more robust than other performance indicators. AUC values range from 0 to 1, where AUC = 0.5 indicates that the model performed a random guess, and thus, the prediction was the entirely unacceptable. The best performance is when AUC = 1, indicating that all of the instances are properly classified.
Table 5 presents the results with mean and standard deviation values.
The dataset for the MLC model includes pictures of fires or non-fires (because the results are mutually exclusive). Therefore, the results of two classes—“Wildfire” and “Non-fire”—are calculated in almost the same way. The results of the classes “Wildfire” and “Smoke” were also calculated similarly, as flames are inevitably accompanied by smoke, although this smoke may be invisible because some fires are small or obscured by forests. The accuracy of the pedestrian and building labels was low in all of the models, which can be attributed to the relatively small number of labels assigned to the instances. It was confirmed that the ROC-AUC scores in all of the classes were generally high in the transfer learning algorithm using DenseNet-121 as a network.
Finally, to confirm the effect of transfer learning and data augmentation on the training model, we removed one data limit overcoming strategies each time using the control variable method and obtained the F1-score and the HL value. This method was implemented for DenseNet-121, which showed the highest performance. The training learning curve over epochs is illustrated in
Figure 7. In the training stage, there was a significant difference in the slope of the learning cover curve when no data limit overcoming strategies were used and when one or more strategy was used; a steep learning curve was demonstrated with the strategies; a shallow learning curve was demonstrated without the strategies. When all of the strategies were used, the curve was formed the most rapidly, and the lowest final loss was calculated. This means other models require more practice or attempts before a performance begins to improve until the same level is reached. The curve produced in the case of using all of the strategies was formed the most rapidly, and the lowest final loss score was also calculated. There was no significant difference in the data augmentation and transfer learning effects when looking at the gradient slope or the final loss, but it was shown that the roughness of the curve was further reduced when data augmentation was used. In other words, learning was more stable.
Additionally, the test results determined by the evaluation metrics are listed in
Table 6. The results of this experiment show that transfer learning can significantly improve multilabel classification performance. With the exception of the transfer learning strategy, the macro average F1-score decreased by 0.0745, the micro average F1-score decreased by 0.0466, and the HL increased by 0.0286. In the case where only augmentation was used, the macro average F1-score decreased by 0.1159, the micro average F1-score decreased by 0.0701, and the HL increased by 0.0412.
The performance of transfer learning was further reduced when trained only with the datasets with no data augmentation. Hence, the quantitative number of the datasets required for learning in MLC has a significant impact on the performance of the model.
4.4. Visualization
To perform localization, a bounding box is drawn by the thresholding method, which retains over 20% of the grad-CAM result. It also provides a confidence score indicating the extent to which the model’s predictions are true, considering a threshold value of 0.5.
Figure 8 and
Figure 9 show an example of the results obtained with DenseNet-121 as a backbone.
As shown in
Figure 8, the sum of the confidence scores between the two classes is almost 100% because the wildfire and non-fire classes are mutually exclusive. Among the test datasets, a Case 1 image was selected as a sample of labeled wildfire with pedestrians, a Case 2 image was selected as a sample with a confusing object, such as sun or fog, that can be treated as a wildfire object for verification. Finally, a sample image with a night fire was selected to evaluate the model in nighttime conditions in Case 3.
In Case 1, the model predicted smoke, flames, non-fire, and a pedestrian with confidence scores of 0.9464, 0.7642, 0.0539, and 0.8623, respectively. The heatmap and bounding box were separated from each other to express the location. Conversely, for non-fire that is not assigned to an instance, a heatmap map without fire or smoke was displayed in the bush area.
Case 2 used an image taken at sunrise in a foggy mountainous area. All of the classes except for the non-fire class showed a score of 0.000, and it was possible to examine whether the model worked correctly to classify the sun and fog, which are frequently used for evaluating errors in wildfire detection.
Case 3 was an image of a wildfire that occurred near a downtown area at nighttime, and fireworks were displayed nearby, which may have some effect on detection. For the wildfire class, a heatmap was formed even in an area unrelated to the wildfire (lower left), which was judged to have detected the smoke generated by the firecrackers on the image. However, although the lighting in the dark at night was considered in the model training process, the heatmap represented the residential area, but the reliability of the building class was still very low (0.0030).
These results were similarly expressed in other test datasets, indicating that the accuracy of the classification was only improved if the CNN model was observable with the naked eye because only clear targets were labeled during the dataset preparation process.
Using an example,
Figure 9 shows that the classification model is robust to a small object or noise in a photograph. As shown in
Figure 9a, firefighters were dispatched to extinguish the fire in the forest, and nearby hikers were caught on camera. Although the human shape in
Figure 9a looks very small, it is detectable with a 0.9952 confidence score, and an approximate location of the object was determined.
Figure 9b shows a house with lights on and a nearby forest fire. Despite the similarity of the lamp to the fire image, the heatmap result did not recognize this part as a fire. Thus, the model looked at the appropriate part when identifying each class.
Finally,
Figure 10 shows the influence of transfer learning and data augmentation. As discussed in the previous subsection, four cases were classified using the control variable method. For each case, the heatmap and bounding box of a specific class (smoke and person in
Figure 10) were visualized, and the probability values were calculated. In the case of not using the data-shortage overcoming strategy, it was found that the heatmap represented the wrong place, and the class that was difficult to distinguish could not be detected at all. (If the value is less than 0.5, it is not treated as a detected value.) Therefore, the accuracy difference is significant if data supplementation strategies are not used for wildfire images, where data are inevitably lacking, as shown in
Figure 9 and
Figure 10. When a data supplementation method was used, the heatmap distribution was somewhat reasonable, and the confidence value for the class that was to be detected was significantly increased. When both strategies were used, the heatmap distribution was the cleanest, and the positive predictive probability was the highest.
4.5. Application
The proposed model was applied to an image captured by the researchers using a DJI Phantom 4 Pro RTK drone, which had an image size of 1280 × 720 high-definition (HD) units. The filming site was composed of a virtual wildfire environment similar to that of a fire created by lighting a drum around the forest.
Although the captured HD image can be resized to the input size of the proposed model, downscaling a high-resolution image may result in the loss of information that is useful for classification, and the model may not operate smoothly [
48]. Thus, the images were divided into 28 equal parts of 224 × 224, and the model was evaluated for the divided pictures. When the pictures are divided without overlapping parts, there is a possibility of a blind spot where the object to be found is cut off. Thus, the images are divided such that there are overlapping parts. The predicted classification values of each part of the picture were merged into the entire image and were visualized. The results of applying the proposed model to the drone shooting screen are shown in
Figure 11. The confidence value was over 50%, and the label corresponding to the photograph part was predicted. In
Figure 11, a forest fire was detected based on smoke in the central part of the whole picture. However, there was also an error (9.02%) in the building area, which was important for preserving residential and cultural assets. This error can be explained as follows: the large object was still cut off in the cropped image despite the application of the overlapping method. The white dotted circle was drawn to highlight the area with people, and the model correctly predicted that there were people in the area at 99.34% and 77.03%.