*Article* **Identifying Forest Fire Driving Factors and Related Impacts in China Using Random Forest Algorithm**

## **Wenyuan Ma <sup>1</sup> , Zhongke Feng 1,\*, Zhuxin Cheng 1, Shilin Chen <sup>1</sup> and Fengge Wang <sup>2</sup>**


Received: 22 March 2020; Accepted: 28 April 2020; Published: 1 May 2020

**Abstract:** Reasonable forest fire management measures can effectively reduce the losses caused by forest fires and forest fire driving factors and their impacts are important aspects that should be considered in forest fire management. We used the random forest model and MODIS Global Fire Atlas dataset (2010~2016) to analyse the impacts of climate, topographic, vegetation and socioeconomic variables on forest fire occurrence in six geographical regions in China. The results show clear regional differences in the forest fire driving factors and their impacts in China. Climate variables are the forest fire driving factors in all regions of China, vegetation variable is the forest fire driving factor in all other regions except the Northwest region and topographic variables and socioeconomic variables are only the driving factors of forest fires in a few regions (Northwest and Southwest regions). The model predictive capability is good: the AUC values are between 0.830 and 0.975, and the prediction accuracy is between 70.0% and 91.4%. High fire hazard areas are concentrated in the Northeast region, Southwest region and East China region. This research will aid in providing a national-scale understanding of forest fire driving factors and fire hazard distribution in China and help policymakers to design fire management strategies to reduce potential fire hazards.

**Keywords:** forest fire driving factors; forest fire occurrence; random forest; forest fire management; China

## **1. Introduction**

Forests are ecosystems with rich biodiversity [1–3], and they play an important role in soil and water conservation, climate regulation, the carbon cycle and other aspects [4,5]. Fire, which affects the biodiversity, species composition and ecosystem structure of forest ecosystems, is the dominant disturbance factor in many forest ecosystems [6–9]. Moreover, fire also affects human lives, regional economies and environmental health [10–12]. In short, forest fires threaten the sustainable development of modern forestry and human security [13]. Therefore, as an important component of global environmental change, forest fires have become the focus of forestry and ecological research [14,15]. An important aspect of forest fire management and prevention is studying forest fire driving factors and their impacts, which can help fire prevention departments to accurately assess forest fire hazards and effectively implement forest fire prevention strategies [11,16]. Forest fires are affected complexly by many driving factors, so it is very important to select appropriate forest fire driving factors and prediction models.

Forest fire driving factors have generally been divided into four types, namely, climate, vegetation, topography and socioeconomic [17,18], which vary at different temporal and spatial scales [19]. Regarding impact modes, climate factors control the accumulation and water content of forest

fuels [20,21], which are usually considered as the major determinants of forest fire occurrence [22]. Vegetation is a source of forest fuel and directly affects the ignition capacity [23,24]. Topography can affect the structure and distribution of vegetation, thus affecting the possibility of forest fires as well as the spread speed and direction of forest fires [25]. Socioeconomic factors affect forest fire occurrence via building expansion, traffic network construction and human-related activities, which increase pressure on wildlands, bringing ignition sources close to forests [23,26]. In terms of impact scope, climate affects forest fires on a larger scale while the vegetation, topography and socioeconomic factors affect forest fires on a smaller scale [27]. In terms of impact relationship, there are nonlinear relationships and thresholds between forest fire driving factors and forest fire occurrence [28–30]. Random forest is a machine learning algorithm, which can automatically select important variables and flexibly evaluate the complex interaction between variables. In recent years, random forest has been used in the study of forest fire driving factors and has shown better prediction ability than multiple linear regression [31] and logistic regression [18].

Previous studies have analysed forest fire distribution, forest fire frequency and burnt area at the national scale in China. Tian [32] analysed the spatial and temporal distribution characteristics of wildfires for 2008–2012 in mainland China. Chang [33] explored the environmental factors influencing the spatial variation in the mean number of fires and mean burned forest area from 1987 to 2007 at a provincial level using cluster analysis and redundancy analysis. Zhong [34] analysed the changes in fire frequency and burnt area during 1992–1999 in China. Lu [35] analysed the impacts of annual temperature and precipitation on the burnt area dynamics in China. Ying used ground-based data to analyse the environmental and social factor contributions to the spatial variation of forest fire frequency and burnt area summarized at a county level during 1989–1991 in China [36]. However, previous studies have used models to analyse the driving factors and their impacts of forest fire occurrence in China, mainly at the provincial scale, such as in Fujian province [18], Heilongjiang province [29,37] and Shanxi province [38]. There is still a lack of nationwide research on forest fire driving factors and their influence on recent forest fires. The value of this study lies in conducting the nationwide research which can provide a detailed analysis and practical information of the forest fire hazard and would help governments to formulate more accurate forest fire prevention strategies and allocate resources rationally. In this study, we used the random forest model and forest fire ignitions for 2010~2016 (obtained from MODIS Global Fire Atlas dataset) to evaluate the impact of four types of forest fire driving factors and the regional differences of these factors in China. This study has three objectives: (1) to determine the forest fire driving factors in various geographical regions of China and analyse how they affect forest fire occurrence; (2) to map the likelihood of forest fire occurrence in China and (3) to discuss forest fire prevention strategies in different geographical areas of China.

## **2. Materials and Methods**

## *2.1. Study Area*

The study area covered mainland China (Hong Kong, Macao and Taiwan were not analysed due to a lack of data). The driving factors of forest fires and their effect were analysed in 6 geographical regions: Northeast region (NE), North China region (N), East China region (E), Northwest region (NW), Southwest region (SW) and Mid-south region (MS). Each region is an aggregation of provinces with adjacent locations and similar topography, economy and climate. The details of each region are shown in Figure 1 and Table 1.



**Table1.**BasicdescriptionofthesixgeographicalregionsinChina.

**Figure 1.** Six geographical regions in China.

## *2.2. Data Preparation*

## 2.2.1. Dependent Variables

We identified 17,466 forest fires (ignitions) between 2010 and 2016 across mainland China with the Global Fire Atlas dataset (downloaded from the Oak Ridge National Laboratory (ORNL) Distributed Active Archive Center (DAAC), https://daac.ornl.gov) and Chinese land-use type dataset (downloaded from the Resource and Environment Data Cloud Platform, http://www.dsac.cn). The timing and location of the fire ignitions were provided by the Global Fire Atlas, which is a global dataset that records the daily dynamics of individual fires based on the Global Fire Atlas algorithm and estimated burn dates from the Moderate Resolution Imaging Spectroradiometer (MODIS) [41]. A Chinese land-use type dataset provided the forest land range in mainland China for 2015 at a 1000-m spatial resolution. According to this range, we identified Chinese forest fire ignitions for 2010–2016 from the Global Fire Atlas dataset in ArcGIS10.2 software (Environmental Systems Research Institute, RedLands, CA, USA). Figure 2 shows the distribution of the forest fire ignitions for six geographical regions in China.

**Figure 2.** Distribution of the number, proportion and location of forest fire ignitions in six geographical regions from 2010 to 2016.

Modelling forest fire occurrence requires a binary target variable, so a certain percentage of control points (nonfire points) were generated randomly according to three principles: (1) the ratio of forest fire ignition points to control points was 1:1.5 [29], (2) the control points were located within the forest land range in mainland China and (3) the points were random in both time and space. ArcGIS10.2 software was used to randomly generate the control points, and the dates of the control points were randomly selected during 2010–2016 to meet the randomness of time.

## 2.2.2. Explanatory Variables

A total of 21 variables, grouped into climate, topography, vegetation and socioeconomic categories, were selected as the initial forest fire driving factors (Table 2). All variables were integrated in ArcGIS10.2 software and extracted to the forest fire ignition points and control points.




## Climate Variables

Climate variables affect fuel accumulation and moisture which largely determine the time, location and occurrence probability of forest fires [31]. In this study, the initial climate variables include annual variables and daily variables. The annual variables are precipitation and soil moisture. As climate factors in the period before the fire can also affect vegetation accumulation and the fuel moisture content, precipitation and soil moisture in the year before individual forest fire ignition during 2010–2016 were also taken into consideration [29,31]. We downloaded precipitation data with a 1-km spatial and monthly temporal resolution [42] and soil moisture data with a 0.25◦ spatial and monthly temporal resolution from the National Earth System Science Data Sharing Infrastructure, National Science & Technology Infrastructure of China (http://www.geodata.cn). Based on these data, we calculated the annual cumulative precipitation and the annual average soil moisture for 2009–2016.

The daily initial climate variables include daily average temperature, daily average ground surface temperature, daily average relative humidity, daily minimum relative humidity, daily precipitation, daily average atmospheric pressure, sunshine hours, daily average wind speed and daily maximum wind speed. The daily humidity, precipitation, wind speed and sunshine hours affect the possibility of forest fire occurrence by reflecting fuel moisture. Daily temperature is the key condition triggering fire ignition. Atmospheric pressure can affect the oxygen content in the air, and the pressure obviously differs due to significant altitude differences and the complex terrain in China; therefore, atmospheric pressure was also considered as an initial climate variable. Daily climate data were obtained from the Daily Data Set of China's Surface Climate Data (V3.0) of the National Meteorological Information Centre (http://data.cma.cn), and we included daily data from 824 national weather stations in China. The daily climate variable values for each fire ignition and control point were provided by the weather station nearest that point.

## Topographic Variables

Topography influences the possibility of forest fire occurrence by affecting the vegetation composition and distribution and local microclimate [25]. In this study, the initial topographic variables include elevation, slope and aspect. Data for these variables were extracted from digital elevation model (DEM) data in China with 90-m spatial resolution (obtained from Geospatial Data Cloud site, Computer Network Information Center, Chinese Academy of Sciences, http://www.gscloud.cn). Aspect was divided into 8 categories according to the criteria in Table 3.



## Vegetation Variable

The initial vegetation variable is the fractional vegetation cover (FVC), which is the percentage of the vertical projection of vegetation area to the ground surface within a unit area [48] and can well represent the amount of forest fuel [18,29]. The normalized difference vegetation index (NDVI) is significantly better than other vegetation indices in estimating FVC [49,50], so we calculated FVC based on the annual NDVI dataset for 2010–2016. The calculation formula is as follows:

$$FVC = \left(NDVI - NDVI\_{\text{soil}}\right) / \left(NDVI\_{\text{veg}} - NDVI\_{\text{soil}}\right) \tag{1}$$

where *NDVIsoil* is the NDVI of bare soil and *NDVIveg* is the NDVI of dense vegetation canopy. The annual NDVI dataset for 2010–2016 was from the Resource and Environment Data Cloud Platform (http://www.resdc.cn), and the resolution was 1 km [51].

## Socioeconomic Variables

Socioeconomic variables affect the probability of forest fire occurrence by affecting human activities. Human travel and engaging in production activities in or around forests will increase the occurrence probability of forest fires. In this study, the initial socioeconomic variables include the distance to the road and railway, the distance to the settlement, gross national product (GDP) and population density. Collectively, these variables can reflect the accessibility of a forest and the possibility of people engaging in fire-prone behaviours in forests [23,26]. The road, railway and settlement datasets were from the National Basic Geographic Database of 1:1 million, which was published on the National Catalogue Service for Geographic Information website (http://www.webmap.cn). The distance between the forest fire ignitions and control points to the nearest road and railway and settlement areas was calculated using the ArcGIS 10.2 "near analysis tool." The population density dataset and GDP dataset were downloaded by the National Earth System Science Data Center (http://www.geodata.cn), and the resolution was 1 km.

## *2.3. Model*

The random forest model was used to identify the forest fire driving factors and their corresponding impacts on forest fire occurrence in 6 geographical regions of China and the whole study area. Random forest is an ensemble learning technique that is derived from classification or regression trees (CARTs). Random forest has a high prediction accuracy and high tolerance to outliers and "noise," and it has shown good prediction ability in forest fire forecasting [30,52]. The random forest model is composed of a combination of various classification trees, which are individually generated by bootstrap samples. Two-thirds of the data are used to train the random forest model and one-third of the data (the out-of-bag samples, OOB) for model validation [53]. Variable importance can also be measured by OOB, which compares increases in OOB error with that variable randomly permuted and all others unchanged [54,55]. The importance score of a variable is as follows [56]:

$$VI\{\mathbf{X}^j\} = \frac{1}{n \text{tree}} \sum\_{t} \left( err'OOB\_t^j - errOOB\_t^j \right) \tag{2}$$

where *X<sup>j</sup>* is the *j*th variable, *ntree* is the number of trees, *errOOB<sup>j</sup> <sup>t</sup>* is the OOB error of each tree *t* and *err OOB<sup>j</sup> <sup>t</sup>* is the OOB error when *<sup>X</sup><sup>j</sup>* is permuted, while all other variables remain unchanged among OOB data. For regression, the OOB error is the mean square error; meanwhile, for classification, the OOB is misclassification probability.

In this study, RF was used for classification, which divided dependent variables into two categories: forest fire occurrence and forest fire nonoccurrence. When using an RF model, the number of trees to run (ntree) and the number of variables to try at each split (mtry) need to be defined. According to previous experience [56,57], the value of mtry was set as *number o f variables* and the value of ntree was set to 2000. The varSelRF package in R statistical software was applied to select significant variables from the initial variables. Then, we measured and ranked the variable importance of these variables. The partialPlot function was used to draw partial dependence plots which can describe the relationship between the dependent variables and explanatory variables.

To eliminate bias, in each study region and the whole study area, we selected 80% of the original dataset (training dataset) to build the model, and the remaining 20% of the original dataset (independent validation dataset) was used to assess the performance of the final model. Each training dataset was divided into an inner training dataset (60%) and an inner validation dataset (40%) randomly [52]; this procedure was repeated 5 times, and 5 random subsamples of data in each study region and the whole study area were obtained. Each subsample contained an inner training dataset and an inner validation dataset, and each subsample generates an intermediate model. The variables that were selected as significant variables in at least 3 of the 5 intermediate model were considered as the forest fire driving factors in a region. In each region, the driving factors and training dataset were used to build a final model, and the independent validation dataset was used to validate the model [30].

## *2.4. Prediction Accuracy of the Models*

The receiver operating characteristic (ROC) curve, a coordinate schema analysis method, was applied to measure the predictive capability of the RF models using the area under the curve (AUC) [28,58,59]. The AUC values ranged from 0.5 to 1, with values closer to 1 indicating a relatively higher accuracy, while an AUC value of >0.8 usually indicates good predictive capability [18,60]. We used the Youden criterion, calculated according to the sensitivity and specificity of ROC (Youden criterion = sensitivity + specificity − 1) [28,61], to determine the cut-off point, which was the threshold for judging whether a fire occurred in the models [62]. If the predicted probability was higher than the cut-off point, it was assumed that a forest fire had occurred and vice versa. The prediction accuracy of the model was calculation based on the cut-off point.

## *2.5. Mapping Forest Fire Occurrence Likelihood*

Based on the fire occurrence probability calculated by the random forest model for fire ignitions and nonfire points, we used ordinary kriging interpolation to map the forest fire occurrence likelihood in mainland China in ArcGIS 10.2 [30].

## **3. Results**

## *3.1. Identification of Forest Fire Driving Factors and Their Importance Ranks*

Table 4 and Figure 3 show the forest fire driving factors and their importance rank in six regions and the whole study area. Table A1 and Figure A1 show the significant variables and their importance rank of each intermediate model.

**Figure 3.** Importance rankings of the forest fire driving factors according to the mean decrease accuracy in six geographical regions and the whole study area. The abbreviated variable names are the same as in Table 4.


**Table 4.** The results when identifying the forest fire driving factors in six geographical regions and the whole study area.

VIF (variance inflation factor) was used to measure the amount of multicollinearity in the explanatory variables. When VIF > 10, then collinearity in the explanatory variable exists and is excluded in the random forest model. "+" indicates that the variable was identified as being a forest fire driving factor in a given region, and "/" indicates that the variable is excluded due to multicollinearity. NE: Northeast region; N: North China region; NW: Northwest region; SW: Southwest region; MS: Mid-south region; E: East China region; Pre\_year0: annual precipitation in the year before the fire; Pre\_year1: annual precipitation in the year of the fire; Soil\_mois0: annual soil moisture in the year before the fire; Soil\_mois1: annual soil moisture in the year of the fire; Tem\_avg: daily average temperature; GST\_avg: daily average ground surface temperature; RH\_avg: daily average relative humidity; RH\_min: daily minimum relative humidity; Pre\_daily: daily precipitation; Pres\_avg: daily average air pressure; SSD: sunshine hours; Win\_avg: daily average wind speed; Win\_max: daily maximum wind speed; DEM: elevation; FVC: fractional vegetation cover; Dis\_road: the distance to road and railway; Dis\_sett: the distance to settlement; Pop: population density; GDP: gross national product.

## *3.2. Influence of the Forest Fire Driving Factors on Forest Fire Occurrence in Di*ff*erent Regions*

Partial dependence plots of each forest fire driving factor in each region were drawn to analyse the variables' influence intervals and trends on the probability of forest fire occurrence, where *x* is the variable value and y is logit of the probability of forest fire occurrence/2 [30]. The markers on the *x*-axis show the data distribution, where fewer marks indicate less training data and inaccurate model predictions; therefore, only the impact trends within the dense data range are discussed in this study.

Figure 4 shows a nonlinear relationship between each forest fire driving factor and the probability of forest fire occurrence. The vegetation variable shows the same influence trend on the forest fire occurrence probability in each region, and the overall trend is fluctuating. When the fractional vegetation cover is approximately 0.9, the probability of forest fire occurrence shows a peak value and then shows a sharp decline trend, while the probability is lowest when the fractional vegetation cover is approximately 0.98. The impact of climate variables is complex. The daily average temperature shows the same influence trend in the Northeast region and Southwest region: it was positively correlated with the probability of forest fire occurrence initially and negatively correlated after the values exceeded thresholds (12 ◦C in the Northeast region and 21 ◦C in the Southwest region). However, it shows another influence trend in the Mid-south region and East China region: the probability of forest fire occurrence is stable at higher values within 20 ◦C and decreases sharply when the daily average temperature exceeds 20 ◦C. The average daily relative humidity and the minimum daily relative humidity are generally negative correlated with the probability of forest fire occurrence in the respective regions. The annual soil moisture shows different influence trends in the North China and Southwest regions: in the North China region, it shows a fluctuating trend, while in the Southwest region, the probability of forest fire occurrence increases initially and then decreases as the annual soil moisture increases. For other climate variables, annual precipitation in the year before the fire and the

year of the fire, daily average air pressure and daily maximum wind speed were generally positively correlated with the probability of forest fire occurrence. The elevation shows similar influence trends in the Northwest region and Southwest region and is negatively correlated with the probability of forest fire occurrence. Among socioeconomic variables, the probability of forest fire occurrence decreases with increasing distance from roads and increases initially and then declines with increasing population density and GDP.

**Figure 4.** *Cont.*

**Figure 4.** Partial dependence plots of each forest fire driving in six geographical regions and the whole study area.

## *3.3. Model Prediction Accuracy in Di*ff*erent Regions*

The AUC values of each final model and intermediate model are greater than 0.85, and the prediction accuracy is between 70.0% and 91.4% (Table 5), which indicates that the model predictive capability is good. In the final models, the AUC (0.974) and prediction accuracy (91.4% for training and 89.3% for testing) in the East China region were the highest. The AUC (0.871) and prediction accuracy (81.75% for training and 70.52% for testing) in the Northwest region were the lowest, which may be due to the too-few fire ignition in the Northwest region.



**Table 5.** Comparisons of prediction accuracy of random forest models.

## *3.4. Likelihood of Forest Fire Occurrence*

Figures 5 and 6 show that the areas with high probability of forest fires are concentrated in the Northeast and Mid-south regions as well as the south of East China region and the northwest of Northwest region. To compare the results of the national model and the regional model, we drew a map of the difference in the likelihood of forest fire occurrence calculated based on the whole study area model and the regional models (Figure 7). The map shows that the probability of the whole model was higher than those of the regional models in most areas of the Southwest region and North China region and in the centre of Northwest region and lower than those in most areas of the Northeast, Northwest, East China and Mid-south regions and in the north of Northwest region.

**Figure 5.** Map of the likelihood of forest fire occurrence in China obtained from the regional models.

**Figure 6.** Map of the likelihood of forest fire occurrence in China obtained from the whole study area model.

**Figure 7.** Map of the likelihood of forest fire occurrence obtained from the whole study area model minus that obtained from the regional model.

## **4. Discussion**

## *4.1. Forest Fire Driving Factors and Their Influence*

Previous studies have found regional and scale differences in forest fire factors [18,36,63]. This study also confirmed this point. We found that due to the differing geographical and social conditions in China from region to region, the forest fire driving factors vary in different regions, and the same variables also operate differently depending on the region and the scale of analysis, which illustrates the spatial applicability of forest fire research and the importance of formulating forest fire management systems based on regional characteristics. All final models included a smaller number of variables selected from the initial set. Previous studies have also shown that the simplified model is more stable. Previous studies have also noted that a parsimonious model would be more stable [28,49].

In this study, all final models included climate variables, which are considered the dominant factors affecting forest fires [64–66]. Among climate variables, daily average temperature was the forest fire driving factor in the most regions (Northeast, Southwest, Mid-south, East China and the whole study area). Previous studies [29,30] have shown thresholds and complex nonlinear relationships between temperature and forest fire occurrence probability, and our study confirms this point. The probability of forest fire occurrence initially increases or stabilizes at a higher value with the increase in temperature. When the temperature exceeds a certain threshold (12 ◦C in the Northeast region, 21 ◦C in the Southwest region and 20 ◦C in the Mid-south region and East China region), the probability shows a sharp downward trend. There may be two reasons for this situation. (1) Although high temperatures can increase plant evaporation, thereby reducing the moisture content of forest fire fuels [67], most parts of China experience a monsoon climate (Table 1), and rainfall and heat are synchronous. Therefore, high-temperature weather is often accompanied by high relative humidity levels, which have opposite effects on forest fires, so there were impact thresholds. (2) At high temperatures, forest fire prevention departments are vigilant, implementing strict fire prevention systems and limiting the occurrence of forest fires [68]. Relative humidity is also one of the main forest fire driving factors. In this study, relative humidity showed a similar influence trend in the respective regions, and it was negatively correlated with the occurrence probability of forest fires despite some moderate fluctuations. This is because high relative humidity increases the moisture content of combustible materials and reduces the

possibility of fire [64]. It is noteworthy that the daily minimum relative humidity was also selected as the forest fire driving factor in the whole study area, which indicates that this variable operates at both regional and large scales. Air pressure affects the oxygen content and fuel ignition temperature, and a relatively lower pressure will lead to a lower oxygen content and higher ignition temperature, thus reducing the possibility of forest fire occurrence [69]. However, in the Southwest region, when the daily air pressure is higher than 860 hPa, the probability of forest fire occurrence shows a small decrease, which indicates that there is also an impact threshold of air pressure. The daily maximum wind speed is also one of the driving factors of forest fires in southwest China. The wind will increase evaporation capacity, and the higher the wind speed, the smaller the water content of forest combustibles; hence, the wind speed has a positive correlation with the occurrence probability of forest fires, which is consistent with the research results of Guo [18] in Fujian province of China. The soil moisture in the year of the fire directly affects the water content of forest combustibles [31], so this variable is negatively correlated with the occurrence probability of forest fire. Annual precipitation promotes the accumulation of plant fuels, thus having a positive impact on the occurrence probability of forest fires.

Among the topographic factors, elevation is a forest fire driving factor in the Northwest region and Southwest region, and it is negatively correlated with the occurrence probability of forest fires in both regions. We suspect that this is because the surface of these two areas fluctuates greatly; the elevation in most areas is 500~5000 m in the Northwest region and 500~6000 m in the Southwest region. As elevation increases, human activity decreases, and its impact on weather conditions, vegetation and soil moisture is also not conducive to forest fire occurrence [49,70,71]. Tian's research [32] also showed that forest fires mainly occurred at low elevations in China.

The vegetation variable (fractional vegetation cover) is a forest fire driving factor in all regions except the Northwest region, and its importance ranked first in four regions (Northeast region, Southwest region, East China region and the whole study area). Previous studies have also shown that vegetation cover has an important impact on forest fires [29,67]. Generally, the higher the vegetation coverage, the more fuel is available, so high vegetation coverage leads to a high forest fire rate. However, in this study, fractional vegetation cover showed a complicated influence trend: when the fractional vegetation cover is between 0.8 and 0.97, the occurrence probability of forest fires fluctuates at a higher value, and then it drops rapidly, reaching a minimum value when the fractional vegetation cover is approximately 0.98. We suspect that this is because in forest land with high vegetation coverage, canopy occlusion will lead to some small fires that are not easily detected by MODIS [67].

Compared with other variables, socioeconomic variables are the forest fire factors in few regions (Northwest region and Southwest region), with low degrees of importance (Figure 4). Distance from the road is negatively correlated with the probability of forest fires in the Northwest region because the forests close to the road are vulnerable to traffic accidents and human activities (i.e., smoking and picnics) [26,61]. GDP and population density show similar influence trends, and they have a positive impact on the occurrence probability of forest fires initially and then have a negative impact after exceeding a certain threshold (GDP of 200 RMB/km2 and population density of 100 number/km2). This may be because within a certain range, the increase in population density and GDP will increase human activity in forests, thereby promoting forest fire occurrence [51,72,73]. However, in economically prosperous and high-population-density areas, the forest coverage rate is low and there are few forest-related production activities conducted by humans, so the occurrence probability of forest fires decreases [18,29,32].

#### *4.2. Implications for Forest Fire Prevention*

There are differences in the forest fire driving factors (Figure 3 and Table 4) and the prediction results (Figures 5–7) between the regional models and the whole study area model. Therefore, it is necessary to study forest fire driving factors based on geographical regions, and regional differences should also be fully considered in forest fire management. Forest fire management departments should formulate forest fire prevention strategies according to the differences in forest fire driving factors and impact thresholds in different regions. E.g., in the Northwest region, elevation has the greatest impact on forest fire occurrence, and the probability of forest fires is higher in low-elevation areas. Therefore, the Northwest region forest fire management departments should strengthen forest fire monitoring in low-elevation areas, such as setting up more forest fire observation towers and forest fire brigades [30]. In the North China region, soil moisture has the greatest impact, so changes in soil moisture should be taken into account when developing a forest fire prevention strategy. In the Northeast region and East China region, when the daily average temperature reaches the impact threshold, the occurrence probability of forest fires reaches a maximum; hence, forest fire management departments should be more vigilant in the corresponding weather. In the Southwest region, there are 13 forest fire driving factors. These factors should be integrated into the local assessment index systems of forest fire hazard, and the influence of these factors should be considered comprehensively when judging the forest fire hazard. In the Mid-south region, forest fire management departments should pay attention to monitoring the daily minimum relative humidity.

The map of the likelihood of forest fire occurrence is also crucial to forest fire management [74]. Understanding the distribution of forest fire occurrence likelihood can help to determine the location and number of fire observation towers [28], contributing to more effective use of financial and human resources. Figure 5 shows that areas with a high probability of forest fires are concentrated in the Northeast and Mid-south regions as well as the south of East China region and the northwest of Northwest region; thus, more stringent forest fire prevention systems should be implemented in these areas.

## *4.3. Strengths and Limitations*

Previous forest fire research has usually been based on eco-geographical areas or forest types [29,75,76]. However, these zoning methods ignore administrative boundaries, and forest fire management strategies are often formulated by administrative areas. Therefore, we chose a zoning method that takes administrative divisions into account, trying to provide a more practical reference for China's fire prevention department. Our research is based on geographical regions in China, a division method that considers both administrative divisions and natural conditions that has been used in some forestry analysis [77,78]. Each region is an aggregation of provinces with adjacent locations and similar topography, economy and climate. However, this zoning method has its shortcomings. First, if a province has complex topography and different climate and vegetation types (such as Tibet in the Southwest region), it must also be included in one region. We think that this may have led to far higher number of forest fire driving factors in the Southwest region than in the other regions. The second point is about the model. To reveal the nonlinear relationship and influence threshold between forest fire driving factors and forest fire occurrence probability, we used the random forest model, which has shown good prediction ability in previous studies on forest fire [18,30,31]. However, behaving as a "black box," this method cannot calculate regression coefficients or confidence intervals [63,79]. Based on these two points, in future research, we will try to use geographically weighted regression, a spatially explicit technique that would overcome the necessity of building predetermined regions, to analyse forest fire driving factors to address these limitations.

#### **5. Conclusions**

We used the random forest model to analyse the forest fire driving factors in different geographical regions in China for 2010 to 2018. The model predictive capability is good, with a prediction accuracy between 70.0% and 91.4%. Furthermore, we mapped the probability of forest fire occurrence in China based on the results of the model. In China, there are obvious regional differences in the types of forest fire driving factors and their impacts. Climate variables (especially temperature and humidity) have major impacts on forest fires occurrence, and the vegetation variable is secondary. Topographic variables and socioeconomic variables are only the forest fire driving factors in the Southwest and Northwest regions. There is a nonlinear relationship and influence threshold between forest fire driving

factors and forest fire occurrence probability. High fire hazard areas are concentrated in the Northeast and Mid-south regions as well as the south of East China region and the northwest of Northwest region. This research will aid in providing a national-scale understanding of forest fire driving factors and fire hazard distribution in China and help policymakers to design fire management strategies and allocate resources reasonably to reduce potential fire hazards.

**Author Contributions:** Conceptualization, W.M.; Data curation, W.M. and Z.C.; Formal analysis, W.M.; Funding acquisition, Z.F.; Investigation, W.M.; Methodology, W.M.; Resources, F.W.; Supervision, Z.F.; Writing–original draft, W.M.; Writing–review & editing, W.M. and S.C. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the Fundamental Research Funds for the Central Universities (No. 2015ZCQ-LX-01) and the National Natural Science Foundation of China (No. U1710123).

**Acknowledgments:** We would like to acknowledge for the data support from National Earth System Science Data Center, National Science & Technology Infrastructure of China. (http://www.geodata.cn)

**Conflicts of Interest:** The authors declare no conflicts of interest.

## **Appendix A Appendix**

**Table A1.** The results when identifying the significant variables in each intermediate model in six geographical regions and the whole study area.



**Table A1.** *Cont.*


**Table A1.** *Cont.*

VIF (variance inflation factor) was used to measure the amount of multicollinearity in the explanatory variables. When VIF > 10, then collinearity in the explanatory variable exists and is excluded in the random forest model. "+" indicates that the variable was identified as being a forest fire driving factor in a given region, and "/" indicates that the variable is excluded due to multicollinearity. Pre\_year0: annual precipitation in the year before the fire; Pre\_year1: annual precipitation in the year of the fire; Soil\_mois0: annual soil moisture in the year before the fire; Soil\_mois1: annual soil moisture in the year of the fire; Tem\_avg: daily average temperature; GST\_avg: daily average ground surface temperature; RH\_avg: daily average relative humidity; RH\_min: daily minimum relative humidity; Pre\_daily: daily precipitation; Pres\_avg: daily average air pressure; SSD: sunshine hours; Win\_avg: daily average wind speed; Win\_max: daily maximum wind speed; DEM: elevation; FVC: fractional vegetation cover; Dis\_road: the distance to road and railway; Dis\_sett: the distance to settlement; Pop: population density; GDP: gross national product.

**Figure A1.** *Cont.*

**Figure A1.** Importance rankings of the significant variables according to the mean decrease accuracy in each intermediate model in six geographical regions and the whole study area. The abbreviated variable names are the same as in Table A1.

## **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Forest Fire Probability Mapping in Eastern Serbia: Logistic Regression versus Random Forest Method**

**Slobodan Milanovi´c 1,2,\* , Nenad Markovi´c 3, Dragan Pamuˇcar 4, Ljubomir Gigovi´c 4, Pavle Kosti´c <sup>5</sup> and Sladjan D. Milanovi´c <sup>6</sup>**


**Abstract:** Forest fire risk has increased globally during the previous decades. The Mediterranean region is traditionally the most at risk in Europe, but continental countries like Serbia have experienced significant economic and ecological losses due to forest fires. To prevent damage to forests and infrastructure, alongside other societal losses, it is necessary to create an effective protection system against fire, which minimizes the harmful effects. Forest fire probability mapping, as one of the basic tools in risk management, allows the allocation of resources for fire suppression, within a fire season, from zones with a lower risk to those under higher threat. Logistic regression (LR) has been used as a standard procedure in forest fire probability mapping, but in the last decade, machine learning methods such as fandom forest (RF) have become more frequent. The main goals in this study were to (i) determine the main explanatory variables for forest fire occurrence for both models, LR and RF, and (ii) map the probability of forest fire occurrence in Eastern Serbia based on LR and RF. The most important variable was drought code, followed by different anthropogenic features depending on the type of the model. The RF models demonstrated better overall predictive ability than LR models. The map produced may increase firefighting efficiency due to the early detection of forest fire and enable resources to be allocated in the eastern part of Serbia, which covers more than one-third of the country's area.

**Keywords:** occurrence of forest fire; machine learning; variable importance; prediction accuracy

## **1. Introduction**

Forest fires, as global phenomena, present numerous forms of threats to many countries around the world, from Canada and Siberia through to Indonesia, Australia, Africa, and the Amazonia. Although statistics show that the frequency of fires, alongside the total burnt area, have been decreasing from year to year globally [1], several regions will experience larger and more intense fires [2–4]. The Mediterranean region is traditionally the most at risk in Europe [5], but in recent years, forest fires have also become an important issue in Central Europe [6]. Among other European countries, Serbia experienced increased fire activity during the last two decades [7].

The increasing risk and associated damage caused by fires are usually linked to climate changes [8]. Considering the current climate scenario [9], which predicts an average temperature rise of 4–6 ◦C by the end of this century [10,11], and a decrease in total rainfall with an uneven distribution throughout the year in the south of Europe featuring long

**Citation:** Milanovi´c, S.; Markovi´c, N.; Pamuˇcar, D.; Gigovi´c, L.; Kosti´c, P.; Milanovi´c, S.D. Forest Fire Probability Mapping in Eastern Serbia: Logistic Regression versus Random Forest Method. *Forests* **2021**, *12*, 5. https:// dx.doi.org/10.3390/f12010005

Received: 24 November 2020 Accepted: 17 December 2020 Published: 22 December 2020

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/ licenses/by/4.0/).

periods of drought in summer, an increased risk of forest fire is expected in Europe [5]. The greatest changes in Europe are expected in the transition between the Mediterranean and Euro-Siberian regions [12], where the Balkan Peninsula is situated. The most dominant oak and beech forest may be replaced by evergreen Mediterranean vegetation [13,14], which is more prone to forest fire. Therefore, a further increase in forest fire risk can be expected due to changes in the fuel type.

Aside from the direct damage in terms of wood loss, which is obvious and easy to quantify, many other risks may appear following forest fires, due to the slow process of regeneration, especially in conifer forests [15]. Secondary pests, such as bark (Coleoptera: Curculionidae: Scolytinae) and ambrosia beetles (Coleoptera: Buprestidae) and diseases, that attack physiologically weakened trees may reach an outbreak population level and affect a much larger area than forest fire alone [16]. To minimize the harmful effects of forest fires, it is necessary to create an effective protection system against fire. According to San-Miguel-Ayanz [17], forest fire risk is defined as the probability of a fire happening and its consequences. Therefore, any increase in fire probability or its consequences directly results in an increase in fire risk. Our research has focused on forest fire occurrence probability mapping as a component of the future system for forest fire risk assessment. Forest fire probability mapping is a basic tool in risk management [18,19] and allows the allocation of resources for fire suppression, within a fire season, from zones with a lower risk to those under higher threat. Also, in the long run, mapping fire occurrence probability is a very valuable tool in forest management planning that improves forest resilience to fire through the implementation of various types of fire preventive silvicultural measures [20–23].

Among traditional methods, logistic regression (LR) is the most common when dealing with binary outcomes, like the presence or absence of fire [24]. Conversely, machine learning (ML) methods, as a form of artificial intelligence, are widely used in wildfire science and management, with more than 300 articles published on this topic since the 1990s [25]. Random forest (RF) belongs to the decision trees branch of the same group of ML methods [26]. Our study had the following objectives: (1) to map the probability of forest fire occurrence in Eastern Serbia based on LR and RF; (2) to determine the main explanatory variables for forest fire occurrence for both models.

## **2. Materials and Methods**

## *2.1. Study Area*

The study area is in the eastern and southern parts of Serbia (Figure 1). It has a total land area of 30,235.5 km2. Broadleaved, conifer, and mixed forest cover 12,587.5, 170.7, 340.6 km2, respectively. Natural grasslands, transitional woodland-shrub, and sparsely vegetated areas cover 925.7, 2949.5, and 48.0 km2, respectively, while other areas represent agricultural, urban, or other non-wood areas. The elevation ranges from 28 to 2169 m. It has been observed that durations of extremely hot weather last longer and periods of extremely cold weather are shortened compared to the reference period of 1960–1990 [27]. These trends of extreme temperature conditions are most pronounced in the summer season [27]. A decrease in spring precipitation has been detected over the central and eastern parts of Serbia [28]. The annual quantity of rainfall is often insufficient, and droughts are evident in eastern and south-eastern parts of Serbia [28]. Scenarios where the monthly quantity of precipitation falls in only a few days of the month are expected to become more frequent, which will lead to more extreme weather events such as floods and droughts [29]. Serbia has two peaks in its fire season [30]. The first peak occurs in March, which is associated with 25% of the annual burnt area, while the second and largest peak appears in August [31]. In August 2012, more than 40 large fires were recorded, including two fires that were more than 1000 ha in size [32]. More than 16,000 ha of forest land were destroyed in 2007, with the value of the wood lost was estimated at 40 million euros [33].

**Figure 1.** The geographic position of study area, Eastern Serbia, is marked in gray (latitudes 42.27– 44.82◦ N and longitudes 20.90–23.01◦ E) with the layer of forest fire hotspots obtained by NASA's Fire Information for Resource Management System (FIRMS) (MODIS fire hotspot) for the period of 2001–2018.

## *2.2. Data Preparation*

## 2.2.1. Dependent Variable

Data regarding the fire location were obtained from NASA's Fire Information for Resource Management System (FIRMS), which distributes near real-time fire products from Moderate Resolution Imaging Spectroradiometer (MODIS) from the Terra and Aqua platforms with a spatial resolution of 1 km (https://firms.modaps.eosdis.nasa.gov). Each position of a MODIS-identified active fire represents the center of a 1 × 1 km pixel that is labeled by the algorithm as containing one or more fires inside the pixel. Only fire pixels with a confidence higher than 80% for the period from January 2001 to December 2018 were considered for further analysis, because in certain cases the product underestimates the occurrence of fire [34].

#### 2.2.2. Independent Variables

Independent variables were grouped into four categories: topography, vegetation, anthropogenic factors, and climate. Specific variables within categories were selected based on previous studies on forest fire occurrence [24,34–39]. Detailed information on preselected variables is presented in Table 1.


**Table 1.** Independent variables considered for forest fire occurrence models with codes, units, source and calculated variable inflation factors (VIFs).

<sup>1</sup> excluded from further analysis due to high Spearman's rho correlation coefficient with the distance to agricultural land (DAgL). <sup>2</sup> included as matching criterion in propensity-score matching and therefore excluded in further analyses \* categorical explanatory variable.

Topographic parameters, elevation, slope, and aspect, were derived from the digital elevation model (DEM) with a precision of 3 arcsec previously downloaded from the United States Geological Survey (http://landsat.usgs.gov//landsatcollections.php). Average values of elevation, slope, and dominant aspect were calculated for each polygon in a 1 × 1 km grid using ArcGIS software (ESRI, Redlands, CA, USA). Obtained values of the slope and dominant aspect were divided into classes according to Carmo et al. [35], while elevations were divided into classes of 200 m (Figure 2).

Vegetation and land cover data were obtained from the CORINE 2012 data set (https://land.copernicus.eu/pan-european/corine-land-cover/clc-2012). The vector layer was intersected with polygon grid data. Objects in this vector layer were filtered for the following CORINE 2012 land cover classes (CLCs): broad-leaved forest (BF), coniferous forest (CF), mixed forest (MF), natural grasslands (NG), transitional woodland-shrub (TWS), and sparsely vegetated areas (SVA) (Figure 2). Intersecting the polygon grid data with the polygon CLC layer filtered this way, a new polygon layer with a table of attributes containing a polygon grid Object ID, CLC class ID with its description, and an area of CLC class that falls into the respective grid polygon were generated.

**Figure 2.** Categorical predictors related to forest fire in Eastern Serbia. (**a**) Classes of elevation; (**b**) aspect classes; (**c**) slope categories; (**d**) land cover categories obtained by CORINE 2012.

Data on drought code (DC), as a component of the Canadian Forest Fire Weather Index [40], were obtained from the Republic Hydrometeorological Service of Serbia (http://www.meteoalarm.rs/eng/fwi\_osm.php) as tables with coordinates of meteorological stations with values on the drought code for each day in the observed period (2012–2018). Each of these tables was then converted into a spatial point layer with points representing meteorological stations and a respective table of attributes with values of the drought code. The layer was then used for interpolation using the ordinary kriging method [41], resulting in raster layers whose pixels represent values of drought code. This way, we were able to calculate zonal statistics of drought code for each day and

each polygon of a grid layer for raster layers that represent the observed period. Finally, average values for each cell in the grid cells were calculated based on all of the drought code data for the observed period.

Layers of roads, populated places, railroads, and agricultural land were used for the calculation of the distance from each object of a 1 × 1 km grid to the nearest object of the respective layer, so the attribute table of the generated point and polygon grid layer was extended with distances to the nearest populated place, road, railway and agricultural land (https://www.openstreetmap.org).

Population data were obtained from a raster dataset available for download in Geo-TIFF format at the Center for International Earth Science Information Network (CIESIN), Columbia University (http://dx.doi.org/10.7927/H4639MPP). The sum of the number of people per polygon grid was also calculated by a zonal statistic tool.

At the end of the GIS portion of the analysis, Boolean values (Yes/No) were assigned to the elements of the grid by using the Spatial Join tool. Values of Yes were assigned to the elements for which the historical wildfire(s) occurred, and values of No were assigned to those for which it did not.

Each cell with at least one historical record of a fire event from 2001 to 2018 was classified as a fire cell and coded as one (1). In total, 429 cells were selected as a fire cell for Eastern Serbia. The use of binary LR alongside RF classification assumes that the predicted variable is dichotomous. Therefore, necessary non-fire cells were sampled based on the propensity score matching (PSM) method [42] and coded as zero (0). In general, a propensity score analysis reduces the effect of confounding in observational (nonrandomized) studies and can be used for matching, stratification, inverse probability of treatment weighting, and covariate adjustment, all based on the propensity score [43]. Thus, PSM forms matched sets with equal ratio of treated and untreated subjects [44], in our case, of fire and non-fire cells, who share a similar value of propensity score. Matching without replacement [45] was applied for the selection of non-fire cells in this study. Namely, non-fire cells can be selected only once as a match for a given fire cell. Among greedy and optimal matching procedures of PSM, the latter was used in this study. Both approaches choose the same sets of controls for the overall matched samples, but optimal matching does a better job of minimizing the difference in propensity score value within each pair [46]. PSM was conducted in the R package 'MatchIt' v 3.02 [47].

As it was important to avoid the selection of non-fire cells in the vicinity of the fire cells, because of the similar environmental conditions, we tested distances among created pairs of cells. The distance between pairs of fire and non-fire cells varied from 2.2 to 268.1 km, with an average of 134.2 km. Additionally, to validate the created models, whole paired samples were randomly divided into two equal subsamples, training and validation, with the same number of fire and non-fire cells [24,39,48,49].

One of the basic assumptions that must be met before applying LR is the absence of high correlations (multicollinearity) among the explanatory variables (predictors) included in the model. Explanatory variables were checked for multicollinearity by the variance inflation factor (VIF) and Spearman's rho correlation [50]. Only variables with VIF ≤ 10 and a Spearman's rho coefficient lower than 0.7 [51] were considered for model building.

## 2.2.3. Forest Fire Occurrence Frequency across Categorical Predictors

Forest fire distribution across categorical predictors was analyzed for all events in the 15 years. Observed versus expected frequencies were analyzed and compared. Observed frequencies represented the number of forest fires that occurred within the explanatory variable category, while the expected frequencies were represented by the surface covered by the respective variable category in the study area [24]. Comparisons between observed and expected frequencies were based on chi-square statistics [52] at a significance level of 0.001. Mean values were calculated for each class of categorical predictors, and histograms were created accordingly.

## *2.3. Modeling Procedures*

Based on collected and generated data, the model of probabilities for forest fire occurrences was generated for each cell of the polygon grid for the territory of Eastern Serbia.

#### 2.3.1. LR Models

The algorithm obtained by the LR calculates the conditional probability of the fire event occurring from one or more input variables [53]. Also, LR can be used for estimating the contribution and significance of each variable by the Wald test [54], and afterwards can select a combination of variables that can be introduced in more complex models [55]. To estimate forest fire occurrence probability in Eastern Serbia by LR, the following equation was applied:

$$\text{Pi} = \frac{1}{1 + e^{-\left(\alpha + \beta 1Xi1 + \beta 2Xii2 + \dots + \beta \, pXip\right)}}\tag{1}$$

with "Pi" as the probability of the forest fire occurrence, "*α*" as a constant, "*β*" as a coefficient from regression, and "*X*" as original values of variables.

#### 2.3.2. RF Models

This method uses a large number of decision trees, which produce their predictions and combine them into a single, more accurate, prediction [56]. RF has gained popularity in recent years, as it has been proven to perform better than LR according to several studies [39,57,58]. Specifically, it considerably outperforms LR in the accuracy measured, as well as the Brier score and area under curve (AUC) [59–61], on large and diverse datasets. Thus, considering the similarity of forest fire probability prediction to other risk-assessment applications, we considered RF as a strongly supported model of choice worth investigation. Similar to LR, the RF method is also good at handling both categorical and continuous types of explanatory variables.

The method proposed by Ye et al. and Genuer et al. [37,62] was applied for the variable selection. Each variable was included in the model *N* − 1 times, where *N* represents the number of variables considered as a predictor. RF models were run 16 times and an additional variable was excluded in each iteration. The variable importance obtained from each iteration was used for the calculation of relative variable importance, which represents an average value for a specific variable ranked from 0 to 1. All variables were scored from 1 to 16, according to the average importance. The variable with the highest average importance was scored as 1, and the variable with the lowest importance was scored as 16. Then, a RF model with each of the 15 highest ranked variables was generated for comparison with LR models. RF analyses were conducted by the default value of *mtry* (4), which represents the number of variables at each split and the 100 trees in the forest (ntree).

## *2.4. Model Validation*

Model evaluation was conducted by a receiver operating curve (ROC) analysis, calculating the proportion of fire cells, classified as fire (sensitivity), and non-fire cells, classified as non-fire (specificity), for the obtained models. A ROC curve plots the values that represent sensitivity versus values that represent 1—specificity for the range of possible cut points [63]. AUC values between 0.5–0.7 indicate poor precision, values between 0.7–0.8 indicate acceptable precision, values between 0.8–0.9 indicate excellent precision, and values higher than 0.9 indicate outstanding model precision [63]. Additional evaluation of the models was conducted by 2 × 2 classification tables based on the overall accuracy. Thus, the tables are a result of the cross-classification of the dichotomous outcome variable, whose values are derived from the predicted and observed probabilities. Also, to compare the predictive capacity of the models produced, the distribution of fire cells of the validation group across classes of forest fire occurrence probability was analyzed. Due to the limited size of data, the k-fold cross-validation (CV) method was also applied to compare obtained models. Analysis was conducted in the R package 'Caret' v 6.0–86 [64] using "glm" and "rf" methods for LR and RF, respectively. ROC analysis,

accuracy, and Kappa values obtained from CV were used to compare models, with k-value set as 10.

To designate each cell as fire or non-fire, an optimal cutoff point must first be defined. For the models selected by the LR and RF procedures, based on the ROC analysis and prediction accuracy obtained under the training subsets, the optimal cutoff point was determined by the sensitivity equals specificity method [65] using the easyROC webtool [66]. Then, the estimated probability for each cell was compared to the optimal cutoff point. If the estimated probability for the particular cell exceeded the optimal cutoff point, then that cell was designated as a fire cell. Conversely, if the estimated probability was lower than the optimal cutoff point, a particular cell was designated as a non-fire cell. Cutoff values were applied within selected models to both training and validation subsets.

#### *2.5. Variable Importance Analysis*

The quantification of variable importance is used for variable selection in many applied studies. To assess the importance of individual variables selected by the LR procedure, Wald statistics were applied to the models [67]. Variable importance in RF classification could be assessed by the permutation importance indices [68] or by the Gini impurity function for a classification problem [69]. For RF, the variable importance measures are summed across all trees in the forest and scaled in the same manner so that the most important variable has a value of 1.

## *2.6. Mapping Forest Fire Occurrence Probability*

Forest fire occurrence probability calculated by the LR or RF models for fire cells and non-fire cells was used to map the forest fire occurrence probability in Eastern Serbia in ArcGIS 10.2. According to Nhongo et al. [34], the produced map was classified into five categories: very low (0.01−0.20), low (0.21−0.40), medium (0.41−0.60), high (0.61−0.80), and very high (0.81−1.00).

All statistical analyses related to LR and RF were conducted using the software Statistica 13 (TIBCO Software Inc., Palo Alto, CA, USA).

## **3. Results**

#### *3.1. Forest Fire Occurrence Frequency*

Forest fire distributions across the categorical explanatory variables are displayed in Figure 3. An almost normal distribution of forest fires was recorded across elevation classes, with higher frequencies at elevations between 200 and 1000 (with a less pronounced peak in the class of 600−800 m), and with lower frequencies below and above this range (Figure 3a). The frequencies of forest fire across elevation classes highly correspond to surface area covered by the respective classes (χ<sup>2</sup> =39.83, *p* < 0.001). This connection was not significant between forest fire occurrence and aspect classes (χ<sup>2</sup> = 1.11, *p* < 0.774). Regarding aspect, the highest frequencies of fires occurred on southern aspects, then almost equally eastern and western aspects, while the lowest frequency was recorded on northern aspects (Figure 3b). The highest frequency of forest fire was recorded on the slopes with an inclination between 10◦ and 19.99◦, then between 5◦ and 9.99◦, while the lowest frequency was recorded at the lowest inclination (0−4.99◦). No fire was recorded at the highest inclination of over 30◦ (Figure 3c). Forest fires were influenced by slope (χ<sup>2</sup> = 47.54, *p* < 0.001). Also, land cover type had a significant influence on forest fire occurrence (χ<sup>2</sup> = 107.89, *p* < 0.001). More than 55% of forest fires occurred in broad-leaved forests, and almost 29% and 14% occurred in the transitional woodland-shrubs and natural grasslands respectively, while less than 2% of forest fires occurred in the mixed forest and coniferous forests. No forest fire was recorded in the sparsely vegetated area (Figure 3d).

**Figure 3.** Frequencies of forest fires and areas covered by categorical variables: (**a**) elevation classes, (**b**) aspect classes (exposure), (**c**) slope degree classes, and (**d**) vegetation classes obtained from CORINE land cover 2012 that are present in Eastern Serbia: BF: broad-leaved forest; CF: coniferous forest; MF: mixed forest; NG: natural grasslands; TWS: transitional woodland-shrubs, and SVA: sparsely vegetated areas.

## *3.2. Models of Forest Fire Occurrence*

In the first step, the total forested area (TFA) variable was excluded for further analysis due to the high correlation (*R* = 0.84) with distance agricultural land (DAgL), as displayed in Figure S1 (see the Supplementary Materials). The remaining 16 explanatory variables met the conditions of VIF ≤ 10 and were considered for the future models. After several iterations, the final models were created by LR and RF procedure for 15 variables. The highest impact on fire probability had a drought code, followed by distance to municipality, distance to water, distance to railway, and distance to arable land, while a relative contribution of mixed forest had the lowest impact in the RF model. Also, in the LR models, the highest impact on fire probability had a drought code, while a relative contribution of coniferous forest had the lowest impact on forest fire occurrence (Table 2).

The overall prediction accuracy of the LR and RF models, calculated by using an optimal cut-off value, was 86.7 and 87.7%, respectively (Table 3). Moreover, the LR and RF models display AUC values of 94.2% and 94.5%, respectively, affirming a high predictive power (Figure 4b).


**Table 2.** Explanatory variable importance evaluation based on Wald statistics for linear regression (LR) and Gini impurity for random forest (RF) models.

ns not-significant, \* significant at *p* < 0.05, \*\* significant at *p* < 0.01, \*\*\* significant at *p* < 0.001.

**Table 3.** Classification tables for the training and validation sets of data based on LR and RF models, with applied cut off values, according to the sensitivity equals specificity method.


**Figure 4.** Receiver operating curve (ROC) for LR and RF models: (**a**) training data set, (**b**) validation data set.

## *3.3. Relative Importance of Variables*

In the LR models, drought code was the most important variable for fire occurrences, followed by the distance to rail and agricultural land. Conversely, the lowest importance of the forest fire occurrence was associated with the contribution of coniferous and mixed forest, and also aspect classes (Table 2). In the RF models, drought code was the most important variable followed by the distances to municipality, water, rail, and arable land. Similarly, to the LR models, contribution of mixed and coniferous forest as well as aspect classes had the lowest impact on the RF models (Table 2). Interestingly, distance to municipality was recognized by RF as very important model variable, while in the LR models this variable did not have a statistically significant effect on model performance.

## *3.4. Spatial Modeling of Probability of Fire Occurrence*

Zones with a very high probability of forest fire occurrence were situated in the southeastern part of the study area in both models and vary from 19.7% (LR) to 18.9% (RF) of the forested area. Zones with a very low probability of forest fire occurrence were situated in the northwestern part of the study area and range from 21.1% (LR) to 22.2% (RF) of the forested area (Figure 5).

**Figure 5.** Maps of forest fire probability based on (**a**) LR models, (**b**) RF models.

#### *3.5. Model Validation*

LR models performed better in the very low fire distribution class, compared to the RF models, identifying lower forest fire incidence in the validation set of data. On the other hand, RF models performed better in high and very high classes than LR models, identifying a higher number of forest fire events in the same set of data (Table 4). If the two lowest classes (low and very low) classes are compared among models, then there is a very slight difference between the LR and RF models (e.g., 23.3% vs. 23.7%, respectively). On the other hand, the RF models classified 58.6% of forest fire events in the two higher classes (high and very high), while the LR models classified 54.0% in the same classes of the validation set of data. The overall prediction accuracy of the LR and RF models, based on 10-fold cross validation, was 86.5 (kappa = 0.7296) and 91.7% (kappa = 0.8345), respectively. Moreover, the LR and RF models display AUC values of 92.4% and 97.5%, respectively, after 10-fold cross validation.

**Table 4.** Forest fire distribution (%) across different probability classes based on LR and RF models for the validating set of data.


#### **4. Discussion**

There has been a huge gap in forest fire risk assessment in Serbia without any system for forest fire risk assessment at the national or even regional scale [70]. Within this study, the first step has been made on that course by producing maps with forest fire occurrence probability that cover more than one-third of the state territory. To map the forest fire probability in Eastern Serbia, the area that has been the most affected by forest fires in the past [32], two different methods were applied, LR and RF. Both methods have been widely used in forest fire risk assessment within the past few decades, with the dominance of LR at the beginning of this century [24,35,50,55,71], while the RF method has prevailed since the last decade [25,37,58,60,70]. The RF models had slightly better performance than the LR models in the training and validation sets. Better performance of the RF models over the LR models in fire occurrence prediction is consistent with similar studies in our region [72,73] and other regions [39,58]. The LR models shift performance from outstanding in training data sets to excellent in validation data sets, while in the RF procedure this shift is less pronounced according to applied model classification [63]. Based on the results obtained within this study and literature survey, we can strongly recommend the use of the RF method for the forest fire occurrence mapping of the entire territory of Serbia.

The propensity-score matching method was used to select non-fire events for known fire events. This method was adopted from the field of medical sciences [74–77], like many others, and it was used for the first time in natural hazard risk assessment by Hudson et al. [78] for evaluating the effectiveness of flood damage mitigation measures. We used this method for the first time in forest fire risk assessment to pair historical fire event data obtained from the NASA FIRMS with non-fire events using elevation as matching criteria.

Elevation had a significant and positive effect on forest fire occurrence probability, with higher frequencies observed between 200 and 1000 m. The fire season is shorter at higher elevations due to snow melting later, leading to a lower fire frequency [79]. Also, a higher relative humidity due to a decrease in temperature of 1 ◦C per 100 m rise in elevation reduces chances for fuel ignition [80]. However, the rapid decrease in fire frequency at altitudes over 1000 m, observed by Ramón González et al. [71], was not recorded in our study. It is more likely that climate changes will influence a positive effect of the elevation on forest fire frequency in the future, as has already been shown by Schwartz et al. [81]. Southern, intermediate slopes between 5◦ and 20◦ experienced increased fire activity in our study due to the lower moisture of combustible material being exposed to the sun radiation more than other aspects with milder and steeper slopes [34,81,82]. All topographical features strongly affect vegetation and its burn ability [83]. Forest fire frequency was much higher in the broad-leaved forests than in the coniferous forests. The negative effect of the distance to water on forest fire occurrence probability may be linked to the NASA sensor's ability to detect only larger forest fires, while a smaller fire is suppressed easily when it is in the vicinity of the water bodies and the early phase of development are thereby not detected. Conifers cover less than 1% of the study area and were scattered in the higher mountains within the bigger broad-leaved forest complex and therefore were lesser exposed to anthropogenic activity, explaining the negative effect of its proportion

on fire activity. Transitional woodland-shrubs are usually situated in the zone between forests and agricultural or arable land. Both types of land cover are connected to spring and autumn burning activity, and therefore transitional woodland-shrubs had increased fire frequency. Forests near human settlements and infrastructures that are densely populated are more prone to small fires due to negligence and/or accidental ignitions [84], while distant locations are prone to often larger, but less frequent, forest fires [85]. Therefore, an expected decrease in the population density in rural areas in the southern part of the study area [86] may result in lower fire activity in the future. Conversely, the expansion of urbanized areas and road cover with an associated increase in population density due to migration from south to north can be expected to lead to higher fire activity in the northern part of the study area.

Zones with the highest probability for forest fire occurrence are located in the southeastern part of the study area in all models, which correspond to the more pronounced drought periods during summer [27,28]. A higher drought code (higher temperature and lack of precipitation) leads to the lower moisture of fuel and makes an area more susceptible to ignition. It is well known that higher temperatures [87] reduce fuel moisture, making the fuels highly susceptible to ignition. Additionally, a study by Chang et al. [88] described low precipitation as a determinant factor for ignition. Drought code was the most important variable, followed by anthropogenic features, in both the LR and RF models. These results were consistent with other studies on determinant factors for the occurrence of wildfire where climatic and anthropogenic predictors had a higher influence on the fire occurrence probability [70,89–92]. All those models were efficiently applied at a smaller scale (such as national parks or protected areas), while our models showed similar efficacy at a larger scale. The produced maps can be used by firefighting services for strategic and operative planning. Defined zones with higher forest fire occurrence probability, in the southeast of the study area, should be intensively monitored during the fire season, especially during the second peak in August [30], when the largest forest fires can be expected [31]. Intense monitoring allows early detection of forest fires and leads to rapid response. All of these measures, along with enhanced equipment for fire suppression, can significantly decrease the burnt area in Eastern Serbia. Also, silvicultural measures that reduce fire risk [23] can be applied under zones with a higher forest fire occurrence probability according to the created maps. Thus, in the fire prone zone, less flammable tree species [93] should be selected for afforestation. Additionally, to prevent the transfer of ground fire to the crown, the introduction of silvicultural measures, such as pruning of the lower branches, should be obligatory in order to reduce fire hazard in the vulnerable zones. Other fuel reduction treatments, such as thinning, prescribed burning, and fuel breaks, can be useful tools to achieve these objectives at the landscape level [20,21].

## **5. Conclusions**

The overall accuracy of the RF models was higher than those of the LR models. Both types of model identified drought code and anthropogenic features as the most important forest fire predictors. The models displayed a very high predictive ability, but the RF models were slightly more efficient and could be recommended for forest fire occurrence mapping in the eastern part of Serbia. The obtained maps could improve the efficacy of forest fire suppression in the study area in several ways. First, the fire probability map could be used for position optimization of the devices used in the early detection of forest fires. Also, firefighting resource allocation could be planned and applied in a manner consistent with the fire frequency. Finally, forest management planning and silvicultural measures should be adapted in terms of the forest fire risk reduction, based on the obtained maps.

**Supplementary Materials:** The following are available online at https://www.mdpi.com/1999-4 907/12/1/5/s1: Figure S1: Correlation plot for all preselected variables based on Spearman's rho coefficient; Table S1: Tables of contingency for training and validation sets of data for the tested random forest models.

**Author Contributions:** Conceptualization: S.M.; methodology: S.M. and S.D.M.; validation: D.P., L.G., and P.K.; formal analysis of Logistic Rregression: S.D.M., and D.P.; formal analysis of Random Forest: S.M., and P.K. investigation: S.M. and S.D.M.; resources: N.M.; data curation: N.M. and L.G.; writing—original draft preparation: S.M.; writing—review and editing: N.M., D.P., and L.G.; visualization: S.D.M. and N.M.; supervision: S.M. and P.K.; project administration: S.D.M.; funding acquisition: S.M. and S.D.M. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the Ministry of Agriculture, Forestry and Water Management of the Republic of Serbia-Forest Directorate (contract: 401-00-1713/2019-10) and by the Ministry of Education, Science and Technological Development of the Republic of Serbia (contract: 451-03- 68/2020-14/200015).

**Data Availability Statement:** The data presented in this study are available on request from the corresponding author.

**Conflicts of Interest:** The authors declare no conflict of interest.

## **References**

