1. Introduction
Forest fires occur regularly in our country during the winter (post-monsoon) to summer (pre-monsoon) seasons. These events occur on a large spatial and temporal scale; for example, between November 2020 and June 2021, the Moderate Resolution Imaging Spectro-Radiometer (MODIS) detected an occurrence of 345,989 forest fires across various districts in India [
1]. The Forest Survey of India (FSI) estimates that nearly 4% of our forest cover is highly prone to frequent forest fires, whereas 6% is very highly prone to fire [
2]. Forest fires are a global concern due to the extensive damage they cause to ecosystems and communities. Further, they are an essential cause of land degradation, leading to biodiversity loss. While most fire events are controlled or restricted within the forest area, such events extend the forest floor and cause damage to adjoining civilian areas.
Odisha has 52,156 square kilometres of forest coverage, and forest land covers 33.5% of the overall geographical area, providing habitat for a diverse range of plants and fauna [
3]. Odisha ranks ninth among Indian states in terms of the number of districts highly vulnerable to fire events. A total of 31–47% of the population lives in these vulnerable districts, reflecting their dependence on forests. Forest ecosystems face a substantial threat involving the loss of timber, fruit-bearing trees, medicinal plants, and wildlife habitats. Among the most active fire spots, Odisha recorded a remarkable 659 events in a concise span between February 1 and 8, 2023. The complex relationship between forest growth and wildfires is particularly noticeable in this region, where ground fires endanger the seedlings essential for regeneration during the monsoon season. This event disrupts the natural cycle, hindering the forest’s regenerating ability and posing a long-lasting ecological problem.
The FSI conducted a study based on spatial analysis of forest fire sites recorded between 2004 and 2021 to identify fire-prone forest areas in the country. The classification of central Odisha (which accounts for 10.66% of the fire-prone forest area in India) as an exceptionally high risk of fires underscores the region’s high susceptibility. The need to address this issue is even more urgent considering the recent incidents in Odisha, where forest fires have been widespread. Since October 2022, an extended drought has worsened the situation by promoting the rapid propagation of wildfires. The magnitude of the issue becomes evident when examining the data from the Forest Survey of India (FSI), which, as of 7 March 2023, had recorded 391 fire occurrences across the country [
4] (
https://fsiforestfire.gov.in/index.php, last accessed on 26 May 2024).
A complex interaction of meteorological factors and environmental circumstances determines the catastrophic consequences of wildfires. Understanding how these factors influence wildfire initiation and spread is critical for building effective early warning systems. This research investigates the complex interaction between meteorological data, environmental conditions, and wildfire incidences, explicitly focusing on forest fire incidences in Odisha, India. The temperature of air close to the surface, the surface or skin temperature, the relative humidity of air close to the surface, soil moisture, and precipitation are the dominant meteorological parameters that influence the occurrence of a fire event.
Numerous studies have reported the connection between meteorological parameters and the occurrence of fire events. In one of the earliest works involving remote sensing and geographic information systems (GISs), a study evaluated the likelihood of fires using GIS and spatial data [
5]. The authors created a thorough database and mapped out areas with high fire risk. Another similar study examined the impact of climatic conditions on wildfires [
6]. The authors used remote sensing data to analyse how climate variables, vegetation, and fire size interact. Their findings report that certain precipitation levels can lead to an increased risk of wildfires, highlighting the importance of geographical factors. A study specifically examined fires caused by human activities and examined the pre-monsoon conditions and the influence of different species on the fire patterns in the Himalayan region [
7]. This study considered factors such as the gathering of biomass, the conditions before the monsoon season, and the levels of moisture present.
The need to use real-time monitoring, science process algorithms, and photography to improve forest fire preparedness and response was highlighted by [
8]. Plant and land cover changes also significantly impact wildfires [
9]. This was shown utilising digital photo categorisation and ISODATA to analyse these changes. Further, the Normalised Difference Vegetation Index (NDVI) and GISs were used to detect regions susceptible to drought and create a detailed representation of the fire damage intensity of the forest fire near Karabaglur, Turkey [
10].
A time series analysis of remote sensing data to analyse fire disturbance and forest recovery across Canada shows that accounting for the temporal variability of the NDVI within unburned areas aided the definition of recovery times to pre-burn levels, which typically took five years or more following a fire [
11]. An extensive knowledge of fire dynamics is obtained using remote sensing methods, especially in the summer, for wind pattern analysis and precipitation data to measure vegetation susceptibility. Furthermore, the awareness of drier weather patterns and human-induced fire hazards highlights the need for proactive forest fire control plans [
9]. More studies suggest integrating burnt area estimates with ground operator data, further improving the reaction and emphasising the need for multidisciplinary approaches to battle wildfires. Geospatial techniques and metrics such as the NDVI and Differenced Normalised Burn Ratio (
) are also used to detect drought-prone areas, i.e., high-temperature areas with low precipitation, using the two indices for burn severity mapping [
12].
A predictive classifier was developed using machine learning to categorise rain and no-rain conditions from remote sensing observations [
13]. In this study, the authors employed a dataset from remote sensing observations to train and evaluate the machine-learning models. A proactive approach to forecasting wildfires in Chapada das Mesas National Park achieved high accuracy using artificial neural networks and data-mining techniques [
14]. Recently, an ensemble method for forest fire susceptibility modelling was developed, focusing on the Western Ghats section [
15], in which the authors show that the land use land cover is an important factor having a significant role in explaining fire severity.
Forest fires are a global issue due to their destruction of ecosystems, lives, and economies. Recent studies were directed towards early forest fire detection to limit their spread using cutting-edge remote sensing and machine learning technology. A novel early-stage forest fire detection method was developed using Himawari-8 Advanced Himawari (AHI) images, a modified MOD14 algorithm, and a random forest classifier [
16]. With further satellite sensors, this can precisely spot fires—especially in Australia. Improved monitoring systems to predict grassland fires in Inner Mongolia can be developed using remote sensing and random forest models [
17]. A better fire control strategy is to utilise remote sensing data to monitor climatic elements and vegetation indices.
Using spectral reflectance data, random forest algorithms were shown to identify early drought in greenhouse tomatoes, addressing another drought stress issue [
18]. The algorithm achieved over 85% accuracy by addressing collinearity and class imbalance, making it a reasonably affordable greenhouse irrigation technique. Data mining in education reveals that uneven data make prediction models challenging. To solve class imbalance in educational datasets, random and synthetic minority oversampling techniques were evaluated [
19]. Oversampling performed better for mild imbalance and hybrid resampling for extreme imbalance. The spatial variability of Swedish forest fires using a random forest model connecting the topography, temperature, and socioeconomic factors to fire incidence was examined [
20]. Their results help focused fire protection initiatives by offering vital fresh insights into the causes of forest fires in Swedish biogeographical zones.
Understanding the behaviour of forest fires and their consequences depends on the accuracy of wildfire prediction and modelling, as fire events could seriously endanger the surrounding areas and people. The synoptic atmospheric conditions at the surface and free troposphere are found to be associated with active fire months in the south central Chile region [
21]. Using CiteSpace, research trends to identify gaps in wildfire forecasts were evaluated [
22], revealing the significance of specific keywords such as “wildfire,” “prediction,” and “model,” in relation to trends in land use, precipitation, and vegetation. The authors advocated for adopting new data sources and advanced approaches to address these research gaps.
The geographical and temporal distribution of lightning-induced wildfires in Australia using ISS LIS and MODIS data revealed that lightning ignitions were infrequent, and thunderstorms did not influence peak wildfire activity [
23]. During the dry Australian season, thunderstorms were found to have minimal impact on wildfires, with other factors playing a more significant role. A comparison of mid-latitude California’s wildfire emissions alongside high-latitude Krasnoyarsk Krai using a multi-dataset approach shows that high-latitude wildfires generated more pollutants including black carbon than mid-latitude ones [
24]. This study underlined the need for thick vegetation to produce significant emissions and demanded more studies for better understanding.
Stochastic wind vectors were included in modified wildfire spread models to consider environmental uncertainty [
25]. Including wind variability in wildfire models produced more accurate forecasts than deterministic models, which usually underlie underestimating wildfire spread risks. Their results show that environmental uncertainties help increase forecast accuracy and the control of wildfire risk.
Several studies were conducted on the impact of forest fires on the flora and fauna of the forest location. The study on the nature of the pyrogenic transformation of ecosystems to evaluate the success of the forest reproduction indicates the success of reforestation and, hence, a favourable forecast of post-fire recovery of light coniferous forests [
26]. Understanding the relationship between the nature of damage and the response of the ecosystem components can allow us to predict the response of an ecosystem after forest fires [
27].
A thorough survey of the literature shows that various approaches were used to try to detect patterns or predictive factors behind the occurrence of forest fire events. These approaches include using remote sensing observations coupled with GIS and meteorological observations. However, most of the earlier work points to a diagnostic mode of investigation that studies the underlying cause of a fire event. However, very few attempts were made to forecast the same. Despite being a recurrent phenomenon, existing research predominantly adopts a diagnostic approach, analysing the causes and patterns of past events. This study identifies a critical gap in the current literature: a proactive and predictive model for forest fires in Odisha. Hence, the present study aims to develop an integrated anomaly detection and early warning system that utilises meteorological parameters and historical wildfire data to anticipate and predict future incidents. So, the objectives of this study are as follows:
To consider basic meteorological parameters and derived parameters on a daily scale and study its effect on forest fire occurrence with different time lags using the mutual information approach.
Based on the mutual information approach, identify the most important parameters to train a random forest model to predict the occurrence of forest fire with one day in advance.
To conduct a detailed study on the performance and robustness of the random forest algorithm in addressing the class imbalance as well as the sample size.
2. Problem Description
The current study aims to predict forest fire occurrence when the meteorological observation data are available at the forest area a day in advance. When solar radiation is incident on a land surface, the same is absorbed in the form of sensible heat, thereby causing the land surface temperature to increase. Elevated temperatures contribute to greater evaporation, dry out plants, and make them more flammable. Further, when the surface gets hot enough, a part of that sensible heat is transferred to the surrounding air through convection, which causes the air temperature to increase. If there is moisture at the surface, the surface absorbs the incident solar radiation partly due to sensible heat, while the remaining is conserved as latent heat. The exchange of latent heat results in the evaporation of water and its subsequent mixing with the ambient air, leading to an increase in humidity near the surface. The complex interaction of land and atmosphere regulates the local weather conditions that govern the energy and moisture transport.
Another essential metric is soil moisture, which indicates the water in the soil. Dry soil indicates lesser availability of water to trees, making them dry and hence a good fuel source. Therefore, low soil moisture levels suggest a higher likelihood of ignition and prolonged fire spread. Adequate soil moisture, on the other hand, functions as a natural firebreak, slowing the spread of wildfires. The prolonged effect of drought conditions results in a lack of moisture availability at the surface, which causes the sensible heat component to dominate. Such a situation results in a very high surface temperature, which is favourable for fire incidents to occur in the presence of dry vegetation. However, precipitation in an area could mitigate this effect by reducing the sensible heat and overall land surface temperature. The present study considers the surface temperature, air temperature close to the surface, relative humidity, and soil moisture as critical meteorological parameters to forecast forest fire occurrence one day in advance. Though the literature suggests the inclusion of additional parameters such as topography, vegetation area, etc., we focus only on these four meteorological parameters due to their wide availability across all weather stations.
Since the relationship between the meteorological parameters and the occurrence of fire events is considered to have a time lag, the complex relationship between the two could best be “learned” by employing a non-parametric-based machine learning algorithm. Given the meteorological observations a day in advance, we considered the random forest model as a classifier to predict the future event as fire or no fire.
2.1. Data
The meteorological parameters such as the surface temperature, soil moisture, air temperature near the surface, and relative humidity were downloaded from the Reanalysis Data Services (RDS), provided by the National Centre for Medium-Range Weather Forecasting (NCMRWF), under the aegis of the Ministry of Earth Sciences. The RDS service provides the regional atmospheric reanalysis data over the Indian subcontinent obtained from the Indian Monsoon Data Assimilation and Analysis (IMDAA). The IMDAA system [
28] is based on the UK Met Office’s four-dimensional variational data assimilation (4DVAR) and unified model. The IMDAA system provides the reanalysis data at a regional scale of 12 km with 63 vertical levels up to a height of 40 km, updated hourly. The meteorological data were obtained from the RDS from January to June between 2014 and 2020 for the Odisha region bounded between 17.49 N–22.34 N latitude and 81.27 E–87.29 E longitude. The study area showing the distribution of forest areas across the state of Odisha is shown in
Figure 1. Data at six intervals, collected every four hours, are averaged to obtain the daily average temperature, soil moisture, and humidity values.
We considered the data provided by the Forest Survey of India (FSI) for the spatial occurrence of fire events. The Forest Survey of India (FSI) functions under the Ministry of Environment, Forest, and Climate Change, Government of India. It carries a principal mandate to conduct surveys and assess forest resources in the country. As part of various forest survey and assessment activities, the FSI has developed a Fire Alert System using near-real-time satellite data from the Moderate Resolution Imaging SpectroRadiometer (MODIS) Aqua and Terra satellites [
29]. The daily fire alerts are issued at 1 km × 1 km spatial resolution at about 10:30 a.m. and 10:30 p.m. using the Aqua satellite and 1:30 AM and 1:30 AM from the Terra satellite to users automatically. The fire pixels identified by the MODIS platform generate a spatial database of archival forest fire events in the form of a fire flag matrix.
The original dataset consists of four parameters, as discussed in the above section: the surface temperature, air temperature close to the surface, soil moisture, and relative humidity. The daily average values of the four parameters and their daily maximum and minimum values are considered in our analysis. This results in twelve original parameters under consideration. However, the complex relationship between the meteorological variables and the occurrence of a fire event with a time lag requires additional derived parameters based on the original dataset. As such, twelve more variables were derived and added to the total variables list as shown in
Table 1.
Data pre-processing ensures a comprehensive exploration of the dynamical aspect of the meteorological parameters, establishing a foundation for a thorough understanding of how they interact in wildfire incidents.
2.2. Fire Incident Data and Fire Flag Matrix
The archival fire data are downloaded from the FSI website, specifically from the MODIS database between 2014 to 2020, from January to June. While the MODIS fire data are available at a very high resolution of , the meteorological parameters from the RDS are available at a coarser resolution of . Hence, an RDS pixel is considered as a fire pixel if a MODIS-identified fire incident occurs within a distance of from the centre point. The threshold distance ensures that the fire pixel is within the bounds of a given RDS pixel. Latitude and longitude distances were also set to a threshold of 0.06° each in the x and y directions to ensure the closest point was flagged. This approach results in the generation of a spatial Boolean matrix, henceforth called the fire flag matrix, wherein the spatial location in which fire is detected is set as ‘1’, and other place values are set to ‘0’.
5. Discussion
Forest fire prediction has been approached in the literature using various techniques, often relying on remote sensing, GIS, and meteorological data. Nevertheless, the majority of these studies adopt a diagnostic approach, examining historical events without a significant emphasis on predicting future occurrences. To address this gap, our work sought to create a predictive model utilising fundamental meteorological parameters and derived variables to forecast forest fire incidents in Odisha. The mutual information method was utilised to determine critical characteristics affecting fire behaviour, which were subsequently employed to train a random forest model for predicting fire outbreaks one day in advance. Furthermore, we concentrated on addressing issues associated with class imbalance and identifying ideal sample sizes for model training, which were essential for improving the resilience and dependability of the predictions. We illustrate the main distinctions in methodology, data, and outcomes by contrasting our study’s methods and findings with those of a number of well-known forest fire prediction studies.
For comparison, Ref. [
34] suggested using the relative humidity and cumulative precipitation as input features to predict fire/no-fire events. The authors have used the artificial neural network (ANN) and support vector machine (SVM) approaches to predict the occurrence of fire events. Their results showed a prediction accuracy of 93–94% using the SVM technique and 89-91% using ANN. In [
35], the authors considered the NDVI, land surface temperature, and thermal anomaly as input to predict the occurrence of forest fires using ANN and SVM. The input parameters were acquired from MODIS satellite observations. Their model was able to detect forest fires with 98.32% accuracy. In both of these works, the authors did not discuss the class imbalance effect and other metrics to evaluate the model performance such as precision, recall, and F1 score. Ref. [
36] used nine spatially explicit explanatory variables, namely, elevation, slope angle, aspect, average annual temperature, drought index, river density, land cover, and distance from roads and residential areas. The authors evaluated four different models, namely, Bayes Network, Naive Bayes, Decision Trees, and Multivariate Logistic Regression, for prediction and mapping of fire susceptibility areas across the Pu Mat National Park, Vietnam. Their results show that the Bayes Network model outperformed the other models with an area under the receiver operating characteristic score of 0.96.
The approaches employed by [
14] and our investigation diverge considerably in approach and emphasis. The authors apply ANN and Classification Rules (CR) for wildfire prediction, utilising a three-layer ANN including 13 hidden neurones and rule-based models to categorise occurrences as “wildfire active” or “not active.” The ANN attained an accuracy of 84.79% and a precision of 40.01%, whereas the CR model exhibited a slightly lower accuracy of 83.71% but a significantly higher precision of 66.06%. Conversely, our study evaluates data sampling and SMOTE, two synthetic oversampling methodologies, to improve classification performance across multiple metrics including accuracy, precision, recall, and F1 score. Data sampling consistently surpassed SMOTE, attaining an average accuracy of between 0.92 and 0.99 and precision from 0.88 to 0.98, whereas SMOTE’s accuracy varied from 0.88 to 0.97 and precision from 0.85 to 0.96. In contrast to DeSouza et al., who concentrate on rule-based classifications and artificial neural networks, our research highlights the impact of sampling approaches on classification metrics, specifically demonstrating that data sampling outperforms in precision and F1 score, although SMOTE is superior in recall. Our study’s methodology demonstrates superior techniques for managing imbalanced datasets, while [
14] emphasise the significance of interpretability and rule-based models in predictive tasks.
In contrast, the authors of [
37] create a forest fire danger forecasting system (FFDFS) to forecast fire danger in northern Canada by utilising precipitable water, NDVI, NMDI, and surface temperature taken from MODIS. With a 95.51% classification accuracy for fires falling into “moderate” to “extremely high” hazard categories, their algorithm predicts fire danger across five categories (from extremely high to low). Their approach offers useful information about overall fire risk over a longer period of time, but it is not intended to forecast the exact frequency of fire occurrences on a daily basis, which is the main goal of our research. Additionally, the problem of class imbalance—a crucial component in raising the predictive accuracy of our random forest model—is not addressed in their study.
In a similar study [
38], the wildfire susceptibility in Irkutsk Oblast, Russia, was mapped using random forest models. In order to create risk maps for regions that are prone to fire, their study takes into account a variety of factors, such as vegetation type, human activity, and meteorological data. Similar to our findings, they claim a high accuracy (0.89), F1-score (0.88), and AUC (0.96). However, instead of accurately forecasting the likelihood of a fire, they continue to map susceptibility. Furthermore, even though their model is strong, it does not specifically address class imbalance, a problem that our study thoroughly tackles using data sampling strategies, producing a more balanced performance across precision, recall, and F1 scores.
A different strategy is used by [
39] based on the LightGBM model to predict fires in China’s Central–South region. With accuracy, precision, and F1 scores above 85% and AUC values above 89%, their work focuses on spatial analysis utilising GIS to predict the likelihood of fire. They attain great predictive accuracy, as in the present study, but instead of concentrating on short-term event prediction, the authors prioritise risk zoning and large-scale spatial grouping. Unlike us, they do not employ comprehensive meteorological data for daily forecasts, and their model does not thoroughly address the problem of class imbalance.
In order to assess fire risk characteristics and forecast fire occurrence in central China, [
40] employs a deep learning methodology, combining convolutional neural networks (CNNs) with geographic information systems (GISs). The scores we attain are comparable to the high accuracy (86.00%), precision (88.00%), recall (87.00%), and AUC (90.50%) displayed by their model. Their research, however, is less concerned with daily forecasts based on meteorological anomalies and more with zoning management techniques and the extraction of spatial features. CNNs are excellent for spatial analysis, but our method, which combines mutual information with a random forest model, performs better in short-term forecasting and gives us an advantage for real-time fire event forecasting.
Our research, in contrast to past studies, is on incorporating meteorological parameters into a predictive model that anticipates forest fire occurrences and proactively addresses data issues such class imbalance. This makes it possible for us to perform more evenly across a range of parameters, with precision and F1 scores continuously over 0.85. Our model is distinct in that it can forecast fire events one day ahead of time, giving early warning systems a crucial lead time, whereas many other studies concentrate on susceptibility mapping, fire danger classification, or spatial analysis.
Despite the exclusion of additional parameters such as the topography, aspect ratio, vegetation index, etc., the techniques developed in the present study are capable of integrating any anomaly detection in the atmospheric parameters to predict the occurrence of forest fire and take suitable actions to mitigate them. However, further studies can help us to increase the lead time by up to a few days in advance.
6. Conclusions
This study aims to predict the occurrence of a wildfire event rather than classifying them for diagnostic purposes. For this, we considered four essential meteorological parameters that influence the occurrence of fire events such as the surface and air temperatures, relative humidity, and soil moisture. These parameters are chosen due to their wide availability with reduced uncertainty across many places around the world. We derived 20 additional parameters that represent the heat energy in various forms, using the four essential parameters. Our first objective was to identify key meteorological parameters that exhibit significant correlations with the occurrence of a fire event a day in advance. The mutual information approach was used for this purpose to study the influence of each meteorological variable on its predictive capability of the fire event. Based on the mutual information study, nine meteorological parameters are identified that carry significant information about the fire event.
The nine meteorological parameters observed between 2015 and 2017 are used to train the random forest classifier model. The forecast is achieved by feeding the meteorological data from the ith day as input to the random forest classifier, while the output is the classification in the form of fire or no fire on the i+1th day. The performance of the classifier depends on the class imbalance in the dataset. To address the imbalance in the two classes, we used the weighted data sampling and SMOTE approaches. Both the approaches show consistent improvement in balancing the two classes.
The last objective was realised by conducting an extensive study on the effect of class imbalance and sample size on the performance of the random forest algorithm. Using four different metrics, namely, the accuracy, precision, recall, and F1 score, we found that as the class imbalance ratio increases, the performance of the classifier improves in both weighted data sampling as well as the SMOTE approach. The study on the effect of sample size shows that as the sample size for training the classifier increases, both the sampling techniques show consistently good performance. The evaluation metric values in both cases approach close to 1 as the sample size increases. The results clearly demonstrate model durability and adaptation to synthetic data amounts. SMOTE and weighted data sampling addressed class imbalance and increased model performance with higher training sizes, but the weighted data sampling technique satisfied the project goals better with the real dataset.