Framework to Create Inventory Dataset for Disaster Behavior Analysis Using Google Earth Engine: A Case Study in Peninsular Malaysia for Historical Forest Fire Behavior Analysis

Chew, Yee Jian; Ooi, Shih Yin; Pang, Ying Han; Lim, Zheng You

doi:10.3390/f15060923

Open AccessArticle

Framework to Create Inventory Dataset for Disaster Behavior Analysis Using Google Earth Engine: A Case Study in Peninsular Malaysia for Historical Forest Fire Behavior Analysis

Faculty of Information Science and Technology, Multimedia University, Jalan Ayer Keroh Lama, Melaka 75450, Malaysia

^*

Author to whom correspondence should be addressed.

Forests 2024, 15(6), 923; https://doi.org/10.3390/f15060923

Submission received: 8 April 2024 / Revised: 21 May 2024 / Accepted: 22 May 2024 / Published: 26 May 2024

(This article belongs to the Special Issue Managing Forest Wildfires in Climate Changes: New Paradigms and Challenges)

Download

Browse Figures

Versions Notes

Abstract

This study developed a comprehensive framework using Google Earth Engine to efficiently generate a forest fire inventory dataset, which enhanced data accessibility without specialized knowledge or access to private datasets. The framework is applicable globally, and the datasets generated are freely accessible and shareable. By implementing the framework in Peninsular Malaysia, significant forest fire factors were successfully extracted, including the Keetch–Byram Drought Index (KBDI), soil moisture, temperature, windspeed, land surface temperature (LST), Palmer Drought Severity Index (PDSI), Normalized Vegetation Index (NDVI), landcover, and precipitation, among others. Additionally, this study also adopted large language models, specifically GPT-4 with the Noteable plugin, for preliminary data analysis to assess the dataset’s validity. Although the plugin effectively performed basic statistical analyses and visualizations, it demonstrated limitations, such as selectively dropping or choosing only relevant columns for tests and automatically modifying scales. These behaviors underscore the need for users to perform additional checks on the codes generated to ensure that they accurately reflect the intended analyses. The initial findings indicate that factors such as KBDI, LST, climate water deficit, and precipitation significantly impact forest fire occurrences in Peninsular Malaysia. Future research should explore extending the framework’s application to various regions and further refine it to accommodate a broader range of factors. Embracing and rigorously validating large language model technologies, alongside developing new tools and plugins, are essential for advancing the field of data analysis.

Keywords:

disaster behavior; forest fire behavior; forest fire dataset; data extraction framework; Google Earth Engine; remote sensing; Malaysia; ChatGPT; Noteable; large language model

1. Introduction

Forest fires pose a significant ecological and societal challenge due to their destructive impact and potential harm to human communities. Recent advances in machine learning and data analysis serve as a promising technology for understanding and mitigating these disasters [1,2]. However, a substantial obstacle has persisted—the tedious and laborious process of collecting and preparing the essential data for analysis [3,4,5,6]. Many existing projects have relied on government intervention or costly data acquisition methods, limiting their accessibility and scalability. This study addresses this critical issue by presenting a simplified approach that harnesses the power of Google Earth Engine (GEE) [7]. Our primary objective was to streamline the behavioral analysis of forest fires, making it more accessible to researchers and practitioners.

A recurring question pertains to the emphasis on real-time detection over behavioral analyses of forest fires. While real-time detection is unquestionably essential [8,9,10,11,12], it often necessitates collaboration from multiple stakeholders, including government agencies, firefighters, and satellite providers, making it resource-intensive and costly. Our contribution focuses on enhancing the understanding of forest fire dynamics and historical patterns, contributing to more effective long-term forest fire management.

To provide context for our research, previous studies related to historical forest fire data extraction leveraging remote sensing data will be reviewed. This review will help us ascertain whether any previous efforts within the forest fire domain have attempted to simplify the data collection and analysis process, and can serve as a foundational background study. It will also highlight the existing knowledge gap and set the stage for our innovative approach within the broader field of fire ecology and management.

Our research objective was to develop a simplified approach using GEE to locate historical fire locations and extract relevant factors to create a fire inventory map. Refs. [13,14] highlighted the importance of having a fire inventory map as a prerequisite for constructing a wildfire susceptibility map. Presently, the optimal method for building the inventory datasets is still unknown [3].

Hence, this simplification of data acquisition is crucial because many advanced machine learning algorithms have primarily focused on algorithmic aspects, often overlooking the complexity of data collection and preprocessing. In many cases, such projects have required significant governmental involvement and substantial resources. Our approach sought to shift the focus toward making the data readily available for analysis and machine learning purposes. The research question underlying our work is fundamental. It revolves around the importance of extracting historical fire data for analysis and machine learning purposes. Without access to this historical data, the scope of analysis remains limited, and the potential for comprehensive insights remains untapped. Moreover, it should be noted that despite the recent advancements in the fields of remote sensing and machine learning, a wealth of satellite data is readily available [15], yet it remains significantly underutilized, primarily due to the challenges associated with the lack of experts, accessible tools, and methodologies.

In this paper, our method that leverages GEE to efficiently locate historical fire locations, extract relevant factors, and conduct data analysis on pertinent factors is presented. This streamlined approach simplifies the entire process, empowering researchers to focus on refining algorithms or integrating new datasets. Additionally, it eliminates the need for resource-intensive data downloads and local processing, democratizing access to this type of analysis. Our study aimed to facilitate the behavioral analysis of forest fires, offering a foundational framework for future research in this domain. Often, the initial hurdle researchers face is finding a starting point. Our approach addresses this challenge by simplifying the data acquisition and analysis stages. Our methodology effectively employs GEE’s Python API and Geemap [16] package to achieve these objectives. While this paper emphasizes forest fire disasters, it is hypothesized that this methodology can be readily extended to address other types of disasters by simply providing the coordinates of the incidents.

On the other hand, to showcase the usefulness of the generated forest fire dataset, it is crucial to perform a series of analyses to validate its applicability. Although traditional methods for analyzing datasets are still widely preferred, this study adopted a distinctive approach by leveraging the advancements in large language models (LLMs). LLMs face several challenges, including hallucinations [17,18], issues with bias and fairness [18,19,20,21], data accuracy [18,22,23], data security or leakage [19,22], and ethical concerns [18,20,22]. However, numerous studies [18,19,20,22,23] suggest that the advancement of LLMs is inevitable and will continue to progress. Therefore, embracing these technologies is essential. In our study, ChatGPT [24], particularly the GPT-4 variant with the Noteable plugin [25], was utilized to demonstrate the utility of the forest fire dataset. At the end of the analysis, the potential uses, advantages, and limitations of this plugin are also discussed.

Section 2 discusses previous studies that have focused on extracting remote sensing data for forest fire analysis. In Section 3, the proposed framework for streamlining remote sensing data extraction for historical forest fire datasets is deliberated. This section also includes a comprehensive list of forest fire attributing factors, along with their sources and detailed information. Section 4 outlines a case study in Peninsular Malaysia, where the framework was applied to generate a forest fire inventory dataset tailored to this specific location. Section 5 presents the sample analysis using an LLM, specifically GPT-4 with the Noteable plugin, to validate the generated forest fire dataset’s applicability. Additionally, the section also delves into the potential applications and limitations of this analysis approach. Section 6 serves as the conclusion, summarizing the paper’s contributions, while Section 7 outlines future work to enhance the proposed framework and validates the adoption of LLMs in data analysis.

2. Literature Reviews and Background Study

In a recent systematic review [26], an extensive exploration was undertaken to unveil the most influential factors affecting forest fires. The exhaustive analysis combed through a total of 144 factors from 94 publications spanning the years from 2001 to 2021. Among the factors that emerged as highly significant were slope, elevation, aspect, land cover, NDVI (Normalized Difference Vegetation Index), temperature, precipitation, windspeed, and more. Notably, the prevalence of these factors was attributed to their wide global availability in existing databases.

To perform data analysis [27,28] and develop machine learning model [3,4,5,6,29,30] in the domain of forest fires, it is essential to possess an inventory dataset containing both fire locations and associated factors. For a comprehensive exploration of the application of data analysis and machine learning in the domain of forest fires, one can refer to the review papers [2,31].

Creating these datasets involves various researchers employing their own methodologies, which may integrate multiple datasets from a variety of sources and government agencies, along with various Geographic Information System (GIS) tools [32,33]. These processes are tedious, time-consuming, and challenging, especially for data scientists and machine learning engineers who lack GIS expertise. Additionally, it is crucial to note that accessing government data typically requires navigating multiple layers of permissions and requests, making it a challenging and often non-shareable resource. Therefore, this study aimed to streamline the data collection process for building a forest fire dataset using publicly available resources, with the goal of enhancing accessibility for data scientists. Often referred to as an inventory dataset, this type of data is structured in tabular form for efficient organization and easy analysis. In the context of forest fire research, it is commonly known as a forest fire inventory dataset. To date, the best method for compiling data to create a fire inventory dataset remains undiscovered [3]. Hence, the next subsection will explore some of the past research attempts to create a fire inventory dataset.

2.1. Building Fire Inventory Datasets: Methods and Sources from Past Studies

This subsection discusses previous studies that utilized remote sensing data to construct fire inventory datasets for analysis or the development of machine learning and deep learning models. The emphasis is on the acquisition of these datasets rather than the tasks performed or the outcomes achieved, as the interest lies in understanding how data from various sources are combined and collected to create fire inventory datasets. Table 1 summarizes the factors used to build fire inventory datasets and the sources of each factor from past studies.

From the summaries provided, it is evident that some sources were unclear based on the descriptions in the original manuscript. It is also apparent that many studies relied on data from local authorities or government sources. Data from these local government or authority sources generally offer much a higher accuracy and more details compared to publicly available data. However, the use of government data typically presents several challenges: (i) the data cannot be publicly shared, and (ii) a complex process is required to apply for and obtain permission to use the data.

2.2. Key Data Sources for Understanding Forest Fires in Malaysia

Focusing on case studies in Malaysia, this subsection examines the datasets used in previous research on forest fire analyses within the country. The primary emphasis here is on datasets; thus, task-specific details are not expanded upon. For information on the tasks performed in each study, a detailed reviewed can be found in [40]. The primary aim of this subsection is to determine if any related work has previously been performed to create forest fire inventory datasets in Malaysia.

Landsat imagery, for example, is instrumental in deriving LULC products, the Normalized Burn Ratio (NBR), the Normalized Difference Water Index (NDWI), and the NDVI [41]. For effective data calculation and classification, clear or mostly cloud-free satellite images are essential. In Malaysia, Landsat images have been widely used, with applications including Landsat-5 [42,43], Landsat-7 [27,43,44,45,46], and Landsat-8 [41,47,48]. The Landsat datasets are generally accessible to the public through the United States Geological Survey (USGS) Earth Explorer [49]. Additionally, a validated private land cover dataset [50] for Malaysia and Indonesia that leverages Landsat-7 and Landsat-8 imagery has also been produced [51,52].

Other key datasets include precipitable water vapor for relative humidity, and surface air temperature and land surface temperature (LST) from MODIS sensors, which were adopted in [53,54]. MODIS level-1 and level-2 product data are publicly available from the Atmosphere Archive and Distribution System (LAADS) Distributed Active Archive Center (DAAC) [55,56,57].

For tracking active fire hotspots, refs. [50,58,59] utilized the MODIS MCD14ML Collection 5 hotspots from NASA FIRMS, which are accessible via the FIRMS archive website [60]. Alternative sources of hotspot data include the World Fire Atlas, available from 1995 to 2012, which were cited in [61,62]. Additionally, hotspots from the Advanced Very-High Resolution Radiometers (AVHRRs) of the National Oceanic and Atmospheric Administration (NOAA) were used in [27,44,54], whose data are available through the ASEAN Specialized Meteorological Centre (ASMC) [63]. Detailed instructions for access to these datasets are provided in [64].

It is important to note that while most of the publicly available data can be accessed and downloaded, replicating the analysis procedures from research studies remains challenging. This complexity stems from the need for researchers to preprocess the data using commonly used GIS tools such as ArcGIS or QGIS before the analysis and to integrate the data within these tools. Additionally, most data acquisition details are not disclosed in the manuscript.

For Malaysia government-centric data, refs. [27,44,54] utilized contours, administrative boundaries, water resources, settlements, and transportation infrastructure from the Department of National Mapping and Survey (JUPEM) [65]. The same studies [27,44,54] also employed hotspots, fire occurrence maps, peat swamp maps, and soil maps from the Malaysia Center of Remote Sensing (MACRES) [50]. Additionally, population data from the Department of Statistics Malaysia [66,67] were used in [27,44]. The Malaysia Meteorological Services Department [68,69] provides data such as temperature, relative humidity, the Fire Danger Rating System (FDRS), and rainfall, which were utilized in [27,42,44,54,70,71]. Ref. [42] leveraged land cover data from the Department of Forestry and the Department of Agriculture [57]. Previous research [27,42,44,54] indicates that fire occurrence reports in Peninsular Malaysia can be obtained from the Forestry Department of Peninsular Malaysia (JPSM); however, the details are not readily available in the manuscript and on their website.

Our attempts to contact the relevant departments via email and telephone were unsuccessful in obtaining the information. This highlights the challenge of accessing government-based data, which are often private or require a fee. Even when permission is granted, it typically involves approval from senior officials such as department directors, and the data obtained cannot be shared publicly.

Consequently, using tools like GEE to aggregate and analyze the necessary data could simplify these processes. Although this manuscript focuses on forest fire scenarios in Malaysia, the proposed methods and tools are likely applicable to other incidents and locations, provided the coordinates of the incidents are available.

2.3. Key Studies Driving Research Motivation

A study closely aligned with our research was conducted in [72]. This study aimed to assemble a comprehensive forest fire inventory dataset, encompassing historical fire incidents and associated factors in Australia. The dataset served as the foundation for investigating the primary factors contributing to forest fires and for constructing machine learning models to predict forest fire incidents. The research relied upon key tools such as the GEE code editor JavaScript API [7] and made use of multiple global public satellite datasets as well as government datasets. However, replicating the authors’ methodology and adapting it to different geographical locations presented considerable challenges when using the provided scripts [73].

In contrast, our work emphasizes the use of globally publicly available datasets from the GEE catalog to ensure its flexibility in adapting to various locations. While government data or any private data are not considered in this study, researchers can seamlessly integrate their datasets into the framework to enrich their analyses. Importantly, we prioritize the development of a swift data extraction method. Our primary purpose was to provide a means to quickly gather data for understanding forest fire occurrences in specific locations, establishing a robust foundation for other researchers to utilize this framework. Notably, none of the previous efforts in the forest fire domain have attempted to simplify the data collection process. Given these challenges, simplifying the process of data collection for forest fires can empower data scientists to leverage their expertise in better analyzing and understanding fire behavior in different locations.

3. Methodology

3.1. Proposed Framework

In this work, GEE was leveraged as the primary big data platform for accessing remote sensing data, providing a robust foundation for meaningful analysis. It has been widely adopted for various tasks, showcasing its versatility [74,75,76,77,78]. The significant increase in the availability of publicly accessible remote sensing data [15] presents a promising opportunity for research and analysis. However, obtaining historical forest fire data remains a formidable challenge. This challenge arises because the ownership of such data is predominantly vested in governmental authorities, which often necessitates a complex approval process for data access and utilization. In Figure 1, a comprehensive visual representation of the proposed framework employed is presented. It illustrates both the process of extracting historical forest fire locations and highlights the multifaceted factors that contribute to the occurrence of these fire locations.

In this study, the MCD64A1 Burnt Area (BA) dataset [79] was utilized to extract historical forest fire locations. As a default, this study leverages all the data with complete year availability (i.e., from 1 January 2001 to 31 December 2022) in MCD64A1 to harvest the historical fire points. This dataset was selected as it offers a comprehensive global record of burned areas with a spatial resolution of 500 m, which is useful for monitoring and analyzing wildfire and land cover change dynamics. Previous research [80,81,82,83,84,85,86] validated the dataset’s ability to detect historical burnt areas in various locations, with most studies successfully identifying historical BAs. Although this dataset has not previously been used to detect BAs in the study area of Peninsular Malaysia, our prior work [87] successfully employed it to identify a small fire in the state of Pahang, within Peninsular Malaysia. With its capability to detect historical BAs, the burnt areas are extracted as individual fire points with an area of 1 km² to construct the dataset, forming the foundation of this study. Additionally, the dataset includes the incident dates of the fires, enabling further extraction of relevant variables based on these dates. MCD64A1 was chosen for this study over alternatives such as FireCCI51 v5.1 [88] and Globfire Fire Event [89] from the GEE catalog because of its extended temporal availability within GEE.

To derive historical fire points, the MCD64A1 dataset was filtered based on the year of interest and the specific region. All detected fire locations were then extracted as fire points at a 1 km² resolution, as shown in Figure 1, Step 1. To extract the year and month of the fire incidents, the ‘system:index’ from MCD64A1 was utilized, with the first four digits denoting the year and the subsequent two digits indicating the month. Since MCD64A1 provides tentative BurnDate values ranging from 0 to 366, the tentative date of the fire of each location was calculated using these burnt day values. A spatial resolution of 1 km² was employed in this study, considering the size of burnt areas in the study location, Peninsular Malaysia, which predominantly comprises small fires, typically less than 100 hectares [90,91,92,93,94,95,96,97]. This spatial resolution is also consistent with the resolutions of other datasets used in the study, which are generally larger than 1 km². It is important to note that in locations with larger-scale fires, increasing the resolution size may be necessary to prevent complications in GEE due to computational resource constraints.

Conversely, the rationale for selecting non-fire points was based on the facts that these areas have not been identified as BAs by the MCD64A1 dataset in the past 20 years. Consequently, it is reasonable to assume that these locations experienced no fire events, making them suitable as non-fire points in our analysis. For the collection of non-fire points, all available burnt area data within the region of interest from the MCD64A1 dataset were utilized. The burnt areas were blended into a single image, depicting the historical burnt regions across all years. In consideration of potential omissions of nearby fires, an additional dilation morphological operation [98] was applied to expand the boundaries of the burnt regions, with the default radius and iteration value set to 2. To obtain the non-fire regions, we inverted the selection of burnt area region with the region of interest. Subsequently, the GEE function ee.FeatureCollectionRandomPoints was employed to randomly extract non-fire points at a 1 km² resolution, with the total number of non-fire points matching the total number of fire points. For the month and day of the fire for non-fire points, we leveraged the most recent available year in the MCD64A1 dataset. This approach is justified by the necessity of incorporating the most up-to-date data concerning non-fire locations to ensure a comprehensive understanding of non-fire occurrences in the current context. The process for extracting non-fire points is elucidated in Step 2 of Figure 1. It is important to emphasize that all the historical fire and non-fire points extracted include their respective coordinates (latitude and longitude), as they are essential for the next step involving importing the fire incidents back into GEE. After the extraction of historical fire and non-fire points, the points were saved as .csv files for the subsequent analysis. This decision to extract and store the data as .csv files was primarily driven by the consideration that GEE may encounter issues such as crashes or connection losses during processing. Therefore, using .csv files allows for the continuity of the process without the need to start the extraction of historical points from the beginning in the case of disruptions in GEE operations.

The factors for each point were extracted based on the coordinates of both fire and non-fire points. Step 3 in Figure 1 lists all the conditioning factors exploited in this work, while Table 2 provides comprehensive information about each factor, including its source, temporal availability, temporal cycle, spatial resolution, etc. As this study aimed to alleviate the challenges faced by data analysts and machine learning engineers in the extraction of remote sensing data, all fire factors based on their temporal availability within the range of BA temporal availability were extracted to maximize data accessibility for future analysts. However, it is worth noting that due to resource constraints within GEE, factors that are available on a daily basis (e.g., KBDI) will undergo processing inside GEE to compute the monthly averages before exporting.

In Step 4, additional processing is conducted locally to derive supplementary factors that are valuable for fire behavior analysis. For all the factors available on a monthly basis, the computation of annual and seasonal averages was performed for each year. The seasonal averages were determined by aggregating data over three-month periods: December–February, March–May, June–August, and September–November. This approach proves valuable as it acknowledges the substantial variations in fire behavior across seasons, which can be attributed to factors such as weather conditions, vegetation growth, and human activities. This aligns with established research findings [99] highlighting the importance of effectively capturing and analyzing these seasonal patterns. Subsequently, to gain a more comprehensive understanding of fire behavior, the factors associated with fire incidents for the specific year were extracted. This information was incorporated as a new column, referencing the year of each fire event obtained from MCD64A1. Historically, this step has been perceived as a complex and labor-intensive process, as generating datasets with a high temporal resolution can pose practical challenges, as noted in prior studies [100]. Consequently, the forest fire dataset was assembled, encompassing all factors, including the monthly, annual, seasonal, and current-year fire-influencing variables, rendering it prepared for in-depth analyses. It is important to acknowledge that this dataset may contain missing data, mandating additional processing before effective utilization in analyzing the data or training machine learning models.

One of the key advantages of the proposed framework is its universal applicability, as it relies solely on globally available, publicly accessible datasets. Given their public accessibility, the generated datasets can be readily shared and distributed without concerns related to copyright or privacy issues. For instance, the forest fire dataset produced in our study location, Peninsular Malaysia, is accessible at https://doi.org/10.5281/zenodo.10050852 (accessed on 21 May 2024) [101]. In this study, the GEE Python API was employed instead of GEE JavaScript. This choice allows for additional analyses and processing, tapping into the widespread utility of Python in data science and machine learning. Furthermore, Python offers the advantage of code reuse for various geospatial and data analysis tasks beyond the GEE environment. In conjunction with the GEE Python API, we also employed GeeMap [16,102], a Python package tailored for interactive geospatial analysis and visualization within the GEE framework. To facilitate the replicability of our proposed methodology and encourage its adoption in other geographical locations, all the source code utilized in this study is readily accessible on GitHub https://github.com/chewyeejian/GEE_FrameworkForestFireDataset (accessed on 21 May 2024). It is important to note that the provided source code primarily focuses on our study location, Peninsular Malaysia. Modifying the default Country Feature Collection is necessary to adapt the code for use in different locations.

3.2. Forest Fire Attributing Factor Data Source and Details

Table 2 provides a comprehensive overview of the factors harnessed in this study, offering insights into their respective categories, sources, temporal availability, temporal cycle, spatial resolution, and more. It is important to note that the temporal availability of the data listed in the table is based on the most recent access date of September 2023. As emphasized earlier, all the datasets used in this research are globally sourced, ensuring their adaptability across various locations without restriction. The selection of factors in this study was based on their documented prevalence and their potential high correlation with forest fire incidents, as indicated in the existing literature [26].

For example, land cover, also known as land use, significantly influences wildfire incidence and spread through its impact on landscape characteristics, fuel types, and interactions with human activities [38,103,104,105,106,107]. Elevation affects temperature, moisture, and wind dynamics, influencing vegetation structure and air humidity, with higher moisture levels in elevated terrains acting as preventive measures against severe wildfires [108]. Slope inclination affects fire ignitions by impacting accessibility; generally, steeper slopes tend to reduce accessibility [108]. Aspect, which indicates the direction a slope faces [109], influences forest fire frequency by affecting the amount of solar radiation received through sunlight exposure. For instance, south-facing slopes, which receive more direct sunlight, often experience more fires due to their drier and less dense vegetation compared to north-facing slopes. Nighttime light intensity data effectively capture various human activities, as highlighted by the authors’ results in [39]. Further details on each of these fire-related factors can be found in [26].

It is important to highlight that the Human Impact Index (HII) from the Wildlife Conservation Society [110] is the sole dataset not directly available in the official GEE dataset catalog. However, it can be conveniently accessed on GEE, where the authors have uploaded to the platform as an image collection.

Table 2. Fire conditioning factor data source and details (note: last accessed September 2023).

Category	Source of Data	Temporal Availability	Temporal Cycle	Spatial Resolution	Annual Average	Monthly Average	Seasonal Average	Data Layer	Unit
Climate & Environment	TerraClimate [111]	1 January 1958 to 1 December 2022	Monthly	4 km²	✔	✔	✔	Actual Evapotranspiration (AET)	mm
								Water Deficit (DEF)	mm
								Palmer Drought Severity Index (PDSI)	-
								Reference Evapotranspiration (PET)	mm
								Precipitation (PR)	mm
								Runoff (RO)	mm
								Soil Moisture (SOIL)	mm
								Downward Surface Shortwave Radiation (SRAD)	w/m²
								Snow Water Equivalent (SWE)	mm
								Minimum Temperature (TMMN)	°C
								Maximum Temperature (TMMX)	°C
								Vapor Pressure (VAP)	kPa
								Vapor Pressure Deficit (VPD)	kPa
								Wind Speed (WS)	m/s
	Rainfall [112]	1 January 2007 to 12 September 2023	Daily	4 km²	✔	✔	✔	Keetch–Byram Drought Index (KBDI)	-
	MOD11A2.061 Terra [113]	18 February 2000 to 29 August 2023	8 days	1 km²	✔	✔	✔	Land Surface Temperature (LST)	K
	MOD13Q1.061 Terra [114]	18 February 2002 to 13 August 2023	16 days	250 m	✔	✔	✔	Normalized Difference Vegetation Index (NDVI)	-
	MOD13Q1.061 Terra [114]	18 February 2002 to 13 August 2023	16 days	250 m	✔	✔	✔	Enhanced Vegetation Index (EVI)	-
Land Cover	MCD12Q1.061 MODIS [115]	1 January 2001 to 1 January 2022	Annual	500 m	✔	N/A	N/A	Annual University of Maryland (UMD) Classification (LC_Type2)	16 classes
Land Cover	European Space Agency (ESA) [116] (static)	1 January 2021 to 1 January 2022	Annual	10 m	✔	N/A	N/A	Land Cover (Map)	11 classes
Topography	NASADEM [117] (static)	11 February 2000 to 22 February 2000		30 m	✔	N/A	N/A	Elevation	m
								Slope (derived from DEM)	degrees
								Aspect (derived from DEM)	degrees
Social Economic/Anthropogenic factors	Wildlife Conservation Society [110]	1 January 2001 to 1 January 2020	Annual	300 m	✔	N/A	N/A	Human Footprint/Human Impact Index (HII)	-
	Deutsches Zentrum für Luft- und Raumfahrt [118]	1 January 2015 to 1 January 2016	Annual	10 m	✔	N/A	N/A	World Settlement Footprint 2015 (settlement)	-
	Visible Infrared Imaging Radiometer Suite (VIIRS) [119]	1 April 2012 to 1 January 2021	Annual	500 m	✔	N/A	N/A	Nighttime Light (average)	nanoWatts/sr/cm²
Burn Area	MCD64A1.061 MODIS [79]	1 November 2000 to 1 July 2023	Monthly	500 m	N/A	N/A	N/A	BurnDate	-

3.3. Limitations of Methodology

Determining the exact cause of each fire requires the presence of domain experts at the scene to conduct thorough investigations, with the resulting reports usually available only through forestry agencies or government departments. The focus of the proposed methodology is on analyzing variables that may either cause forest fires or intensify their severity. Although human negligence [120,121] is identified as a predominant factor in forest fires, meteorological conditions also play a critical role. For example, the Director of the Fire and Rescue Department highlighted that an extended period of hot and dry weather was the catalyst for the fire that occurred in March 2021 in Pahang, Malaysia [122]. As mentioned in Section 2.2, accessing data from government or forestry agency sources in Malaysia has proven challenging, resulting in a significant gap in detailed information about each fire incident. Nonetheless, it is advisable to integrate this framework with additional data from forest agencies and government sources, when available, to fully leverage the framework’s potential.

4. Application of the Proposed Framework in the Study Area—Peninsular Malaysia

4.1. Study Area—Peninsular Malaysia

In this study, Peninsular Malaysia served as the chosen study location for assessing the proposed framework’s effectiveness. For further details related to the study area, please refer to Appendix B.

This selection stems from the limited previous efforts to create a publicly available forest fire dataset for analytical purposes in this region [40]. Most prior works relied on private datasets sourced either from the Malaysian Government or private agencies [64]. In our approach, we utilize the level 2 administrative boundaries of Malaysia [123], refined to encompass only the states within Peninsular Malaysia. Employing these administrative boundaries enables the extraction of both fire and non-fire points to include their respective states and districts, which, in turn, facilitates subsequent analyses based on states or districts.

Following the extraction of historical fire and non-fire points in Steps 1 and 2 of Figure 1, a total of 5557 fire points and 5526 non-fire points were collected. However, visualizing such a substantial volume of points within GEE is not feasible due to the constraint, which would lead to a “Request payload size exceeds the limit: 10,485,760 bytes” error. To address this, QGIS was employed to visualize all the points, ensuring their inclusion in the figure, as depicted in Figure 2. It is important to note that for sample visualization, displaying a subset of the points is achievable in GEE.

4.2. Peninsular Malaysia Forest Fire Dataset Description

In this forest fire dataset, a total of 11,083 rows and 7040 columns are present. This dataset comprises 11,083 instances, including 5557 fire points and 5526 non-fire points, each characterized by 7040 features. After removing all columns that exclusively contained null values (i.e., ADM2_REF, ADM2ALT2EN, ADM2ALT1EN), a total of 7037 columns remained, making them available for a comprehensive analysis. The 5557 fire points represent burned areas detected from 1 January 2001 to 31 December 2022, at a spatial resolution of 1 km². In contrast, the 5526 non-fire points default to the latest date in the analysis (31 December 2022), reflecting the present context of non-fire scenes. The 7073 columns encompass all monthly, annual, and seasonal factors detailed in Table 2, administrative boundary features, and burnt date information sourced from MCD64A1. The full, unprocessed dataset is freely accessible at https://doi.org/10.5281/zenodo.10050852 (accessed on 21 May 2024) [101].

Figure 3 depicts the annual distribution of fire points from 2001 to 2022, revealing significant peaks in 2005 and 2014, which correspond with a previous analysis of FIRMS hotspots [124]. An analysis of the datasets to identify missing data is presented in Figure 4, highlighting the top 20 features with the highest percentage of missing data. Among these variables, LST, nighttime light, and the HII stand out as those with the most substantial missing data. For LST, the missing data likely originate from their source. Regarding nighttime light and the HII, the high percentage of missing data can be attributed to the use of the latest available date (December 2022) as the reference date for non-fire points. The substantial missing data were caused by the limited temporal coverage of nighttime light data, which extends only until January 2021, while the HII only covers data until January 2020. While addressing missing data is not the primary focus of this study, future studies may consider strategies such as substituting missing values with data from the previous year or month or using overall averages to fill the gaps. This study conducted a preliminary analysis to determine whether the generated datasets can offer insights into the behavior of fires in Peninsular Malaysia. In the next section, we will conduct a preliminary analysis to assess the potential of the generated datasets in providing insights into the fire behavior in Peninsular Malaysia. It should be noted that Figure 3 and Figure 4 were generated using the full dataset [101], which was created based on the methodology described in Section 3.

5. Assessing Usability Forest Fire Dataset Leveraging Large Language Model

5.1. ChatGPT (GPT-4) and Noteable Plugin

The primary objective of this subsection is to evaluate the suitability of the forest fire dataset for fire behavioral analysis. It is important to emphasize that this assessment does not encompass a comprehensive analysis of the dataset. Instead, our focus was on ensuring that the dataset contains the necessary information and variables needed for in-depth analyses, paving the way for future research.

A total of 7037 attributes (columns) were extracted based on the framework proposed in Section 3.1; however, conducting a detailed empirical analysis on all the attributes was beyond the scope of this study. Therefore, to ensure the forest fire dataset is logical and to streamline the evaluation process, the dataset was filtered to include only annual key features. Table 3 presents a high-level comparison between the full and filtered datasets. It is worth mentioning that the dataset filtering process was conducted locally, specifically targeting variables containing the keyword ‘annual’, resulting in the inclusion of only dynamic variables. Although static variables such as ESA land cover class, elevation, slope, and aspect are acknowledged as important features as discussed in Section 3.2, they were excluded from this preliminary analysis because our focus was primarily on the dynamic variables that changed annually and impacted forest fire behavior. This section intends to determine whether the generated forest fire dataset can provide valuable insights into the behavior of fires in Peninsular Malaysia.

For the analysis phase, the significant rise in the popularity of LLMs, such as ChatGPT [24], has created new opportunities for exploring and leveraging their capabilities. In this study, we employed GPT-4 in conjunction with the Noteable Plugin to conduct our sample analysis in the forthcoming discussions. Although Microsoft and Meta have recently released an open-source LLM called Llama 2 [125], it was not utilized in this work due to the unavailability of corresponding plugins. Despite GPT-4 being a closed-source model, it was employed in this study as the available plugin is exclusive to this platform. It should be emphasized that our primary focus was on the analysis, not the specific LLM model adopted.

Noteable, originally designed as a collaborative notebook platform, facilitates team data utilization and visualization through its secure cloud-based deployment, no-code visualizations, and expertly designed collaborative features, offering a unified data workspace for businesses. With the advent of ChatGPT and its increasing adoption, a new plugin has emerged, Noteable Plugin [25] with GPT-4, extending the platform’s functionalities, enabling the creation of notebooks that encompass exploratory analysis, data visualization, machine learning, and data manipulation through natural language prompts. One key feature of this plugin is its ability to generate all the Python scripts utilized for the analysis, encompassing figures, charts, tables, and more within the Noteable platform. This approach promotes open data science by providing not only the results but also the Python scripts used to generate them [15]. Consequently, researchers and analysts can access these scripts to reproduce the same results, enhancing source code reusability.

In the forthcoming discussions, we will delve into the analysis conducted through GPT-4 with the Noteable Plugin. It is essential to acknowledge that while this tool offers valuable insights, it may not be without minor errors. These imperfections were intentionally retained to emphasize that the plugin, while powerful, is not flawless and may require further refinement in the generated Python scripts for improved results. Nonetheless, it remains a valuable addition to a researcher’s or data analyst’s toolkit, offering a swift method for conducting preliminary analyses. To enhance transparency, the prompt history through ChatGPT via https://chat.openai.com (accessed on 21 May 2024) that was used to trigger the plugin was made accessible and can be found at https://github.com/chewyeejian/GEE_FrameworkForestFireDataset (accessed on 21 May 2024). Additionally, the scripts generated by the plugin through the prompts are also provided at the same GitHub repository. Hence, the second aim of this study was to evaluate the suitability of ChatGPT with the Noteable plugin as a tool for analysis, highlighting its potential and limitations.

Is it important to emphasize that the two primary goals were to (i) evaluate whether the created dataset is usable and (ii) assess if the Noteable AI plugin can effectively serve as an alternative to manual human analysis. This section does not delve into a detailed analysis to identify influencing variables or to develop machine learning models in Malaysia. To verify the results of the analysis produced by the Noteable AI plugin, we performed a separate manual analysis. The code for this analysis is available on our GitHub repository, with subsequent results and analyses detailed in Appendix A.

5.2. Termination of the Noteable Plugin

It was announced that the Noteable plugin will be discontinued in December 2023, though no specific reasons were provided for this decision [126]. Despite its discontinuation, we postulate that the findings detailed within this paper might significantly encourage the future development and adoption of similar tools within the academic research domain. It is important to note that, while direct access to active notebooks on the Noteable platform has ceased, an archived copy of the notebook has been preserved and is accessible through our GitHub repository: https://github.com/chewyeejian/GEE_FrameworkForestFireDataset (accessed on 21 May 2024). In light of this, for those seeking alternatives, there are several ChatGPT-4 plugins available that offer coding assistance, including Code Interpreter [127] and Code Copilot [128].

5.3. Sample Analysis of Forest Fire Dataset in Peninsular Malaysia through GPT-4

In this subsection, the forest fire inventory dataset generated earlier for Peninsular Malaysia is explored and analyzed to assess its usability. As previously mentioned, the dataset had been locally filtered to encompass only the annual average features to facilitate a simplified analysis. This analysis was carried out solely through ChatGPT prompts, which trigger the Noteable environment to generate the results and analysis. As mentioned in Section 5.1, the corresponding prompt history and the associated Noteable notebook are available at the GitHub repository shared earlier. This analysis aimed to offer an initial glimpse into the potential utility of the forest fire dataset for understanding fire behavior in Peninsular Malaysia.

To begin the analysis, GPT-4 with the Noteable plugin was employed to establish a connection between ChatGPT prompts and the Noteable platform. After the successful establishment of the linkage, ChatGPT was directed through prompts to execute various operations on the dataset. For instance, it was instructed to first analyze the dataset, determining the number of rows, the total number of columns, and listing all the column names. This revealed that the dataset comprises a total of 11,083 rows, aligning with the total number of rows in the full dataset. However, after the local filtering process to include only the annual average features, only 40 columns remained available, as detailed in Table 4. To refine the dataset for fire-focused analyses, a selection of columns that are irrelevant to the analytical objectives was removed. This includes the elimination of columns such as date/shape categories inherited from MCD64A1 (i.e., system:index, Shape_Leng Shape_Area, date, year, month, day, validOn), various administrative boundary categories (i.e., ADM0_EN, ADM1_EN, ADM2_EN, ADM0_PCODE, ADM1_PCODE, ADM2_PCODE), and others (i.e., longitude, latitude).

5.3.1. Sample Analysis—Missing Feature Analysis with GPT-4 and Noteable Plugin

Following the data filtering process, an assessment to discover the missing features was carried out, akin to the analysis conducted on the full dataset. As depicted in Figure 5, variables such as nighttime light, the HII, and KBDI exhibit a high percentage of missing data. This trend aligns with the observations made in the analysis of the full dataset and underscores the temporal availability of the data, which does not extend to 2022. This limitation arises from the chosen reference date for non-fire points. To elaborate further,

Nighttime Light: Approximately 80% of the data was missing. Specifically, data were only available for 1960 out of the 5557 fire points. Additionally, data were missing for all 5526 non-fire points. The nighttime light data span from 1 April 2012, to 1 January 2021, which resulted in at least 10 years of missing data (from 2001 to 2011). Furthermore, with the reference date set to 2022 (the latest year), data were absent for all non-fire points due to this cutoff.
HII: Approximately 50% of the data was missing, with no data available for the 5526 non-fire points. The HII data cover the period from 1 January 2001 to 1 January 2020. The reference year being set to 2022 led to a lack of data for all non-fire points.
KBDI: Approximately 20% of the data was missing, with data only available for 3404 out of the 5557 fire points. It is important to note that data for all non-fire points (5526 out of 5526) are available. The KBDI data from 1 January 2007 to the present are accessible, indicating that data from 2001 to 2006 were missing, covering approximately 6 years.
LST: The LST annual average did not appear to present an issue. However, the high percentage of missing data in the full dataset (Figure 4) may be attributed to the missing monthly values.

It is essential to emphasize that the primary objective of this analysis was to showcase the tool’s effectiveness in evaluating the dataset. Therefore, this preliminary examination did not include any methods for replacing the missing values.

5.3.2. Sample Analysis—Basic Statistical Analysis with GPT-4 and Noteable Plugin

To provide further insights into the dataset, Table 5 presents the statistical mean and standard deviation of key features in relation to the fire class.

The formula for calculating the mean is expressed as follows:

μ = \frac{\sum_{i = 1}^{n} x_{i}}{n}

(1)

where

μ

represents the mean,

x_{i}

represents each individual data point, and

n

is the total number of data points.

The calculation of the standard deviation can be written as follows:

σ = \sqrt{\frac{\sum_{i = 1}^{n} {(x_{i} - μ)}^{2}}{n}}

(2)

σ

is the standard deviation,

x_{i}

represents each individual data point,

μ

represents the mean, and

\sum_{i = 1}^{n} {(x_{i} - μ)}^{2}

computes the squared deviations of each data point from the mean.

Several noteworthy observations emerge from this table. First, the total count of average annual values observed for current0101_average_annual_nighttime and current0101_hii_annual was 0 for the non-fire category, aligning with our earlier discussions regarding the reference date for non-fire scenarios. Additionally, the current_swe_annual (snow water equivalent) feature remained at 0 for both fire and non-fire classes, which is reasonable given the absence of snowfall in Malaysia throughout the entire year. From the table, higher mean values were observed for KBDI, LST, AET, DEF, PET, and VPD for fire conditions, indicating drier conditions. The lower values of PDSI, PR, and RO also denote drier conditions. On the other hand, higher values of SRAD, TMMN, TMMX, and vs suggest more favorable conditions for fire incidents.

5.3.3. Sample Analysis—Boxplot Analysis with GPT-4 and Noteable Plugin

A box plot, also known as a whisker plot, which visually represents the data distribution using a five-number summary (minimum, first quartile, median, third quartile, and maximum), is presented in Figure 6. It is important to emphasize that the figures and graphs in this section were generated using the plugin and were intentionally left as is to highlight their flaws and imperfections. For the validated outputs obtained through manual coding, refer to Appendix A.

In non-fire conditions, the BurnDate appeared insignificant due to the reference date being set to −1 with the year of 2022. However, in fire scenarios, the median typically fell around day 60, with the first-third quartile ranging from approximately day 40 to 150. This suggests that fires are generally prevalent from February to around May.

For the features current_KBDI_annual, current_LST_annual, current_aet_annual, current_def_annual, current_pdsi_annual, current_pet_annual, current_pr_annual, current_ro_annual, current_tmmn_annual, current_tmmx_annual, current_vap_annual, current_vpd_annual, and current_vs_annual, distinct medians were apparent when comparing non-fire scenario (i.e., fire = 0) to fire scenario (i.e., fire = 1). This disparity suggests that these features exhibit varying central tendencies in areas with fire. It is important to note that the black line inside the blue box in the diagram corresponds to the median.

The results obtained from the box plot analysis appear to be quite reasonable because most of the median values observed from the meteorological variables suggest conditions favorable for fire incidents. For instance, higher values of KBDI indicate drier conditions, elevated LST values signify higher temperatures, increased AET and PET values represent more significant water losses due to evaporation, lower PDSI and higher DEF (climate water deficit) values imply moisture deficits, lower PR (precipitation) levels indicate reduced moisture conditions, and decreased RO (runoff) values denote less water flowing from the land to the surface, signifying drier conditions or reduced water availability. In addition, higher values of SRAD indicate more solar energy reaching the Earth’s surface, while the higher median temperature values (TMMN and TMMX) suggest a more favorable environment for fire occurrence. Elevated VPD values, associated with dry air, can promote the rapid drying of vegetation, while higher vs values may accelerate fire spread. Conversely, the lower median values for NDVI and EVI underscore the presence of less green vegetation at fire-affected sites. This analysis offers valuable insights into the interplay between various features and fire incidents.

5.3.4. Sample Analysis—t-Test Statistical Tests with GPT-4 and Noteable Plugin

The t-test statistical analysis is a valuable tool for comparing the means of two groups and determining whether the observed differences between them are statistically significant. In this analysis, t-tests were exclusively applied to the numeric columns, and any columns containing missing data were thoughtfully excluded from the analysis. The t-test methodology involves comparing the means of two distinct groups and generating p-values, which quantitatively express the statistical significance of the observed differences. These p-values are subsequently compared to a predefined significance level, typically set at 0.05, to make informed decisions regarding the null hypothesis. The groups in question here are as follows: group 1 comprises variables associated with non-fire scenarios (identified by “fire = 0”), while group 2 encompasses variables linked to fire scenarios (designated by “fire = 1”). The t-tests, as illustrated in Table 6, yielded both t-statistic values and corresponding p-values, aiding in the assessment of the statistical significance within the dataset.

In general, the magnitude of a t-statistic value serves as a measure of the difference between the sample mean and the hypothesized population mean. A larger t-statistic magnitude indicates a more substantial disparity between the sample mean and the hypothesized population mean, while a magnitude close to 0 suggests that the means of both groups are quite similar. Examining the table, we observe that t-statistic values for most key features exhibit a substantial magnitude, with the exception of current_soil_annual, which hovers closer to 0. Furthermore, the p-values generated for the majority of key variables, except for current_soil_annual, current_0101_average_annual_nighttime, current0101_hii_annual, and current_swe_annual, are extremely low, effectively reaching 0. This indicates that the differences observed between non-fire and fire conditions are indeed statistically significant for most key features.

In the case of current_soil_annual, the t-statistic and p-value reveal their statistical insignificance. For current_0101_average_annual_nighttime and current0101_hii_annual, the results are not available due to missing data, as highlighted in Figure 5. With the fire data for these variables absent, it is not possible to perform the statistical tests. It should be noted that current_0101_LC_Type2_annual is not ideally suited for this analysis as it represents a categorical attribute converted to a numeric form, rendering it less relevant. Similarly, BurnDate should not be included as a key feature or predictor in this analysis; its inclusion is intentional, as these results were generated directly through the ChatGPT prompts.

5.3.5. Sample Analysis—Variance Inflation Factor with GPT-4 and Noteable Plugin

The Variance Inflation Factor (VIF) is a crucial metric used to identify multicollinearity in regression analyses. Multicollinearity refers to a scenario in which two or more predictor variables in a regression model exhibit high correlations. To address multicollinearity issues, it is generally recommended to either eliminate or combine features with high VIF values.

Mathematically, the VIF can be represented as

{VIF}_{i} = \frac{1}{1 - R_{i}^{2}}

(3)

{VIF}_{i}

is the Variance Inflation Factor for the i-th predictor and

R_{i}^{2}

is the coefficient of determination (R-squared value) obtained by regressing the predictor on all the other predictors. Further details related to the VIF formula can be found in [129].

For this analysis, the Noteable plugin considered only numeric variables, excluding any rows containing null values. While we were able to reproduce the results shown in Appendix A, Figure A5, we deemed the analysis produced in this subsection as invalid. Although the plugin successfully executed the analysis, in our opinion, dropping all rows with null values using the dropna function from pandas is not an effective method for handling missing data in this case. This approach resulted in a significant reduction in the dataset, leaving only 1841 rows (out of 11,083) available for the VIF analysis (Figure 7). Given the high percentage of missing data in the current_0101_average_annual_nighttime and current_0101_hii_annual columns, we believe that it would have been a better choice to exclude these two columns rather than dropping all rows with null values. Additionally, the plugin automatically changed the x-axis (VIF value) scale to a logarithmic scale instead of the standard linear scale. While it is commendable to perform such changes automatically to accommodate a wide range of values in a graph, users should always double-check the scale settings to ensure that the representation accurately reflects the underlying data.

5.4. Limitation of GPT-4 with Noteable Plugin

In our evaluation of GPT-4 with the Noteable plugin, we observed a combination of successful outcomes and challenges. The plugin effectively conducted basic statistical analyses, boxplot visualizations, and t-test statistics, which provided valuable insights, as detailed in the previous subsection. However, it is important to highlight certain behaviors of the plugin that users should be cautious of.

The plugin’s use of an LLM means that it retains a memory of previous interactions [130]. This can lead to the unintended persistence of changes made to the dataset. For example, in our validation of the basic statistical analysis (Appendix A, Figure A2), it was noticed that the column BurnDate was omitted from the basic statistical analysis of mean and standard deviation in Table 5. This omission resulted from an earlier prompt.

The plugin often attempts to execute analyses by dropping null values or limiting the analysis to only numeric values. For instance, during the t-test statistics, it automatically excluded non-numeric variables. In the VIF analysis, it dropped all the rows containing null values, which we found compromised the validity of the analysis. Additionally, the plugin automatically converted the x-axis from a linear to a logarithmic scale to enhance graph visualization. Therefore, it is crucial for users to verify that the scale adjustment accurately reflects their intended analysis.

These observations indicate that the plugin (i) can retain modifications from previous interactions, potentially affecting the dataset and subsequent analyses and (ii) tends to drop or select only relevant columns to facilitate the execution of tests or analyses. Users should be mindful of these characteristics to ensure accurate results.

5.5. Incomplete Tests by GPT-4 with Noteable Plugin

While the previous section highlighted the limitations of successfully performed analyses and tests, it also noted that several other tests suggested by the plugin encountered errors and were not completed successfully. It is important to emphasize that the suggestions for various statistical and feature importance tests are generated by ChatGPT. This subsection delves into the tests and analyses that failed to be completed.

The first test that encountered errors is the feature importance ranking utilizing the Random Forest machine learning model. This analysis aims to rank features based on their importance in predicting the fire class, offering insights into which features strongly influence fire occurrences. However, the prompts consistently detected only one class, preventing model training and the calculation of feature importance through Random Forest. A similar issue also arose when examining logistic regression coefficients. This anomaly could be attributed to the unaddressed missing data, which may require further attention.

In addition, attempts to conduct multiple statistical tests in a single prompt using numerous suggestions from ChatGPT responses resulted in the conversation crashing. This issue occurred during multiple retries on logistic regression coefficients, recursive feature elimination, chi-squared tests, mutual information, and ANOVA (analysis of variance) tests. The crash may be attributed to exceeding a certain output limit. It is important to note that these errors or limitations may also be caused by the imperfect prompts used to instruct ChatGPT to perform specific tasks, highlighting the need for more precise prompts to ensure accurate execution.

5.6. Considerations for Adopting AI Plugins in Analyses

While the plugin has demonstrated its capability to generate graphs, charts, and analysis insights in Section 5.3 through multiple prompts, it is equally important for data users to have a solid grasp of the specific analysis or tests being conducted. This understanding is crucial for interpreting the analysis results effectively and deriving deeper insights from the generated outcomes.

Advocating for additional studies to validate the usage of these tools in their analysis is essential. While the limitations of the plugins are acknowledged, as described in Section 5.4 and Section 5.5, the results presented in Section 5.3 suggest that embracing these tools as a part of one’s toolkit for preliminary analyses or assessments is warranted. However, given the noted limitations, it remains crucial to perform additional checks to verify the accuracy and intended outcomes of the analyses. From our perspective, considering that most tests and procedures are proposed by the plugin, it is evident that this tool can significantly benefit data analysts and researchers seeking guidance on which available tests to perform.

6. Future Works

In this study, the proposed framework relied solely on the MCD64A1 BA dataset for extracting fire points in the study area. It is important to acknowledge that this dataset may not be entirely accurate and could miss some fire occurrences [131]. However, for the purposes of this study, it provided a rapid means to discover historical fire locations. For future work, improving the collection of historical fire points could involve incorporating other BA datasets such as FireCCI51 v5.1 [88], Globfire Fire Event [89], or region-specific government data. Additionally, future work should also consider addressing the significant amount of missing data in nighttime light and the HII for non-fire points by employing methods such as replacing the missing entries with data from the previous year or month, or by adopting overall averages to fill the gaps. While the provided scripts demonstrate the feasibility of the proposed framework, creating a GUI version could enhance user convenience.

While GPT-4 with the Noteable plugin effectively showcased its utility for analysis, it is essential to emphasize the need for subsequent validation to confirm the tool’s applicability and limitations. The sample analysis conducted with GPT-4 and the Noteable plugin focused solely on the annual averages, specifically referencing the year of the fire. However, it is crucial to advocate for a more comprehensive analysis that considers all variables to provide a deeper understanding of the critical factors influencing fires in Malaysia. Furthermore, the dataset can be employed to train a machine learning model as a predictive tool to forecast future fire occurrences. This predictive model would serve as a valuable resource for proactive fire management strategies.

The versatility of the proposed framework extends beyond forest fires. By supplying the geographical coordinates of various disastrous events, it becomes feasible to extract the pertinent remote sensing features from GEE to analyze the occurrence of these disasters using the same methodology applied in this study. It is important to underscore that a substantial amount of remote sensing data employed in this research is accessible through the GEE data catalog. However, it is worth noting that the GEE community catalog [132] offers access to a multitude of additional datasets, further broadening the scope of potential applications. Future work could explore the utilization of these diverse datasets for different types of environmental assessments and disaster management efforts.

7. Conclusions

The primary contribution of this study lies in the proposed framework, which includes the scripts for swiftly generating forest fire inventory datasets from GEE. This methodology is easily replicable for various locations, and the resulting datasets can be freely shared without the need for permissions from government authorities or other organizations. Peninsular Malaysia served as the case study to showcase the effectiveness of the proposed framework. Since the generated dataset was created without the use of any private government or organizational data, it can be openly accessed and shared without restrictions. This framework greatly lowers the barriers for data scientists, enabling them to apply their analytical skills directly to the GEE-extracted datasets, reducing the necessity for in-depth remote sensing knowledge.

The second contribution of this work involves the successful adoption and demonstration of an LLM, specifically GPT-4 with the Noteable plugin, as a tool for conducting preliminary analysis on the generated dataset. The sample analysis revealed valuable insights into the fire scenarios in Peninsular Malaysia. The key factors affecting forest fires in this region, based on the preliminary analysis of the annual averages referencing the year of fire, included KBDI, LST, PDSI, DEF, PR, RO, SRAD, TMMX, TMMN, and VPD. Section 5.4 and Section 5.5 discussed the limitations of GPT-4 with the Noteable plugin. It is important to note that no manual coding was performed during the analysis; rather, the analysis and Python scripts producing the results were generated through simple prompts within the ChatGPT interface. As technology continues to evolve, researchers at the forefront should consider adopting such technologies to enhance their methodologies and analyses.

In conclusion, the key contributions in this work can be summarized as follows: (i) the proposed framework enables the swift generation of forest fire inventory datasets from GEE, facilitating open access and sharing without restrictions; (ii) we successfully adopted an automated analysis using an LLM, specifically GPT-4 with the Noteable plugin, for conducting a preliminary analysis; (iii) the case study in Peninsular Malaysia demonstrated the framework’s effectiveness and the utility of the GPT-4 Noteable plugin; and (iv) the generation of a forest fire factors inventory dataset for Peninsular Malaysia containing all the relevant variables without reliance on private data sources was published as a public dataset [48] to facilitate a more comprehensive analysis of fire factors in the future.

Author Contributions

Conceptualization, Y.J.C.; Methodology, Y.J.C. and S.Y.O.; Formal Analysis, Y.J.C.; Investigation, Y.J.C. and S.Y.O.; Visualization, Y.J.C. and Z.Y.L.; Writing—Original Draft, Y.J.C.; Funding Acquisition, S.Y.O.; Resources, S.Y.O.; Project Administration, S.Y.O.; Supervision, S.Y.O. and Y.H.P.; Writing—Review and Editing, S.Y.O. and Y.H.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research work was supported by a Fundamental Research Grant Scheme (FRGS) under the Ministry of Education and Multimedia University, Malaysia (Project ID: FRGS/1/2020/ICT02/MMU/02/2).

Data Availability Statement

The complete Python scripts for extracting fire factors through GEE, the filtered forest fire dataset containing only the annual variables, a copy of the Noteable Python scripts generated by the ChatGPT prompt, our own Python script to verify the results generated by Noteable, and the ChatGPT conversation history used to trigger the plugin for generating the analysis can be easily accessed at the following GitHub repository: https://github.com/chewyeejian/GEE_FrameworkForestFireDataset (accessed on 21 May 2024). The full unprocessed forest fire dataset for Peninsular Malaysia is available at https://doi.org/10.5281/zenodo.10050852 [48]. Information related to the GEE datasets mentioned in Table 2 can be accessed at https://developers.google.com/earth-engine/datasets/catalog (accessed on 21 May 2024).

Acknowledgments

We extend our heartfelt gratitude to Morgan A. Crowley (Forest Fire Research Scientist at Natural Resources Canada) for the valuable suggestions and advice provided on the proposed framework.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

We repeated the analysis from Section 5.3 using our own program code to validate the results presented. We are grateful for the suggestions from one of the anonymous reviewers, which led us to include this additional section to validate the results from Section 5.3. The results presented in the Appendix verify that we achieved the same outcomes as those produced by the ChatGPT Noteable plugin prompts. Some previously discussed discrepancies include the removal of BurnDate in the basic statistical analysis shown in Figure A2 and the dropping of all rows containing null values in the VIF analysis shown in Figure A5. Additionally, we discovered that implementing customizations on the charts is sometimes more straightforward on our own. The scripts used for this validation are available in our GitHub repository, as detailed in the data availability statement.

Figure A1. Percentage of missing data in the filtered forest fire dataset (validated).

Figure A2. Statistical mean and standard deviation of the key features by fire class (validated). (Note: ‘BurnDate’ column was not available in Table 5 as it has been removed by the Noteable Plugin).

Figure A3. Boxplot analysis for each key features for fire and non-fire points (validated).

Figure A4. t-test statistics and p-value test results (validated).

Figure A5. Variance inflation factor for key features (validated). (Note: The analysis is considered invalid as the plugin dropped all rows containing null values, leaving only 1841 rows for this analysis.)

Appendix B

Appendix B has been added to provide more information related to our study area in Section 4.1. Malaysia is a country in Southeast Asia, consisting of Peninsular Malaysia and East Malaysia. The focus of the study in this paper was on the entire Peninsular Malaysia, where the capital, Kuala Lumpur, is located. To provide a comprehensive overview of the study area, Figure A6 illustrates the location of Peninsular Malaysia.

Malaysia is a tropical country and 1 of the 17 megadiverse countries, housing a vast diversity of flora and fauna [133]. According to a news article published in 2023, the Minister of Malaysia’s Natural Resource, Environment, and Climate Change (NRECC) Ministry highlighted that 54.6% (or 18.04 million hectares) of the total land area is forest cover in Malaysia [134]. The country’s temperature ranges from approximately 21 °C to 32 °C throughout the year, with the country only experiencing sunny and rainy weather. Figure A7 presents an elevation map, while Figure A8 and Figure A9 show the mean maximum temperature and mean precipitation for 2023, respectively. The elevation data were sourced from NASADEM [117], and the temperature and precipitation data are from TerraClimate [111]. GEE was utilized to generate Figure A7, Figure A8 and Figure A9.

Figure A6. Location of the study area, Peninsular Malaysia.

Figure A7. Elevation map of Peninsular Malaysia generated using NASADEM data.

Figure A8. Mean maximum temperature of Peninsular Malaysia for 2023, sourced from TerraClimate data.

Figure A9. Mean precipitation of Peninsular Malaysia for 2023, sourced from TerraClimate data.

References

Arif, M.; Alghamdi, K.K.; Sahel, S.A.; Alosaimi, S.O.; Alsahaft, M.E.; Alharthi, M.A.; Arif, M. Role of Machine Learning Algorithms in Forest Fire Management: A Literature Review. J. Robot. Autom. 2021, 5, 212–226. [Google Scholar]
Bot, K.; Borges, J.G. A Systematic Review of Applications of Machine Learning Techniques for Wildfire Management Decision Support. Inventions 2022, 7, 15. [Google Scholar] [CrossRef]
Moayedi, H.; Mehrabi, M.; Bui, D.T.; Pradhan, B.; Foong, L.K. Fuzzy-Metaheuristic Ensembles for Spatial Assessment of Forest Fire Susceptibility. J. Environ. Manag. 2020, 260, 109867. [Google Scholar] [CrossRef]
Bui, D.T.; Bui, Q.-T.; Nguyen, Q.-P.; Pradhan, B.; Nampak, H.; Trinh, P.T. A Hybrid Artificial Intelligence Approach Using GIS-Based Neural-Fuzzy Inference System and Particle Swarm Optimization for Forest Fire Susceptibility Modeling at a Tropical Area. Agric. For. Meteorol. 2017, 233, 32–44. [Google Scholar]
Bui, D.T.; Hoang, N.-D.; Samui, P. Spatial Pattern Analysis and Prediction of Forest Fire Using New Machine Learning Approach of Multivariate Adaptive Regression Splines and Differential Flower Pollination Optimization: A Case Study at Lao Cai Province (Viet Nam). J. Environ. Manag. 2019, 237, 476–487. [Google Scholar]
Sevinc, V.; Kucuk, O.; Goltas, M. A Bayesian Network Model for Prediction and Analysis of Possible Forest Fire Causes. For. Ecol. Manag. 2020, 457, 117723. [Google Scholar] [CrossRef]
Gorelick, N.; Hancher, M.; Dixon, M.; Ilyushchenko, S.; Thau, D.; Moore, R. Google Earth Engine: Planetary-Scale Geospatial Analysis for Everyone. Remote Sens. Environ. 2017, 202, 18–27. [Google Scholar] [CrossRef]
Ban, Y.; Zhang, P.; Nascetti, A.; Bevington, A.R.; Wulder, M.A. Near Real-Time Wildfire Progression Monitoring with Sentinel-1 SAR Time Series and Deep Learning. Sci. Rep. 2020, 10, 1322. [Google Scholar] [CrossRef]
Hodges, J.L.; Lattimer, B.Y. Wildland Fire Spread Modeling Using Convolutional Neural Networks. Fire Technol. 2019, 55, 2115–2142. [Google Scholar] [CrossRef]
Zhang, G.; Wang, M.; Liu, K. Forest Fire Susceptibility Modeling Using a Convolutional Neural Network for Yunnan Province of China. Int. J. Disaster Risk Sci. 2019, 10, 386–403. [Google Scholar] [CrossRef]
Jiao, Z.; Zhang, Y.; Xin, J.; Mu, L.; Yi, Y.; Liu, H.; Liu, D. A Deep Learning Based Forest Fire Detection Approach Using UAV and YOLOv3. In Proceedings of the 2019 1st International Conference on Industrial Artificial Intelligence (IAI), Shenyang, China, 23–27 July 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–5. [Google Scholar]
Wang, S.; Zhao, J.; Ta, N.; Zhao, X.; Xiao, M.; Wei, H. A Real-Time Deep Learning Forest Fire Monitoring Algorithm Based on an Improved Pruned + KD Model. J. Real-Time Image Process. 2021, 18, 2319–2329. [Google Scholar] [CrossRef]
Trucchia, A.; Meschi, G.; Fiorucci, P.; Gollini, A.; Negro, D. Defining Wildfire Susceptibility Maps in Italy for Understanding Seasonal Wildfire Regimes at the National Level. Fire 2022, 5, 30. [Google Scholar] [CrossRef]
Nur, A.S.; Kim, Y.J.; Lee, J.; Lee, C.-W. Spatial Prediction of Wildfire Susceptibility Using Hybrid Machine Learning Models Based on Support Vector Regression in Sydney, Australia. Remote Sens. 2023, 15, 760. [Google Scholar] [CrossRef]
Gomes, V.C.F.; Queiroz, G.R.; Ferreira, K.R. An Overview of Platforms for Big Earth Observation Data Management and Analysis. Remote Sens. 2020, 12, 1253. [Google Scholar] [CrossRef]
Wu, Q. Geemap: A Python Package for Interactive Mapping with Google Earth Engine. J. Open Source Softw. 2020, 5, 2305. [Google Scholar] [CrossRef]
Nye, B.D.; Mee, D.; Core, M.G. Generative Large Language Models for Dialog-Based Tutoring: An Early Consideration of Opportunities and Concerns. CEUR Workshop Proc. 2023, 3487, 78–88. [Google Scholar]
Lin, Z. Why and How to Embrace AI Such as ChatGPT in Your Academic Life. R. Soc. Open Sci. 2023, 10, 230658. [Google Scholar] [CrossRef] [PubMed]
Ziems, C.; Held, W.; Shaikh, O.; Chen, J.; Zhang, Z.; Yang, D. Can Large Language Models Transform Computational Social Science? Comput. Linguist. 2024, 50, 237–291. [Google Scholar] [CrossRef]
Meyer, J.G.; Urbanowicz, R.J.; Martin, P.C.N.; O’Connor, K.; Li, R.; Peng, P.C.; Bright, T.J.; Tatonetti, N.; Won, K.J.; Gonzalez-Hernandez, G.; et al. ChatGPT and Large Language Models in Academia: Opportunities and Challenges. BioData Min. 2023, 16, 20. [Google Scholar] [CrossRef]
Liu, S.C.; Wang, S.K.; Lin, W.; Hsiung, C.W.; Hsieh, Y.C.; Cheng, Y.P.; Luo, S.H.; Chang, T.; Zhang, J. JarviX: A LLM No Code Platform for Tabular Data Analysis and Optimization. In Proceedings of the EMNLP 2023—2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, Singapore, 6–10 December 2023; pp. 622–630. [Google Scholar] [CrossRef]
Caglayan, A.; Slusarczyk, W.; Rabbani, R.D.; Ghose, A.; Papadopoulos, V.; Boussios, S. Large Language Models in Oncology: Revolution or Cause for Concern? Curr. Oncol. 2024, 31, 1817–1830. [Google Scholar] [CrossRef]
Zvarevashe, K.; Mapanga, I.; Kadebu, P. A Technical Evaluation of the Performance of Classical Artificial Intelligence (AI) and Methods Based on Computational Intelligence (CI) i.e Supervised Learning, Unsupervised Learning And Ensemble Algorithms In Intrusion Detection Systems. In Proceedings of the Annual Conference of UbuntuNet Alliance 2016, Entebbe, Uganda, 30 October–4 November 2016. [Google Scholar]
OpenAI ChatGPT [Large Language Model]. Available online: https://chat.openai.com/chat (accessed on 21 October 2023).
Noteable Noteable (ChatGPT Plugin for Notebook). Available online: https://noteable.io/chatgpt-plugin-for-notebook/ (accessed on 21 October 2023).
Chicas, S.D.; Østergaard Nielsen, J. Who Are the Actors and What Are the Factors That Are Used in Models to Map Forest Fire Susceptibility? A Systematic Review. Nat. Hazards 2022, 114, 2417–2434. [Google Scholar] [CrossRef]
Pradhan, B.; Bin Suliman, M.D.H.; Awang, M.A. Bin Forest Fire Susceptibility and Risk Mapping Using Remote Sensing and Geographical Information Systems (GIS). Disaster Prev. Manag. 2007, 16, 344–352. [Google Scholar] [CrossRef]
Pu, R.; Li, Z.; Gong, P.; Csiszar, I.; Fraser, R.; Hao, W.-M.; Kondragunta, S.; Weng, F. Development and Analysis of a 12-Year Daily 1-Km Forest Fire Dataset across North America from NOAA/AVHRR Data. Remote Sens. Environ. 2007, 108, 198–208. [Google Scholar] [CrossRef]
Lestari, A.; Rumantir, G.; Tapper, N. A Spatio-Temporal Analysis on the Forest Fire Occurrence in Central Kalimantan, Indonesia. In Proceedings of the 20th Pacific Asia Conference on Information Systems, Chiayi, Taiwan, 27 June–1 July 2016; p. 90. [Google Scholar]
Monjarás-Vega, N.A.; Briones-Herrera, C.I.; Vega-Nieva, D.J.; Calleros-Flores, E.; Corral-Rivas, J.J.; López-Serrano, P.M.; Pompa-García, M.; Rodríguez-Trejo, D.A.; Carrillo-Parra, A.; González-Cabán, A. Predicting Forest Fire Kernel Density at Multiple Scales with Geographically Weighted Regression in Mexico. Sci. Total Environ. 2020, 718, 137313. [Google Scholar] [CrossRef] [PubMed]
Abid, F. A Survey of Machine Learning Algorithms Based Forest Fires Prediction and Detection Systems. Fire Technol. 2021, 57, 559–590. [Google Scholar] [CrossRef]
Esri Introducing ArcGIS Platform|Esri. Available online: https://www.esri.com/en-us/home (accessed on 13 March 2021).
QGIS Development Team Welcome to the QGIS Project! Available online: https://www.qgis.org/en/site/ (accessed on 13 March 2021).
National Centers for Environmental Information Daily Weather Records|Data Tools|Climate Data Online (CDO)|National Climatic Data Center (NCDC). Available online: https://www.ncdc.noaa.gov/cdo-web/datatools/records (accessed on 5 April 2021).
Cartus, O.; Kellndorfer, J.; Walker, W.; Franco, C.; Bishop, J.; Santos, L.; Fuentes, J.M.M. A National, Detailed Map of Forest Aboveground Carbon Stocks in Mexico. Remote Sens. 2014, 6, 5559–5588. [Google Scholar] [CrossRef]
Pourghasemi, H.R. GIS-Based Forest Fire Susceptibility Mapping in Iran: A Comparison between Evidential Belief Function and Binary Logistic Regression Models. Scand. J. For. Res. 2016, 31, 80–98. [Google Scholar] [CrossRef]
Piao, Y.; Lee, D.; Park, S.; Kim, H.G.; Jin, Y. Forest Fire Susceptibility Assessment Using Google Earth Engine in Gangwon-Do, Republic of Korea. Geomat. Nat. Hazards Risk 2022, 13, 432–450. [Google Scholar] [CrossRef]
Nur, A.S.; Kim, Y.J.; Lee, C.-W. Creation of Wildfire Susceptibility Maps in Plumas National Forest Using InSAR Coherence, Deep Learning, and Metaheuristic Optimization Approaches. Remote Sens. 2022, 14, 4416. [Google Scholar] [CrossRef]
Zhang, X.; Lan, M.; Ming, J.; Zhu, J.; Lo, S. Spatiotemporal Heterogeneity of Forest Fire Occurrence Based on Remote Sensing Data: An Analysis in Anhui, China. Remote Sens. 2023, 15, 598. [Google Scholar] [CrossRef]
Chew, Y.J.; Ooi, S.Y.; Pang, Y.H.; Wong, K.-S. A Review of Forest Fire Combating Efforts, Challenges and Future Directions in Peninsular Malaysia, Sabah, and Sarawak. Forests 2022, 13, 1405. [Google Scholar] [CrossRef]
Abdul Aziz, N.F.; Ya’acob, N.; Yusof, A.L.; Omar, H. Statistical Analysis for Forest Fire Factors Using Geography Information System (GIS) and Remote Sensing Imagery. J. Adv. Res. Appl. Sci. Eng. Technol. 2024, 38, 17–30. [Google Scholar]
Patah, N.A.; Mansor, S.; Mispan, M.R. An Application of Remote Sensing and Geographic Information System for Forest Fire Risk Mapping. Malaysian Cent. Remote Sens. 2006, 54–67. [Google Scholar]
Phua, M.-H.; Tsuyuki, S.; Lee, J.S.; Sasakawa, H. Detection of Burned Peat Swamp Forest in a Heterogeneous Tropical Landscape: A Case Study of the Klias Peninsula, Sabah, Malaysia. Landsc. Urban. Plan. 2007, 82, 103–116. [Google Scholar] [CrossRef]
Pradhan, B.; Awang, M.A. Application of Remote Sensing and Gis for Forest Fire Susceptibility Mapping Using Likelihood Ratio Model. In Proceedings of the Map Malaysia 2007, Kuala Lumpur, Malaysia, 3–4 May 2007. [Google Scholar]
Suliman, M.D.H.; Mahmud, M.; Reba, M.N.M. Mapping and Analysis of Forest and Land Fire Potential Using Geospatial Technology and Mathematical Modeling. IOP Conf. Ser. Earth Environ. Sci. 2014, 18, 12034. [Google Scholar] [CrossRef]
Mohd, D.; Mastura, M. Analysis of Potential Forest Fires by Utilizing Geospatial and AHP Model in Selangor, Malaysia. Sains Malays. 2013, 42, 579–586. [Google Scholar]
Bin Jamaruppin, M.E.; Bayuaji, L.; Ab Ghani, N.B.; Rahman, M.A.; Akashah, F.W.; Shah, A. Forest Fire Occurrence Analysis Base on Land Brightness Temperature Using Landsat Data (Study Area: Jalan Kuantan–Pekan, Pahang, Malaysia). In Proceedings of the National Conference for Postgraduate Research, Pekan, Malaysia, 24 September 2016; pp. 798–805. [Google Scholar]
Ya’Acob, N.; Jamil, I.A.A.; Aziz, N.F.A.; Yusof, A.L.; Kassim, M.; Naim, N.F. Hotspots Forest Fire Susceptibility Mapping for Land Use or Land Cover Using Remote Sensing and Geographical Information Systems (GIS). IOP Conf. Ser. Earth Environ. Sci. 2022, 1064, 012029. [Google Scholar] [CrossRef]
United States Geological Survey EarthExplorer. Available online: https://earthexplorer.usgs.gov/ (accessed on 3 April 2021).
Hyer, E.J.; Reid, J.S.; Prins, E.M.; Hoffman, J.P.; Schmidt, C.C.; Miettinen, J.I.; Giglio, L. Patterns of Fire Activity over Indonesia and Malaysia from Polar and Geostationary Satellite Observations. Atmos. Res. 2013, 122, 504–519. [Google Scholar] [CrossRef]
Miettinen, J.; Liew, S.C. Degradation and Development of Peatlands in Peninsular Malaysia and in the Islands of Sumatra and Borneo since 1990. Land Degrad. Dev. 2010, 21, 285–296. [Google Scholar] [CrossRef]
Miettinen, J.; Shi, C.; Liew, S.C. Land Cover Distribution in the Peatlands of Peninsular Malaysia, Sumatra and Borneo in 2015 with Changes since 1990. Glob. Ecol. Conserv. 2016, 6, 67–78. [Google Scholar] [CrossRef]
Peng, G.; Li, J.; Chen, Y.; Norizan, A.P.; Tay, L. High-Resolution Surface Relative Humidity Computation Using MODIS Image in Peninsular Malaysia. Chin. Geogr. Sci. 2006, 16, 260–264. [Google Scholar] [CrossRef]
Pradhan, B. Hot Spot Detection and Monitoring Using MODIS and NOAA AVHRR Images for Wild Fire Emergency Preparedness. In Proceedings of the 2nd Applied Geoinformatics for Society and Environment (AGSE) Conference, Stuttgard, Germany, 13–18 July 2009; pp. 53–61. [Google Scholar]
NASA LAADS DAAC (Archive). Available online: https://ladsweb.modaps.eosdis.nasa.gov/archive/ (accessed on 3 April 2021).
NASA Find Data—LAADS DAAC. Available online: https://ladsweb.modaps.eosdis.nasa.gov/search/ (accessed on 3 April 2021).
NASA LP DAAC (MODIS Download). Available online: https://e4ftl01.cr.usgs.gov/MOLA/ (accessed on 3 April 2021).
Miettinen, J.; Shi, C.; Liew, S.C. Fire Distribution in Peninsular Malaysia, Sumatra and Borneo in 2015 with Special Emphasis on Peatland Fires. Environ. Manag. 2017, 60, 747–757. [Google Scholar] [CrossRef]
Leewe, Y.; Ahmad, A.N.; Ismail, A.; Sheriza, M.R. Analysis of Hotspot Pattern Distribution at Sabah, Malaysia for Forest Fire Management. J. Environ. Sci. Technol. 2016, 9, 291–295. [Google Scholar]
Fire Information for Resource Management System Archive Download—NASA|LANCE|FIRMS. Available online: https://firms.modaps.eosdis.nasa.gov/download/ (accessed on 1 April 2021).
Dymond, C.C.; Field, R.D.; Roswintiarti, O. Using Satellite Fire Detection to Calibrate Components of the Fire Weather Index System in Malaysia and Indonesia. Environ. Manag. 2005, 35, 426–440. [Google Scholar] [CrossRef]
Ash’aari, Z.H.; Badrunsham, A.S. Spatial Temporal Analysis of Forest Fire in Malaysia Using ATSR Satellite Measurement. Bull. Environ. Sci. Sustain. Manag. 2014, 2, 8–11. [Google Scholar] [CrossRef]
Asean Specialised Meteorological Centre (ASMC) VIIRS Hotspot—Annual. Available online: http://asmc.asean.org/asmc-haze-hotspot-annual-new#Hotspot (accessed on 4 April 2021).
Chew, Y.J.; Ooi, S.Y.; Pang, Y.H. Data Acquisition Guide for Forest Fire Risk Modelling in Malaysia. In Proceedings of the 2021 9th International Conference on Information and Communication Technology (ICoICT), Yogyakarta, Indonesia, 3–5 August 2021; pp. 633–638. [Google Scholar] [CrossRef]
JUPEM. Product & Services by Department of Survey and Mapping Malaysia (JUPEM). 2024. Available online: https://www.jupem.gov.my/en/orang-awam (accessed on 21 May 2024).
Malaysia Government Portal Data Terbuka (One Stop Center for Public Data). Available online: https://www.data.gov.my/ (accessed on 5 April 2021).
Department of Statistics Malaysia Open Data. Available online: https://www.dosm.gov.my/v1/index.php?r=column3/accordion&menu_id=amZNeW9vTXRydTFwTXAxSmdDL1J4dz09 (accessed on 5 April 2021).
Malaysia Meteorological Department MetMalaysia: Ramalan Cuaca Negeri. Available online: https://www.met.gov.my/forecast/weather/state?lang=en (accessed on 6 April 2021).
Malaysian Meteorological Department Web Service API. Available online: https://api.met.gov.my/ (accessed on 5 April 2021).
De Groot, W.J.; Field, R.D.; Brady, M.A.; Roswintiarti, O.; Mohamad, M. Development of the Indonesian and Malaysian Fire Danger Rating Systems. Mitig. Adapt. Strateg. Glob. Chang. 2007, 12, 165. [Google Scholar] [CrossRef]
Ainuddin, N.A.; Ampun, J. Temporal Analysis of the Keetch-Byram Drought Index in Malaysia: Implications for Forest Fire Management. J. Appl. Sci. 2008, 8, 3991–3994. [Google Scholar] [CrossRef][Green Version]
Sulova, A.; Jokar Arsanjani, J. Exploratory Analysis of Driving Force of Wildfires in Australia: An Application of Machine Learning within Google Earth Engine. Remote Sens. 2021, 13, 10. [Google Scholar] [CrossRef]
Sulova, A.; Jokar Arsanjani, J. Github Code: Exploratory Analysis of Wildfires in Australia and Approach for Wildfire Modeling in Google Earth Engine. Available online: https://github.com/sulova/AustraliaFires (accessed on 3 October 2023).
Velastegui-Montoya, A.; Montalván-Burbano, N.; Carrión-Mero, P.; Rivera-Torres, H.; Sadeck, L.; Adami, M. Google Earth Engine: A Global Analysis and Future Trends. Remote Sens. 2023, 15, 3675. [Google Scholar] [CrossRef]
Pham-Duc, B.; Nguyen, H.; Phan, H.; Tran-Anh, Q. Trends and Applications of Google Earth Engine in Remote Sensing and Earth Science Research: A Bibliometric Analysis Using Scopus Database. Earth Sci. Inform. 2023, 16, 2355–2371. [Google Scholar] [CrossRef]
Pérez-Cutillas, P.; Pérez-Navarro, A.; Conesa-García, C.; Zema, D.A.; Amado-Álvarez, J.P. What Is Going on within Google Earth Engine? A Systematic Review and Meta-Analysis. Remote Sens. Appl. Soc. Environ. 2023, 29, 100907. [Google Scholar] [CrossRef]
Yang, L.; Driscol, J.; Sarigai, S.; Wu, Q.; Chen, H.; Lippitt, C.D. Google Earth Engine and Artificial Intelligence (AI): A Comprehensive Review. Remote Sens. 2022, 14, 3253. [Google Scholar] [CrossRef]
Chen, H.; Yang, L.; Wu, Q. Enhancing Land Cover Mapping and Monitoring: An Interactive and Explainable Machine Learning Approach Using Google Earth Engine. Remote Sens. 2023, 15, 4585. [Google Scholar] [CrossRef]
Giglio, L.; Justice, C.; Boschetti, L.; Roy, D. MODIS/Terra+Aqua Burned Area Monthly L3 Global 500m SIN Grid V061 [Data Set]. Available online: https://lpdaac.usgs.gov/products/mcd64a1v061/ (accessed on 1 March 2023).
Giglio, L.; Boschetti, L.; Roy, D.P.; Humber, M.L.; Justice, C.O. The Collection 6 MODIS Burned Area Mapping Algorithm and Product. Remote Sens. Environ. 2018, 217, 72–85. [Google Scholar] [CrossRef] [PubMed]
Bar, S.; Parida, B.R.; Pandey, A.C. Landsat-8 and Sentinel-2 Based Forest Fire Burn Area Mapping Using Machine Learning Algorithms on GEE Cloud Platform over Uttarakhand, Western Himalaya. Remote Sens. Appl. Soc. Environ. 2020, 18, 100324. [Google Scholar] [CrossRef]
Arruda, V.L.S.; Piontekowski, V.J.; Alencar, A.; Pereira, R.S.; Matricardi, E.A.T. An Alternative Approach for Mapping Burn Scars Using Landsat Imagery, Google Earth Engine, and Deep Learning in the Brazilian Savanna. Remote Sens. Appl. Soc. Environ. 2021, 22, 100472. [Google Scholar] [CrossRef]
Roteta, E.; Bastarrika, A.; Franquesa, M.; Chuvieco, E. Landsat and Sentinel-2 Based Burned Area Mapping Tools in Google Earth Engine. Remote Sens. 2021, 13, 816. [Google Scholar] [CrossRef]
Gholamrezaie, H.; Hasanlou, M.; Amani, M.; Mirmazloumi, S.M. Automatic Mapping of Burned Areas Using Landsat 8 Time-Series Images in Google Earth Engine: A Case Study from Iran. Remote Sens. 2022, 14, 6376. [Google Scholar] [CrossRef]
Turco, M.; Herrera, S.; Tourigny, E.; Chuvieco, E.; Provenzale, A. A Comparison of Remotely-Sensed and Inventory Datasets for Burned Area in Mediterranean Europe. Int. J. Appl. Earth Obs. Geoinf. 2019, 82, 101887. [Google Scholar] [CrossRef]
Katagis, T.; Gitas, I.Z. Assessing the Accuracy of MODIS MCD64A1 C6 and FireCCI51 Burned Area Products in Mediterranean Ecosystems. Remote Sens. 2022, 14, 602. [Google Scholar] [CrossRef]
Chew, Y.J.; Ooi, S.Y.; Pang, Y.H. MCD64A1 Burnt Area Dataset Assessment Using Sentinel-2 and Landsat-8 on Google Earth Engine: A Case Study in Rompin, Pahang in Malaysia. In Proceedings of the 2023 IEEE 13th Symposium on Computer Applications & Industrial Electronics (ISCAIE), Penang, Malaysia, 20–21 May 2023; pp. 38–43. [Google Scholar]
Lizundia-Loiola, J.; Otón, G.; Ramo, R.; Chuvieco, E. A Spatio-Temporal Active-Fire Clustering Approach for Global Burned Area Mapping at 250 m from MODIS Data. Remote Sens. Environ. 2020, 236, 111493. [Google Scholar] [CrossRef]
Artés, T.; Oom, D.; De Rigo, D.; Durrant, T.H.; Maianti, P.; Libertà, G.; San-Miguel-Ayanz, J. A Global Wildfire Dataset for the Analysis of Fire Regimes and Fire Behaviour. Sci. Data 2019, 6, 296. [Google Scholar] [CrossRef]
Malaysia Kini Hutan Seluas 34 Hektar Terbakar Di Kuantan (A 34-Hectare Forest Burned in Kuantan). Available online: https://www.malaysiakini.com/news/339616 (accessed on 2 August 2021).
Astro Awani Kebakaran Hutan Simpan Pekan Tak Membimbangkan (Fire in Pekan Forest Reserve Is Not a Concern). Available online: https://www.astroawani.com/berita-malaysia/kebakaran-hutan-simpan-pekan-tak-membimbangkan-186979 (accessed on 2 August 2021).
Bernama. Kebakaran Hutan Simpan Pekan: Anggota Bomba, Jabatan Perhutanan Terkandas (Fire in Pekan Forest Reserve: Fire Fighters, Forestry Department Is Stranded). Available online: https://www.utusanborneo.com.my/2018/10/01/kebakaran-hutan-simpan-pekan-anggota-bomba-jabatan-perhutanan-terkandas (accessed on 2 August 2021).
Alagesh, T.N. 40ha of Pahang Forest, Peat Land on Fire. New Straits Times, 26 February 2019. [Google Scholar]
Bernama. 80 Hektar Hutan Simpan Kuala Langat Terbakar. Available online: https://www.bharian.com.my/berita/kes/2020/04/679541/80-hektar-hutan-simpan-kuala-langat-terbakar (accessed on 2 August 2021).
Idris, M.N. Kebakaran Hutan Di Selangor Meningkat—Utusan Digital. Available online: https://www.utusan.com.my/berita/2020/07/kebakaran-hutan-di-selangor-meningkat/ (accessed on 2 August 2021).
Malaymail. Kuala Langat Selatan Forest Fire Spreads to over 40 Hectares, Says Selangor Fire Dept. 2 March 2021. Available online: https://www.malaymail.com/news/malaysia/2021/03/02/kuala-langat-selatan-forest-fire-spreads-to-over-40-hectares-says-selangor/1954203 (accessed on 21 May 2024).
Bernama. Large Forest Fire Raging in Perak; Orang Asli Settlement Threatened. News Straits Times, 5 March 2019. [Google Scholar]
Haralick, R.M.; Sternberg, S.R.; Zhuang, X. Image Analysis Using Mathematical Morphology. IEEE Trans. Pattern Anal. Mach. Intell. 1987, PAMI-9, 532–550. [Google Scholar] [CrossRef]
Zeng, A.-C.; Cai, Q.-J.; Su, Z.; Guo, X.-B.; Jin, Q.-F.; Guo, F.-T. Seasonal Variation and Driving Factors of Forest Fire in Zhejiang Province, China, Based on MODIS Satellite Hot Spots. Chin. J. Appl. Ecol. 2020, 31, 399–406. [Google Scholar]
Parisien, M.A.; Parks, S.A.; Krawchuk, M.A.; Little, J.M.; Flannigan, M.D.; Gowman, L.M.; Moritz, M.A. An Analysis of Controls on Fire Activity in Boreal Canada: Comparing Models Built with Different Temporal Resolutions. Ecol. Appl. 2014, 24, 1341–1356. [Google Scholar] [CrossRef]
Chew, Y.J. Forest Fire Dataset for Peninsular Malaysia (2001–2022) Extracted from Multiple-Source Remote Sensing Data Using Google Earth Engine [Data Set]. Zenodo 2023. [Google Scholar] [CrossRef]
Wu, Q.; Lane, C.R.; Li, X.; Zhao, K.; Zhou, Y.; Clinton, N.; DeVries, B.; Golden, H.E.; Lang, M.W. Integrating LiDAR Data and Multi-Temporal Aerial Imagery to Map Wetland Inundation Dynamics Using Google Earth Engine. Remote Sens. Environ. 2019, 228, 1–13. [Google Scholar] [CrossRef] [PubMed]
Vilar, L.; Woolford, D.G.; Martell, D.L.; Martín, M.P. A Model for Predicting Human-Caused Wildfire Occurrence in the Region of Madrid, Spain. Int. J. Wildl. Fire 2010, 19, 325. [Google Scholar] [CrossRef]
Syphard, A.D.; Radeloff, V.C.; Keuler, N.S.; Taylor, R.S.; Hawbaker, T.J.; Stewart, S.I.; Clayton, M.K. Predicting Spatial Patterns of Fire on a Southern California Landscape. Int. J. Wildl. Fire 2008, 17, 602. [Google Scholar] [CrossRef]
Bustillo Sánchez, M.; Tonini, M.; Mapelli, A.; Fiorucci, P. Spatial Assessment of Wildfires Susceptibility in Santa Cruz (Bolivia) Using Random Forest. Geosciences 2021, 11, 224. [Google Scholar] [CrossRef]
Oliveira, S.; Oehler, F.; San-Miguel-Ayanz, J.; Camia, A.; Pereira, J.M.C. Modeling Spatial Patterns of Fire Occurrence in Mediterranean Europe Using Multiple Regression and Random Forest. For. Ecol. Manag. 2012, 275, 117–129. [Google Scholar] [CrossRef]
Tavakkoli Piralilou, S.; Einali, G.; Ghorbanzadeh, O.; Nachappa, T.G.; Gholamnia, K.; Blaschke, T.; Ghamisi, P. A Google Earth Engine Approach for Wildfire Susceptibility Prediction Fusion with Remote Sensing Data of Different Spatial Resolutions. Remote Sens. 2022, 14, 672. [Google Scholar] [CrossRef]
Oliveira, S.; Pereira, J.M.C.; San-Miguel-Ayanz, J.; Lourenço, L. Exploring the Spatial Patterns of Fire Density in Southern Europe Using Geographically Weighted Regression. Appl. Geogr. 2014, 51, 143–157. [Google Scholar] [CrossRef]
Ljubomir, G.; Pamučar, D.; Drobnjak, S.; Pourghasemi, H.R. Modeling the Spatial Variability of Forest Fire Susceptibility Using Geographical Information Systems and the Analytical Hierarchy Process. In Spatial Modeling in GIS and R for Earth and Environmental Sciences; Elsevier: Amsterdam, The Netherlands, 2019; pp. 337–369. [Google Scholar]
Sanderson, E.W.; Fisher, K.; Robinson, N.; Sampson, D.; Duncan, A.; Royte, L. The March of the Human Footprint. EcoEvoRxiv 2022. [Google Scholar] [CrossRef]
Abatzoglou, J.T.; Dobrowski, S.Z.; Parks, S.A.; Hegewisch, K.C. TerraClimate, a High-Resolution Global Dataset of Monthly Climate and Climatic Water Balance from 1958–2015. Sci. Data 2018, 5, 170191. [Google Scholar] [CrossRef]
Takeuchi, W.; Darmawan, S.; Shofiyati, R.; Van Khiem, M.; Oo, K.S.; Pimple, U.; Heng, S. Near-Real Time Meteorological Drought Monitoring and Early Warning System for Croplands in Asia. In Proceedings of the Asian Conference on Remote Sensing 2015: Fostering Resilient Growth in Asia, Quezon City, Philippines, 19–23 October 2015; Volume 1, pp. 171–178. [Google Scholar]
Wan, Z.; Hook, S.; Hulley, G. MODIS/Terra Land Surface Temperature/Emissivity 8-Day L3 Global 1km SIN Grid V061 [Dataset]. NASA EOSDIS Land Processes DAAC. 2021. Available online: https://lpdaac.usgs.gov/products/mod11a2v061/ (accessed on 7 February 2023).
Didan, K. MODIS/Terra Vegetation Indices 16-Day L3 Global 250m SIN Grid V061 [Data Set]. NASA EOSDIS Land Processes DAAC. 2021. Available online: https://lpdaac.usgs.gov/products/mod13q1v061/ (accessed on 23 December 2022).
Friedl, M.; Sulla-Menashe, D. MODIS/Terra+ Aqua Land Cover Type Yearly L3 Global 500m SIN Grid V061; NASA EOSDIS Land Processes DAAC: Sioux Falls, SD, USA, 2022. [Google Scholar]
Zanaga, D.; Van De Kerchove, R.; Daems, D.; De Keersmaecker, W.; Brockmann, C.; Kirches, G.; Wevers, J.; Cartus, O.; Santoro, M.; Fritz, S.; et al. ESA WorldCover 10 m 2021 V200. Zenodo 2022. [Google Scholar] [CrossRef]
NASA JPL. NASADEM Merged DEM Global 1 Arc Second V001 [Data Set]; NASA: Washington, DC, USA, 2020. [CrossRef]
Marconcini, M.; Metz-Marconcini, A.; Üreyen, S.; Palacios-Lopez, D.; Hanke, W.; Bachofer, F.; Zeidler, J.; Esch, T.; Gorelick, N.; Kakarla, A. Outlining Where Humans Live, the World Settlement Footprint 2015. Sci. Data 2020, 7, 242. [Google Scholar] [CrossRef]
Elvidge, C.D.; Zhizhin, M.; Ghosh, T.; Hsu, F.-C.; Taneja, J. Annual Time Series of Global VIIRS Nighttime Lights Derived from Monthly Averages: 2012 to 2019. Remote Sens. 2021, 13, 922. [Google Scholar] [CrossRef]
Abdullah, M.J.; Ibrahim, M.R.; Abdul Rahim, A.R. The Incidence of Forest Fire in Peninsular Malaysia: History, Root Causes, Prevention and Control. Prev. Control Fire Peatl. 2002, 20–27. [Google Scholar]
Chandrasekharan, C. The Mission on Forest Fire Prevention and Management to Indonesia and Malaysia (Sarawak). Trop. For. Fire Prev. Control. Rehabil. Trans-Bound. Issues 1998, 14, 1–79. [Google Scholar]
Awang, A. Lebih 300 Hektar Hutan Di Pahang Terbakar (More than 300 Hectare of Forest Burnt in Pahang). Berita Harian, 11 March 2021. [Google Scholar]
OCHA Malaysia—Subnational Administrative Boundaries. Available online: https://data.humdata.org/dataset/cod-ab-mys (accessed on 20 October 2023).
Chew, Y.J.; Ooi, S.Y.; Pang, Y.H.; Wong, K.S. Trend Analysis of Forest Fire in Pahang, Malaysia from 2001-2021 with Google Earth Engine Platform. J. Logist. Inform. Serv. Sci. 2022, 9, 15–26. [Google Scholar] [CrossRef]
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar]
Zhipa, J. Noteable’s Unexpected Farewell. Available online: https://medium.com/@julia.zhipa/noteables-unexpected-farewell-b18a312346b4 (accessed on 29 March 2024).
Paliwal, A. Code Interpreter. Available online: https://gptstore.ai/gpts/R3QCSpFIgF-code-interpreter (accessed on 29 March 2024).
Promptspellsmith Code Copilot. Available online: https://gptstore.ai/gpts/_0GbIdtVSh-codecopilot (accessed on 29 March 2024).
O’brien, R.M. A Caution Regarding Rules of Thumb for Variance Inflation Factors. Qual. Quant. 2007, 41, 673–690. [Google Scholar] [CrossRef]
Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y. A Survey on Evaluation of Large Language Models. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–45. [Google Scholar] [CrossRef]
Zhu, C.; Kobayashi, H.; Kanaya, Y.; Saito, M. Size-Dependent Validation of MODIS MCD64A1 Burned Area over Six Vegetation Types in Boreal Eurasia: Large Underestimation in Croplands. Sci. Rep. 2017, 7, 4181. [Google Scholar] [CrossRef]
Roy, S.; Swetnam, T. Samapriya/Awesome-Gee-Community-Datasets: Community Catalog (2.6.0). Zenedo 2024. [Google Scholar] [CrossRef]
Latiff, A. Biodiversity in Malaysia. In Global Biodiversity; Apple Academic Press: Palm Bay, FL, USA, 2018; pp. 307–349. [Google Scholar]
Biodiversity: Our Life, Our Heritage, Our Future. The Star, 30 May 2023; p. 1.

Figure 1. Overall methodology to build a forest fire dataset.

Figure 2. Fire and non-fire point distribution, illustrated in QGIS.

Figure 3. Number of fire points for each year from 2001 to 2022.

Figure 4. Percentage of missing data for top 20 features from full forest fire dataset.

Figure 5. Percentage of missing data in the filtered forest fire dataset.

Figure 6. Boxplot analysis for each key feature for fire and non-fire points.

Figure 7. Variance inflation factor for key features. (Note: The analysis is considered invalid as the plugin dropped all rows containing null values, leaving only 1841 rows for this analysis).

Table 1. Summary of the factors and sources utilized to build fire inventory datasets.

Published Year	Reference	Tools/Platforms	Factors Used to Build Dataset	Source/Satellite
2017	[4]	ArcGis 10.2	Slope, Aspect, Elevation, Distance to Road, Distance to Residential Area	National Topographic Maps
			Land Use	Local Authority from Lam Dong Province
			NDVI	Landsat-8
			Temperature, Windspeed, Rainfall	Climate Forecast System Reanalysis [34]
2019	[5]	ArcGis 10.4	Slope, Aspect, Elevation, Distance to Road	Topographic Maps from Ministry of Natural Resources and Environment of Vietnam
			Land Use Map 2015, Distance to Residential Area	Local Authority
			NDVI	Landsat-8
			Rainfall, Temperature, Windspeed, Humidity	Climate Forecast System Reanalysis [34]/ Ministry of Natural Resources and Environment of Vietnam
			Historical Forest Fire (2008 to 2016)	Department of Forest Protection of Vietnam
2020	[30]	ArcGis 10.5	Urban Area Distance, Distance to Agriculture, Urban Area Density, Population Density, Density of Agricultural Interface, Elevation, Slope, Aspect	National Institute of Statistics and Geography
			Main Road Distance, Density of Main Road	Secretariat of Infrastructure, Communications and Transportation
			Secondary Road Distance, Density of Secondary Road	Mexican Institute of Transportation
			Aboveground Carbon Density	Cartus et al. [35]
			Tree Cover	NASA
			Precipitation	National Autonomous University of Mexico
			Forest Fire/Fire Density	National Forestry Commission Mexico
2020	[3]	ArcGIS	Distance to Villages, Distance to Road, Distance to Rivers	Local Topographic Database
			Elevation, Slope, Aspect, Plan Curvature, Slope Degree, Topographic Wetness Index (TWI)	Digital Elevation Map (DEM) Created from Topographic Data
			NDVI, Land Use	Landsat-8
			Wind, Temperature, Rainfall	Meteorological Database
			Soil Type (Soil Texture)	Soil Texture Map of Golestan Province from Agriculture Department, Iran
			Fire Hotspots	MODIS
			Note: Fire Inventory Map information is made available in [36]
2020	[6]	Bayesian Network	Fire Causes, Relative Humidity, Temperature, Distance from Settlement, Month, Amount of Burnt Area, Distance from Agricultural Land, Windspeed, Distance from Road, Tree Species	General Directorate of Forestry (Mugla Region)
2022	[37]	GEE	Landsat, Shuttle Radar Topography Mission V3 Product, NDVI	GEE
			River Line Map	Korea Ministry of Environment
			Road Line Map	Korea National Spatial Data Infrastructure Portal
			Rainfall and Temperature Dataset	Korea Meteorological Administration
			Forest Fire List	Korea Forest Service
			Land Use and Land Cover (LULC)	Korea Ministry of Environment
2022	[38]	ArcGIS 10.4	Fire Perimeter Database in California	California Department of Forestry and Fire Protection
			Aspect, Altitude, Slope, Plan Curvature, TWI	Copernicus DEM
			Precipitation, Maximum Temperature	Oregon State University’s Parameter-Elevation Regressions on Independent Slopes Model (PRISM) Climate Group
			Solar Radiation	National Renewable Energy Laboratory
			Windspeed	Global Wind Atlas
			Distance to Stream	California State Geoportal
			Soil Moisture, Drought Index	TerraClimate
			NDVI	MODIS/Terra
			Land Use, Distance to Road, Distance to Settlement	USGS National Land Cover Database (NLCD-2019)
2023	[14]	ArcGIS 10.4	Aspect, Altitude, Slope, Plan Curvature	Copernicus DEM
			Precipitation, Maximum Temperature, Soil Moisture, Drought Index	TerraClimate
			Windspeed	Global Wind Atlas
			NDVI	MODIS
			Distance to Rivers, Forest Type, Land Use, Distance to Recreational Areas, Distance to Roads, Distance to Human Settlements	Australian Bureau of Agricultural and Resource Economics and Sciences
2023	[39]	ArcGIS 10.5	Slope Angle	Geospatial Data Cloud
			Elevation, NDVI	Resource and Environment Science and Data Center
			Precipitation, Average Maximum Temperature, Average Minimum Temperature, Annual Average Temperature	National Earth System Science Data Center
			Railway Density, Road Density	Geographic Information Professional Knowledge Service System
			Population Density	WorldPOP
			Nighttime Light	Earth Observation Group of Colorado School of Mines
			VIIRS Fire Hotspots, Land Surface Temperature	NASA Earth Data

Table 3. Comparison of full vs. filtered datasets for Peninsular Malaysia.

	Full Peninsular Malaysia Dataset [101]	Filtered Peninsular Malaysia Dataset
Number of Fire Points	5557	5557
Number of Non-Fire Points	5526	5526
Number of Attributes (Column)	7037	40

Table 4. Feature information in filtered forest fire dataset.

Feature Name	Description	Feature Name	Description
system:index	System-generated from MCD64A1	current_aet_annual	Actual Evapotranspiration
longitude	Longitude Coordinate of Fire Points	current_def_annual	Climate Water Deficit
latitude	Latitude Coordinate of Fire Points	current_pdsi_annual	Palmer Drought Severity Index
fire	Fire Occurrence (binary class)	current_pet_annual	Reference Evapotranspiration
date	Date from Administrative Boundaries refer to the Shape	current_pr_annual	Precipitation Accumulation
ADM1_PCODE	Administrative level 1 code	current_ro_annual	Runoff
ADM2_PCODE	Administrative level 2 code	current_soil_annual	Soil Moisture
Shape_Leng	Shape Length (from MCD64A1)	current_srad_annual	Downward Surface Shortwave Radiation
ADM0_EN	Country Name	current_swe_annual	Snow Water Equivalent
ADM1_EN	Administrative level 1 name	current_tmmn_annual	Minimum Temperature
ADM2_EN	Administrative level 2 name	current_tmmx_annual	Maximum Temperature
validOn	Validation Date from Administrative Boundaries refer to the Shape	current_vap_annual	Vapor Pressure
Shape_Area	Shape area (from MCD64A1)	current_vpd_annual	Vapor Pressure Deficit
ADM0_PCODE	Country code	current_vs_annual	Wind Speed at 10 m
BurnDate	Date in 0–365 (from MCD64A1)	current_EVI_annual	Enhanced Vegetation Index
year	Year of Fire Observation	current_NDVI_annual	Normalized Difference Vegetation Index
month	Month of Fire Observation	current_LST_annual	Land Surface Temperature
day	Day of Fire Observation	current_KBDI_annual	Keetch–Byram Drought Index
current0101_hii_annual	Human Impact Index	current0101_LC_Type2_annual	Land Cover Classification of UMD (Numeric)
current0101_average_annual_nighttime	Nighttime Brightness	current0101_LC_Type2_annual_classname	Land Cover Classification of UMD (Class Name)

Table 5. Statistical mean and standard deviation of the key features by fire class.

Feature	Non-Fire (Fire = 0)			Fire = 1
Feature	Count	Mean	Standard Deviation	Count	Mean	Standard Deviation
current0101_LC_Type2_annual	5526	3.088853	2.465596	5557	9.613101	3.652621
current0101_average_annual_nighttime	0	-	-	1960	1.33929	2.232185
current0101_hii_annual	0	-	-	5403	2094.723	881.2572
current_EVI_annual	5526	0.460531	0.064722	5553	0.369616	0.059076
current_KBDI_annual	5526	21.63871	9.450237	3404	95.86203	55.7931
current_LST_annual	5510	299.5443	2.200019	5522	304.4709	1.561426
current_NDVI_annual	5526	0.684663	0.075106	5553	0.567431	0.078325
current_aet_annual	5526	96.14751	8.020696	5557	104.8773	9.327677
current_def_annual	5526	0.57292	0.925342	5557	7.798602	7.765711
current_pdsi_annual	5526	3.93935	1.623397	5557	0.453241	2.904917
current_pet_annual	5526	96.72054	8.350228	5557	112.6757	12.85509
current_pr_annual	5526	269.2244	48.8007	5557	202.7966	36.59999
current_ro_annual	5526	173.054	52.46889	5557	97.5236	41.03856
current_soil_annual	5526	94.80768	27.17345	5557	95.14247	52.39215
current_srad_annual	5526	169.1053	7.917929	5557	185.4051	24.57128
current_swe_annual	5526	0	0	5557	0	0
current_tmmn_annual	5526	21.33283	2.059547	5557	23.47415	0.549855
current_tmmx_annual	5526	29.97259	2.362977	5557	32.31917	0.718615
current_vap_annual	5526	2.808721	0.217333	5557	3.065104	0.080479
current_vpd_annual	5526	0.623304	0.215493	5557	0.825758	0.098175
current_vs_annual	5526	1.327443	0.267973	5557	1.834698	0.287861

Table 6. t-test statistics and p-value test results.

Feature	Group 1	Group 2	t-Statistics	p-Value	Statistical Significance
BurnDate	Filter Non-Fire Condition (‘fire’ = 0)	Filter Fire Condition (‘fire’ = 1)	−93.2815	0	Significant
current0101_LC_Type2_annual			−110.2646	0
current_EVI_annual			77.2107	0
current_KBDI_annual			−76.9397	0
current_LST_annual			−135.6047	0
current_NDVI_annual			80.4098	0
current_aet_annual			−52.8367	0
current_def_annual			−68.8715	0
current_pdsi_annual			78.0405	0
current_pet_annual			−77.5257	0
current_pr_annual			81.0322	0
current_ro_annual			84.3792	0
current_srad_annual			−47.0552	0
current_tmmn_annual			−74.6871	0
current_tmmx_annual			−70.6441	0
current_vap_annual			−82.2645	0
current_vpd_annual			−63.5849	0
current_vs_annual			−96.0227	0
current_soil_annual			−0.4226	0.6726	Not Significant
current0101_average_annual_nighttime			-	-	NaN (Missing Data)
current0101_hii_annual			-	-	NaN (Missing Data)
current_swe_annual			-	-	NaN (Constant)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chew, Y.J.; Ooi, S.Y.; Pang, Y.H.; Lim, Z.Y. Framework to Create Inventory Dataset for Disaster Behavior Analysis Using Google Earth Engine: A Case Study in Peninsular Malaysia for Historical Forest Fire Behavior Analysis. Forests 2024, 15, 923. https://doi.org/10.3390/f15060923

AMA Style

Chew YJ, Ooi SY, Pang YH, Lim ZY. Framework to Create Inventory Dataset for Disaster Behavior Analysis Using Google Earth Engine: A Case Study in Peninsular Malaysia for Historical Forest Fire Behavior Analysis. Forests. 2024; 15(6):923. https://doi.org/10.3390/f15060923

Chicago/Turabian Style

Chew, Yee Jian, Shih Yin Ooi, Ying Han Pang, and Zheng You Lim. 2024. "Framework to Create Inventory Dataset for Disaster Behavior Analysis Using Google Earth Engine: A Case Study in Peninsular Malaysia for Historical Forest Fire Behavior Analysis" Forests 15, no. 6: 923. https://doi.org/10.3390/f15060923

APA Style

Chew, Y. J., Ooi, S. Y., Pang, Y. H., & Lim, Z. Y. (2024). Framework to Create Inventory Dataset for Disaster Behavior Analysis Using Google Earth Engine: A Case Study in Peninsular Malaysia for Historical Forest Fire Behavior Analysis. Forests, 15(6), 923. https://doi.org/10.3390/f15060923

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Framework to Create Inventory Dataset for Disaster Behavior Analysis Using Google Earth Engine: A Case Study in Peninsular Malaysia for Historical Forest Fire Behavior Analysis

Abstract

1. Introduction

2. Literature Reviews and Background Study

2.1. Building Fire Inventory Datasets: Methods and Sources from Past Studies

2.2. Key Data Sources for Understanding Forest Fires in Malaysia

2.3. Key Studies Driving Research Motivation

3. Methodology

3.1. Proposed Framework

3.2. Forest Fire Attributing Factor Data Source and Details

3.3. Limitations of Methodology

4. Application of the Proposed Framework in the Study Area—Peninsular Malaysia

4.1. Study Area—Peninsular Malaysia

4.2. Peninsular Malaysia Forest Fire Dataset Description

5. Assessing Usability Forest Fire Dataset Leveraging Large Language Model

5.1. ChatGPT (GPT-4) and Noteable Plugin

5.2. Termination of the Noteable Plugin

5.3. Sample Analysis of Forest Fire Dataset in Peninsular Malaysia through GPT-4

5.3.1. Sample Analysis—Missing Feature Analysis with GPT-4 and Noteable Plugin

5.3.2. Sample Analysis—Basic Statistical Analysis with GPT-4 and Noteable Plugin

5.3.3. Sample Analysis—Boxplot Analysis with GPT-4 and Noteable Plugin

5.3.4. Sample Analysis—t-Test Statistical Tests with GPT-4 and Noteable Plugin

5.3.5. Sample Analysis—Variance Inflation Factor with GPT-4 and Noteable Plugin

5.4. Limitation of GPT-4 with Noteable Plugin

5.5. Incomplete Tests by GPT-4 with Noteable Plugin

5.6. Considerations for Adopting AI Plugins in Analyses

6. Future Works

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix B

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI