Spatial Predictive Modeling of Liver Fluke Opisthorchis viverrine (OV) Infection under the Mathematical Models in Hexagonal Symmetrical Shapes Using Machine Learning-Based Forest Classification Regression

Pumhirunroj, Benjamabhorn; Littidej, Patiwat; Boonmars, Thidarut; Artchayasawat, Atchara; Prasertsri, Narueset; Khamphilung, Phusit; Sangpradid, Satith; Buasri, Nutchanat; Uttha, Theeraya; Slack, Donald

doi:10.3390/sym16081067

Open AccessArticle

Spatial Predictive Modeling of Liver Fluke Opisthorchis viverrine (OV) Infection under the Mathematical Models in Hexagonal Symmetrical Shapes Using Machine Learning-Based Forest Classification Regression

by

Benjamabhorn Pumhirunroj

¹,

Patiwat Littidej

^2,*

,

Thidarut Boonmars

³

,

Atchara Artchayasawat

⁴,

Narueset Prasertsri

²,

Phusit Khamphilung

²

,

Satith Sangpradid

²

,

Nutchanat Buasri

²,

Theeraya Uttha

² and

Donald Slack

⁵

¹

Program in Animal Science, Faculty of Agricultural Technology, Sakon Nakhon Rajabhat University, Sakon Nakhon 47000, Thailand

²

Geoinformatics Research Unit for Spatial Management, Department of Geoinformatics, Faculty of Informatics, Mahasarakham University, Maha Sarakham 44150, Thailand

³

Department of Parasitology, Faculty of Medicine, Khon Kaen University, Khon Kaen 40002, Thailand

⁴

Department of Agriculture and Resources, Faculty of Natural Resources and Agro-Industry, Kasetsart University, Chalermphrakiat Sakon Nakhon Province Campus, Sakon Nakhon 47000, Thailand

⁵

Department of Civil & Architectural Engineering & Mechanics, University of Arizona, 1209 E. Second St., P.O. Box 210072, Tucson, AZ 85721, USA

^*

Author to whom correspondence should be addressed.

Symmetry 2024, 16(8), 1067; https://doi.org/10.3390/sym16081067

Submission received: 21 June 2024 / Revised: 21 July 2024 / Accepted: 12 August 2024 / Published: 19 August 2024

(This article belongs to the Special Issue Mathematical Modeling of the Infectious Diseases and Their Controls)

Download

Browse Figures

Versions Notes

Abstract

Infection with liver flukes (Opisthorchis viverrini) is partly due to their ability to thrive in habitats in sub-basin areas, causing the intermediate host to remain in the watershed system throughout the year. Spatial modeling is used to predict water source infections, which involves designing appropriate area units with hexagonal grids. This allows for the creation of a set of independent variables, which are then covered using machine learning techniques such as forest-based classification regression methods. The independent variable set was obtained from the local public health agency and used to establish a relationship with a mathematical model. The ordinary least (OLS) model approach was used to screen the variables, and the most consistent set was selected to create a new set of variables using the principal of component analysis (PCA) method. The results showed that the forest classification and regression (FCR) model was able to accurately predict the infection rates, with the PCA factor yielding a reliability value of 0.915. This was followed by values of 0.794, 0.741, and 0.632, respectively. This article provides detailed information on the factors related to water body infection, including the length and density of water flow lines in hexagonal form, and traces the depth of each process.

Keywords:

Opisthorchis viverrini; spatial modeling; forest classification regression (FCR); machine learning; hexagonal grid

1. Introduction

Severe liver fluke infections have been discovered in the Ponna Kaeo district of the Sakon Nakhon province, Thailand. The liver fluke, Opisthorchis viverrini, is known to cause cholangiocarcinoma (CCA) [1]. Thailand has the highest prevalence of bile duct cancer cases due to liver fluke infections [2], which are primarily caused by consuming raw fish contaminated with infectious larvae and the widespread consumption of semi-raw or raw seafood [3]. There have even been reports of fluke infections from fermented fish products [4]. In Thailand, liver fluke and cholangiocarcinoma are persistent public health concerns. Annually, these diseases claim the lives of at least 20,000 people in the northeast region alone [5]. With 6–8 million current cases of liver fluke infections, it is essential to test individuals for this infection [6]. By eliminating the parasites, we can significantly reduce the risk of developing cholangiocarcinoma [7].

Sakon Nakhon Hospital alone records nearly a thousand new cases of CCA each year. Surprisingly, despite knowing the main risk factors for O. viverrini infection, the incidence of CCA has not decreased in the past decade [8]. The prevalence of CCA involves Thailand’s four main regions—Sakon Nakhon, Phrae, Roi-Et, and Nong Bua Lamphu [9]. A study by the authors of [10] found that individuals with a high severity of O. viverrini infection (>6000 eggs/g. feces) had a 14.1-fold higher likelihood of developing CCA compared to those without the infection. Approximately 10% of people infected with O. viverrini eventually develop CCA, causing major health crises in the region [11]. The five-year survival rates for patients who underwent surgery for intrahepatic [12], distal extrahepatic, and hilar CCA were 22–44%, 27–37%, and 11–41%, respectively [13].

The largest natural water contact zone in the northeast can be found near the boundary of the Nong Han subdistrict. This is due to the unique topography of the area. The swamp has specific physical characteristics that make it a significant source of water. It remains full all the time because it is fed by multiple streams, making it an essential source of food for the local population. Fish is a primary source of protein for people living in the watershed. There is a cultural preference for raw fish, cooked with herbs to create a sweet, sour, and spicy flavor [1]. As a result, residents living close to the river basin eat fish with every meal [3]. Preliminary screening results from 2019 to 2021 indicate that only a small proportion of individuals have developed liver fluke [2]. Furthermore, a study on fish liver fluke infection prevalence found that 33.33% of the Sakon Nakhon province is affected by the infection. In a study on the density of larvae in fish, it was discovered that there were 10–20 metacercaria per kilogram of fish [10]. Consequently, the Sakon Nakhon province continues to experience outbreaks of liver fluke. It is possible that the feces containing the eggs of the liver fluke contaminate water sources and contribute to recurring illnesses and an ongoing cycle of infection. It is important to monitor the outbreaks and infections spatially, but this requires data from remote sensing and geographic information system processing. Figure 1 illustrates the percentage of reported cases found. The Sakon Nakhon province is home to the largest freshwater resource in the northeast, serving as a breeding ground for animals during the rainy season [2]. Due to Phon Na Kaeo in the Sakon Nakhon province having the highest average infection rate, provincial health officials should closely monitor the situation. The investigation utilized data on the number of liver fluke infections in the Phon Na Kaeo district [3].

Remote sensing information systems enable the geographic study of liver fluke infections, making the use of geographic information system (GIS) knowledge as an analytical tool particularly valuable. Remote sensing (RS) derived from satellite imagery allows for an in-depth analysis of the likelihood and distribution of liver flukes. This analysis can incorporate various indicators, such as the standardized vegetation index, soil moisture index, and soil cover index, which can be linked to the presence of liver fluke intermediates. Several studies, including those by the authors of [3,16], have employed spatial statistics to investigate geographical correlation factors with liver fluke infection. However, these studies analyzed large areas, leading to inconsistencies and discrepancies in the raster data. To address this, GWR (geographic weighted regression) models were created in small area unit systems for hydrological factor analyses, proving effective with high R² values in all models [17,18,19,20]. Combining spatial modeling with mathematical models can enhance the accuracy of linear models and improve risk analyses. However, when utilizing spatial statistical models for forecasting, a risk analysis can only be conducted at the sub-basin level, requiring sufficient independent variable data to generate relevant trendlines. Therefore, by incorporating machine learning (ML) and learning from spatial characteristics, it is possible to estimate the risk of infection in the location of a water supply.

Modern research has incorporated machine learning into spatial risk assessment problems, as demonstrated in [21]. In recent years, advancements in machine learning techniques, processing power, and geospatial developments, including software, have made it easier to generate spatial maps [22]. Various machine learning algorithms, such as knowledge-based approaches [23], multivariate logistic regression methods [24,25], and multivariate binary logistic regression [26], have been shown to improve the accuracy of spatial maps in recent studies. The following algorithms have been investigated: the general linear model [27,28], quadratic discriminant analysis [29], boosted regression tree [24], random forest classification (RFC) [30,31,32,33], multivariate adaptive regression splines [34,35,36], classification and regression tree [21,36], support vector machine [37,38], naïve Bayes [39,40], generalized additive model [36], neuro-fuzzy and adaptive neuro-fuzzy inference [41,42,43], fuzzy logic [44], and artificial neural networks [25,45,46,47,48,49]. Maximum entropy [50,51] and decision tree [52,53] approaches have also been proposed. Machine learning technologies have also been commonly used to create landslide maps (LSMs). The study of [54,55] compared the performance and effectiveness of various machine learning techniques in the literature and found that tree-based ensemble optimization strategies outperformed other machine learning algorithms. Several studies on machine learning applications have shown that the random forest classification method consistently performs better in terms of the receiver operating characteristic (ROC) and area under the ROC curve (AUC) than other models, although this also depends on the factors influencing the machine learning process, such as the number of training and testing points. A wide range of spatial conditioning factors can be utilized to build machine learning models. Several studies on flood-prone landslides and land-use change evaluation have employed remote sensing and GIS approaches [15,56,57].

There is currently a lack of studies on spatial fluke infection in small river basins due to the absence of a direct model to predict the presence of liver flukes in water bodies. Most of the existing models are designed for larger areas such as regions, and thus are not suitable for studying smaller river basins. To address this gap, this study suggests that spatial features could be used as indicators of infection, by employing a forest-based classification and regression (FCR) modeling technique. The main objective of this study was to identify the key features of the small river basin that are most influential in causing infection, and then use these characteristics to develop a forecasting model based on FCR. Unlike previous studies, this research is the first to utilize FCR for this type of prediction.

The previous study used frequency-compensation ratio (FR) methodologies to forecast the proportion of infection risk based on geographical parameters at the watershed level and learning point locations, which are bodies of water containing diseased fish. As a result, appropriate management of the sub-basin level can be achieved for protection, provided that the spatial distribution of each parasite’s features is significant to any subspace unit at the sub-basin level [58]. For example, by interrupting the mollusc host cycle, healthy populations can be maintained and future diseases can be prevented, leading to reduced community impact and medical treatment costs. However, the creation of independent variable data from point data and using them as representatives for watersheds to analyze the relationship with infection may not be common in all watersheds. This approach requires estimation methods from point data to be equal to the grid size, which can result in an increase or decrease in data, leading to multicollinearity, which is the correlation between independent variables.

This study used a hexagonal grid to implement independent variables. The FCR model test narrowed down the relevant independent variable factors to four out of nine spatial factors: the index of land-use types, index of soil drainage properties, distance index from the road network, and distance index from surface water resources. The distance index from the flow accumulation lines, index of average surface temperature, average surface moisture index, average normalized difference vegetation index, and average soil-adjusted vegetation index were found to have minimal impact on accuracy. In this study, an independent set of variables will be reconstructed within the constraints of a 2000 m × 2000 m hexagonal grid. This will enable an accurate description of the data values of all variables by defining border distances using distance analysis variables. The mathematical model will be adjusted to fit the hexagonal grid, and the score values of independent variables will be obtained from data provided by the Sakon Nakhon province’s public health authority.

The main aim of this study is to establish a dataset of measurable independent variables across various areas. In order to achieve this, a mathematical model in the form of a hexagonal grid was created. These models were used to justify the selection of independent variables for the FCR model, which serves as the primary framework for this study. A machine learning model using FCR was created to analyze the spatial factors associated with liver fluke infection. This study aimed to achieve two objectives: (1) analyze the spatial factors correlated with liver fluke infection and (2) predict the percentage of infection in water bodies using a machine learning-based FCR model with hexagonal grids.

2. Materials and Methods

2.1. Analyses and Datasets

The Sakon Nakhon Provincial Public Health Office (SKKO) [59], https://pnkhospital.net/ (accessed on 1 November 2023), provided the information on liver fluke infections used in this study. One common screening technique that has been used for a long time is stool testing. For instance, the modified Kato–Katz approach, which has proven to be an efficient way in the past when there were widespread parasite outbreaks, can be used to examine parasite eggs in feces in-depth.

In addition, a stool analysis is a well-established procedure that has been used for many years. The modified Kato–Katz technique was used to examine stool samples for O. viverrini eggs shortly after collection [60]. The majority of infected individuals were found in the Phon Na Kaeo district of the Sakon Nakhon province. The prevalence of infection tends to increase in the age range of 18–80 years. Two other testing techniques, namely the enzyme-linked immunosorbent assay (ELISA) and the formalin–ethyl acetate concentration technique (FECT), are more effective than stool samples [61]. These methods provide numerical data that reflect the parasite density and can be used for post-drug evaluations to determine the rate of new or reinfection [61]. However, due to their high cost, these approaches were not utilized in this study. Nevertheless, the modified Kato–Katz method is a suitable technique for screening a large number of individuals, and the secondary data obtained from SKKO on the number of liver fluke patients assessed using this method are reliable.

Based on the data on modified Kato–Katz fluke infections, the highest number of cases were found in the Phon Na Kaeo district of the Sakon Nakhon province. The prevalence of infection tended to increase among individuals aged 30 to 40 years. It was also observed that the density of liver fluke infection among the infected individuals in the province was similar to the prevalence. Table 1 provides the age range of 20 to 30 years for the Sakon Nakhon province. Although the number of people infected with bile duct cancer in the Sakon Nakhon province has decreased from 161 to 130, it still has significantly higher rates compared to other provinces. However, it is important to note that the reported number of infected individuals in each province may not accurately reflect the true extent of the problem. Budget constraints can impact the ability to detect all infected individuals. Nevertheless, the findings from Table 1 clearly highlight the distinctiveness of the infected population in the Sakon Nakhon province compared to other provinces, making it an interesting area for further study.

From 2019 to 2020, a total of 12,063 cases of stool testing were identified and reported to the 8th Health District Office (Region 8) [62], Available online: https://r8way.moph.go.th/r8way/ (accessed on 15 November 2023). Out of these cases, 599 were discovered in the Sakon Nakhon province, which has the highest number of liver fluke infections among the surrounding provinces in the interconnected river basin system of Nakhon Phanom and Bueng Kan [63].

The Study Area

This study aimed to increase the number of samples collected from infected water sources over a three-year period (2019–2022). The samples were collected from three districts: the Mueang Sakon Nakhon district, Phon Na Kaeo district, and Khok Si Suphan district. These districts are all located near the largest freshwater lake in northeastern Thailand, which has geographic coordinates 17°13′18″ N, 104°17′24″ E. The lake’s water bodies have a high percentage of fish liver fluke infection, as shown in Figure 2a–d. To analyze the data, we first determined the environmental independent variables that contribute to the spread of infected freshwater fish from the main water source to the tributary water bodies. This was performed by examining the flow lines of gastric travel and using mathematical models and geographic information systems with map algebra under ArcGIS Pro v.3.2.0, as shown in Equations (1)–(10).

2.2. Design of Independent Variables

Additionally, public health authorities have recommended a specific set of independent variables. These variables were carefully selected and incorporated into the model to analyze the spread of hepatic leafworms in the gastric basin junction area. These independent variables are based on mathematical models that assess the approximate distance to the surface moisture, which is the optimal environment for the intermediate host’s habitat. This study utilizes a design in which water flow lines are used as an independent variable set to represent the movement of infected fish from the main water source, Nongharn, to the various digestive water bodies in the three study districts. The ordinary least square regression model is applied to screen the independent variables and then predict the infection percentage of each sub-water source using a machine learning approach. Additionally, a forest-based classification regression model is employed, using the same set of independent variables as the previous model.

2.2.1. The Distance of the Line Flows to Connect with the Main Water Source (Stream Length, SL)

This factor was developed to assess the probability of liver fluke infection in fish that migrate from the primary contaminated water source to the surrounding areas. The closer the distance to the infected water, the greater the likelihood of infection. This is due to the fish’s ability to travel and reproduce rapidly, making them an efficient carrier for intermediary hosts.

{S L}_{i j} = \sum_{i = 1}^{n} {S L}_{i}

(1)

The main source of water, which serves as the primary habitat and breeding ground for the fish studied, is Nongharn. Nongharn is the largest freshwater body in northeastern Thailand. In this study,

{S L}_{i j}

represents the cumulative length, in meters, of any sub-stream flow line connecting to the main water source j.

2.2.2. Continuation of the Slope of the Stream Line to the Main Infectious Water Source (Continuous of Slope Length)

The slope of the terrain has an impact on how water flows from the outlet, which connects the digestive system to the main body of water where the infected fish is located. To determine this impact, an index is developed by multiplying the slopes by the length of the slope cross-section and summing them. A smaller index indicates a higher risk of infected fish, as they are more likely to move between contaminated water sources and their seasonal habitats, compared to those with higher index values.

S_{i k j} = \sum_{i = 1}^{n} \sum_{k = 1}^{k} L_{i k} S_{i k}

(2)

j is the primary source of water from the previous source.

S_{i k j}

represents the sum of the products of the slope and length of any sub-stream flow lines i and k. These flow lines connect to the main water source j.

L_{i k}

represents the length of any sub-stream flow lines i and k that connect to the main water source j.

S_{i k}

represents the slope of any sub-stream flow lines i and k connecting to the main water source j.

2.2.3. Prevalence of Nearby Infected Water Bodies (PNW)

The prevalence index is a measure that assesses the connectivity of water bodies where infected fish are found in close proximity to each other and connected by the same water flow lines. By analyzing the characteristics of these water bodies, along with the percentage of infected fish in villages and water bodies, this index can provide insights into the likelihood of finding infected fish. The prevalence index is calculated based on the cohesion value of the location point where infected fish were discovered. This index is derived from the Getis–Ord model, and the Z-value index is calculated using the distance of the water flow line connecting the infected water source to other water sources. The results are presented as Z-scores, where higher scores indicate denser clusters known as hot spots, and lower scores indicate cold spots. In this study, confidence levels of 90%, 95%, and 99% were used to identify non-significant spatial clustering when Z-values approached 0. The calculation can be expressed using the Getis–Ord local statistical formula [64].

G_{i}^{*} = \frac{\sum_{j = 1}^{n} w_{i, j} x_{j} - \bar{x} \sum_{j = 1}^{n} w_{i, j}}{S \sqrt{\frac{[n \sum_{j = 1}^{n} w_{i, j}^{2} - {(\sum_{j = 1}^{n} w_{i, j})}^{2}]}{n - 1}}}

(3)

where

x_{j}

represents the attribute value for stream line feature j, and

w_{i, j}

represents the spatial weight between feature i and j, which is equal to the total number of features.

\bar{x} = \frac{\sum_{j = 1}^{n} x_{j}}{n}

(4)

S = \sqrt{\frac{\sum_{j = 1}^{n} x_{j}^{2}}{n}} - {(\bar{x})}^{2}

(5)

The statistic is already in the form of a Z-score, so there is no need for any additional calculations.

2.2.4. Continuous Length of the Line of Water Connecting to the Infected Water Source (CL)

The flow of digestive juices from the main source of infected fish to other surface water bodies within the small sub-basin enables an increase in the percentage of infected fish in that water. This can be determined by classifying the index level on a scale of 0–9, based on a stream order analysis. In Strahler’s method, a higher number of gastric juice sequences indicates a greater chance of water flowing from Nongharn to the water source. Stream order increases only when streams of the same order intersect. Therefore, the intersection of a first-order and second-order link will remain a second-order link, rather than creating a third-order link. This is the default setting. The continuous flow of gastric juice from the main source of infected fish to other surface water bodies within the small sub-basin increases the likelihood of a higher percentage of infected fish in that water source [15].

{C L}_{i k j} = \sum_{i = 1}^{n} {S O}_{i j}

(6)

The main source of water is represented by j.

{C L}_{i k j}

represents the sum of the stream order of sub-stream flow lines i and k that connect to the main water source j.

{S O}_{i j}

represents the stream order of sub-stream flow lines i and k that connect to the main water source j.

2.2.5. The Proximity between the Water Source and the Community (PWC)

It has been discovered that when communities are located in close proximity to water bodies, there is a higher chance of intermedial hosts becoming embedded in moist soils due to defecation being dumped into rivers. These intermedial hosts serve as habitats for parasite eggs, which are typically found at depths of 10–15 cm along the banks of the water bodies in moist soil zones.

P_{i k j} = \sum_{i = 1}^{n} \sum_{k = 1}^{k} W_{i k} L_{i k}

(7)

The main source of water is denoted as “j” in the given context.

P_{i k j}

represents the sum of products obtained by multiplying the length of sub-stream flow lines of i and k, which connect to the main water source i and any village j.

W_{i k}

is the weighted value of population density for position i in a village connected to stream flow lines k. The population density is divided into five ranges: 500, 1000, 2000, 3000, and more than 3001 persons per square kilometer. The corresponding weights for these ranges are 0.2, 0.4, 0.6, 0.8, and 1, respectively.

L_{i k}

represents the length of stream flow lines i and k connecting to the main water source i and any village j.

2.2.6. Population Density in Closed Proximity to Water Bodies (PPW)

The likelihood of infection within the population can be determined in a hexagonal grid by calculating the product of population density (persons/square kilometer), the perimeter (in kilometers) of the hexagonal grid, and the length (in kilometers) of the stream lines. This result should then be divided by the area of the hexagonal grid.

P_{i} = \frac{D_{i} {P e}_{i} L_{i k}}{A_{i}}

(8)

P_{i}

represents the population risk in the flow line of the infection risk area at any point i on the hexagonal grid.

D_{i}

represents the population density (persons/square kilometer) determined using hexagonal grid units at any point i.

{P e}_{i}

represents the circumferential length of the hexagonal grid at any given point i.

L_{i k}

represents the length of any sub-stream flow lines i and k that connect to the main water source j.

A_{i}

represents the area of the hexagonal grid at any given point i.

2.2.7. Continuation of Surface Moisture Index along the Banks of Flowing Water Lines Connecting Infected Water Bodies (CSW)

This index was created to measure the continuity of surface moisture in infected gastric water sources. It connects any digestive water source to the Nongharn lake and extracts surface moisture values from the normal differential moisture index (NDMI) averaged over a 3-month period (January–March, which is the dry season). This measurement helps assess the embedding capacity of intermediate hosts in moist soils during the dry season. When the automatic spatial correlation index measures a continuous moisture level above 0.25, the water flow line is considered to be better at maintaining surface moisture along the banks compared to other lines with lower index values. The NDMI utilizes the NIR and SWIR bands to represent humidity. However, the amount of water in a leaf does not affect NIR reflectance. Instead, it is influenced by the leaf’s interior structure and dry matter content. The index is derived from the wavelength ratio of (B08 − B11)/(B08 + B11) in Sentinel-2 satellite imagery. It is then re-standardized, as indicated in Equation (9) [65].

{C S W}_{i} = \frac{({N D M I}_{i} - {N D M I}_{m i n})}{({N D M I}_{m a x} - {N D M I}_{m i n})} / A_{i}

(9)

2.2.8. Continuity of the Soil Surface Heat Index along the Banks of the Flowing Water Line Connecting the Infected Water Source (CSH)

In contrast to the previous index, the surface heat index was developed to measure the consistency of surface moisture deficiency in infected gastric water sources. The flow lines of gastric juice connecting any of the sources to Nongharn were extracted from the MSAVI (Modified Soil Adjusted Vegetation Index), which was averaged over a 3-month period (January–March) during the dry season. This was performed to assess the embedding capacity of intermediate hosts in moist soils during this season. When there is continuous moisture, as indicated by an automatic spatial correlation index below 0.25, the water flow line is considered to be less effective at retaining surface moisture compared to other lines with higher index values. Empirical studies have shown that NDVI products are unstable due to variations in soil color, wetness, and saturation of high-density plant impacts. To improve the NDVI, this index utilizes a change technique that reduces the influence of soil brightness on the spectral plant index associated with red and near-infrared (NIR) wavelengths. This is expressed as (B08 − B04)/(B08 + B04 + 9) × (1 + 9) and is applied to a mathematical model to normalize the data by increasing the number of images to 9. This normalization is represented by a constant value (L), showing a greater difference than the original scale of 0–1, as shown in Equation (10).

{C S H}_{i} = \frac{({M S A V I}_{i} - {M S A V I}_{m i n})}{({M S A V I}_{m a x} - {M S A V I}_{m i n})} / A_{i}

(10)

2.2.9. Continuation of Vegetation Index along the Banks of Stream Lines Connecting Infected Water Bodies (CVS)

The NDVI index is a well-known and commonly used tool for measuring green plants. It works by reducing the dispersion of green leaves in the near-infrared wavelength range through the absorption of chlorophyll in the red wavelength range. In this study, the NDVI index was utilized to measure the continuity of vegetation covering the soil surface of infected gastric juice bodies. The goal was to connect the lines of gastric juice flowing to any of the gastric juice sources to Nongharn and extract the average NDVI index of surface vegetation over a period of 3 months (January–March 2019–2021), which corresponds to the dry season. This measurement aimed to assess the embedding capacity of intermediate hosts in moist soils during the dry season. Water flow lines with an automatic spatial correlation index above 0.25 were considered to have better surface moisture maintenance along the banks compared to lines with lower index values (as shown in Equation (11)).

{C V S}_{i} = \frac{({N D V I}_{i} - {N D V I}_{m i n})}{({N D V I}_{m a x} - {N D V I}_{m i n})} / A_{i}

(11)

The index values obtained from the Sentinel-2 remote sensing satellite images, as shown in Equations (9)–(11), need to be normalized using the Benefit of Scale transformation method. This method adjusts GIS data by keeping large values within a range of no more than 1. The purpose of this adjustment is to ensure that the scale of these data ranges can be used to measure range values accurately. Additionally, there are options to modify the size of the hexagonal grid area to reduce data fluctuation that may arise when processing images from multiple time periods together.

2.3. Designing Spatial Units Using a Hexagonal Grid

This study used the percentage of fluke infection as the dependent variable, which was obtained by translating spatial data of infected fish locations. These data represented the prevalence of infection in the water source. In order to measure various factors alongside the surface moisture index, soil surface heat index, and vegetation index obtained from Sentinel-2 satellite imagery, it is essential to generate raster data that average each index within a hexagonal grid. This is required as some of the data obtained from remote sensing have a grid size of 10 m. However, it is important to note that using this type of raster may not be suitable for every grid. This is because averaging does not evenly distribute the average across the area units. In this investigation, the point data in Figure 2c were converted into a raster within a hexagonal border measuring 2.1 km diagonally. This allowed for the generation of raster data based on the distribution and density of the infection site, as shown in Figure 2d. All independent variables were also generated using hexagonal raster data in order to calculate an average index value within the same unit of space. By using mathematical modeling, each grid agent’s connection to the infected water source from a central location can be analyzed.

The calculation of the index for the independent variable relies on the location of the primary infected water source, the Nong Han lake, which connects the digestive stream to the freshwater bodies commonly found in the study area where the infected fish were discovered. This index is determined by generating descriptive data on the length of the water flow line, measured in meters. By examining the characteristics of liver fluke infection in this specific watershed, it was observed that infected fish tend to migrate within a range of 50–60 km from the water source, as connected to a smaller body of water along the digestive flow line. The mathematical model for the nine variables has been developed based on the research conducted by the authors of [15]. It is specifically designed to generate indexes on hexagonal grids.

2.4. Spatially Transformed Distribution Using a Hexagonal Grid

The reason for adopting hexagonal grids to index all factors is because uniformly shaped grids are utilized for various purposes [66]. These include normalizing geography for mapping and addressing the issue of arbitrarily constructed irregular polygons, such as district boundaries or block clusters formed through political processes. The use of mathematical models in deriving independent variable factors ensures that every grid contains index values, which is not possible with regular rectangular grids. Another option for creating a regularly shaped grid is by using equilateral triangles. These three polygons are the only ones that can be aligned to form an equally spaced grid, without any gaps or overlaps.

Hexagons are preferred for sampling due to the boundaries of the grid shape created by the location of the infected water supply. The hexagon’s perimeter-to-area ratio is relatively low, making it an ideal choice. While circles have the lowest ratio, they cannot be arranged in a continuous grid. Hexagons, on the other hand, are the most circular polygons that can be organized into tables with equal spacing. One advantage of using hexagonal grids for generating independent variable datasets is that the circular shape of the hexagon allows for more natural display of curves in the data compared to rectangular grids [67]. Additionally, the acute angle that squares and triangles have with respect to the hexagon indicates that any location within a hexagon is closer to its centroid than any location within an equally sized square or triangle. This observation has led to the development of the hexagonal index, which includes factors such as X1(SL), X4(CL), X6 (PPW), X7(CSW), and X8(CSH). This index facilitates efficient learning for FCR models. Furthermore, a hexagonal grid can accommodate more samples compared to a rectangular grid, as demonstrated by the same distance search between the two types of grids. Consequently, the FCR model can learn from a larger amount of data when constructed using a hexagonal grid rather than a rectangular one.

When comparing polygons with equal areas, the points near the border are closer to the centroid, especially the points near the vertices, if the polygon is more similar to a circle [15]. This means that any point inside a hexagon is closer to the centroid than any point in an equal-area square or triangle. Rectangles have a linear character, which can cause fishnet grids to focus on straight, unbroken, parallel lines, potentially obscuring underlying patterns in the data. Hexagons, on the other hand, break up these lines and make any curvature in the data patterns more visible and understandable. This fragmentation of artificial linear patterns also reduces any orientation bias that may be present in fishnet grids. Analyzing and modeling broad areas with hexagonal grids result in less distortion due to earth curvature compared to fishnet grids. In this sub-basin area, the quantity of samples from adjacent areas needs to be determined. Using a hexagonal grid instead of a square grid can result in a fairer estimation of index values for the nine independent variables. Finding neighbors is easier with a hexagon grid because the centroid of each neighbor is equal since the edge or length of contact is the same on both sides. In this study, distance bars were used with the optimized hot spot analysis tool to identify neighbors. When using hexagonal grids instead of fishnet tables, more neighbors can be included in the computation for each feature. Figure 3 shows an example of a search simulation from a water body.

2.5. FCR for Predicting Infection with the Liver Fluke (Opisthorchis viverrini)

This study used FCR to examine the distribution of infectious water sources along the important water flow lines that connect to the outlet points of the streams. This approach enabled the model to generate more accurate weight values that represent the actual infection, surpassing other linear models. The establishment of learning for the model is extremely important. To train a model, known values (such as the average number of infected OV in each hexagonal grid) are provided as part of a training dataset by the forest-based classification and regression (FCR) tool [68].

Feature Selection

All datasets were trained and tested. Parasites were discovered in 83 survey hexagonal grids, which included 50 modeling points (see Figure 4d) and 33 testing stations. Figure 4c displays the locations of both the training and testing points, which represent the data of infected water sources. The prevalence of liver leafworm infection in the water source is indicated by the percentage, with higher percentages indicating a higher prevalence. The remaining points are used for testing the accuracy of the model, as shown in Figure 4b. Figure 4a illustrates the correlation between the test points and the modeling agent through the natural water flow line, depicted as the green line.

To ensure that the data sampling process was unbiased, repeated sampling was conducted over 300 runs, as per the usual setting of the FCR model. The learning setting included a number of trees set to 300, leaf size set to 5, and a tree depth range of 9–31. The forecasting model integration employed a boosting strategy, which involved developing multiple data categorization models. Each model was built using the same training data but with an additional weighted value. Weighted voting methods were used to assign new data groups based on majority voting, with two methods utilized in this study: average and weighted.

2.6. Validating the Model

In order to assess the accuracy of the FCR, we utilized the receiver operating characteristic–area under curve (ROC-AUC) technique. This method is widely employed in machine learning to evaluate the precision of different models and address any issues related to interpretation and criteria selection [29]. ROC curves are created by plotting the True Positive Rate (TPR), also known as sensitivity, against the False Positive Rate (FPR), which represents specificity. The TPR on the y-axis measures the proportion of existing positives that are correctly detected, while the FPR on the x-axis measures the proportion of negative instances or non-events that are incorrectly classified as positive or events [69]. Additionally, the ROC curve also indicates how often the model incorrectly predicts a positive outcome when the true outcome is negative [70]. The calculations for TPR and FPR can be performed using Equations (12) and (13).

T P R = \frac{T P}{T P + F N}

(12)

F P R = \frac{F P}{F P + F N}

(13)

TP is the number of True Positives (correctly predicted positive instances), FN is the number of False Negatives (actual positive instances incorrectly predicted as negative), FP is the number of False Positives (actual negative instances incorrectly predicted as positive), and TN is the number of True Negatives (correctly predicted negative instances). After computing TPR and FPR, an AUC value of 50% indicates that the estimation is not discriminatory [37]. However, when the AUC exceeds 90%, the model can be considered to have outstanding leverage.

3. Results

3.1. Spatial Factor Distribution

Figure 5a–i illustrate the components of the independent variables X1–X9. These components represent vector default data that are converted into raster data using a mathematical model and generated with the map algebra function on a hexagonal grid. In order to ensure fairness, all hexagonal grids are normalized to a range of 0–1, except for the raster for the X6 factor, which represents the actual population in the area, and X8, which requires further index differentiation.

The outcome of the adjustment is described by the X1 factor, also known as stream length (SL), which varies from 0.00 to 0.359. The number of value dispersions covers a wide range, from 0.941 to 1000. This is clearly visible in the histogram, where the higher range is more frequent compared to the other ranges. Additionally, the X2 factor, or continuous slope length, demonstrates a consistent increase in the cumulative length of the water flow line with low slopes. This is indicated by the red grid, which indicates a range value between 0.927 and 1.000 for this index.

When analyzing the proximity of infected water sources to each drought grid, the X3 index (prevalence of nearby infected water bodies (PNW)) ranges from 0.930 to 1.000. This indicates that the likelihood of these grids experiencing infections is even higher, as the frequency of this hexagonal grid is clearly higher than other levels. However, if we only consider the length of the water flow line from the diseased grid that can connect to each grid, the flow of the connection is not sufficient to cover all grids. This leads us to screen the X4 continuous length of the line of water connected to the infected water source (CL) index as a more precise variable. It suggests that the high-level grid between 0.517 and 1.000 should be monitored.

Next, the X5 index was developed to examine the proximity between the water source and the community (PWC). This index is used to determine the link between the closeness of the water source and the location of the community. The frequency of dispersion was primarily determined by how close the village was to the infected water supply. This allowed for an easy identification of opportunities to acquire sick fish. However, upon examining the population of the villages surrounding the source of infected fish, it was discovered that the population density was low. This was reflected in the index image as the most spread grid. X6, the index for population density adjacent to water bodies (PPW), also supports this finding.

When analyzing the surface moisture variables along the river flow that connects the contaminated water source to the hexagonal grid, it was discovered that the grid surrounding the Nongharn lake had the highest index range between 0.425 and 0.671. Additionally, the minor level range was also noteworthy. Despite the grid having a small frequency distribution, the risk of infection for fish was relatively higher compared to the grid in the low index range. This can be seen in the X7 continuation of surface moisture index along the banks of flowing water lines connecting infected water bodies (CSW).

The X8 continuity of the soil surface heat index along the banks of the flowing water line connecting the infected water source (CSH) consistently showed that grids with higher moisture levels were less likely to be soil. It could be inferred that these streams were covered with weeds along the banks of the water flow lines. This resulted in values in the high-risk range, ranging from 0.000 to 1.312. Similarly, the X9 continuation of vegetation index along the banks of stream lines connecting infected water bodies (CVS) confirmed this pattern. Water flow lines at risk were found to have moderate to fairly large vegetation index values ranging from 0.502 to 0.738. The analysis of nine factors, including the percentage of water bodies where infected fish were found, will help identify the appropriate set of factors to be used in the FCR model for learning and predicting other infectious grids.

3.2. Selected Factors for Prediction

3.2.1. Correlation between All Variables and Infection

The exploratory regression global summary table uses an ordinary least squares (OLS) model to examine the consistency of the data obtained from developing a regression model. It performs this by looking at the percentage of search criteria that are met for each of the variables used to filter the model. This information is shown in Table 2. The table displays five search criteria, with the most important one being the max VIF value. This criterion has passed a total of 511 trials (100.00%). However, the other variables, such as the minimum adjusted R², the maximum coefficient p-value, the minimum Jarque–Bera p-value, and the minimum spatial autocorrelation p-value, did not meet the validation conditions specified in the table. Based on these data, it is possible that the regression model created has issues with data consistency or is not compatible with the feeder data used to develop the model. Therefore, it is recommended to evaluate and update the data or model accordingly before using it further in analyses or decision-making related to this regression model.

Table 3 presents the results of variable significance. The following statements summarize the findings: X3 is the most important variable and is negatively related to the outcome. The variation in X4 is consistently essential and is associated with positive outcomes. X7 has low priority and yields negative results. The importance of the X9 variation is low, but it is linked to beneficial results. The variation in X1 is not substantial, but it does correspond to positive outcomes. X2 is not a significant variable, yet it correlates with favorable findings. X5 and X6 variables are not significant and also correlate with positive results. Therefore, analyzing these data allows us to confidently categorize and predict the impact of each variable on the outcome. When the variance inflation factor (VIF) value exceeds 10, it indicates the presence of multicollinearity. Conversely, a VIF value below 10 suggests the absence of multicollinearity. As shown in Table 4, our model is free from multicollinearity. This allows us to make accurate predictions without encountering issues that could compromise the model’s accuracy.

The residual normality (JB) distribution validation summary presents the results of the residual test conducted during statistical modeling. In the case of Model 1, the JB value for the other components is 0, suggesting that the remaining value deviates from a normal distribution. This indicates that adjustments may be necessary to enhance the reliability of the data for future analyses. Please refer to Table 5 for further illustration.

A multiple regression analysis for spatial autocorrelation (SA) can be summarized as follows: (1) the SA value represents the association between the value of an existing variable and a previously randomized variable. A higher value indicates a stronger association between the variables in the area. (2) The Adj R² number measures how well the model matches the data. A higher value suggests a more effective model in explaining the data. (3) The AICc value represents the estimation value of the model. A lower value indicates a better ability of the model to describe the data. (4) The Jarque–Bera multiple-choice test value, also known as the “JB value”, indicates whether the model deviates from the hypothetical distribution. A score of 0 denotes that the model is not covered. (5) The Pretest Bias (BP) value of the multiple-choice test is referred to as the K(BP) value. Higher values indicate the model’s ability to characterize the data. (6) The variance inflation factor (VIF) measures multicollinearity issues. A VIF value of 1 indicates that the variable value does not have a multicollinearity problem in the model. By examining the values in the above table, we can summarize the link between the data and models used in the analysis. The second set of models has the lowest values for SA, Adj R², and AICc, indicating a better fit to the data. The JB and K(BP) values of the second set of models are roughly 0, suggesting a good random approximation. Table 6 illustrates the second group of models that are considered appropriate for analyzing spatial autocorrelation data because their VIF values approach 1, indicating no multicollinearity problem.

3.2.2. Generate New Independent Variables by Employing Principal Component Analysis (PCA)

Based on an analysis of the association between the independent variable and the infection percentage of water sources, it was concluded that the X3 variant was the most important. The X4, X7, and X8 variants were also found to be high priorities. To ensure that these variables are definitively included in the FCR model, they must be tested using a principal component analysis (PCA), as shown in Table 7, Table 8, Table 9 and Table 10. Table 7 shows that the X7 variable should be included in the covariance matrix model. This is because it has a higher covariance compared to the correlation in X3 and X4. Including X7 helps to capture the learning between highly correlated features in the data imported into the model. It also improves the effectiveness of the weights in the model training process for supervised learning of FCR models, reducing the risk of overfitting in large datasets. Additionally, the correlation matrix values in Table 7 reveal the following correlations between variables: X3 has a correlation of 0.03570 with X8, indicating a weak positive correlation. X4 has a correlation of 0.05525 with X7, indicating a moderate positive connection. Finally, X7 and X8 have a correlation of 0.45663, indicating a strong positive correlation.

Table 9 displays the eigenvalues and eigenvectors, providing insight into the significance of each principal component (PC) Layer. The eigenvalues in Table 8 indicate the importance of each PC Layer, while the eigenvectors reveal the relationship between the Input Layer and the PC Layer. The vector associated with each Input Layer signifies the pattern or trend of this relationship. In our case, we have X3, X4, X7, and X8 as the Input Layers, and a total of four principal component layers. The highest eigenvalue belongs to PC Layer X3, while the smallest eigenvalue is associated with PC Layer X8. Each Input Layer represents its relationship with the PC Layers by contributing a specific value that helps define the PC Layer. Utilizing eigenvalues and eigenvectors in this manner enables an effective analysis and reduction in data dimensions based on their relevance and relationships.

To analyze the significance of each variable or dimension in a data analysis, it is useful to refer to the percent and cumulative value table. The percent values indicate how important each variable is in describing the data, while the cumulative values show the overall importance of each variable. Among the variables, X4 stands out as the second most significant, with an eigenvalue of 0.00861, representing 38.1232% of the total, and an accumulated eigenvalue of 90.5935%. The X7 variable comes next in terms of importance, with an eigenvalue of 0.00164, accounting for 7.2683% of the total, and an accumulated eigenvalue of 97.8617%. On the other hand, the X8 variable contributes the least to the description of the data, with an eigenvalue of 0.00048, representing only 2.1383% of the total, and an accumulated eigenvalue of 100.00%. Figure 6 and Table 10 illustrate the reconfiguration of the four variables into a new set of independent variables, with a data range spanning from 1.35 to 3.41.

3.3. Predicting OV Infection Using a Machine Learning-Based FCR Model

Table 11 provides information on the characteristics of the learning model. FCR refers to the creation of a decision tree model using integers ranging from two to five, and one decimal place. The model includes the following factors for interpretation: (1) Number of Trees: This indicates the total number of trees in the model, which is set at 300. (2) Leaf Size: This refers to the size of the leaves within the tree, which is set at 1. (3) Tree Depth Range: This represents the range of depths for the trees, which falls between 6 and 25. (4) Mean Tree Depth: This indicates the average depth across all trees created, which is set at 15. (5) % of Training Available per Tree: This reflects the percentage of training data used to build each individual tree, which is set at 100%. (6) Number of Randomly Sampled Variables: This refers to the number of randomly generated variables in each tree, which is set at 3. (7) % of Training Data Eliminated for Validation: This represents the percentage of training data, specifically 40%, that is removed for model testing purposes.

The out-of-bag errors in Table 12 highlight how the model’s error is calculated using unpredictable data. This is performed to generate a performance tree and estimate, as shown in the table. It is clear that the 300 trees have a lower mean squared error (MSE) for out-of-bag errors. Additionally, when the target class is 0.0, the MSE value for calculating model out-of-bag errors with 150 trees is smaller compared to when the target class is 1.0.

The Top Variable Importance table displays the significance of factors. This is determined by the importance value assigned to each variable, indicating its predictive power for outcomes. The variable X6 has the highest importance (16%), indicating that it greatly contributes to accurate outcome predictions when included in the modeling process. Following X6, the variables X4, X2, and X1 also have high importance. On the other hand, the variables X5, X8, X7, and X9 have comparatively lower importance. All the variables mentioned in Table 13 have percentages below 10%, suggesting that their impact on model predictions may be minimal.

The results of the data classification test are presented in Table 14 and Table 15, showing F1-Score, MCC, sensitivity, and accuracy values for two data groups. Based on the analysis of these tables, it can be concluded that the created models are highly effective in classifying and predicting data. It is clear that the developed model performs well in accurately predicting the type of data found in most of the test datasets.

The accuracy and recall of the model are used to calculate the F1-Score, which is a measure of the effectiveness of categorization. A higher F1-Score indicates high accuracy and recall, while a lower F1-Score indicates lower accuracy and recall. By evaluating the F1-Score, we can determine which models perform better and choose the most effective ones for data categorization. When analyzing the validation data, it is important to consider the following classification diagnostics: Category F1-Score, MCC, sensitivity, and accuracy. (1) F1-Score: This metric measures the effectiveness of the model in data classification. A higher F1-Score indicates a strong correlation between the actual and predicted values, especially when the MCC value is close to 1. (2) Sensitivity: This metric describes the model’s ability to correctly identify positive cases. A high sensitivity indicates that the model can accurately recognize positive instances. (3) Accuracy: Accuracy refers to the model’s correctness in classifying data. A high accuracy suggests a high level of accuracy in data classification for the model.

By using F1-Score, MCC, sensitivity, and accuracy to analyze the validation data, we can gain a better understanding of the model’s effectiveness in classifying the data. The analysis can be performed as follows: The table below presents the results of the analysis of the validation data test results, which evaluates the accuracy of the model’s predictions or classifications. For Category 0.0, the model demonstrates high accuracy with an F1-Score of up to 0.96 and a sensitivity of up to 0.99. However, it should be noted that a negative MCC may lead to incomplete data ties. On the other hand, for Category 1.0, the model fails to classify correctly, as indicated by an F1-Score and sensitivity of 0.00, along with a negative MCC when compared to Category 0.0. Overall, the model achieves a total accuracy of 0.92, meaning that it accurately classifies 92% of the data for both categories. It is important to highlight that the model still performs well for Category 0.0, exhibiting high precision.

The synthesis results of the factor concluded that X1 is the best explanatory variable. This is because it has the best range diagnostics in all data ranges (training, validation, prediction), the most compatible share values in all ranges (training, validation, prediction), and the highest value in prediction of 1.00. Therefore, in order to compare the predicted value to the alternative model, the X1 factor must be included in the optional FCR model, as indicated in Table 16.

4. Discussion

4.1. Prediction of Water Source Infections

The FCR model was trained to predict water sources contaminated with liver parasites using four independent variables obtained through stepwise screening. The prediction results using PCA components showed the highest accuracy value [71], with the proportion of infected water sources ranging from 4.000 to 4.414. This differed from the prediction results obtained using the other three sets of variables. If we can develop independent variables that are strongly associated with infection, using a single component generated by the PCA approach can lead to more accurate predictions. While the separate set of variables from factors X1, X2, X3, X4, X6, X7, and X8 provide an acceptable level of accuracy, it is crucial to focus on the 1 to 4 percent of the hexagonal grid that has a chance of detecting water contamination. The coherence factor of the water flow line from the contaminated water source to a different water source has a greater impact on the model’s correctness compared to other factors [15,16]. Figure 7 illustrates the distribution of these components corresponding to the influence of soil surface moisture from factors X3, X4, X7, and X8. During the rainy season, there are numerous water flows, which facilitate the regeneration of the medium in these places.

4.2. Comparing Models and Understanding Their Limitations

AUC stands for “area under the curve”, and it is a number used to assess the performance of a prediction model [72]. The AUC of the FCR model using PCA is 0.915, indicating that the model is highly effective at prediction. The AUC of the FCR model (X3, X4, X7, and X8) is 0.741, indicating that the model performs moderately in terms of prediction. The AUC of the FCR model (X1, X2, X4, and X6) is 0.794, the same as the AUC of the second graph. The AUC of the FCR model (X1 to X9) is 0.632, suggesting that the model has poor predictive ability. Analyzing the AUC value helps assess the integrity and performance of the model for more accurate predictions. Testing the accuracy of the results is affected by the inclusion of infection sites from a moderate to low population. In order to improve our predictive performance, we have specifically tested difficult scenarios with accepted predictions. Unfortunately, the results of this testing, as shown in Figure 8d, have revealed unsatisfactory AUC values when using nine independent variables.

A high AUC value usually indicates a reliable and well-performing model. This means that the independent variables used to train the FCR model can be predicted with moderate efficiency. When all variables are used, the efficiency is the lowest but still exceeds 0.5, which is considered acceptable compared to the prediction of the liver fluke infection of water sources. Figure 8 illustrates this, showing that Figure 8a has no False Positive Rate when using the independent variable factor data generated by the PCR process, unlike the other three groups. In this study, an OLS model was initially used to screen for correlation [73]. Then, non-duplicate factors were considered, and multicollinearity values were analyzed to select only the variables used in the PCA technique to create a new set of independent variables.

4.3. Critique of the Overall Model

The success of this study is, at the very least, to demonstrate that incorporating alternative models can reveal the importance of including significant independent factors from both the training and testing datasets of FCR testing. This is crucial for the selection of independent variables.

Another accomplishment is the development of a mathematical model to standardize the raster data and calculate its average within a hexagonal grid. This optimization greatly enhances the prediction capability of the FCR model, particularly when generating a new dataset using the PCA method.

In the past, studies of this nature primarily focused on assessing the accuracy of comparative models. They did not specifically aim to enhance the predictive power of independent variables, as we have performed in our study. When applying the FCR model and this set of independent variables to other floodplain areas, the approach of creating an independent set of variables within a hexagonal grid can be employed to improve predictability. It may be worthwhile to initially test it with linear models such as OLS or GWR to evaluate the validity of the statistics obtained.

One limitation of our study is that the design of the hexagonal grid needs to be analyzed based on the diameter of the village size, which varies in each area. This could potentially affect the accuracy of this study’s application, resulting in outcomes that may not meet expectations. The acceptable AUC effect for predicting liver helminth infection risk in this small watershed should be greater than 0.7. If the AUC is lower, it indicates a significant discrepancy in the infection range. In such cases, an update is required for the set of independent variables that demonstrate a specific relationship with the infected water source. This update needs to be implemented in multiple parts of the data to better explain these changes.

4.4. Apply to the Appropriate Agencies

The guidelines for preventing and controlling liver fluke and bile duct cancer, as established by the Sakon Nakhon Provincial Public Health Office, include various measures. These measures encompass the implementation of sanitation systems, sewage management to disrupt the parasite cycle, health literacy education in schools, liver fluke screening for individuals over 15 years old, bile duct cancer screening for individuals aged 40 and above who are at risk and have undergone ultrasound, systematic management of referrals for suspected cholangiocarcinoma, safe food practices, a campaign to eradicate parasites from fish, and a system for receiving and referring patients from hospitals to communities. The performance of these measures is reported through the Ministry of Public Health’s reporting system or the Isan Cohort database. This spatial model study approach is valuable in supporting sanitation and sewage management strategies that aim to break the parasite cycle, as indicated by investigations into prevention and control practices. Furthermore, the FCR model provides insights into trends by continuously gathering data on the number of infected individuals.

5. Conclusions

The prediction of liver fluke (Opisthorchis viverrini) infection in water sources in three districts of the Sakon Nakhon province, which have the highest rate of bile duct cancer in the province [2], was the objective of this study. In order to determine the number of samples required for the FCR model, a mathematical model was developed to analyze the spatial data and convert it into a standardized spatial unit, specifically hexagons. By organizing the independent variable dataset into a hexagonal grid, the number of independent variable factors was reduced from nine to four. The relevant factors were then analyzed using the OLS model and the best factors were selected using PCA to create a new dataset for machine learning. The FCR model only needs to learn variables that are associated with water source infection and does not require a wide range of variables [54,74]. Furthermore, this study presents a method for combining spatial modeling with environmental factors such as soil surface moisture using machine learning models to estimate the percentage of water source infections. The predictions obtained from this approach are accurate, and it is clear that machine learning from different sets of variables can produce predictions with varying degrees of accuracy. The findings of this study can guide the development of geographical epidemiology simulations. However, it is important to consider local knowledge and develop relevant mathematical models in order to establish the connections between these variables. Interested individuals can download the zip files (Supplementary Materials) containing the geographic information system (GIS) data of the study area. These files include the independent variable data and can be used for preliminary modeling experiments.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/sym16081067/s1.

Author Contributions

Conceptualization, B.P. and P.L.; methodology, B.P.; testing, T.B., S.S., N.P. and N.B.; formal analysis, P.K.; data collection, A.A., T.U. and B.P.; writing—original draft preparation, P.L.; writing—review and editing, P.L. and D.S.; supervision, P.L.; project administration, B.P. and P.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research project was financially supported by Mahasarakham University in 2024 (funding number 6801) for spatial analysis and GIS laboratory usage. This work was supported by the Fundamental Fund (funding number FY-67-2023), granted by the Thailand Science Research and Innovation and funding through Sakon Nakhon Rajabhat University for the analysis of the percentage of people infected with liver fluke.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Patient consent was waived due to using aggregated data for secondary data analysis.

Data Availability Statement

The data are available upon request. The copyright of ArcGIS pro version 3.2.0 is subscription ID: 6875220XXX, customer number: 389XXX, customer name: Mahasarakham University. The datasets used and/or analyzed during the current study are available from the corresponding author upon reasonable request.

Acknowledgments

This research was supported by the Public Health Office. The Sakon Nakhon province has integrated cooperation in knowledge development for the prevention and resolution of liver fluke and bile duct cancer health problems for the community.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Boonjaraspinyo, S.; Boonmars, T.; Ekobol, N.; Artchayasawat, A.; Sriraj, P.; Aukkanimart, R.; Pumhirunroj, B.; Sripan, P.; Songsri, J.; Juasook, A.; et al. Prevalence and Associated Risk Factors of Intestinal Parasitic Infections: A Population-Based Study in Phra Lap Sub-District, Mueang Khon Kaen District, Khon Kaen Province, Northeastern Thailand. Trop. Med. Infect. Dis. 2023, 8, 22. [Google Scholar] [CrossRef] [PubMed]
Perakanya, P.; Ungcharoen, R.; Worrabannakorn, S.; Ongarj, P.; Artchayasawat, A.; Boonmars, T.; Boueroy, P. Prevalence and Risk Factors of Opisthorchis Viverrini Infection in Sakon Nakhon Province, Thailand. Trop. Med. Infect. Dis. 2022, 7, 313. [Google Scholar] [CrossRef]
Geadkaew-Krenc, A.; Krenc, D.; Thanongsaksrikul, J.; Grams, R.; Phadungsil, W.; Glab-ampai, K.; Chantree, P.; Martviset, P. Production and Immunological Characterization of ScFv Specific to Epitope of Opisthorchis Viverrini Rhophilin-Associated Tail Protein 1-like (OvROPN1L). Trop. Med. Infect. Dis. 2023, 8, 160. [Google Scholar] [CrossRef] [PubMed]
Prasongwatana, J.; Laummaunwai, P.; Boonmars, T.; Pinlaor, S. Viable Metacercariae of Opisthorchis Viverrini in Northeastern Thai Cyprinid Fish Dishes--as Part of a Rational Program for Control of O. Viverrini-Associated Cholangiocarcinoma. Parasitol. Res. 2013, 112, 1323–1327. [Google Scholar] [CrossRef]
Qian, M.-B.; Utzinger, J.; Keiser, J.; Zhou, X.-N. Clonorchiasis. Lancet 2016, 387, 800–810. [Google Scholar] [CrossRef]
Sripa, B.; Kaewkes, S.; Intapan, P.M.; Maleewong, W.; Brindley, P.J. Chapter 11—Food-Borne Trematodiases in Southeast Asia: Epidemiology, Pathology, Clinical Manifestation and Control. In Important Helminth Infections in Southeast Asia: Diversity and Potential for Control and Elimination, Part A; Zhou, X.-N., Bergquist, R., Olveda, R., Utzinger, J., Eds.; Academic Press: Cambridge, MA, USA, 2010; Volume 72, pp. 305–350. ISBN 0065-308X. [Google Scholar]
Brindley, P.J.; Bachini, M.; Ilyas, S.I.; Khan, S.A.; Loukas, A.; Sirica, A.E.; Teh, B.T.; Wongkham, S.; Gores, G.J. Cholangiocarcinoma. Nat. Rev. Dis. Primers 2021, 7, 65. [Google Scholar] [CrossRef]
Sadaow, L.; Rodpai, R.; Janwan, P.; Boonroumkaew, P.; Sanpool, O.; Thanchomnang, T.; Yamasaki, H.; Ittiprasert, W.; Mann, V.H.; Brindley, P.J.; et al. An Innovative Test for the Rapid Detection of Specific IgG Antibodies in Human Whole-Blood for the Diagnosis of Opisthorchis Viverrini Infection. Trop. Med. Infect. Dis. 2022, 7, 308. [Google Scholar] [CrossRef]
Sripa, B.; Bethony, J.M.; Sithithaworn, P.; Kaewkes, S.; Mairiang, E.; Loukas, A.; Mulvenna, J.; Laha, T.; Hotez, P.J.; Brindley, P.J. Opisthorchiasis and Opisthorchis-Associated Cholangiocarcinoma in Thailand and Laos. Acta Trop. 2011, 120 (Suppl. S1), S158–S168. [Google Scholar] [CrossRef] [PubMed]
Pumhirunroj, B.; Aukkanimart, R. Liver Fluke-Infected Cyprinoid Fish in Northeastern Thailand (2016–2017). Southeast Asian J. Trop. Med. Public Health 2017, 51, 1–7. [Google Scholar]
Pinlaor, S.; Onsurathum, S.; Boonmars, T.; Pinlaor, P.; Hongsrichan, N.; Chaidee, A.; Haonon, O.; Limviroj, W.; Tesana, S.; Kaewkes, S.; et al. Distribution and Abundance of Opisthorchis Viverrini Metacercariae in Cyprinid Fish in Northeastern Thailand. Korean J. Parasitol. 2013, 51, 703–710. [Google Scholar] [CrossRef] [PubMed]
Thinkhamrop, K.; Suwannatrai, A.T.; Chamadol, N.; Khuntikeo, N.; Thinkhamrop, B.; Sarakarn, P.; Gray, D.J.; Wangdi, K.; Clements, A.C.A.; Kelly, M. Spatial Analysis of Hepatobiliary Abnormalities in a Population at High-Risk of Cholangiocarcinoma in Thailand. Sci. Rep. 2020, 10, 16855. [Google Scholar] [CrossRef] [PubMed]
Hasegawa, S.; Ikai, I.; Fujii, H.; Hatano, E.; Shimahara, Y. Surgical Resection of Hilar Cholangiocarcinoma: Analysis of Survival and Postoperative Complications. World J. Surg. 2007, 31, 1258–1265. [Google Scholar] [CrossRef] [PubMed]
Office, 8th Health District. Annual Report 2021. 2021. Available online: https://r8way.moph.go.th/r8way/ (accessed on 21 July 2021).
Pumhirunroj, B.; Littidej, P.; Boonmars, T.; Bootyothee, K.; Artchayasawat, A.; Khamphilung, P.; Slack, D. Machine-Learning-Based Forest Classification and Regression (FCR) for Spatial Prediction of Liver Fluke Opisthorchis Viverrini (OV) Infection in Small Sub-Watersheds. ISPRS Int. J. Geo-Inf. 2023, 12, 503. [Google Scholar] [CrossRef]
Suwannatrai, A.T.; Thinkhamrop, K.; Clements, A.C.A.; Kelly, M.; Suwannatrai, K.; Thinkhamrop, B.; Khuntikeo, N.; Gray, D.J.; Wangdi, K. Bayesian Spatial Analysis of Cholangiocarcinoma in Northeast Thailand. Sci. Rep. 2019, 9, 14263. [Google Scholar] [CrossRef] [PubMed]
Littidej, P.; Buasri, N. Built-up Growth Impacts on Digital Elevation Model and Flood Risk Susceptibility Prediction in Muaeng District, Nakhon Ratchasima (Thailand). Water 2019, 11, 1496. [Google Scholar] [CrossRef]
Littidej, P.; Uttha, T.; Pumhirunroj, B. Spatial Predictive Modeling of the Burning of Sugarcane Plots in Northeast Thailand with Selection of Factor Sets Using a GWR Model and Machine Learning Based on an ANN-CA. Symmetry 2022, 14, 1989. [Google Scholar] [CrossRef]
Prasertsri, N.; Littidej, P. Spatial Environmental Modeling for Wildfire Progression Accelerating Extent Analysis Using Geo- Informatics. Pol. J. Environ. Stud. 2020, 29, 3249–3261. [Google Scholar] [CrossRef]
Sangpradid, S. Application of a Multi-Layer Perceptron Neural Network To Simulate Spatial-Temporal Land Use and Land Cover Change Analysis Based on Cellular Automata in Buriram Province, Thailand. Environ. Eng. Manag. J. 2023, 22, 917–931. [Google Scholar] [CrossRef]
Hussain, M.A.; Chen, Z.; Zheng, Y.; Shoaib, M.; Shah, S.U.; Ali, N.; Afzal, Z. Landslide Susceptibility Mapping Using Machine Learning Algorithm Validated by Persistent Scatterer In-SAR Technique. Sensors 2022, 22, 3119. [Google Scholar] [CrossRef] [PubMed]
Achour, Y.; Pourghasemi, H.R. How Do Machine Learning Techniques Help in Increasing Accuracy of Landslide Susceptibility Maps? Geosci. Front. 2020, 11, 871–883. [Google Scholar] [CrossRef]
Kumar, R.; Anbalagan, R. Landslide Susceptibility Mapping Using Analytical Hierarchy Process (AHP) in Tehri Reservoir Rim Region, Uttarakhand. J. Geol. Soc. India 2016, 87, 271–286. [Google Scholar] [CrossRef]
Park, S.; Kim, J. Landslide Susceptibility Mapping Based on Random Forest and Boosted Regression Tree Models, and a Comparison of Their Performance. Appl. Sci. 2019, 9, 942. [Google Scholar] [CrossRef]
Bui, D.T.; Moayedi, H.; Kalantar, B.; Osouli, A.; Pradhan, B.; Nguyen, H.; Rashid, A.S.A. A Novel Swarm Intelligence—Harris Hawks Optimization for Spatial Assessment of Landslide Susceptibility. Sensors 2019, 19, 3590. [Google Scholar] [CrossRef]
Mandal, S.; Mandal, K. Modeling and Mapping Landslide Susceptibility Zones Using GIS Based Multivariate Binary Logistic Regression (LR) Model in the Rorachu River Basin of Eastern Sikkim Himalaya, India. Model. Earth Syst. Environ. 2018, 4, 69–88. [Google Scholar] [CrossRef]
Pourghasemi, H.R.; Rahmati, O. Prediction of the Landslide Susceptibility: Which Algorithm, Which Precision? CATENA 2018, 162, 177–192. [Google Scholar] [CrossRef]
Youssef, A.M.; Pourghasemi, H.R.; Pourtaghi, Z.S.; Al-Katheeri, M.M. Landslide Susceptibility Mapping Using Random Forest, Boosted Regression Tree, Classification and Regression Tree, and General Linear Models and Comparison of Their Performance at Wadi Tayyah Basin, Asir Region, Saudi Arabia. Landslides 2016, 13, 839–856. [Google Scholar] [CrossRef]
Rossi, M.; Guzzetti, F.; Reichenbach, P.; Mondini, A.C.; Peruccacci, S. Optimal Landslide Susceptibility Zonation Based on Multiple Forecasts. Geomorphology 2010, 114, 129–142. [Google Scholar] [CrossRef]
Sevgen, E.; Kocaman, S.; Nefeslioglu, H.A.; Gokceoglu, C. Photogrammetric Techniques for Landslide Susceptibility Mapping with Logistic Regression. Sensors 2019, 19, 3940. [Google Scholar] [CrossRef]
Pérez-Díaz, P.; Martín-Dorta, N.; Gutiérrez-García, F.J. Construction Labour Measurement in Reinforced Concrete Floating Caissons in Maritime Ports. Civ. Eng. J. 2022, 8, 195–208. [Google Scholar] [CrossRef]
Hussain, M.A.; Chen, Z.; Wang, R.; Shoaib, M. Ps-Insar-Based Validated Landslide Susceptibility Mapping along Karakorum Highway, Pakistan. Remote Sens. 2021, 13, 4129. [Google Scholar] [CrossRef]
Taalab, K.; Cheng, T.; Zhang, Y. Mapping Landslide Susceptibility and Types Using Random Forest. Big Earth Data 2018, 2, 159–178. [Google Scholar] [CrossRef]
Conoscenti, C.; Ciaccio, M.; Caraballo-Arias, N.A.; Gómez-Gutiérrez, Á.; Rotigliano, E.; Agnesi, V. Assessment of Susceptibility to Earth-Flow Landslide Using Logistic Regression and Multivariate Adaptive Regression Splines: A Case of the Belice River Basin (Western Sicily, Italy). Geomorphology 2015, 242, 49–64. [Google Scholar] [CrossRef]
Felicísimo, Á.M.; Cuartero, A.; Remondo, J.; Quirós, E. Mapping Landslide Susceptibility with Logistic Regression, Multiple Adaptive Regression Splines, Classification and Regression Trees, and Maximum Entropy Methods: A Comparative Study. Landslides 2013, 10, 175–189. [Google Scholar] [CrossRef]
Vorpahl, P.; Elsenbeer, H.; Märker, M.; Schröder, B. How Can Statistical Models Help to Determine Driving Factors of Landslides? Ecol. Model. 2012, 239, 27–39. [Google Scholar] [CrossRef]
Ma, J.; Wang, Y.; Niu, X.; Jiang, S.; Liu, Z. A Comparative Study of Mutual Information-Based Input Variable Selection Strategies for the Displacement Prediction of Seepage-Driven Landslides Using Optimized Support Vector Regression. Stoch. Environ. Res. Risk Assess. 2022, 36, 3109–3129. [Google Scholar] [CrossRef]
Kalantar, B.; Pradhan, B.; Naghibi, S.A.; Motevalli, A.; Mansor, S. Assessment of the Effects of Training Data Selection on the Landslide Susceptibility Mapping: A Comparison between Support Vector Machine (SVM), Logistic Regression (LR) and Artificial Neural Networks (ANN). Geomat. Nat. Hazards Risk 2018, 9, 49–69. [Google Scholar] [CrossRef]
Pham, B.T.; Tien Bui, D.; Pourghasemi, H.R.; Indra, P.; Dholakia, M.B. Landslide Susceptibility Assesssment in the Uttarakhand Area (India) Using GIS: A Comparison Study of Prediction Capability of Naïve Bayes, Multilayer Perceptron Neural Networks, and Functional Trees Methods. Theor. Appl. Climatol. 2017, 128, 255–273. [Google Scholar] [CrossRef]
Pham, B.T.; Pradhan, B.; Tien Bui, D.; Prakash, I.; Dholakia, M.B. A Comparative Study of Different Machine Learning Methods for Landslide Susceptibility Assessment: A Case Study of Uttarakhand Area (India). Environ. Model. Softw. 2016, 84, 240–250. [Google Scholar] [CrossRef]
Mehrabi, M.; Pradhan, B.; Moayedi, H. Optimizing an Adaptive Neuro-Fuzzy Inference System for Spatial Prediction of Landslide Susceptibility Using Four State-of-the-art Metaheuristic Techniques. Sensors 2020, 20, 1723. [Google Scholar] [CrossRef]
Dehnavi, A.; Aghdam, I.N.; Pradhan, B.; Morshed Varzandeh, M.H. A New Hybrid Model Using Step-Wise Weight Assessment Ratio Analysis (SWARA) Technique and Adaptive Neuro-Fuzzy Inference System (ANFIS) for Regional Landslide Hazard Assessment in Iran. CATENA 2015, 135, 122–148. [Google Scholar] [CrossRef]
Aghdam, I.N.; Varzandeh, M.H.M.; Pradhan, B. Landslide Susceptibility Mapping Using an Ensemble Statistical Index (Wi) and Adaptive Neuro-Fuzzy Inference System (ANFIS) Model at Alborz Mountains (Iran). Environ. Earth Sci. 2016, 75, 553. [Google Scholar] [CrossRef]
Kumar, R.; Anbalagan, R. Landslide Susceptibility Zonation in Part of Tehri Reservoir Region Using Frequency Ratio, Fuzzy Logic and GIS. J. Earth Syst. Sci. 2015, 124, 431–448. [Google Scholar] [CrossRef]
Charandabi, S.E.; Kamyar, K. Prediction of Cryptocurrency Price Index Using Artificial Neural Networks: A Survey of the Literature. Eur. J. Bus. Manag. Res. 2021, 6, 17–20. [Google Scholar] [CrossRef]
Roshani, M.; Sattari, M.A.; Muhammad Ali, P.J.; Roshani, G.H.; Nazemi, B.; Corniani, E.; Nazemi, E. Application of GMDH Neural Network Technique to Improve Measuring Precision of a Simplified Photon Attenuation Based Two-Phase Flowmeter. Flow Meas. Instrum. 2020, 75, 101804. [Google Scholar] [CrossRef]
Moayedi, H.; Abdolreza, O.; Bui, D.T.; Foong, L.K. Spatial Landslide Susceptibility Assessment Based On. Sensors 2019, 19, 4698. [Google Scholar] [CrossRef]
Arnone, E.; Francipane, A.; Scarbaci, A.; Puglisi, C.; Noto, L.V. Effect of Raster Resolution and Polygon-Conversion Algorithm on Landslide Susceptibility Mapping. Environ. Model. Softw. 2016, 84, 467–481. [Google Scholar] [CrossRef]
Aditian, A.; Kubota, T.; Shinohara, Y. Comparison of GIS-Based Landslide Susceptibility Models Using Frequency Ratio, Logistic Regression, and Artificial Neural Network in a Tertiary Region of Ambon, Indonesia. Geomorphology 2018, 318, 101–111. [Google Scholar] [CrossRef]
Kornejady, A.; Ownegh, M.; Bahremand, A. Landslide Susceptibility Assessment Using Maximum Entropy Model with Two Different Data Sampling Methods. CATENA 2017, 152, 144–162. [Google Scholar] [CrossRef]
Park, N.-W. Using Maximum Entropy Modeling for Landslide Susceptibility Mapping with Multiple Geoenvironmental Data Sets. Environ. Earth Sci. 2015, 73, 937–949. [Google Scholar] [CrossRef]
Dang, V.H.; Hoang, N.D.; Nguyen, L.M.D.; Bui, D.T.; Samui, P. A Novel GIS-Based Random Forest Machine Algorithm for the Spatial Prediction of Shallow Landslide Susceptibility. Forests 2020, 11, 118. [Google Scholar] [CrossRef]
Wu, X.; Ren, F.; Niu, R. Landslide Susceptibility Assessment Using Object Mapping Units, Decision Tree, and Support Vector Machine Models in the Three Gorges of China. Environ. Earth Sci. 2014, 71, 4725–4738. [Google Scholar] [CrossRef]
Merghadi, A.; Yunus, A.P.; Dou, J.; Whiteley, J.; ThaiPham, B.; Bui, D.T.; Avtar, R.; Abderrahmane, B. Machine Learning Methods for Landslide Susceptibility Studies: A Comparative Overview of Algorithm Performance. Earth-Sci. Rev. 2020, 207, 103225. [Google Scholar] [CrossRef]
Sahin, E.K. Comparative Analysis of Gradient Boosting Algorithms for Landslide Susceptibility Mapping. Geocarto Int. 2022, 37, 2441–2465. [Google Scholar] [CrossRef]
Nohani, E.; Moharrami, M.; Sharafi, S.; Khosravi, K.; Pradhan, B.; Pham, B.T.; Lee, S.; Melesse, A.M. Landslide Susceptibility Mapping Using Different GIS-Based Bivariate Models. Water 2019, 11, 1402. [Google Scholar] [CrossRef]
Pourghasemi, H.R.; Gayen, A.; Panahi, M.; Rezaie, F.; Blaschke, T. Multi-Hazard Probability Assessment and Mapping in Iran. Sci. Total Environ. 2019, 692, 556–571. [Google Scholar] [CrossRef]
Suwannahitatorn, P.; Webster, J.; Riley, S.; Mungthin, M.; Donnelly, C.A. Uncooked Fish Consumption among Those at Risk of Opisthorchis Viverrini Infection in Central Thailand. PLoS ONE 2019, 14, e0211540. [Google Scholar] [CrossRef] [PubMed]
Sakon Nakhon Provincial Public Health Office (SKKO). Annual Report 2023. 2023. Available online: https://pnkhospital.net/ (accessed on 1 November 2023).
Dao, T.T.H.; Bui, T.V.; Abatih, E.N.; Gabriël, S.; Nguyen, T.T.G.; Huynh, Q.H.; Van Nguyen, C.; Dorny, P. Opisthorchis Viverrini Infections and Associated Risk Factors in a Lowland Area of Binh Dinh Province, Central Vietnam. Acta Trop. 2016, 157, 151–157. [Google Scholar] [CrossRef] [PubMed]
Ruantip, S.; Eamudomkarn, C.; Kopolrat, K.Y.; Sithithaworn, J.; Laha, T.; Sithithaworn, P. Analysis of Daily Variation for 3 and for 30 Days of Parasite-Specific IgG in Urine for Diagnosis of Strongyloidiasis by Enzyme-Linked Immunosorbent Assay. Acta Trop. 2021, 218, 105896. [Google Scholar] [CrossRef]
Office, 8th Health District. Annual Report 2023. 2023. Available online: https://r8way.moph.go.th/r8way/ (accessed on 15 November 2023).
Honjo, S.; Srivatanakul, P.; Sriplung, H.; Kikukawa, H.; Hanai, S.; Uchida, K.; Todoroki, T.; Jedpiyawongse, A.; Kittiwatanachot, P.; Sripa, B.; et al. Genetic and Environmental Determinants of Risk for Cholangiocarcinoma via Opisthorchis Viverrini in a Densely Infested Area in Nakhon Phanom, Northeast Thailand. Int. J. Cancer 2005, 117, 854–860. [Google Scholar] [CrossRef]
Ord, J.K.; Getis, A. Local Spatial Autocorrelation Statistics: Distributional Issues and an Application. Geogr. Anal. 2010, 27, 286–306. [Google Scholar] [CrossRef]
Aukkanimart, R.; Boonmars, T.; Sriraj, P.; Sripan, P.; Songsri, J.; Ratanasuwan, P.; Laummaunwai, P.; Suwanantrai, A.; Aunpromma, S.; Khueangchaingkhwang, S.; et al. Carcinogenic Liver Fluke and Others Contaminated in Pickled Fish of Northeastern Thailand. Asian Pac. J. Cancer Prev. APJCP 2017, 18, 529–533. [Google Scholar] [CrossRef] [PubMed]
Zhang, C.; Han, J. Data Mining and Knowledge Discovery; Springer: Berlin/Heidelberg, Germany, 2021; ISBN 9780387098227. [Google Scholar]
Brunton, L.A.; Alexander, N.; Wint, W.; Ashton, A.; Broughan, J.M. Using Geographically Weighted Regression to Explore the Spatially Heterogeneous Spread of Bovine Tuberculosis in England and Wales. Stoch. Environ. Res. Risk Assess. 2017, 31, 339–352. [Google Scholar] [CrossRef]
Arabameri, A.; Yamani, M.; Pradhan, B.; Melesse, A.; Shirani, K.; Tien Bui, D. Novel Ensembles of COPRAS Multi-Criteria Decision-Making with Logistic Regression, Boosted Regression Tree, and Random Forest for Spatial Prediction of Gully Erosion Susceptibility. Sci. Total Environ. 2019, 688, 903–916. [Google Scholar] [CrossRef] [PubMed]
Wang, H.; Zheng, H. True Positive Rate BT—Encyclopedia of Systems Biology; Dubitzky, W., Wolkenhauer, O., Cho, K.-H., Yokota, H., Eds.; Springer: New York, NY, USA, 2013; pp. 2302–2303. ISBN 978-1-4419-9863-7. [Google Scholar]
Fawcett, T. An Introduction to ROC Analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
Hao, S.; Fu, Y.; Zhang, J.; Zou, Y.; Wei, J.; Zheng, H. Modeling and Evaluating Spatial Variation of Pollution Characteristics in the Nyang River. Pol. J. Environ. Stud. 2022, 31, 75–83. [Google Scholar] [CrossRef]
Sulaiman, N.A.F.; Shaharudin, S.M.; Ismail, S.; Zainuddin, N.H.; Tan, M.L.; Abd Jalil, Y. Predictive Modelling of Statistical Downscaling Based on Hybrid Machine Learning Model for Daily Rainfall in East-Coast Peninsular Malaysia. Symmetry 2022, 14, 927. [Google Scholar] [CrossRef]
Isazade, V.; Qasimi, A.B.; Dong, P.; Kaplan, G.; Isazade, E. Integration of Moran’s I, Geographically Weighted Regression (GWR), and Ordinary Least Square (OLS) Models in Spatiotemporal Modeling of COVID-19 Outbreak in Qom and Mazandaran Provinces, Iran. Model. Earth Syst. Environ. 2023, 9, 3923–3937. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]

Figure 1. The percentage of individuals infected with liver fluke in the 8th Regional Health Province (R8) [14], Available online: https://r8way.moph.go.th/r8way/ (accessed on 21 July 2021), near the Mekong River between 2019 and 2021 (adapted from [15]).

Figure 2. Study area and distribution of percentage of ov infection (a) study area on a national scale, (b) study area on region scale, (c) the presence of an infection is indicated as a point, (d) an infection is indicated by a hexagonal shape.

Figure 3. The comparison of the number of samples with the function Between Optimized Hot Spot Analysis Tool on a rectangular grid and a hexagonal grid (the search boundaries for circles and straight red lines differ between the rectangular grid on the left and the hexagon on the right).

Figure 4. Distribution of training and testing points on hexagonal grids; (a) overall connected stream lines with ov–infected points, (b) training points, (c) testing points, and (d) training points and testing points.

Figure 5. Distribution of independent variables and dependent variable obtained from mathematical model within hexagonal grid: (a) X1, (b) X2, (c) X3, (d) X4, (e) X5, (f) X6, (g) X7, (h) X8, (i) X9, and (j) Y.

Figure 6. Independent variable obtained from PCA using X3, X4, X7, and X8.

Figure 7. Prediction of percentage of infection of water resources from import of several independent variables; prediction with FCR model: (a) FCR predicted using X3, X4, X7, and X8, (b) FCR predicted using X1 to X9, (c) FCR predicted using X1, X2, X4, and X6, (d) FCR predicted using PCA (X3, X4, X7, and X8).

Figure 8. AUC graphs for use as an alternative decide on the right FCR model for predicting liver fluke infection: (a) AUC of FCR predicted using X3, X4, X7, and X8, (b) AUC of FCR predicted using X1 to X9, (c) AUC of FCR predicted using X1, X2, X4, and X6, (d) AUC of FCR predicted using PCA (X3, X4, X7, and X8), (e) AUC of all predicted FCR model.

Table 1. Comparison of the 2019 and 2020 cholangiocarcinoma patient counts [59].

Provinces	Number of People with Cholangiocarcinoma in 2019	Number of People with Cholangiocarcinoma in 2020	Difference
Nongkhai	22	37	\|15\|
Buengkarn	8	7	\|−1\|
Loei	54	84	\|30\|
Nakhon Phanom	7	10	\|3\|
Udon Thani	50	88	\|38\|
Nongbualumphu	19	12	\|−7\|
Sakon Nakhon	161	130	\|−31\|

Table 2. Percentage of search criteria passed.

Search Criterion	Cut–Off	Trials	# Passed	% Passed
Min Adjusted R–Squared	>0.50	511	0	0
Max Coefficient p-Value	<0.05	511	2	0.39
Max VIF Value	<7.50	511	511	100
Min Jarque–Bera p-Value	>0.10	511	0	0
Min Spatial Autocorrelation p-Value	>0.10	27	0	0

Table Abbreviations: # Number of Rounds Passed.

Table 3. Summary of variable significance.

Variable	% Significant	% Negative	% Positive
X3	100.00	100.00	0.00
X4	85.55	0.00	100.00
X7	7.03	100.00	0.00
X8	1.56	100.00	0.00
X9	0.00	70.31	29.69
X1	0.00	0.00	100.00
X2	0.00	26.17	73.83
X5	0.00	1.17	98.83
X6	0.00	0.00	100.00

Table 4. Summary of multicollinearity.

Variable	VIF
X7	2.27
X8	3.99
X9	4.02
X1	1.1
X2	2.19
X3	1.06
X4	1.16
X5	1.09
X6	1.1

Table 5. Summary of residual normality (JB).

AdjR2	AICc	K(BP)	VIF	SA	Model
−0.00138	1713.306	0.784	1	0.000064	−X9
0.002586	1710.864	0.237	1	0.000187	−X8
0.00332	1710.411	0.141	1	0.000365	−X7 *

Table Abbreviations: Model Variable significance (* = 0.10).

Table 6. Summary of residual spatial autocorrelation (SA).

SA	AdjR2	AICc	K (BP)	VIF	Model
0.003364	0.013443	1710.342	0.573381	3.991	−X7 − X8 − X9 + X1 − X3 + X4 + X5
0.003275	0.014683	1708.521	0.458194	1.3707	−X7 * − X9 + X1 − X3 + X4 * + X5
0.002862	0.014801	1708.448	0.500987	3.985	−X7 − X8 − X9 + X1 − X3 + X4

Table Abbreviations: AdjR2—Adjusted R-Squared, AICc—Akaike’s Information Criterion, JB—Jarque–Bera p-Value; K(BP)—Koenker (BP) Statistic p-Value, VIF—Max Variance Inflation Factor, SA—Global Moran’s I p-Value; Model Variable sign (+/−), and Model Variable significance (* = 0.10; ** = 0.05; *** = 0.01).

Table 7. Covariance matrix.

Layer	X3	X4	X7	X8
X3	1.59 × 10⁻³	−1.46 × 10⁻⁶	−2.34 × 10⁻⁴	1.47 × 10⁻⁴
X4	−1.46 × 10⁻⁶	9.63 × 10⁻³	1.43 × 10⁻⁴	1.50 × 10⁻³
X7	−2.34 × 10⁻⁴	1.43 × 10⁻⁴	6.92 × 10⁻⁴	1.24 × 10⁻³
X8	1.47 × 10⁻⁴	1.50 × 10⁻³	1.24 × 10⁻³	1.07 × 10⁻²

Table 8. Correlation matrix.

Layer	X3	X4	X7	X8
X3	1	−0.00037	−0.22298	0.0357
X4	−0.00037	1	0.05525	0.14769
X7	−0.22298	0.05525	1	0.45663
X8	0.0357	0.14769	0.45663	1

Table 9. Eigenvalues and eigenvectors.

Number of Input Layers	Number of Principal Component Layers
4	4
PC Layer	X3	X4	X7	X8
Eigenvalues	0.01185	0.00861	0.00164	0.00048
Eigenvectors’ Input Layer
X3	0.00946	0.00941	0.97518	−0.22101
X4	0.56019	0.82835	0.00145	−0.00487
X7	0.09843	0.07187	−0.22095	−0.96764
X8	0.82244	0.55550	0.01424	0.12166

Table 10. Percent and accumulative eigenvalues.

PC Layer	Eigen Value	Percent of Eigen Values	Accumulation of Eigen Values
X3	0.01185	52.4703	52.4703
X4	0.00861	38.1232	90.5935
X7	0.00164	7.2683	97.8617
X8	0.00048	2.1383	100.0000

Table 11. Model Characteristics.

Number of Trees	300
Leaf Size	1
Tree Depth Range	6–25
Mean Tree Depth	15
% of Training Available per Tree	100
Number of Randomly Sampled Variables	3
% of Training Data Excluded for Validation	40

Table 12. Model Out-of-Bag Errors.

Number of Trees	150	300
MSE	6.131	5.991
0.0	1.582	1.263
1.0	88.098	89.441

Table 13. Top Variable Importance.

Variable	Importance	%
X6	0.55	16
X4	0.51	15
X2	0.50	15
X1	0.50	15
X3	0.32	10
X9	0.27	8
X7	0.25	8
X8	0.24	7
X5	0.23	7

Table 14. Training Data: Classification Diagnostics.

Category	F1-Score	MCC	Sensitivity	Accuracy
0.0	1.00	1.00	1.00	1.00
1.0	1.00	1.00	1.00	1.00

Table 15. Validation Data: Classification Diagnostics.

Category	F1-Score	MCC	Sensitivity	Accuracy
0.0	0.96	−0.03	0.99	0.92
1.0	0.00	−0.03	0.00	0.92

Table 16. Explanatory Variable Range Diagnostics.

Variable	Training		Validation		Prediction		Share
Variable	Minimum	Maximum	Minimum	Maximum	Minimum	Maximum	Training ^a	Validation ^b	Prediction ^c
X7	−0.14	0.67	−0.15	0.65	−0.15	0.67	0.99 *	0.97 *	1.01 +
X8	0.00	6.64	0.45	6.76	0.00	6.76	0.98 *	0.93 *	1.02 +
X9	−0.52	0.73	−0.39	0.74	−0.52	0.74	1.00 *	0.90 *	1.00+
X1	0.00	1.00	0.04	1.00	0.00	1.00	1.00	0.96 *	1.00
X2	0.00	1.00	0.05	1.00	0.00	1.00	1.00	0.95 *	1.00
X3	0.49	1.00	0.00	1.00	0.00	1.00	0.51 *	1.96	1.96 +
X4	0.00	0.99	0.00	1.00	0.00	1.00	0.99 *	1.01	1.01 +
X5	0.29	1.00	−0.00	1.00	−0.00	1.00	0.71 *	1.41	1.41 +
X6	0.00	2104.21	0.00	1533.11	0.00	2104.21	1.00	0.73 *	1.00

^a % of overlap between the ranges of the training data and the input explanatory variable; ^b % of overlap between the ranges of the validation data and the training data; ^c % of overlap between the ranges of the training data and the prediction data; * data ranges do not coincide. Training or validation is occurring with incomplete data. + Ranges of the training data and prediction data do not coincide and the tool is attempting to extrapolate.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pumhirunroj, B.; Littidej, P.; Boonmars, T.; Artchayasawat, A.; Prasertsri, N.; Khamphilung, P.; Sangpradid, S.; Buasri, N.; Uttha, T.; Slack, D. Spatial Predictive Modeling of Liver Fluke Opisthorchis viverrine (OV) Infection under the Mathematical Models in Hexagonal Symmetrical Shapes Using Machine Learning-Based Forest Classification Regression. Symmetry 2024, 16, 1067. https://doi.org/10.3390/sym16081067

AMA Style

Pumhirunroj B, Littidej P, Boonmars T, Artchayasawat A, Prasertsri N, Khamphilung P, Sangpradid S, Buasri N, Uttha T, Slack D. Spatial Predictive Modeling of Liver Fluke Opisthorchis viverrine (OV) Infection under the Mathematical Models in Hexagonal Symmetrical Shapes Using Machine Learning-Based Forest Classification Regression. Symmetry. 2024; 16(8):1067. https://doi.org/10.3390/sym16081067

Chicago/Turabian Style

Pumhirunroj, Benjamabhorn, Patiwat Littidej, Thidarut Boonmars, Atchara Artchayasawat, Narueset Prasertsri, Phusit Khamphilung, Satith Sangpradid, Nutchanat Buasri, Theeraya Uttha, and Donald Slack. 2024. "Spatial Predictive Modeling of Liver Fluke Opisthorchis viverrine (OV) Infection under the Mathematical Models in Hexagonal Symmetrical Shapes Using Machine Learning-Based Forest Classification Regression" Symmetry 16, no. 8: 1067. https://doi.org/10.3390/sym16081067

APA Style

Pumhirunroj, B., Littidej, P., Boonmars, T., Artchayasawat, A., Prasertsri, N., Khamphilung, P., Sangpradid, S., Buasri, N., Uttha, T., & Slack, D. (2024). Spatial Predictive Modeling of Liver Fluke Opisthorchis viverrine (OV) Infection under the Mathematical Models in Hexagonal Symmetrical Shapes Using Machine Learning-Based Forest Classification Regression. Symmetry, 16(8), 1067. https://doi.org/10.3390/sym16081067

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu