1. Introduction
Severe liver fluke infections have been discovered in the Ponna Kaeo district of the Sakon Nakhon province, Thailand. The liver fluke,
Opisthorchis viverrini, is known to cause cholangiocarcinoma (CCA) [
1]. Thailand has the highest prevalence of bile duct cancer cases due to liver fluke infections [
2], which are primarily caused by consuming raw fish contaminated with infectious larvae and the widespread consumption of semi-raw or raw seafood [
3]. There have even been reports of fluke infections from fermented fish products [
4]. In Thailand, liver fluke and cholangiocarcinoma are persistent public health concerns. Annually, these diseases claim the lives of at least 20,000 people in the northeast region alone [
5]. With 6–8 million current cases of liver fluke infections, it is essential to test individuals for this infection [
6]. By eliminating the parasites, we can significantly reduce the risk of developing cholangiocarcinoma [
7].
Sakon Nakhon Hospital alone records nearly a thousand new cases of CCA each year. Surprisingly, despite knowing the main risk factors for
O. viverrini infection, the incidence of CCA has not decreased in the past decade [
8]. The prevalence of CCA involves Thailand’s four main regions—Sakon Nakhon, Phrae, Roi-Et, and Nong Bua Lamphu [
9]. A study by the authors of [
10] found that individuals with a high severity of
O. viverrini infection (>6000 eggs/g. feces) had a 14.1-fold higher likelihood of developing CCA compared to those without the infection. Approximately 10% of people infected with
O. viverrini eventually develop CCA, causing major health crises in the region [
11]. The five-year survival rates for patients who underwent surgery for intrahepatic [
12], distal extrahepatic, and hilar CCA were 22–44%, 27–37%, and 11–41%, respectively [
13].
The largest natural water contact zone in the northeast can be found near the boundary of the Nong Han subdistrict. This is due to the unique topography of the area. The swamp has specific physical characteristics that make it a significant source of water. It remains full all the time because it is fed by multiple streams, making it an essential source of food for the local population. Fish is a primary source of protein for people living in the watershed. There is a cultural preference for raw fish, cooked with herbs to create a sweet, sour, and spicy flavor [
1]. As a result, residents living close to the river basin eat fish with every meal [
3]. Preliminary screening results from 2019 to 2021 indicate that only a small proportion of individuals have developed liver fluke [
2]. Furthermore, a study on fish liver fluke infection prevalence found that 33.33% of the Sakon Nakhon province is affected by the infection. In a study on the density of larvae in fish, it was discovered that there were 10–20 metacercaria per kilogram of fish [
10]. Consequently, the Sakon Nakhon province continues to experience outbreaks of liver fluke. It is possible that the feces containing the eggs of the liver fluke contaminate water sources and contribute to recurring illnesses and an ongoing cycle of infection. It is important to monitor the outbreaks and infections spatially, but this requires data from remote sensing and geographic information system processing.
Figure 1 illustrates the percentage of reported cases found. The Sakon Nakhon province is home to the largest freshwater resource in the northeast, serving as a breeding ground for animals during the rainy season [
2]. Due to Phon Na Kaeo in the Sakon Nakhon province having the highest average infection rate, provincial health officials should closely monitor the situation. The investigation utilized data on the number of liver fluke infections in the Phon Na Kaeo district [
3].
Remote sensing information systems enable the geographic study of liver fluke infections, making the use of geographic information system (GIS) knowledge as an analytical tool particularly valuable. Remote sensing (RS) derived from satellite imagery allows for an in-depth analysis of the likelihood and distribution of liver flukes. This analysis can incorporate various indicators, such as the standardized vegetation index, soil moisture index, and soil cover index, which can be linked to the presence of liver fluke intermediates. Several studies, including those by the authors of [
3,
16], have employed spatial statistics to investigate geographical correlation factors with liver fluke infection. However, these studies analyzed large areas, leading to inconsistencies and discrepancies in the raster data. To address this, GWR (geographic weighted regression) models were created in small area unit systems for hydrological factor analyses, proving effective with high R
2 values in all models [
17,
18,
19,
20]. Combining spatial modeling with mathematical models can enhance the accuracy of linear models and improve risk analyses. However, when utilizing spatial statistical models for forecasting, a risk analysis can only be conducted at the sub-basin level, requiring sufficient independent variable data to generate relevant trendlines. Therefore, by incorporating machine learning (ML) and learning from spatial characteristics, it is possible to estimate the risk of infection in the location of a water supply.
Modern research has incorporated machine learning into spatial risk assessment problems, as demonstrated in [
21]. In recent years, advancements in machine learning techniques, processing power, and geospatial developments, including software, have made it easier to generate spatial maps [
22]. Various machine learning algorithms, such as knowledge-based approaches [
23], multivariate logistic regression methods [
24,
25], and multivariate binary logistic regression [
26], have been shown to improve the accuracy of spatial maps in recent studies. The following algorithms have been investigated: the general linear model [
27,
28], quadratic discriminant analysis [
29], boosted regression tree [
24], random forest classification (RFC) [
30,
31,
32,
33], multivariate adaptive regression splines [
34,
35,
36], classification and regression tree [
21,
36], support vector machine [
37,
38], naïve Bayes [
39,
40], generalized additive model [
36], neuro-fuzzy and adaptive neuro-fuzzy inference [
41,
42,
43], fuzzy logic [
44], and artificial neural networks [
25,
45,
46,
47,
48,
49]. Maximum entropy [
50,
51] and decision tree [
52,
53] approaches have also been proposed. Machine learning technologies have also been commonly used to create landslide maps (LSMs). The study of [
54,
55] compared the performance and effectiveness of various machine learning techniques in the literature and found that tree-based ensemble optimization strategies outperformed other machine learning algorithms. Several studies on machine learning applications have shown that the random forest classification method consistently performs better in terms of the receiver operating characteristic (ROC) and area under the ROC curve (AUC) than other models, although this also depends on the factors influencing the machine learning process, such as the number of training and testing points. A wide range of spatial conditioning factors can be utilized to build machine learning models. Several studies on flood-prone landslides and land-use change evaluation have employed remote sensing and GIS approaches [
15,
56,
57].
There is currently a lack of studies on spatial fluke infection in small river basins due to the absence of a direct model to predict the presence of liver flukes in water bodies. Most of the existing models are designed for larger areas such as regions, and thus are not suitable for studying smaller river basins. To address this gap, this study suggests that spatial features could be used as indicators of infection, by employing a forest-based classification and regression (FCR) modeling technique. The main objective of this study was to identify the key features of the small river basin that are most influential in causing infection, and then use these characteristics to develop a forecasting model based on FCR. Unlike previous studies, this research is the first to utilize FCR for this type of prediction.
The previous study used frequency-compensation ratio (FR) methodologies to forecast the proportion of infection risk based on geographical parameters at the watershed level and learning point locations, which are bodies of water containing diseased fish. As a result, appropriate management of the sub-basin level can be achieved for protection, provided that the spatial distribution of each parasite’s features is significant to any subspace unit at the sub-basin level [
58]. For example, by interrupting the mollusc host cycle, healthy populations can be maintained and future diseases can be prevented, leading to reduced community impact and medical treatment costs. However, the creation of independent variable data from point data and using them as representatives for watersheds to analyze the relationship with infection may not be common in all watersheds. This approach requires estimation methods from point data to be equal to the grid size, which can result in an increase or decrease in data, leading to multicollinearity, which is the correlation between independent variables.
This study used a hexagonal grid to implement independent variables. The FCR model test narrowed down the relevant independent variable factors to four out of nine spatial factors: the index of land-use types, index of soil drainage properties, distance index from the road network, and distance index from surface water resources. The distance index from the flow accumulation lines, index of average surface temperature, average surface moisture index, average normalized difference vegetation index, and average soil-adjusted vegetation index were found to have minimal impact on accuracy. In this study, an independent set of variables will be reconstructed within the constraints of a 2000 m × 2000 m hexagonal grid. This will enable an accurate description of the data values of all variables by defining border distances using distance analysis variables. The mathematical model will be adjusted to fit the hexagonal grid, and the score values of independent variables will be obtained from data provided by the Sakon Nakhon province’s public health authority.
The main aim of this study is to establish a dataset of measurable independent variables across various areas. In order to achieve this, a mathematical model in the form of a hexagonal grid was created. These models were used to justify the selection of independent variables for the FCR model, which serves as the primary framework for this study. A machine learning model using FCR was created to analyze the spatial factors associated with liver fluke infection. This study aimed to achieve two objectives: (1) analyze the spatial factors correlated with liver fluke infection and (2) predict the percentage of infection in water bodies using a machine learning-based FCR model with hexagonal grids.
3. Results
3.1. Spatial Factor Distribution
Figure 5a–i illustrate the components of the independent variables X1–X9. These components represent vector default data that are converted into raster data using a mathematical model and generated with the map algebra function on a hexagonal grid. In order to ensure fairness, all hexagonal grids are normalized to a range of 0–1, except for the raster for the X6 factor, which represents the actual population in the area, and X8, which requires further index differentiation.
The outcome of the adjustment is described by the X1 factor, also known as stream length (SL), which varies from 0.00 to 0.359. The number of value dispersions covers a wide range, from 0.941 to 1000. This is clearly visible in the histogram, where the higher range is more frequent compared to the other ranges. Additionally, the X2 factor, or continuous slope length, demonstrates a consistent increase in the cumulative length of the water flow line with low slopes. This is indicated by the red grid, which indicates a range value between 0.927 and 1.000 for this index.
When analyzing the proximity of infected water sources to each drought grid, the X3 index (prevalence of nearby infected water bodies (PNW)) ranges from 0.930 to 1.000. This indicates that the likelihood of these grids experiencing infections is even higher, as the frequency of this hexagonal grid is clearly higher than other levels. However, if we only consider the length of the water flow line from the diseased grid that can connect to each grid, the flow of the connection is not sufficient to cover all grids. This leads us to screen the X4 continuous length of the line of water connected to the infected water source (CL) index as a more precise variable. It suggests that the high-level grid between 0.517 and 1.000 should be monitored.
Next, the X5 index was developed to examine the proximity between the water source and the community (PWC). This index is used to determine the link between the closeness of the water source and the location of the community. The frequency of dispersion was primarily determined by how close the village was to the infected water supply. This allowed for an easy identification of opportunities to acquire sick fish. However, upon examining the population of the villages surrounding the source of infected fish, it was discovered that the population density was low. This was reflected in the index image as the most spread grid. X6, the index for population density adjacent to water bodies (PPW), also supports this finding.
When analyzing the surface moisture variables along the river flow that connects the contaminated water source to the hexagonal grid, it was discovered that the grid surrounding the Nongharn lake had the highest index range between 0.425 and 0.671. Additionally, the minor level range was also noteworthy. Despite the grid having a small frequency distribution, the risk of infection for fish was relatively higher compared to the grid in the low index range. This can be seen in the X7 continuation of surface moisture index along the banks of flowing water lines connecting infected water bodies (CSW).
The X8 continuity of the soil surface heat index along the banks of the flowing water line connecting the infected water source (CSH) consistently showed that grids with higher moisture levels were less likely to be soil. It could be inferred that these streams were covered with weeds along the banks of the water flow lines. This resulted in values in the high-risk range, ranging from 0.000 to 1.312. Similarly, the X9 continuation of vegetation index along the banks of stream lines connecting infected water bodies (CVS) confirmed this pattern. Water flow lines at risk were found to have moderate to fairly large vegetation index values ranging from 0.502 to 0.738. The analysis of nine factors, including the percentage of water bodies where infected fish were found, will help identify the appropriate set of factors to be used in the FCR model for learning and predicting other infectious grids.
3.2. Selected Factors for Prediction
3.2.1. Correlation between All Variables and Infection
The exploratory regression global summary table uses an ordinary least squares (OLS) model to examine the consistency of the data obtained from developing a regression model. It performs this by looking at the percentage of search criteria that are met for each of the variables used to filter the model. This information is shown in
Table 2. The table displays five search criteria, with the most important one being the max VIF value. This criterion has passed a total of 511 trials (100.00%). However, the other variables, such as the minimum adjusted R
2, the maximum coefficient
p-value, the minimum Jarque–Bera
p-value, and the minimum spatial autocorrelation
p-value, did not meet the validation conditions specified in the table. Based on these data, it is possible that the regression model created has issues with data consistency or is not compatible with the feeder data used to develop the model. Therefore, it is recommended to evaluate and update the data or model accordingly before using it further in analyses or decision-making related to this regression model.
Table 3 presents the results of variable significance. The following statements summarize the findings: X3 is the most important variable and is negatively related to the outcome. The variation in X4 is consistently essential and is associated with positive outcomes. X7 has low priority and yields negative results. The importance of the X9 variation is low, but it is linked to beneficial results. The variation in X1 is not substantial, but it does correspond to positive outcomes. X2 is not a significant variable, yet it correlates with favorable findings. X5 and X6 variables are not significant and also correlate with positive results. Therefore, analyzing these data allows us to confidently categorize and predict the impact of each variable on the outcome. When the variance inflation factor (VIF) value exceeds 10, it indicates the presence of multicollinearity. Conversely, a VIF value below 10 suggests the absence of multicollinearity. As shown in
Table 4, our model is free from multicollinearity. This allows us to make accurate predictions without encountering issues that could compromise the model’s accuracy.
The residual normality (JB) distribution validation summary presents the results of the residual test conducted during statistical modeling. In the case of Model 1, the JB value for the other components is 0, suggesting that the remaining value deviates from a normal distribution. This indicates that adjustments may be necessary to enhance the reliability of the data for future analyses. Please refer to
Table 5 for further illustration.
A multiple regression analysis for spatial autocorrelation (SA) can be summarized as follows: (1) the SA value represents the association between the value of an existing variable and a previously randomized variable. A higher value indicates a stronger association between the variables in the area. (2) The Adj R
2 number measures how well the model matches the data. A higher value suggests a more effective model in explaining the data. (3) The AICc value represents the estimation value of the model. A lower value indicates a better ability of the model to describe the data. (4) The Jarque–Bera multiple-choice test value, also known as the “JB value”, indicates whether the model deviates from the hypothetical distribution. A score of 0 denotes that the model is not covered. (5) The Pretest Bias (BP) value of the multiple-choice test is referred to as the K(BP) value. Higher values indicate the model’s ability to characterize the data. (6) The variance inflation factor (VIF) measures multicollinearity issues. A VIF value of 1 indicates that the variable value does not have a multicollinearity problem in the model. By examining the values in the above table, we can summarize the link between the data and models used in the analysis. The second set of models has the lowest values for SA, Adj R
2, and AICc, indicating a better fit to the data. The JB and K(BP) values of the second set of models are roughly 0, suggesting a good random approximation.
Table 6 illustrates the second group of models that are considered appropriate for analyzing spatial autocorrelation data because their VIF values approach 1, indicating no multicollinearity problem.
3.2.2. Generate New Independent Variables by Employing Principal Component Analysis (PCA)
Based on an analysis of the association between the independent variable and the infection percentage of water sources, it was concluded that the X3 variant was the most important. The X4, X7, and X8 variants were also found to be high priorities. To ensure that these variables are definitively included in the FCR model, they must be tested using a principal component analysis (PCA), as shown in
Table 7,
Table 8,
Table 9 and
Table 10.
Table 7 shows that the X7 variable should be included in the covariance matrix model. This is because it has a higher covariance compared to the correlation in X3 and X4. Including X7 helps to capture the learning between highly correlated features in the data imported into the model. It also improves the effectiveness of the weights in the model training process for supervised learning of FCR models, reducing the risk of overfitting in large datasets. Additionally, the correlation matrix values in
Table 7 reveal the following correlations between variables: X3 has a correlation of 0.03570 with X8, indicating a weak positive correlation. X4 has a correlation of 0.05525 with X7, indicating a moderate positive connection. Finally, X7 and X8 have a correlation of 0.45663, indicating a strong positive correlation.
Table 9 displays the eigenvalues and eigenvectors, providing insight into the significance of each principal component (PC) Layer. The eigenvalues in
Table 8 indicate the importance of each PC Layer, while the eigenvectors reveal the relationship between the Input Layer and the PC Layer. The vector associated with each Input Layer signifies the pattern or trend of this relationship. In our case, we have X3, X4, X7, and X8 as the Input Layers, and a total of four principal component layers. The highest eigenvalue belongs to PC Layer X3, while the smallest eigenvalue is associated with PC Layer X8. Each Input Layer represents its relationship with the PC Layers by contributing a specific value that helps define the PC Layer. Utilizing eigenvalues and eigenvectors in this manner enables an effective analysis and reduction in data dimensions based on their relevance and relationships.
To analyze the significance of each variable or dimension in a data analysis, it is useful to refer to the percent and cumulative value table. The percent values indicate how important each variable is in describing the data, while the cumulative values show the overall importance of each variable. Among the variables, X4 stands out as the second most significant, with an eigenvalue of 0.00861, representing 38.1232% of the total, and an accumulated eigenvalue of 90.5935%. The X7 variable comes next in terms of importance, with an eigenvalue of 0.00164, accounting for 7.2683% of the total, and an accumulated eigenvalue of 97.8617%. On the other hand, the X8 variable contributes the least to the description of the data, with an eigenvalue of 0.00048, representing only 2.1383% of the total, and an accumulated eigenvalue of 100.00%.
Figure 6 and
Table 10 illustrate the reconfiguration of the four variables into a new set of independent variables, with a data range spanning from 1.35 to 3.41.
3.3. Predicting OV Infection Using a Machine Learning-Based FCR Model
Table 11 provides information on the characteristics of the learning model. FCR refers to the creation of a decision tree model using integers ranging from two to five, and one decimal place. The model includes the following factors for interpretation: (1) Number of Trees: This indicates the total number of trees in the model, which is set at 300. (2) Leaf Size: This refers to the size of the leaves within the tree, which is set at 1. (3) Tree Depth Range: This represents the range of depths for the trees, which falls between 6 and 25. (4) Mean Tree Depth: This indicates the average depth across all trees created, which is set at 15. (5) % of Training Available per Tree: This reflects the percentage of training data used to build each individual tree, which is set at 100%. (6) Number of Randomly Sampled Variables: This refers to the number of randomly generated variables in each tree, which is set at 3. (7) % of Training Data Eliminated for Validation: This represents the percentage of training data, specifically 40%, that is removed for model testing purposes.
The out-of-bag errors in
Table 12 highlight how the model’s error is calculated using unpredictable data. This is performed to generate a performance tree and estimate, as shown in the table. It is clear that the 300 trees have a lower mean squared error (MSE) for out-of-bag errors. Additionally, when the target class is 0.0, the MSE value for calculating model out-of-bag errors with 150 trees is smaller compared to when the target class is 1.0.
The Top Variable Importance table displays the significance of factors. This is determined by the importance value assigned to each variable, indicating its predictive power for outcomes. The variable X6 has the highest importance (16%), indicating that it greatly contributes to accurate outcome predictions when included in the modeling process. Following X6, the variables X4, X2, and X1 also have high importance. On the other hand, the variables X5, X8, X7, and X9 have comparatively lower importance. All the variables mentioned in
Table 13 have percentages below 10%, suggesting that their impact on model predictions may be minimal.
The results of the data classification test are presented in
Table 14 and
Table 15, showing F1-Score, MCC, sensitivity, and accuracy values for two data groups. Based on the analysis of these tables, it can be concluded that the created models are highly effective in classifying and predicting data. It is clear that the developed model performs well in accurately predicting the type of data found in most of the test datasets.
The accuracy and recall of the model are used to calculate the F1-Score, which is a measure of the effectiveness of categorization. A higher F1-Score indicates high accuracy and recall, while a lower F1-Score indicates lower accuracy and recall. By evaluating the F1-Score, we can determine which models perform better and choose the most effective ones for data categorization. When analyzing the validation data, it is important to consider the following classification diagnostics: Category F1-Score, MCC, sensitivity, and accuracy. (1) F1-Score: This metric measures the effectiveness of the model in data classification. A higher F1-Score indicates a strong correlation between the actual and predicted values, especially when the MCC value is close to 1. (2) Sensitivity: This metric describes the model’s ability to correctly identify positive cases. A high sensitivity indicates that the model can accurately recognize positive instances. (3) Accuracy: Accuracy refers to the model’s correctness in classifying data. A high accuracy suggests a high level of accuracy in data classification for the model.
By using F1-Score, MCC, sensitivity, and accuracy to analyze the validation data, we can gain a better understanding of the model’s effectiveness in classifying the data. The analysis can be performed as follows: The table below presents the results of the analysis of the validation data test results, which evaluates the accuracy of the model’s predictions or classifications. For Category 0.0, the model demonstrates high accuracy with an F1-Score of up to 0.96 and a sensitivity of up to 0.99. However, it should be noted that a negative MCC may lead to incomplete data ties. On the other hand, for Category 1.0, the model fails to classify correctly, as indicated by an F1-Score and sensitivity of 0.00, along with a negative MCC when compared to Category 0.0. Overall, the model achieves a total accuracy of 0.92, meaning that it accurately classifies 92% of the data for both categories. It is important to highlight that the model still performs well for Category 0.0, exhibiting high precision.
The synthesis results of the factor concluded that X1 is the best explanatory variable. This is because it has the best range diagnostics in all data ranges (training, validation, prediction), the most compatible share values in all ranges (training, validation, prediction), and the highest value in prediction of 1.00. Therefore, in order to compare the predicted value to the alternative model, the X1 factor must be included in the optional FCR model, as indicated in
Table 16.