Applying Fuzzy Inference and Machine Learning Methods for Prediction with a Small Dataset: A Case Study for Predicting the Consequences of Oil Spills on a Ground Environment

Burmakova, Anastasiya; Kalibatienė, Diana

doi:10.3390/app12168252

Open AccessArticle

Applying Fuzzy Inference and Machine Learning Methods for Prediction with a Small Dataset: A Case Study for Predicting the Consequences of Oil Spills on a Ground Environment

by

Anastasiya Burmakova

^* and

Diana Kalibatienė

^*

Department of Information Systems, Faculty of Fundamental Sciences, Vilnius Gediminas Technical University, Saulėtekio al. 11, 10223 Vilnius, Lithuania

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2022, 12(16), 8252; https://doi.org/10.3390/app12168252

Submission received: 28 June 2022 / Revised: 12 August 2022 / Accepted: 15 August 2022 / Published: 18 August 2022

(This article belongs to the Special Issue Design, Development and Application of Fuzzy Systems)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Applying machine learning (ML) and fuzzy inference systems (FIS) requires large datasets to obtain more accurate predictions. However, in the cases of oil spills on ground environments, only small datasets are available. Therefore, this research aims to assess the suitability of ML techniques and FIS for the prediction of the consequences of oil spills on ground environments using small datasets. Consequently, we present a hybrid approach for assessing the suitability of ML (Linear Regression, Decision Trees, Support Vector Regression, Ensembles, and Gaussian Process Regression) and the adaptive neural fuzzy inference system (ANFIS) for predicting the consequences of oil spills with a small dataset. This paper proposes enlarging the initial small dataset of an oil spill on a ground environment by using the synthetic data generated by applying a mathematical model. ML techniques and ANFIS were tested with the same generated synthetic datasets to assess the proposed approach. The proposed ANFIS-based approach shows significant performance and sufficient efficiency for predicting the consequences of oil spills on ground environments with a smaller dataset than the applied ML techniques. The main finding of this paper indicates that FIS is suitable for prediction with a small dataset and provides sufficiently accurate prediction results.

Keywords:

machine learning; fuzzy inference; ANFIS; oil spill; ground environment model; prediction model

1. Introduction

In the real-world application of prediction, we often face the challenge of small-labelled data [1] or small sets of data [2], such as when detecting outliers [3], detecting new trending topics, diagnosing a rare disease [4,5], predict emerging social or natural events [6], predicting river flooding [7], etc. However, to provide sufficiently accurate predictions even with uncertainty, data-based prediction approaches, like machine-learning (ML) techniques and fuzzy inference systems (FIS), need large datasets. Predicting the consequences of an oil spill on the ground environment also falls within the realm of prediction with small datasets [8,9], since collecting real data from the extreme situations of oil spills is very time consuming and expensive.

We can find different approaches to solving the small dataset issue in the literature. The authors of [4] generated a surrogate dataset for heart disease prediction based on original case studies. The authors of [10] showed that an appropriate choice of regression algorithm and input parameters allows us to build accurate surrogates for simulations in additive manufacturing. The authors of [11] used a cluster analysis approach to generate surrogate data in the medical domain. In [12], the authors applied a deep neural network (DNN) to integrate similarity features and perform learning to predict software defects. As can be seen from these works, synthetic data generation from the existing small datasets of real oil spills could be used in designing experiments, and these datasets are relatively inexpensive to develop and use [10]. However, for efficiency, the generation of synthetic oil spill data must be accurate enough for the task of prediction.

Moreover, the process of an oil spill and its penetration into the ground environment is complex and vague due to its nature and the dynamic environment (i.e., temperature, sun, and wind) [13,14,15]. All of these factors lead to inaccurate initial qualitative data [16,17] and subjective expert opinions regarding oil spill consequences that are difficult to understand [18,19]. Consequently, we need an effective and intelligent prediction model to facilitate the selection of suitable response methods [5]. This motivates some researchers to apply the adaptive neural fuzzy inference system (ANFIS) for oil spill prediction. According to [20,21], the application of this method in modelling and the prediction of various complex processes allows us to achieve more successful and accurate results. ANFIS is used in this study because it combines the advantages of artificial neural networks (ANNs) (i.e., self-learning from available data) and FIS (i.e., human-like reasoning) [22,23].

However, despite existing research and approaches, prediction is challenging when we have only a few real entities of oil spills, and the oil spill process remains fuzzy. Therefore, it is unclear which prediction approach is best for predicting the consequences of oil spills on ground environments when only a small dataset is present.

Consequently, in this paper, we aim to assess the appropriateness of several ML techniques and ANFIS for predicting the consequences of oil spills on ground environments when only small datasets are present. The main contributions and novelties of this paper are the following:

we propose applying ANFIS to predict the consequences of oil spills on ground environments when small datasets are present;
we propose using synthetic data generated from the small datasets of real oil spills by applying the mathematical formalism (described in [24]) of oil spill penetration into groundwater;
we use the impurity of the tree (i.e., R and a Random Forest [25,26,27]) and a correlation analysis (i.e., Spearman and Kendall correlation coefficients) to select the appropriate variables for prediction;
we propose an approach that combines several ML techniques and FIS for predicting the consequences of oil spills on ground environments when small datasets are present;
we propose a conceptual architecture for the implementation of the introduced approach for predicting the consequences of oil spills on ground environments when small datasets are present;
we use a practical problem for predicting the consequences of oil spills to investigate the performance of various ML techniques and FIS. We demonstrate that the proposed ANFIS-based approach has the best prediction performance results.

This research is intended to support researchers and practitioners working on predicting and eliminating the consequences of oil spills on ground and groundwater environments. We believe that the proposed approach to oil spill prediction can help practitioners to predict the consequences of oil spills more effectively and to eliminate them more quickly. Moreover, future research directions are opened in this paper.

This paper is structured as follows: in Section 2, the background of the research is presented; Section 3 reviews related works; in Section 4, we describe the proposed assessment approach applied in the research; Section 5 presents the experiments conducted and the results obtained; Section 6 discusses the obtained results; and the conclusions of the paper are presented in Section 7.

2. Background

In recent years, the protection of the geological environment has been enhanced. As a result, different approaches to predicting, detecting, and assessing oil spills have been proposed to reduce their consequences [28,29,30]. Depending on the types of prediction, detection, and assessment problems being solved and the answers required, particular classification [11] or regression [31] algorithms are used. Both classes of algorithms (classification and regression) are used to predict outcomes in Supervised ML (SML), and both use labelled datasets to train algorithms [32]. However, they are each applied for different SML problems.

Classification is a model-finding process used to portion data into discrete classes according to certain features [33]. In classification, the algorithm is trained using the training dataset (input (x)) to categorise data into different discrete classes (output (y)), such as dog or cat, yes or no, healthy or unhealthy, spam or not spam, etc. The most popular classification algorithms are as follows: Linear classifiers [34], Support Vector Machines (SVM) [35], the k-Nearest Neighbours algorithm [36], Decision Tree learning [37], and Neural Networks [11,12,20].

Regression is a statistics-based evaluating process used for finding relationships between dependent and independent variables [38]. In regression, the algorithm is trained using the training dataset (input (x)) for the prediction of continuous variables (output (y)) which fall in the range of some allowable intervals, such as predicting market trends, prices, weather forecasting, etc. The most popular regression algorithms are as follows: Gaussian Process Regression (GPR) [39], Support Vector Regression (SVR) [40], Linear and Logistical Regressions [27,30,41], and Neural Networks [11,12,20].

As we can see, with minimal modifications some algorithms are used for solving both regression and classification tasks, such as Decision Trees and Neural Networks (also see Table 1).

One of the main differences between classification and regression is assessing the quality of the prediction [54]. For evaluating classification results, precision and recall performance metrics based on relevance [43,55] and receiver operating characteristic (ROC) curves are used. For evaluating regression results, statistical tests are used.

Predicting the consequences of oil spills on ground environments belongs to the regression group, since we would like to obtain continuous values, such as the weight of the adsorbed oil, the penetrating oil volume, etc. Consequently, in this research, we use the following statistical tests to evaluate the results of the prediction: the Root Mean Square Error (RMSE), the Mean Squared Error (MSE), and the Coefficient of Determination (R²) [35].

RMSE shows how far the observation values (f_train) are from the predicted data points (f_i) [56] (see Equation (1)).

RMSE = \sqrt{\frac{\sum_{i = 1}^{n} {(f_{i} - f_{t r a i n})}^{2}}{n}}

(1)

RMSE belongs to the positive values interval, where 0 indicates a perfect fit to the data. Since RMSE can vary in the interval

(0, + \infty)

, some authors [36,57,58] propose using normalised RMSE (Equation (2)).

Then, a normalised RMSE (NRMSE) number ranges in the interval of [0; 1], where values closer to 0 represent better fitting predictions.

NRMSE = \frac{RMSE}{\max f_{i} - \min f_{i}}

(2)

The coefficient of determination (R²) is used to show how many predicted data points (f_i) meet the regression line (Equation (3) [56]). Its meaning varies in the interval from 0 to 1, where 1 indicates a perfect fit and 0 no fit at all.

R^{2} = 1 - \frac{sum squared regression}{total sum of squares} = 1 - \frac{\sum_{i = 1}^{n} {(f_{i} - {\bar{f}}_{i})}^{2}}{\sum_{i = 1}^{n} {(f_{i} - f_{t r a i n})}^{2}}

(3)

where

{\bar{f}}_{i}

represents the mean value of a sample.

The MSE [56,57,58,59,60] (Equation (4)) shows the number of errors in the model by measuring the average squared difference between the predicted output f_i and the training value f_train.

MSE = \frac{1}{n} \sum_{i = 1}^{n} (f_{i} - f_{t r a i n})^{2}

(4)

MSE is always positive, and for a more accurate model it will approach zero.

3. Related Works

This section reviews approaches, methods, and techniques used in ML to solve the issues caused by small datasets in different application areas. They are summarized in Table 1.

Table 1 summarizes the relevant related works on prediction with small datasets in various application domains. Consequently, the table consists of seven columns as follows: (1) reference for the publication; (2) domain of interest, describing the application domain of prediction with small datasets; (3) solution to the issue of small data, presenting approaches used to overcome the small dataset issue; (4) classification/regression, identifying the type of prediction; (5) methods/techniques used in the research; (6) tools used to implement the proposed approach; and (7) dataset for the experiment.

As can be seen from Table 1 Column 2, the small dataset issue exists in different research areas including medicine [39], studying the chemical and physical properties of various materials [20,45,47], energy optimisation [11,31,46], marine oil spills [11], etc. Despite different research areas, the small dataset issue tends to be solved in similar ways (Table 1 Column 3). Most authors use the surrogate data generation approach [4,50], whilst some use unsupervised learning, such as the cluster analysis approach [11] and deep neural networks [11,12,20], to increase the amount of data. Other authors suggest the use of an interpolation approach for data augmentation and data visualisation [46,49], whilst some select useful predictive features of data samples [12,47]. The authors of [31] propose an approach where users enter data for prediction, whilst the authors of [51] conclude that their proposed approach of two ML models based on a convolutional network can naturally operate on an input of any size.

Column 4 in Table 1 shows that despite the differences in classification and regression, the same methods/techniques (Table 1 Column 5) were used to make predictions. As a rule, most authors applied ML approaches, such as various kinds of regression algorithms, k-nearest neighbours, SVM, Decision Trees and Ensembles [4,45,48]. In the analyzed papers, most authors also applied deep neural networks [4,12,20,50], arguing that this approach is optimal for solving predictions with small datasets. Some authors [11,44] used unsupervised learning algorithms when indicating dependence in the data features was difficult. Several authors [31,38,45] showed optimization methods for predictive algorithms, such as cross-validation, gradient descent, and the use of fuzzy logic.

Modern tools simplify the implementation of prediction algorithms (Table 1 Column 6). The authors of [4,47] used various programming languages, such as C++ and R. Most authors argued that the Scikit-Learn library, Tensorflow, and KERAS, from Python, are convenient for data analysis and deep analysis learning methods [12,20,49]. The authors of [12] used MATLAB modules for ML, whilst the authors of [31] proposed the implementation of ANFIS from MATLAB to implement predictions with fuzzy logic.

Column 7 shows the datasets that the authors used for predictions. In most cases, this data concerned real objects or processes [4,20,44,45]. The authors of [10] proposed an approach for generating a dataset based on two previously built models.

Summing up, in this research, the most suitable approach to solving the small dataset issue for oil spill prediction is generating a surrogate dataset, since the process of an oil spill can be specified using a mathematical model [13]. According to the related works presented in Table 1, it is reasonable to use several ML methods for oil spill prediction and apply fuzzy logic to describe the vagueness of the oil spill process on the ground environment.

4. Method

This section presents the proposed approach for the prediction of the consequences of oil spills using small sets of data. Its main steps are as follows (see Figure 1).

4.1. The Mathematical Model of Oil Spills and Their Penetration into the Ground

As described before, we use a mathematical model to generate synthetic data. This mathematical model [13] was developed based on mathematical equations of oil penetration into the ground environment. The mathematical model corresponds to the four layers of ground environments (see Figure 2): (1) surface layer; (2) soil layer; (3) ground layer; and (4) groundwater layer.

In each layer a set of outputs are calculated from the initial variables presented in Table 2 using the mathematical model visualized in Figure 2 as follows: (1) the shape of the pollution area and the weight of the evaporated oil product (OP); (2) the OP penetration depth in the soil layer and the ground layer; (3) the weight and concentration of the absorbed OP in the ground layer; (4) the remaining OP weight that can reach the groundwater layer; (5) the time of maximum OP penetration into the groundwater layer; (6) the distance of horizontal propagation of OP in the groundwater layer.

4.2. Generating the Synthetic Data Using the Mathematical Model

The mathematical model described is used to generate synthetic data, which will be used for further oil spill prediction. According to the mathematical model, an initial set of 12 variables is presented in Table 2. Therefore, we should determine the relationship among those variables and select only those that do not correlate.

An initial set of variables is presented in Table 2 as follows: Column 2 shows the name of the variable and the measurement units; Column 3 shows the minimum possible value; Column 4 shows the maximum possible value. Column 1 (Rank) and Column 5 (Importance (Gini impurity)) were obtained by applying the importance method in the R software environment, as described in Section 4.3.

4.3. Selecting the Main Initial Variables for Prediction

In order to apply prediction methods, we have to select the appropriate variables on which the performance of the prediction depends [61]. In the literature, we can find various approaches for selecting variables as follows: filtering techniques [62], which are applied before model training [62]; questionnaires [63], which are based on human expert judgments as to whether a variable is relevant; visual analysis, such as the removal of a feature that has only one value or where most of the values are missing; and the assessment of features using some statistical criteria (variance, correlation, X², etc.) [64].

Compared to the other variable selection approaches, questionnaires are subjective and difficult to collect [65]. Therefore, we use the statistical selection of variables as presented below.

To refine an initial set of variables derived from the mathematical model, we use the impurity of the tree applying the R software environment, a Random Forest [25,26,27] (Section 4.2) (Equation (19)), and a set of correlations (i.e., Spearman and Kendall correlation coefficients). The ranking of variables for prediction based on their importance is presented in Figure 3.

Consequently, we selected the four most important initial variables for the prediction of the consequences of oil spills as the following: spilled oil volume, oil density, surface spreading coefficient, and ground thickness. The choice of more variables increases the level of training of ML and ANFIS in terms of time and resources, and the complexity of the prediction model increases with the increase of input parameters.

The correlation analysis between the selected four variables is presented as follows. The Pearson correlation coefficient [66,67] measures the linear relationship among two variables as the ratio of the covariance of the two variables to the product of their standard deviations (Equation (6)). It ranges in the interval from −1 to 1, where 1 shows a very strong positive correlation and −1 a very strong negative correlation. The closer the value to 0, the weaker the relationship between the two variables.

r_{x y} = \frac{\sum_{i = 1}^{n} (x_{i} - \bar{x}) (y_{i} - \bar{y)}}{\sqrt{\sum_{i = 1}^{n} {(x_{i} - \bar{x})}^{2} \sum_{i = 1}^{n} (y_{i} - {\bar{y)}}^{2}}}

(5)

In order to determine the statistical significance of a correlation, a p-value can be calculated as presented in Equation (6). A p-value less than 0.05 typically shows a statistically significant correlation.

ρ = \frac{S_{x y}}{S_{x} S_{y}} = \frac{\frac{1}{n} \sum_{i = 1}^{n} (R (x_{i}) - \bar{R (x)}) (R (y_{i}) - \bar{R (y))}}{\sqrt{(\frac{1}{n} \sum_{i = 1}^{n} {(R (x_{i}) - \bar{R (x)})}^{2}) (\frac{1}{n} \sum_{i = 1}^{n} {(R (y_{i}) - \bar{R (y)})}^{2})}}

(6)

Table 3 presents the results of the correlation analysis between input variables (spilled oil volume, oil density, surface spreading coefficient, and ground thickness), which do not correlate.

From Table 3, we can see that the four input variables do not correlate and are suitable for prediction.

4.4. ML Methods Used for Prediction

We use several ML methods in this study. Firstly, Linear Regression [68] is a model that uses a linear equation of input values (x) to predict output data (y) (see Equation (7)):

y = b + w x

(7)

where x is the predictor (feature), y is the output value (target), w is the slope of the regression line, and b is the value of y when x = 0. The goal of the Linear Regression model is to minimize the sum of square errors (Equation (8)).

\sum_{i = 1}^{m} {(y_{i} - w x_{i})}^{2} \to \min

(8)

where the meaning of yᵢ, wᵢ and xᵢ are the same as in Equation (7).

SVR [69] is a supervised ML model, during the training of which we aim to find the “maximum-margin hyperplane”, i.e., to minimize the l2-norm of the coefficient vector (Equation (9)). SVR allows us to define the error acceptability of our model and find the right line to fit the data. The error term is constrained by the absolute error margins, called the maximum error (ε–epsilon) (Equation (10)). We can adjust the epsilon to obtain the performance we want from our model. Consequently, we should choose which is more appropriate: a more accurate prediction (but larger margin of error) or a smaller margin of error (but more prediction error) [70].

MIN \frac{1}{2} {‖ w ‖}^{2}

(9)

| y_{i} - w_{i} x_{i} | \leq ε

(10)

This optimisation task can be formulated as follows: the minimum deviation ξ of the value outside of ε should be found as presented by Equations (11) and (12).

MIN \frac{1}{2} {‖ w ‖}^{2} + C \sum_{i = 1}^{n} | ξ_{i} |

(11)

| y_{i} - w_{i} x_{i} | \leq ε + | ξ_{i} |

(12)

where C is the hyper-parameter.

The Decision Tree [71] is developed by recursive partitioning of the root node (i.e., the first parent) to the left and right child nodes. Note that in the literature, we can find binary and n-ary Decision Trees, whereas in this research we emphasize binary Decision Trees. Consequently, the data is divided into subsets, which results in information gain (IG) [72]. This partitioning process is repeated until we achieve pure leaves, i.e., all samples in the leaf have a certain feature. To divide the nodes according to the most informative features, we have to solve an optimization task by applying the tree-learning algorithm (Equations (13) and (14)).

I G (D_{p} f) = I_{E} (D_{p}) - (\frac{N_{l e f t}}{N_{p}} I_{E} (D_{l e f t}) + \frac{N_{r i g h t}}{N_{p}} I_{E} (D_{r i g h t}))

(13)

where I_{E} (z_{1}, z_{2}, \dots, z_{j}) = - \sum_{i = 1}^{j} z_{i} \log_{2} z_{i}

(14)

where f is the split feature; D_p, D_left, and D_right are the parent, left child, and right child nodes, respectively; Np is the total number of samples in the parent node; N_left and N_right are the numbers of samples in the left child and right child nodes, respectively; IE is the impurity measure; and z₁,…, z_j represent the degree to which each class is present in the child node, resulting from a tree split. Consequently, the lower the IG of the child nodes, the larger the information gain. Depending on the recursive partitioning, in the literature we can find different Decision Tree algorithms, such as ID3, C4.5 [25], and C5.0 [26]. In this research, we use the CART [27] algorithm with the Gini impurity criterion (Equation (15)), since its statistical meaning shows how often a randomly selected example of a training set will be recognised incorrectly by providing a certain statistical distribution.

I_{G i n i} (D) = 1 - \sum_{i = 1}^{k} z_{i}^{2}

(15)

where the probability of samples belonging to class i at a given node can be denoted as z_i. For more about Decision Trees, see [71].

Ensembles [73] combine several ML algorithms to achieve better prediction results than in the case of the application a single ML algorithm. Based on their usage, Ensembles are divided into three main categories as follows: (1) sequential ensemble learning (boosting), where each subsequent algorithm improves the errors of the previous algorithm, such as Adaboost, Stochastic Gradient Boosting, etc.; (2) parallel ensemble learning (bagging), where algorithms predict independently of each other and the final prediction result is the arithmetic mean of all results, such as Random Forest, Bagged Decision Trees, Extra Trees, etc.; and 3) Stacking and Blending, which combines multiple models of a meta learner. Stacking and Blending is less applicable than the previous two algorithms; however, we can use it to combine models of different types.

GPR [74] is based on Bayes’ Rule (Equation (16)):

p (w | y, X) = \frac{p (y | X) p (w)}{p (y | X)}

(16)

where p(y|X) is the posterior distribution, obtained by applying three different approaches such as Bernoulli distribution [75], Gaussian distribution [73], and multinomial distribution [76]. In this research, we use Gaussian distribution–GPR (Equation (17)).

p (f^{*} | x^{*}, y, X) = \int_{w} p (f^{*} | x^{*}, w) p (w | y, X) d w

(17)

4.5. Predicting Oil Spill Consequences Using ANFIS

ANFIS consists of a FIS and an ANN used to solve nonlinear problems [77]. The ANFIS reference schema for n inputs and one output is presented in Figure 4.

In ANFIS, fuzzy inference is based on the Takagi–Sugeno model incorporating fuzzy IF-THEN rules (Equation (18)). These rules consist of a consequence (i.e., an output), which is linearly dependent on a premise (i.e., an input) as follows:

R_{i} : i f (x_{1} i s A_{1}^{i}) \dots a n d \dots (x_{n} i s A_{n}^{i}) t h e n f_{i} = a_{i}^{T} x + b_{i}

(18)

where (

x ϵ ℝ^{n}

) is the vector of inputs in the premise characterized by an appropriate membership function (MF), and (

a_{i}, b_{i}

) are the coefficients in a linear equation. Training ANFIS means determining the parameters belonging to the premise and consequence areas utilising a particular optimisation algorithm.

The basic structure of ANFIS has five layers:

Fuzzification, during which crisp inputs are fuzzified using Gaussian MF [78] as follows (Equation (19)):

$μ^{G a u s} = e^{\frac{- {(x - c)}^{2}}{2 σ^{2}}}$

(19)

where c, σ are the function coefficients which are adapted by a back-propagation algorithm during the learning process. Note that other types of MFs can also be used.
Evaluating the strength of rules, where for each node the strength, $w_{i}$ , is provided by multiplication (Equation (20)):

$w_{i} = \prod μ_{i} (x)$

(20)
Normalisation, during which the rule strengths, $w_{i}$ , are normalized ( ${\bar{w}}_{i}$ ) (Equation (21)):

$\bar{w_{i}} = \frac{w_{i}}{\sum_{i} w_{i}}$

(21)
Obtaining an output, f_i, where the rules R_i (Equation (18)) are applied.
Obtaining global model response f using Equation (22):

$f = \sum_{i} \bar{w_{i}} f_{i}$

(22)

Once the ANFIS training has been completed, the performance is determined by a specific statistical test, examples of which are presented in [79]. In this research, we use MSE (Equation (4)), R² (Equation (3)), and NRMSE (Equation (2)).

4.6. The Conceptual Architecture for the Implementation of the Proposed Method

Here we present a conceptual architecture for the implementation of the proposed approach to the prediction of the consequences of oil spills using small sets of data (see Figure 5). This consists of four main components: interface; manager; synthetic data generation; and prediction. The interface component is responsible for the interaction of the user (i.e., expert) with the overall system, implementing the proposed approach to the prediction of the consequences of oil spills using small sets of data. It consists of sub-components as follows: data collection and pre-processing, which is used for data input; and prediction visualization, used to visualize the results of the prediction for the expert.

The synthetic data generation component is used to generate synthetic data from the data input. Its sub-components correspond to the description presented in Section 4.1, Section 4.2 and Section 4.3. The sub-components of selecting the main initial variables for prediction and checking variables for correlation are described in Section 4.3. The sub-component of generating synthetic data works as described in Section 4.2.

The prediction component is responsible for predicting the application of ANFIS or other ML methods. It works as described in Section 4.4 and Section 4.5 and Figure 1.

A manager is used to initiate all processes and tasks for the prediction of the consequences of oil spills using small sets of data.

5. Experimenting and Results

For the implementation of the proposed approach, we used MATLAB R2021b with the Statistics module and the Machine Learning Toolbox. The ANFIS model was developed using the MATLAB Fuzzy Logic Toolbox. For objectivity of the final results, we used the initial parameter settings of the MATLAB Statistical Toolbox library.

First, according to the presented approach to the prediction of the consequences of oil spills using small sets of data, a mathematical model was employed to generate an initial dataset of 150 input values (see Section 4.1). This input was generated based on the selected input parameters (in Section 4.3): (1) spilled oil volume, (2) oil density, (3) spreading coefficient over the surface, and (4) ground thickness (presented in Table 4).

5.1. Application of ML Methods

With the MATLAB ML Toolbox module, the algorithms of Linear Regression, Decision Trees, SVR, Ensembles, and GPR (described in Section 4.4) were used for predicting the consequences of oil spills on the ground with the described dataset. A cross-validation approach was used to improve the training of the applied ML algorithms (see Figure 1).

After training, the prediction quality of the algorithms was assessed with MSE (Equation (4)), R² (Equation (3)), and NRMSE (Equation (2)). The following results were obtained: Linear Regression (MSE = 0.4; R² = 0.45; NRMSE = 14.2%); SVR (MSE = 0.5; R² = 0.54; NRMSE = 19.1%); Decision Trees (MSE = 0.3; R² = 0.32; NRMSE = 15.8%); Ensembles (MSE = 0.3; R² = 0.14; NRMSE = 17.7%); and GPR (MSE = 0.1; R² = 0.95; NRMSE = 4.3%). The statistical measures of GPR were also significant.

5.2. ANFIS Results

The application of ANFIS for the fuzzy prediction of the consequences of oil spills on the ground is presented in Figure 6. The initial parameters for the ANFIS model were as follows: (1) Gaussian MF; (2) 5 linguistic terms; and (3) 30 learning epochs (also see [80]).

In Layer 1, the fuzzification of the input data is performed and MFs are obtained based on Equation (20). The input data is partitioned into five terms: “Very High”, “High”, “Moderate”, “Low”, and “Very Low” (Table 4), and is transformed into Gaussian MFs represented by

μ^{G a u s}

(Equation (20)).

The MFs of ground thickness (Input 1) and the corresponding intervals of five linguistic terms (Table 4) are presented in Figure 6. The MFs of other input variables are not shown since their graphical representation looks similar to Figure 7, only the interval values differ. This is because we used the Gaussian MF for all inputs.

In Layer 2, a set of initial fuzzy rules is formed based on combinatorial variants and Equation (19). As a result, we have 625 fuzzy rules. Every node in this layer is fixed or non-adaptive. In Layer 3, the normalization of rule strengths is performed based on Equation (21). In Layer 3, each node is also fixed. Fuzzy rule examples are presented in Table 5.

Using an ANFIS intelligence, we obtained the prediction surfaces, two of which are presented in Figure 8. Figure 8a illustrates how the penetration of the oil spill (the penetrated OP) depends on the ground thickness (input 1) and the spilled oil volume (input 2).

Figure 8b presents how the penetration of the oil spill (the penetrated OP) depends on the spilled oil volume (input 2) and the oil density (input 4). The penetration of the oil spill (the weight of the OP that penetrates into the groundwater in kg–an output) depends on the oil density, i.e., the same volume of spilled oil spreads differently on the surface depending on its density. At the same time, the more OP spreads over the surface, the less depth it penetrates into the ground. Eventually, when a huge (i.e., “high”) volume of oil spills (~5000 m³), it will pass into the groundwater.

The defuzzification and application of fuzzy rules are performed in Layer 4 (see Equation (18)), where every node is adapted based on those rules. The global model response is obtained in Layer 5 using Equation (22).

The output of the oil spill prediction ranges into five categories as follows: “Very High” [9.2 × 10⁶; 1.5 × 10⁵), “High” (1.4 × 10⁵; −6.2 × 10⁶), “Moderate” (−6.1 × 10⁶; −1.3 × 10⁷), “Low” (−1.2 × 10⁷; −2.1 × 10⁷), and “Very Low” (−2.0 × 10⁷; −2.9 × 10⁷].

5.3. Evaluation of Results

The results of the prediction of oil spill penetration into groundwater using the chosen ML algorithms and the proposed ANFIS-based approach are shown in Table 6. The proposed approach for the prediction of the consequences of oil spills using small sets of data with ANFIS showed the lowest NRMSE (1.0%). It also had a better oil penetration predictive performance in comparison with the analysed ML algorithms (i.e., Linear Regression, Decision Trees, SVR, Ensembles, and GPR). However, the GPR method also shows sufficiently accurate results. Thus, the proposed ANFIS-based approach can be considered an efficient oil penetration into groundwater prediction model, which can be further utilised with more data and other oil spill cases.

6. Discussion

Predictions with fuzzy inference and ML methods are now widely used in various fields of application, including the prediction of the impact of oil spills on the ground environment [13]. Although oil spills in the geological environment are not as large as oil spills on water, the entire consequences of such oil spills on the ground and their scale are not immediately visible due to the slow and gradual penetration of oil into the ground and groundwater. When oil penetrates groundwater, it can lead to massive pollution. Therefore, it is vital to predict oil spill penetration into groundwater in order to assess and evaluate the possible magnitude of the oil spill and prevent catastrophic consequences. However, given the peculiarities and characteristics of the oil spill process on the ground (see [13]), there is only a small amount of data on real oil spills.

Consequently, we require an approach to attaining more data to apply fuzzy inference and ML methods for the prediction of the consequences of oil spills. As the results of related works and analyses show, various approaches to solving the issue of small datasets can be found in the literature. They are as follows: generating surrogate data, unsupervised learning, interpolation, and collecting data from users. We have found that generating surrogate data for the prediction of the consequences of oil spills is the most suitable for this research. On this basis, a mathematical model describing oil spill penetration into the groundwater process was developed [24], and necessary surrogate data was generated from existing data collected from real world oil spill accidents. Furthermore, we proposed the ANFIS-based approach for the prediction of the consequences of oil spills using small sets of data and compared its predictive performance with several ML algorithms using the same set of data.

The performance and effectiveness of the proposed ANFIS-based approach were compared with the chosen ML algorithms (Linear Regression, Decision Trees, SVR, Ensembles, and GPR) using the statistical measures of R², NRMSE, and MSE. These measures showed better predictive performance and effectiveness of the proposed ANFIS-based approach for the prediction of the consequences of oil spills using small sets of data and the GPR algorithm. This result can be explained by the generalisation capability of ANFIS [81] and its stronger ability to deal with inaccurate and fuzzy input data for prediction than models based on mathematical Equation [24].

We also achieved better prediction results for Linear Regression (NRMSE = 14.2%) than with Decision Trees (NRMSE = 15.8%) and Ensembles (NRMSE = 17.7%), but this difference was not significant. However, the better statistical measuring of the Linear Regression algorithm can be explained by the small dataset (with low noise); as such, Linear Regression outperforms Decision Trees and Ensembles. Moreover, as can be seen in Figure 8b, the oil spill penetration into the ground surface depends on the volume of spilled oil (Input 2) and the oil density (Input 4), i.e., at certain values of spilled oil volume and oil density, spilled oil will spread more on the surface, and not penetrate into the ground. Therefore, we observed a curved surface to oil spill penetration which satisfies real oil spill cases. However, the Decision Trees and Ensemble algorithms understand this curvature as an outlier which is corrected or eliminated by those algorithms. On the contrary, the proposed ANFIS-based approach copes with this curvature. Consequently, its prediction performance is much better.

Summing up, the proposed ANFIS-based approach for predicting the consequences of oil spills using small sets of data can be considered a model that produces an efficient prediction with a small dataset. Consequently, we have shown that none of the ML algorithms give such a predictive performance as ANFIS, which consists of FIS and ANN. We have learned that the proposed ANFIS-based approach performed the best. These results show us that the inclusion of fuzzy sets in the prediction algorithm allows us to flexibly describe the oil spill process and, as a result, obtain more accurate prediction results. Therefore, we assume that if the GPR algorithm was extended to include fuzzy sets, this could produce more accurate results. Moreover, we can assume that the proposed ANFIS-based approach will perform better than other algorithms with small datasets in general. However, in order to verify this hypothesis, we need additional research with small datasets from various application areas. As such, we leave these tasks for future work.

In this vein, this paper has several limitations which at the same time provide future research opportunities. We will always have limitations associated with the quality of data, model configuration, algorithm assumptions, etc. With that in mind, it is possible to develop a model using better quality field data to help improve the performance and reliability of the prediction. In future work, we intend to extend the set of analysed fuzzy inference and ML methods and conduct new and more complex experiments that allow us to explore in more detail the prediction capabilities and to choose the most suitable approaches for predicting the consequences of oil spills.

In addition, we plan to analyze the efficiency of the methods used for the prediction of the consequences of oil spills. This study will allow us to determine which prediction approach is most appropriate under certain conditions. For example: Which approach is more suitable for devices with limited resources? Which approach has less computational complexity? Which approach takes the least time to achieve a prediction?

Summing up, the presented research has proposed a better understanding of the application of fuzzy inference and ML methods for prediction with small datasets. The obtained results show that ANFIS is suitable for prediction with a small dataset and provides sufficiently accurate prediction results. Moreover, it is useful for further research.

7. Conclusions

The analysis of the related works on the application of fuzzy inference and ML methods when predicting with small datasets showed that a small dataset prediction issue exists in different research areas, and its solution tends to be solved in similar ways: (1) most commonly, using the surrogate data generation approach; (2) applying unsupervised learning; (3) using an interpolation approach for data augmentation; (4) selecting useful predictive features of data samples; (5) collecting data from users; and (6) using convolutional networks that naturally operate on an input of any size. The analysis of these approaches and the environment for the prediction of the consequences of oil spills on the ground environment showed that the most suitable approach is generating a surrogate dataset, since the process of an oil spill can be specified using a mathematical model. Moreover, applying fuzzy logic allows us to flexibly describe the oil spill process.

The proposed approach for the prediction of the consequences of oil spills using small sets of data allows us to investigate the suitability of ML algorithms (Linear Regression, Decision Trees, SVR, Ensembles, and GPR) and fuzzy inference methods to predict the consequences of oil spills using small sets of data. A small dataset was enlarged with the synthetic data generated by the mathematical model, and the same generated dataset was used for experimentation with fuzzy inference and ML methods for prediction. The main advantage and contribution of the proposed approach is that it allows us to perform a statistical comparison of the application of the fuzzy inference and ML methods for predicting the consequences of oil spills using small sets of data.

The advantage and contribution of the current research is the proposed conceptual architecture for predicting the consequences of oil spills using small datasets, which could be adapted for users’ needs by adding new prediction methods and techniques and using different implementation tools.

The obtained results showed that the proposed ANFIS-based approach (MSE = 0.12; R² = 0.95; NRMSE = 4.3%) has the best prediction performance and performance results. Consequently, in the case of small datasets, it is suitable and can be successfully applied for predicting the consequences of oil spills.

In future works, we will increase the number of prediction experiments with various oil spill datasets. We plan to expand and use our proposed approach to analyse and predict oil product migration with groundwater. During these experiments, we have produced a large set of fuzzy rules. A standard personal computer does not have enough computing resources for high-quality and large-scale training and predicting oil spill consequences. Therefore, in the future, we plan to develop a method for transferring the proposed ANFIS-based approach to a cloud server. Another important direction is developing an approach for tuning and optimising fuzzy rules.

Author Contributions

Conceptualization, D.K.; methodology, A.B. and D.K.; software, A.B.; validation, A.B. and D.K.; formal analysis, A.B. and D.K.; investigation, A.B. and D.K.; resources, A.B.; data curation, A.B.; writing—original draft preparation, A.B. and D.K.; writing—review and editing, A.B. and D.K.; visualization, A.B. and D.K.; supervision, D.K.; funding acquisition, D.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Li, Z.; Yao, H.; Ma, F. Learning with Small Data. In Proceedings of the 13th International Conference on Web Search and Data Mining, Houston, TX, USA, 3–7 February 2020; Association for Computing Machinery: New York, NY, USA, 2020. [Google Scholar]
Papageorgiou, E.I.; Aggelopoulou, K.; Gemtos, T.A.; Nanos, G.D. Development and evaluation of a fuzzy inference system and a neuro-fuzzy inference system for grading apple quality. Appl. Artif. Intell. 2018, 32, 253–280. [Google Scholar] [CrossRef]
Azmy, S.B.; Sneineh, R.A.; Zorba, N.; Hassanein, H.S. Small data in IoT: An MCS perspective. In Performability in Internet of Things; Springer: Cham, Switzerland, 2019; pp. 209–229. [Google Scholar]
Sabay, A.; Harris, L.; Bejugama, V.; Jaceldo-Siegl, K. Overcoming Small Data Limitations in Heart Disease Prediction by Using Surrogate Data. SMU Data Sci. Rev. 2018, 1, 12. [Google Scholar]
Chen, R.J.; Lu, M.Y.; Chen, T.Y.; Williamson, D.F.; Mahmood, F. Synthetic data in machine learning for medicine and healthcare. Nat. Biomed. Eng. 2021, 5, 493–497. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Chen, J.; Lu, H. Predicting Future Event via Small Data (e.g., 4 Data) by ASF and Curve Fitting Methods. In Proceedings of the 11th International Conference on ICICIP IEEE 2021, Dali, China, 3–7 December 2021. [Google Scholar]
Suwa, M.; Watanabe, Y.; Ikeda, H.; Matsuoka, H.; Suzuki, T.A. Model for Predicting River Flooding Using Relatively Small Data Sets. AGU Fall Meet. Abstr. 2018, 17, H43J-2603. [Google Scholar]
Burmakova, A.; Kalibatienė, D. Machine learning vs fuzzy inference methods for predicting the oil spill consequences with small data sets. In Proceedings of the Data Analysis Methods for Software Systems, Druskininkai, Lithuania, 2–4 December 2021. [Google Scholar]
Mohammadiun, S.; Hu, G.; Gharahbagh, A.A.; Li, J.; Hewage, K.; Sadiq, R. Evaluation of machine learning techniques to select marine oil spill response methods under small-sized dataset conditions. J. Hazard. Mater. 2022, 436, 129282. [Google Scholar] [CrossRef] [PubMed]
Kamath, C.; Fan, Y.J. Regression with small data sets: A case study using code surrogates in additive manufacturing. Knowl. Inf. Syst. 2018, 57, 475–493. [Google Scholar] [CrossRef]
Sakizadeh, M.; Rahmatinia, H. Statistical learning methods for classification and prediction of groundwater quality using a small data record. Int. J. Agric. Environ. Inf. Syst. (IJAEIS) 2017, 8, 37–53. [Google Scholar] [CrossRef]
Zhao, L.; Shang, Z.; Qin, A.; Tang, Y.Y. Siamese Dense Neural Network for Software Defect Prediction with Small Data. IEEE Access 2019, 7, 7663–7677. [Google Scholar] [CrossRef]
Kalibatiene, D.; Burmakova, A. Fuzzy Model for Predicting Contamination of the Geological Environment during an Accidental Oil Spill. IJFS Int. J. Fuzzy Syst. 2021, 24, 425–439. [Google Scholar] [CrossRef]
Jiao, Z.; Jia, G.; Cai, Y. A new approach to oil spill detection that combines deep learning with unmanned aerial vehicles. Comput. Ind. Eng. 2019, 135, 1300–1311. [Google Scholar] [CrossRef]
Mohammadiun, S.; Hu, G.; Gharahbagh, A.A.; Mirshahi, R.; Li, J.; Hewage, K.; Sadiq, R. Optimization of integrated fuzzy decision tree and regression models for selection of oil spill response method in the Arctic. Knowl.-Based Syst. 2021, 213, 106676. [Google Scholar] [CrossRef]
Sajid, Z.; Khan, F.; Veitch, B. Dynamic ecological risk modelling of hydrocarbon release scenarios in Arctic waters. Mar. Pollut. Bull. 2020, 153, 111001. [Google Scholar] [CrossRef] [PubMed]
Cherednichenko, O.; Yanholenko, O.; Vovk, M.; Tkachenko, V. Formal Modeling of Decision-Making Processes under Transboundary Emergency Conditions. Data-Cent. Bus. Appl. 2020, 42, 141–162. [Google Scholar]
Lourenzutti, R.; Krohling, R.A. A generalized TOPSIS method for group decision making with heterogeneous information in a dynamic environment. Inf. Sci. 2016, 330, 1–18. [Google Scholar] [CrossRef]
Akyuz, E.; Celik, E. A quantitative risk analysis by using interval type-2 fuzzy FMEA approach: The case of oil spill. Marit. Policy Manag. 2018, 45, 979–994. [Google Scholar] [CrossRef]
Yu, Z.; Ye, S.; Sun, Y.; Zhao, H.; Feng, X.Q. Deep learning method for predicting the mechanical properties of aluminum alloys with small data sets. Mater. Today Commun. 2021, 28, 102570. [Google Scholar] [CrossRef]
Karaboga, D.; Kaya, E. Adaptive network based fuzzy inference system (ANFIS) training approaches: A comprehensive survey. Artif. Intell. Rev. 2019, 52, 2263–2293. [Google Scholar] [CrossRef]
Al-Mahasneh, M.; Aljarrah, M.; Rababah, T.; Alu’datt, M. Application of hybrid neural fuzzy system (ANFIS) in food processing and technology. Food Eng. Rev. 2016, 8, 351–366. [Google Scholar] [CrossRef]
Elsisi, M.; Tran, M.Q.; Mahmoud, K.; Lehtonen, M.; Darwish, M.M. Robust design of ANFIS-based blade pitch controller for wind energy conversion systems against wind speed fluctuations. IEEE Access 2021, 9, 37894–37904. [Google Scholar] [CrossRef]
Kalibatiene, D.; Burmakova, A.; Smelov, V. On Knowledge-Based Forecasting Approach for Predicting the Effects of Oil Spills on the Ground. Digit. Transform. 2020, 4, 44–56. [Google Scholar] [CrossRef]
Hssina, B.; Merbouha, A.; Ezzikouri, H.; Erritali, M. A comparative study of decision tree ID3 and C4. Int. J. Adv. Comput. Sci. Appl. 2014, 4, 13–19. [Google Scholar]
Pandya, R.; Pandya, J. C5.0 algorithm to improved decision tree with feature selection and reduced error pruning. Int. J. Comput. Appl. 2015, 117, 18–21. [Google Scholar]
Lewis, R.J. An introduction to classification and regression tree (CART) analysis. In Proceedings of the Annual Meeting of the Society for Academic Emergency Medicine, San Francisco, CA, USA, 22–25 May 2000. [Google Scholar]
Hu, G.; Mohammadiun, S.; Gharahbagh, A.A.; Li, J.; Hewage, K.; Sadiq, R. Selection of oil spill response method in Arctic offshore waters: A fuzzy decision tree-based framework. Mar. Pollut. Bull. 2020, 161, 111705. [Google Scholar] [CrossRef] [PubMed]
Zhao, Q.; Wang, J. Disaster Chain Scenarios Evolutionary Analysis and Simulation Based on Fuzzy Petri Net: A Case Study on Marine Oil Spill Disaster. IEEE Access 2019, 7, 183010–183023. [Google Scholar] [CrossRef]
Feng, D.; Passalacqua, P.; Hodges, B.R. Innovative Approaches for Geometric Uncertainty Quantification in an Operational Oil Spill Modeling System. JMSE J. Mar. Sci. Eng. 2019, 7, 259. [Google Scholar] [CrossRef]
Hoblitzell, A.; Babbar-Sebens, M.; Mukhopadhyay, S. Machine Learning with Small Data for User Modeling of Watershed Stakeholders Engaged in Interactive Optimization. In Proceedings of the 2nd International Conference on Computer Science and Artificial Intelligence, Shenzhen, China, 8 December 2018; Association for Computing Machinery: New York, NY, USA, 2018. [Google Scholar]
Russel, S.; Norvig, P. Artificial Intelligence. A Modern Approach, 3rd ed.; Prentice Hall: Upper Saddle River, NJ, USA, 2012; pp. 30–86. [Google Scholar]
Kumar, R.; Verma, R. Classification algorithms for data mining: A survey. Int. J. Inf. Educ. Technol. 2012, 1, 7–14. [Google Scholar]
Yuan, G.X.; Ho, C.H.; Lin, C.J. Recent advances of large-scale linear classification. Proc. IEEE 2012, 100, 2584–2603. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V.; Saitta, L. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Fix, E.; Hodges, J.L. Nonparametric Discrimination. Consistency Properties; International Statistical Institute (ISI): Randolph Field, TX, USA, 1951; Volume 1, pp. 21–49. [Google Scholar]
Wu, X.; Kumar, V.; Ross, Q.J.; Ghosh, J.; Yang, Q.; Motoda, H.; McLachlan, G.J.; Ng, A.; Liu, B.; Yu, P.S.; et al. Top 10 algorithms in data mining. Knowl. Inf. Syst. 2007, 14, 1–37. [Google Scholar] [CrossRef]
Freedman, D.A. Statistical Models: Theory and Practice; Cambridge University Press: Cambridge, UK, 2009; pp. 41–72. [Google Scholar]
Williams, C.K.; Rasmussen, C.E. Gaussian Processes for Machine Learning; MIT Press: Cambridge, UK, 2006; pp. 7–30. [Google Scholar]
Vapnik, V. The Nature of Statistical Learning Theory; Springer Science & Business Media: New York, NY, USA, 1999; pp. 69–99. [Google Scholar]
Tolles, J.; Meurer, W.J. Logistic Regression: Relating Patient Characteristics to Outcomes. JAMA 2016, 316, 533–534. [Google Scholar] [CrossRef]
Janosi, A.; Steinbrunn, W.; Pfisterer, M.; Detrano, R. The UCI machine Learning Repository Online. Available online: http://archive.ics.uci.edu/ml/datasets/Heart+Disease (accessed on 20 June 2022).
Piryonesi, S.M.; El-Diraby, T.E. Data Analytics in Asset Management: Cost-Effective Prediction of the Pavement Condition Index. J. Infrastruct. Syst. 2019, 26, 04019036. [Google Scholar] [CrossRef]
Gong, H.F.; Chen, Z.S.; Zhu, Q.X.; He, Y.L. A Monte Carlo and PSO based virtual sample generation method for enhancing the energy prediction and energy optimization on small data problem: An empirical study of petrochemical industries. Appl. Energy 2017, 197, 405–415. [Google Scholar] [CrossRef]
Cubuk, E.D.; Sendek, A.D.; Reed, E.J. Screening billions of candidates for solid lithium-ion conductors: A transfer learning approach for small data. J. Chem. Phys. 2019, 150, 214701. [Google Scholar] [CrossRef] [PubMed]
He, Y.L.; Wang, P.J.; Zhang, M.Q.; Zhu, Q.X.; Xu, Y. A novel and effective nonlinear interpolation virtual sample generation method for enhancing energy prediction and analysis on small data problem: A case study of Ethylene industry. Energy 2018, 147, 418–427. [Google Scholar] [CrossRef]
Drechsler, R.; Huhn, S.; Plump, C. Combining Machine Learning and Formal Techniques for Small Data Applications-A Framework to Explore New Structural Materials. In Proceedings of the 2020 23rd Euromicro Conference on Digital System Design (DSD), Kranj, Slovenia, 26–28 August 2020; Volume 1, pp. 518–525. [Google Scholar]
Baldominos, A.; Ogul, H.; Colomo-Palacios, R. Infection diagnosis using biomedical signals in small data scenarios. In Proceedings of the 32nd International Symposium on Computer-Based Medical Systems, Cordoba, Spain, 5–7 June 2019. [Google Scholar]
Micallef, L.; Sundin, I.; Marttinen, P.; Ammad-ud-din, M.; Peltola, T.; Soare, M.; Jacucci, G.; Kaski, S. Interactive Elicitation of Knowledge on Feature Relevance Improves Predictions in Small Data Sets. In Proceedings of the 22nd International Conference on Intelligent User Interfaces, Limassol, Cyprus, 13–16 March 2017; Association for Computing Machinery: New York, NY, USA, 2017. [Google Scholar]
Shaikhina, T.; Khovanova, N.A. Handling limited datasets with neural networks in medical applications: A small-data approach. Artif. Intell. Med. 2017, 75, 51–63. [Google Scholar] [CrossRef]
Li, Y.; Yang, X.; Ye, Y.; Cui, L.; Jia, B.; Jiang, Z.; Wang, S. Detection of oil spill through fully convolutional network. In Proceedings of the International Conference on Geo-Spatial Knowledge and Intelligence, Chiang Mai, Thailand, 8–10 December 2017. [Google Scholar]
Li, Y.; Lyu, X.; Frery, A.C.; Ren, P. Oil Spill Detection with Multiscale Conditional Adversarial Networks with Small-Data Training. Remote Sens. 2021, 13, 2378. [Google Scholar] [CrossRef]
Qin Ouyang, X.; Chen, Y.P.; Wei, B.H.; Mosic, D. Experimental Study on Class Imbalance Problem Using an Oil Spill Training Data Set. Br. J. Math. Comput. Sci. 2017, 2, 1–9. [Google Scholar] [CrossRef]
Mills, P. Efficient statistical classification of satellite measurements. Int. J. Remote Sens. 2012, 32, 6109–6132. [Google Scholar] [CrossRef]
Powers, D.M.; Ailab, W. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. IJMLT 2020, 2, 37–63. [Google Scholar]
Chicco, D.; Warrens, M.J.; Jurman, G. The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. PeerJ Comput. Sci. 2021, 7, e623. [Google Scholar] [CrossRef]
Teke, A.; Yildirim, H.B.; Çelik, Ö. Evaluation and performance comparison of different models for the estimation of solar radiation. Renew. Sustain. Energy Rev. 2015, 50, 1097–1107. [Google Scholar] [CrossRef]
Jadhav, A.; Pramod, D. Comparison of performance of data imputation methods for numeric dataset. Appl. Artif. Intell. 2019, 33, 913–933. [Google Scholar] [CrossRef]
Ranković, V.; Radulović, J.; Radojević, I.; Ostojić, A.; Čomić, L. Neural network modeling of dissolved oxygen in the Gruža reservoir. Ecol. Model. 2010, 221, 1239–1244. [Google Scholar] [CrossRef]
Sammut, C.; Webb, G. Encyclopedia of Machine Learning; Springer: Boston, MA, USA, 2011; pp. 150–207. [Google Scholar]
Putka, D.J.; Beatty, A.S.; Reeder, M.C. Modern prediction methods: New perspectives on a common problem. Organ. Res. Methods 2018, 21, 689–732. [Google Scholar] [CrossRef]
Duch, W. Filter Methods; Springer: Heidelberg/Berlin, Germany, 2006; pp. 89–117. [Google Scholar]
Cherrington, M.; Thabtah, F.; Lu, J.; Xu, Q. Feature selection: Filter methods performance challenges. In Proceedings of the International Conference on Computer and Information Sciences, Jouf University, Aljouf, Saudi Arabia, 3–4 April 2019. [Google Scholar]
Bolón-Canedo, V.; Sánchez-Maroño, N.; Alonso-Betanzos, A. A review of feature selection methods on synthetic data. Knowl. Inf. Syst. 2012, 1, 483–519. [Google Scholar] [CrossRef]
Bolón-Canedo, V.; Sánchez-Maroño, N.; Alonso-Betanzos, A. Feature selection for high-dimensional data. PAI 2016, 5, 65–75. [Google Scholar] [CrossRef]
Benesty, J.; Chen, J.; Huang, Y.; Cohen, I. Pearson Correlation Coefficient. Springer Topics in Signal Processing, 2nd ed.; Springer: Heidelberg/Berlin, Germany, 2009; pp. 1–4. [Google Scholar]
Ardil, C.; Pashaev, A.M.; Sadiqov, R.A.; Abdullayev, P. Multiple Criteria Decision-Making Analysis for Selecting and Evaluating Fighter Aircraft. Int. J. Transp. Veh. Eng. 2021, 13, 683–694. [Google Scholar]
Hastie, T.; Tibshirani, R.; Friedman, J.H.; Friedman, J.H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer: New York, NY, USA, 2009; pp. 1–175. [Google Scholar]
Awad, M.; Khanna, R. Support vector regression. In Efficient Learning Machines; Apress: Berkeley, CA, USA, 2015; pp. 67–80. [Google Scholar]
Karimi, K.; Hamilton, H.J. Generation and interpretation of temporal decision rules. arXiv 2010, arXiv:1004.3334. [Google Scholar]
Jadhav, S.; He, H.; Jenkins, K. Information gain directed genetic algorithm wrapper feature selection for credit rating. Appl. Soft Comput. 2018, 69, 541–553. [Google Scholar] [CrossRef]
Zhou, Z.-H. Ensemble Methods: Foundations and Algorithms; CRC Press: Cambridge, UK, 2012; pp. 69–85. [Google Scholar]
Khatri, C. Classical statistical analysis based on a certain multivariate complex Gaussian distribution. Ann. Math. Stat. 1965, 36, 98–114. [Google Scholar] [CrossRef]
MacKay, D.J.C. Information Theory, Inference and Learning Algorithms; Cambridge University Press: Cambridge, UK, 2003; pp. 54–75. [Google Scholar]
Koizumi, D. On the Prediction of a Nonstationary Bernoulli Distribution based on Bayes Decision Theory. ICAART 2021, 2, 957–965. [Google Scholar]
Bloch, D.A.; Watson, G.S. A Bayesian study of the multinomial distribution. Ann. Math. Stat. 1967, 38, 1423–1435. [Google Scholar] [CrossRef]
Jang, J.-S. ANFIS: Adaptive-network-based fuzzy inference system. IEEE Trans. Syst. Man Cybern. 1993, 23, 665–685. [Google Scholar] [CrossRef]
Choi, B.I.; Rhee, F.C.H. Interval type-2 fuzzy membership function generation methods for pattern recognition. Inf. Sci. 2009, 179, 2102–2122. [Google Scholar] [CrossRef]
Willmott, C.; Matsuura, K. Advantages of the Mean Absolute Error (MAE) over the Root Mean Square Error (RMSE) in assessing average model performance. Clim. Res. 2005, 30, 79–82. [Google Scholar] [CrossRef]
Burmakova, A.; Kalibatiene, D. An ANFIS-based Model to Predict the Oil Spill Consequences on the Ground. In Proceedings of the IEEE Open Conference of Electrical, Electronic and Information Sciences (eStream), Vilnius, Lithuania, 22 April 2021. [Google Scholar]
Fathipour-Azar, H. Machine learning-assisted distinct element model calibration: ANFIS, SVM, GPR, and MARS approaches. Acta Geotech. 2021, 17, 1207–1217. [Google Scholar] [CrossRef]

Figure 1. The schema of the approach for the prediction of the consequences of oil spills on the ground environment.

Figure 2. The visualization of the mathematical model for predicting the impact of an oil spill on the ground environment.

Figure 3. The ranking of variables for prediction (#1, #2, …, #12 is a number of a variable).

Figure 4. The basic structure of ANFIS.

Figure 5. The conceptual architecture of the proposed approach to the prediction of the consequences of oil spills using small sets of data.

Figure 6. The proposed ANFIS network structure for oil spill prediction.

Figure 7. The MFs of ground thickness.

Figure 8. The results of the prediction of oil penetration with ANFIS.

Table 1. Comparison of approaches, methods, and techniques used in ML to solve issues caused by small datasets.

References	Domain of Interest	Solution for Small Data Issue	Classification/ Regression	Methods/Techniques Used	Tools Used	Dataset for Experiment
(1)	(2)	(3)	(4)	(5)	(6)	(7)
[4]	Prediction of heart disease	Surrogate data generation from the characteristics of original observations	Regression	Neural network (NN), Logistic Regression, Decision Tree and Random Forest	Synthpop package in R	Heart disease data from the UCI Repository [42], 14 dataset variables, 297 records
[10]	Additive manufacturing	Surrogate data generation from the two physical models	Regression	Regression Trees, SVR, kernel regression, multivariate adaptive regression, GPR. GPR performed the best	Analysis of variance (ANOVA)	Three datasets: for Eagar–Tsai-100, four input parameters and 100 samples; for Eagar–Tsai-462, four input parameters and random sampling; for Verhaeghe-41, 41 data points consisting of the four inputs
[11]	Pollution of groundwater	Cluster analysis	Classification	Clusters, SVM and NN	SVM with polynomial and RBF kernel methods	14 groundwater quality variables gathered from 27 groundwater samples
[12]	Software defect prediction (SDP)	DNN model with integrated similarity feature learning and distance metric learning	Classification	Siamese Dense neural networks (SDNN)	Tensorflow, keras and Matlab	10 datasets from the NASA repository from 87 to 2032 instances
[31]	Modelling of watershed stakeholders engaged	Integrating user models with limited data	Regression	NN	ANFIS MATHLAB	Known data on the physical and chemical properties of soil and aquatic environment (datasets of 25, 50, 100 and 360 samples)
[43]	Material design	ML model training using only elementary descriptors on the same dataset	Classification	Linear SVM with leave-one-out cross-validation	NA	Materials Project database
[44]	The energy prediction and optimisation of petrochemical systems	Virtual sample generation forms the underlying information of the small dataset	Regression	The Monte Carlo, Particle Swarm Optimisation, and extreme ML (ELM) algorithms	NA	Two real-world cases of the petroleum production process
[45]	Prediction of the mechanical properties of aluminium alloys	DNN model pre-training and tuning its parameters	Regression	DNN with gradient descent optimisation	TensorFlow	Data of mechanical properties of aluminium alloys
[46]	Enhancing energy prediction	Virtual sample generation through nonlinear interpolation	Regression	Nonlinear interpolation, EML, non-linear interpolation based virtual sample generation	NA	50 production data items from Chinese plants between 2011–2013; five input variables
[47]	Exploring new structural materials	Extracting relevant properties from a high dimensional small training dataset	Regression	Kernel Regression-based Learning, Kernel Recursive Least Squares, expert knowledge	C++, dlib, MongoDB	Tensile test specimen and spherical micro samples; the training data of grid points, i.e., fully classified structural materials of 6500 data points; 15 kernel functions
[48]	Infection diagnosis	Aggregating existing measurements to generate required features	Classification	Decision support system based on ensemble of Decision Trees, k-nearest neighbours, logistic regression, multi-layer perceptron, SVM	NA	Real dataset of 60 patients for infections; three variables
[49]	Interactive knowledge extraction based on feature relevance	Interactive visualisation for extracting tacit prior knowledge	Classification, Regression	Interactive visualisation (IVis), user model	Python Natural Language Toolkit, Python Rake and KPMiner	The user’s knowledge about feature relevance from 162 scientific documents; 457 unique keywords that were used as features
[50]	Medical applications	Surrogate data generated using statistical features of the original dataset	Regression	NN	NA	56 samples; 5 input parameters
[51]	Marine oil spill detection from images	The ResNet model for extracting the feature maps with the input data	Classification	Convolutional NN (CNN)	FCN-GoogLeNet and FNC-ResNet models	20 oil spill images
[52]	Oil spill detection	Generating an oil spill detection map from the observed image characteristics	Classification	Multiscale conditional adversarial network	NA	Four oil spill image pairs (size of 256 × 256 pixels)
[53]	Oil spill detection	Generating new examples applying the Mega-Trend Diffusion function, intelligent over-sampling methods	Classification	Unsupervised algorithm SOM	MATLAB, Libsvm	Pictures of real oil spills compared to pictures of fake spills

Table 2. An initial set of variables for predicting the impact of an oil spill on the ground environment.

Rank	Variables, Units	Min.	Max.	Importance (Gini Impurity)
1	Spilled oil volume, m³	10	1000	10.69
2	Surface spreading coefficient, m⁻¹	5	30	6.51
3	Ground type	4 possible values: sand, sandy loam, loam, clay		6.17
4	Ground thickness, m	3	5	5.41
5	Time after the spill, days	1	10	4.63
6	Oil density, kg/m³	750	930	4.54
7	Depth of groundwater, m	3.15	5.35	4.32
8	Ground temperature, °C	3	11	4.15
9	Soil moisture	0.08	0.2	3.92
10	Terrain relief, m	130	170	3.9
11	Air temperature, °C	5	25	3.68
12	Ground moisture	0.18	0.46	3.49

Table 3. Correlations among variables (Pearson) *.

	Surface Spreading Coefficient	Ground Thickness	Oil Density	Spilled Oil Volume
Surface spreading coefficient	1	0	0	0
Ground thickness	0	1	0	0
Oil density	0	0	1	0
Spilled oil volume	0	0	0	1

* Note that the Pearson correlation coefficient is calculated since the data is continuous. Moreover, we have zero values for the Pearson correlation coefficient, since we have only a small dataset that presents the prediction input. It presents clean data without noise and outliers.

Table 4. Linguistic terms of input data.

Input Data			Intervals (Terms)
Name of the Variable	Value Range, Measurement Units	Input No.	Very Low	Low	Moderate	High	Very High
Ground thickness	[3; 5], m	Input 1	[3; 3.65)	(3; 3.9)	(3.4; 4.6)	(4.1; 5)	(4.3; 5]
Spilled oil volume	[100; 10,000], m³	Input 2	[100; 3500)	(100; 6000)	(1700; 8300)	(4000; 10,000)	(6000; 10,000]
Surface spreading coefficient	[5; 30], m⁻¹	Input 3	[5; 14)	(5; 20)	(9; 26)	(16; 30)	(23; 30]
Oil density	[750; 850], kg/m³	Input 4	[750; 785)	(750; 810)	(765; 835)	(790; 850)	(815; 850]

Table 5. The fuzzy rules (VH—Very High, H—High, M—Moderate, L—Low, VL—Very Low).

Rule #	IF	Input 1	AND	Input 2	AND	Input 3	AND	Input 4	THEN	Output
1		VH		VH		VH		VH		VH
2		L		M		M		M		M
3		M		L		VH		VH		VH
4		H		VH		VH		VH		VH
5		H		L		L		L		M
6		H		M		M		M		H
7		M		L		L		L		M
8		M		H		H		VL		L
9		M		VH		VH		VH		VH
10		L		M		VH		M		M
11		L		H		H		VL		M
12		L		VH		VH		VL		M

Table 6. Performance evaluation of the proposed approach with classifiers.

Method	MSE	R²	NRMSE
Linear Regression	0.43	0.45	14.2%
Decision Trees	0.33	0.32	15.8%
SVR	0.52	0.54	19.1%
Ensembles	0.30	0.14	17.7%
GPR	0.12	0.95	4.3%
The proposed ANFIS-based approach	0.01	0.99	1.0%

Note that we can achieve better prediction results and better training of algorithms if we receive more real-world data from oil spill sites.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Burmakova, A.; Kalibatienė, D. Applying Fuzzy Inference and Machine Learning Methods for Prediction with a Small Dataset: A Case Study for Predicting the Consequences of Oil Spills on a Ground Environment. Appl. Sci. 2022, 12, 8252. https://doi.org/10.3390/app12168252

AMA Style

Burmakova A, Kalibatienė D. Applying Fuzzy Inference and Machine Learning Methods for Prediction with a Small Dataset: A Case Study for Predicting the Consequences of Oil Spills on a Ground Environment. Applied Sciences. 2022; 12(16):8252. https://doi.org/10.3390/app12168252

Chicago/Turabian Style

Burmakova, Anastasiya, and Diana Kalibatienė. 2022. "Applying Fuzzy Inference and Machine Learning Methods for Prediction with a Small Dataset: A Case Study for Predicting the Consequences of Oil Spills on a Ground Environment" Applied Sciences 12, no. 16: 8252. https://doi.org/10.3390/app12168252

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Applying Fuzzy Inference and Machine Learning Methods for Prediction with a Small Dataset: A Case Study for Predicting the Consequences of Oil Spills on a Ground Environment

Abstract

1. Introduction

2. Background

3. Related Works

4. Method

4.1. The Mathematical Model of Oil Spills and Their Penetration into the Ground

4.2. Generating the Synthetic Data Using the Mathematical Model

4.3. Selecting the Main Initial Variables for Prediction

4.4. ML Methods Used for Prediction

4.5. Predicting Oil Spill Consequences Using ANFIS

4.6. The Conceptual Architecture for the Implementation of the Proposed Method

5. Experimenting and Results

5.1. Application of ML Methods

5.2. ANFIS Results

5.3. Evaluation of Results

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI