*Review* **Application of Computational Intelligence Methods in Agricultural Soil–Machine Interaction: A Review**

**Chetan Badgujar 1,\*, Sanjoy Das 2, Dania Martinez Figueroa <sup>2</sup> and Daniel Flippo <sup>1</sup>**


**\*** Correspondence: chetan19@ksu.edu

**Abstract:** Rapid advancements in technology, particularly in soil tools and agricultural machinery, have led to the proliferation of mechanized agriculture. The interaction between such tools/machines and soil is a complex, dynamic process. The modeling of this interactive process is essential for reducing energy requirements, excessive soil pulverization, and soil compaction, thereby leading to sustainable crop production. Traditional methods that rely on simplistic physics-based models are not often the best approach. Computational intelligence-based approaches are an attractive alternative to traditional methods. These methods are highly versatile, can handle various forms of data, and are adaptive in nature. Recent years have witnessed a surge in adapting such methods in all domains of engineering, including agriculture. These applications leverage not only classical computational intelligence methods, but also emergent ones, such as deep learning. Although classical methods have routinely been applied to the soil–machine interaction studies, the field is yet to harness the more recent developments in computational intelligence. The purpose of this review article is twofold. Firstly, it provides an in-depth description of classical computational intelligence methods, including their underlying theoretical basis, along with a survey of their use in soil–machine interaction research. Hence, it serves as a concise and systematic reference for practicing engineers as well as researchers in this field. Next, this article provides an outline of various emergent methods in computational intelligence, with the aim of introducing state-of-the-art methods to the interested reader and motivating their application in soil–machine interaction research.

**Keywords:** tillage; traction; compaction; neural networks; support vector regression; fuzzy inference system; adaptive neuro-fuzzy inference system

#### **1. Introduction**

Soil-engaging tools or machines are an indispensable part of mechanized agriculture. A soil–machine interaction deals with a behavior of tools or machines with soil that results in either tillage, traction, or compaction. The soil–machine interaction is mainly categorized into tillage, traction, and compaction [1]. In traction, a powered traction element (wheel/track) of the vehicle operates on deformable soil, causing soil shear to generate the traction [2]. The soil-derived traction force overcomes the vehicle's resisting forces and maintains its constant motion with its slip and terrain damage [3]. The slip is a principal form of vehicle power loss and one of the prime reasons behind the off-road vehicle's worst traction efficiency. Tractors are the prime movers in agriculture and are mainly used for drawbar work. The drawbar is the most used but the least efficient power outlet, and approx. 20 to 55% of the tractor's available energy is wasted at the soil-tire interface, often resulting in soil compaction and tire wear [4]. Therefore, the traction performance of the vehicle is quantified in terms of traction, slip, and power efficiencies on certain terrain. In the off-road vehicle, it is critical to optimize or increase the working capacity, efficiency, and reduce slip and terrain damage. Multiple variables, such as traction element geometry,

**Citation:** Badgujar, C.; Das, S.; Figueroa, D.M.; Flippo, D. Application of Computational Intelligence Methods in Agricultural Soil–Machine Interaction: A Review. *Agriculture* **2023**, *13*, 357. https:// doi.org/10.3390/agriculture13020357

Academic Editors: Mustafa Ucgul and Chung-Liang Chang

Received: 25 October 2022 Revised: 18 January 2023 Accepted: 19 January 2023 Published: 31 January 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

operating variables, and soil physical conditions, influence vehicle traction performance. Therefore, traction models are often proposed to optimize off-road vehicle performance (e.g., drawbar, slip, traction efficiency).

Tillage alters the soil mechanically to create favorable conditions for crops [2]. It employs either powered or unpowered mechanical devices (tool/implement) to apply forces to the soil, resulting in soil cutting, inversion, pulverization, displacement, mixing, or a collective action aiming to obtain desired conditions [5]. Most tillage devices are passive (unpowered), known as conventional tillage, where a drawbar is applied to the device, and its movement through soil results in tillage. In contrast, active tillage, also known as rotary tillage, employs a powered device to transmit power to the soil. The powered tool comparatively moves greater soil volume than required, and energy cost increases with a working width and depth. Tillage is the most energy-intensive agricultural operation and accounts for nearly half of the total crop production energy [6].

Tillage energy is influenced by multiple factors, including soil conditions (initial condition, texture, bulk density, moisture content, and crop residue cover), tool parameters (shape, size, and cutting-edge sharpness), and operating parameters (depth and speed) [5,6]. Therefore, extensive literature is available that aims to reduce tillage input energy by optimizing those factors. The research efforts mainly revolved around the soil failure pattern, soil movement, and force or energy prediction models [5,6]. The information on tillage force or energy is critical to select tillage types, tools, control variables, energy management, optimization, and reducing excessive soil pulverization. For example, knowing the tool draft in specific soil helps select the tractor size with a matching implement, reducing operation costs and negative soil impacts. Therefore, tillage force or energy prediction models are necessary from an energy optimization perspective.

Soil compaction is a leading factor in degrading productive farmland [7,8]. It has degraded an estimated 83 Mha of farmland [7,9] and affected around 45% of agrarian soil [10,11]. Natural and artificial activities are responsible for soil compaction. The artificial activities involved in crop production can severely affect soil compaction. These activities include heavyweight machinery, and its intense use, uncontrolled vehicle traffic, multiple passes, operating machines under unfavorable conditions (e.g., wet soil), repeated tillage, and bad crop rotation [7,12,13]. In addition to topsoil compaction, a subsoil or plow pan is caused by heavy vehicular movement, heavy plow weight, downward forces from a plow bottom/disk, and repeated tillage. Soil compaction resulting from the soil–machine interaction influences the soil structure, porosity, permeability, and density [7,14], which impacts the crop yield and may degrade the soil. The soil compaction evolved from soil-machine interaction is a complex process that involves multiple interrelated factors. Hence, optimizing the vehicle parameters (e.g., tire type, orientation, inflation pressure, axle weight, traffic), tillage parameters (tool shape, weight, depth, speed, and tillage intensity), and assessing the initial field conditions (soil moisture) can minimize or eliminate the soil compaction.

The soil–machine interaction is a dynamic and intricate process that includes multivariate. However, understanding and accurately describing (models) the soil–machine interaction may provide a solution to sustainable agricultural production. For example, a slight improvement in the tillage tool design or practice could significantly reduce the input energy and avoid excessive soil pulverization or compaction. Likewise, improving the vehicle traction efficiency may increase the working capacity, save energy, and avoid terrain damage or compaction.

In recent years, computational intelligence (CI) methods have succeeded in solving intricate problems in agriculture and life science. The literature shows that researchers, scholars, and engineers have implemented cutting-edge CI methods, including neural networks, fuzzy logic, neuro-fuzzy systems, support vector machines, and genetic algorithms, to solve a challenging problem in the soil tillage and traction domain. However, it lacks a comprehensive, curated source of reference and a detailed and well-organized discussion on the application of CI methods on the soil–machine interaction. Therefore, this study

aimed to survey and analyze the recent research efforts in the soil–machine interaction and critically review the existing methods with a detailed discussion. The article provides brief information, progress, and future direction on CI methods in the soil tillage and traction domain. The proposed study would serve as a concise reference to the reader, engineers, researchers, and farm managers who are further interested in the soil–machine interaction. It is also a quick and systematic way to understand the applicable methods that allow crucial decision-making in farm management.

The review is organized into the following sections: Section 2 discusses the traditional approach used in soil–machine interaction modeling. It also discusses the strengths and limitations of the traditional approach. Section 3 presents a brief overview of popular CI methods, while Section 4 discusses the popular CI methods. These sections provide a piece of fundamental and in-depth information to the readers. Section 5 provides a brief literature survey on CI methods that are employed in soil–machine interaction studies. It also contains the followed literature survey methodology. In Section 6, the strengths and limitations of CI methods were identified. Section 7 talks about the other emergent CI models that may provide a better solution than popular CI methods. Section 8 discusses the significance, scope, and future direction.

#### **2. Traditional Modeling Methods**

In recent decades, numerous methods have been proposed to evaluate, analyze, model, and understand the soil–machine interaction, which aims to optimize energy, time, efficiency, and machine or tool service life with reduced wear. The methods are explained as follows:

#### *2.1. Analytical Method*

Analytical methods are based on physical principles of soil/terrain, machine parameters, and simple assumptions. The traction force is often computed from a soil–tire contact–surface interface and stress distribution (shear and normal) [5,15]. However, both soil and tire deform during the process, making it challenging to describe in mathematical terms. Moreover, machine dynamics, varying soil conditions, its elastic-plastic nature, and inadequate information on boundary conditions make the soil–machine interaction a very complex problem to model accurately. These challenges raise questions on its adaptability [3,5,15,16].

Likewise, in tillage, soil resistance is computed with a logarithmic spiral method and passive earth pressure theory [6]. These are assumption-based methods that do not include the actual soil failure patterns that vary with tool parameters (shape, rake angle, speed) and soil parameters (moisture, density, and structure) [17–20]. Moreover, the analytical methods are suitable for simple shapes, but difficult for the complex shape tool [5,6]. Thus, it exhibits limited applicability for tool design and energy or force prediction.

#### *2.2. Empirical Method*

Empirical methods are derived from a large amount of experimental data, where the best-fit regression curve explains a relationship among the selected variables. Empirical equations are simple and consist of a few variables with constants specific to the soil, track/wheel, tool type, machine configuration, and operating conditions. Thus, these equations cannot be extrapolated to the other problems, restricting their broad applicability [3,5,15]. Thus, precautions are necessary on a new tire, tillage tool, and test environment [3,5]. Moreover, it requires a large amount of experimental data, which is laborious and costly. Additionally, it is subjected to a multi-collinearity problem, arising from not truly dependent factors [21].

#### *2.3. Semi-Empirical Method*

Semi-empirical method combines experimental data, empirical formulations, and analytical methods. In traction studies, the stress (normal and shear) and soil deformation are computed by assuming stress under a flat plate, and bevameter is used [3,5,15]. The flat plate is non-flexible, but the tire or track is flexible, working on deformable soil. Thus, this method requires improvement. Similarly, a passive earth pressure theory explains the soil-failure pattern for a simple shaped tillage tool [5,6,19]. However, adopting the earth pressure theory to other complex shaped tools is challenging [5,17,22,23]. Semi-empirical is a hybrid, reliable, and the most common method, although equations derived from assumptions limit its accuracy in varying terrain.

#### *2.4. Numerical Method*

Numerical methods such as finite element and discrete element were extensively studied, and lately in soil tillage and traction domain [3,5,15,19,24–28]. The detailed examples can be found here [25,26]. These methods have successfully modeled the complex, dynamic, and non-linear soil–machine interaction problems with greater accuracy and fewer assumptions [3,25,26]. However, it is a highly computational method consisting of a virtual simulation with commercial software installed on a high-speed computer. Therefore, it is time-consuming and requires special and costly resources. Moreover, the simulation setup needs an accurate description of a soil medium that varies on a spatial-temporal basis, making it challenging.

In short, the traditional modeling methods have a few limitations and are very specific to machine or tool types and experimental conditions, which restricts their wide applicability.

#### **3. Computational Intelligence: An Overview**

Broadly speaking, the term "computational intelligence" refers to a wide class of approaches that rely on approximate information to solve complex problems [29–32]. There are a vast array of such problems (e.g., classification, regression, clustering, anomaly detection, function optimization), where CI models have been extensively used. In the available literature on soil–machine interactions, these models have been used for regression tasks. Accordingly, this article describes CI models from a regression standpoint. However, as some articles have used CI optimization approaches for training the models (i.e., model parameter optimization for best performance), CI-based optimization algorithms are also addressed here, albeit in the context of training.

Many CI models are derived from paradigms observed in the natural world. Artificial neural networks (NN), deep neural networks (DNN), and radial basis function networks (RBFN) are structures that loosely resemble the organization of neurons in higher animals. Fuzzy inference systems (FIS) perform computations in a manner analogous to verbal reasoning. Adaptive neuro-fuzzy systems (ANFIS) are designed to combine the attractive features of NN and FIS. These models are very well suited for regression tasks.

Other CI regression models, which are designed from purely mathematical considerations, do not have any natural counterparts. This class of machine learning methods includes support vector regression (SVR) and Bayesian methods.

Natural phenomena also provide the backdrop of CI optimization methods. Genetic algorithms (GA) are modeled after Darwinian evolution, while particle swarm optimization (PSO) simulates the foraging strategy of a swarm of organisms. These methods have been routinely used to train other CI models [33–36].

CI models for regression are data-driven approaches. In soil–machine interaction studies, the data are typically collected from field experiments. Each sample in the data is a pair (**x**(*n*), **<sup>t</sup>**(*n*)), where *<sup>n</sup>* is a sample index. The quantity **<sup>x</sup>**(*n*) <sup>∈</sup> <sup>R</sup>*<sup>M</sup>* in a sample is an *<sup>M</sup>*-dimensional input, **<sup>t</sup>**(*n*) <sup>∈</sup> <sup>R</sup>*<sup>N</sup>* is its corresponding *<sup>M</sup>*-dimensional target (or desired output) vector. The symbol **Θ** is used in this article to denote the set of all trainable parameters of any CI model. Wherever necessary, it may be treated as a vector. Note that throughout this article, italicized fonts are used to represent scalar quantities, and bold fonts for vectors (lowercase) and multi-dimensional arrays (uppercase).

#### *3.1. Data Preprocessing*

Preprocessing is often essential before using data to train a CI model. It renders the data more suitable for CI models.


The samples are randomly divided into three disjoint sets—the training set S*t*, the test set S*s*, and the validation set S*v*. Training samples are used directly to adjust the model parameters in small increments. Unlike them, samples in S*s*, are not used explicitly to compute parameter increments. Instead, testing samples are used intermittently during training to monitor progress. Validation samples in S*<sup>v</sup>* are used as surrogates for the real world. The performance of the CI model is evaluated with respect to S*<sup>v</sup>* only after training is completely accomplished. Approximately 60%–80% of the samples are assigned to S*<sup>t</sup>* and the remainder divided roughly equally between S*<sup>s</sup>* and S*v*.

#### *3.2. Loss Functions*

The purpose of training any model is to minimize the differences between the targets and the true outputs, quantified in terms of its loss [41,42], which is the average of the penalties incurred by all samples. The symbol L is used to represent the loss. The model's loss with respect to samples in the dataset S is,

$$\mathcal{L}\_{\Theta}(\mathcal{S}) = \frac{1}{|\mathcal{S}|} \sum\_{n \in \mathcal{S}} l(t(n) - y(n)) \tag{1}$$

The optional subscript **Θ** in Equation (1) above is used to highlight the loss's dependence on the model parameters. Each term *l*(·) is a sample penalty or error. Using this convention, L**Θ**(S*t*), L**Θ**(S*s*), and L**Θ**(S*v*) are the training and validation losses.

Several loss functions have been proposed. The following are the most commonly used.

(*i*) Mean squared (L2) loss: For scalars, this loss is the average of the squared differences between the network's outputs *y*(*n*), for inputs **x**(*n*) and the corresponding targets, so that, <sup>L</sup><sup>2</sup> <sup>=</sup> |S|−<sup>1</sup> <sup>∑</sup>*n*[*y*(*n*) <sup>−</sup> *<sup>t</sup>*(*n*)]2. For vector outputs, the Euclidean norm ||**y**(*n*) − **t**(*n*)|| is used, where **y**(*n*) is the model's vector output. The L<sup>2</sup> loss is the most commonly used function. Using quadratic penalty terms makes the function quite sensitive to statistical outliers.


$$l\_\delta(n) = \begin{cases} \frac{1}{2} [y(n) - t(n)]^2, & \text{if } |y(n) - t(n)| < \delta; \\ \delta |y(n) - t(n)| - \frac{1}{2} \delta^2, & \text{otherwise.} \end{cases} \tag{2}$$

As the Hüber loss function is not twice differentiable at ±*δ*, the similarly shaped log-cosh function below can be used in its place,

$$d\_{lc}(n) = \log\_{\ell} \frac{1}{2} \left( e^{y(n) - t(n)} + e^{t(n) - y(n)} \right). \tag{3}$$

(*iv*) -Loss: This loss does not apply a penalty when the difference *y*(*n*) − *t*(*n*) lies within a tolerable range [−, +], for some constant . A linear penalty is incurred whenever the numerical difference lies outside this range. In other words, <sup>L</sup> <sup>=</sup> |S|−<sup>1</sup> <sup>∑</sup>*<sup>n</sup> <sup>l</sup>*(*n*), where,

$$l\_{\mathfrak{c}}(n) = \begin{cases} 0, & \text{if } |y(n) - t(n)| < \epsilon; \\ |y(n) - t(n)| - \epsilon, & \text{otherwise.} \end{cases} \tag{4}$$

Since the loss function is not differentiable at ±, if needed, a subgradient in [0, 1] can be used as a substitute for its derivative at ±.

The shapes of the above loss functions are illustrated in Figure 1. The log-cosh loss, which is similar to the Hüber loss, is not shown. There are several other loss functions, including those that are specific to the application, that have not been listed here.

**Figure 1.** Loss Functions. Losses L as functions of the difference between the model output *y*, and the corresponding target (desired output) *t*.

#### *3.3. Model Selection*

Model complexity is a key concept in statistical learning theory, closely related to overfitting. The complexity of a model can be quantified as the number of independent scalar parameters used to compute its output and their ranges. The V-C (Vapnik–Chervonenkis) dimensionality of a model is one such measure of complexity [44] that has led to the development of support vector machines.

Model complexity is a critical factor that should be considered during model selection. Low-complexity models tend to exhibit a bias towards specific input-output maps. For instance, a linear model, which is the least complex regression model, cannot be used to capture nonlinear input–output relationships. Conversely, increasing a CI model's complexity endows it with more degrees of freedom to fit the training data. Due to its lower bias, training the model yields significantly lower training error L**Θ**(S*t*). Unfortunately, a

model with too large a complexity becomes too sensitive to extraneous artifacts present in its training dataset S*t*, such as random noise, sampling, or aliasing. These are extraneous artifacts that do not reflect any underlying input–output relationship. As stated in another manner, as the model's complexity increases, so do its variance. The model with higher variance performs poorly in the real world, with inputs outside S*t*. This is reflected in terms of its higher validation loss L**Θ**(S*v*). In general, the model's effective loss can be decomposed into three components,

$$\mathcal{L}\_{\mathsf{G}} = \mathsf{bias}\_{\mathsf{G}}^2 + \mathsf{var}\_{\mathsf{G}} + \mathsf{noise}.\tag{5}$$

The square of the bias term is used in (5), as it can acquire positive and negative values. The noise component is an artifact introduced by the external environment, and is independent of the model **Θ**. Selecting a CI model with the optimal complexity is a well-known bias-variance dilemma in machine learning. This phenomenon is depicted in Figure 2.

**Figure 2.** Bias, Variance, and Model Complexity. (**Left**) Performance of three models, **Θ**<sup>1</sup> (dashed blue), **Θ**<sup>2</sup> (solid red), and **Θ**<sup>3</sup> (dotted green) with low, optimum, and high model complexities. Small grey circles are training samples (**x**, **t**) ∈ S. (**Right**) Squared bias (solid blue), variance (solid green), noise (dotted brown), and loss (dashed red) as functions of model complexity.

A widely used approach to keep the model's complexity at lower levels is by adding a regularization term R(·) to the loss function. Regularizers are routinely devised in terms of the model parameters in **Θ**. If **Θ** is treated as a vector of parameters, R(**Θ**) = ||**Θ**||<sup>1</sup> and R(**Θ**) = ||**Θ**||<sup>2</sup> <sup>2</sup> are used as LASSO (least absolute shrinkage and selection operator) and ridge regularizers. The elastic net function, which is the convex combination of the LASSO and ridge terms, so that R(**Θ**) = *<sup>r</sup>*||**Θ**||<sup>2</sup> <sup>2</sup> + (1 − *r*)||**Θ**||<sup>1</sup> (where 0 < *r* < 1 is a constant), is another popular choice for regularization [45].

#### *3.4. Training Algorithms*

At present, almost all training algorithms rely on the basic gradient descent. If L**Θ**(·) is the loss function (which may include a regularization term), the parameters of the model are incremented using the training samples in S*t*, as shown in the following expression,

$$
\Theta \leftarrow \Theta + \eta \nabla\_{\Theta} \mathcal{L}\_{\Theta}(\mathcal{S}\_l) \tag{6}
$$

The quantity *η* in the above expression is the gradient descent step size, commonly referred to as the learning rate in CI terminology. The operator ∇**<sup>Θ</sup>** is the gradient (vector derivative) w.r.t. **Θ**.

Since the loss L**Θ**(S*t*) is the sum of all sample penalties *l*(*n*), where *n* ∈ S*t*, a direct implementation of Equation (6) would require a pass over all samples in S*<sup>t</sup>* before **Θ** can be updated. As this is computationally burdensome (particularly for large datasets), training algorithms invariably use stochastic gradient descent (SGD). Before every training epoch of SGD, the samples in S*<sup>t</sup>* are rearranged randomly. The vector parameter **Θ** is incremented once for each sample *n*, using the gradient ∇**Θ***l*(*n*).

In theory, SGD can lead to a speed up the training algorithm by a factor |S*t*|. However, as the directions of the gradients ∇**Θ***l*(*n*) are not perfectly aligned with one another, the

actual speed up is considerably less than |S*t*|. Adding a momentum term to the gradient step helps alleviate this situation. If in step *n* − 1 the parameter **Θ** is incremented by an amount Δ**Θ**(*n* − 1), in the next step *n*, the increment would be Δ**Θ**(*n*) = *η*∇**Θ***l*(*n*) + *μ*Δ**Θ**(*n* − 1). The quantity *μ* (0 ≤ *μ* < *η*) is the momentum rate.

The convergence rate of the training algorithm can be significantly improved by Newton's algorithm, which requires the Hessian matrix ∇<sup>2</sup> **<sup>Θ</sup>**. It can be readily established that the outer product ∇**Θ**∇<sup>T</sup> **<sup>Θ</sup>** is a close approximation of the Hessian. In the Levenberg-Marquardt algorithm [46], the diagonal elements of this matrix are incremented by an amount *μ* to improve the conditioning. Accordingly, incremental updates with the Levenberg-Marquardt algorithm are implemented as per the following rule,

$$
\Theta \leftarrow \Theta + \left(\nabla\_{\Theta} \nabla\_{\Theta}^{\mathsf{T}} + \mu \mathbf{I}\right)^{-1} \nabla\_{\Theta} \mathcal{L}\_{\Theta}(\mathcal{S}\_{l}) \tag{7}
$$

Overtraining—a problem that is closely related to overfitting, is frequently encountered during training. This is shown in Figure 3. Since samples in S*<sup>t</sup>* are used to compute the gradient, as long as *η* is small enough, the training loss decreases each time the parameters are incremented. Initially, the test loss L(S*t*) also drops with training. However, after the model has undergone a significant amount of training, L(S*t*) begins to rise. Applying (6) further will cause overtraining.

*K*-fold cross-validation [47] is an effective means for performance evaluation with sparse data. The samples in S*<sup>t</sup>* are randomly shuffled and split into *K* groups or folds of equal size. One of the folds is used as the test set S*s*, and the rest are used to increment **Θ**. This process is repeated *K* times, with each fold acting as the test set. The loss averaged over all *K* folds is a reliable estimate of its true (real-world) loss.

Premature convergence is another issue that is sometimes observed during training (see Figure 3). This occurs if the training algorithm encounters a local minimum of the loss function's landscape, where the gradient ∇**<sup>Θ</sup>** is very close to zero. Applying (6) or (7) would have little effect on the parameter **Θ**. A simple method to rectify the situation is to restart the training process from some other (randomly generated) initial point.

The presence of narrow ridges with "V"-shaped cross sections is another reason why the loss may remain unchanged (see Figure 3), giving the appearance of a local minimum for several iterations. Although there is no perceptible drop in the loss, the amount of increment to **Θ** is not negligible. Restarts are unnecessary in such situations, for the algorithm eventually leaves such a ridge after multiple updates.

**Figure 3.** Overtraining and premature convergence. Overtraining is illustrated (**left**), showing how test loss (dashed red) begins to rise with overtraining (shaded green region) although training loss (solid blue) decreases. Premature convergence (**right**) of the training loss is shown (dotted red) in contrast to desired convergence (dashed green, solid blue). Due to "V" shaped narrow ridges in the loss function's landscape, there may not be any perceptible decrease in the loss for many training iterations (dashed green)

#### *3.5. Optimization Metaheuristics*

Existing training algorithms apply optimization metaheuristics, such as GA and PSO, to avoid getting trapped in local minima. These algorithms maintain a set of many candidate solutions, referred to as its population. In each step of the optimization algorithm, a new population is formed out of the existing one, using a variety of stochastic and heuristic search operators. Stochastic operators help the algorithm escape from local minima, while heuristics aid in its convergence towards the global maximum of the objective function.

GAs are useful in training CI regression models. Let the population of such a GA be the set {**Θ***<sup>j</sup>* |*j* = 1, 2, ···}, where each **Θ** is a candidate model parameter. During each iteration, pairs of solutions are selected from the population in a random manner, but with better ones (in terms of the inverse loss function) being more likely to be picked. Using the crossover operator, a new pair of new solutions is generated from the old ones. For example, in a convex crossover, the existing pair (**Θ***<sup>i</sup>* , **Θ***<sup>j</sup>* ) can be used to generate a new pair, (**Θ***<sup>i</sup>* , **Θ***<sup>j</sup>* ) = - *<sup>μ</sup>***Θ***<sup>i</sup>* + (<sup>1</sup> <sup>−</sup> *<sup>μ</sup>*)**Θ***<sup>j</sup>* ,(<sup>1</sup> <sup>−</sup> *<sup>μ</sup>*)**Θ***<sup>i</sup>* <sup>+</sup> *<sup>μ</sup>***Θ***<sup>j</sup>* .

In the mutation operator, a small amount of perturbation Δ**Θ***<sup>i</sup>* is added to each new candidate parameter so that it becomes equal to **Θ***<sup>i</sup>* + Δ**Θ***<sup>i</sup>* .

In Gaussian mutation, the perturbation ΔΘ*<sup>i</sup>* follows a Gaussian distribution centered around the origin. This process is repeated many times until no further improvement can be found.

Although GA and PSO have found widespread use in many optimization applications, their use in machine–soil interaction studies is rather limited. GAs have been used during model training. In these cases, the GA is hybridized with (6) and (7), or any other related method. Gradient descent steps can be incorporated into the GA in different ways. For instance, **<sup>Θ</sup>***<sup>i</sup>* can be mutated into **<sup>Θ</sup>***<sup>i</sup>* <sup>+</sup> <sup>Δ</sup>**Θ***<sup>i</sup>* <sup>+</sup> *<sup>η</sup>*∇**Θ***<sup>i</sup>* <sup>L</sup>, where <sup>Δ</sup>**Θ***<sup>i</sup>* is the random perturbation and ∇**Θ**L, the gradient of the training loss L.

Similar hybrid techniques exist for PSO (cf. [48]). However, PSO has not been used in the existing literature on machine–soil interactions. On the other hand, a relatively unknown population-based stochastic algorithm has been used in [49,50].

#### **4. Current Computational Intelligence Models**

#### *4.1. Neural Networks*

Neural network (NN) models have been routinely used in various regression applications since the mid-eighties, wherever a significant amount of data are involved. Neural networks are layered structures consisting of an input layer, one or more intermediate layers, called hidden layers, and an output layer. Each layer comprises elementary processing units or neurons. In a fashion resembling the mammalian cortex, the neurons in each hidden and output layer receive the outputs of those of the preceding layer, as their inputs via weighted synaptic connections.

Figure 4 shows the layout of a neural network with *L* layers. The indices of the input and output layers are 1 and *L*, where *M* and *N* are the number of neurons in the input and output layers. The vectors **<sup>x</sup>** (**<sup>x</sup>** <sup>∈</sup> <sup>R</sup>*M*) and **<sup>y</sup>** (**<sup>y</sup>** <sup>∈</sup> <sup>R</sup>*N*) denote the network's input and output.

The size of an NN can be written succinctly as ∏*<sup>L</sup> <sup>l</sup>*=<sup>1</sup> *<sup>N</sup>*(*l*) where *<sup>N</sup>*(*l*) is the number of neurons in layer *l*. For instance, a 3 × 5 × 6 × 2 NN has three input neurons, a hidden layer with five neurons, another hidden layer with six neurons, and two output neurons. Note that indices of layers (superscripts) are shown within parentheses so as not to confuse them with exponents.

Until recently, NNs were equipped with only one or two hidden layers (so that *L* = 3 or *L* = 4)—an approach used everywhere in the published literature on soil–machine interaction studies. To distinguish them from deep neural networks (DNNs), which have multiple hidden layers, models with only one hidden layer are referred to as shallow networks. However, for the purpose of this review, networks with two hidden layers are also included in this category. This section focuses on classical methods that are common to both shallow and deep networks. Advanced features that are relevant to DNNs are addressed separately in a subsequent section.

**Figure 4.** Neural network. Neurons are depicted as small circles and synaptic connections as straight lines. The network has an input layer (red), hidden layers (green), and an output layer (blue). Since the network shown has multiple hidden layers, it is a deep neural network.

The output of the *<sup>k</sup>*th neuron in a layer indexed *<sup>l</sup>* (*<sup>l</sup>* ∈ {<sup>1</sup> ··· *<sup>L</sup>*}) is denoted as *<sup>y</sup>* (*l*) *<sup>j</sup>* . Thus, the *i* th element of **x** is *xj* = *y* (1) *<sup>j</sup>* ; similarly *yj* = *y* (*L*−1) *<sup>j</sup>* . Figure 4 shows all quantities associated with the *k*th neuron in the layer *l* (*l* > 1). The neuron's input is the weighted sum of the outputs of all neurons in the preceding layer (*l* − 1), as shown in the following expression,

$$s\_k^{(l)} = w\_{k,0}^{(l)} + \sum\_j w\_{k,j}^{(l)} y\_j^{(l-1)}.\tag{8}$$

The summation in (8) is carried out over the outputs *y* (*l*−1) *<sup>j</sup>* of all neurons (indexed *<sup>j</sup>*, *<sup>j</sup>* <sup>≥</sup> 1) of the previous layer, and the associated weight is *<sup>w</sup>*(*l*) *<sup>k</sup>*,*<sup>j</sup>* . The quantity *<sup>w</sup>*(*l*) *<sup>k</sup>*,0 is the neuron's bias. Figure 5 shows a neuron in a hidden layer. The weights and biases are the trainable parameters of the neural network that are included in **Θ**.

Neurons in the input layer are linear elements; their role is merely to transmit the incoming vector to hidden neurons. However, those in the hidden layers, and optionally in the output layer as well, incorporate a monotonically increasing nonlinear function *f*(·), where either *<sup>f</sup>* : <sup>R</sup> <sup>→</sup> (0, 1) or *<sup>f</sup>* : <sup>R</sup> <sup>→</sup> (−1, 1), that is referred to as the activation function. The output of the neuron is,

$$y\_k^{(l)} = f\left(s\_k^{(l)}\right). \tag{9}$$

The logistic function *<sup>σ</sup>*(*s*) = (<sup>1</sup> <sup>+</sup> exp(−*s*))−<sup>1</sup> and the hyperbolic tangent function (tanh(·)) are the most commonly used activation functions used in shallow networks. Due to their characteristic 'S' shapes, such activation nonlinearities fall under the category of sigmoid functions.

**Figure 5.** Neuron. Quantities associated with a neuron (green circle). Also shown is a neuron in the preceding layer (grey circle)

Historically, the popularity of NNs surged with the introduction of the back-propagation (BP) algorithm [51], which is a reformulation of SGD designed to train layered structures. The error *δ* (*l*) *<sup>k</sup>* of the *<sup>k</sup>*th neuron in the *<sup>l</sup>* th layer is defined as the derivative of the penalty term *l***<sup>Θ</sup>** (in the loss L**Θ**) with respect to the neuron's input *s* (*l*) *<sup>k</sup>* (see Equation (8)). Such penalties can be readily differentiated for neurons in the output layer (*l* = *L*). The back-propagation rule shows how *δ* (*l*) *<sup>k</sup>* can be computed for hidden neurons (*l* < *L*), using the errors of the

next layer, *l* + 1. The schematic in Figure 6 illustrates how errors back-propagate. The general expression to compute the errors is,

$$\delta\_k^{(l)} = \begin{cases} \frac{\partial}{\partial s\_k^{(l)}} l\_{\Theta'} & \text{if } l = L; \\\sum\_j w\_{kj}^{(l+1)} \delta\_j^{(l+1)}, & \text{otherwise.} \end{cases} \tag{10}$$

The weights in **Θ** can be updated in the following manner,

$$w\_{kj}^{(l)} = w\_{kj}^{(l)} + \eta y\_j^{(l-1)} \delta\_k^{(l)}.\tag{11}$$

It is common practice to include a momentum term to BP. Additionally, BP can be extended to apply Levenberg-Marquardt updates. This is the Levenberg-Marquardt BP (LMBP) algorithm [52].

**Figure 6.** Back-propagation of errors. Shown here are the quantities relevant to the back-propagation of error from a neuron (solid circle) to another in the previous layer (green circle).

The VC-dimensionality of a neural network is typically specified in terms of the total number of weights and biases [53]. The number of training samples should be about ten times this quantity. The number of epochs to achieve training is independent of the data size.

#### *4.2. Radial Basis Function Networks*

The radial basis function network (RBFN) [54,55] is another popular computational intelligence regression model that is topologically identical to an *M* × *K* × 1 neural network. In other words, an RBFN has *M* input neurons, a single hidden layer of *K* neurons, and only one output neuron. The sole purpose of the input layer, which contains *M* linear neurons, is to pass on *M* dimensional inputs to the hidden layer. The *K* neurons in the hidden layer are incorporated with nonlinear activation functions. The output neuron computes the weighted summation of the outputs from the hidden layer. Due to its strong resemblance to a shallow neural network, the RBFN is sometimes treated as a specific kind of NN. RBFNs have been successfully used in agricultural applications [56–58].

Unlike in NNs, the hidden neurons of the RBFN are designed to produce localized responses. The activation function of any hidden neuron has an *M* dimensional parameter called its center. The closer an input is to its center, the higher the neuron's output. In this manner, the network's hidden neurons simulate sensory cells of the peripheral nervous system, which have localized receptive fields.

Suppose **<sup>x</sup>** (**<sup>x</sup>** <sup>∈</sup> <sup>R</sup>*M*) is the network's input. Each hidden neuron *<sup>k</sup>* (where *<sup>k</sup>* ∈ {<sup>1</sup> ··· *<sup>K</sup>*}) receives **x** from the input layer, and produces an output *f*(||**x** − **c***k*||), where || · || denotes a vector norm operator (e.g., length). Gaussian nonlinearities are the most widely used activation functions. In this case, the output of the *k*th hidden neuron, denoted as *φk*, is obtained according to the following expression,

$$\phi\_k = \mathfrak{e}^{\frac{1}{\sigma\_k}||\mathbf{x} - \mathbf{c}\_k||^2} \tag{12}$$

The quantity *σ<sup>k</sup>* in (12) is an optional width parameter of the *k*th hidden neuron. When dealing with training samples that are distributed evenly within the input space, all hidden neurons may be assigned the same widths *σ*.

With *wk* (*k* ∈ 1 ··· *K*) being network weights, the RBFN's output *y* is the weighted sum ∑*<sup>k</sup> wkφk*. Using (12), *y* can be expressed directly in terms of the input **x** as,

$$y = \sum\_{k} w\_{k} \epsilon^{\frac{1}{\sigma\_{k}} ||\mathbf{x} - \mathbf{c}\_{k}||^{2}} \tag{13}$$

Figure 7 depicts the main quantities of an RBFN.

**Figure 7.** Radial Basis Function. Shown are a hidden neuron (green) and the output neuron (grey).

The RBFN's parameters in **Θ** are all its weights *wk* and centers **c***k*. If necessary to train the network's widths *σk*, they are also included in **Θ**. Due to the use of localized activation functions, the number of hidden neurons *K* required by the RBFN increases exponentially with the input dimensionality *M*. Hence the effectiveness of RBFNs is limited to tasks involving low dimensional data (up to *M* = 6 or 7). Even in such tasks, RBFNs require significantly more hidden neurons than NNs. As the trade-off for this limitation, RBFNs offer faster training, often by a few orders of magnitude. This speedup over Equation (6) is achieved when the centers, widths, and weights are trained separately [59].

A popular method to train the centers of the hidden neurons is by using *K*-means clustering [60]. For each hidden neuron *k*, a subset N*<sup>k</sup>* of samples in training set S*<sup>t</sup>* is obtained. This subset consists of all samples that are closer to the neuron's center **c***<sup>k</sup>* than to **c***<sup>k</sup>* of any other neuron *k* , *k* = *k*. The center of each hidden neuron is made equal to the average of all samples **x**(*n*) in N*k*. The two steps can be expressed, as shown below,

$$\begin{aligned} \mathbf{c}\_{k} &\leftarrow \frac{1}{|\mathcal{N}\_{k}|} \sum\_{n \in \mathcal{N}\_{k}} \mathbf{x}(n), \quad \text{where,} \\ \mathcal{N}\_{k} &= \{ n \in \mathcal{S}\_{l} | \boldsymbol{k}' \neq \boldsymbol{k}, ||\mathbf{x}(n) - \mathbf{c}\_{k}|| < ||\mathbf{x}(n) - \mathbf{c}\_{k'}|| \} \end{aligned} \tag{14}$$

A relatively small number of iterations of (14) is enough to train the centers of all hidden neurons. Their widths can be fixed at some constant value such that *σ<sup>k</sup>* = *σ*, *k* ∈ {1 ··· *K*}. Alternately, the nearest neighbor heuristic can be applied to determine each *σ<sup>k</sup>* separately, such as,

$$
\sigma\_k = \mathfrak{c} \underset{k' \neq k}{\operatorname{argmin}} ||\mathbf{c}\_k - \mathbf{c}\_{k'}|| \tag{15}
$$

The quantity *c* in (15) is an algorithmic constant.

For the L<sup>1</sup> or the Hüber loss functions, the weight parameters **w***<sup>k</sup>* must be trained in an iterative manner using (6). When the L<sup>2</sup> loss is used, the Moore-Penrose pseudoinverse formula provides a simpler method to obtain the weights. Let **<sup>w</sup>** <sup>∈</sup> <sup>R</sup>*<sup>K</sup>* be the vector of all weights. Similarly, let *<sup>φ</sup>*(*n*) <sup>∈</sup> <sup>R</sup>*<sup>K</sup>* (*<sup>n</sup>* ∈ S*t*) be the vector of outputs of the hidden neurons, determined using (12)) with input **x**(*n*). It can be observed that the RBFN's output is *<sup>y</sup>*(*n*) = *<sup>φ</sup>*T(*n*) **<sup>w</sup>** (where · <sup>T</sup> is the transpose operator).

To observe how the pseudoinverse formula works, let us construct an activation matrix **<sup>Φ</sup>** <sup>∈</sup> <sup>R</sup>|S*t*|×*<sup>K</sup>* <sup>+</sup> , whose *<sup>n</sup>*th row is *<sup>φ</sup>*T(*n*). Whence **<sup>y</sup>** = **<sup>Φ</sup><sup>w</sup>** is the |S*t*| × 1 vector of all outputs of the RBFN. If **<sup>t</sup>** <sup>∈</sup> <sup>R</sup>|S*t*<sup>|</sup> is the corresponding vector of all target values, the mean squared <sup>L</sup><sup>2</sup> loss is the expression <sup>L</sup><sup>2</sup> <sup>=</sup> ||**<sup>y</sup>** <sup>−</sup> **<sup>t</sup>**||<sup>2</sup> . The weight vector that minimizes this loss is argmin ||**Φ<sup>w</sup>** <sup>−</sup> **<sup>t</sup>**||<sup>2</sup> .

**w** If the number of training samples is more than the number of hidden neurons (i.e., |S*t*<sup>|</sup> <sup>&</sup>gt; *<sup>K</sup>*), which is always the case in a real application, the matrix **<sup>Φ</sup>**T**<sup>Φ</sup>** is non-singular. In this case, the expression for the loss minimizing weight vector **w** is determined as,

$$\mathbf{w} = \left(\boldsymbol{\Phi}^{\mathrm{T}}\boldsymbol{\Phi}\right)^{-1}\boldsymbol{\Phi}^{\mathrm{T}}\mathbf{t} \tag{16}$$

The factor (**Φ**T**Φ**) −1 **<sup>Φ</sup>**<sup>T</sup> in (16) is a matrix of size *<sup>K</sup>* × |S*t*|. It is referred to as the pseudoinverse of **Φ** and denoted as **Φ**+.

In theory, all RBFN parameters can be trained iteratively using gradient descent. Although training the RBFN parameter vectors [*μk*] and [*σk*] in this manner is fairly uncommon, and gradient descent is often used to train the weight vector **w**. This is carried out as in Equation (6), with **Θ** replaced with **w**. This method is applied to avoid numerical issues with matrix pseudoinversion and wherever the training algorithm is not based on the mean squared loss function.

Recent RBFN models use multivariate Gaussian distributions, where Equation (12) is replaced with,

$$\phi\_k = e^{\frac{1}{2}(\mathbf{x} - \mathbf{c}\_k)^T \Sigma\_k (\mathbf{x} - \mathbf{c}\_k)} \tag{17}$$

In the above expression, the quantity **<sup>Σ</sup>***<sup>k</sup>* <sup>∈</sup> <sup>R</sup>*M*×*<sup>M</sup>* is a covariance matrix.

#### *4.3. Support Vector Regression*

SVRs are another class of CI models [61,62] that are widely used in various engineering and other applications [63]. SVRs have been used for regression applications in agriculture [64–67]. Unlike the other CI models discussed earlier, SVRs do not have any strong parallels in nature. Instead, they are specifically aimed at addressing the issue of model complexity, which is addressed below.

The simplest formulation is the linear SVR with -loss, as shown in Figure 8. Sample targets that lie within a margin of ± from the regression line do not incur any penalty, while those outside the margin incur penalties. Hence, the error arising from a sample pair (**x**(*n*), *t*(*n*)) is obtained as shown in (4). Denoting this error as *ξ*(*n*), it can be readily established that the following constraints are satisfied,

$$\begin{cases} \mathcal{J}(n) \ge 0 \\ \begin{array}{l} t(n) - \mathbf{w}^{\mathrm{T}} \mathbf{x}(n) - b \le \boldsymbol{\varepsilon} + \boldsymbol{\mathcal{G}}(n) \\ \mathbf{w}^{\mathrm{T}} \mathbf{x}(n) + b - t(n) \le \boldsymbol{\varepsilon} + \boldsymbol{\mathcal{G}}(n) \end{array} \end{cases} \tag{18}$$

When the above conditions are satisfied, the loss is simply the sum of all errors, ∑*<sup>n</sup> ξ*(*n*). It has been demonstrated that the gap between the validation and training losses (i.e., L(S*v*) − L(S*t*)) can be lowered by increasing , or alternately by decreasing ||**w**||<sup>2</sup> while ± is a constant [68]. This term can be recognized as the LASSO regularizer.

With *C* being an algorithmic constant, the optimal regression model can be obtained as the solution to the following constrained optimization problem,

$$\begin{cases} \min\_{\mathbf{w}, b\_{\mathcal{S}}^{\mathcal{X}}} & \frac{1}{2} \mathbf{w}^{\mathcal{T}} \mathbf{w} + \mathbb{C} \sum\_{n \in \mathcal{S}\_{\mathbf{t}}} \xi(n) \\ & \text{s.t.} \quad \text{(18)} \qquad \text{is true} \end{cases} \tag{19}$$

**Figure 8.** Linear support vector regression. The regression line *y* = **w**T**x** + *b* (solid green), and the region (shaded green) of zero penalties around it, are shown. Also shown are samples (small circles) including the support vectors (filled circles) indexed *m* and *n*.

The above problem (19) to obtain the SVR is in primal form. Classical optimization theory (cf. [69,70]) illustrates that for every primal problem, a dual problem can be constructed using the Lagrange multipliers of the primal constraints as its variables. The optimization theory establishes that under certain constraint qualifications, the optima of the primal and dual problems coincide at a saddle point. The dual form of (19) can be derived readily [68]. Ignoring the constraints *<sup>ξ</sup>*(*n*) <sup>≥</sup> 0 and using the symbols *<sup>λ</sup>*<sup>+</sup> <sup>∈</sup> <sup>R</sup>|S*t*<sup>|</sup> <sup>+</sup> and *<sup>λ</sup>*<sup>−</sup> <sup>∈</sup> <sup>R</sup>|S*t*<sup>|</sup> <sup>+</sup> as the Lagrange multiplier vectors of the other constraints in (18), the dual problem can be formulated in the following manner,

$$\begin{cases} \min\_{\lambda\_{+},\lambda\_{-}} \quad \frac{1}{2}(\lambda\_{+}-\lambda\_{-})^{\mathrm{T}}\mathbf{K}(\lambda\_{+}-\lambda\_{-}) + \lambda\_{+}^{\mathrm{T}}(\epsilon\mathbf{1}-\mathbf{t}) + \lambda\_{-}^{\mathrm{T}}(\epsilon\mathbf{1}+\mathbf{t})\\ \text{ s.t.} \quad \mathbf{1}^{\mathrm{T}}(\lambda\_{+}-\lambda\_{-}) = 0\\ \quad \mathbf{0} \le \lambda\_{+}, \lambda\_{+} \le \mathbf{C1} \end{cases} \tag{20}$$

The element in the *<sup>m</sup>*th row and *<sup>n</sup>*th column of the symmetric matrix **<sup>K</sup>** <sup>∈</sup> <sup>R</sup>|S*t*<sup>|</sup> <sup>+</sup> in (20) is **x**(*m*)T**x**(*n*). The bias *b* and the normal vector **w** can be obtained from the dual solution, although **w** is not required.

In more generalized settings, input samples can lie in any arbitrary Hilbert space. The inner product of the *<sup>m</sup>*th and *<sup>n</sup>*th samples is represented as **x**(*m*), **<sup>x</sup>**(*n*). The matrix **<sup>K</sup>** will contain pairwise inner products of such samples.

Nonlinear SVRs implicitly apply a transformation *φ*(·) from the input space S to an unknown Hilbert space [61]. Under these circumstances, the (*m*, *n*)th element of **K**, which we now denote as *K*(**x**(*m*), **x**(*n*)), is obtained as provided below,

$$K(\mathbf{x}(m), \mathbf{x}(n)) = \langle \phi(\mathbf{x}(m)), \phi(\mathbf{x}(n)) \rangle \tag{21}$$

The function *<sup>K</sup>* : S×S→ <sup>R</sup><sup>+</sup> is referred to as the kernel function. Mercer's theorem states that as long as the kernel satisfies a few conditions, there must exist some transformation *<sup>φ</sup>* : S → <sup>H</sup> satisfying (21). As long as these conditions are met, the matrix **<sup>K</sup>** obtained from every possible sample set will be symmetric and positive definite. In other words, kernel functions can be devised without even considering the mapping *φ*(·); this mapping, along with its range in Hilbert space H can remain unknown. This is a remarkable feature of SVR models. In engineering applications, any symmetric, non-negative measure of similarity between pairs of samples can be adopted as the kernel function. For instance, Gaussian kernels *e* 1 *<sup>σ</sup>* ||**x**(*m*)−**x**(*n*)||<sup>2</sup> , or L*p*-normed kernels ||**x**(*m*) − **x**(*n*)||*<sup>p</sup>* can be adopted for inputs that lie in a Euclidean space R*M*. In bio-informatics, where samples may consist of

DNA strands that are sequences of the letters C, T, G, and A, the kernel may vary negatively with the minimum edit distance between every pair of samples.

For each sample, **x**(*n*) that is strictly within the ± margin of the regression line (see Figure 8), the corresponding dual variables *λ*±(*n*) obtained from (20) will be zeros. It is only when the sample lies either on the margin's boundaries or outside it that yields either *λ*+(*n*) > 0 or *λ*−(*n*) > 0. These samples are the support vectors. The set of all support vectors is,

$$\mathcal{V} = \{ n | \lambda\_+ (n) > 0 \text{ or } \lambda\_- (n) > 0 \}. \tag{22}$$

Given an unknown sample **x**, the estimated output *y* can be obtained using the kernels of **x** and the support vectors in V,

$$y = \sum\_{n \in \mathcal{V}} (\lambda\_+(n) - \lambda\_-(n)) K(\mathbf{x}(n), \mathbf{x}) + b.$$

Although not provided in this article, the bias *b* can be obtained readily from the dual form in (20).

As long as the training set S*<sup>t</sup>* is small enough so that it is computationally feasible to compute the matrix **K** and store in memory, quadratic programming can be applied directly to solve (20). Otherwise, there are a plethora of iterative training algorithms [71–73], that are well-equipped to train SVRs with larger data sets. SVRs can be formulated using other losses and regularizers as well.

#### *4.4. Fuzzy Inference Systems*

FIS is a CI model that is inspired by decision-making processes in humans. It uses fuzzy sets to capture the inherent vagueness in human verbal reasoning. The fuzzy set theory extends the classical concept of a set (called a 'crisp' set in fuzzy terminology) by incorporating such imprecision. The manner in which it does so is described next.

Any element *x* from the universe of discourse U can either be in a given crisp set *A*, where *<sup>A</sup>* <sup>⊂</sup> <sup>U</sup> (i.e., *<sup>x</sup>* <sup>∈</sup> *<sup>A</sup>*) or not in it (i.e., *<sup>x</sup>* <sup>∈</sup>/ *<sup>A</sup>*). Accordingly, a binary membership function *<sup>μ</sup><sup>A</sup>* : <sup>U</sup> → {0, 1} can be defined such that *<sup>μ</sup>A*(*x*) = 1 iff *<sup>x</sup>* <sup>∈</sup>/ *<sup>A</sup>*, otherwise *μA*(*x*) = 0 iff *x* ∈/ *A*. The membership function of a fuzzy set *A* is allowed to have any real value within the interval [0, 1], i.e., *<sup>μ</sup><sup>A</sup>* : <sup>U</sup> <sup>→</sup> [0, 1]. The numerical value of *<sup>μ</sup>A*(*x*) indicates the degree to which *x* is included in *A*. For example, let *T* be the set of tall students in a class. If *T* is a crisp set, there must be a minimum cutoff for tallness. Let this cutoff be 5 10. Hence, Jack and Jill, whose heights are 5 9 and 6 1, have memberships *μT*(Jack) = 0, and *μT*(Jill) = 1 in . On the other hand, if *T* is a fuzzy set, then memberships such as *μT*(Jack) = 0.7, and *μT*(Jill) = 0.99 are possible, indicating that Jack is very close to being tall, whereas Jill is definitely tall.

When the universe of the discourse is a continuous variable, memberships can be defined in terms of functions of real arguments *<sup>μ</sup>* : <sup>R</sup> <sup>→</sup> [0, 1]. The Gaussian, trapezoidal, and triangular functions are commonly used for memberships. The Gaussian membership of a scalar input *<sup>x</sup>* to the fuzzy set *<sup>A</sup>* <sup>⊂</sup> <sup>U</sup> is *<sup>e</sup>*−(*x*−*μ*)/*σ*. The trapezoidal membership can be defined using four parameters, *a*, *b*, *c*, and *d* (*a* ≤ *b* < *c* ≤ *d*),

$$\mu\_A(x) = \begin{cases} 0, & \text{if } x < a; \\ \frac{\frac{x-a}{b-a'}}{\frac{b-a'}{b-a'}} & \text{if } a \le x < b; \\\ 1, & \text{if } b \le x < c; \\ \frac{d-x}{d-c'} & \text{if } c \le x < d; \\\ 0, & \text{if } d \le x. \end{cases} \tag{2.3}$$

The triangle membership function requires only three parameters, *a*, *b*, and *c* (*a* ≤ *b* ≤ *c*),

$$\mu\_A(x) = \begin{cases} 0, & \text{if } x < a; \\ \frac{\frac{x-a}{b-a'}}{\frac{c-x}{c-b'}}, & \text{if } a \le x < b; \\ \frac{\frac{c-x}{c-b'}}{0, & \text{if } c \le x. \end{cases} \tag{24}$$

Gaussian memberships, as well as those in (23) and (24), have peak values of unity. Although this is common practice in real-world applications, fuzzy sets can also admit any other membership function as long as its maximum lies anywhere in (0, 1]. The complement *A*¯ of the fuzzy set *A* can be readily defined in terms of the membership function as *μA*¯ = 1 − *μA*. A fuzzy singleton—say *B*, is a fuzzy set that is fully parametrized by a constant *vB*, where *vB* <sup>∈</sup> <sup>R</sup>, such that for the input *<sup>y</sup>* <sup>∈</sup> <sup>R</sup>,

$$
\mu\_B(y) = \begin{cases} 1, & \text{if } y = v\_B; \\ 0, & \text{otherwise.} \end{cases} \tag{25}
$$

The operations of union (∪) and intersection (∩) in crisp sets correspond to conjunction (AND) and disjunction (OR) in Boolean algebra. In terms of membership functions, the union *A* ∪ *B* and intersection *A* ∩ *B* of the sets *A* and *B* are *μA*∪*<sup>B</sup>* = *μ<sup>A</sup>* OR *μA*, and *μA*∩*<sup>B</sup>* = *μ<sup>A</sup>* AND *μA*. Union and intersection of fuzzy sets can be realized in various ways [74], using t-conorms and t-norms. A popular choice is to use max(···) as the t-conorm operator and min(···) as the t-norm. In this case, *μA*∪*<sup>B</sup>* = max{*μA*, *μB*} and *μA*∩*<sup>B</sup>* = min{*μA*, *μB*}. In our previous example, suppose *S* is the fuzzy set of smart students and *μS*(Jill) = 0.75, then *μS*∪*T*(Jill) = max{0.75, 0.99} = 0.99 and *μS*∩*T*(Jill) = min{0.75, 0.99} = 0.75.

A FIS encapsulates human knowledge through a fuzzy rule base. Each rule in the base consists of two parts, an antecedent and a consequent, and is written in the format, "**If** ANTECEDENT **then** CONSEQUENT". If the input to the model is an *M*-dimensional vector **x** and its output is an *M*-dimensional vector **y**, the antecedents and consequents are made up of *M* and *N* fields. The generic format of a rule with index *k* ∈ 1, 2, ··· , *K* is as shown below,

$$\textbf{If} \quad \underbrace{\textbf{x}\_{1} \text{ is } A\_{1}^{k} \diamond \textbf{x}\_{2} \text{ is } A\_{2}^{k} \cdot \cdots \cdot \textbf{x}\_{M} \text{ is } A\_{M}^{k}}\_{\textbf{AND-EEDENT}} \textbf{ then} \quad \underbrace{\textbf{y}\_{1} \text{ is } B\_{1}^{k} \cdot \cdots \cdot \diamond \textbf{y}\_{N} \text{ is } B\_{N}^{k}}\_{\textbf{CONSEQUEN}}.\tag{26}$$

Each diamond symbol () in (26) represents an AND or an OR operator.

The order in which these operations are applied may either be in accordance with an established convention or, alternately, specified explicitly by inserting brackets at appropriate places. Mathematically speaking, the *j* th field in the antecedent of the fuzzy rule in Equation (26), "*xj* is *Aj*" is the membership, *μAj* (*xj*). In a similar fashion, the *i* th field in the consequent is *μBi* (*yi*). Figure 9 shows a simple rule base with *K* = 6 rules.

There are two kinds of FIS, differing only in the way the sets *Bi* in the consequent's *i* th field (*<sup>i</sup>* ∈ {1, 2, ··· , *<sup>N</sup>*}) are defined. In the Mamdani model [75], they are allowed to be fuzzy sets. As a result of this flexibility, a Mamdani FIS can easily apply verbal descriptions of the consequents. On the other hand, in the Takagi-Sugeno-Kang (TSK) model [76,77], each *Bi* must be a singleton, as in Equation (25). A TSK model renders the FIS more amenable to mathematical treatment. Figure 9 shows examples of the Mamdani and TSK models.

The various steps involved in mapping an input to its output will be illustrated using the examples shown in Figure 10 (Mamdani model) and Figure 11 (TSK model). The steps are briefly described below.

(*i*) Fuzzification: This step is carried out separately in each antecedent field "*xj* is *A<sup>k</sup> j* " and for each rule *k*. It involves computing the values of the memberships *μA<sup>k</sup> j* (*xj*) using the numerical values of the input element *xj*.

**Figure 9.** Fuzzy inference system. The figure shows membership functions (**top**), and fuzzy rule base (**bottom**). The input **x** has two elements, *WL* ∈ [0, 10] (wheel load), that can be L (Low), M (Medium), or H (High), and *IP* ∈ [0, 10] (inflation pressure) that can be L (Low) or H (High). The output *y* ∈ [0, 30] is a scalar (*N* = 1). This is the quantity *CA* (contact area), which can be L (Low), M (Medium), or H (High). The membership functions, *μ*L(*CA*), *μ*M(*CA*), and *μ*H(*CA*) are trapezoids/triangle (Mamdani) or singletons (TSK).

(*ii*) Aggregation: In this step, AND and OR operations are applied as appropriate to each rule in the FIS. The rules in the FIS shown in Figures 10 and 11 only involve conjunctions (AND) that are implemented through the min(·) t-norm. The aggregated membership is referred to as its rule strength. The strength of rule *k* is,

$$\mu\_A^k = \bigcup\_j \mu\_{A\_j^k}(x\_j). \tag{27}$$

(*iii*) Inference: The strength of each rule is applied to its consequent. Each rule *k* in our example contains only one consequent field. Its membership function *μB<sup>k</sup>* is limited to a maximum of *<sup>μ</sup>A<sup>k</sup>* . For every rule, *<sup>k</sup>* in *<sup>k</sup>*, a two-dimensional region R*<sup>k</sup>* is identified in the Mamdani model. Since the TSK model involves only singletons at this step, only a two-dimensional point R*<sup>k</sup>* is necessary. Accordingly,

$$\mathcal{R}^k = \left\{ \begin{array}{c} \{(y^k, z^k) | y^k \in [0, y\_{\text{max}}], \, z^k \in [0, \max(\mu\_{\text{B}^k}(y), \mu\_{A^k})] \}, & \text{Mandani};\\ (v\_{\text{B}^k}, \mu\_{A^k}), & \text{TSK}. \end{array} \right. \tag{28}$$

In the example shown in Figure 10, the upper limit *ymax* = 30.

(*iv*) Defuzzification: The value of the FIS's output is determined in the last step. The Mamdani FIS in Figure <sup>10</sup> uses the centroid defuzzification method. The regions R*<sup>k</sup>* are unified into a single region R. The final output is the x-coordinate of the centroid of R. The TSK model in Figure 11 uses a weighted sum to obtain the output *y* of the FIS. Mathematically,

$$y = \begin{cases} \left[\int\_{\mathcal{R}} d\mathcal{R}\right]^{-1} \int\_{\mathcal{R}} z^k d\mathcal{R}, & \text{Mandani;}\\ \left[\sum\_k \mu\_{A^k}\right]^{-1} \sum\_k \upsilon\_{\mathcal{B}^k} \mu\_{A^{k\_\prime}} & \text{TSK}. \end{cases} \tag{29}$$

In the above expression, R = , *k* R. It is evident from the above description, that the inference and defuzzification step in a Mamdani FIS is more computationally intensive in comparison to that in the TSK model. There are several other methods to obtain the output of a FIS. For details, the interested reader is referred to [78,79]. The Mamdani model [80–82] as well as the TSK model [83–87] have been used frequently in agricultural research.

**Figure 10.** Mamdani FIS. The inputs to the FIS in Figure 9 are *WL* = 7.5 and *WL* = 5.0 (dotted vertical lines), and the output is *y* = 23.50). The first three rules in Figure 9 with *WL* = H are ignored since *<sup>μ</sup>*L(7.5)=0). The dark-shaded regions are <sup>R</sup>*<sup>k</sup>* of the relevant rules.

**Figure 11.** TSK FIS. The inputs to the FIS in Figure 9 are *WL* = 7.5 and *WL* = 5.0 (dotted vertical lines), and the output is *y* = 23.49. The first three rules in Figure 9 with *WL* = H are ignored since *μ*L(7.5) = 0). In the other rules, the values of *vBk* are 15, 25, and 25.

#### *4.5. Adaptive Neuro-Fuzzy Inference Systems*

A TSK model with fuzzy rules as in (26) is often referred to as a zero-order FIS. An ANFIS is based on a zero or higher order TSK model, that is arranged in a manner resembling an NN [88–90]. ANFIS models frequently use first-order TSK rule bases. Assuming a scalar output *y*, the format of the *k*th rule in such a first-order TSK model is,

$$\text{If } \mathbf{x}\_1 \text{ is } A\_1^k \diamond \cdots \diamond \mathbf{x}\_M \text{ is } A\_M^k \text{ then } \ y = b\_0^k + b\_1^k \mathbf{x}\_1 + \cdots + b\_M^k \mathbf{x}\_M. \tag{30}$$

The consequent in Equation (30) is a linear expression for *y* in terms of **x**, with *M* × *K* coefficients, *b<sup>k</sup> <sup>j</sup>* (where *j* ∈ 1, ··· , *M* and *k* ∈ 1, ··· , *K*). To simplify its training, the membership functions in the ANFIS rules' antecedents are usually restricted to Gaussian [90]. Figure 12 shows an example of a first-order TSK model.

Figure 13 illustrates the ANFIS corresponding to the first-order TSK rule set shown in Figure 12. The parameters of the membership functions of each input variable are trainable quantities. For Gaussian memberships, they are *σ<sup>k</sup> <sup>j</sup>* , *<sup>μ</sup><sup>k</sup> <sup>j</sup>* , where *j* ∈ {1, ··· , *M*}, and *k* is the index of a rule). The coefficients in the consequent side of each such rule, which are *b<sup>k</sup> j* ,

*j* ∈ {0, 1, ··· , *M*} are also trainable. All trainable quantities constitute the parameter vector **Θ** of the ANFIS.

**Figure 12.** Type 1 TSK FIS. Shown are the membership functions of the fields (top) and the fuzzy rule base (bottom). The antecedents of the rules are the same as those in Figure 9.

There are five layers in the ANFIS model, which are as follows.


$$
\mu\_A^k = \prod\_j \mu\_{A\_j^k}(x\_j). \tag{31}
$$

(*iii*) Normalizing layer: This is the third layer of the ANFIS, whose role is to normalize the incoming aggregated memberships, *μ<sup>k</sup> <sup>A</sup>* from the previous layer. The output of its *<sup>k</sup>*th unit is,

$$
\hat{\mu}\_A^k = \frac{\mu\_A^k}{\sum\_{k'} \mu\_A^{k'}}.\tag{32}
$$

(*iv*) Consequent layer: The output of the *k*th unit of the fourth layer is,

$$y^k = \hat{\mu}\_A^k \left( b\_0^k + \sum\_j b\_j^k x\_j \right). \tag{33}$$

(*v*) Output layer: The final layer of the ANFIS performs a summation of the consequent outputs *yk*,

$$y = \sum\_{k} y^{k}.\tag{34}$$

The quantity *y* is the output of the ANFIS.

Several methods have been proposed to train the parameters of an ANFIS model [91]. Much research has been directed towards gradient descent approaches ((6)) resembling BP [89,92]. Such approaches have been used in agriculture [93–95]. A Levenberg-Marquardt approach has been suggested recently [96]. Stochastic metaheuristics such as GAs [97] and PSO [98] have also been investigated. Hybrid approaches combining them are widely used to train ANFISs [99]. A comparison of three metaheuristics has been reported in [100] for an agriculture-related application.

**Figure 13.** Adaptive Neuro-Fuzzy Inference System. Shown are the inputs and the five layers of an ANFIS.

#### **5. Soil–Machine Interaction Studies: A Brief Survey**

*5.1. Literature Survey Methodology*

In recent decades, CI methods have been extensively studied in agriculture, particularly in crop management, insect–pest management, irrigation scheduling, precision agriculture, input application optimization, yield prediction, and so on [80]. Initially, we collected research articles for the period ranging from 1990 to 2022, from multiple online databases such as Web of Science, Scopus, Science Direct, Google Scholar, Wiley, and Springer Link. More than 150 research articles were collected in the preliminary screening stage. Out of the 150 articles, only 50 articles directly related to the CI application on traction, tillage, and compaction were selected. Figure 14 shows the year-wise and categorical distribution of the selected articles where CI methods were employed.

#### *5.2. Traction*

In traction studies, an individual traction element (wheel, track) or entire off-road vehicle is tested in a controlled laboratory setup or prepared or un-prepared fields for its performance optimization. The major performance parameters are drawbar pull, traction efficiency, slip, and fuel consumption, which are optimized as a function of numerous variables pertaining to the machine, operational setup, and soil properties. A summary of CI models developed in selected traction studies is shown in Table 1.

An off-road vehicle (tractor or skidder) used in agriculture and forestry is specially designed for drawbar work, i.e., pulling or pushing the implements. Drawbar power is a product of drawbar pull and vehicle velocity in the travel direction. The vehicle tire size and its inflation pressure increase the soil contact area, which improves the drawbar performance. The drawbar performance of the forestry skidder was studied [101] in soft soil to develop a

multiple linear regression (MLR) and fully connected NN to predict the drawbar pull. For tractor energy management and optimization, the NN hybridized with GA, and the ANFIS was implemented to predict the drawbar pull energy [50], and net traction power [102]. The tractor's drawbar pull varies with vehicle configuration, weight, and operating mode (2WD and 4WD). Thus, a FIS was proposed to estimate a drawbar pull [103]. In addition to tire size, the drawbar pull is also influenced by the tire geometrical parameters, which can be defined with 3D footprints. Thus, NN was implemented to understand the complex relationship between 3D tire footprints and generated the drawbar pull [104].

The traction device develops a force, parallel to the travel direction and transfers to the vehicle. The traction efficiency is a ratio of output power to input power to the device [105]. It is one of the most critical factors in traction studies and relates to energy saving. Several studies were conducted in a laboratory setup with a single-wheel tester to study the influence of the traction device's operational parameters and soil properties on traction efficiency. Table 1 contains the various CI methods proposed to model the traction efficiency [106–112].

Motion resistance is an opposing force, that works against the traction device's forward motion and accounts for all energy loss unrelated to slip [105]. Motion resistance is the difference between gross traction and net traction. A series of experiments were conducted on a driven wheel in a soil bin (clay loam) [113–115] that aimed to study the motion resistance influenced by various operating parameters, and predict motion resistance with the CI methods (NN & FIS).

The tractor is a major power source in agriculture. Therefore, it is essential to understand how tractor power can be best utilized for varying field conditions for efficient operation. The tractor loses the most power at the soil–tire interface, and its performance is influenced by operational and soil/terrain parameters. Therefore, the field performance of a 75 HP tractor was evaluated [116], and NN was proposed for predicting the tractor performance as a function of soil and tractor-implement variables. Likewise, NN and AN-FIS were proposed to study the performance of tractor-implement operational parameters on traction efficiency [94] and wheel slip [117]. Specific fuel consumption is the most used and common indicator of tractor performance. Thus, NN was proposed to predict the fuel consumption of a 60 HP 2WD tractor [118].

Mobile robots and autonomous ground vehicles (AGV) are becoming popular on smart farms. Thus, the traction behavior of the ground vehicle was studied on a sloped soil bin, and NN predicted the traction, mobility, and energy requirement of the AGV [119].

#### *5.3. Tillage*

Tillage is classified into two major categories: (1) primary and (2) secondary tillage, based on purpose, tillage depth, and energy requirement. Primary tillage is an initial major soil working operation, aiming to open up any cultivable land, reduce soil strength, cover plants/residues, and rearrange soil aggregates [2]. It manipulates soil at a greater depth (15 to 30 cm), and moldboard, disk, chisel plow, and subsoiler are commonly used primary tillage tools.

Moldboard (MB) plow shatters soil, inverts furrow slices, and covers crop residues or grasses. As the plow bottom advances, it cuts, fails the ground, and forms furrow slices, which shows a periodic variation in the draft force. Therefore, a time-lagged recurrent neural network (RNN) was proposed to predict the dynamic draft as a function of one step ahead prediction [120] for various shaped tillage tools (MB, Korean, model plow). The MB plow consumes the highest energy compared to other tillage tools for a given depth [121,122]. Therefore, the researchers studied the performance of various types of MB plow in varying soil conditions for energy optimization. The developed CI methods are listed in Table 2 and explained briefly as follows: The ANFIS models were proposed for predicting the draft and specific draft of three-bottom MB plow [123]. The NN predicted the specific draft and fuel consumption of a tractor-mounted MB plow under varying operating conditions [124]. Similarly, the NN was proposed for general-purpose MB plow's draft, and energy requirement [122].



The MB plow has a sliding plow bottom that slides through the soil. The sliding friction is one of the primary reasons for the MB plow's higher draft and energy requirement. On the contrary, a disk plow is equipped with concave rolling disks, i.e., a rolling plow bottom designed to reduce friction through rolling action. The energy requirement of the disk plow is significantly lower than the MB plow. Thus, the NN was proposed to predict the disk plow draft and energy requirement [125,126].

Deep tillage (depth < 30 cm) is designed to shatter soil, breaking up hardpans and compacted soil layers to ease water and plant root movement. A chisel plow and subsoiler are mainly used for deep tillage. The chisel plow has a series of shovels or teeth spaced on a frame. Its draft requirement is comparatively low and varies with soil type and depth of operation. Hence, the NN was proposed to model the chisel plow draft using various soil textural indices [127]. TSK-type ANFIS was proposed for chisel plow draft prediction [86]. Likewise, NN was presented for modeling the chisel plow performance parameters [128]. More details on the proposed model inputs and output can be found in Table 2.

A subsoiler has a narrow straight shank to break and fracture the deep compacted soil zone at a greater depth (60–90 cm). The subsoiling demands high horsepower, ranging from 30 to 50 hp per shank [129]. Thus, to predict the draft and energy requirement of the subsoiler, the NN was presented as a function of soil parameters and operational variables [130]. The subsoiler is a non-inversion tillage tool, available in various shaped shanks, and selecting the right shank could reduce the draft [131]. The conventional straight shank requires a significantly higher draft and is often replaced with parabolic, bent leg, or paraplow shanks [132]. Therefore, the CI-based models (ANFIS, MLR, RSM) were presented for predicting the draft of three types of subsoiler shank (subsoiler, paraplow, and bent leg) [130]. Similarly, the ANFIS was proposed to predict the forces acting on paraplow having three different design configurations (bent wing with forward, backward, and without wing) [133].

Secondary tillage is performed for seedbed preparation, crop production practices, and moisture conservation. Examples of secondary tillage tools include a harrow (disk, spring or spike tooth, chisel), cultivator, and clod crushing roller. The energy requirement of secondary tillage tools is comparatively less than that of primary. The cultivator and harrow are often operated at a higher ground speed to produce finer tilth, soil pulverization, and weed control. Thus, its operational parameters (tool type, speed, depth) are often investigated to achieve finer tilth, prevent soil degradation and optimize the tillage energy. The NN predicted the draft of a cultivator, disk harrow, and MB plow in a soil bin setup [21]. The FIS was proposed for predicting the soil fragmentation resulting from a combination of primary and secondary tillage implements during the seedbed preparation [134]. The draft efficiency and soil loosening of the duck foot cultivator were predicted with the FIS in soil bin [135]. Similarly, the NN was proposed to predict the draft force of a chisel cultivator [136]. An RBF neural network was presented to simulate the soil–machine interaction of five narrow blades in field conditions [137].

Reduced tillage offers several benefits, such as reduced energy and soil disturbance over traditional tillage. Winged share is a reduced tillage tool, and the CI models (NN and FIS) were proposed for predicting the draft force of two different types of winged share tillage tools in a soil bin (loam soil) [138,139]. Likewise, a combined tillage implement is equipped with multiple tillage tools on a single frame to reduce the tractor passes. The combined tillage saves time, fuel, and energy to obtain the desired soil conditions compared to the conventional method [140,141]. Therefore, the CI models (NN and ANFIS) were proposed to predict the energy indices of the tractor-implement system during a combined tillage operation [142].

The model tool is a miniature-scale replica of an actual tool and is often studied in a laboratory environment. The NN models were developed for predicting the energy requirement of the rectangular cross-sectional model tool in a soil bin [143]. Similarly, NN was proposed to understand the design and technical insight of the plowing process of a multi-flat plate (model tool) and resulting soil fineness [144].



#### *5.4. Compaction*

Vehicular traffic is common during field operation, and it is estimated that the wheels traffic the soil more than five times a year. The vehicular traffic affects the soil structure, void ratio, and bulk density, which further influence crop yield. Therefore, the soil compaction resulting from vehicular traffic needs to be reduced or avoided. Hence, two agricultural tires were studied in a soil bin and the FIS-based models were developed to predict bulk density, penetration resistance, and soil pressure at a 20 cm depth [147].

Tire–soil contact area varies with tire parameters such as vertical load, inflation pressure, and thread type/pattern. The contact area determines the forces acting on soil and resulting stress–strain. Therefore, a series of experiments were conducted in a soil bin, and several CI models (i.e., NN, FIS, and Wavelet NN) were proposed to predict the wheel contact area, contact pressure, soil strength, and soil density based on tire parameter [148–150]. Multiple wheel passes cumulatively compact the soil. Hence, the NN was presented for predicting the penetration resistance and soil sinkage as a function of wheel pass and wheel operating parameters [151] mentioned in Table 3.

**Table 3.** Soil compaction studies that employed the CI methods.


#### *5.5. Implemented CI Methods*

A summary of CI methods proposed in a selected article (50) is presented in Figure 15. The NNs were the most frequently employed, followed by multiple linear regression, ANFIS, and FIS. The NN-based models were proposed in 36 studies (50.7%), out of which 34 studies employed a fully connected feedforward (FF) NN type (Figure 15b). Other types of NNs, such as RNNs, wavelet NNs, and RBFNs, were reported once (Figure 16a). This indicates that shallow NN with only one or two layers was sufficient to model complex soil–machine interactions (Figure 16a). In most of these studies, NNs were trained with BP or LMBP (Figure 16b). A GA-based metaheuristic was used in one such research, and dimension reduction using ICA in another.

**Figure 15.** CI methods used in soil–machine interaction studies; (**a**) soft computing methods and their frequency; (**b**) percentage share of each method.

**Figure 16.** Neural network: (**a**) types of a neural network, (**b**) training methods.

Subsequently, the FIS was implemented in a total of eight studies (11.3%). The triangular, Gaussian, and linear were observed as the most popular membership functions. The ANFIS models were proposed in eleven studies (15.5%), with the first-order TSK fuzzy inference system being the preferred approach. ANFIS models were often trained using a combination of the least-squares method and BP. The SVR models were applied in two studies, which used various kernel functions.

Additionally, traditional regression methods were implemented in thirteen (11.3%) studies. These regression methods included MLR and the standard ASABE equations (tool draft equation). These methods were usually compared with CI methods in terms of prediction accuracy.

The performances of the models were evaluated with commonly used metrics. Figure 17a shows the frequencies of their usages. As is evident from Figure 17b, CI models consistently outperformed classical approaches. Although the performance of the traditional regression method was comparatively lower in terms of model accuracy, MATLAB was the most widely used platform to implement CI-based models (Figure 17c).

#### **6. Strengths and Limitations of CI Methods**

CI models offer manifold advantages over traditional methods described earlier. The features that make these models so attractive are enumerated below.

(*i*) Data-driven models can handle copious amounts of data with relative ease [152]. With increasing data size, the corresponding growth in computational overheads is generally between linear and quadratic orders of magnitude. For instance, the number of iterations (called epochs) needed to train a neural network is fixed regardless of data size [53]. On the other hand, traditional methods regularly witness quadratic or higher growths.


It is of little surprise that CI approaches have become very popular in agricultural soil-tillage, traction domains, and many other applications.

In spite of the several attractive features that CI models offer, they have a few shortcomings as well. These are outlined below.


#### **7. Emergent Computational Intelligence Models**

The study critically analyzed the most popular CI methods found in the literature, particularly in the soil–machine interaction domain. Further, we suggest emergent CI methods that may provide better results and can be considered as alternatives to existing CI methods. Those methods are described in brief here.

#### *7.1. Deep Neural Networks*

DNNs are NN models with multiple hidden layers [169–171]. In the past few years, this class of CI models has witnessed explosive growth in popularity. DDNs have emerged as a popular tool in a wide range of applications in agriculture [172–177], where they have been used for various image recognition tasks. Unfortunately, DNNs have yet to be explored in any soil–machine interaction application.

Figure 4 illustrates the layout of such a DNN with fully connected layers. State-ofthe-art DNNs incorporate various other types of layers, including RBFN [178], SVR [179], and TSK fuzzy [180,181] layers. DNNs can be endowed with the ability to handle time series data by incorporating long short-term memory (LSTM) or gated recurrent unit (GRU) layers [176,182]. At each time step *t*, these layers can hold in memory essential features from earlier time steps (i.e., *t* − 1, *t* − 2, etc.) by means of time-delayed feedback. Such DNNs are called recurrent neural networks. An alternate to LSTM and GRU in DNNs is the attention mechanism [183], which has been applied in agriculture [184].

Although sigmoid functions are widely used as neuronal nonlinearities, the presence of a large number of layers in the DNN poses the problem of vanishing gradients [185]. This issue is addressed by using rectified linear (ReLu) units [186], which incorporate the ReLu function, *f*(*s*) = max(*s*, 0). Current training methods that are based on BP [187–189]. The Adam (adaptive momentum estimation) algorithm is currently the dominant approach to train DNNs [190]. In [191], Adam was used to successfully train DNNs for agriculture data. The use of metaheuristics in conjunction with gradient methods has been investigated [192,193].

Unlike FIS and ANFIS models, DNNs are black-box approaches, whose outputs are not readily amenable to human interpretation. However, recent studies are beginning to address this issue [164,165].

#### *7.2. Regression Trees and Random Forests*

Decision trees are CI methods that use graphical tree-based representations [194,195], with binary trees [196] being most frequently used. During training, each node in a binary tree is used to split sample pairs (**x**(*n*), *<sup>t</sup>*(*n*)) (*<sup>n</sup>* ∈ S*t*) into two subsets S<sup>L</sup> *<sup>t</sup>* and SLR *<sup>t</sup>* . A threshold *θ<sup>j</sup>* is applied to an element *xj*. Hence,

$$y = \begin{cases} \mathcal{S}\_t^\mathcal{L} = \{ n \in \mathcal{S}\_l | x\_j(n) \le \theta\_j \}; \\ \mathcal{S}\_t^\mathcal{R} = \{ n \in \mathcal{S}\_l | x\_j(n) > \theta\_j \}. \end{cases} \tag{35}$$

The threshold is computed so that at each node, the split is as evenly balanced as possible. Information theoretic and heuristic methods using values of the targets *t*(*n*) in the training dataset. Regression trees have found agricultural applications in the past few years [197,198].

Random forests are CI methods that use multiple trees to obtain outputs [199,200]. There has been a steep rise in the use of this approach for various applications in agriculture [201–211]. An excellent survey of decision trees, random forests, and other CI models has been published in [212].

#### *7.3. Extreme Learning Machines*

Extreme learning machines (ELM) are CI models that are useful in regression problems [213–215]. Although in comparison to some other CI models (NNs, RBFNs, and SVRs), ELMs have not been as widely used in other engineering domains; surprisingly, they are very popular in various agricultural applications [216–223].

An ELM is structurally equivalent to an *M* × *K* × *N* NN. The neurons in the hidden layer incorporate nonlinear activation functions in the same manner as in Equation (9). However, unlike in NNs, the hidden layer in an ELM is not fully connected to the input. The hidden weights of an ELM can be arranged as a *K* × *M* sparse matrix. These weights are assigned randomly and do not undergo any training. Only the output weights are trained using a matrix form of the pseudoinverse rule in Equation (16). This allows ELMs to be trained significantly faster than equivalent NNs. Hybrid training algorithms for ELMs have also been proposed in [224–226] for agricultural applications. DNN architectures that contain ELM layers are being investigated (cf. [227,228]).

#### *7.4. Bayesian Methods*

Bayesian methods are CI paradigms where the outcome renders itself to a probabilistic interpretation. Central to these methods is the Bayes rule. The rule can be applied to a parametric Bayesian model in the following manner,

$$\mathbf{p}(\boldsymbol{\Theta}|\mathcal{S}\_{l}) = \frac{\mathbf{p}(\mathcal{S}\_{l}|\boldsymbol{\Theta})\mathbf{p}(\boldsymbol{\Theta})}{\mathbf{p}(\mathcal{S}\_{l})}.\tag{36}$$

In this expression, p(·) is the probability of the argument. The left-hand side of Equation (36) is the posterior probability. The factors p(S*t*|**Θ**), and p(S*t*) in the right-hand side are the likelihood and the prior probability. It can be demonstrated that LASSO and ridge regularization discussed earlier in Section 3.3 are instances of Bayesian methods, where the prior probabilities follow Laplacian and Gaussian distributions.

Since the training data S*<sup>t</sup>* is independent of the model, it can be dropped from the Bayes rule. The model parameter is obtained as the one that has the highest probability, argmax**Θ**p(**Θ**|S*t*). Given any unknown input **x**, the output probability p(*y*|**x**, **Θ**) can be obtained from **Θ**. Bayesian approaches have been used in several areas of agriculture [229–232].

A Bayesian network is a specific Bayesian modeling approach that uses a graphical structure that resembles an NN [233]. Inferring the output in this model relies heavily on statistical sampling techniques [234]. Bayesian networks have been used in [184,235,236].

A mixture of Gaussians [237,238] is a Bayesian model that uses hidden variables *zi*, *i* = 1, 2, ··· , which play an intermediate role between the inputs and outputs. Given any input **x**, the output probability p(*y*) is determined as the following summation,

$$\mathbf{p}(y) = \sum\_{i} \mathbf{p}(y|z\_i)\mathbf{p}(z\_i|\mathbf{x}).\tag{37}$$

The use of such methods has begun to be explored in agriculture [38,239,240].

Gaussian process regression [241–243] is a Bayesian approach that assumes the presence of Gaussian noise. As in SVR, kernel matrices are applied in this method. Gaussian process regression has been extensively used in various applications related to agriculture [244–248].

#### *7.5. Ensemble Models*

Ensemble models are approaches that combine multiple CI models for decisionmaking [249–251]. Bagging and boosting are two commonly used ensemble approaches. Random forests and Gaussian mixtures discussed earlier in this section are ensemble models.

There has been a surge in the use of these methods in the agriculture domain [252–259]. Recent research has been directed towards using bagging [260–263] and boosting [198]. Ensembles of NNs have been investigated in [264–270]. GA and PSO have also been studied in this context [255,264].

#### **8. Future Direction and Scope**

#### *8.1. Online Traction Control*

A sensing technology has reached its maturity, and ample research material is available, where numerous sensors were employed to sense, measure, and provide real-time information on the biological material (e.g., plant, soil, and field conditions). This review article taught us that CI methods can accurately and precisely model or predict complex soil–machine interactions. Therefore, future research efforts should target automatic and real-time traction-tillage control with the help of a sensing and prediction model. The online traction control system would optimize the machine parameters in real-time to increase traction efficiencies and reduce soil compaction. For example, traction control is a standard safety feature in today's automotive vehicles. The wheel sensor senses the road conditions (icy or slippery), and the control algorithm enables the traction control to adapt to road conditions in real-time. Moreover, the planetary rover developed by NASA is also

equipped with a traction control algorithm that senses the terrain driving condition and predicts the chance of getting trapped in soil (immobility condition) [271,272].

#### *8.2. Online Tillage Control*

The agricultural soil and field conditions are dynamic and vary on a spatial and temporal scale. Hence, a single tillage tool or management system operating uniformly throughout the field would not be sufficing. Multiple factors, including soil type, texture, structure, moisture, field topography, slope, and crop rotation, play a vital role when deciding which implement is best for the field. The current tillage management approach involves employing a single tillage tool for the entire area. The soil moisture is the only parameter checked before performing the tillage operation. Therefore, future research should develop variable depth, variable-intensity, and adaptive tillage implements that can be controlled in real-time. This site-specific tillage management would collect real-time information on soil and operating terrain, and CI models would serve as decision-support tools, creating a fully automated tillage management system. Site-specific tillage has excellent potential since the intensity of the operation is adapted to the local needs, which can dramatically improve tillage. Recently, adaptive tillage has become a significant research focus, where the tillage tool adapts or changes its shape in real-time [273,274].

**Author Contributions:** Conceptualization, C.B., S.D. and D.F.; Methodology, C.B., S.D. and D.F.; Formal analysis, C.B. and S.D.; Resources, C.B., D.M.F., S.D. and D.F.; Data curation, C.B., D.M.F. and S.D.; Writing—original draft preparation, C.B. and S.D.; Writing—review and editing, C.B., S.D. and D.F.; Supervision, S.D. and D.F.; project administration, S.D. and D.F.; funding acquisition, S.D. and D.F. All authors have read and agreed to the published version of the manuscript.

**Funding:** This study was financially supported by the National Institute of Food and Agriculture (NIFA-USDA). The project is titled "National Robotics Initiative (NRI): Multi-Robot Farming on Marginal, Highly Sloped Lands" Project number- KS4513081.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
