*Article* **Predicting the 305-Day Milk Yield of Holstein-Friesian Cows Depending on the Conformation Traits and Farm Using Simplified Selective Ensembles**

**Snezhana Gocheva-Ilieva 1,\*, Antoaneta Yordanova <sup>2</sup> and Hristina Kulina <sup>1</sup>**


**Abstract:** In animal husbandry, it is of great interest to determine and control the key factors that affect the production characteristics of animals, such as milk yield. In this study, simplified selective tree-based ensembles were used for modeling and forecasting the 305-day average milk yield of Holstein-Friesian cows, depending on 12 external traits and the farm as an environmental factor. The preprocessing of the initial independent variables included their transformation into rotated principal components. The resulting dataset was divided into learning (75%) and holdout test (25%) subsamples. Initially, three diverse base models were generated using Classifiction and Regression Trees (CART) ensembles and bagging and arcing algorithms. These models were processed using the developed simplified selective algorithm based on the index of agreement. An average reduction of 30% in the number of trees of selective ensembles was obtained. Finally, by separately stacking the predictions from the non-selective and selective base models, two linear hybrid models were built. The hybrid model of the selective ensembles showed a 13.6% reduction in the test set prediction error compared to the hybrid model of the non-selective ensembles. The identified key factors determining milk yield include the farm, udder width, chest width, and stature of the animals. The proposed approach can be applied to improve the management of dairy farms.

**Keywords:** machine learning; rotation CART ensemble; bagging; boosting; arcing; simplified selective ensemble; linear stacked model

**MSC:** 62-11; 62P30

#### **1. Introduction**

Numerous studies have found associative connections between external characteristics of dairy cows and their milk production [1–3]. The 305-day milk yield is dependent on many other factors, such as the genetic potential of the animals, fertility, health status, environmental comforts, etc. Therefore, establishing which connections between the various factors determine a given productive trait and predicting its values, including milk yield, is an important research issue for improving economic profitability and dairy farm management.

In dairy science, many studies are based on modeling of collected empirical data using modern computer-based statistical techniques. These techniques enable determination of not only linear-type dependencies using standard statistical approaches, such as multiple linear regression (MLR), but also complex and hidden local dependencies between examined variables with significantly better predictive ability. A review paper [4] showed that the health and productivity of milk cows depend on various parameters and that numerous researchers have recognized the potential of machine learning (ML) as a powerful tool in

**Citation:** Gocheva-Ilieva, S.; Yordanova, A.; Kulina, H. Predicting the 305-Day Milk Yield of Holstein-Friesian Cows Depending on the Conformation Traits and Farm Using Simplified Selective Ensembles. *Mathematics* **2022**, *10*, 1254. https:// doi.org/10.3390/math10081254

Academic Editor: Ripon Kumar Chakrabortty

Received: 28 February 2022 Accepted: 6 April 2022 Published: 11 April 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

this field. In [5], MLR, random forest (RF), and artificial neural networks (ANN) were used to determine dairy herd improvement metrics, with the highest impact on the first-test-day milk yield of primiparous dairy Holstein cows. MLR and ANN were used in [6] for 305-day milk yield prediction. In [7], the decision tree (DT) method was used to study lactation milk yield for Brown Swiss cattle, depending on productivity and environmental factors. The live body weight of Pakistani goats was predicted in [8] depending on morphological measurements using classification and regression trees (CART), Chi-square Automatic Interaction Detector (CHAID), and multivariate adaptive regression splines (MARS). In [9], DT was used to assess the relationship between the 305-day milk yield and several environmental factors for Brown Swiss dairy cattle. Fenlon et al. [10] applied logistic regression, generalized additive models, and ensemble learning in the form of bagging to model milk yield depending on age, stage of suckling, calving, and energy balance measures related to the animals. Four ML methods were tested by Van der Heide et al. [11]: majority voting rule, multiple logistic regression, RF, and Naive Bayes for predicting cow survival as a complex characteristic, which combines variables such as milk production, fertility, health, and environmental factors. The authors of [12] studied cattle weight using active contour models and bagged regression trees.

Other publications in the field of study related to dairy cows and the use of data mining and ML methods are [13–16]. In a broader aspect, predictive ML models and algorithms are essential to make intelligent decisions for efficient and sustainable dairy production management using information, web information, and expert systems [17]. As stated in [17], modern dairy animals are selected for physical traits that directly or indirectly contribute to high milk production. In particular, this motivates the development of models and tools for assessing and forecasting expected milk based on a limited number of easily measurable factors, such as the main external characteristics of the animals.

A new approach based on ensemble methods using bagging, boosting, and linear stacking of their predictions was developed in this paper to increase the predictive ability of the models. The essential part of modeling is the construction of selective ensembles, which reduce the number of trees in the ensemble and, at the same time, improve the performance of the model. Many researchers are actively studying this problem. The complete solution to the problem of choosing a subset of trees in the ensemble to minimize generalization errors comes down to 2*tn* − 1 possibilities, where *tn* is the number of trees. Such an algorithm is NP-complete [18]. For this reason, various heuristic algorithms for pruning and building selective ensembles are being developed. Some of the well-known results on selective ensembles of decision trees and ANN are based on genetic algorithms [19,20]. In [19], the resulting ensemble model is a weighted combination of component neural networks, the weights of which are determined by the developed algorithm so as to reduce the ensemble size and improve the performance. The algorithm selects the trees with weights greater than a preset threshold to form an ensemble with a reduced number of trees. This algorithm was further modified and applied to build decision tree selective ensembles in [20]. A significant reduction in the number of trees was achieved, from 20 to an average of 8 trees for 15 different empirical datasets. It is also believed that to obtain more efficient models, the components of an ensemble must be sufficiently different [21–23]. Applied results in this area can be found in [24–26] and others.

This paper contributes to statistical data modeling and machine learning by developing a framework based on a new heuristic algorithm for constructing selective decision tree ensembles. The ensembles are built with rotation CART ensembles and bagging (EBag), as well as rotation-adaptive resampling and combining (Arcing) algorithms. The simplified selective ensembles are built from the obtained models based on the index of agreement. This approach not only reduces the number of trees in the ensemble but also increases the index of agreement and the coefficient of determination and reduces the root mean square error (RMSE) of the models. In addition, combinations by linear stacking of models were obtained that satisfy four diversity criteria. The proposed approach was applied to predict the 305-day milk yield of Holstein-Friesian cows depending on the conformation

traits of the animals and their breeding farm. Comparative data analysis with the used real-world datasets showed that constructed selective ensembles have higher performance than models with non-selective ensembles.

#### **2. Materials and Methods**

All measurements of the animals were performed in accordance with the official laws and regulations of the Republic of Bulgaria: Regulation No. 16 of 3 February 2006 on protection and humane treatment in the production and use of farm animals, the Regulation amending of the Regulation No. 16 (last updated 2017), and the Veterinary Law (Chapter 7: Protection and Human Treatment of Animals, Articles 149–169). The measurement procedures were carried out in compliance with Council Directive 98/58/EC concerning the protection of animals kept for farming purposes. All measurements and data collection were performed by qualified specialists from the Department of Animal Husbandry—Ruminants and Dairy Farming, Faculty of Agriculture, Trakia University, Stara Zagora, Bulgaria, with methodologies approved by the International Committee for Animal Recording (ICAR) [27]. The data do not apply to physical interventions, treatments, experiments with drugs, or other activities harmful or dangerous to animals.

#### *2.1. Description of the Analyzed Data*

In this study, we used measurements from *n* = 158 Holstein-Friesian cows from 4 different farms located within Bulgaria. One productive characteristic was recorded: 305-day milk yield. Table 1 provides a description of the initial variables used. The collection of data and the choice of variables were based on the following considerations. It is well known from practice and research that the form and level of development of conformation traits depend on heritability and phenotypic characteristics of animals and influence their productivity, health, and longevity. The linear traits used were measured and evaluated for the animals according to the recommendations of the International Agreement on Recording Practices for conformation traits of ICAR (pp. 199–214, [27]). Our dataset of approved standard traits includes stature, chest width, rump width, rear leg set, rear legs (rear view), foot angle, and locomotion. Hock development and bone structure are representatives of the group of common standard traits. In addition, three other traits eligible under ICAR rules were recorded: foot depth, udder width, and lameness. For the present study, from each group, we selected those traits that have the highest coefficient of heritability and correlation with the 305-day milk yield, established as per Bulgarian conditions in [28,29]. The dataset includes the variable *Farm* to account for growing conditions, the environment, the influence of the herd, and other implicit and difficult-to-measure factors.

**Table 1.** Description of the variables used in statistical analyses.


External traits are described individually as ordinal variables. This scale complies with the standards of the ICAR [27]. The examined traits have two types of coding. The two traits (variables *RLSV* and *FootA*) are transformed, resulting in two opposite disadvantages in the ranking scale from 1 to 5 with ascending positive evaluation of the trait, in accordance with the evaluation instructions as per the type of ICAR. All other traits were measured linearly from one biological extreme to the other. The range of scores is from 1 to 9, and improvement of the characteristic corresponds to a higher value. The variable *Farm* is of categorical type, with 4 different values. The distribution by number of cows in the farms is 54, 32, 34, and 38.

It should be noted that in the general case, the relationships between the variables for exterior traits the productive and phenotypic characteristics of Holstein cattle are considered to be nonlinear (for example, [30]). Therefore, the machine learning approach has a better perspective to reveal the deep multidimensional dependencies between them.

Tables 1 and 2 list notations used in this paper.


**Table 2.** Nomenclature of the notations 1.

<sup>1</sup> All variable names are in italic style.

#### *2.2. Modeling Methods*

Statistical analyses of the data were performed using principal component analysis (PCA), factor analysis, and ensemble methods EBag and ARC. We used EBag and ARC as ensemble methods based on bagging and boosting, respectively. The main types of ensemble methods, their characteristics, advantages, and disadvantages are discussed in [21,23,32].

#### 2.2.1. Principal Component Analysis and Exploratory Factor Analysis

PCA is a statistical method for transforming a set of correlated variables into socalled principal components (PCs) [33]. The number of variables is equal to the number of extracted PCs. When the data include several strongly correlated variables, their linear combination can be replaced by a new common artificial variable through factor analysis. In this case, the number of initial variables is reduced at the cost of certain losses in the total variance explained by the new sample. Following the rotation procedure, the resulting rotated factor variables are non-correlated or correlate weakly with one another. These can be used in subsequent statistical analyses. PCA was used in [34,35].

#### 2.2.2. CART Ensemble and Bagging (EBag)

An ensemble is a model that includes many single models (called components) of the same type. In our case, the components are decision trees constructed using the powerful ML and data-mining CART method [36]. CART is used for regression and classification of numerical, ordinal, and nominal datasets. For example, let an initial sample of *n* observations {Y, X} be given, where Y = {*y*1,*y*2, ..., *yn*} is the target variable and X = *X*1, *X*2,..., *Xp* , *p* ≥ 1 are independent variables. The single CART model is a binary tree structure, *T*, obtained by recursively dividing the initial dataset into disjoint subsets called nodes of the tree. The predicted value for each case in the node, *τ*- ∈ *T*, is the mean value of Y of cases in *τ*-. The root of the tree contains all the initial observations, and its prediction is the mean value of the sample.

For each splitting of a given node, *τ*-, the algorithm of the method selects a predictor, *Xk*, and its threshold case, *Xk*,*θ*, from all or from a pool of variables, X, and cases in *τ*-, to minimize some preselected type of model prediction error. The division of cases from *τ* is performed according to the rule: if *Xk*,*<sup>i</sup>* ≤ *Xk*,*θ*, X ∈ *τ* then the observation with index *i* is assigned to the left child node of *τ*-—and in the case of *Xk*,*<sup>i</sup>* > *Xk*,*θ*, to the right child node of *τ*-. The growth of the tree is limited and stopped by preset hyperparameters (depth of the tree, accuracy, etc.). Thus, all initial observations are classified into terminal nodes of the tree. If a training sample is specified, the CART model function can be written as [33]:

$$\hat{\mu}(\mathbb{X}) = \sum\_{\tau \in T} \hat{Y}(\tau) I\_{[\mathbb{X} \in \mathbb{\pi}]} = \sum\_{m}^{\ell=1} \hat{Y}(\tau\_{\ell}) I\_{[\mathbb{X} \in \mathbb{\pi}\_{\ell}]} \tag{1}$$

where:

$$\hat{Y}(\mathbf{r}\_{\ell}) = \overline{Y}(\mathbf{r}\_{\ell}) = \frac{1}{n(\mathbf{r}\_{\ell})} \sum\_{\mathbf{X}\_{i} \in \mathbf{r}\_{\ell}} y\_{i\prime} \quad l\_{[\mathbf{X} \in \mathbf{r}\_{\ell}]} = \begin{cases} \ 1 \\ \ 0 \end{cases} \begin{array}{c} \mathbf{X} \in \mathbf{r}\_{\ell} \\ \ \mathbf{0} \end{array} \tag{2}$$

where *m* is the number of terminal nodes of the tree, and *n*(*τ*-) is the number of observations in node *τ*-. For each case *i*, *y*ˆ*<sup>i</sup>* = *μ*ˆ(X*i*) is the predicted value for the observation, X*i*.

An example of a CART model with 2 independent variables and 5 nodes is shown in Figure 1.

**Figure 1.** Example of a single-regression CART tree with two predictors and five terminal nodes.

CART ensembles and bagging is an ensemble method with ML for classification and regression proposed by Leo Breiman in [37]. For ensembles, the training set is perturbed repeatedly to generate multiple independent CART trees, and then the predictions are averaged by simple voting. In this study, we used the software engine CART ensembles and bagger included in the Salford Predictive Modeler [38].

In order to compile the ensemble, the researcher sets the number of trees, type of cross-validation, number of subsets of predictors for the splitting of each branch of each tree, limits for the minimum number of cases per parent and child node, and some other hyperparameters. The method's main advantage is that it leads to a dramatic decrease in test-set errors and a significant reduction in variance [39].

In terms of generating, the tree components of the ensemble are characterized by considerable differences in their performance and, individually, do not have high statistical indices. For this reason, in the literature, these are called "weak learners". However, after averaging, the statistics are adjusted, and the final ensemble model (for classification or regression) is more efficient. Component trees, which worsen the ensemble's statistics in any statistical measure, are called "negative" trees. Various heuristic algorithms have been developed to reduce the impact of these trees [19,26].

#### 2.2.3. Adaptive Resampling and Combining Algorithm (Arcing)

Another approach that uses ensemble trees is based on the boosting technique first proposed in [40]. A variant of boosting is the Arcing algorithm developed and studied by Breiman in [39], also known as Arc-x4. The family of Arc-x(h) algorithms is differentiated from Adaboost [40] by the simpler weight updating rule in the form:

$$w\_{t+1}(V\_i) = \frac{1 + m\left(V\_i\right)^h}{\sum\_{i} \left(1 + m\left(V\_i\right)^h\right)}.\tag{3}$$

where *m* (*Vi*) is the number of misclassifications of instance *Vi* by models generated in the previous iterations, and 1, 2, ... , *t*, *h* is an integer. In this way, the ensemble components are generated sequentially and penalize resampling in the cases that yield bad predictions up to the current step, *t*. Breiman showed that Arcing had error performance comparable to that of Adaboost.

Combining multiple models and applying any of the two methods—bagging or arcing—leads to a significant variance reduction, whereby arcing is more successful than bagging in test-set error reduction [39].

#### 2.2.4. Proposed Simplified Selective Ensemble Algorithm

To improve predictive performance, we further developed the algorithm for building simplified selective ensembles that we recently proposed in [41] for time series analysis. In this study, used it in the case of a non-dynamic data type. We applied the algorithm separately for two types of ensembles from CART trees: with bagging and boosting. The simplified selective algorithm is presented for the case of EBag. It consists of the following steps:


$$SSEB\_{tn-k} = \frac{tn.EBtn - \sum\_{j=1}^{k} ss\_j}{tn - k}, \ k = 1, 2, \dots, s. \tag{4}$$

In this way, removing the "negative" trees improves the IA of the initial EBag model and generates many new ensemble models for *k* = 1, 2, ... , *s*. The maximum simplified selective tree is obtained at *k* = *s*.

To implement the simplified selective algorithm, we used the generated EBag and ARC component trees using the ensembles and bagger engine of SPM software [38] and the authors' code in Wolfram Mathematica [42]. A detailed description of the simplified selective algorithm is given in Algorithm 1.

**Algorithm 1**: Simplified selective ensemble


#### 2.2.5. Methodology

In this study, regression models were constructed to determine the influence of the observed external characteristics of Holstein-Friesian cows and the farm on milk quantity and to predict the values of 305-day milk yield. First, EBag and arcing ensembles and corresponding simplified selective models were built, and their predictions were then combined linearly in stacked models according to the stacked generalization paradigm developed by Wolpert [43].

Our study was carried out under the following framework (see also Figure 2):


**Figure 2.** Framework of the study.

For application of the stacking paradigm in particular, the number of base models at the first stage has to be between 3 and 8. In addition, these models need to be differentiated from each other according to some diversity criteria [21–23].

#### 2.2.6. Evaluation Measures

The quality of the built models was assessed and compared using standard measures of prediction accuracy: root mean squared error (RMSE), mean absolute percentage error (MAPE), goodness-of-fit measure (coefficient of determination *R*2), and index of agreement (IA) *d* [31], defined as follows:

$$\begin{aligned} \text{RMSE} &= \sqrt{\frac{1}{n} \sum\_{k=1}^{n} \left(P\_{k} - Y\_{k}\right)^{2}}, \qquad \text{MAPE} = \frac{100}{n} \sum\_{k=1}^{n} \left|\frac{P\_{k} - Y\_{k}}{Y\_{k}}\right|, \\\ R^{2} &= \frac{\left\{\sum\_{k=1}^{n} \left(p\_{k} - \overline{p}\right) \left(Y\_{k} - \overline{Y}\right)\right\}^{2}}{\sum\_{k=1}^{n} \left(p\_{k} - \overline{p}\right)^{2} \cdot \sum\_{k=1}^{n} \left(Y\_{k} - \overline{Y}\right)^{2}}, \qquad \text{IA} = d = 1 - \frac{\sum\_{k=1}^{n} \left(p\_{k} - Y\_{k}\right)^{2}}{\sum\_{k=1}^{n} \left(\left|p\_{k} - \overline{p}\right| + \left|Y\_{k} - \overline{Y}\right|\right)^{2}} \end{aligned} \tag{5}$$

where *Yk* and *Y* are the values and the mean of the dependent variable, *Y*, respectively; *Pk* and *P* are the predicted values and their mean, respectively; and *n* is the sample volume. Among these measures, a good predictive model should have a value close to 0 for RMSE and MAPE and a value close to 1 for *R*<sup>2</sup> and IA. IA is not a measure of correlation or association in the formal sense but a measure of the degree to which a model's predictions are error-free [31].

Furthermore, the nonparametric WSRT is used to compare diversity between the predictive models [44]. This test does not assume that the data follow the normal distribution.

#### **3. Results and Discussion**

#### *3.1. Data Preprocessing*

Table 3 shows the results of the descriptive statistics of the initial variables from Table 1. We see that the values of skewness and kurtosis for all variables are close to zero, and we can assume that the distribution of all variables is close to normal.


**Table 3.** Descriptive statistics of the measured data 1.

<sup>1</sup> Std. Err. Skewness is 0.193; for *Milk\_miss*40, 0.223; for *Milk*\_40, 0.374. Std. Err. Kurtosis is 0.384; for *Milk\_miss*40, 0.442; for *Milk*\_40, 0.733.

#### *3.2. PCA Results*

During the initial data processing, multicollinearity was found between the considered 12 independent variables for conformation traits from Table 1. In order to reduce the influence of multicollinearity and improve the accuracy of the regression models, these 12 initial variables were transformed into independent variables using exploratory factor analysis and PCA [33]. The goal is to retain information and preserve the total variance explained following this transformation as much as possible. The basic assumptions for procedural application are fulfilled, namely: close to a normal distribution of the 12 variables and small determinant of their correlation matrix, det = 0.019 ≈ 0. In addition, the adequacy verification of factor analysis indicates that the Kaiser–Meyer–Olkin (KMO) measure of sampling adequacy is 0.658 > 0.5, and the significance of the Bartlett's test of sphericity is Sig. = 0.000.

With the help of nonparametric Spearman's rho statistics *R*, the following correlation coefficients were found between the dependent variable, *Milk*305, and the 12 measured variables, respectively: with the variable *UdderW*, *R* = 0.685; with *Stature*, *R* = 0.501; with *ChestW*, *R* = 0.492; and with *Bone R* = 0.343. Other significantly correlated Spearman's rho coefficients are: *R*(*Stature*, *UdderW*) = 0.558, *R*(*Stature, RumpW*) = 0.508, *R*(*Stature*, *ChestW*) = 0.466, *R*(*Bone, Stature*) = 0.466, and *R*(*Laminess, Locom*) = −0.929. All correlation coefficients are significant at the 0.01 level (2-tailed). Research into this type of linear correlation is a known approach, including external traits [28,29]. This often leads to establishing both positive and negative linear correlations (e.g., *R*(*Laminess, Locom*) = −0.929), etc.). The latter can lead to an inaccurate interpretation of the influence of some external traits, the interactions of which are primarily nonlinear and difficult to determine [30].

The next step is to conduct factor analysis. In our case, 12 PCs were preset for extraction using the PCA method. Due to the strong negative correlation *R*(*Laminess, Locom*) = −0.908, these two variables were grouped in a common factor. This resulted in 11 factors extracted from the 12 independent variables. The factors were rotated using the Promax method. The resulting rotated matrix of factor loadings is shown in Table 4. The extracted 11 factorscore variables are very well differentiated. We denote them by *PC*1, *PC*2, ... , *PC*11. These 11 variables account for 99.278% of the total variance of the independent continuous variables. The residual variance is 0.722 and can be ignored. The correspondence between the initial 12 linear traits and the resulting 11 *PC*s is given in Table 4. The coefficients of the factor loadings are sorted by size, and coefficients with an absolute value below 0.1 are suppressed [33].


**Table 4.** Rotated pattern matrix with 11 PCs generated using Promax 1.

<sup>1</sup> Extraction method: principal component analysis; Rotation method: Promax with Kaiser normalization; rotation converged in six iterations.

Considering that the coefficients along the main diagonal in the rotated pattern matrix of Table 4 are equal to 1 or almost 1, in subsequent analyses, we can interpret each generated factor as a direct match with the corresponding initial variable, except *PC*1, which groups *Locom* and *Lameness*.

#### *3.3. Building and Evaluation of Base Models*

To model and predict the milk yield dataset, *Milk\_miss*40, we used the eleven *PC*s and *Farm* variables as predictors. The aim is to build between 3 and 8 base models that meet the diversity requirement, as recommended in the stacking paradigm [22,43,45]. In this study, we set the following four diversity criteria:


#### 3.3.1. CART Ensembles and Bagging and Simplified Selective Bagged Ensembles

First, numerous CART-EBag models with different numbers of component trees (*tn* = 10, 15, 20, . . . , 60) were built. The hyperparameters were changed as follows: minimum cases in parent node to minimum cases in child node-14:7, 10:5, 8:4, 7:4. Crossvalidation was varied from CV-5 fold, 10 fold, and 20-fold. Of these models, two ensemble models, *EB*15 and *EB*40, were selected, with *tn* = 15 and *tn* = 40 trees. The subsequent increase in the number of trees in the ensemble and the tuning of the hyperparameters led to a decrease in the statistical indices of the ensembles. These two models were used to generate selective ensembles according to the algorithm described in Section 2.2.4. Four negative trees were reduced from model *EB*15. The resulting simplified selective ensemble with 11 component trees is denoted as *SSEB*11. Accordingly, for the second model, *EB*40, 15 negative trees were identified, and after their removal, model *SSEB*25 with 25 component trees was obtained.

The analysis of the statistical indices of simplified selective ensembles revealed some special dependencies. We will demonstrate the main ones for the components of the *EB*40 model. Figure 3a illustrates the values of *dj*, *j* = 1, 2, ... , 40, calculated for all component trees compared against the *dE* of the initial ensemble. Values greater than *dE* correspond to negative trees. Figure 3b–d show the change in the statistical indices for the generated selective models, *SSEB*40−*k*, *<sup>k</sup>* = 1, 2, ... , 15, obtained from *EB*40 after the removal of the cumulative sums of negative trees in (4).

Figure 3b shows that the curves IA and *<sup>R</sup>*<sup>2</sup> of ensembles *SSEB*40−*k*, *<sup>k</sup>* <sup>=</sup> 1, 2, ... , *<sup>s</sup>* increase monotonically with the removal of each subsequent negative tree, *T*− *<sup>j</sup>* , as the values of *R*<sup>2</sup> increase faster. The behavior of the RMSE is inversely proportional and decreases monotonically with increased *k*. We found that with the removal of each subsequent negative tree, all statistics (5) improve, excluding MAPE. In our case, for the selected *SSEB*25 model and *Milk*305, IA increases by 0.5%, *R*<sup>2</sup> increases by 1.7%, RMSE is reduced by 12.8%, and MAPE is reduced by 6.6% compared to the initial ensemble, *EB*40 (see Section 3.3.4.

#### 3.3.2. Arcing and Simplified Selective Arcing Models

Numerous ARC models with different hyperparameters were built by varying the number of component trees:*tn* = 5, 10, ... , 30. The hyperparameters were changed as follows: minimum cases in parent node to minimum cases in child node-14:7, 10:5, 8:4, 7:4, 6:3. Cross-validation was varied: CV-5-fold, 10-fold, and 20-fold. One model with 10 components was selected from the obtained ARC models, denoted as *AR*10, which satisfies the diversity criteria C1, ... , C4 with *EB*15 and *EB*40. This model was used to generate a selective ensemble with nine component trees denoted by *SSAR*9.

#### 3.3.3. Diversity of the Selected Base Models and Their Hyperparameters

The diversity criteria between the base models were checked using a two-relatedsamples WSRT. The resulting statistics are given in Table 5. Because they are all significant at a level of *α* = 0.05, we can assume that the selected base models are different [44].


**Table 5.** Test statistics for diversity verification among the selected base models a.

<sup>a</sup> Wilcoxon signed ranks test. <sup>b</sup> Based on negative ranks.

Table 6 shows the relevant hyperparameters of the base models in the following two groups:


The number of variables for splitting each node on each tree was set to 3. It should also be noted that the indicated value, k, of the cross-validation is applied to all trees in the respective ensemble model.

#### 3.3.4. Evaluation Statistics of the Selected Base Models

First, let us estimate the reduction in the number of trees in the simplified selective ensembles. For the three base models, we have: from *EB*15 to *SSEB*11, 4 trees; from *EB*40 to *SSEB*25, 15 trees; and from *AR*10 to *SSAR*9, 1 tree. The relative reductions are 25%, 37.5%, and 10%, or an average of 30%.


**Table 6.** Hyperparameters of the selected base models.

The performance statistics (5) of the selected two groups of base models for predicting the reduced dependent variable *Milk\_miss*40 were evaluated and compared. In addition, the predicted values of these models were also compared against the initial sample, *Milk*305 with 158 cases; *Milk\_miss*40 with 118 cases; and the holdout test sample, and *Milk*\_40 with 40 cases, not used in the modeling procedure. The obtained basic statistics of predictive models are shown in the first six columns of Table 7. It can be seen that the performance results are similar, whereas all statistics from (5) of the selective ensembles are superior.

**Table 7.** Summary statistics of the predictions of obtained models against the measured values of the dependent variables.


*R*2, 158*R*2, 118*R*2, 40 In particular, the *SSEB*11 model demonstrates better performance than the *EB*15 model from which it is derived. For example, for the whole sample, the reduction in RMSE of *SSEB*11 compared to *EB*15 is 5.1%, whereas for the test sample, *Milk*\_40, the error is reduced by 9.26%. Similarly, model *SSEB*25 outperforms the source model, *EB*40. In this case, the improvement in RMSE for the whole sample is 11.4%, and for the holdout sample, the error is reduced by 3.0% compared to that of *EB*40. For *SSAR*9, these indices are 2% and 1.9%, respectively. Overall, the indicators of model *AR*10 and its simplified selective model, *SSAR*9, are comparatively the weakest. This can be explained by the fact that they contain the smallest number of trees, and only one negative tree has been removed from the *AR*10 ensemble.

#### 3.3.5. Relative Importance of the Factors Determining Milk305 Quantity

The regression models we built were used to predict 305-day milk yield, allowing us to determine, with high accuracy, how the considered factors explain the predicted values according to their weight in the models. For better interpretation, the initial names of the variables were recorded, along with the predictors, according to Table 4. The predictor with the most significant importance in the model has the highest weight (100 scores), and the other scores are relative to it.

The results in Table 8 show the relative variable importance of the predictors within the built base ensemble models. As expected, the main defining variable for 305-day milk yield with the greatest importance of 100 is *Farm*. The other significant conformation traits, in descending order, are *PC*4 (*UdderW*), with relative weight between 60 and 68; *PC*3 (*ChestW*), 45 to 58; *PC*11 (*Stature*), 19 to 36; *PC*10 (*Bone*), 19 to 27. The conformation trails with the weakest influence are *PC*8 (*FootA*), with a relative weight of 8 to 14, and *PC*7 (*RLSV*), with a relative weight of 7 to 11. Because all predictors have an average weight of more than five relative scores, we consider them all as essential traits on which milk depends.


**Table 8.** Relative averaged variable importance in base models.

In an actual situation, the average values of the main conformation traits should be maintained within the limits of their lower and upper bounds of the means (5% confidence intervals). In our case, these limits are given in Table 3.

#### *3.4. Building and Evaluation of the Linear Hybrid Models*

The next stage of the proposed framework is combining the obtained predictions from the single base models. To illustrate the higher efficiency when using simplified selective ensembles, we compared the results obtained from the two groups of base models.

#### 3.4.1. Results for Hybrid Models

Using the well-known approach of linear combinations of ensembles (see [45]), we sought to find a linear hybrid model, *y*ˆ, of the type

$$
\hat{y} = \mathfrak{a}\_1 E\_1 + \mathfrak{a}\_2 E\_2 + \mathfrak{a}\_3 E\_3. \tag{6}
$$

where *Ei*, *i* = 1, 2, 3 are ensemble models that satisfy the conditions for diversity, C1, ... , C4 (see Section 3.3), and the coefficients *α<sup>i</sup>* are sought such that

$$\sum\_{i=1}^{3} \alpha\_i = 1, \quad \alpha\_i \in [0, 1]. \tag{7}$$

When varying by step *h* = 0.05 in the interval [0, 1] and all possible combinations for *αi*, *i* = 1, 2, 3, the following two hybrid models with the least RMSE were obtained for the test sample *Milk*\_40:

$$Hybr\_1 = 0.55\ EB15 + 0.15\ EB40 + 0.3\ AR10.\tag{8}$$

$$Hybr2 = 0.\text{75 }\ $SEB11 + 0.\text{25 }\$ SAR9.\tag{9}$$

The main statistics of these models are given in the last two columns of Table 7. Hybrid models improve all indicators of the base models. For the holdout test sample, *Milk*\_40, the *Hybr*<sup>1</sup> model has an RMSE equal to 509.555 kg, which is less than the errors of Group A models by 7.9% for *EB*15, 12.4% for *EB*40, and 28.8% for *AR*10. Accordingly, model *Hybr*<sup>2</sup> improves the statistics of Group B models. In the case of the test sample, its RMSE = 473.690 kg, which is smaller than the *SSEB*11, *SSEB*25, and *SSAR*9 models by 5.1%, 17.2%, and 35.8%, respectively. Furthermore, we obtained the desired result for the superiority of simplified selective models and *Hybr*<sup>2</sup> over the initial non-selective models and *Hybr*1. In particular, the RMSE of *Hybr*<sup>2</sup> is smaller than that of *Hybr*<sup>1</sup> by 7% for the holdout test sample, *Milk*\_40. MAPE coefficients of 5.5% were achieved.

A comparison of the values predicted by models (8) and (9) and the initial values for *Milk*305 are illustrated in Figure 4.

**Figure 4.** Quality of the coincidence of the measured values of milk yield and the predictions by the hybrid models with 5% confidence intervals: (**a**) model *Hybr*<sup>1</sup> against *Milk*305; (**b**) model *Hybr*<sup>2</sup> against *Milk*305; (**c**) model *Hybr*<sup>1</sup> against the holdout test sample, *Milk\_*40; (**d**) model *Hybr*<sup>2</sup> against *Milk\_*40.

#### 3.4.2. Comparison of Statistics of All Models

A comparison of the performance statistics of all eight models constructed in this study for *Milk*305 and *Milk*\_40 is illustrated in Figures 5 and 6. Figure 5 shows that the coefficients of determination for model *AR*10 and its simplified selective ensemble, *SSAR*9, are weaker than those of the other base models. However, despite their second largest coefficients in (8) and (9), respectively, the *R*<sup>2</sup> of the hybrid models is satisfactory for the small data samples studied. In the same context, Figure 6 illustrates the behavior of RMSE values, which do not deteriorate significantly in hybrid models due to their higher values in the *AR*10 and *SSAR*9 models.

**Figure 5.** Comparison of coefficients of determination *R*<sup>2</sup> for all eight models for *Milk*305 and *Milk\_*40 samples.

**Figure 6.** Comparison of coefficients of RMSE for all eight constructed models for *Milk*305 and *Milk\_*40 samples.

Finally, we compared the RMSE and generalization error (mean squared error (MSE) = RMSE2) of the built models for a randomly selected holdout test sample, *Milk*\_40. The results are shown in Table 9. The *Hybr*<sup>2</sup> model produces RMSE 7% less than that produced by *Hybr*1; compared to base models, the improvement varies from 5% to 26%. The comparison by generalization error shows 13.6% and 9.6% lower values for *Hybr*<sup>2</sup> than those for *Hybr*<sup>1</sup> and model *SSEB*11, respectively.


**Table 9.** Holdout test-set prediction errors.

#### **4. Discussion**

We investigated the relationship between the 305-day milk yield of Holstein-Friesian cows and 12 external traits and the farm in a sample of 158 cows. To evaluate the constructed models, a random holdout test subsample was used, including 25% (40 entries) from the variable *Milk*305 for 305-day milk yield. In order to reveal the dependence and to predict milk yield, a new framework was developed based on ensemble methods using bagging and boosting algorithms and enhanced by a new proposed simplified selective ensemble approach.

We simultaneously applied the CART ensembles and bagging and Arcing methods for livestock data for the first time. To improve the predictive ability of the models, the initial ordinal variables were transformed using factor analysis to obtain rotation feature samples. Three initial base models (group A) were selected, satisfying four diversity criteria. Numerous simplified selective ensembles were built from each of these models. Using these, a second trio of base models (group B) was selected. Predictions for each group of base models were stacked into two linear hybrid models. The models successfully predict up to 94.5% of the data for the initial and holdout test samples. The obtained results for predicting 25% holdout values of daily milk showed that the two hybrid models have better predictive capabilities than the single base models. In particular, the RMSE of hybrid model *Hybr*<sup>2</sup> from the simplified selective ensembles is 7.0% lower than that of the other hybrid model based on non-selective ensembles. The number of trees in the three selective ensembles was decreased by 27%, 37.5%, and 10%, or an average of 30%.

Our proposed approach to build selective tree ensembles is characterized by a simple algorithm, reduces the dimensionality of the ensemble, improves basic statistical measures, and provides many new ensembles to be used to satisfy the diversity criteria. In addition, in the two-level stacking procedure, we used two different criteria: increasing the index of the agreement to build simplified selective ensembles and minimizing the RMSE for choosing the stacked model.

However, some shortcomings can be noted, including the selection of base models that meet the condition of diversity, which remains a challenging problem, known as "black art" [43]. Another disadvantage is determining the variable importance of the initial predictors in the stacked models. The method proposed in this study may have certain limitations when used in practical applications. It inherits all the main shortcomings of ensemble algorithms based on decision trees: it requires more computing resources compared to a single model, i.e., additional computational costs, training time and memory. Our method's more complex algorithm compared to standard ensemble methods would be an obstacle to its application to real-time problems unless greater accuracy and stability of predictions is sought. However, for parallel computer systems, these limitations are reduced by at least one order of magnitude. Another disadvantage is the more difficult interpretation of the obtained results.

Our results can be compared with those obtained by other authors. For example, selective ensembles were derived in [19,20] using genetic algorithms. In [19], a large empirical study was performed, including 10 datasets for regression generated from mathematical functions. Twenty neural network trees were used for each ensemble. The component neural networks were trained using 10-fold cross validation. As a result, the number of trees in selective ensembles was reduced to an average of 3.7 without sacrificing the generalization ability. In [20], selective C4.5 decision tree ensembles were constructed for 15 different empirical datasets. All ensembles initially consisted of 20 trees. A modified genetic algorithm with a 10-fold cross-validation procedure was applied. There were reductions in the number of trees in the range of 7 to 12, with an average of 8, and a reduction in the ensemble error by an average of 3%. Several methods for ensemble selection were proposed in [24], and a significant reduction (60–80%) in the number of trees in Adaboost ensembles was achieved without significantly deteriorating the generalization error. The authors of [26] developed a complex hierarchical selective ensemble classifier for multiclass problems using boosting, bagging, and RF algorithms and achieved accuracy of up to 94–96%.

The classical paper by Breiman [45] can be mentioned, wherein various linear combinations with stacked regressions, including decision trees ensembles, were studied. The stacking was applied for 10 CART subtrees of different sizes with 10-fold cross-validation for relatively small samples. Least squares under non-negativity constraints was used to determine the coefficients in the linear combination. A reduction in generalization error of 10% was obtained for 10% and 15% holdout test samples. These performance results are comparable with those achieved in the present empirical study. Here, under the constraints (5), we obtained a 9.6% to 13.6% reduction in the prediction generalization error of model *Hybr*<sup>2</sup> compared to *SSEB*11 and *Hybr*<sup>1</sup> models, respectively (see Table 9).

Furthermore, the proposed simplified selective algorithm easily adapts to other ensemble methods, including neural-network-type ensembles.

As a practical result of modeling, it was also found that 305-day milk yield depends on the following key factors (in descending order of importance): breeding farm, udder width, chest width, and the animals' stature. Furthermore, the farm as a breeding environment is found to be of crucial importance. In our case, numerous hard-to-measure factors were stochastically taken into account, such as state of the farm, comfort conditions for each animal, feeding method and diet, milking method, cleaning, animal healthcare, etc. With the obtained estimates, the indicators of the main external traits could be monitored within their mean values and confidence intervals to maintain and control a certain level of milk yield for each herd. The developed framework may also be used to forecast milk quantity in the case of measurements prior to the end of lactation.

This study shows a moderate to strong nonlinear dependence between conformation traits and 305-day milk yield, which presents an indirect opportunity to improve animal selection. However, to achieve real results in the management and selection of animals, it is recommended to accumulate data and perform statistical analyses periodically to monitor multiple dependencies between external, productive, and genetic traits and environmental factors.

**Author Contributions:** Conceptualization and methodology, S.G.-I.; software, S.G.-I. and A.Y.; validation, all authors; investigation, all authors; resources, A.Y.; data curation, A.Y. and H.K.; writing original draft preparation, S.G.-I.; writing—review and editing, S.G.-I. and H.K.; funding acquisition, S.G.-I. and H.K. All authors have read and agreed to the published version of the manuscript.

**Funding:** The authors acknowledge the support of the Bulgarian National Science Fund, Grant KP-06-N52/9. The first author is partially supported by Grant No. BG05M2OP001-1.001-0003, financed by the Science and Education for Smart Growth Operational Program (2014–2020), co-financed by the European Union through the European Structural and Investment funds.

**Institutional Review Board Statement:** All measurements of the animals were performed in accordance with the official laws and regulations of the Republic of Bulgaria: Regulation No. 16 of 3 February 2006 on protection and humane treatment in the production and use of farm animals, the Regulation amending of the Regulation No. 16 (last updated 2017), and the Veterinary Law (Chapter 7: Protection and Human Treatment of Animals, Articles 149–169). The measurement procedures were carried out in compliance with Council Directive 98/58/EC concerning the protection of animals kept for farming purposes.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Saddam Aziz 1,\*, Muhammad Talib Faiz 1, Adegoke Muideen Adeniyi 1, Ka-Hong Loo 1,2,\*, Kazi Nazmul Hasan 3, Linli Xu <sup>2</sup> and Muhammad Irshad <sup>2</sup>**


**Abstract:** It is increasingly difficult to identify complex cyberattacks in a wide range of industries, such as the Internet of Vehicles (IoV). The IoV is a network of vehicles that consists of sensors, actuators, network layers, and communication systems between vehicles. Communication plays an important role as an essential part of the IoV. Vehicles in a network share and deliver information based on several protocols. Due to wireless communication between vehicles, the whole network can be sensitive towards cyber-attacks.In these attacks, sensitive information can be shared with a malicious network or a bogus user, resulting in malicious attacks on the IoV. For the last few years, detecting attacks in the IoV has been a challenging task. It is becoming increasingly difficult for traditional Intrusion Detection Systems (IDS) to detect these newer, more sophisticated attacks, which employ unusual patterns. Attackers disguise themselves as typical users to evade detection. These problems can be solved using deep learning. Many machine-learning and deep-learning (DL) models have been implemented to detect malicious attacks; however, feature selection remains a core issue. Through the use of training empirical data, DL independently defines intrusion features. We built a DL-based intrusion model that focuses on Denial of Service (DoS) assaults in particular. We used K-Means clustering for feature scoring and ranking. After extracting the best features for anomaly detection, we applied a novel model, i.e., an Explainable Neural Network (xNN), to classify attacks in the CICIDS2019 dataset and UNSW-NB15 dataset separately. The model performed well regarding the precision, recall, F1 score, and accuracy. Comparatively, it can be seen that our proposed model xNN performed well after the feature-scoring technique. In dataset 1 (UNSW-NB15), xNN performed well, with the highest accuracy of 99.7%, while CNN scored 87%, LSTM scored 90%, and the Deep Neural Network (DNN) scored 92%. xNN achieved the highest accuracy of 99.3% while classifying attacks in the second dataset (CICIDS2019); the Convolutional Neural Network (CNN) achieved 87%, Long Short-Term Memory (LSTM) achieved 89%, and the DNN achieved 82%. The suggested solution outperformed the existing systems in terms of the detection and classification accuracy.

**Keywords:** IoV; xNN; K-MEANS; anomaly detection

**MSC:** 62T07; 68T05

#### **1. Introduction**

The IoV, is an open, convergent network system that encourages collaboration between people, vehicles, and the environment [1,2]. With the help of vehicular ad hoc networks (VANET), cloud computing, and multi-agent systems (MAS), this hybrid paradigm plays a crucial role in developing an intelligent transportation system that is both cooperative and effective [3]. The presence of an anomaly detection system in the IoV is essential in today's uncertain world for the sake of data validity and safety. When it comes to critical safety

**Citation:** Aziz, S.; Faiz, M.T.; Adeniyi, A.M.; Loo, K.-H.; Hasan, K.N.; Xu, L.; Irshad, M. Anomaly Detection in the Internet of Vehicular Networks Using Explainable Neural Networks (xNN). *Mathematics* **2022**, *10*, 1267. https:// doi.org/10.3390/math10081267

Academic Editor: Snezhana Gocheva-Ilieva

Received: 5 March 2022 Accepted: 31 March 2022 Published: 11 April 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

data analysis, the cost of real-time anomaly detection of all data in a data package must be considered [4].

IoV consists of three layers:


In the experimental and control layers, the vehicle is controlled and monitored according to sensed data and information from its environment. In the computing layer, vehicles communicate with the help of WLAN, cellular (4G/5G), and short-range wireless networks [5]. In the application layer, closed and open service models, or IoVs, are present. Key components of an IoV system are shown in Figure 1.

**Figure 1.** Key components and layers of an IoV system.

Unlike the internet's specific data security preventive techniques, the IoV data security issues start from internal and external factors [6,7]. The lack of a reliable data verification mechanism in automobiles, such as the Controller Area Network (CAN) protocol, is one way that vehicles' internal safety problems are reflected in existing internet communication protocols. The open architecture of IoV and widespread use make data breaches more difficult to defend against cyber-attacks [8]. An autonomous vehicle anomaly detection system is the subject of this paper. IoVs are unprecedented and vulnerable when backed by a dynamic and uncertain network [9].

Human safety and property can be jeopardized by malicious assaults and data tampering as well as system breakdowns [10]. Figure 2 shows the possible security risks in an IoV system. Vehicle-to-vehicle (V2V) communication is the first risk, where data can be attacked with an attacker and can cause harm to drivers. At the same time, a second security risk can be generated in the vehicle-to-infrastructure (V2I) communication scenario.

Numerous concerns have been raised about the privacy and security of intelligent vehicles and intelligent transportation networks due to multiple attack models for intelligent vehicles [10]. Cyber attackers might jam and spoof the signal of the VANET communication network, which raises serious security problems [11]. This could cause the entire V2X system to be impacted by misleading signaling and signal delays to ensure that the message conveyed is corrupted and does not fulfill its intended aims [12].

The internet or physical access to a linked vehicle's intelligence system is another security danger that intelligent automobiles encounter. In 2016, security professionals Charlie Miller and Chris Valasek, for example, wirelessly hacked the Jeep Cherokee's intelligence system [13], while the Jeep Cherokee's driver was still behind the wheel, researchers Miller and Valasek compromised the entertainment system, steering and brakes, and air conditioning system to show that the Jeep's intelligence system had security vulnerabilities. The Nissan Leaf's companion app was abused by cybercriminals utilizing the vehicle's unique identification number, which is generally displayed on the windows. Hackers were able to gain control of the HVAC system thanks to this flaw [14].

**Figure 2.** Key components and layers of an IoV system.

IoV's growth has been bolstered by embedded systems, hardware and software enhancements, and networking devices. However, there are still several dangers in the IoV, including security, accuracy, performance, networks, and privacy. Many security and privacy concerns have arisen due to the rising usage of intelligent services, remote access, and frequent network modifications. As a result, security vulnerabilities in IoV data transfer are a significant concern. Therefore, clustering [15,16] and deep-learning algorithms and approaches [17–19] can be used to handle network and security issues relating to the IoV. As part of this study, the security standards for IoV applications are outlined to improve network and user services efficiency. Denial of Service (DoS) assaults are detected using a novel model, xNN. The motivations of this study are:


#### **2. Related Work**

#### *2.1. Anomaly Detection Systems*

The safety of IoV's users is a significant concern. In the event of an infiltration attack on IoV system, hackers could gain direct control of vehicles, resulting in traffic accidents. Previously, many studies have been conducted on improving security for vehicular networks. To detect both known and unknown assaults on automotive networks, a multi-tiered hybrid IDS that integrates IDS with a signature and IDS with an anomaly was presented by Yang et al. [1]. The suggested system can detect several known assaults with 99.99% accuracy and 99.88% accuracy on the CICIDS2017 dataset, representing the CAN-intrusion-dataset's external vehicular network data.

The suggested system has strong F1 scores of 0.963 and 0.800 on both datasets above when it comes to zero-day attack detection. Intrusion detection networks, IDS design, and the limitations and characteristics of an IOV network were explored by Wu et al. [3]. The IDS designs for IOV networks were discussed in detail, and a wide range of optimization targets were investigated and thoroughly analyzed in that study. Vehicular ad hoc networks (VANETs) provide wireless communication between cars and infrastructures. Connected vehicles may help intelligent cities and Intelligent Transportation Systems (ITS). VANET's primary goals are to reduce travel time and improve driver safety, comfort, and productivity. VANET is distinct from other ad hoc networks due to its extreme mobility. However, the lack of centralized infrastructure exposes it to several security flaws.

This poses a serious threat to road traffic safety. CAN is a protocol for reliable and efficient communication between in-vehicle parts. The CAN bus does not contain source or destination information; therefore, messages cannot be verified as they transit between nodes. An attacker can easily insert any message and cause system issues. Alshammari et al. [4] presented KNN and SVM techniques for grouping and categorizing VANET intrusions. The offset ratio and time gap between the CAN message request and answer were examined to detect intrusions.

#### *2.2. Machine-Learning-Based Models*

A data-driven IDS was designed by evaluating the link load behavior of the Roadside Unit (RSU) in the Internet of Things (IoT) against various assaults that cause traffic flow irregularities. An intrusion targeting RSUs can be detected using a deep-learning architecture based on a Convolutional Neural Network (CNN). The proposed architecture [5] uses a standard CNN and a basic error term based on the backpropagation algorithm's convergence. In the meantime, the suggested CNN-based deep architecture's probabilistic representation provides a theoretical analysis of convergence.

An IoV system must efficiently manage traffic, re-configure, and secure streaming data. Software-defined networks (SDN) provide network flexibility and control. However, these can attract hostile agents. The author's technique uses probabilistic data structures to detect aberrant IoV behaviour. Count-Min-Sketch is used to find suggestive nodes. Phase II uses Bloom filter-based control to check questionable nodes' signatures. Phase 3 uses a Quotient filter to store risky nodes quickly. To detect super points (malicious hosts connecting to several destinations), author counted the flows across each switch in phase 4. This was tested using a computer simulation. The proposed method of Garg et al. [7] outperformed the current standard in terms of detection ratios and false-positive rates.

In a generic threat model, an attacker can access the CAN bus utilising common access points. Xiao et al. [8] presented an in-vehicle network anomaly detection framework based on SIMATT and SECCU symmetry. To obtain state-of-the-art anomaly detection performance, SECCU and SIMATT are integrated. The authors want to reduce the computing overhead in training and detection stages. The SECCU and SIMATT models now have only one layer of 500 cells each, thus, reducing computing expenses. Numerous SIMATT-SECCU architectures evaluations have shown near-optimal accuracy and recall rates (with other traditional algorithms, such as LSTM, GRU, GIDS, RNN, or their derivatives) [20,21].

#### *2.3. Anomaly Detection Based Driving Patterns*

The Anomaly Detection Based on the Driver's Emotional State (EAD) algorithm was proposed by Ding et al. [9] to achieve the real-time detection of data related to safe driving in a cooperative vehicular network. A driver's emotional quantification model was defined in this research, which was used to characterize the driver's driving style in the first place. Second, the data anomaly detection technique was built using the Gaussian Mixed Model (GMM) based on the emotion quantization model and vehicle driving status information. Finally, the authors performed extensive experiments on a real data set (NGSIM) to demonstrate the EAD algorithm's high performance in combination with the application scenarios of cooperative vehicular networks.

With the IoV cloud providing a tiny amount of labelled data for a novel assault, Li et al. [10] suggested two model updating approaches. Cloud-assisted updates from the IoV can give a tiny quantity of data. Using the local update technique prevents the IoV cloud from sending labelled data promptly. This research shows that pre-labelled data can

be leveraged to derive the pseudo label of unlabelled data in new assaults. A vehicle can update without obtaining labelled data from the IoV cloud. Schemes proposed by Li et al. improved the detection accuracy by 23% over conventional methods.

Connected vehicle cybersecurity and safety have been addressed using anomaly detection techniques. Prior research in this field is categorised according to Rajbahadur et al.'s [11] proposed taxonomy. There are nine main categories and 38 subcategories in the author's proposed taxonomy. Researchers found that real-world data is rarely used, and rather most results are derived from simulations; V2I and in-vehicle communication are not considered together; proposed techniques seldom compare to a baseline; and the safety of the vehicles is not given as much attention as cybersecurity.

Maintaining a safe and intelligent transportation system necessitates avoiding routes that are prone to accidents. With the help of crowd sourcing and historical accident data, intelligent navigation systems can help drivers avoid dangerous driving conditions (such as snowy roads and rain-slicked road areas). Using crowd-sourced data, such as images, sensor readings, and so on, a vehicle cloud can compute such safe routes and react faster than a centralised service. The security and privacy for each data owner must be ensured in the intelligent routing. Additionally, crowd sourced data needs to be verified in the vehicle cloud before being used. Joy et al. [12] investigated ways to ensure that vehicular clouds are secure, private, and protected against intrusion.

Over the past few years, the complexity and connectivity of today's automobiles has steadily increased. There has been a massive increase in the security risks for invehicle networks and the components in the context of this development. In addition to putting the driver and other road users at risk, these attacks can compromise the vehicle's critical safety systems. The detection of anomalies in automobile in-vehicle networks is discussed by Müter et al. [13]. A set of anomaly detection sensors was introduced based on the characteristics of typical vehicular networks, such as the CAN. These sensors allow the detection of attacks during vehicle operation without causing false positives. A vehicle attack detection system is also described and discussed in terms of its design and application criteria.

#### *2.4. Distributed Anomaly Detection System*

Negi et al. [14] proposed a framework for a distributed anomaly detection system that incorporates an online new data selection algorithm that directs retraining and modifies the model parameters as needed for self-driving and connected cars. Offline training of the LSTM model over many machines in a distributed manner using all available data is part of the framework's implementation. Anomaly detection occurs at the vehicle level using the trained parameters and is then sent to the individual vehicles. A more complex LSTM anomaly detection model is used, and the proposed distributed framework's accuracy in detecting anomalies is improved using the MXnet framework, which is used to test the framework's performance.

Sakiyama et al. [22] offered filter banks defined by a sum of sinusoidal waves in the graph spectral domain. These filter banks have low approximation errors even when using a lower-order shifted Chebyshev polynomial approximation. Their parameters can be efficiently obtained from any real-valued linear phase finite impulse response filter banks regularly. The author's proposed frequency-domain filter bank design has the same characteristics as a classical filter bank. The approximation precision determines the approximation orders. Many spectral graph wavelets and filter banks exist to test the author's techniques.

For autonomous and connected automobiles, securing vehicles is a top priority in light of the Jeep Cherokee incident of 2015, in which the vehicle was illegally controlled remotely by spoofing messages that were placed on the public mobile network. Security solutions for each unknown cyberattack involve the timely identification of attacks that occur throughout time in the vehicles' lifespan. Sporking communications at the central gateway can be detected using IDS as described by Hamada et al. [23]. Using communications from a realworld in-vehicle network, the author also reported on the system's detection performance.

#### *2.5. Ad Hoc Vehicle Network Intrusion Detection System*

Ad hoc vehicle networks are evolving into the Internet of Automobiles as the Internet of Things (IoT) takes hold of the IoV. The IoV can attract a large number of businesses and researchers due to the rapid advancement of computing and communication technologies. Using an abstract model of the IoTs, Yang et al. [24] provided an overview of the technologies needed to build the IoV, examined many IoV-related applications, and provided some open research challenges and descriptions of necessary future research in the IoV field.

Future Automated and Connected Vehicles (CAVs), or ITS, will form a highly interconnected network. City traffic flows can only be coordinated if vehicles are connected via the Internet of Vehicles (herein the Internet of CAVs). It will be possible to monitor and regulate CAVs using anonymized CAV mobility data. To ensure safe and secure operations, the early detection of anomalies is crucial. Wang et al. [25] proposed an unsupervised learning technique based on a deep autoencoder to detect CAV self-reported location abnormalities. Quantitative investigations on simulated datasets show that the proposed approach worked well in detecting self-reported location anomalies.

As real-time anomaly detection on complete data packages is expensive, Ding et al. [26] concentrated on crucial safety data analysis. The traffic cellular automata model was used for preprocessing to obtain optimal anomaly detection with minimal computer resources. An algorithm can discover irregularities in data related to safe driving in real time and online by modelling the driver's driving style. Starting with a driving style quantization model that describes a driver's driving style as a driving coefficient, then a Gaussian mixture model is used to detect data anomalies based on the driving style quantization and vehicle driving state (GMM). Finally, this study evaluated the suggested ADD algorithm's performance in IoV applications using real and simulated data.

In our study, authors summarized the research on anomaly detection. Authors categorised existing techniques into groups based on their core approach. Chandola et al. [27] created key assumptions for each category to distinguish normal from deviant behaviour. A few assumptions can be used to recommend testing a technique's efficacy in a specific domain. Using a basic anomaly detection technique, the authors showed how the existing techniques are all variations of the same technique. This template makes categorising and remembering techniques in each area easier. Each technique's pros and cons are listed separately. The authors also looked at the strategies' computing complexity, which is important in real-world applications. This study aims to better understand how strategies developed for one field can be applied to other fields. Authors hope the survey's results are useful.

The In-Vehicle Anomaly Detection Engine is a machine-learning-based intrusion detection technology developed by Araujo et al. [28]. The system monitors vehicle mobility data using Cooperative Awareness Messages (CAMs), which are delivered between cars and infrastructure via V2V and V2I networks (such as position, speed, and direction). The IVADE Lane Keeping Assistance system uses an ECU for signal measurement and control computations on a CAN bus (LKAS). To implement machine learning in IVADE, you need CAN message fields, automotive domain-specific knowledge about dynamic system behaviour, and decision trees. The simulation results suggest that IVADE may detect irregularities in in-vehicle applications, therefore, aiding safety functions.

#### *2.6. In-Vehicle Network Intrusion Detection*

A remote wireless attack on an in-vehicle network is possible with 5G and the Internet of Vehicles. Anomaly detection systems can be effective as a first line of defence against security threats. Wang et al. [29] proposed an anomaly detection system that leverages hierarchical temporal memory (HTM) to secure a vehicle controller area network bus. The HTM model may predict real-time flow data based on prior learning. The forecast evaluator's anomalous scoring algorithm was improved with manually created field modification and replay attacks. The results revealed that the distributed HTM anomaly detection system

outperformed recurrent neural networks and hidden Markov model detection systems regarding the RCC score, precision, and recall.

Khalastchi et al. [30] described an online anomaly detection approach for robots that was light-weight and capable of considering a large number of sensors and internal measures with high precision. By selecting online correlated data, the authors presented a robot-specific version of the well-known Mahalanobis distance. The authors also illustrated how it may be applied to large dimensions. The authors tested these contributions using commercial Unmanned Aerial Vehicles (UAVs), a vacuum-cleaning robot, and a highfidelity flight simulator. According to their findings, the Online Mahalanobis distance was superior to previous methods.

For example, autos are CPSs due to their unique sensors, ECUs, and actuators. External connectivity increases the attack surface, affecting those inside vehicles and those nearby. The attack surface has grown due to complex systems built on top of older, less secure common bus frameworks that lack basic authentication methods. In order to make such systems safer, authors treat this as a data analytic challenge. Narayanan et al. [31] employed a Hidden Markov Model to detect dangerous behaviour and send alerts when a vehicle is in motion. To demonstrate the techniques' ability to detect anomalies in vehicles, the authors tested them with single and dual parameters. Moreover, this technique worked on both new and old cars.

#### *2.7. Feature Based Intrusion Detection System*

Garg et al. [32] proposed an anomaly detection system with three stages: (a) feature selection, (b) SVM parameter optimization, and (c) traffic classification. The first two stages are expressed using the multi-objective optimization problem. The "C-ABC" coupling increases the optimizer's local search capabilities and speed. The final stage of data classification uses SVM with updated parameters. OMNET++ and SUMO were used to evaluate the proposed model extensively. The detection rate, accuracy, and false positive rate show the effectiveness.

Marchetti et al. [33] examined information-theoretic anomaly detection methods for current automotive networks. This study focused on entropy-based anomaly detectors. The authors simulated in-car network assaults by inserting bogus CAN messages into real data from a modern licenced vehicle. An experiment found that entropy anomaly detection applied to all CAN messages could detect a large number of false CAN signals. Forging CAN signals was only detectable via entropy-based anomaly detection, which requires many different anomaly detectors for each class of CAN message.

In order to accurately estimate a vehicle's location and speed, the AEKF must additionally take into account the situation of the traffic surrounding the vehicle. The car-following model takes into account a communication time delay factor to improve its suitability for real-world applications. Anomaly detection in [34] suggested that this method is superior to that of the AEKF with the typical 2-detector. Increasing the time delay had a negative effect on the overall detection performance.

#### *2.8. Connected and Autonomous Vehicles*

Connected and autonomous vehicles (CAV) are expected to revolutionise the automobile industry. Autonomous decision-making systems process data from external and on-board sensors. Signal sabotage, hardware degradation, software errors, power instability, and cyberattacks are all possible with CAV. Preventing these potentially fatal anomalies requires real-time detection [35] and identification. Oucheikh et al. [36] proposed a hierarchical model to reliably categorise each signal sequence in real-time using an LSTM auto-encoder.

The effect of model parameter modification on anomaly detection and the channel boosting benefits were examined in three cases. The model was 95.5% precise. The below Table 1 shows the comparative analysis of previous studies conducted to detect anomalies in the IoV. In the table below, it can be seen that multiple techniques have been used previously, i.e., Hybrid Models, Random Forests, Gaussian Mixture Models, MXNet, HTM Models, Support Vector Machines and various other machine and deep-learning models.


**Table 1.** Comparative analysis of previous studies.

#### *2.9. Research Gap*

The capacity of anomaly detection systems to detect unexpected assaults has garnered a great deal of interest, and this has led to its widespread use in fields, including artificial detection, pattern recognition, and machine learning. Traditional machine-learning techniques commonly employed in IDS rely on time-consuming feature extraction and feature selection processes. Additionally, the classification algorithm currently in use uses shallow machine learning. In a real-world network application, shallow machine-learning techniques can analyse high-dimensional inputs, resulting in a lower detection rate.

Last but not least, the data that IDS systems must deal with mostly consist of network traffic or host call sequences, and there are significant distinctions between the two. Host call sequences are more like a sequence problem than network traffic data. Although earlier methods are generally geared toward a specific case, the detection algorithms are not adaptive, especially to hybrid data source detection systems or advanced detection systems. Consequently, the previous detection algorithms are ineffective. For the purpose of feature selection, we used K-MEANS clustering to extract and select the best features. For classification of attack, we used an Explainable Neural Network (xNN).

The main research gaps are:


#### *2.10. Contributions*

In this article, a xNN model for anomaly detection in the IoV is proposed for the classification of attacks in two different data sets separately. Comparing with existing comparative literature, the commitments of this paper are bi-fold.

The contributions of this study are summarized as:


The remainder of this paper is arranged as follows: Section 3 depicts the proposed xNN for anomaly detection in the IoV, in Section 4, the training method of xNN for IoV, and Sections 5 and 6 present our results and conclusions, respectively.

#### **3. Proposed xNN for Anomaly Detection in the IoV**

Data with sequential features is difficult for standard neural networks to deal with. The system call order is followed by host calls in the UNSWNB and CICIDS data [37,38]. An unusual behaviour may contain call sequence and sub sequences that are normal. As of

this, the sequential properties of the system call must be taken into account while doing intrusion detection in the IoV. This means that the input data classification must take into account the current data as well as prior data and its shifted and scaled attributes. Thus, for the detection of intrusion designed to take the input instances with normal and abnormal sequences, we shift and scale the *K*-Means-clustered data features in order to meet the above requirements for the xNN. xNN works on the Additive Index Model as:

$$f(\mathbf{x}) = \mathbf{g}\_1 \boldsymbol{\beta}\_1^T \mathbf{x} + \mathbf{g}\_2 \boldsymbol{\beta}\_2^T \mathbf{x} + [\dots] + \mathbf{g}\_K \boldsymbol{\beta}\_K^T \mathbf{x} \tag{1}$$

*f*(*x*) is the function for classification of output variable, i.e., attacks. *γ* is the input feature. All of the features are arranged according to the *K*-based value from *K*-Means clustering, while *x* is the value of each instance from the feature. *T* is the scaling coefficient, which is directly related to *β*. From Equation (1), we added scaling parameters in the neural network, while in Equation (2), we added a shifting parameter of gamma with the coefficient of shifting, i.e., *σ*, and h is the hyper-parameter transfer function for over and under-fitting of the model. The alternative formulation for xNN is:

$$f(\mathbf{x}) = \sigma + \gamma\_1 h\_1 \boldsymbol{\beta}\_1^T \mathbf{x} + \gamma\_2 h\_2 \boldsymbol{\beta}\_2^T \mathbf{x} + [\dots] + \gamma\_K h\_K \boldsymbol{\beta}\_K^T \mathbf{x} \tag{2}$$

When data is fed into the network, it is multiplied by the weights assigned to each number before being sent to the second layer of neurons as shown in Figure 3. The sigmoid activation function is constructed by summing the weighted sums of the activation functions of each of the neurons. Now, the weights of the connections between layers two and three are divided by these values. The process is then repeated until the final layer.

The architectural diagram of xNN can be seen below:

**Figure 3.** The proposed architecture of xNN.

If we let


then, we can define a universal equation to find the activation of any neuron in an Explainable Neural Network (xNN)

$$a\_j^l = \sigma\left(\left[\sum\_{k=1}^{n\_{l-1}} w\_{j,k}^l a\_k^{l-1}\right] + b\_j^l\right) \tag{3}$$

A weighted directed graph can be used to conceptualise xNN, in which neurons are nodes and directed edges with weights connect the nodes. Information from the outside world is encoded as vectors and received by the neural network model. For *d* inputs, the notation *x*(*d*) is used to designate these inputs.

The weights of each input are multiplied. The neural network relies on weights to help it solve a problem. Weight is typically used to represent the strength of the connections between neurons in a neural network.

The computing unit sums together all of the inputs that have been weighted (artificial neuron). In the event that the weighted total is zero, a bias is added to make the result non-zero or to increase the system's responsiveness. Weight and input are both equal to "1" in bias.

Any number from 0 to infinity can be added to the sum. The threshold value is used to limit the response to the desired value. An activation function f(x) is used to move the sum ahead.

To obtain the desired result, the activation function is set to the transfer function. The activation function might be linear or nonlinear.

#### **4. Training Method of xNN for IoV**

This section explains a detailed description of the dataset, methodology, and performance metrics. We used two recent datasets of autonomous vehicular networks, i.e., UNSW-NB15 and CICIDS2017, which contain a mix of common and modern attacks. The complete flow of the current methodology is shown in Figure 4 below.

**Figure 4.** The proposed workflow.

### *4.1. Dataset Description*

#### 4.1.1. UNSW-NB15

Network intrusions are tracked in the UNSW-NB15 dataset. DoS, worms, Backdoors, and Fuzzers are only some of the nine various types of assaults included in this malicious software. Packets from the network are included in the dataset. There are 175,341 records in the training set and 82,332 records in the testing set of attack and normal records. The following table shows the dataset attributes, i.e., the ID, duration, protocols, state, flags, source and destination bytes, and packets. Attack is the output variable with multiple classes, i.e., DDoS, Backdoor attacks, Worms, and others. The description of UNSW-NB15 dataset is given below in Table 2:

The figure below shows the repartition and total counts of protocols, i.e., HTTP, FTP, FTP Data, SMTP, Pop3, DNS, SNMP, SSL, DHCP, IRC, Radius, and SSH.

Figure 5 shows the number of total categories of attacks present in the UNSW-NB15 dataset, i.e., Generic, Shell Code, DOS, Reconnaissance, Backdoor, Exploits, Analysis, Fuzzers, and Worms, while total 3500 instances were considered as Normal.

**Figure 5.** Repartition of services in UNSW-NB15.

#### 4.1.2. CICIDS2019

The Table 3 shows the second dataset attributes used in this study from CICIDS2019. There are numbers of malicious attacks that can be found in vehicular networks in this dataset, which are related to real-world anomalies. A time stamp, source and destination IPs, source and destination ports, protocols, and attacks are included in the results of the network traffic analysis using Cyclometers. The extracted feature definition is also accessible. The data collection period lasted 5 days, from 9 a.m. on Monday, 3 July 2019, to 5 p.m. on Friday, 7 July 2019. Monday was a regular day with light traffic. Infiltration, Botnet and DDoS assaults were implemented Tuesday, Wednesday, Thursday, and Friday mornings and afternoons.

Figure 5 is showing repartition of services in UNSW-NB15 and Figure 6 is exhibting repartition of attack types. Figure 7 below shows the distribution of target variable, i.e., Attacks.

There has been a long-term interest in anomaly detection in several research communities. In some cases, advanced approaches are still needed to deal with complicated problems and obstacles. An important new path in anomaly detection has developed in recent years: deep-learning-enabled anomaly detection (sometimes known as "deep anomaly detection"). Using these two recent datasets, the suggested method is tested. The

data sets are preprocessed so that deep-learning techniques may be applied to them. The homogeneity measure (k-means clustering) is a strategy for selecting relevant features from both sets of data in an unsupervised manner to improve the performance of classifiers. The performance of deep-learning models can be estimated and improved via five-fold cross validation. We used Explainable Neural Network (xNN) to classify attacks.

**Table 2.** UNSW-NB15 dataset description.


**Figure 6.** Repartition of attack types.

**Figure 7.** Target variable distribution in CICIDS2019.

**Table 3.** CICIDS2019 dataset description.


#### *4.2. Data Preprocessing*

The dataset is preprocessed to make it more appropriate for a neural network classifier.

#### 4.2.1. Removal of Socket Information

For impartial identification, it is necessary to delete the IP address of the source and destination hosts in the network from the original dataset, since this information may result in overfitting training toward this socket information. Rather than relying on the socket information, the classifier should be taught by the packet's characteristics, so that any host with similar packet information will be excluded.

#### 4.2.2. Remove White Spaces

When creating multi-class labels, white spaces may be included. As the actual value differs from the labels of other tuples in the same class, these white spaces result in separate classes.

#### 4.2.3. Label Encoding

A string value is used to label the multi-class labels in the dataset, which include the names of attacks. In order to teach the classifier whose class each tuple belongs to, it is necessary to encode these values numerically. The multi-class labels are used for this operation, as the binary labels are already in the zero-one formation for this operation.

#### 4.2.4. Data Normalization

The dataset contains a wide variety of numerical values, which presents a challenge to the classifier during training. This means that the minimum and maximum values for each characteristic should be set to zero and one, respectively. This gives the classifier more uniform values while still maintaining the relevancy of each attribute's values.

#### 4.2.5. Removal of Null and Missing Values

The CICIDS2017 dataset contains 2867 tuples as missing and infinity values. This has been addressed in two ways, resulting in two datasets. In the second dataset, infinite values are replaced by maximum values, and missing values are replaced by averages. The proposed method was tested on both datasets. Only the attack information packets were used to evaluate the proposed approach with the data packets representing normal network traffic from both sets being ignored.

#### 4.2.6. Feature Ranking

Preprocessed datasets are fed into the *K*-Means-clustering algorithm, which uses each attribute individually to rank them in terms of importance before applying it to cluster the entire dataset. For multi-class classification, *k* = the number of attacks in datasets, which means that the data point of feature is clustered into two groups: normal and anomalous. To rank the attributes, the clusters' homogeneity score is computed, with higher homogeneity denoting higher class similarity across the objects inside each cluster. Having a high score indicates that this attribute is important in the classification, while a low score indicates that this attribute is not important. For calculating the highest score similarity between the features, we first calculated the distance and then created an objective function

$$distance(\mathbb{C}\_{j}, p) = \sqrt{\left(\sum\_{i}^{d} = 1[(\mathbb{C}\_{(j\_{i})} - p\_{i})\right]^{2}}\tag{4}$$

From Equation (4), we computed the distance of the jth cluster from *c* centroid to check the *j*th feature's similarity at instance *i* with the data point *p* at instance *i*. After this, we created an objective function to minimize the distance between the cluster centroid and to check the homogeneity between selected features.

$$Obj(\mathbf{C}\_{j}) = \sum\_{m}^{p} [distance(\mathbf{C}\_{j}, p)]^{2} \tag{5}$$

For feature ranking, we derived the objective function for the *j*th features in Equation (5). This will calculate the minimal distance of Center *C* from *p* taking m as the starting point to rank the best features.

#### **5. Results**

This section shows the implementation and results of the xNN model on the selected datasets. We applied the xNN model on both datasets separately. Both datasets are publicly available on [37,38]. In experimental setup, we used python as a language source and a GPU-based system consisting of Jupyter as a compiler with more than 3.2 GHz processor, which is the minimal simulation requirement for the experimental setup. In the first phase, we evaluated our model based on the accuracy, precision, recall, and F1 score for the classification of nine attacks in UNSW-NB15 dataset. Furthermore, in the second phase, the model was evaluated on the CICIDS2019 dataset.

#### *5.1. Performance of xNN on UNSW-NB15*

Figure 8 shows the performance of the xNN model on UNSW-NB15 after applying the K-Means-clustering-based feature scoring method. In the figure, the *y* axis shows the percentage of accuracy, and the *x* axis shows the accuracy, precision, recall, and F1 score of xNN. It shows that the model is 99.7% accurate in classifying the attacks in the IoV-based dataset.

It can be seen from Figure 9 that, without feature scoring, the accuracy of xNN is 91.5%, which is less than the accuracy with feature scoring. In the figure, the *y* axis shows the percentage of accuracy, and the *x* axis shows the accuracy, precision, recall, and F1 score of xNN.

Figure 10 shows the confusion matrix with feature scoring, while Figure 11 shows the confusion matrix without feature scoring. It can be seen from Figure 10 that the true positive rate with feature scoring is much higher than without the feature scoring confusion matrix.

We also applied a Convolutional Neural Network and Long Short-Term Memory for the classification of attacks in order to compare our model with previous state-of-the-art models. xNN demonstrated promising accuracy and was the highest among the other deep-learning models. The comparison of deep-learning models for the classification of attacks in UNSW-NB15 is shown in Figure 12. In the figure, the *y* axis shows the percentage of accuracy, and the *x* axis shows the model's accuracy histogram.

#### *5.2. Performance of xNN on CICIDS2019*

Figure 13 shows the performance of the xNN model on CICIDS2019 after applying the K-Means-clustering-based feature scoring method. This shows that the model was 99.3% accurate in classifying the attacks in the IoV-based dataset. In the Figures 13 and 14, the *y* axis shows the percentage of accuracy, and *x* axis shows the model's accuracy histogram.

**Figure 8.** The performance of xNN on UNSW-NB15.

**Figure 9.** The performance of xNN on UNSW-NB15 without feature scoring.

**Figure 10.** Confusion matrix of xNN for UNSW-NB15 with feature scoring.

It can be seen from Figure 13 that, without feature scoring, the accuracy of xNN is 87.3%, which is less than the accuracy with feature scoring. We also applied a Convolutional Neural Network and Long Short-Term Memory for the classification of attacks in order to compare our model with previous state-of-the-art models. xNN demonstrated promising accuracy and was the highest among the other deep-learning models. The comparison of deep-learning models for the classification of attacks in CICIDS2019 is shown in the figure

below. In the figure, the *y* axis shows the percentage of accuracy, and the *x* axis shows the model's accuracy histogram.

**Figure 11.** Confusion matrix of xNN for UNSW-NB15 without feature scoring.

**Figure 12.** Comparison of deep-learning models for the classification of attacks in UNSW-NB15.

Comparatively, it can be seen that our proposed model xNN performed well after the feature-scoring technique. In Dataset 1 (UNSW-NB15), xNN performed well with the highest accuracy of 99.7%, while CNN scored 87%, LSTM scored 90%, and DNN scored 92%, while in the classification of attacks in the second dataset (CICIDS2019) xNN scored the highest accuracy of 99.3%, CNN scored 87%, LSTM scored 89%, and DNN scored 82%. Tables 4 and 5 shows the comparative analysis of deep-learning models proposed in this study to justify that xNN scored the highest accuracy and was a persistent model for the detection of intrusions on both datasets. Figures 15–17 show confusion matrix of xNN for CICIDS2019 with feature scoring, Confusion matrix of xNN for CICIDS2019 without feature scoring and comparison of the deep-learning model on the CICIDS2019 dataset, respectively.

**Figure 13.** The performance of xNN on CICIDS2019.

**Figure 14.** The performance of xNN on CICIDS2019 without feature scoring.

We compared our model with previous research. In a comparative analysis, we found that our proposed model scored the highest accuracy with respect to some of the recent previous research techniques.

**Figure 16.** Confusion matrix of xNN for CICIDS2019 without feature scoring.

**Figure 17.** Comparison of the deep-learning model on the CICIDS2019 dataset.


**Table 4.** Comparative analysis of the deep-learning models.

**Table 5.** Comparative analysis of previous studies.


#### **6. Conclusions**

One of the most difficult challenges is in developing systems that can detect CAN message attacks as early as possible. Vehicle networks can be protected from cyber threats through the use of artificial-intelligence-based technology. When an intruder attempts to enter the autonomous vehicle, deep learning safeguards it. The CICIDS2019 and UNSW-NB15 security systems were utilized to evaluate our proposed security system. Preprocessing is the process of converting category data into numerical data. K-Means clustering was used to determine which features were the most important.

Detecting attack types in this dataset was accomplished through the use of an Explainable Neural Network (xNN). The precision, recall, F1 score, and accuracy were all high for the model, which were encouraging results. Following the application of the feature-scoring technique, it can be seen that our suggested model xNN outperformed the competition. In Dataset 1 (UNSW-NB15), xNN outperformed the competition, scoring 99.7% accuracy, while CNN scored 87% accuracy, LSTM scored 90% accuracy, and DNN scored 92% accuracy. In the classification of attacks in the second dataset (CICIDS2019), xNN achieved the highest accuracy of 99.3%, followed by CNN with 87% accuracy, LSTM with 89% accuracy, and DNN with 82% accuracy.

With regard to accuracy in detection and classification, as well as real-time CAN bus security, the proposed approach outperformed the existing solutions in the study. Furthermore, this work can be extended to real-world scenarios and real-time controlled vehicles as well as on autonomous systems to protect against malicious attacks. The data package in the protocol analysed with the maximum values by applying the highperformance xNN model would be preferable for use in the future to reduce and eliminate security attacks, such as for the IoV.

**Author Contributions:** Data curation, S.A.; Funding acquisition, K.-H.L.; Investigation, S.A. and M.T.F.; Methodology, S.A.; Project administration, K.-H.L.; Resources, K.-H.L.; Software, A.M.A. and M.I.; Validation, K.N.H. and M.I.; Writing—original draft, S.A.; Writing—review & editing, A.M.A., K.-H.L., K.N.H. and L.X. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Acknowledgments:** The work presented in this article is supported by Centre for Advances in Reliability and Safety (CAiRS) admitted under AIR@InnoHK Research Cluster.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:



#### **References**

