**3. Related Work**

Although the MAML algorithm and its variants do not use parameters other than those of the base-learner, network training is quite slow and computationally expensive as it contains second-degree derivatives. In particular, the meta-update of the MAML algorithm includes gradient nested in gradient, or second-degree derivatives, which significantly increases the computational cost. In order to solve theaboveproblem,severalapproximationtechniqueshavebeenproposedtoacceleratethealgorithm.

Finn et al. [12] developed the MAML by ignoring the second derivatives, calculating the slope in the meta-update, which they called FOMAML (First Order MAML).

More specifically, MAML optimizes the:

$$\min\_{\partial} \mathbb{E}\_{T \sim p(T)} \Big| L\_T \Big( \mathbb{U}\_T^k(\theta) \Big) \Big| \,\tag{11}$$

where U*kT* is the process by which k samples are taken from task *T* and θ is updated. This procedure employs the support set and the query set, so the optimization can be rewritten as follows:

$$\min\_{\theta} \mathbb{E}\_{T \sim p(T)} \left[ L\_{T,Q}(\mathbb{U}\_{T,S}(\theta)) \right] \tag{12}$$

Finally, MAML uses the slope method to calculate the following:

$$\mathcal{Z}\text{-}\mathcal{Z}\text{-}\text{M}\text{-}\text{ML}=\mathcal{L}\_{T,\mathcal{Q}}(\mathbb{U}\_{T,\mathcal{S}}(\theta))=\mathbb{U}\_{T,\mathcal{S}}'(\theta)L'\_{T,\mathcal{Q}}(\widetilde{\theta}).\tag{13}$$

where θ = <sup>U</sup>*T*,*<sup>S</sup>*(θ) and U-*T*,*S* is the Jacobian renewal matrix of U*T*,*<sup>S</sup>* where the FOMAML considers U-*T*,*S*as unitary, so it calculates the following:

$$\text{gFOMAML} = L'\_{T,Q}(\overline{\theta}) \tag{14}$$

The resulting method still calculates the meta-gradient for the parameter values after updating θ-, which is an effective post-learning method from a theoretical point of view. However, experiments have shown that the yield of this method is almost the same as the one obtained by a second derivative. Most of the improvement in MAML comes from the gradients of the objective at the post-update parameter values, rather than the second-order updates from differentiating through the gradient update.

A different implementation employing first degree derivatives was studied and analyzed by Nichol et al. [13]. They introduced the reptile algorithm, which is a variation of the MAML, using only the first derivative. The basic difference from FOMAML is that the last step treats θ −θ as a slope and feeds it into an adaptive algorithm such as ADAM. Algorithm 2 presents reptile.


MAML also suffers from training instability, which can currently only be alleviated by arduous architecture and hyperparameter searches.

Antoniou et al. proposed an improved variant of the algorithm, called MAML++, which effectively addresses MAML problems, by providing a much improved training stability and removing the dependency of training stability on the model's architecture. Specifically, Antoniou et al. [14] found that simply replacing max-pooling layers with stridden convolutional layers makes network training unstable. It is clearly shown in Figure 3 that in two of the three cases, the original MAML appears to be unstable and irregular, while all 3 MAML++ models appear to converge consistently very quickly, with much higher generalization accuracy without any stability problems.

**Figure 3.** Stabilizing MAML.

It was estimated that the instability was caused by a gradient degradation (gradient explosion or vanishing gradient) which was due to the depth of the network. Let us consider a typical four-layer Convolutional Neural Network (CNN) followed by a single linear layer. If we repeat the Inner Loop Learning N times, then the inference graph comprises 5N layers in total, without any skip connections.

Since the original MAML only uses the final weights for the Outer Loop Learning, the backpropagation algorithm has to go through all layers, which causes gradient degradation. To solve the above problem, the Multi-Step Loss (MSL) optimization approach was adopted. It eliminates the problem by calculating the external loss after each internal step, based on the outer loop update, as in Equation (15) below:

$$\theta = \theta - \beta \nabla\_{\theta} \sum\_{i=1}^{B} \sum\_{j=1}^{N} w\_{j} L\_{T\_{i}} \big( f\_{0\_{j}^{i}} \big) \tag{15}$$

where β is a Learning Rate; *LTi*(*f*θ*ij*) denotes the outer loss of task *i* when using the base-network weights after the *j*-inner-step update; and *wj*denotes the importance weight of the outer loss at step *j*.

The following, Figure 4, is a graphical display of the MAML++ algorithm, where the outer loss is calculated after each internal step and the weighted average is obtained at the end of the process.

In practice, all losses are initialized with equal contributions to the overall loss, but as repetitions increase, contributions from previous steps are reduced, while the ones of subsequent steps keep increasing gradually. This is to ensure that as training progresses, the final step loss receives more attention from the optimizer, thereby ensuring that the lowest possible loss is achieved. If the annealing is not used, the final loss might be higher compared to the one obtained by the original formulation. Additionally, due to the fact that the original MAML overcomes the second-order derivative cost by completely ignoring it, the final general performance of the network is reduced. The MAML++ solves this problem, by using the *derivative order annealing* method. Specifically, it employs the first order grade for the first 50 epochs of training and then it moves to the second order grade for the rest of the training process. An interesting observation is that this derivative-order annealing does not create incidents of exploding or diminishing gradients, and so the training is much more stable.

Another drawback of MAML is the fact that it does non-use *batch normalization statistic accumulation*. Instead, only the statistics of the current batch are used. As a result, smoothing is less effective, as the trained parameters must include a variety of different means and standard deviations from different tasks. A naive application would accumulate current batch statistics at all stages of the Inner Loop Learning update, which could cause optimization problems, and it could slow or stop optimization altogether. The problem stems from the erroneous assumption that the original model and all its

updated iterations have similar attribute distributions. Thus, the current statistics could be shared across all updates to the internal loop of the network. Obviously, this assumption is not correct. A better alternative solution, which is employed by MAML++, is the storage of *per-step batch normalization statistics* and the reading of *per-step batch normalization weights and* biases for each repetition of the inner loop. One issue that affects the speed of generalization and convergence is the use of a *shared Learning Rate* for all parameters and all steps of learning process update.

**Figure 4.** MAML++ visualization.

The consistent Learning Rate requires multiple hyperparameter searches, in order to find the right rate for a particular set of data. This is computationally expensive and time consuming, depending on how the search is shared. Use the shared Learning Rate for all parameters and all the steps of updating the learning process. In addition, while gradient is an effective data fitting tactic, a constant Learning Rate can easily lead to overfitting under the *few-shot regime*. An approach to avoid potential overfitting is the identification of all learning factors in a way that maximizes the power of generalization rather than the over-fitting of the data.

Li et al. [15] proposed a Learning Rate for each parameter in the core network where the internal loop was updated, as in the following equation (Equation (16)):

$$
\theta' = \theta - \alpha \circ \nabla\_{\theta} L\_{T\_i}(f\_{\theta}),
\tag{16}
$$

where α is a vector of learnable parameters with the same size as *LTi*(*f*θ) and denotes the element-wise product operation. We do not put the constraint of positivity on the Learning Rate (LER) denoted as "α." Therefore, we should not expect the inner-update direction to follow the gradient direction.

A clearly improved approach to the above process is suggested by MAML++ which employs *per-layer per-step* Learning Rates. For example, if it is assumed that the *core network* comprises L layers and the *Inner Loop Learning* consists of N stages of updating, then there are LN additional learnable parameters for the *Inner Loop Learning Rate*.

MAML uses the *ADAM* algorithm with a constant LER to optimize the meta-objective. This means which more time is required to properly adjust the Learning Rate, which is a critical parameter of the generalization performance. On the other hand, MAML++ employs the cosine annealing scheduling on the *meta-optimizer*, which is defined based on the following Equation (17) [16].

$$\beta = \beta\_{\rm min} + \frac{1}{2} \left( \beta\_{\rm max} - \beta\_{\rm min} \right) (1 + \cos \left( \frac{T}{T\_{\rm max}} \pi \right)), \tag{17}$$

where β*min* denotes the minimum Learning Rate*,* β*max* denotes the initial Learning Rate, *T* is the number of current iterations and *Tmax* is the maximum number of iterations. When *T* = 0, the LER β = β*max*. On the other hand, if *T* = *Tmax*, *then* β = β*min*. In practice, we might want *T* to be *Tmax*.

In summary, this particular MAML++ standardization enables its use in complex Deep Learning architectures, making it easier to learn more complex functions, such as loss functions, optimizers or even gradient computation functions. Moreover, the use of first-class derivatives offers a powerful pre-training method aiming to detect the parameters which are less likely to cause exploding or diminishing gradients. Finally, the *learning per-layer per-step* LER technique avoids potential overfitting, while it significantly reduces the computational cost and time required to build a consistent Learning Rate throughout the process.

### **4. Design Principles and Novelties of the Introduced MAME-ZsL Algorithm**

As it has already been mentioned, the proposed MAME-ZsL algorithm employs MAML++ for the development of a robust *Hyperspectral Image Analysis* and *Classification* (HIAC) model, based on ZsL. The basic novelty introduced by the improved MAME-ZsL model, is related to a neural network with Convolutional (CON) filters, comprising very small receptive fields of size 3 × 3.

The Convolutional stride and the spatial padding were set to 1 pixel. Max-pooling was performed over 3 × 3 pixels windows with stride equal to three. All of the CON layers were developed using the Rectified Linear Unit (ReLU) nonlinear Activation Function (ACF), except for the last layer where the Softmax ACF [3] was applied, as it performs better on multi-classification problems like the one under consideration (18).

$$\sigma\_{\vec{j}}(z) = \frac{e^{z\_j}}{\sum\_{k=1}^{K} e^{z\_k}}, \; j = 1, \dots, K \tag{18}$$

The Sigmoid approach offers better results in binary classification tasks. It is a fact that in the Softmax, the sum of probabilities is equal to 1, which is not the case for the Sigmoid. Moreover, in Softmax the highest value has a higher probability than the others, while in the Sigmoid the highest value is expected to have a high probability but not the highest one.

The fully Convolutional Neural Network (CNN) was trained based on the novel *AdaBound* algorithm [17] which employs dynamic bounds on the Learning Rate and it achieves a smooth transition to stochastic gradient descent. Algorithm 3 makes a detailed presentation of the AdaBound [17]:

**Algorithm 3.** The AdaBound algorithm.

**Input:** *x*1 ∈ *F*, initial step size α, β<sup>1</sup>*tTt*=1, β2 lower bound function η*l*, upper bound function η*u* 1: Set *m*0 = 0, *u*0 = 0 2: **for** *t* = 1 **to** T **do** 3: *gt* = <sup>∇</sup>*ft*(*xt*) 4: *mt* = β<sup>1</sup>*tmt*−<sup>1</sup> + (1 − β1*t*)*gt* 5: *ut* = β2*ut*−<sup>1</sup> + (1 − β2)*g*2*t* and *Vt* = *diag*(*ut*) 6: ηˆ*t* = *Clip a* √ *Vt* , η*l*(*t*), <sup>η</sup>*u*(*t*) and η*t* = ηˆ*t* √*t* 7: *xt*+1 = ! *f*,*diag*(η<sup>−</sup>1*t* ) (*xt* − η*<sup>t</sup>*·*mt*) 8: **end for**

Compared to other methods, AdaBound has two major advantages. It is uncertain whether there exists a fixed turning point to distinguish the simple ADAM algorithm and the SGD. The advantage of the AdaBound is the fact that it addresses this problem with a continuous transforming procedure, rather than with a "hard" switch. The AdaBound introduces an extra hyperparameter to perform the switching time, which is not very easy to fine-tune. Moreover, it has a higher convergence speed than the stochastic gradient descent ones. Finally, it overcomes the poor generalization ability of the adaptive models, as it uses dynamic limits on the LER, aiming towards higher classification accuracy.

The selection of the appropriate hyperparameters to be employed in the proposed method, was based on the restrictions' settings and configurations, which should be based on the consideration of the di fferent decision boundaries of the classification problem. For example, the obvious choice of classifiers with the smallest error in training data is considered as improper for generating a classification model. The performance based on a training dataset, even when cross-validation is used, may be misleading when first seen data vectors are used. In order for the proposed process to be e ffective, individual hyperparameters were chosen. They not only display a certain level of diversity, but they also use di fferent operating functions, thus allowing di fferent decision boundaries to be created and combined in such a way that can reduce the overall error.

In general, the selection of features was based on a heuristic method which considers the way the proposed method faces each situation. For instance:


### **5. Application of the MAME-ZsL in Hyperspectral Image Analysis**

Advances in artificial intelligence, combined with the extended availability of high quality data and advances in both hardware and software, have led to serious developments in the e fficient processing of data related to the GeoAI field (*Artificial Intelligence and Geography*/*Geographic Information Systems*). Hyperspectral Image Analysis for e fficient and accurate object detection using Deep Learning is one of the timely topics of GeoAI. The most recent research examples include detection of soil characteristics [18], detailed ways of capturing densely populated areas [19], extracting information from scanned historical maps [20], semantic point sorting [21], innovative spatial interpolation methods [22] and tra ffic forecasting [23].

Similarly, modern applications of artificial vision and imaging (IMG) systems significantly extend the distinctive ability of optical systems, both in terms of spectral sensitivity and resolution. Thus, it is possible to identify and di fferentiate spectral and spatial regions, which although having the same color appearance, are characterized by di fferent physico-chemical and/or structural properties. This di fferentiation is based on the emerging spatial diversification, which is detected by observing in narrow spectral bands, within or outside the visible spectrum.

Recent technological developments have made it possible to combine IMG (spatial variation in RGB resolution) and spectroscopy (spectral analysis in spatially emitted radiation) in a new field called "Spectral Imaging" (SIM). In the SIM process, the intensity of light is recorded simultaneously as a function of both wavelength and position. The dataset corresponding to the observed surface contains a complete image, di fferent for each wavelength. In the field of spectroscopy, a fully resolved spectrum can be recorded for each pixel of the spatial resolution of the observation field. The multitude of spectral regions, which the IMG system can manage, determines the di fference between multispectral (tens of regions) and Hyperspectral (hundreds of regions) Imaging. The key element of a typical spectral IMG system is the monochromatic image sensor (monochrome camera), which can be used to select the desired observation wavelength.

It can be easily perceived that the success of a sophisticated Hyperspectral Analysis System (HAS) is a major challenge for DL technologies, which are using a series of algorithms attempting to model data characterized by high-level of abstraction. HAS use a multi-level processing architecture, which is based on sequential linear and non-linear transformations. Despite their undoubtedly well-established and e ffective approaches and their advantages, these architectures depend on the performance of training with huge datasets which include multiple representations of images of the same class. Considering the multitude of classes which may be included in a Hyperspectral image, we realize that this process is so incredibly time consuming and costly, that it can sometimes be impossible to run [1].

The ZsL method was adopted based on a heuristic [24], hierarchical parameter search methodology [25]. It is part of a family of learning techniques which exploit data representations to interpret and derive the optimal result. This methodology uses distributed representation, the basic premise of which is that the observed data result from the interactions of factors which are organized in layers. A fundamental principle is that these layers correspond to levels of abstraction or composition based on their quantity and size.

Fine-Grained Recognition (FIG\_RC) is the task of distinguishing between visually very similar objects, such as identifying the species of a bird, the breed of a dog or the model of an aircraft. On the other hand, FIG\_RC [26] which aims to identify the type of an object among a large number of subcategories, is an emerging application with the increasing resolution which exposes new details in image data. Traditional fully supervised algorithms fail to handle this problem where there is low between-class variance and high within-class variance for the classes of interest with small sample sizes. The experiments show that the proposed fine-grained object recognition model achieves only 14.3% recognition accuracy for the classes with no training examples. This is slightly better than a random guess accuracy of 6.3%. Another method [27] automatically creates a training dataset from a single degraded image and trains a denoising network without any clear images. However, the performance of the proposed method shows the same performance as the optimization-based method at high noise levels.

Hu et al., 2015, proposed a time-consuming and resource depended model [28] which learns to perform zero-shot classification, using a meta-learner which is trained to produce corrections to the output of a previously trained learner. The model consists of a Task Module (TM) which supplies an initial prediction, and a Correction Module (CM) updating the initial prediction. The TM is the learner and the CM is the meta-learner. The correction module is trained in an episodic approach, whereby many di fferent task modules are trained on various subsets of the total training data, with the rest being used as unseen data for the CM. The correction module takes as input a representation of the TM's training data to perform the predicted correction. The correction module is trained to update the task module's prediction to be closer to the target value.

In addition [29] proposes the use of the visual space as the embedding one. In this space, the subsequent nearest neighbor search su ffers much less from the harness problem and it becomes more effective. This model design also provides a natural mechanism for multiple semantic modalities (e.g., attributes and sentence descriptions) to be fused and optimized jointly in an end-to-end manner. Only the statistics of the current environment are used and the trained process must include a variety of di fferent statistics from di fferent tasks and environments.

Additionally, [30] propose a very promising approach with high-grade accuracy, but the model is characterized by high bias. In the case of image classification, various spatial information can be extracted and used, such as edges, shapes and associated color areas. As they are organized into multiple levels, they are hierarchically separated into levels of abstraction, creating the conditions for selecting the most appropriate features for the training process. Utilizing the above processes, ZsL inspires and simulates the functions of human visual perception, where multiple functional levels and intermediate representations are developed, from capturing an image to the retina to responding in stimuli. This function is based on the conversion of the input representation to a higher level one, as it is performed by each intermediate unit. High-level features are more general and unchanged, while low-level ones help to categorize inputs. Their effectiveness is interpreted on the basis of the "*universal approximation theorem,*" which deals with the ability of a neural structure to approach continuous functions and the probabilistic inference which considers the activation of nonlinearity as a function of cumulative distribution. This is related to the concepts of optimization and generalization respectively [25].

Given that in deep neural networks, each hidden level trains a distinct set of features, coming from the output of the previous level, further operation of this network enables the analysis of the most complex features, as they are reconstructed and decomposed from layer to layer. This hierarchy, as well as the degradation of information, while increasing the complexity of the system, also enables the handling of high-dimensional data, which pass through non-linear functions. It is thus possible to discover unstructured data and to reveal a latent structure in unmarked data. This is done in order to handle more general problematic structures, even discerning the minimal similarities or anomalies they entail.

Specifically, since the aim was the design of a system with zero samples from the target class, the proposed methodology used the intermediate representations extracted from the rest of the image samples. This was done in order to find the appropriate representations to be used in order to classify the unknown image samples.

To increase the efficiency of the method, bootstrap sampling was used, in order to train different subsets of the data set in the most appropriate way. Bootstrap sampling is the process of using increasingly larger random samples until the accuracy of the neural network is improved. Each sample is used to compile a separate model and the results of each model are aggregated with "voting"; that is, for each input vector, each classifier predicts the output variable, and finally, the value with the most "votes" is selected as the response variable for which particular vector. This methodology, which belongs to the ensemble methods, is called bagging and has many advantages, such as reducing co-variance and avoiding overfitting, as you can see in the below Figure 5 [31].

**Figure 5.** Bagging (bootstrap sampling) (https://www.kdnuggets.com/).

The ensemble approach was selected to be employed in this research, due to the particularly high complexity of the examined ZsL, and due to the fact that the prediction results were highly volatile. This can be attributed to the sensitivity of the correlational models to the data, and to the complex relationship which describes them. The ensemble function of the proposed system offers a more stable model with better prediction results. This is due to the fact that the overall behavior of a multiple

model is less noisy than a corresponding single one. This reduces the overall risk of a particularly bad choice.

It is important to note that in Deep Learning, the training process is based on analyzing large amounts of data. The research and development of neural networks is flourishing thanks to recent advancements in computational power, the discovery of new algorithms and the increase in labeled data.

Neural networks typically take longer to run, as an increase in the number of features or columns in the dataset also increases the number of hidden layers. Specifically, we should say that a single affine layer of a neural network without any non-linearities/activations is practically the same as a linear model. Here we are referring to deep neural networks that have multiple layers and activation functions (non-linearities as Relu, Elu, tanh, Sigmoid) Additionally, all of the nonlinearities and multiple layers introduce a nonconvex and usually rather complex error space, which means that we have many local minimums that the training of the deep neural network can converge to. This means that a lot of hyperparameters have to be tuned in order to ge<sup>t</sup> to a place in the error space where the error is small enough so that the model will be useful. A lot of hyper parameters which could start from 10 and reach up to 40 or 50 are dealt with via Bayesian optimization, using Gaussian processes to optimize them, which still does not guarantee good performance. Their training is very slow, and adding the tuning of the hyperparameters into that makes it even slower, whereas the linear model would be much faster to be trained. This introduces a serious cost–benefit tradeo ff. A trained linear model has weights which are interpretable and gives useful information to the data scientist as to how various features should have roles for the task at hand.

Modern frameworks like TensorFlow or Theano perform execution of neural networks on GPU. They take advantage of parallel programming capabilities for large array multiplications, which are typical of backpropagation algorithms.

The proposed Deep Learning model is a quite resource-demanding technology. It requires powerful, high-performance graphics processing units and large amounts of storage to train the models. Furthermore, this technology needs more time to train in comparison with traditional machine learning. Another important disadvantage of any Deep Learning model is that it is incapable of providing arguments about why it has reached a certain conclusion. Unlike in the case of traditional machine learning, you cannot follow an algorithm to find out why your system has decided which it is a tree on a picture, not a tile. To correct errors in Deep Learning, you have to revise the whole algorithm.

### **6. Description of the Datasets**

The datasets used in this research include images taken from a Reflective Optics System Imaging Spectrometer (ROSIS). More specifically, the Pavia University and Pavia Center datasets were considered [32]. Both datasets came from the ROSIS sensor during a flight campaign over Pavia in southern Italy. The number of spectral bands is 102 for the Pavia Center and it is 103 for Pavia University. The selected Pavia Center and Pavia University images have an analysis of 1096 × 1096 pixels and 610 × 610 pixels respectively. Ultrasound imaging consists of 115 spectral channels ranging from 430 to 860 nm, of which only 102 were used in this research, as 13 were removed due to noise. Rejected specimens which in both cases contain no information (including black bars) can be seen in the following figure (Figure 6) below.

(**a**) Sample band of Pavia Centre dataset (**b**) Ground truth of Pavia Centre dataset

(**c**) Sample band of Pavia University dataset (**d**) Ground truth of Pavia University dataset

**Figure 6.** Noisy bands in Pavia Centre and University datasets.

The available samples were scaled down, so that every image has an analysis of 610 × 610 pixels and geometric analysis of 1.3 m. In both datasets the basic points of the image belong to nine categories which are mainly related to land cover objects. The Pavia University dataset includes nine classes, and in total, 46,697 cases. The Pavia Center dataset comprises nine classes with 7456 cases, whereas the first seven classes are common in both datasets (*Asphalt, Meadows, Trees, Bare Soil, Self-Blocking Bricks, Bitumen, Shadows*) [33].

The Pavia University dataset was divided into training, validation and testing sets, as is presented in the following table (Table 2) [32].



The Pavia Center dataset was divided into training, validation and testing sets, as is presented analytically in the following table (Table 3) [32].


**Table 3.** Pavia Center dataset.

### *Metrics Used for the Assessment of the Modeling E*ff*ort*

The following metrics were used for the assessment of the modeling e ffort [33,34]:

(a) Overall Accuracy (OA): This index represents the number of samples correctly classified, divided by the number of testing samples.

(b) Kappa Statistic: This is a statistical measure which provides information on the level of agreemen<sup>t</sup> between the truth map and the final classification map. It is the percentage of agreemen<sup>t</sup> corrected by the level of agreement, which could be expected to occur by chance. In general, it is considered to be a more robust index than a simple percent agreemen<sup>t</sup> calculation, since *k* takes into account the agreemen<sup>t</sup> occurring by chance. It is a popular measure for benchmarking classification accuracy under class imbalance. It is used in static classification scenarios and for streaming data classification. Cohen's kappa measures the agreemen<sup>t</sup> between two raters, where each classifies *N* items into *C* mutually exclusive categories. The definition of κ is [35,36]:

$$\kappa = \frac{p\_0 - p\_\varepsilon}{1 - p\_\varepsilon} = 1 - \frac{1 - p\_0}{1 - p\_\varepsilon},\tag{19}$$

where *po* is the relative observed agreemen<sup>t</sup> among raters (identical to accuracy), and *pe* is the hypothetical probability of chance agreement. The observed data are used to calculate the probabilities of each observer, to randomly see each category. If the raters are in complete agreement, then κ = 1. If there is no agreemen<sup>t</sup> among the raters other than what would be expected by chance (as given by *pe*), then κ ≈ 0.

The Kappa Reliability (KR) can be considered as the outcome from the data editing, allowing the conservancy of more relevant data for the upcoming forecast. A detailed analysis of the KR is presented in the following Table 4.



(c) McNemar test: The McNemar statistical test was employed to evaluate the importance of classification accuracy derived from di fferent approaches [31]:

$$z\_{12} = \frac{f\_{12} - f\_{21}}{\sqrt{f\_{12} + f\_{21}}} \tag{20}$$

where *fij* is the number of correctly classified samples in classification *i*, and incorrect one are in classification *j*. McNemar's test is based on the standardized normal test statistic, and therefore the null hypothesis, which is "no significant di fference," rejected at the widely used *p* = 0.05 (|*z*| > 1.96) level of significance.
