**1. Introduction**

Hyperspectral image analysis and classification is a timely special field of Geoinformatics which has attracted much attention recently. This has led to the development of a wide variety of new approaches, exploiting both spatial and spectral content of images, in order to optimally classify them into discrete components related to specific standards. Typical information products obtained by the above approaches are related to diverse areas; namely: ground cover maps for environmental Remote Sensing; surface mineral maps used in geological applications; vegetation species maps, employed in agricultural-geoscience studies and in urban mapping. Recent developments in optical sensor technology and Geoinformatics (GINF), provide multispectral, Hyperspectral (HyS) and panchromatic images at very high spatial resolution. Accurate and e ffective HyS image analysis and classification is one of the key applications which can enable the development of new decision support systems. They can provide significant opportunities for business, science and engineering in particular. Automatic assignment of a specific semantic label to each object of a HyS image (according to its content) is one of the most di fficult problems of GINF Remote Sensing (RES).

With the available HyS resolution, subtle objects and materials can be extracted by HyS imaging sensors with very narrow diagnostic spectral bands. This can be achieved for a variety of purposes, such as detection, urban planning, agriculture, identification, surveillance and quantification. HyS image analysis enables the characterization of objects of interest (e.g., land cover classes) with unprecedented accuracy, and keeps inventories up to date. Improvements in spectral resolution have called for advances in signal processing and exploitation algorithms.

A Hyperspectral image is a 3D data cube, which contains two-dimensional spatial information (image feature) and one-dimensional spectral information (spectral bands). Especially, the spectral bands occupy very fine wavelengths. Additionally, image features related to land cover and shape disclose the disparity and association among adjacent pixels from di fferent directions at a confident wavelength. This is due to its vital applications in the design and managemen<sup>t</sup> of soil resources, precision farming, complex ecosystem/habitat monitoring, biodiversity conservation, disaster logging, tra ffic control and urban mapping.

It is well known that increasing data dimensionality and high redundancy between features might cause problems during data analysis. There are many significant challenges which need to be addressed when performing HyS image classification. Primarily, supervised classification faces a challenge related to the imbalance between high dimensionality and incomplete accessibility of training samples, or to the presence of mixed pixels in the data. Further, it is desirable to integrate the essential spatial and spectral information, so as to combine the complementary features which stem from source images.

Deep Learning methodologies have significantly contributed towards the evolution and development of HyS image analysis and classification [1]. Deep Learning (DL) is a branch of computational intelligence which uses a series of algorithms that model high-level abstraction data using a multi-level processing architecture.

It is di fficult for all Deep Learning algorithms to achieve satisfactory classification results with limited labeled samples, despite their undoubtedly well-established functions and their advantages. The approaches with the highest classification accuracy and generalization ability fall under the supervised learning umbrella. For this reason, especially in the case of Ultra-Spectral Images, huge datasets with hundreds or thousands of specimens labeled by experts are required [2]. This process is very expensive and time consuming.

In the case of supervised image classification, the input image is processed by a series of operations performed at di fferent neuronal levels. Eventually, the output generates a probability distribution for all possible classes (usually using the *Softmax* function). Softmax is a function which takes an input vector Z of *k* real numbers and normalizes it into a probability distribution consisting of *k* probabilities, proportional to the exponentials of the input numbers [3].

$$\sigma(z)\_j = \frac{\mathfrak{c}^{\bar{z}j}}{\sum\_{k=1}^{K} \mathfrak{c}^{z\_k}} j = 1, \dots, k \text{ where } \sigma: \mathbb{R}^k \to \mathbb{R}^k\\Z = (z\_1, \dots, z\_k) \in \mathbb{R}^k \tag{1}$$

For example, if we try to classify an image as *Lim*−*a*, or *Lim*−*b*, or *Lim*−*c*, or *Lim*−*d*, then we generate four probabilities for each input image, indicating the respective probabilities of the image belonging to each of the four categories. There are two important points to be mentioned here. First, during the training process, we require a large number of images for each class (*Lim*−*a*, *Lim*−*b*, *Lim*−*c*, *Lim*−*<sup>d</sup>*). Secondly, if the network is only trained for the above four image classes, then we cannot expect to test it for any other class; e.g., "*Lim*−*x*." If we want our model to sort images as well, then we need to ge<sup>t</sup> many "*Lim*−*x*" images and to rebuild and retrain the model [3]. There are cases where we do not have enough data for each category, or the classes are huge but also dynamically changing. Thus, the cost of data collection and periodic retraining is enormous. A reliable solution should be sought in these cases. In contrast, *k*-shot learning is a framework within which the network is called upon to learn quickly and with few examples. During training, a limited number of examples from diverse classes with their labels are introduced. The network is required to learn general characteristics of the problem, such as features which are either common to the samples of the same class, or unique features which di fferentiate and eventually separate the classes.

In contrast to the learning process of the traditional neural networks, it is not su fficient for the network to learn good representations of the training classes, as the testing classes are distinct and they are not presented in training. However, it is desirable to learn features which distinguish the existing classes.

The evaluation process consists of two distinct stages of the following format [4]:

Step 1: Given *k* examples (value of *k*-shot), if *k* = 1, then the process is called *one-shot*; if *k* = 5, *five-shot*, and so on. The parameter *k* represents the number of labeled samples given to the algorithm by each class. By considering these samples, which comprise the support set, the network is required to classify and eventually adapt to existing classes.

Step 2: Unknown examples of the labeled classes are presented randomly, unlike the ones presented in the previous step, which the network is called to correctly classify. The set of examples in this stage is known as the query set.

The above procedure (steps) is repeated many times using random classes and examples which are sampled from the testing-evaluation set.

As it is immediately apparent from the description of the evaluation process, as the number of classes increases, the task becomes more di fficult, because the network has to decide between several alternatives. This means that *Zero-shot Learning* [5] is clearly more di fficult than the *one-shot*, which is more di fficult than the *five-shot*, and so on. Although humans have the ability to cope with this process, traditional ANN require many more examples to generalize e ffectively, in order to achieve the same degree of performance. The limitation of these learning approaches is that the model has access to minimum samples from each class and the validation process is performed by calculating the cross-entropy error of the test set. Specifically, in the cases of one-shot and *Zero-shot Learning* (ZsL), only one example each of the candidate classes and only meta-data is shown at the evaluation stage.

Overall, *k-shot learning* is a perfect example of a problematic area, where specialized solutions are needed to design and train systems capable to learn very quickly from a small support set, containing only 1–5 samples per class. These systems can o ffer strong generalization to a corresponding target set. A successful exploitation of the above k-shot learning cases is provided by meta-learning techniques which can be used to deliver e ffective solutions [6].

In this work, we propose a new classification model, which is based on zero-shot philosophy, named MAME-ZsL. The significant advantages of the proposed algorithm is that it reduces computational cost and training time; it avoids potential overfitting by enhancing the learning of features which do not cause issues of exploding or diminishing gradients; and it o ffers an improved training stability, high generalization performance and remarkable classification accuracy. The superiority of the proposed model refers to the fact that the instances in the testing set belong to classes which were not contained in the training set. In contrast, the traditional supervised state-of-the-art Deep Learning models were trained with labeled instances from all classes. The performance of the proposed model was evaluated against state-of-the-art supervised Deep Learning models. The presented numerical experiments provide convincing arguments regarding the classification e fficiency of the proposed model.
