**2. Meta-Learning**

It is a field of machine learning where advanced learning algorithms are applied to the data and metadata of a given problem. The models "*learn to learn*" [7] from previous learning processes or previous sorting tasks they have completed [8]. It is an advanced form of learning where computational models, which usually consist of multiple levels of abstraction, can improve their learning ability. This is achieved by learning some or all of their own building blocks, through the experience gained in handling a large number of tasks. Their building blocks which are "*learning to learn*" can be optimizers, loss functions, initializations, Learning Rates, updated functions and architectures.

In general, for real physical modeling situations, the input patterns with and without tags are derived from the same boundary distribution or they follow a common cluster structure. Thus, classified data can contribute to the learning process, while correspondingly useful information related

to the exploration of the data structure of the general set can be extracted from the non-classified data. This information can be combined with knowledge originating from prior learning processes or from completed prior classification tasks. Based on the above theory, *meta-learning* techniques can discover the structure of the data, by allowing new tasks to be learned quickly. This is achieved by using di fferent types of metadata, such as the properties of the learning problem, the properties of the algorithm used (e.g., performance measures) or the patterns derived from data from a previous problem. This process employs knowledge from unknown cases sampled from real-world distribution of examples, aiming to enhance the outcome of the learning task. In this way it is possible to learn, select, change or combine di fferent learning algorithms to e ffectively solve a given problem.

*Meta-learning* is achieved by conceptually dividing learning in two levels. The inner-most levels acquire specific knowledge for specific tasks (e.g., fine-tuning a model on a new dataset), while the outer-most levels acquire across-task knowledge (e.g., learning to transfer tasks more e fficiently).

If the inner-most levels are using learnable parameters, outer-most optimization process can meta-learn the parameters of such components, thereby enabling automatic learning of inner-loop components.

A *meta-learning* system should combine the following three requirements [9,10]:


Employing a generic approach, a credible meta-learning model should be trained in a variety of learning tasks and it should be optimized for the best performance in generalizing tasks, including potentially unknown ones. Each task is associated with a dataset *D*, containing attribute vectors and class labels in a supervised learning problem. The optimal parameters of the model are [11]:

$$\Theta^\* = \arg\S\_0^{\text{min}} \mathbb{E}\_{D \sim P(D)}[L\_\theta(D)] \tag{2}$$

It looks similar to a normal learning process, but each data set is considered as a data sample.

The dataset *D* is divided in two parts, a training set *S* and a set of predictions *B* for validation and testing.

$$D = \langle \mathcal{S}, \mathcal{B} \rangle \tag{3}$$

The dataset *D* includes pairs of vectors and labels so which:

$$D = \{ (\mathbf{x}\_i, y\_i) \} \tag{4}$$

Each label belongs to a known label set *L*.

Let us consider a classifier *f*<sup>θ</sup>. The parameter θ extracts the probability *x*, *P*θ (*y x*) of a data point to belong to class *y*, given by the attribute vector. Optimal parameters should maximize the likelihood of identifying true labels in multiple training batches *B* ⊂ *D*:

$$\theta^\* = \arg\max\_{\theta} \mathbb{E}\_{(\mathbf{x}, y) \in \mathcal{D}} [P\_{\theta}(y|\mathbf{x})] \tag{5}$$

$$\Theta^\* = \operatorname\*{argmax}\_{\theta} \mathbb{E}\_{B \subset D} \left[ \sum\_{(\mathbf{x}, \mathbf{y}) \in B} P\_{\theta}(\mathbf{y}|\mathbf{x}) \right] \tag{6}$$

The aim is to reduce the prediction error in data samples with unknown labels, in which there is a small set of "fast learning" support which can be used for "fine-tuning".

*Algorithms* **2020**, *13*, 61

Fast learning is a trick which creates a "fake" dataset containing a small subset of labels (to avoid exposing all labels to the model). During the optimization process, various modifications take place, aiming to achieve rapid learning.

A brief step-by-step description of the whole process is presented below [11]:


In this way each sample pair (*SL*, *B<sup>L</sup>*) can be considered to be a data point. The model is trained so that it can generalize to new unknown datasets.

The following function (Equation (7)) is a modification of the supervised learning model. The symbols of the meta-learning process have been added:

$$\boldsymbol{\Theta}^{\*} = \arg\max\_{\boldsymbol{\Theta}} \mathbb{E}\_{\boldsymbol{L}\_{\*} \subset \boldsymbol{L}} \left[ \mathbb{E}\_{\mathrm{S}^{L} \subset \boldsymbol{D}, \boldsymbol{B}^{L} \subset \boldsymbol{D}} \left[ \sum\_{(\mathbf{x}, \boldsymbol{y}) \in \boldsymbol{B}^{L}} P\_{\boldsymbol{\theta}} \Big( \mathbf{x}, \boldsymbol{y}, \boldsymbol{S}^{L} \Big) \right] \right] \tag{7}$$

There are three *meta-learning* modeling approaches, as presented below [11] and the Table 1:



**Table 1.** Meta-learning approaches.

*k*θ is a kernel function which calculates the similarity between *xi* and *x*.

The *Recurrent Neural Networks* (RNNs) which use only internal memory, and also the *Long-Short-Term Memory* approaches (LSTM), are not considered meta-learning techniques [11]. Meta-learning can be achieved through a variety of learning examples. In this case, the *supervised gradient-based* learning can be considered as the most effective method [11]. More specifically, the *gradient-based*, *end-to-end di*ff*erentiable meta-learning*, provides a wide framework for the application of effective *meta-learning* techniques.

This research proposes an *optimization-based, gradient-based, end-to-end* differentiable meta-learning architecture, based on an innovative evolution of the MAML algorithm [10]. MAML is one of the most

successful and at the same time simple optimization algorithms which belongs to the meta-learning approach. One of its grea<sup>t</sup> advantages is that it is compatible with any model which learns through the Gradient Descent (GRD) method. It is comprised of the Base-Learner (BL) and the Meta-Learner (ML) models, with the second used to train the first. The weights of the BL are updated following the GRD method in learning tasks of the k-shot problem, whereas the ML applies the GRD approach on the weights of the BL, before the GRD [10].

In Figure 1 you can see a depiction of the MAML algorithm.

**Figure 1.** Model-Agnostic Meta-Learning algorithm.

It should be clarified that θ denotes the weights of the meta-learner. Gradient *Li* comprises of the losses for task *i* in a meta-batch. The θ∗*i* are the optimal weights for each task. It is essentially an optimization procedure on a set of parameters, such that when a slope step is obtained with respect to a particular task *i*, the respective parameters θ∗*i* are approaching their optimal values. Therefore, the goal of this approach is to learn an intrinsic feature which is widely applicable to all tasks of a distribution *p*(*T*) and not to a single one. This is achieved by minimizing the total loss across tasks sampled from the distribution *p*(*T*).

In particular, we have a base-model represented by a parametric function *f*θ with parameters θ*i* and a task *Ti*∼*p*(*T*). After applying the Gradient Descent, a new feature vector is obtained denoted as θ*i*:

$$
\theta\_i' = \theta - a\nabla\_0 L \tau\_i(f\_0) \tag{8}
$$

We will consider that we execute only one GD step. The meta-learner optimizes the new parameters using the initial ones, based on the performance of the *f*θ- model, in tasks which use sampling from the *<sup>P</sup>*(*T*). Equation (9) is the meta-objective [10]:

$$\min\_{\theta} \sum\_{T\_i \sim p(T)} L\_{T\_i}(f\_{\theta\_i'}) = \min\_{\theta} \sum\_{T\_i \sim p(T)} L\_{T\_i}(f\_\theta - a\nabla\_\theta L\_{T\_i}(f\_\theta)) \tag{9}$$

The meta-optimization is performed again with the *Stochastic Gradient Descent* (SGD) and it updates the parameters θ as follows:

$$\Theta \gets \Theta - \beta \nabla\_{\theta} \sum\_{T\_i \sim p(T)} L\_{T\_i} \Big( f\_{\theta\_i'} \Big) \tag{10}$$

It should be noted that we do not actually define an additional set of variables θ*i* whose values are calculated by considering one (or more) Gradient Descents from θ relative to *task i*. This step is known as the *Inner Loop Learning (INLL)*, which is the reverse process of the *Outer Loop Learning (OLL),* and it optimizes Equation (10). If, for example, we apply INLL to fine-tune θ for process *i*, then according to Equation (10) we are optimizing a target with the expectation which the model applies to each task, following corresponding fine-tuning procedures.

The following Algorithm 1 is an analytical presentation of the MAML algorithm [10]. # **Algorithm 1.** MAML.

**Require:** *p*(*T*): distribution over tasks **Require:** α, β: step size hyperparameters 1: randomly initialize θ 2: **while** not done **do** 3: Sample batch of tasks *Ti* ∼ *p*(*T*) 4: **for all** *Ti* **do** 5: Evalluate <sup>∇</sup>θ*LTi*(*f*θ) with respect to K examples 6: Compute adapted parameters with gradient descent: θ*i* = θ − <sup>α</sup>∇θ*LTi*(*f*θ) 7: **end for** 8: Update θ ← θ − β∇θ *Ti*∼*p*(*T*) *LTi<sup>f</sup>*θ*i* 9: **endwhile**

**Figure 2.** Graphical presentation of the MAML.

It should be clarified that the intermediate parameters θ*i* are considered fast weights. The INLL considers all of the *N* gradient steps for the final estimation of the fast weights, based on the fact that the outer learning loop calculates the outer task loss *LTi*(*f*θ*i*). However, though the inner learning loop makes *N* iterations, the MAML algorithm employs only the final weights to perform the OLL. However, this is a fairly significant problem, as it can create instability in learning when *N* is large.

The field of few-shot or Zero-shot Learning, has recently seen substantial advancements. Most of these advancements came from casting few-shot learning as a meta-learning problem. MAML is currently one of the best approaches for few-shot learning via meta-learning. It is a simple, general, and effective optimization algorithm that does not place any constraints on the model architecture or loss functions. As a result, it can be combined with arbitrary networks and different types of loss functions, which makes it applicable to a variety of different learning processes. However, it has a variety of issues, such as being very sensitive to neural network architectures, often leading to instability during training. It requires arduous hyperparameter searches to stabilize training and achieve high generalization, and it is very computationally expensive at both training and inference times.
