Amortized Bayesian Meta-Learning with Accelerated Gradient Descent Steps

Zhang, Zhewei; Li, Xuejing; Wang, Shengjin

doi:10.3390/app13158653

Open AccessArticle

Amortized Bayesian Meta-Learning with Accelerated Gradient Descent Steps

by

Zhewei Zhang

¹

,

Xuejing Li

^1,2,*

and

Shengjin Wang

¹

Department of Electronic Engineering, Tsinghua University, Beijing 100084, China

²

Glodon Co., Ltd., Beijing 100193, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(15), 8653; https://doi.org/10.3390/app13158653

Submission received: 14 June 2023 / Revised: 18 July 2023 / Accepted: 22 July 2023 / Published: 27 July 2023

(This article belongs to the Special Issue Advances in Deep Learning III)

Download

Browse Figures

Versions Notes

Abstract

:

Recent meta-learning models often learn priors from observed tasks using a network optimized via stochastic gradient descent (SGD), which usually takes more training steps to convergence. In this paper, we propose an accelerated Bayesian meta-learning structure with a stochastic inference network (ABML-SIN). The proposed model aims to solve the training procedure of Bayesian meta-learning to improve the training speed and efficiency. Current approaches of meta-learning hardly converge within a few descent steps, owing to the small number of training samples. Therefore, we introduce an accelerated gradient descent learning network based on teacher–student architecture to learn the meta-latent variable

θ_{t}

for task t. With this amortized fast inference network, the meta-learner is able to learn the task-specific latent

θ_{t}

within a few training steps; thus, it improves the learning speed of the meta-learner. To refine the latent variables generated from the transductive amortization network of the meta-learner, SIN—followed by a conventional SGD-optimized network—is introduced as the student–teacher network to online-update the parameters. SIN extracts the local latent variables and accelerates the convergence of the meta-learning network. Our experiments on simulation data demonstrate that the proposed method provides generalization and scalability on unseen samples, and produces competitive/superior uncertainty estimations on few-shot learning tasks on two widely adopted 2D datasets with fewer training epochs compared to the state-of-the-art meta-learning approaches. Furthermore, the parameters generated by SIN act as perturbations on latent weights, enhancing the probability of accelerating the training efficiency of the meta-learner. Extensive qualitative experiments show that our method performs well across different meta-learning tasks in both simulated and real-world circumstances.

Keywords:

Bayesian meta-learning; amortized Bayesian network; neural network structure optimization; knowledge distillation

1. Introduction

The hallmark of human intelligence is the ability to learn quickly. For example, humans can recognize objects or make decisions quickly based on a few learning examples or minutes of experience. They can use learned skills to handle new cases that have never occurred before. We believe that our artificial agents should have the same capability to learn and adapt quickly from a few training examples, and should continue to learn as more data become available. This has generated vast interest in few-shot learning [1], which enhances data efficiency through shared information. The notion of “learning to learn” (or meta-learning) should be applied to train tasks and meet the computational requirements.

Meta-learning attempts to bestow machine learning models with the ability to train a meta-learner to perform well on a variety of training tasks [2]. For any unseen task, the meta-learner can learn to solve it efficiently by forming a task distribution similar to the one used for training. Though performances on few-shot learning have greatly increased in the past few years [3,4,5,6], it is still unclear how well these methods would perform in real-world tasks, where the evaluation tasks are different from the training dataset. Moreover, there is a lack of general-purpose tools for flexible and data-efficient learning [7], and there is still room for optimization in the meta-training process.

In addition to improving predictive accuracy for general tasks, it is also important for meta-models to provide good uncertainty predictions with a fast training process. Thus, we adopt the Bayesian method as a principled framework to provide uncertainty prediction. Approximate Bayesian methods are widely used to provide uncertainty prediction for deep learning models [8,9]. Bayesian meta-learning methods output a posterior distribution that correctly indicates the confidence level for unseen samples [2]. Researchers formulate the notion of the hierarchical Bayes (HB) model or empirical Bayes (EB) model, with the EB model restricting the learning of meta-parameters to point estimates. In our work, we consider EB model variational inference for meta-learning to simplify the training process without the HB formulation strength. This work focuses on the efficient usage of amortized variational inference for the Bayesian model with fast training steps. As such, we propose a synthetic inference network (SIN) combined with transduction for meta-learning, where the synthetic network produces the learned latent features over the neural network. During the training process, SIN refines the output of the transductive network that accelerates network convergence. We also propose a teacher–student network structure for the online training of SIN, where the conventional SGD-optimized network is treated as the teacher network. If not specified, we will denote the regular network optimized by SGD as the SGD network in the contents hereafter. The proposed method can converge a few steps after knowing the overall direction of the gradient descent. As illustrated in Figure 1, with collaborative training through the teacher–student network, the meta-training process is accelerated and the latent weights are optimized. Moreover, our method combines the generalization and scalability of the meta-learning model with uncertainty quantification, and it is robust to unseen datasets, owing to its posterior inference optimization. Our contributions are summarized as follows: (i) We propose a novel meta-learning model that utilizes a transductive amortization network to reduce the computational costs by optimizing the task-specific and global variables; (ii) the proposed synthetic inference network refines the output from the transductive network, significantly reducing the necessary training epochs compared with state-of-the-art methods; (iii) our method achieves quantitatively competitive performance on the Mini-ImageNet and Omniglot datasets, and superior visual performance on the ShapeNet dataset.

2. Related Work

The meta-learning literature usually interprets the meta-learning problem as Bayesian inference in a hierarchical graphical model, where ideas of HB or EB models for meta-learning have been widely used in the past few years. For example, Ref. [2] considered an HB model that improved MAML [5] to compute posteriors via amortized variational inference. Ref. [10] proposed a gradient-based algorithm, which minimized the objective function of meta-learning based on the HB model using the PAC-Bayesian Analysis. Ref. [11] introduced a new and rigorously formulated PAC-Bayes few-shot meta-learning algorithm, which implicitly learns a prior distribution of the model of interest. Ref. [12] used the EB model to construct a variational posterior by introducing a transduction network with a synthetic gradient. Other similar Bayesian VI-based HB methods are referred to [7,13,14]. Moreover, the empirical risk minimization (ERM) perspective has been successfully applied to a variety of few-shot learning models [15,16,17]. However, the HB model lacks the ability to handle large complex models or high-dimensional datasets due to its high computational costs for inference. Recent research developments have been used to scale Bayesian meta-learning into large datasets. For instance, Ref. [4] inferred episode-specific latent variables, and Ref. [10] learned the separated posterior weights for each task.

Other works focused on few-shot or one-shot learning-based Bayes methods proposed different perspectives. For instance, Ref. [18] proposed a Gaussian process, like the uncertainty neural process, which can be trainable by gradient descent. Ref. [2] proposed a method that efficiently amortizes hierarchical variational inference across tasks, learning a prior distribution over neural network weights so that a few steps of Bayes by Backprop will produce a good task-specific approximate posterior. Ref. [19] proposed a novel method that can adapt a pre-trained neural network to novel categories by directly predicting the parameters from the activations, which is regarded as amortized mean average precision (MAP) inference derived from the hyper-network [20]. Ref. [21] adapted the Bayesian decision theory to formulate the use of an amortization network for meta-learning so that it improves the state-of-the-art results.

Recent literature studies on PAC-Bayes meta-learning methods have focused on optimizing the generalization bounds for predictors with prior distributions. In [11], the authors extended previous bounds by optimizing the shared prior through the expectation–maximization (EM) algorithm. Rothfuss et al. [22] proposed a class of PAC optimal meta-learning algorithms with performance guarantees and a principled meta-level regularization by introducing Gaussian process priors with the SVGD [23] approximate optimization method. Other works [24,25] focused on improving generalization performance by introducing gradient augmentation or the regulation method at the inner level. Although these efforts strive to improve generalization performance, there are significant challenges in generalization on unseen tasks for meta-learning. Moreover, optimization of the meta-learning training procedure is also important and challenging as a means of improving training efficiency.

3. Meta-Learning with Amortized Bayesian Inference

As previously depicted, the goal of meta-learning is to train a model that suits a collection of tasks; such a meta-model works well on any task from this collection. Assume we are given T tasks for training, and the data are denoted as

D = {d^{(t)} : = {(x^{(t)}, y^{(t)}}}_{t = 1}^{T}

. Inputs are denoted as x and outputs as y.

D

can be elaborated as training data

D_{S}^{(t)} = {(x_{n}^{(t)}, y_{n}^{(t)})}_{n = 1}^{N_{t}}

and test data

{\tilde{D}}_{Q}^{(t)} = {({\tilde{x}}_{m}^{(t)}, {\tilde{y}}_{n}^{(t)})}_{m = 1}^{M_{t}}

. According to [3], we denote the training samples as the support set and the test samples as the query set. For each task

i = 1 \dots T

,

D^{(i)} = D_{S}^{(i)} ⋃ {\tilde{D}}_{Q}^{(i)}

. We revisit the classical hierarchical variational Bayes probabilistic model in this section, and propose using the amortized Bayesian meta-learning scheme by implementing a function of

x_{t}

as the variational posterior.

3.1. Probabilistic Model

According to the classical hierarchical variational Bayes theory discussed in [10], the marginal likelihood of a hierarchical Bayes model is written in Equation (1) as follows:

p_{f} (D) = \int_{θ} p_{f} (D | θ) p (θ) = \int_{θ} [\prod_{t = 1}^{T} \int_{ψ} p_{f} (d_{t} | ψ_{t}) p (ψ_{t} | θ)] p (θ) .

(1)

Its generative process is shown as a directed graphical model in Figure 2, where

θ

is denoted as the global latent variable. For each task t, a task-specific parameter

ψ_{t}

is sampled from its priority

p (ψ_{t} | θ)

. Hierarchical variational inference can be used to estimate the lower bound of the likelihood in Equation (2) as follows:

\begin{matrix} log p_{f} (D) \\ \geq E_{q_{θ}} [\sum_{t = 1}^{T} log (\int_{ψ} p_{f} (d_{t} | ψ_{t}) p (ψ_{t} | θ))] - KL (q (θ) | | p (θ)) \\ \geq \sum_{t = 1}^{T} [E_{q_{θ_{t}}} [log p_{f} (d_{t} | ψ_{t})] - KL (q_{θ_{t}} (ψ_{t}) | | p (ψ_{t} | θ))] - KL (q (θ) | | p (θ)), \end{matrix}

(2)

where

log p_{f} (D)

is denoted as the evidence lower bound (ELBO) on the log version.

q_{θ_{t}} (ψ_{t})

is the introduced variational distribution with parameter

θ_{t}

. Given

θ_{t} (0 \leq t \leq T)

, variational inference solves the following optimization problem:

\begin{matrix} \underset{θ, f, θ_{1}, \dots, θ_{T}}{argmin} E_{q_{θ}} [\sum_{t = 1}^{T} - E_{q_{θ_{t}}} [log p_{f} (d_{t} | ψ_{t})] + KL (q_{θ_{t}} (ψ_{t}) | | p (ψ_{t} | θ))] \\ + KL (q (θ) | | p (θ)) \\ = \underset{θ, f, θ_{1}, \dots, θ_{T}}{argmin} \frac{1}{T} \sum_{t = 1}^{T} [E_{q_{θ_{t}}} [- log p_{f} (d_{t} | ψ_{t})] + KL (q_{θ_{t}} (ψ_{t}) | | p (ψ_{t} | θ))] \\ + KL (q (θ) | | p (θ)) . \end{matrix}

(3)

Note that the optimization process in (3) results in heavy computational costs since it has to optimize T-different variational distributions

q_{θ_{t}} (0 \leq t \leq T)

when T is large. To overcome this shortcoming, we introduce a transductive amortization network

ϕ

that takes

x_{n}^{(t)} \in D_{S}^{(t)}

as input, and it outputs the variational distribution

q_{ϕ (d_{t}^{S})}

. Here, we use

d_{t}^{S}

to represent samples in

D_{S}^{(t)}

for short. The idea of amortized variational inference (AVI) was first introduced in [26] as a trainable autoencoder that produces variational parameters for each data point, and has been extended to Bayesian variational inferences [27]. Assume that

q_{ϕ (d_{t}^{S})}

follows the Dirac delta distribution (e.g.,

q (θ) = I {θ = θ^{*}}

), where

ϕ (d_{t}^{S}) = θ_{t}^{0}

. For the k-th stochastic gradient descent iterate:

θ_{t}^{k + 1} = θ_{t}^{k} + η \nabla_{θ_{t}} KL (q_{θ_{t}^{k}} (ψ_{t}) | | p_{f, θ} (ψ_{t} | d_{t})) with θ_{t}^{0} = ϕ,

(4)

where

η

denotes the descent step size. By conducting this optimization process up to the K-th step, such that

θ_{t}^{K} = ϕ^{K} (d_{t}^{S})

, this guarantees that

θ_{t}^{K}

is a good approximation to its optimal value

θ_{t}^{*}

. In summary, the optimization process contains the following two steps:

initialize $θ_{t}^{0} = ϕ (d_{t}^{S})$
for $k = 0, \dots, K - 1$ , update $θ_{t}^{k + 1}$ via (4)

Let

ℓ_{t}

denote the loss function between the ground truth

y_{t}

and the predicted value

{\hat{y}}_{t}

for task t; we provide the gradient decomposition as follows:

\begin{matrix} \nabla_{θ_{t}} KL (q_{θ_{t}} | | p_{f, θ}) = \frac{1}{N} \sum_{i = 1}^{N} \frac{\partial ℓ_{t} (y_{t, i}, {\hat{y}}_{t, i})}{\partial {\hat{y}}_{t, i}} \frac{{\hat{y}}_{t, i}}{\partial ψ_{t}} \frac{\partial ψ_{t}}{\partial θ_{t}} \\ + \nabla_{θ_{t}} KL (q_{θ_{t}} | | p_{θ}) . \end{matrix}

(5)

Next, replacing each

q_{θ_{t}}

in (3), the optimization problem now becomes:

\begin{matrix} min_{θ, f, ϕ} \frac{1}{T} \sum_{i = 0}^{T} [E_{q_{ϕ (d_{t}^{S})}} [- log p_{f} (d_{t} | ψ_{t})] + KL (q_{ϕ (d_{t}^{S})} (ψ_{t}) | | p (ψ_{t} | θ))] \\ + KL (q (θ) | | p (θ)) . \end{matrix}

(6)

In conclusion, the meta-model is optimized with the feature network f and the hyperparameter

θ

from the Bayes formulation, together with the transductive amortization network

ϕ

.

3.2. Fast Transductive Inference with the Synthetic Inference Network

We now introduce the proposed synthetic neural network for fast transductive inference. A side-by-side comparison between MAML and our method is displayed in Figure 3. In MAML, the global parameter

θ

is given an initialization according to its previous update, and it optimizes for a representation of

θ

that can quickly adapt to new tasks. In our method, we first introduce a transductive neural network

ϕ

that outputs an initial distribution

θ_{t}^{0}

for different tasks t. Next, we propose a trainable synthetic inference network (SIN), which takes features of

θ_{t}^{0}

as the input and outputs its inferred parameters

I

for the refined variational distribution

q_{ϕ (d_{t}^{S})}^{'}

. Then, the refined latent variable

θ_{t}^{'}

is formed by sampling from

I

. With a few iteration steps,

θ_{t}^{'}

can reach

θ_{t}^{K}

, converging to

θ_{t}^{*}

.

The details of SIN are given below. We let meta-parameter

θ = {μ_{θ}, σ_{θ}^{2}}

, where

μ_{θ} \in R^{D}

,

σ_{θ}^{2} \in R^{D}

represent the mean and variance, respectively. For task-specific parameter

ψ_{t}

, its weight follows the normal distribution in Equation (7):

p (ψ_{t} | θ) = N (ψ_{t} | μ_{θ}, σ_{θ}^{2} I),

(7)

and the prior

p (θ)

is written in Equation (8):

p (θ) = Dir (π | α_{0}) \prod_{t = 1}^{T} N (μ_{t} | m_{0}^{(t)}, {(β_{0}^{(t)} Λ_{t})}^{- 1}) Wi (Λ_{t} | L_{0}^{(t)}, ν_{0}^{(t)}),

(8)

where

π = [π_{1}, \dots, π_{t}]

follows the Dirichlet distribution with a symmetric prior

α_{0} = α_{0} I

.

m_{0}

,

β_{0} Λ_{t}

denote the mean and precision matrix for the normal distribution, and

L_{0}

and

ν_{0}

represent the scale matrix and its freedom degree for the Wishart distribution. Note that when

d i m (θ_{t}) = 1 \times N

, Wishart reduces to the Gamma distribution in Equation (9):

Wi (λ | s^{- 1}, ν) = Ga (λ | \frac{ν}{2}, \frac{s}{2}) .

(9)

Note that the exact posterior

p (ψ, θ | D)

is a mixture of T distributions. We use the standard variational Bayes (VB) approximation to the posterior:

p (θ, ψ_{1 : T} | D) ≃ q (θ) \prod_{t = 1}^{T} q (ψ_{t}) .

(10)

Meanwhile, it attempts to maximize the following lower bound:

L = \sum_{ψ} \int q (ψ, θ) log \frac{p (f (x), ψ, θ)}{q (ψ, θ)} d θ \leq log p (D)

(11)

For a detailed overview of the proposed meta-training with the SIN framework, we show a computational graph in Figure 1. A teacher–student network architecture is proposed for the fast convergence training of

θ_{t}

. SIN is treated as an online trainable neural network with the stochastic gradient descent (SGD) network as its teacher network. Note that the learnable parameter of SIN is the inference parameter

I

for distribution

q_{ϕ (d_{t}^{S})}^{'}

, where

I = {m, β, L, ν}

denote parameters for the normal and Wishart distributions in (8). In fact, when

θ_{t}^{0} \in R^{1 \times N}

, the Wishart distribution reduces to the Gamma distribution. Hence,

p (θ) = Dir (π | α_{0}) \prod_{t = 1}^{T} N (μ_{t} | m_{0}^{(t)}, {(β_{0}^{(t)} τ_{t})}^{- 1}) Gamma (τ_{t} | a_{0}, b_{0}),

(12)

where

β_{0}^{(t)} τ_{t} = \frac{1}{σ_{t}^{2}}

and

a_{0}

and

b_{0}

are the alpha and beta parameters for the gamma distribution. Note that

a_{0} = \frac{ν}{2}

,

b_{0} = \frac{S}{2}

, where

S = 1 / N_{t} \sum_{i} r_{i t} {(x_{i} - \bar{x})}^{2}

.

In our work,

θ^{0} = ϕ (d^{S}) \in R^{(T \times N)}

represent T-way meta parameters as learnable initializations. The neural network

ϕ (\cdot)

takes the feature network output

f (x_{n} \in D_{S})

as input, and

ϕ (\cdot)

is treated as the weight initialization network. Each task-specific parameter

θ_{t}^{0}

is passed through SIN to acquire

I

. The proposed network structure is displayed in Figure 4 as a brief overview. It follows a multi-task learning structure, where the common network layers are denoted as the shared layers, while the last two separate layers output the two predicted groups of

I

. Observe that SIN follows the online supervised learning scheme, where its training data

D_{S I N} = {θ_{t}^{0}, M}

. In Figure 1, one can find that

M

is acquired through the variational Bayes inference with expectation maximization (VBEM). The EM method is introduced to maximize the lower bound of

KL (q (θ^{K}) | | p (θ^{K}))

, where

θ^{K}

is obtained from the SGD network with K-descent steps. As for the loss function, we use the simple MSE loss:

\begin{matrix} L & = γ L_{0} + (1 - γ) L_{1}, \end{matrix}

(13)

where

L_{0} = ‖ {\hat{m}}_{t} - m_{t} ‖_{2} + {‖ {\hat{β}}_{t} - β_{t} ‖}_{2}

and

L_{1} = ‖ {\hat{a}}_{t} - a_{t} ‖_{2} + {‖ {\hat{b}}_{t} - b_{t} ‖}_{2}

.

γ

is the weight factor whose value is set to

0.5

as default.

A detailed description of the proposed amortized Bayesian meta-training is depicted in Algorithm 1. The algorithm starts with the initializations of some meta-models, such as

ϕ (d^{S})

,

ϕ (d^{S})

, and f. For each task t, it computes

θ_{t}^{k}

via fast gradient descent steps, mixing SIN with the SGD network. It first uses the standard SGD network to obtain

θ_{t}^{k}

. Meanwhile, the SGD network is used as a teacher network for training SIN. When the training process is valid, the output variable

θ_{t}^{'}

sampled from SIN can reach

θ_{t}^{*}

through fewer descent steps. After T training tasks are processed as the epoch, the transductive amortization network

ϕ

is updated with the feature network f that has been fine-tuned. Notice that

ξ

denotes an online training loss threshold, which indicates that the model can use the learned amortized network output

θ_{t}^{'}

for fast training. When the training process begins, SIN introduces a teacher–student network architecture for the prediction of

θ_{k}

(after k steps, the gradient descent of SGD). It uses a linear artificial neural network (ANN) to learn the specific latent parameters

I

related to

θ_{k}

. The parameter set

I = {m, β, L, v}

represents parameters for normal and Wishart distributions in Equation (8). Note that we use these latent parameters to represent the posterior distribution of

q (θ)

, which maximizes the following lower bound given in Equation (11). We first train both teacher and student networks in

K^{'}

steps (

K^{'} ≪ K

), after the student network learns the representation of the teacher network, then we use the inference of the student network directly to output

I

. Note that

I

is obtained by the VBEM method of

θ_{t}^{k}

. We generate the new

θ_{t} ’

sampled from

I

and then assign it to

θ_{0}

for a middleware descent output. By applying a modest number of descent steps, where

K_{0} < < K

, the learned latent variable

θ_{t}

of the meta-learner can converge to the optimal point.

Algorithm 1 Amortized Bayesian meta-training with SIN.

Require:: the dataset $D$ ; learning rate $η$ ; iteration number K;
number of tasks T.
1:: Initialize meta-model $θ = {μ_{θ}, {σ_{θ}}^{2}}$ , $p (θ) \sim$ (8).
2:: Initialize the transductive amortization network $ϕ (d^{S})$ and the pre-trained feature network f.
3:: Initialize loss function $L^{(t)} = inf$ for T tasks of SIN.
4:: while not converged do
5:: for t = 1 to T do
6:: Sample a task t’s data $x^{t}$ with its related support set $D_{S}^{(t)}$ and query set ${\hat{D}}_{Q}^{(t)}$ .
7:: Compute $θ_{0}^{t} = ϕ (f (x^{t}))$ , where $x^{t} \in D_{S}^{(t)}$ .
8:: if ${\bar{L}}^{(t)} < ξ$ then
9:: Compute $θ_{t}^{'}$ sampled from the output $\hat{I}$ of SIN, assign $θ_{t}^{0} = θ_{t}^{'}$ sampled from $\hat{I}$ .
10:: for $k = 1, \dots, K^{'} (K^{'} ≪ K)$ do
11:: Compute $θ_{t}^{k}$ via (4) with learning rate $η$
12:: end for
13:: else
14:: for $k = 1, \dots, K$ do
15:: Compute $θ_{t}^{k}$ via (4) with learning rate $η$
16:: end for
17:: end if
18:: Implement the VBEM method for $θ_{t}^{k}$ to compute the inferred parameter $I^{(t)}$ .
19:: Update $w \leftarrow w - η \nabla_{w} L (I^{(t)})$ with $K_{0}$ ( $K_{0} < K$ ) iterations, where $w$ denotes the weight of SIN.
20:: end for
21:: Update $ϕ \leftarrow ϕ - η_{ϕ} KL (q_{ϕ (d^{S})} (ψ) | | p (ψ | θ))$ .
22:: Optionally, fine-tune $f \leftarrow f + η \nabla_{f} log p_{f} (d_{t} | ψ_{t})$ .
23:: end while

To summarize the architecture of the proposed ABML-SIN model, we provide a clear learning structure of our model, depicted in Figure 5. We can see that the architecture of ABML-SIN is embedded into the meta-learner module. It helps the meta-learner to quickly infer the learned latent posterior

θ_{t}

for task t based on the prior

θ_{0}

. ABML-SIN provides an amortized synthetic inference network (ASIN) to use the pre-trained student network for fast

θ_{t}

output. For different task numbers

t = 0, \dots, T

, the inference network outputs different posterior values

θ_{t}

related to the prediction results.

4. Experiments

To evaluate the proposed amortized Bayesian SIN meta-learning model on several few-shot learning tasks, we begin with simulation experiments for posterior inference to investigate the performance of the proposed amortized inference network (SIN). We then test few-shot classification results based on Omniglot and Mini-ImageNet datasets. Also, the performance of our method is analyzed based on accuracy, inference speed, and convergence time, varying in ways and shots, compared with other state-of-the-art meta-learning methods.

4.1. Data Posterior Inference Examples

We begin the posterior inference example with emulated data generated from a Gaussian distribution whose mean value is different from the task in Equation (14):

\{\begin{matrix} p (ψ^{(t)} | θ) = N (ψ^{(t)}; μ, σ^{2}) \\ p (x_{n}^{(t)} | ψ^{(t)}) = N (x_{n}^{(t)}; ψ^{(t)}, σ_{x}^{2}), \end{matrix}

(14)

where

p (θ)

follows the delta distribution

δ (θ)

. According to Gordon et al. [7], we set the sample number

N \in {5, 10}

for the support set and

N^{'} = 15

for the query set. The transductive amortized neural network

ϕ

is defined as follows:

\{\begin{matrix} μ^{(t)} = \sum_{n = 1}^{N} x_{n}^{(t)} ω_{μ}^{n} + b_{μ} \\ log σ^{(t)} = \sum_{n = 1}^{N} x_{n}^{(t)} ω_{σ}^{n} + b_{σ}, \end{matrix}

(15)

where parameter sets

θ^{0} = {ω_{μ}, ω_{σ}, b_{μ}, b_{σ}}

are trained parameters for

ϕ

, and

q_{ϕ} (ψ | D^{(t)}) = N (ψ; μ^{(t)}, σ^{(t) 2})

. In this case, since the dimensions of these trainable parameters are low, it is unnecessary to use VI to inference the refined parameters

I

in SIN. Instead, we use a simple encoder–decoder network that takes

θ^{0}

as the input and outputs

θ_{t}^{'}

directly since few parameters with low dimensions are easy to converge. The objective log-likelihood function for optimization is given in Equation (16):

L (θ, ϕ) = \frac{1}{T} \sum_{T} log \frac{1}{N} \sum_{n = 1}^{N} p (x_{n}^{(t)} | ψ^{(t)}, θ), ψ^{(t)} \sim q_{ϕ} (ψ | D^{(t)}, θ)

(16)

For implementation details, we generate

T = 250

tasks with 5 shots and 10 shots for training observations. The teacher network (SGD) and the student network (SIN) are both trained to convergence by using the Adam [28] optimizer, and the generated data samples are rearranged as mini-batches of tasks for training input. To generate this dataset, we use N random variables for

μ

and

σ

to generate the support set (small training samples) and query set (unseen samples of

μ

and

σ

). After the loss function is converged,

q_{ϕ} (ψ | D)

is inferred by SIN with the learned amortization parameters. Figure 6 shows the approximate posterior distributions inferred from unseen test samples and the true posteriors for two-shot cases. We compare the output of SIN with Versa [7], where Versa uses an end-to-end stochastic training that optimizes the likelihood function corresponding with

ϕ

and

θ

. We can see that SIN can output the approximate posterior that closely resembles its true posterior, and its inference results are close to the standard amortization network.

Another simulation example is made for spinning-line regression of zero-shot learning. Denote

D = {(x_{n}, y_{n})}_{n = 1}^{N}

, the output for

x_{n}

is

y_{n} = w \cdot x_{n}

. Similar to the previous example, we set

θ^{0} = {ω, b}

and

q_{ϕ} (ψ | D^{(t)}) = N (ψ; μ_{ψ}^{(t)}, σ_{ψ}^{(t) 2})

. Specifically, we have

μ_{ψ}^{(t)} = \sum_{n} x_{n}^{(t)} ω_{n} + b

. To generate the dataset, we first sample i.i.d. Gaussian random variables

x_{n} \sim N (μ, σ)

. Next, we sample slope

w = \frac{1}{N} \sum_{n} x_{n} + N (μ_{w}, σ_{w}^{2})

from

p (ψ | D)

. The posterior distribution

p (ψ | D) \sim N (\bar{x_{n}} + μ_{ψ}, σ_{ψ}^{2})

, and the log-likelihood function is defined in Equation (17):

L (θ, ϕ) = \frac{1}{T} \sum_{T} log \frac{1}{N} \sum_{n = 1}^{N} p (y_{n}^{(t)} | x_{n}^{(t)}, ψ^{(t)}, θ),

(17)

where

ψ^{(t)} \sim q_{ϕ} (ψ | D^{(t)}, θ)

. In this example, we initialize

n = 32, μ = 0, σ = 1, μ_{w} = 1,

σ_{w} = 0.1

. The number of tasks T is set to 240 for both the train and test sets, and the learning rate is given as 0.001 with the mini-batch size of 8. We also use a simple encoder–decoder network for SIN, which takes

θ_{t}^{0}

as the input and outputs

θ_{t}^{'}

. Figure 7 shows some experimental results for spinning-line zero-shot learning. We trained our model over 100 epochs by using the ADAM optimizer. From Figure 7a,b, we see that

θ_{t}^{k} (k = {1 \dots 4})

steps gradually toward the ground truth. Also, when SIN converges, it outputs an appropriate value

θ_{t}^{'}

, which is close to

θ^{*}

, and only a few iteration steps can make

θ_{t}^{k}

close to the ground truth.

Figure 7c shows the performance of SIN for the unseen test dataset with extra noise points. In this case, we set

x_{n} \sim N (μ, σ) + ε

, where

ε \sim N (0, σ_{ε})

denotes Gaussian white noise. The test data are generated from

x_{n}

with noise components. We find that the proposed method can predict the regression line clearly given the unseen samples under the task of zero-shot learning. Figure 7d displays the value of the loss function and KL divergence of

q_{ϕ} (ψ | D)

and

p (ψ | D)

during the training process, which indicates that SIN is able to accelerate the training convergence process and identify the descent direction after meta-learning.

We provide another simulation example of posterior latent inference based on the MNIST dataset. In this case, dataset

D = {x_{n}, y_{n}}_{n = 1}^{N}

, where

x_{n}

denotes the image’s gray pixels and

y_{n}

denotes its label. The latent variable

z \in R^{1 \times 2}

has two-dimensional values. We aim to find the implicit distribution for z, where

z \sim q_{ϕ} (ψ | D, θ)

, which one can sample from, but the density cannot be evaluated. We set the prior distribution

p (z) \sim N (0, σ^{(t) 2})

, and its unnormalized posterior follows:

p (z | D) \propto log (Q) - T (x, y) / Q

(18)

where

Q = β_{0} + β_{1} (ReLu {(x_{1})}^{3} + ReLu {(x_{2})}^{3})

denotes the exponential random variable with parameter

β

,

T (x, y) \Rightarrow D \to R

is the mapping function that maps

{x, y}

into a real value. We rearrange the KL divergence of

p (z | D)

and

q_{ϕ} (z | D)

in a discrete form:

\begin{matrix} \frac{1}{N} \sum_{n = 1}^{N} KL [q_{ϕ} (z | D) | | p (z | D)] = \frac{1}{N} \sum_{n = 1}^{N} KL [q_{ϕ} (z | D) | | p (z)] - \\ \frac{1}{N} \sum_{n = 1}^{N} E_{z \sim q_{ϕ} (z | D)} log p (D | z) + p (D) . \end{matrix}

(19)

In this sub-experiment, the training samples are collected from the MNIST dataset and manually create

T (x, y)

, which maps 0–9 into a monotone nonlinear set

T

. We let

T = {- 10, 0, 4, 8, 12, \dots, 150}

, and generate the ground-truth posterior of z from

p (ψ | D, θ)

. As shown in Figure 8, the samples from the MNIST dataset correspond to the posterior distribution

p (z | D)

one by one. We use the pre-trained CNN network for the MNIST feature network by removing the last sigmoid layer. Hence,

θ_{t}^{0} = ϕ (d_{t}^{S}) \in R^{(T \times N)}

, whereas

T = 10

and

N = 1024

. The network structure of SIN is given in Figure 4, where four linear dense layers are used followed by a ReLu and a dropout layer as the shared backbone. For each

θ_{t}^{0}

, the output of SIN

I = {m_{t}, β_{t}, a_{0}, b_{0}}

and the threshold

ξ

of the loss function

L^{(t)}

is set to 0.75 empirically. We use 5-shot learning with a batch size of 8 as the mini-batch, and the learning rate is given as 0.001 for 150 epochs with 4000 steps per epoch.

Figure 9 shows the true posterior distribution

p (ψ | D)

together with the inference distribution

q_{ϕ} (ψ | D)

for selected test samples of different meta-VI approaches. We compare our method with AVB [29] and BiGAN [30], where AVB uses the adversarial variational Bayes model to train VAE for arbitrary inference models and BiGAN introduces feature networks of inverse mapping, using the learned feature representation for inference tasks. In this experiment, the feature network used in BiGAN is chosen as the same one used in SIN. We find that the proposed method achieves a good learning representation of

q_{ϕ} (ψ | D)

toward ground truth, even though this is a challenging learning task since

p (ψ | D)

shows the unobvious differences between the adjacent number samples.

4.2. Mini-ImageNet and Omniglot Dataset Performances

Compared with other state-of-the-art meta-learning methods, we test the few-shot image classification performance of SIN on the Mini-ImageNet and Omniglot datasets in this subsection. Following [12], the feature network f is chosen as a four-layer convolution network with 64 filter channels (Conv4-4-64). Each layer in f consists of four sub-blocks: Convolution + BatchNormalized + ReLu + MaxPooling, and f is pre-trained as a standard classification task on the Mini-ImageNet and Omniglot datasets. The amortization network

ϕ

is designed as an MLP network with two hidden layers. We trained our model using the Adam optimizer with a learning rate of 0.003 and batch size of 8 for 35,000 steps. By default, the size of the query set

{\tilde{D}}_{Q}^{(t)}

is set at

15 k

, where k is the shot number. We predict the labels of the query set during training, and the evaluation metric is the average classification accuracy over tasks. Table 1 and Table 2 provide a comparison between the state-of-the-art approaches and our method in classification accuracy. With fewer training epochs, SIN shows a clear improvement in obtaining better variational posterior results, as expected. Moreover, the training of the teacher–student network is effective and it can avoid the latent weight from falling into the local optimum. Moreover, when

θ_{0}^{t}

is close to a local minimum, the

θ_{t}^{'}

resample from SIN can converge to

θ^{*}

in a few descent steps. However, when

θ_{0}^{t}

is close to the global minimum, the resample from SIN may decrease the accuracy if the descent step is not enough.

Since ABML-SIN introduces an amortized neural network (SIN) to predict specific parameters (

I

) during the training process. The generated latent variable

θ_{t} ’

is sampled from

I

based on the Wishart distribution. After implementing the few-step gradient descent, the model is able to converge at an optimal point. Moreover,

θ_{t} ’

is also denoted as a “re-sampling” process, whose weight follows the Wishart distribution, which indicates that the descent procedure starts from a good point. To provide a convincible example of this hypothesis, we add another experiment to the Mini-ImageNet dataset, which records the accuracy factor with the increasing number of training epochs (in the 1-shot and 5-shot meta-learning). As shown in Figure 10, we observe that the proposed method outperforms other state-of-the-art methods in terms of uncertainty estimations (accuracy factor) during the training process. It converges to the optimal point after 30 k epochs, while others do not converge.

The impacts of different hyperparameters on the performance of ABML-SIN should be thoroughly investigated and discussed. One of the important hyperparameters of the proposed method is threshold

ξ

in algorithm 1. This threshold indicates whether to accept SIN to generate

θ_{t} ’

or use the teacher network to train through the SGD method. To provide a sensitivity analysis on robustness, we added a new experiment by changing the normalized value of

ξ

during the training process. We used the Mini-ImageNet dataset for unseen image classification tasks. Figure 11 displays the accuracy factor with different normalized values of

ξ

, it shows the trend that the accuracy factor increases first and then decreases when the value of

ξ

grows. This performance evaluation also provides a sensitivity analysis to show the robustness of the proposed method.

4.3. ShapeNet Dataset View Reconstruction

In this sub-experiment, we use ShapeNet [32] as a 3D-object database, which covers 55 common object categories, including view reconstruction tests. According to the experiment by Gordon et al., we pruned this big dataset by concatenating from 12 categories with nearly 30,000 objects; 70% of the objects were chosen for training, 10% for validation, and the rest for testing. Each object contains 36 views spaced every 10 degrees in the azimuth rotation.

We designed the structure of the feature network

f (\cdot)

and the amortization network

ϕ (\cdot)

, as displayed in Table 3. Since the size of the input image is

32 \times 32

with one channel, a simple convolution network is formed for feature extraction. Then, we used an MLP neural network with two hidden dense layers and an fc layer as the structure of

ϕ

. The structure of the SIN network is shown in Figure 4. We trained our network with a batch size of 16, a learning rate of 0.003 in 1k epochs, and a task number of 12 with 5 shots per category. For the comparison experiment, we evaluated Versa [7] with the conditional variational autoencoder (C-VAE) model [33]. The network structure of Versa was designed by referring to the author’s paper, with specific training parameters; the C-VAE network was trained in batch mode, at once, on all 12 category objects. Note that all the models were trained to reconstruct the view from a single orientation; the rest were treated as unseen samples. Figure 12 shows the reconstruction results of unseen objects from these test samples, varying in azimuth orientations. It can be seen that all of the algorithms are able to capture the orientations of the objects among the generated images; both Versa and SIN provided much more details, while SIN produced images that were visually sharper than Versa. Objects generated from Versa provided smooth details whereas SIN occasionally generated particle noises as reconstruction errors. Table 4 shows MSE and SSIM comparisons for view reconstruction results, where MSE denotes the mean square error (the lower the better) and SSIM denotes the structural similarity index (the higher the better). We observe that the measurements show improvements for both SIN and Versa as the shot number increases to 5. To provide a qualitative analysis for the ShapeNet dataset, we plot the normalized MSE with the increase in training epochs in Figure 13. Observe that there is a descending MSE trend as the training epochs grow, which shows the convergence of the proposed method for both 5-shot and 1-shot learning. This also validates the fast convergence principle of ABML-SIN.

Table 4 displays the details of the training performance comparisons of different meta-learning approaches. We used a personal computer with a 4-core Intel i5-8200 CPU and a NVIDIA-GTX 1060 6 GB GPU. The training epochs were set to 20,000 and 35,000 separately. The support size and query size for meta-training were set at (1, 15) and (2, 25) for memory and training time comparisons, respectively. We used five-way, one-shot for accurate recording. Our method (SIN) exhibits top-tier accuracy among the compared methods after 20k and 30k epochs (Figure 14). However, it costs a little more in terms of GPU memory usage and training parameters since it uses the knowledge distillation approach with the ’teacher and student’ architecture introduced. In conclusion, our method achieves better performance with fast gradient descent steps, owing to the optimized training architectural designs.

To further see the performance metrics of the training time and computational resources of these meta-learning methods, they were based on PyTorch/TensorFlow architectures. We recorded the training time and GPU memory consumption through memory probe tools. Model parameters were recorded through PyTorch/TensorFlow. Also, the accuracy factor was recorded via training epochs and training time. Table 5 shows the details of the training performance comparisons in different meta-learning approaches. Observe that the proposed method performs competitively among the compared methods. It requires the least training time to meet the target epoch with the best accuracy.

5. Conclusions

We presented a novel meta-learning framework with fast training steps. To achieve efficient variational inference and provide uncertainty predictions, we adopted the amortized Bayesian method as a principled theory, together with a transductive scheme for constructing the variational posterior. To optimize the training process of meta-learner, we developed a synthetic inference network, which takes the output of the amortization network as input, and it produces the refined latent variables for meta-training. We also used a student–teacher network architecture as an online learning structure for cooperative training of the synthetic inference network. The training process of the meta-network converges faster due to the proposed mechanism. The proposed method showed generalization and scalability on unseen samples, and it produced competitive/superior uncertainty estimations on few-shot learning tasks on two widely adopted 2D datasets with 30%∼41% fewer training epochs compared with the state-of-the-art. It also improved learning accuracy by 3%∼5% on unseen samples on the Mini-ImageNet dataset for classification tasks. These outcomes indicate that our method is effective for few-shot learning tasks, providing uncertainty predictions for the meta-learning model, together with flexibility and scalability, which means it is suitable for unseen cases through a fast learning process. In future work, we will consider optimizing the learning structure of the meta-learner, deriving more theoretical proof to propose the optimized bound for PAC-Bayesian meta-learning theories with hyper priors and posteriors.

Author Contributions

Conceptualization, Z.Z. and S.W.; Methodology, Z.Z. and X.L.; Software, Z.Z.; Validation, Z.Z.; Formal analysis, Z.Z.; Investigation, Z.Z.; Resources, Z.Z.; Data curation, Z.Z.; Writing—original draft, Z.Z.; Writing—review & editing, Z.Z. and X.L.; Visualization, Z.Z.; Supervision, X.L. and S.W.; Project administration, Z.Z.; Funding acquisition, S.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Li, F.-F.; Fergus, R.; Perona, P. One-shot learning of object categories. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 594–611. [Google Scholar] [CrossRef] [Green Version]
Ravi, S.; Beatson, A. Amortized Bayesian Meta-Learning. In Proceedings of the ICLR, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Ravi, S.; Larochelle, H. Optimization as a Model for Few-Shot Learning. In Proceedings of the ICLR, Toulon, France, 24–26 April 2017. [Google Scholar]
Edwards, H.; Storkey, A. Towards a Neural Statistician. arXiv 2017, arXiv:1606.02185. [Google Scholar]
Finn, C.; Abbeel, P.; Levine, S. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; Volume 70, pp. 1126–1135. [Google Scholar]
Lake, B.M.; Salakhutdinov, R.; Gross, J.; Tenenbaum, J. One shot learning of simple visual concepts. Cogn. Sci. 2011, 33, 2568–2573. [Google Scholar]
Gordon, J.; Bronskill, J.; Bauer, M.; Nowozin, S.; Turner, R. Meta-Learning Probabilistic Inference for Prediction. arXiv 2019, arXiv:1805.09921. [Google Scholar]
Louizos, C.; Welling, M. Multiplicative Normalizing Flows for Variational Bayesian Neural Networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; Volume 70, pp. 2218–2227. [Google Scholar]
Gal, Y.; Ghahramani, Z. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; Volume 48, pp. 1050–1059. [Google Scholar]
Amit, R.; Meir, R. Meta-Learning by Adjusting Priors Based on Extended PAC-Bayes Theory. arXiv 2017, arXiv:1711.01244. [Google Scholar]
Nguyen, C.; Do, T.T.; Carneiro, G. PAC-Bayesian Meta-learning with Implicit Prior and Posterior. arXiv 2020, arXiv:2003.02455. [Google Scholar]
Hu, S.X.; Moreno, P.G.; Xiao, Y.; Shen, X.; Obozinski, G.; Lawrence, N.D.; Damianou, A. Empirical Bayes Transductive Meta-Learning with Synthetic Gradients. arXiv 2020, arXiv:2004.12696. [Google Scholar]
Kim, T.; Yoon, J.; Dia, O.; Kim, S.; Bengio, Y.; Ahn, S. Bayesian Model-Agnostic Meta-Learning. arXiv 2018, arXiv:1806.03836. [Google Scholar]
Finn, C.; Xu, K.; Levine, S. Probabilistic Model-Agnostic Meta-Learning. In Proceedings of the Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, Montréal, Canada, 3–8 December 2018; Neural Information Processing Systems Foundation, Inc. (NeurIPS): La Jolla, CA, USA, 2018; pp. 9537–9548. [Google Scholar]
Vinyals, O.; Blundell, C.; Lillicrap, T.; Kavukcuoglu, K.; Wierstra, D. Matching Networks for One Shot Learning. arXiv 2016, arXiv:1606.04080. [Google Scholar]
Snell, J.; Swersky, K.; Zemel, R.S. Prototypical Networks for Few-shot Learning. arXiv 2017, arXiv:1703.05175. [Google Scholar]
Mishra, N.; Rohaninejad, M.; Chen, X.; Abbeel, P. A Simple Neural Attentive Meta-Learner. arXiv 2017, arXiv:1707.03141. [Google Scholar]
Garnelo, M.; Rosenbaum, D.; Maddison, C.; Ramalho, T.; Saxton, D.; Shanahan, M.; Teh, Y.W.; Rezende, D.; Eslami, S.M.A. Conditional Neural Processes. In Proceedings of the Machine Learning Research, Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 1704–1713. [Google Scholar]
Qiao, S.; Liu, C.; Shen, W.; Yuille, A. Few-Shot Image Recognition by Predicting Parameters from Activations. arXiv 2017, arXiv:1706.03466. [Google Scholar]
Ha, D.; Dai, A.; Le, Q.V. HyperNetworks. arXiv 2016, arXiv:1609.09106. [Google Scholar]
Gordon, J.; Bronskill, J.; Bauer, M.; Nowozin, S.; Turner, R.E. Decision-Theoretic Meta-Learning: Versatile and Efficient Amortization of Few-Shot Learning. arXiv 2018, arXiv:1805.09921. [Google Scholar]
Rothfuss, J.; Fortuin, V.; Josifoski, M.; Krause, A. PACOH: Bayes-optimal meta-learning with PAC-guarantees. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 9116–9126. [Google Scholar]
Liu, Q.; Wang, D. Stein variational gradient descent: A general purpose bayesian inference algorithm. Adv. Neural Inf. Process. Syst. 2016, 35, 29. [Google Scholar]
Wang, R.; Sun, H.; Wei, Q.; Nie, X.; Ma, Y.; Yin, Y. Improving Generalization in Meta-Learning via Meta-Gradient Augmentation. arXiv 2023, arXiv:2306.08460. [Google Scholar]
Wang, L.; Zhou, S.; Zhang, S.; Chu, X.; Chang, H.; Zhu, W. Improving Generalization of Meta-Learning with Inverted Regularization at Inner-Level. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 18–22 June 2023; pp. 7826–7835. [Google Scholar]
Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Garnelo, M.; Schwarz, J.; Rosenbaum, D.; Viola, F.; Rezende, D.J.; Eslami, S.M.A.; Whye Teh, Y. Neural Processes. arXiv 2018, arXiv:1807.01622. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Mescheder, L.M.; Nowozin, S.; Geiger, A. Adversarial Variational Bayes: Unifying Variational Autoencoders and Generative Adversarial Networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017; Volume 70, pp. 2391–2400. [Google Scholar]
Donahue, J.; Krähenbühl, P.; Darrell, T. Adversarial Feature Learning. arXiv 2016, arXiv:1605.09782. [Google Scholar]
Li, Z.; Zhou, F.; Chen, F.; Li, H. Meta-SGD: Learning to Learn Quickly for Few-Shot Learning. arXiv 2017, arXiv:1707.09835. [Google Scholar]
Chang, A.X.; Funkhouser, T.; Guibas, L.; Hanrahan, P.; Huang, Q.; Li, Z.; Savarese, S.; Savva, M.; Song, S.; Su, H.; et al. ShapeNet: An Information-Rich 3D Model Repository. arXiv 2015, arXiv:1512.03012. [Google Scholar]
Zenati, H.; Romain, M.; Foo, C.; Lecouat, B.; Chandrasekhar, V. Adversarially Learned Anomaly Detection. In Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM), Singapore, 17–20 November 2018; pp. 727–736. [Google Scholar]

Figure 1. The computation graph of the ELBO computation for meta-training with the proposed synthetic inference network (SIN), where SIN is regarded as an online trainable student network and the conventional SGD network is treated as the teacher network.

Figure 2. Directed graphical model for the Bayesian meta-learning framework. The meta-learners learn the transductive amortization function f from the task-specific latent variables

ψ

and the global variables

θ

.

Figure 2. Directed graphical model for the Bayesian meta-learning framework. The meta-learners learn the transductive amortization function f from the task-specific latent variables

ψ

and the global variables

θ

.

Figure 3. A flow comparison between MAML and our method (SIN). (a) MAML meta-learning diagram; (b) our method, which introduces the synthetic inference network (SIN).

Figure 4. The proposed multi-task learning network structure of SIN.

Figure 5. Basic learning structure of the proposed ABML-SIN meta-learning method.

Figure 6. True posteriors

p (ψ | D)

and the approximate

q_{ϕ} (ψ | D)

for unseen test sets in the experiment.

Figure 6. True posteriors

p (ψ | D)

and the approximate

q_{ϕ} (ψ | D)

for unseen test sets in the experiment.

Figure 7. (a) The predicted

y = θ_{t}^{k} x

for

k = 0, \dots, 4

inferenced by the teacher network (SGD). (b) The predicted

y = θ_{t}^{k} x

for

k = 0, \dots, 4

inferenced by SIN. (c) Spinning-line regression results given unseen test data samples with noise. (d) Loss function and KL divergence curve analysis with training epochs.

Figure 7. (a) The predicted

y = θ_{t}^{k} x

for

k = 0, \dots, 4

inferenced by the teacher network (SGD). (b) The predicted

y = θ_{t}^{k} x

for

k = 0, \dots, 4

inferenced by SIN. (c) Spinning-line regression results given unseen test data samples with noise. (d) Loss function and KL divergence curve analysis with training epochs.

Figure 8. Selected samples from the MNIST dataset with the generated posterior distribution

p (z | D)

.

Figure 8. Selected samples from the MNIST dataset with the generated posterior distribution

p (z | D)

.

Figure 9. True posteriors

p (ψ | D)

and its approximate

q_{ϕ} (ψ | D)

for test samples in the experiment.

Figure 9. True posteriors

p (ψ | D)

and its approximate

q_{ϕ} (ψ | D)

for test samples in the experiment.

Figure 10. Accuracy performances for unseen samples among state-of-the-art meta-learning methods, with the increasing number of training epochs on the Mini-ImageNet dataset (image classification tasks). (a) One-shot learning. (b) Five-shot learning.

Figure 11. Qualitative analysis of the normalized MSE with the increase in the training epochs on the ShapeNet dataset.

Figure 12. Results for ShapeNet view reconstruction for unseen test objects. (I)–(VI) represents six categories: plane, lamp, bench, car, metal box, chair. Top row: The ground truth. (a) C-VAE [33]. (b) Versa [21]. (c) SIN.

Figure 13. Sensitivity analysis on the accuracy factor of hyperparameter

ξ

in Algorithm 1 based on the Mini-ImageNet dataset (image classification tasks); (a) one-shot learning; (b) five-shot learning.

Figure 13. Sensitivity analysis on the accuracy factor of hyperparameter

ξ

in Algorithm 1 based on the Mini-ImageNet dataset (image classification tasks); (a) one-shot learning; (b) five-shot learning.

Figure 14. The comparison between the proposed model and the baseline (SGD) on the MINI-ImageNet dataset. Purely running the baseline network results in a slower convergence speed and worse accuracy performance at the same epochs, compared with the combined model using SIN.

Table 1. Average classification accuracies on the Mini-ImageNet dataset among different methods (five-way one-/five-shot).

Method	Backbone	1-Shot	5-Shot	Epochs
Matching Nets [15]	Conv-4-64	44.2%	57%	30,000
MAML [5]	Conv-4-64	$48.7 \pm 1.8 %$	$63.1 \pm 0.9 %$	60,000
Meta-SGD [31]	Conv-4-64	$50.5 \pm 1.8 %$	$64.0 \pm 0.9 %$	50,000
Prototypical-Nets [16]	Conv-4-64	$46.6 \pm 0.8 %$	$65.7 \pm 0.7 %$	40,000
Versa [21]	Conv-5-64	$53.4 \pm 1.8 %$	$67.3 \pm 0.8 %$	60,000
SIB [12]	Conv4-4-64	$56.0 \pm 0.6 %$	$70.7 \pm 0.4 %$	50,000
SIN (Ours)	Conv4-4-64	$57.9 \pm 0.7 %$	$71.3 \pm 0.6 %$	35,000

Table 2. Average classification accuracies on the Omniglot dataset among different methods (five-way one-/five-shot).

Method	Backbone	1-Shot	5-Shot	Epochs
Matching Nets [15]	Conv-4-64	97.3%	98.6%	30,000
MAML [5]	Conv-4-64	$98.7 \pm 0.4 %$	$99.8 \pm 0.9 %$	60,000
Meta-SGD [31]	Conv-4-64	$99.53 \pm 0.2 %$	$99.7 \pm 0.1 %$	50,000
Prototypical-Nets [16]	Conv-4-64	$97.4$	$99.3$	40,000
Versa [21]	Conv-5-64	$98.7 \pm 0.2 %$	$99.5 \pm 0.1 %$	60,000
SIB [12]	Conv4-4-64	$99.4 \pm 0.1 %$	$99.7 \pm 0.1 %$	50,000
SIN (Ours)	Conv4-4-64	$99.5 \pm 0.1 %$	$99.8 \pm 0.1 %$	35,000

Table 3. The structure of feature network f with amortization network

ϕ

for ShapeNet with few-shot learning.

Table 3. The structure of feature network f with amortization network

ϕ

for ShapeNet with few-shot learning.

Output Size	Layers
$32 \times 32 \times 1$	input image
$16 \times 16 \times 64$	conv2d ( $3 \times 3$ , stride = 1, padding = SAME), pool
$8 \times 8 \times 64$	conv2d ( $3 \times 3$ , stride = 1, padding = SAME), pool
$4 \times 4 \times 64$	conv2d ( $3 \times 3$ , stride = 1, padding = SAME), pool
$2 \times 2 \times 64$	conv2d ( $3 \times 3$ , stride = 1, padding = SAME), pool
$d_{ϕ}, 256 \times 1$	full connection layer, activation=ReLu
$d_{ϕ}, 256 \times 1$	input layer
$256 \times 128$	dense, activation = Tanh
$128 \times 64$	dense, activation = Tanh
$θ_{0}^{t}, 64 \times 1$	full connection layer, activation = ReLu

Table 4. View reconstruction test results. (MSE and SSIM comparisons between different methods).

Method	MSE	SSIM
C-VAE [33] 1-shot	0.0277	0.5923
C-VAE 5-shot	0.0265	0.5639
Versa [7] 1-shot	0.0108	0.7893
Versa 5-shot	0.0069	0.8483
SIN 1-shot	0.0065	0.8492
SIN 5-shot	0.0061	0.8521

Table 5. Details of training performance comparisons among different meta-learning approaches.

Method	Config	Acc (5way-1shot)	Epochs	Training Time	GPU Memory	Model Params
MAML [5]	supp-sz = 1, query-sz = 15	31.1%	20,000	1 h 45 min	1.8 GB	4.2 MB
Versa [21]	supp-sz = 1, query-sz = 15	37.5%	20,000	2 h 15 min	2.1 GB	15.4 MB
SIB [12]	supp-sz = 1, query-sz = 15	40.5%	20,000	3 h 21 min	3.2 GB	88.1 MB
SIN (Ours)	supp-sz = 1, query-sz = 15	45.5%	20,000	1 h 03 min	3.4 GB	45.2 MB
MAML [5]	supp-sz = 2, query-sz = 25	44.3%	35,000	3 h 03 min	2.2 GB	4.2 MB
Versa [21]	supp-sz = 2, query-sz = 25	49.1%	35,000	3 h 55 min	2.4 GB	15.4 MB
SIB [12]	supp-sz = 2, query-sz = 25	52.5%	35,000	5 h 49 min	3.8 GB	88.1 MB
SIN (Ours)	supp-sz = 2, query-sz = 25	56.9%	35,000	2 h 22 min	4.1 GB	45.2 MB

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Z.; Li, X.; Wang, S. Amortized Bayesian Meta-Learning with Accelerated Gradient Descent Steps. Appl. Sci. 2023, 13, 8653. https://doi.org/10.3390/app13158653

AMA Style

Zhang Z, Li X, Wang S. Amortized Bayesian Meta-Learning with Accelerated Gradient Descent Steps. Applied Sciences. 2023; 13(15):8653. https://doi.org/10.3390/app13158653

Chicago/Turabian Style

Zhang, Zhewei, Xuejing Li, and Shengjin Wang. 2023. "Amortized Bayesian Meta-Learning with Accelerated Gradient Descent Steps" Applied Sciences 13, no. 15: 8653. https://doi.org/10.3390/app13158653

APA Style

Zhang, Z., Li, X., & Wang, S. (2023). Amortized Bayesian Meta-Learning with Accelerated Gradient Descent Steps. Applied Sciences, 13(15), 8653. https://doi.org/10.3390/app13158653

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Amortized Bayesian Meta-Learning with Accelerated Gradient Descent Steps

Abstract

1. Introduction

2. Related Work

3. Meta-Learning with Amortized Bayesian Inference

3.1. Probabilistic Model

3.2. Fast Transductive Inference with the Synthetic Inference Network

4. Experiments

4.1. Data Posterior Inference Examples

4.2. Mini-ImageNet and Omniglot Dataset Performances

4.3. ShapeNet Dataset View Reconstruction

5. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI