1. Introduction
With the daily increase in novel categories of objects in nature, such as new species of fish and birds, it is essential to construct classification models for these new categories. In the context of traditional models, the quality of classification is contingent on the availability of a sufficient number of instances of each class for the purpose of training. However, this is not a viable option for non-trained classes. In addition to the necessity of collecting large amounts of data for new categories, the subsequent annotation of these data represents a significant expense. Furthermore, the construction of a new training set and the retraining of the model are required for each new category added to the dataset, which can be impractical. To address these problems, ref. [
1] presented for the first time a method called zero-shot learning to discriminate instances that do not participate in training, thus improving the generalizability of classification models.
ZSL relates new category instances to known category instances using semantic information, e.g., describing a zebra (assuming it is an unknown animal) with “it is shaped like a horse and colored like a panda”. Such data can be obtained in several ways, including the use of manually defined semantic attributes to describe the semantic features of instances [
1,
2] or vector embeddings of class names through related models [
3]. In both approaches, a feature vector is used to describe the semantic information of each class, and the composed space becomes the semantic space. Each category in this space has only one vector to represent the features, unlike the visual feature space, where each instance of each category is represented by a visual feature vector. Semantic vectors for seen and unseen classes are always available for training or testing, but only seen class instances are involved in training. The objective of the ZSL approach is to discover relationships between classes to construct a knowledge migration system from the implicit relationships in seen and unseen classes to match semantic features and visual features of unseen classes.
The traditional ZSL model is based on the assumption that only instances of the unseen class are involved during the testing phase. Consequently, the classification is performed exclusively in the unseen class. However, in practice, the true class should be recognized among all classes, based on the more reasonable assumption that both seen and unseen classes participate in the testing phase. This approach is called generalized zero-shot learning (GZSL) [
4].
From a methodological viewpoint, there are currently three dominant approaches in ZSL: (1) global compatibility methods, (2) embedding space methods, and (3) generative methods [
5]. The global compatibility approach [
6,
7] involves defining a global compatibility function on visual feature vectors and semantic vectors, whose goal is to maximize the compatibility score of the correct class of semantic vectors with each training instance. The embedding space approach maps visual feature vectors to the semantic space, maps them in the opposite direction, or maps both to a predefined common space. Once the different features have been mapped to the same space, the maximum similarity, which is typically the closest distance, is then used to match instances and categories. Some studies [
8,
9] map visual feature vectors to the semantic space to preserve the semantic structure. In contrast, other studies [
10,
11] change the embedding direction to form unique visual prototypes. A final trend in embedding approaches is to unify the visual feature vectors and corresponding semantic feature vectors by defining a projection function that maps them into a common space [
12]. The generic approach [
13,
14] re-samples each learned class by analyzing the connection between seen and unseen classes in conjunction with semantic relations [
5]. Then, enough instances of the unseen classes are generated to join the training, thus transforming ZSL into a traditional classification task. Generative networks, such as Generative Adversarial Networks (GANs) [
15,
16,
17], are often used to achieve this goal. Generative models are larger in time and space complexity than the other two methods, with lower visibility and interaction. This paper focuses on the embedding space approach.
There is a problem in ZSL called the domain-shift problem (DSP), where the apparent difference in the distribution of data in the source (training) and target (test) domains can lead to a bias toward the seen classes, thus preventing the model from achieving reasonable recognition performance [
18]. To address this problem, some researchers consider unseen class instances as negative samples in the training phase, in addition to labeled seen class instances, instead of adhering to the inductive setting of ZSL, which is called transductive ZSL [
19]. However, in realistic recognition scenarios, data related to unobserved classes are usually not available during training. Therefore, the transductive setting is rarely considered the best solution to the DSP.
Another challenge in ZSL is called the hubness problem [
20]. Hubness arises when points are randomly selected in a high-dimensional space, as there is a tendency for some points to frequently appear in the neighborhood of other points, which means that when using nearest neighbor searches, such points are often retrieved [
21]. For ZSL methods that map visual features into the semantic space, hubness is observed in classification since this embedding direction is usually in a high-dimensional space. To mitigate this problem, several methods attempt to learn appropriate distance functions with parameters, instead of using general distance metrics such as Euclidean distance [
10,
22]. However, disjoining the training and test classes still degrades the classification ability of the model.
To address the above problems, this paper proposes a similarity learning algorithm using the kernel of the embedding space method, which aims to increase the similarity between instances and visual prototypes of the same class while decreasing the similarity with visual prototypes of different classes. For the construction of category-specific prototypes, kernelized ridge regression is proposed for learning visual prototypes in their semantic vectors. Applications of kernelization have demonstrated the potential to enhance model generalization; however, this avenue has rarely been explored in ZSL and GZSL. In addition, kernel polarization is applied to the kernelized similarity function to improve the classification ability of the model, and autoencoder structures are introduced into the optimization objective function to reduce hubness and the DSP. The classification performance of the model is evaluated on five standard datasets in both ZSL and GZSL experiments. The results show that the classification ability of the model is improved compared to other methods. Our contributions are as follows:
A kernelized similarity function is proposed to adjust the cosine distance of visual feature vectors to associated prototypes mapped into the visual feature space, alleviating hubness and the DSP.
The similarity function is enhanced using the kernel polarization method to improve discriminative ability.
A prototype learning method based on kernelized ridge regression is used to represent unseen visual prototypes.
Related works are discussed in
Section 2, the background is detailed in
Section 3, the research methodology is explained in
Section 4, and the experimental results and corresponding interpretations are presented in
Section 5. Finally,
Section 6 presents the conclusions of the proposed methodology.
4. Proposed Method
The structure of the proposed method is shown in
Figure 1. Our method focuses on kernelized visual prototype learning and kernelized similarity learning for zero-shot recognition. In this section, we explain the proposed method in two steps.
Let denote the set of instances from seen classes, where is the -dimensional visual feature vector for the instance. Let and represent the sets of seen and unseen class labels, respectively, where and denote the set of all labels. Moreover, each class label corresponds to , where is the -dimensional semantic feature vector of the class c. The class label of is denoted by . The ZSL classification task can be described as , where the model predicts . In addition, in GZSL classification, new instances are from all classes (both seen and unseen).
Let
be the visual prototype of class
c. Define the kernelized similarity function for calculating the similarity of each instance
to all visual prototypes
. The kernel function uplifts the visual feature vector to a higher-dimensional space to achieve nonlinearization, and the similarity is calculated as shown in Equation (
9).
where
W is a projection matrix and
k is a selected kernel function. Kernelized similarity involves replacing the common vector inner product with a kernel to create the nonlinear mapping part.
The set of prototypes is derived and utilized to learn the similarity function. In the first subsection, with some theoretical proof, the method for obtaining suitable is presented. Then, we introduce the kernelized similarity function and the kernel polarization methods in the second subsection.
4.1. Prototype Learning
In learning kernelized similarity, the corresponding prototype of the visible class must first be given. Therefore, the proposed method requires prototype learning as the first step. Two different approaches to learning prototypes were mentioned earlier, one for late learning [
24] and one for early learning [
23]. The former sets the initial prototype as the average of instances from the same visible class, then assumes a mapping function with parameters to map semantic features to prototypes, minimizes the error of the mapped semantic features to the prototypes by adjusting the parameters, and uses the learned function to map semantic feature vectors to the visual prototype. In the latter case, the parameters are adjusted so that the semantic features are mapped to prototypes that are as similar as possible to instances from the same seen classes in order to facilitate the classification task. In early learning, prototypes are generated directly by the mapping function rather than by the mean of the associated instances. It is proved below that when the mapping function is linear, the weight matrices obtained from early and late learning are similar to the objective function formed by the mean square error. In early learning, the general objective function is given by Equation (
10):
Furthermore, the objective function is rewritten as a matrix formula in Equation (
11):
where
X and
A are the matrices whose
columns are
and
, respectively.
To find the matrix
W that minimizes
, first find the partial derivative function of
with respect to
W, and then set the partial derivative equal to 0 to solve for
W:
On the one hand, let
and let
be the one-hot code matrix from the seen class associated with
such that
. Let
, where
is the average of the instances belonging to the same seen class. It is claimed that after replacing all corresponding elements from
X with elements from
(of the same seen classes), that is, replacing each class instance with the mean of the class, the result of the summation by class remains unchanged. This assertion can be expressed by the following Equation (
13):
where the matrix
is the transformed matrix obtained after replacing
with the mean of the same classes. Thus, Equation (
14) can be derived from Equations (
12) and (
13):
Finally, by using Equations (
12) and (
14), Equation (
15) is derived:
At this point, it is not possible to multiply the inverse of
directly on both sides of Equation (
15) because it is not known whether it exists.
On the other hand, in late learning, let
, and replace
in Equation (
11) with
. Hence,
is rewritten, as shown in Equation (
16):
Similar to the solution in early learning, the
that minimizes
J can be found, with the result given in Equation (
17):
Since the columns of
H are one-hot vectors, the columns of the matrix
H are linearly independent, which makes
invertible. Hence,
After finding
, the optimization method minimizes the Frobenius norm
in order to obtain the best projection matrix
from
A to
. The following Equation (
19) is obtained by setting the derivative to zero:
When comparing Equation (
15) with Equation (
19), the form of
in Equation (
15) is similar to that of
W in Equation (
19). If the columns of the matrix
A are linearly independent such that
is invertible,
W can be uniquely represented as
. Thus, when using linear mapping functions, early learning with the objective of minimizing the mean square error produces parameters similar to those of late learning. To reduce the number of calculations, late learning is chosen, where the means of the instances from the corresponding seen classes are considered the true prototypes.
Since the dataset used for testing has an imbalance in the number of instances per class, directly ignoring this imbalance using the Frobenius norm minimization method may not effectively reduce the impact of data imbalance. Ridge regression improves generalization by preventing the weights from being biased toward a class with a large number of instances. The kernel function is used to enhance the nonlinear component and relate the semantic features of the seen classes to those of the unseen classes. Therefore, a kernelized ridge regression method is used instead of the original form, which avoids the effects of using linear mapping. The common form of ridge regression is given in Equation (
20):
where
is a non-negative regularization hyperparameter.
Similar to the previous solution, the optimal
is solved by first finding the derivative function of
J with respect to
and then setting the derivative function to zero, as shown in Equation (
21).
with the above form, the variable
is defined by Equation (
22) to reformulate Equations (
21)–(
23).
To simplify the form of Equation (
21), the variable
is introduced, which is defined as shown in Equation (
22). By introducing the variable
into Equation (
21), Equation (
21) can be simplified to the form of Equation (
23):
Then, using Equation (
23) in Equation (
22), Equation (
22) can be rewritten as Equation (
24):
which implies that
A kernel matrix
can be computed for the semantic vectors of the seen classes as a pairwise similarity of the class vectors using a kernel function
k, and Equation (
25) can be reformulated as Equation (
26):
Finally, let
represent the visual prototype of the unseen class
, which can be computed by Equation (
27):
where each element of
is the inner product of the semantic feature vectors of the unseen classes and the semantic feature vectors of each seen class computed by the kernel function. Furthermore,
is the coordinate in the space generated by the base consisting of the column vectors in
.
4.2. Similarity Learning
In ZSL, maximum similarity is often used to discriminate between the correct unseen class labels, which commonly takes the form of the inner product of vectors. Considering the maximization of intraclass similarity, this paper follows another approach with the autoencoder structure. The similarity function is presented in Equation (
28):
where
M is the learned weight matrix. Proposition 1 implies that the goal of maximizing Equation (
28) is to find an orthogonal matrix solution,
, for which its column vectors are weakly incoherent. Therefore, the weight matrix
M is constrained in an implicit manner. Such a constrained
M creates greater similarity between the data points and corresponding prototypes than an unconstrained
M.
Although the above formula contains the autoencoder structure, the weakness of this similarity function is that it is merely a linear mapping between the old
and the new
. Thus, kernel methods are used to reformulate the similarity function in this paper. Let
denote the RBF kernels, and the similarity function is then rewritten as Equation (
29):
Due to the kernels, Equation (
29) contains a nonlinear component. However, the similarity function defined in this way only considers maximizing intraclass similarity, and, for better classification, interclass similarity must also be reduced. Therefore, in the following, kernel polarization is applied in Equation (
29).
Let
denote the set of corresponding class labels, one for each data point. Following Proposition 3, the objective function is given by Equation (
30):
where
is formed as Equation (
31).
is the class label corresponding to
, and
denotes the prototype corresponding to class
.
equals one when
and when
,
equals zero.
preserves the similarity of instances within the same class. In contrast, for instances that are not from the same class,
contributes
to the non-diagonal entries of the matrix
, which reduces the similarity between instances from different classes.
Moreover, this term moves within-class samples closer to each other. In contrast, contributes to the off-diagonal entries of L and moves between-class samples away from each other. Thus, controls a form of regularization, which balances positive and negative entries.
The final goal of similarity learning is to find the best projection matrix
M that maximizes the objective function, as given by Equation (
30). Fix
and
, and let
. Our approach is demonstrated in detail as follows.
Then, we give the objective function concerning
M that needs to be optimized:
where
denotes the set of all seen class labels, except for
, and
denotes the prototype corresponding to class
j. In order for the algorithm to converge slightly faster, we set
and
instead of
and
, respectively. Also, we set
for simplicity.
Then, the corresponding gradient is computed for the gradient descent method, and the weight matrix is updated using Equation (
35) as follows:
where
is the set of indices of size
I in the
t-th batch and
is the learning rate. As the algorithm is applied within different domains (instances from the same class), the learning rate should be adjusted accordingly. Root Mean Square Propagation (RMSprop) [
30] is a more appropriate method, where
calculates the weighted moving average of the squared gradient, with
being the decaying rate:
Furthermore, if the kernel function does not impose soft/implicit incoherence on
M, Equation (
34) can be simplified to Equation (
37):
The proposed method consists of two parts, the first part is prototype learning, which simply involves computing the prototype
according to Equations (
26) and (
27). The pseudo-code of the algorithm in the second part is shown in Algorithm 1.
Algorithm 1 Optimization of the objective function |
- 1:
- 2:
while do # T denotes the number of batches . - 3:
- 4:
for do - 5:
# is based on the Equation ( 34). - 6:
end for - 7:
- 8:
- 9:
- 10:
end while - 11:
return
|
4.3. Classification
In prototype learning, we gain a new prototype
of each unseen class. Having learned
, we calculate the similarity between the instance
to be classified and all visual prototypes
, based on the weight matrix and kernel function, and take the class with the highest similarity as the classification result during testing:
where
k is the same kernel function as in training and
is determined using Equation (
27). Since the proposed method is required to compute the vector dot product, for this kind of nearest neighbor classification, a useful method is the kernel approximation approach (e.g., Nystrom approximation [
31]). However, this is beyond the scope of our research, and we will explore the effectiveness of this approximation in an algorithm in future work.
5. Experiments and Evaluation
Since the classification space of the algorithm is a visual feature space, and images cannot be used directly for classification, the image features first need to be extracted as feature vectors. For visual feature vectors, we extracted 2048-dimensional feature vectors from the images using ResNet-152 [
32], pooled as top-level average pooling, and pre-trained on the ImageNet dataset [
33]. For semantic feature vectors, we used attribute feature vectors (real-value type) for each class in the dataset.
The proposed method was evaluated on five ZSL datasets: Attribute Pascal and Yahoo (aPY) [
2], Animals with Attributes (AWA1) [
34], Animals with Attributes 2 (AWA2) [
33], Caltech-UCSD Birds (CUB) [
34], and SUN Attributes (SUN) [
35]. The detailed elements of the utilized datasets are presented in
Table 2. The data-splitting method used in the experiments is detailed in
Table 3, and, since the visual features were pre-trained on ImageNet, we needed to ensure that there were no common classes between the datasets involved in the testing and ImageNet. We processed visual and semantic feature vectors using two basic data preprocessing methods: mean subtraction and
-norm normalization.
Table 4 shows all the kernel functions and their derivatives used in the experiments. The first step of the proposed method is kernelized ridge regression, in which the kernel function we chose was the Gaussian kernel and
for all
-normalized datasets. The parameter
in the kernel function and the hyperparameter
of kernelized ridge regression were determined using cross-validation. The second step of the proposed method is similarity learning, and the parameters in the algorithm were set as follows:
,
, the parameter
was set to
for the RMSprop solver [
30], and the datasets were trained using SGD for 5-12 epochs.
To evaluate the proposed ZSL method, we report the average top-1 accuracy per class for classification tasks performed on unseen classes, which is calculated as shown in Equation (
39):
where
is the number of unseen classes. For the GZSL protocol, both seen and unseen classes are involved in the testing phase, and let
denote the average per-class top-1 accuracy of the seen class, calculated in the same way as
. Then, we use the harmonic mean [
33] to evaluate the performance of our method for GZSL, which is calculated as shown in Equation (
40):
This strategy can identify when the proposed method overfits to either seen or unseen classes for both ZSL and GZSL.
5.1. Comparison with Baselines
To evaluate the effectiveness, the proposed zero-shot kernel learning approach was compared with several methods, including DAP and IAP [
34], CONSE [
36], CMT [
37], SSE [
38], LATEM [
39], ALE [
7], DEVISE [
6], SJE [
40], ESZSL [
41], SYNC [
42], SAE [
11], PLNPS [
3], APN [
43], and CCZSL [
44].
5.1.1. Zero-Shot Learning
Table 5 demonstrates the ZSL classification accuracy of the proposed method with respect to different kernel functions. It can be observed that the Gaussian-Ort kernel achieved the best performance, followed by the Cauchy-Ort kernel. The methods with implicit incoherence constraints on
M outperformed the unconstrained methods on all five datasets, and the unconstrained methods also performed worse than the Linear kernel. The performance of the proposed method on the CUB dataset did not significantly improve.
It is hypothesized that the superior performance of the Gaussian kernel can be attributed to its ability to map data points to a hidden infinite-dimensional Hilbert space, where decision boundaries can be easily identified and data points can be classified according to the assigned labels. Also, the orthogonality of the kernel ensures that the constructed mappings are autoencoder-like, which guarantees the correct distribution of the semantic features when mapped to the visual feature space, mitigating certain DSPs.
5.1.2. Generalized Zero-Shot Learning
Table 6 presents the results of our experiments and the scores obtained for GZSL. When comparing the generalized score (H), the proposed method with the Cauchy-Ort kernel outperformed other methods on the AWA1 and aPY datasets (2/5). The generalized score (H) is a composite indicator of the model’s classification performance in all classes. In terms of this score, the Cauchy-Ort kernel was closely followed by the Gaussian-Ort kernel and outperformed the other methods on the SUN dataset (1/5). In particular, on the aPY dataset, the Cauchy-Ort and Gaussian-Ort kernels obtained the best and second-best accuracies, respectively, which implies that the Cauchy-Ort kernel outperformed the Gaussian-Ort kernel. Similar to the ZSL results, the performance of the proposed method on the CUB dataset did not significantly improve.
It should be noted that the models that imposed implicit incoherence continued to demonstrate superior performance compared to the variants that did not impose it. Furthermore, the Cauchy kernel was observed to be a suitable option for a range of testing tasks. The Cauchy kernel may be less susceptible to overfitting to local clusters of data points, as its tails decay more gradually in comparison to the tails of a Gaussian kernel.
5.2. Sensitivity Analysis
In this section, we show how the proposed method behaved with respect to the choice of the radius
of the Gaussian and Cauchy kernels, the bias
c of the linear kernel, and the regularization parameter
in Equation (
32) to analyze robustness. The robust algorithm avoided overfitting and exhibited better generalization ability, while the sensitivity of hyperparameter selection was studied for better fine-tuning.
Figure 2 shows the ZSL classification accuracy of the proposed method on the AWA2 dataset with respect to the above hyperparameters. The plot shows that the radius
is an important hyperparameter in the proposed method. For the Gaussian-Ort kernel, the validation and test curves were relatively smooth. The best results were obtained at
= 0.6 and
= 1, respectively, and the testing accuracy was consistently higher than the validation accuracy. A similar trend was observed for the Cauchy-Ort kernel. The difference between the testing and validation accuracies indicated a domain-shift problem, which is common in knowledge transfer tasks. For the Linear kernel, the testing and validation accuracies were not significantly different and reached a maximum at
, but were lower than those of the Gaussian-Ort and Cauchy-Ort kernels. This demonstrates that while the Linear kernel generalized well, it lacked the ability to represent nonlinear data patterns.
Figure 2 (bottom) shows the accuracy with respect to the regularization parameter
, which controls the extent to which data points projected onto the same space are pushed against each other. For all kernel functions in the experiment, the values of
corresponding to the peaks in the testing and validation accuracies were almost identical. The value of
had little effect on the performance of the proposed method, and the interval to reach the best performance was (0.8, 2). However, it is clear that as
or
, the classification performance of the proposed method significantly degrades. This indicates the importance of balancing the effects of intra- and interclass statistics due to kernel polarization.
5.3. Visual Prototype Learning
This method classifies semantic features by first obtaining their prototypes in the visual feature space and then considering the distance from the instances to the step-by-step prototypes without affecting each other. Therefore, it is important to choose a suitable and distinguishable visual prototype.
Figure 3 shows the visualization results of the visible class instances and means on the AWA2 dataset using t-SNE. Due to the use of similarity for classification, although some class instances were at the fuzzy boundary of classification, the prototypes (means) of each class maintained a suitable distance from each other and were distinguishable, so it was appropriate to use the mean of the visual feature as the prototype.
5.4. Kernel Effect on Visual Prototype Learning
Since the prototypes of the unseen classes were obtained through kernelized ridge regression, we needed to evaluate the effect of using different kernels on the final performance of kernelized ridge regression. This subsection uses a simple model: kernelized ridge regression + Euclidean distance nearest neighbor search. The accuracy was calculated on the AWA2 dataset using three different kernel functions: Linear, Gaussian, and Cauchy. In
Figure 4, the results show that the Gaussian kernel achieved the highest accuracy of 70.5%, followed by the Cauchy kernel with an accuracy of 70.4%, and it can be seen that the difference between the two was very small (0.1%). For the Linear kernel, the difference in accuracy did not reach 5%. It can be seen that the selection of the prototype learning kernel is not as important as the selection of the similarity learning kernel, considering the impact of different kernels on model performance.