A Robust Generalized Zero-Shot Learning Method with Attribute Prototype and Discriminative Attention Mechanism

Liu, Xiaodong; Luo, Weixing; Du, Jiale; Wang, Xinshuo; Dang, Yuhao; Liu, Yang

doi:10.3390/electronics13183751

Open AccessArticle

A Robust Generalized Zero-Shot Learning Method with Attribute Prototype and Discriminative Attention Mechanism

by

Xiaodong Liu

^1,2,†,

Weixing Luo

^1,*,†,

Jiale Du

¹,

Xinshuo Wang

¹,

Yuhao Dang

¹ and

Yang Liu

^1,2

¹

School of Telecommunications Engineering, Xidian University, Xi’an 710071, China

²

Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, Anhui University, Hefei 230601, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2024, 13(18), 3751; https://doi.org/10.3390/electronics13183751

Submission received: 20 August 2024 / Revised: 14 September 2024 / Accepted: 18 September 2024 / Published: 21 September 2024

(This article belongs to the Special Issue Deep/Machine Learning in Visual Recognition and Anomaly Detection)

Download

Browse Figures

Versions Notes

Abstract

:

In the field of Generalized Zero-Shot Learning (GZSL), the challenge lies in learning attribute-based information from seen classes and effectively conveying this knowledge to recognize both seen and unseen categories during the training process. This paper proposes an innovative approach to enhance the generalization ability and efficiency of GZSL models by integrating a Convolutional Block Attention Module (CBAM). The CBAM blends channel-wise and spatial-wise information to emphasize key features, thereby improving the model’s discriminative and localization capabilities. Additionally, the method employs a ResNet101 backbone for systematic image feature extraction, enhanced contrastive learning, and a similarity map generator with attribute prototypes. This comprehensive framework aims to achieve robust visual–semantic embedding for classification tasks. The proposed method demonstrates significant improvements in performance metrics in benchmark datasets, showcasing its potential in advancing GZSL applications.

Keywords:

zero-shot learning; attention mechanism; attribute prototype; consistency control

1. Introduction

Generalized Zero-Shot Learning (GZSL) tasks are especially designed to learn attribute-based information from seen classes and convey the knowledge to recognize categories both seen and unseen during the training process. Numerous studies have been carried out to improve the abstraction capability and efficiency of the model, aiding in the resolution of precise images of uncommon species without training instance sets. Furthermore, GZSL seeks to identify objects in a human-like manner, transferring knowledge from a limited number of known examples to unknown ones [1,2,3]. Therefore, it is vital for GZSL to learn embedding to integrate visual and semantic information that can shift knowledge to both seen and unseen categories from seen category attributes.

The primary approach to performing GZSL tasks involves learning a correlation function between image depictions and class attribute vectors [4,5,6]. The principle of GZSL compatibility function is to map visual–semantic representations onto a semantic descriptor space [7,8,9]. In the semantic domain, we can simply align the example with the attribute vector of the class by identifying and linking the image region to the visual attribute [7,10,11]. Some methods also adopt visual attention or attribute classifiers to enhance the localization capability of the model [7,10,12]. The importance of attention mechanisms has been validated extensively in previous studies [13,14,15,16,17,18]. By integrating attention mechanism in GZSL, the representation power of samples and locating capability as well as discriminative ability of the overall model can be optimized.

Therefore, we adopted an attention mechanism known as the Convolutional Block Attention Module (CBAM), which features by integrating channel-wise and spatial-wise information to emphasize the main feature in two dimensions [13]. By suppressing and highlighting features separately in channel and spatial axes, we can learn about the content of the feature in the sample and its location, thus enhancing the discriminant capacity and locating ability of the overall framework.

In addition to attention mechanisms, GZSL frequently employs semantic attributes for classification tasks. These semantic attributes detail the features of objects from both seen and unseen categories, highlighting similarities and differences between classes. Consequently, current GZSL methods leverage these discriminative properties to sort precise images. The learned attributes of visual and semantic information are mapped onto the feature description domain, where the likeness between class attributes and visual features is maximized. In our work, we adopt this similarity to further enhance the model’s generalization ability. We describe the proficiency of the visual feature by adopting the attribute regression [7,19]. Through this regression, the visual attribute is compelled to convert the visual features to class level. This allows us to understand the relationship between image information and ground-truth attributes, enabling the generalization of information from seen categories to unseen categories. By optimizing regression loss, the model’s generalization ability can be further improved.

We also emphasize the robustness of GZSL methods. Therefore, we adopt a Teacher-Student mechanism as our consistency regularization hierarchy that aims to train our model to generate similar prediction even if an input object is perturbed compared to the original image. This Teacher–Student mechanism is composed of a student model which deals with the original data augmentation and teacher models that generate weakly disturbed and random image data augmentation. The augmentations of each network are generated using the exponential moving average strategy (EMA) [20] in accordance with the settings from the previous steps. This optimizes the model’s discriminative ability and robustness by reducing the distance between predictions from the student networks and the teacher networks.

Our contributions are generally fourfold:

The Convolutional Block Attention Module (CBAM) is integrated into the GZSL method to enhance the model’s overall localization and discriminant capabilities. This integration optimizes channel-wise and spatial-wise information, thereby improving computational efficiency.
The attribute prototype network is adopted into the model during the classification procedure in order to generate the similarity map of the class-specific features regarding the consistent feature as well as all other features in other categories.
In addition to projecting the visual features of an image onto a semantic space, we also encode these visual features with class-specific visual attributes to enhance the discriminative power of the visual–semantic representation.
We utilize a Teacher–Student network with exponential moving average strategy to generate perturbed augmentations of the original image, thereby enhancing the robustness of our model.

2. Related Works

2.1. Zero-Shot Learning

Traditional Zero-Shot Learning (ZSL) is focused on developing a model capable of identifying objects from unseen categories by leveraging attribute-based learning acquired from seen categories and applying it to unseen classes. However, Generalized ZSL takes a step further by aiming to recognize objects from not only seen classes but unseen classes during testing. A crucial aspect of ZSL is its ability to convey visual knowledge and semantic information from seen classes to unseen ones. Therefore, the key to successful ZSL lies in effectively transferring knowledge on visual and semantic information between seen categories and unseen ones. ZSL methods are categorized into three main types, as follows:

Algorithms based on embedding are a large component of Zero-Shot Learning, which plot visual characteristics and semantic properties onto a common feature domain, then classify objects based on similarity measurements through nearest neighbor search [21,22,23,24,25,26,27,28,29,30,31,32,33]. DeViSE, proposed by Frome et al. [34], employs a zero-shot classification model trained in an end-to-end manner that combines a convolutional neural network that has been pre-trained [35] with the Word2Vec model [36]. Inspired by this work, Akata et al. [37] introduce attribute label embedding for Zero-Shot Learning and eliminate the use of Word2Vec embedding method. Huynh and Elhamifar et al. [38] design an attention mechanism to classify objects by matching regional features with attribute semantic vectors. The attribute prototype network is created by Liu et al. [10,38] to enhance the locality of image representation in ZSL.
Generative methods aim to train a sample generator that can produce synthetic examples of unseen classes based on given semantic attributes, thereby transforming traditional ZSL problems into supervised image classification tasks [39,40,41,42,43,44]. Commonly adopted generative methods include Generative Adversarial Networks (GAN) [45], Variational Autoencoders (VAE) [46], and other techniques [47] for training the synthesizer. Xian et al. [48] designed a conditional GAN that is conditioned on class-level semantic descriptors to generate image features of unseen categories. To enhance the authenticity and consistency of the visual characters, they integrated the strengths of both VAE [49] and GANs [43,48,50] to develop f-VAEGAN-D2 [51], which performs better in learning the minor attribute distribution of images without labels. Shen et al. designed and employed flow models [47,52] to generate unseen examples from seen categories. Cheng et al. [19] further optimized ZSL by using the EMA [20] to enhance the robustness of their algorithm.
Gating methods are often used in GZSL tasks to separate unseen samples from seen samples and thus transform GZSL task to be a traditional ZSL task and a supervised classification problem [53,54,55]. In the model of Chen et al. [54], a classifier that detects out-of-distribution samples based on boundaries is proposed to divide the seen and unseen categories.

ZSL algorithms are also frequently categorized into inductive models and transductive models based on the availability of unseen examples without labels during the training process. Inductive ZSL involves learning a model that can generalize to new, unseen classes in accordance with the learning from seen classes. It assumes that the test examples of unseen classes will be represented in the same feature space as the seen classes. It is trained on a set of seen classes with corresponding attributes or semantic information. When it encounters examples of unseen classes, it uses this learned information to make predictions about the new classes. However, transductive ZSL [56,57] make use of both seen class samples with labels and embedding of semantic attributes for all labels, as well as samples of unseen categories without labels.

Thus, Transductive ZSL methods tend to perform better as they can leverage conditional seen category information with unseen class samples without labels to learn the original attribute embedding [57]. This approach helps to mitigate the inherent bias challenge that ZSL tasks often encounter. However, it is challenging to manage all the unseen examples simultaneously during the model training process in reality. Inductive ZSL offers more flexibility in this regard, so our method adopts the general setting of inductive ZSL to enhance the overall discriminative capacity and generalization ability of the model.

2.2. Attention Mechanism

Attention mechanism not only is key to human perception but also aids the training model to capture salient parts and visual structure.

Previous studies have been conducted [58,59] to integrate the attention mechanism to improve the discriminant and locality capacity of CNN models in comprehensive classification tasks. Wang et al. [58] design the Residual Attention Network which introduces an encoder-decoder attention style. With this special mechanism, the model is able to refine the feature map and thus shows a robust performance to perturbed input. Since it is complex to form a 3D attention map, this process can be decomposed into a channel attention and a spatial attention to optimize computation and parameter overhead, making this module flexible for pre-existing basic CNN architecture.

The attention mechanism used in our model mainly draw inspiration from the module of Hu et al. [59]. Hu et al. propose a compact module, which called Squeeze-and-Excitation module, to deal with the inter-channel relationship. In their module, they employ global average-pooling operation to obtain channel-wise features and compute channel attention. To further improve the robustness of the module, Woo et al. [13] exploit max-pooling operation and form CBAM module, which generates both spatial-wise and channel-wise attention.

3. A Robust Generalized Zero-Shot Learning Method with Attribute Prototype and Discriminative Attention Mechanism

3.1. Problem Definition

Zero-Shot Learning (ZSL) is designed to identify images from unseen classes, represented as

Y^{U}

, by leveraging semantic information learned from seen classes, represented as

Y^{S}

, based on class attributes, represented as D. The seen and unseen sets do not overlap, i.e.,

Y^{S} \cap Y^{U} = \emptyset

, and all classes are included within the attribute space D. The sum of categories in the seen image space is S and the number of unseen classes is U. Images from seen classes are denoted as

X^{S}

. Images from seen classes are denoted as

X^{U}

.

The training set for our ZSL model is composed of images with labels and attributes from seen classes, represented as

S_{t r a i n} = {x_{i}, y_{i}, d_{y_{i}} | x_{i} \in X^{S}, y_{i} \in Y^{S}, d_{y_{i}} \in D}_{i = 1}^{N_{t r a i n}}

. Here,

x_{i}

is the i-th image in the seen image space

X^{S}

,

y_{i}

is the corresponding label, and

d_{y_{i}}

is the class embedding. Each class attribute vector

d \in R^{m \times 1}

, where the m denotes the total number of visual attributes, and

N_{t r a i n}

denotes the number of training samples. For the class attribute vector

a \in R^{m \times 1}

, all class attribute vectors are specific to their classes and are given for both seen and unseen categories throughout the training process.

During model testing, in addition to traditional Zero-Shot Learning, which only predicts labels from unseen classes (

X^{U} \to Y^{U}

), we also employ Generalized Zero-Shot Learning (GZSL). GZSL targets the prediction of labels from both seen and unseen categories (

X^{S} \cup X^{U} \to Y^{S} \cup Y^{U}

).

3.2. Overall Framework

Methods for Zero-Shot Learning that rely on semantic embedding are specifically developed to learn visual features and plot it onto the semantic attribute domain. Our approach builds upon this foundation and incorporates four main modules (see Figure 1) to provide a more efficient and robust network for Generalized Zero-Shot Learning (GZSL), as follows:

The ResNet101 [60] backbone network which is utilized for extracting systematic image feature.
Enhanced contrastive learning between variant classes, which is composed of Convolutional Block Attention Module [13] and linear layer.
Similarity map generator module with attribute prototype for local feature regression and mapping.
Consistency control with Exponential Moving Average strategy.

3.3. Enhanced Visual-Semantic Embedding for Classification

We utilized Resnet101 [60] as the backbone network for image extraction in our model. For an input image x from train set in the seen class, it is mapped onto the feature domain, represented as

f (x) \in R^{H \times W \times C}

, where H, W, and C denote the height, width, and channels these three dimensions of the feature

f (x)

. Initially, we employed the Convolutional Block Attention Module (CBAM) [13].

CBAM is designed to learn which features to focus on or suppress and refines intermediate features effectively. The feature of the input image

f (x)

undergoes max pooling and average pooling operations, respectively, to suppress the height and width, yielding two different descriptors for spatial-wise information:

f_{m a x}^{C}

and

f_{a v g}^{C}

. Both descriptors are then passed through a shared network, which comprises a multi-layer perceptron (MLP) with one hidden layer, to produce our attention map on the channel

M_{C} (f) \in R^{1 \times 1 \times C}

. In the hidden layer, the activation size is set to

R^{1 \times 1 \times C / r}

, where r is the reduction ratio. After the two descriptors pass through the shared network, they are merged via element-wise summation to obtain the output feature vectors, i.e.,

\begin{matrix} M_{C} (f) & = σ (M L P (A v g P o o l (f (x))) + M L P (M a x P o o l (f (x)))) \\ = σ (w_{1} (w_{0} (f_{a v g}^{C})) + w_{1} (w_{0} (f_{m a x}^{C}))), \end{matrix}

(1)

where

σ

denotes the sigmoid function.

w_{0} \in R^{C / r \times C}

and

w_{1} \in R^{C \times C / r}

are the weights in the shared network and thus is common for both inputs. And, the activation function applied in hidden layer of the multi-layer perceptron is ReLu, which is followed by

w_{0}

.

With the channel attention map, we can further generate the refined output through channel

f_{C}

by operating element-wise multiplication between the input feature

f (x)

and channel attention map

M_{C}

, i.e.,

f_{C} (x) = M_{C} (f) \otimes f (x),

(2)

where ⊗ denotes element-wise multiplication. We broadcast the channel attention in the spatial dimension; thus,

f_{C} (x)

depicts what the label could be for the image x.

Then, the feature map

M_{C} (f)

is forwarded to spatial attention module, which focuses on the location of the valid part. First, the input map goes through average pooling and max pooling operation separately to suppress the channel dimension,

f_{a v g}^{S}

and

f_{m a x}^{S}

, both of which are concatenated and forwarded to the convolutional layer with one hidden layer. Thus, we can have the final refined attention map

M_{S} (f_{C} (x))

. This process can be expressed as:

\begin{matrix} M_{S} (f_{C} (x)) & = σ (f^{7 \times 7} [A v g p o o l (f_{C} (x)); M a x p o o l (f_{C} (x))]) \\ = σ (f^{7 \times 7} [f_{a v g}^{S}; f_{m a x}^{S}]), \end{matrix}

(3)

where

f^{7 \times 7}

denotes convolution operation with filter size of

7 \times 7

.

With the attention map that optimized in both spatial dimension and channel dimension, the final refined output can be generated by applying element-wise multiplication between the attention map

f_{S} (x)

and the channel refined output

f_{C} (x)

.

F (x) = M_{S} (f_{C}) \otimes f_{C} (x),

(4)

where

F (x) \in R^{H \times W \times C}

denotes the final output of the CBAM, which is a refined feature output.

After the CBAM, average pooling is applied to

F (x)

over H and W to obtain

g (x) \in R^{C \times 1}

and forwarded to a linear projection layer by multiplying an adjustable matrix

W_{S} \in R^{C \times m}

to plot

g (x)

into semantic attribute space, where C still denotes channel, which is identical in

g (x)

and

W_{s}

, and m is the total number of classes, which also denotes the size of the standardized attribute descriptor, which equals to the number of visual attributes in seen samples. Then, we apply scaled dot product to the plotted attributes with all S seen semantic information descriptors. Thus, for

x_{s}

, the

s i m i l a r i t y / c l a s s i f i c a t i o n

score on each class attributes is calculated as:

P (y_{s} | x_{s}) = \frac{e x p (g {(x_{s})}^{T} W_{s} d_{y_{s}} / τ)}{\sum_{i = 1}^{S} e x p (g {(x_{s})}^{T} W_{s} d_{i} / τ)},

(5)

where

τ

denotes the temperature coefficient to standardize the outcome score and control its sharpness,

d_{y_{s}}

denotes the corresponding attribute for input image

x_{s}

.

After the linear projection layer, we employ cross-entropy loss to identify the best score among all the pairs of projected image features and their corresponding attribute, while ensuring the lowest scores for the rest

S - 1

attribute descriptors. Thus, the loss generated in the classification procedure

L_{C L S}

is calculated as follows:

L_{c l s} = - l o g \frac{e x p (g {(x_{s})}^{T} W_{s} d_{y_{s}} / τ)}{\sum_{i = 1}^{S} e x p (g {(x_{s})}^{T} W_{s} d_{i} / τ)} .

(6)

To minimize the loss from cross-entropy, we can modify the parameters in a linear projection layer and the setting of ResNet101.

3.4. Similarity Map Generator Module

To improve the ability of learned visual features to distinguish between different items and enhance the spatial accuracy of image representation, our model incorporates a similarity map generator module with attribute prototypes, inspired by a previous study [7]. Attribute prototypes are designed to align image features by aligning the visual elements of localized areas paired with their semantic characteristics, significantly enhancing our model’s abstraction capability—which is crucial for ZSL.

We use

f (x)

with size of

R^{H \times W \times C}

as the input for this module and learn a set of attribute prototypes

P = {p_{i} \in P^{C}}_{i = 1}^{m}

to predict scores for the corresponding attribute prototypes, where

p_{i}

corresponds to the prototype attribute at index i. The similarity map is obtained by performing a dot product to the input features and every single prototype attribute, i.e.,

M_{x, y}^{i} = 〈 p_{i}, f_{x, y} (x) 〉,

(7)

where

f_{x, y} (x) \in R^{C}

refers to the image attribute descriptor at

(x, y)

and

M_{x, y}^{i}

denotes the similarity map regarding i-th prototype attribute at position

(x, y)

. We take the maximum value of

M^{i}

as the the i-th predicted attribute score

{\hat{M}}_{i} = m a x_{x, y} M_{x, y}^{i}

. All the estimated attribute values derived from the predicted class attribute vector

\hat{M}

. Thus, we can adopt Mean Square Error (MSE) loss to regulate the attribute regression loss

L_{r e g}

so that we can supervise the learning outcome of the similarity map generator and the backbone network, i.e.,

L_{r e g} = {∥ \hat{M} - M_{o} ∥}_{2}^{2},

(8)

where

\hat{M}

is the estimated outcome attribute vector of the class and

M_{o}

is the original one for the input x. Thus, the locality of our model is enhanced through encoding of local feature to attribute and optimizing regression loss.

3.5. Consistency Regularization Hierarchy

To enhance the resilience of aligning semantics across different modalities, we introduce the Teacher–Student mechanism based on Exponential Moving Average (EMA) strategy [19]. This hierarchy is constructed by a student model and a teacher model which have identical internal structures with updated parameters with the student model. Each time the training process of a step is accomplished, the parameters are updated through averaging weights of the current step and forwarded to the next step. We deploy this hierarchy based on the premise that our model should produce consistent predictions when given variations of the same image with perturbations.

Given that

θ_{o}

denotes the parameter of the student model of the whole hierarchy,

θ_{n}

denotes the parameter of the t-th step of the hierarchy. The EMA strategy is calculated as follows,

θ_{n} = α θ_{n - 1} + (1 - α) θ_{o},

(9)

where

α

is the smoothing coefficient.

As a result, for two input images

x_{o}

and

x_{n}

from the same image modified with various augmentation techniques from exponential moving averaging, they are sent to the student model and t-th step, respectively. They both go through the identical processing modules and output the classification scores

p (y_{o} | x_{o})

and

{p (y_{n}}^{'} | x_{n})

, which are computed according to Equation (5). To validate our assumption, we hope that the classification scores between the student and teacher to be similar. Thus, Mean Square Error (MSE) is used as the regularization for consistency.

L_{c o n} = {∥ p (y_{o} | x_{o}) - p (y_{n} | x_{n}) ∥}_{2}^{2},

(10)

where

p (y_{o} | x_{o})

is the classification score of the student model and

p (y_{n} | x_{n})

is the classification score of the teacher model.

3.6. Global Loss Regulation

To adjust the parameters of the backbone network for optimization, the similarity map generator module, the convolutional block attention module and the linear projection layer, we regulate the global loss with Adam optimization technique [61].

L_{g l o b a l} = L_{c l s} + λ_{1} L_{c o n} + λ_{2} L_{r e g},

(11)

where

λ_{1}

and

λ_{2}

are the hyper-parameters used to balance the loss generated in different modules in the global loss regulation function.

3.7. Zero-Shot Learning Inference

The student model of our hierarchy is used for evaluation during test since the teacher models deal with parameters that are perturbed, which means that the student model is better than the teacher networks to some extent. Thus, for ZSL task, given an input x, the best class output

\hat{y}

should be calculated as follows.

\hat{y} = \underset{\tilde{y} \in Y^{U}}{a r g m a x} g {(x)}^{T} W_{s} a_{\tilde{y}},

(12)

where

a_{\tilde{y}}

is the original class attribute descriptor for test class

\tilde{y}

.

For the Generalized Zero-Shot Learning task, the test dataset includes samples from both seen and unseen classes, which introduces bias in class predictions. To address this issue, we use Calibrated Stacking (CS) [6,7,19,62,63] to balance the scores of seen and unseen classes by reducing the scores of seen classes by a factor

γ

.

\hat{y} = \underset{\tilde{y} \in Y^{S} \cup Y^{U}}{a r g m a x} g {(x)}^{T} W a_{\tilde{y}} - γ I [\tilde{y} \in Y^{S}],

(13)

where I is determined according to

\hat{y}

. When

\hat{y}

is from seen categories,

I = 1

, and if

\hat{y}

is from unseen classes, the latter term is discarded by setting I to be 0.

γ

in this equation denotes the balancing coefficient pre-adjusted on a validation set. The pseudo-code of the PA-GZSL is given as Algorithm 1:

Algorithm 1 Training procedure of PA-GZSL.
Input: Training set $S_{t r a i n} = {x_{i}, y_{i}, d_{y_{i}}}$ , smoothing coefficient $α$ .
Output: Predicted labels for test images.
Initialize: Visual backbone Resnet101.
for each training epoch do
	for each batch ${x_{i}, y_{i}, d_{y_{i}}}$ in $S_{t r a i n}$ do
		1. $f \leftarrow E_{o} (Resnet 101 (x_{i}))$ ;
		2. Generate channel attention map $M_{C}$ via Equation (1);
		3. $f_{C} \leftarrow f \otimes M_{C}$ Conduct element-wise multiplication to $M_{C}$ and f to obtain channel attention $f_{C}$ ;
		4. Compute spatial attention map $M_{S}$ via Equation (3);
		5. $F \leftarrow M_{S} \otimes f_{C}$ Compute the output of CBAM via Equation (3);
		6. $g \leftarrow A v g p o o l (F)$ Conduct average pooling on F to generate $g \in R^{c \times 1}$ ;
		7. Calculate $L_{c l s}$ via Equation (6);
		8. Generate similarity map $M_{i}$ via Equation (7);
		9. Calculate $L_{r e g}$ via Equation (8);
		10. Update model parameter in teacher-student network via Equation (9);
		11. Calculate $L_{c o n}$ via Equation (10);
		12. $L \leftarrow L_{c l s} + λ_{1} * L_{c o n} + λ_{2} * L_{r e g}$ ;
		13. Compute the best class output $\hat{y}$ via Equation (13).

	end
end

4. Experiments

4.1. Experiment Setting

Datasets: To generate a rigorous result of the effectiveness of the proposed method, experiments are carried out on three widely used benchmark ZSL datasets: SUN attributes (SUN) [64], Caltech-UCSD Birds-200-2011 (CUB) [65], and Animal with Attributes 2 (AWA2) [66]. SUN [64] is a fine-grained dataset with 14,340 images from 717 scene categories as well as 102 annotated attributes split into 645 seen classes and 72 unseen classes. The CUB [65] dataset is a fine-grained dataset that has 11,788 bird images from 150 seen classes and 50 unseen classes distinguished by 312 annotated attributes. And, the AWA2 [66] dataset—which is a coarse-grained animal dataset—contains 37,322 animal images from 40 seen classes and 10 unseen classes with 85 attributes.
Evaluation Metrics: We adopt the commonly used evaluation metric to evaluate our proposed method, which is the average per-class top-1 accuracy. For traditional ZSL tasks, we only conduct T1 (top-1 accuracy) on the test dataset with only unseen categories. But, for GZSL, the performance of the task is evaluated on the test dataset with both seen and unseen classes. According to the protocol proposed in [7,19,66], we report the top-1 accuracy on seen classes and unseen classes, which are denoted as s and u, as well as the harmonic mean H to further evaluate the performance of the GZSL task, which is defined as:

$H \equiv 2 \times \frac{s \times u}{s + u} .$

(14)
Implementation Details: Following [7,19,66], we adopt pre-trained ResNet101 [60] on ImageNet-1K [67] as the backbone network. Also, based on [13], we manage the kernel size of the CBAM model to be 7 and place it in the order that first conduct convolution operation channel-wise then spatial-wise. We fine-tune the parameters on our model on this basis. Given an input image size of $224 \times 224$ , the output image feature size would be $7 \times 7 \times 2024$ , where two different augmentation strategies are applied and the feature maps are sent to student and teacher networks in the Teacher-Student mechanism. As for the optimization method, we adopt Adam [61] to optimize the parameters. The smoothing coefficient $α$ is set to be a constant $0.995$ . For the hyper-parameters $λ_{1}$ and $λ_{2}$ in Equation (11), they vary with the three datasets.

4.2. Comparison with Cutting-Edge Methods

To demonstrate the feasibility of our approach, we compare it with other state-of-the-art ZSL methods, including both embedding-based and generative-based approaches, on both ZSL and GZSL tasks. The data in Table 1 show top-1 accuracy for ZSL tasks and the harmonic mean for GZSL tasks, where “-” indicates no reports and bold text indicates the best results. The embedding methods without fine-tuning include DEViSE [34], CONSE [68], SJE [69], ALE [67], PSR [26], LATEM [70], TCN [71], DAZLE [38], and SP-AEN [42]. Embedding methods with fine-tuning include QFSL [72], LDF [73], LFGAA [74], SGMA [75], SR2E [76], AREN [77], TCDCSS [78], Hybrid-RT [3], and VGSE [22]. Overall, the performance of these methods on the three datasets shows that our proposed method has advantages over most existing methods.

Traditional Zero-Shot Learning Results: In traditional ZSL tasks, we use different sets of samples for training and testing, where training categories only come from seen classes and testing samples only come from unseen classes. We can see in Table 1 that our proposed method has outstanding performance compared to these cutting-edge methods. The top-1 efficiency on AWA2, CUB, and SUN exceed most of the chosen state-of-the-art methods. In the AWA2 dataset, we achieve a T1 score of $65.3 %$ that is ranked first and higher than VGSE [22] by $1.3 %$ , which is ranked in third place with a T1 score of $64 %$ . For the CUB dataset, we reach a performance of $69.7 %$ , which exceeded LDF [73] by $2.2 %$ . On the SUN dataset, our model ranked first with a T1 performance of $59.2 %$ , which is equal to SP-AEN [42].
Generalized Zero-Shot Learning Results: In Generalized ZSL tasks, there are both seen and unseen classes samples in the testing set; thus, we report the T1 performance on seen classes s and unseen classes u as well as the harmonic mean H. According to Table 2, it is distinct that our method has a wonderful property on these three datasets that is superior to all the progressive methods shown above. In AWA2, where we reach an H score of $68.3 %$ , we match most of the other methods, with our score being only $0.5 %$ behind SDGZSL [32]. The T1 scores for seen and unseen classes are $79.6 %$ and $59.8 %$ , respectively. It is distinctive that the information transference between these two types is competent. As for the CUB dataset, we also have the best performance in contrast with the end-to-end and non-end-to-end methods. The harmonic mean for CUB is $67.5 %$ , winning over our baseline APN [7], which has an H of $67.2 %$ , and surpassing other models. The s and u on the dataset are $69.9 %$ and $65.2 %$ , which also depicts the outstanding consistency of our model. For the SUN dataset, our performance still ranks highly with an H margin of $39.1 %$ , which is higher than the score of BGZSL [50]—the second highest on the SUN dataset—exceeding it by $0.9 %$ . And, the s and u for SUN are $36.2 %$ and $42.6 %$ , respectively, showcasing the superb consistency regulation of our method.

4.3. Ablation Study

For the ablation study of our proposed method, we introduce three terms in Equation (11) to describe the loss from each part of the model. We also take out the CBAM [13] to see the if implementation of this module can optimize the model. And, to illustrate the effort of each part of our method, we report the top-1 efficiency for both the seen and unseen classes in GZSL tasks s and u, the harmonic mean H, and the top-1 efficiency for traditional ZSL tasks

T 1

in the AWA2, CUB, and SUN datasets. By only using cross-entropy to conduct classification work, our performance is quite normal compared to current state-of-the-art methods. Thus, we introduce the attribute prototype [7] to enhance the discriminant capability of our method. On ZSL tasks, by adding this module, we achieve a

0.8

improvement on AWA2, a boost of

3.2

on CUB, but only a small improvement of

0.1

on SUN. For GZSL, we enhanced our performance with a margin of

1.8

, a setback of

0.6

on CUB, and a little progress of

0.2

on SUN. To further develop the robustness of our method, we then introduce the consistency control hierarchy [20]. On ZSL, we achieve

0.7

progress in AWA2, a great leap of

3.7

on CUB, and considerable improvement of

0.5

on SUN. For GZSL tasks, we have

1.5

progress on AWA2,

2.8

on CUB, and

2.2

on SUN. Finally, we add the attention mechanism CBAM [13] to improve the localization property and discriminant capacity. For ZSL tasks, we reach

1.2

improvement on AWA2,

2.2

on CUB, and

0.4

on SUN. For GZSL tasks, we obtain a

1.3

improvement on AWA2,

2.4

on CUB, and

1.2

on SUN. In total, on ZSL, we make a

2.5

improvement on AWA2, a

9.1

improvement on CUB, and a

0.8

improvement on SUN. On GZSL tasks, we boost our method with a margin of

4.6

on AWA2,

4.6

on CUB, and

3.6

on SUN.

From the results in Table 3, it is distinct that all these three models contribute to the effectiveness of our proposed method. Attribute prototype [7] plays a part in regulating the discriminant capacity. The consistency control hierarchy significantly helps our model build robustness, which works on the distillation from the student networks to teacher networks to improve the steadiness of the method. The introduction of CBAM [13] greatly enhanced the localization power and the discriminant ability since it works both channel-wise and spatial-wise.

In the Figure 2, each dot is colored according to its category. And, the dots of the same color gather according to the classification function of our method. It is distinctive from the above t-SNE figures that the discriminative power of our method is greatly enhanced by integrating the CBAM [13]. The clusters in the t-SNE result with CBAM are more compact, and the one without CBAM expands to some extent for an unseen class within the cluster.

The ground difference compared to other previous studies [7] is that when we conduct classification using cross-entropy the input feature is pre-optimized by an attention mechanism that works at the attribute prototype module as well as in the classification module in order to be simplified. Also, we employ the consistency control in this method to ensure that our model receives varied perturbed parameters when faced with an identical image to improve its robustness.

5. Conclusions

In conclusion, our study introduces an innovative approach to enhance the generalization capabilities of GZSL models through the integration of a CBAM. The experimental results demonstrate the effectiveness of the proposed method, showing significant improvements in top-1 accuracy and harmonic mean across the benchmark datasets AWA2, CUB, and SUN. The combination of a ResNet101 backbone, enhanced contrastive learning, and attribute prototype-based similarity mapping has proven to be highly effective in improving the discriminative capacity and localization ability of the GZSL models.

Our method consistently outperforms existing state-of-the-art techniques in both ZSL and GZSL tasks, particularly excelling in the transfer of knowledge from seen to unseen classes. The robustness and efficiency of our approach suggest that incorporating attention mechanisms and fine-tuning hyperparameters can substantially advance the field of GZSL.

Future work could explore further optimization of the CBAM module and the integration of additional attention mechanisms to further enhance the performance of GZSL models. Additionally, expanding the scope of the datasets and applying the proposed method to other domains could provide further validation of its versatility and effectiveness. Overall, this research contributes valuable insights and advancements to the field of Zero-Shot Learning, paving the way for more accurate and efficient recognition of unseen classes.

Author Contributions

Conceptualization, X.L. and W.L.; investigation, J.D.; writing—original draft preparation, X.L.; writing—review and editing, W.L. and Y.L.; visualization, X.W. and Y.D.; funding acquisition, Y.L. and Y.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under Grant 62376207; in part by the Open Research Fund from the Guangdong Provincial Key Laboratory of Big Data Computing; The Chinese University of Hong Kong, China, Shenzhen, under Grant No. B10120210117-OF06; in part by the Open Project Program of State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, China (No. VRLAB2024A03); in part by the Opening Project of Guangdong Province Key Laboratory of Computational Science at the Sun Yat-sen University, China (No. 2024015); in part by the Open Fund of Anhui Engineering Research Center for Intelligent Applications and Security of Industrial Internet, China (No. IASII24-02); in part by the Open Project of Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, Anhui University (No. MMC202301); in part by the Open Projects Program of State Key Laboratory of Multimodal Artificial Intelligence Systems (No. MAIS2024109); in part by the Xidian University Specially Funded Project for Interdisciplinary Exploration, China; and in part by the Fundamental Research Funds for the Central Universities, China.

Data Availability Statement

The CUB dataset can be get by https://www.vision.caltech.edu/datasets/cub_200_2011/, the AWA2 dataset can be available by https://cvml.ista.ac.at/AwA2/, and the SUN dataset can be available by https://groups.csail.mit.edu/vision/SUN/, accessed on 19 August 2024.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Morgado, P.; Vasconcelos, N. Semantically consistent regularization for zero-shot recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6060–6069. [Google Scholar]
Liu, J.; Chen, Y.; Liu, H.; Zhang, H.; Zhang, Y. From less to more: Progressive generalized zero-shot detection with curriculum learning. IEEE Trans. Intell. Transp. Syst. 2022, 23, 19016–19029. [Google Scholar] [CrossRef]
Cheng, D.; Wang, G.; Wang, B.; Zhang, Q.; Han, J.; Zhang, D. Hybrid routing transformer for zero-shot learning. Pattern Recognit. 2023, 137, 109270. [Google Scholar] [CrossRef]
Zhang, L.; Wang, P.; Liu, L.; Shen, C.; Wei, W.; Zhang, Y.; Van Den Hengel, A. Towards effective deep embedding for zero-shot learning. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 2843–2852. [Google Scholar] [CrossRef]
Li, Y.; Liu, Z.; Yao, L.; Wang, X.; McAuley, J.; Chang, X. An entropy guided reinforced partial convolutional network for zero-shot learning. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 5175–5186. [Google Scholar] [CrossRef]
Chen, S.; Hong, Z.; Xie, G.S.; Yang, W.; Peng, Q.; Wang, K.; Zhao, J.; You, X. MSDN: Mutually semantic distillation network for zero shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7612–7621. [Google Scholar]
Xu, W.; Xian, Y.; Wang, J.; Schiele, B.; Akata, Z. Attribute prototype network for zero-shot learning. Adv. Neural Inf. Process. Syst. 2020, 33, 1–12. [Google Scholar]
Chen, S.; Xie, G.; Liu, Y.; Peng, Q.; Sun, B.; Li, H.; You, X.; Shao, L. HSVA: Hierarchical semantic-visual adaptation for zero shot learning. Adv. Neural Inf. Process. Syst. 2021, 34, 16622–16634. [Google Scholar]
Chi, J.; Peng, Y. Zero-shot cross-media embedding learning with dual adversarial distribution network. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 1173–1187. [Google Scholar] [CrossRef]
Liu, Y.; Zhou, L.; Bai, X.; Huang, Y.; Gu, L.; Zhou, J.; Harada, T. Goal-oriented gaze estimation for zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 3794–3803. [Google Scholar]
Han, Z.; Fu, Z.; Chen, S.; Yang, J. Contrastive embedding for generalized zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 2371–2381. [Google Scholar]
Shen, J.; Xiao, Z.; Zhen, X.; Zhang, L. Spherical zero-shot learning. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 634–645. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. arXiv 2018, arXiv:1807.06521. [Google Scholar]
Ba, J.; Mnih, V.; Kavukcuoglu, K. Multiple object recognition with visual attention. arXiv 2014, arXiv:1412.7755. [Google Scholar]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. arXiv 2015, arXiv:1502.03044. [Google Scholar]
Gregor, K.; Danihelka, I.; Graves, A.; Rezende, D.J.; Wierstra, D. Draw: A recurrent neural network for image generation. In Proceedings of the International Conference on Machine Learning, Lille, France, 7–9 July 2015. [Google Scholar]
Jaderberg, M.; Simonyan, K.; Zisserman, A. Spatial transformer networks. In Proceedings of the Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 7–12 December 2015. [Google Scholar]
Cheng, D.; Wang, G.; Wang, N.; Zhang, D.; Zhang, Q.; Gao, X. Discriminative and Robust Attribute Alignment for Zero-Shot Learning. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 4244–4256. [Google Scholar] [CrossRef]
Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Adv. Neural Inf. Process. Syst. 2017, 30, 1–10. [Google Scholar]
Xie, G.-S.; Zhang, Z.; Xiong, H.; Shao, L.; Li, X. Towards zero-shot learning: A brief review and an attention-based embedding network. IEEE Trans. Circuits Syst. Video Technol. 2022. early access. [Google Scholar] [CrossRef]
Xu, W.; Xian, Y.; Wang, J.; Schiele, B.; Akata, Z. VGSE: Visually-grounded semantic embeddings for zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9316–9325. [Google Scholar]
Li, Y.; Liu, Z.; Yao, L.; Chang, X. Attribute-modulated generative meta learning for zero-shot learning. IEEE Trans. Multimed. 2021. early access. [Google Scholar] [CrossRef]
Akata, Z.; Perronnin, F.; Harchaoui, Z.; Schmid, C. Label-embedding for image classification. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 1425–1438. [Google Scholar] [CrossRef] [PubMed]
Tian, Y.; Kong, Y.; Ruan, Q.; An, G.; Fu, Y. Aligned dynamic-preserving embedding for zero-shot action recognition. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 1597–1612. [Google Scholar] [CrossRef]
Biswas, S.; Annadani, Y. Preserving semantic relations for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7603–7612. [Google Scholar]
Liu, S.; Chen, J.; Pan, L.; Ngo, C.-W.; Chua, T.-S.; Jiang, Y.-G. Hyperbolic visual embedding learning for zero-shot recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9273–9281. [Google Scholar]
Liu, Y.; Gao, Q.; Li, J.; Han, J.; Shao, L. Zero shot learning via low-rank embedded semantic AutoEncoder. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; pp. 2490–2496. [Google Scholar]
Xie, G.; Liu, L.; Zhu, F.; Zhao, F.; Zhang, Z.; Yao, Y.; Qin, J.; Shao, L. Region graph embedding network for zero-shot learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020. [Google Scholar]
Cheng, D.; Gong, Y.; Wang, J.; Zheng, N. Balanced mixture of deformable part models with automatic part configurations. IEEE Trans. Circuits Syst. Video Technol. 2017, 27, 1962–1973. [Google Scholar] [CrossRef]
Gong, C.; Tao, D.; Maybank, S.J.; Liu, W.; Kang, G.; Yang, J. Multi-modal curriculum learning for semi-supervised image classification. IEEE Trans. Image Process. 2016, 25, 3249–3260. [Google Scholar] [CrossRef]
Chen, Z.; Luo, Y.; Qiu, R.; Wang, S.; Huang, Z.; Li, J.; Zhang, Z. Semantics Disentangling for Generalized Zero-Shot Learning. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 8692–8700. [Google Scholar]
Guo, J.; Guo, S. A Novel Perspective to Zero-Shot Learning: Towards an Alignment of Manifold Structures via Semantic Feature Expansion. IEEE Trans. Multimed. 2021, 23, 524–537. [Google Scholar] [CrossRef]
Frome, A.; Corrado, G.S.; Shlens, J.; Bengio, S.; Dean, J.; Ranzato, M.A.; Mikolov, T. DeViSE: A deep visual-semantic embedding model. Proc. Adv. Neural Inf. Process. Syst. 2013, 26, 2121–2129. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compo sitionality. Adv. Neural Inf. Process. Syst. 2013, 26, 3111–3119. [Google Scholar]
Akata, Z.; Perronnin, F.; Harchaoui, Z.; Schmid, C. Label-embedding for attribute-based classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 819–826. [Google Scholar]
Huynh, D.; Elhamifar, E. Fine-grained generalized zero-shot learning via dense attribute-based attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 4483–4493. [Google Scholar]
Kim, J.; Shim, K.; Shim, B. Semantic feature extraction for generalized zero-shot learning. Proc. AAAI Conf. Artif. Intell. 2022, 36, 1166–1173. [Google Scholar] [CrossRef]
Yu, Y.; Ji, Z.; Han, J.; Zhang, Z. Episode-based prototype generating network for zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 14035–14044. [Google Scholar]
Wu, J.; Zhang, T.; Zha, Z.-J.; Luo, J.; Zhang, Y.; Wu, F. Self supervised domain-aware generative network for generalized zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 12767–12776. [Google Scholar]
Chen, L.; Zhang, H.; Xiao, J.; Liu, W.; Chang, S.-F. Zero-shot visual recognition using semantics-preserving adversarial embedding networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1043–1052. [Google Scholar]
Felix, R.; Kumar, V.B.; Reid, I.; Carneiro, G. Multi-modal cycle consistent generalized zero-shot learning. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 21–37. [Google Scholar]
Gong, C.; Yang, J.; You, J.; Sugiyama, M. Centroid estimation with guaranteed efficiency: A general framework for weakly supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 2841–2855. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. arXiv 2014, arXiv:1406.2661. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. Auto-encoding variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Dinh, L.; Krueger, D.; Bengio, Y. NICE: Non-linear independent components estimation. arXiv 2014, arXiv:1410.8516. [Google Scholar]
Xian, Y.; Lorenz, T.; Schiele, B.; Akata, Z. Feature generating networks for zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5542–5551. [Google Scholar]
Verma, V.K.; Arora, G.; Mishra, A.; Rai, P. Generalized zero-shot learning via synthesized examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4281–4289. [Google Scholar]
Zhao, X.; Shen, Y.; Wang, S.; Zhang, H. Boosting generative zero-shot learning by synthesizing diverse features with attribute augmentation. Proc. AAAI Conf. Artif. Intell. 2022, 36, 3454–3462. [Google Scholar]
Xian, Y.; Sharma, S.; Schiele, B.; Akata, Z. F-VAEGAN-d2: A feature generating framework for any-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 10275–10284. [Google Scholar]
Dinh, L.; Sohl-Dickstein, J.; Bengio, S. Density estimation using real NVP. arXiv 2016, arXiv:1605.08803. [Google Scholar]
Atzmon, Y.; Chechik, G. Adaptive confidence smoothing for generalized zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 11671–11680. [Google Scholar]
Chen, X.; Lan, X.; Sun, F.; Zheng, N. A boundary based out-of distribution classifier for generalized zero-shot learning. In Proceedings of the 16th European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020. [Google Scholar]
Kwon, G.; Regib, G.A. A gating model for bias calibration in generalized zero-shot learning. IEEE Trans. Image Process. 2022. early access. [Google Scholar] [CrossRef]
Ye, M.; Guo, Y. Zero-shot classification with discriminative semantic representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 7140–7148. [Google Scholar]
Paul, A.; Krishnan, N.C.; Munjal, P. Semantically aligned bias reducing zero shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 7056–7065. [Google Scholar]
Wang, F.; Jiang, M.; Qian, C.; Yang, S.; Li, C.; Zhang, H.; Wang, X.; Tang, X. Residual attention network for image classification. arXiv 2017, arXiv:1704.06904. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. arXiv 2017, arXiv:1709.01507. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Yue, Z.; Wang, T.; Zhang, H.; Sun, Q.; Hua, X.-S. Counterfactual zero-shot and open-set visual recognition. arXiv 2021, arXiv:2103.00887. [Google Scholar]
Chao, W.; Changpingyo, S.; Gong, B.; Sha, F. An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016. [Google Scholar]
Patterson, G.; Xu, C.; Su, H.; Hays, J. The SUN attribute database: Beyond categories for deeper scene understanding. Int. J. Comput. Vis. 2014, 108, 59–81. [Google Scholar] [CrossRef]
Welinder, P.; Branson, S.; Mita, T.; Wah, C.; Schroff, F.; Belongie, S.; Perona, P. Caltech-UCSD Birds 200; California Institute of Technology: Pasadena, CA, USA, 2010. [Google Scholar]
Xian, Y.; Schiele, B.; Akata, Z. Zero-shot learning—The good, the bad and the ugly. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4582–4591. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Norouzi, M.; Mikolov, T.; Bengio, S.; Singer, Y.; Shlens, J.; Frome, A.; Corrado, G.S.; Dean, J. Zero-shot learning by convex combination of semantic embeddings. arXiv 2013, arXiv:1312.5650. [Google Scholar]
Akata, Z.; Reed, S.; Walter, D.; Lee, H.; Schiele, B. Evaluation of output embeddings for ffne-grained image classiffcation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 2927–2936. [Google Scholar]
Xian, Y.; Akata, Z.; Sharma, G.; Nguyen, Q.; Hein, M.; Schiele, B. Latent embeddings for zero-shot classiffcation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 69–77. [Google Scholar]
Jiang, H.; Wang, R.; Shan, S.; Chen, X. Transferable contrastive network for generalized zero-shot learning. In Proceedings of the IEEE/CVF IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9765–9774. [Google Scholar]
Song, J.; Shen, C.; Yang, Y.; Liu, Y.; Song, M. Transductive unbiased embedding for zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1024–1033. [Google Scholar]
Li, Y.; Zhang, J.; Zhang, J.; Huang, K. Discriminative learning of latent features for zero-shot recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7463–7471. [Google Scholar]
Liu, Y.; Guo, J.; Cai, D.; He, X. Attribute attention for semantic disambiguation in zero-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6698–6707. [Google Scholar]
Zhu, Y.; Xie, J.; Tang, Z.; Peng, X.; Elgammal, A. Semantic-guided multi-attention localization for zero-shot learning. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 14943–14953. [Google Scholar]
Ge, J.; Xie, H.; Min, S.; Zhang, Y. Semantic-guided reinforced region embedding for generalized zero-shot learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 1406–1414. [Google Scholar]
Xie, G.S.; Liu, L.; Jin, X.; Zhu, F.; Zhang, Z.; Qin, J.; Yao, Y.; Shao, L. Attentive region embedding network for zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9384–9393. [Google Scholar]
Feng, Y.; Huang, X.; Yang, P.; Yu, J.; Sang, J. Non-generative generalized zero-shot learning via task-correlated disentanglement and controllable samples synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 9346–9355. [Google Scholar]
Gao, R.; Hou, X.; Qin, J.; Shen, Y.; Long, Y.; Liu, L.; Zhang, Z.; Shao, L. Visual-semantic aligned bidirectional network for zero-shot learning. IEEE Trans. Multimed. 2022. early access. [Google Scholar] [CrossRef]

Figure 1. Framework of our proposed method. An input image first goes through ResNet101 to extract the feature. Then, the features pass through CBAM attention module to emphasize the location information and attribute and conduct cross-entropy to obtain the classification loss. The features also go through the similarity map generator with attribute prototype, where the predicted similarity is measured against the authentic attribute prototype vector to obtain the regression loss. The input is dealt by student network and a teacher network where exponential moving average (EMA) is applied to perturb the augmentations of separate steps and calculate the consistency loss according to the classification score obtained from cross-entropy using MSE.

Figure 2. t-SNE results for unseen classes illustrate the discriminative power of our method. (a) The result without CBAM module and (b) the result integrated with CBAM.

Table 1. I Comparison of our experiment outcomes with other cutting-edge ZSL methods. Our method and other advanced ZSL methods are trained on AWA2, CUB, and SUN, where they are separated into 2 groups. The first group contains non-end-to-end methods; the second group contains end-to-end methods. We report T1 efficiency in this table. The best results are bolded.

Method	AWA2	CUB	SUN
Method	T1
CONSE [68]	44.5	34.3	38.8
DEVISE [34]	59.7	52	56.5
ALE [67]	62.5	54.9	58.1
SJE [69]	61.9	53.9	53.7
LATEM [70]	55.8	9.3	55.3
SP-AEN [42]	-	55.4	59.2
LDF [73]	-	67.5	-
QFSL [72]	63.5	58.8	56.2
VGSE [22]	64	28.9	38.1
PA-GZSL (ours)	65.3	69.7	59.2

Table 2. Comparison of our experiment outcome with other cutting-edge GZSL methods. And, these modern methods are arranged the same as Table 1. Also, we report top-1 efficiency for seen class s, unseen class u, and harmonic mean H. The best results are bolded.

Method	AWA2			CUB			SUN
Method	s	u	H	s	u	H	s	u	H
CONSE [68]	90.6	0.5	1	72.2	1.6	3.1	39.9	6.8	11.6
DEVISE [34]	74.7	17.1	27.8	53	23.8	32.8	27.4	16.9	20.9
ALE [67]	81.8	14	23.9	62.8	23.7	34.4	33.1	21.8	26.3
SJE [69]	73.9	8	14.4	59.2	23.5	33.6	30.5	14.7	19.8
LATEM [70]	77.3	11.5	20	57.3	15.2	24	28.8	14.7	19.5
PSR [26]	73.8	20.7	32.3	54.3	24.6	33.9	37.2	20.8	26.7
SP-AEN [42]	-	-	-	70.6	34.7	46.6	38.6	24.9	30.3
TCN [71]	65.8	61.2	63.4	52	52.6	52.3	37.3	3.2	34
DAZLE [38]	75.7	60.3	67.1	59.6	56.7	58.1	24.3	52.3	33.2
LDF [73]	-	-	-	81.6	26.4	39.9	-	-	-
QFSL [72]	72.8	52.1	60.7	48.1	33.3	39.4	18.4	30.9	23.1
SGMA [75]	87.1	37.6	52.5	71.3	36.7	48.5	-	-	-
LFGAA [74]	93.4	27	41.9	80.9	36.2	50	40.4	18.5	25.3
AREN [77]	79.1	54.7	64.7	69	63.2	66	32.3	40.3	35.9
APN [7]	78	56.5	65.5	69.3	65.3	67.2	34	41.9	37.6
SR2E [76]	80.7	58	67.5	70.6	61.6	65.8	36.8	40.5	37.9
AMGML [29]	74.6	56	64	55.7	58.2	56.9	35.1	42	38.3
VSABN [79]	71.8	56.1	63	-	-	-	33.4	40.1	36.4
HybridRT [3]	78.7	58.9	67.4	63.5	62.1	62.8	26.9	53.2	35.7
BGZSL [50]	75	59.3	66.2	59.6	59.2	59.4	32.6	46.2	38.2
TCDCSS [78]	74.9	59.2	66.1	62.8	44.2	51.9	-	-	-
VGSE [22]	81.8	51.2	63	45.5	21.9	29.5	31.8	24.1	27.4
SDGZSL [32]	64.6	73.6	68.8	59.9	66.4	63	-	-	-
PA-GZSL (ours)	79.6	59.8	68.3	69.9	65.2	67.5	36.2	42.6	39.1

Table 3. Ablation results for Generalized Zero-Shot Learning on three datasets.

$L_{cls}$	$L_{reg}$	$L_{sis}$	CBAM	AWA2				CUB				SUN
$L_{cls}$	$L_{reg}$	$L_{sis}$	CBAM	s	u	H	T1	s	u	H	T1	s	u	H	T1
✓	✕	✕	✕	76.3	54.6	63.7	62.8	64.3	55.6	62.9	60.6	34.2	36.9	35.5	58.4
✓	✓	✕	✕	77.2	56.9	65.5	63.4	65.6	59.3	62.3	63.8	34.2	37.3	35.7	58.3
✓	✓	✓	✕	78.8	58.2	67	64.1	68.1	62.3	65.1	67.5	35.3	40.8	37.9	58.8
✓	✓	✓	✓	79.6	59.6	68.3	65.3	69.9	65.2	67.5	69.7	36.2	42.6	39.1	59.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, X.; Luo, W.; Du, J.; Wang, X.; Dang, Y.; Liu, Y. A Robust Generalized Zero-Shot Learning Method with Attribute Prototype and Discriminative Attention Mechanism. Electronics 2024, 13, 3751. https://doi.org/10.3390/electronics13183751

AMA Style

Liu X, Luo W, Du J, Wang X, Dang Y, Liu Y. A Robust Generalized Zero-Shot Learning Method with Attribute Prototype and Discriminative Attention Mechanism. Electronics. 2024; 13(18):3751. https://doi.org/10.3390/electronics13183751

Chicago/Turabian Style

Liu, Xiaodong, Weixing Luo, Jiale Du, Xinshuo Wang, Yuhao Dang, and Yang Liu. 2024. "A Robust Generalized Zero-Shot Learning Method with Attribute Prototype and Discriminative Attention Mechanism" Electronics 13, no. 18: 3751. https://doi.org/10.3390/electronics13183751

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Robust Generalized Zero-Shot Learning Method with Attribute Prototype and Discriminative Attention Mechanism

Abstract

1. Introduction

2. Related Works

2.1. Zero-Shot Learning

2.2. Attention Mechanism

3. A Robust Generalized Zero-Shot Learning Method with Attribute Prototype and Discriminative Attention Mechanism

3.1. Problem Definition

3.2. Overall Framework

3.3. Enhanced Visual-Semantic Embedding for Classification

3.4. Similarity Map Generator Module

3.5. Consistency Regularization Hierarchy

3.6. Global Loss Regulation

3.7. Zero-Shot Learning Inference

4. Experiments

4.1. Experiment Setting

4.2. Comparison with Cutting-Edge Methods

4.3. Ablation Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI