Zero-Shot Image Classification with Rectified Embedding Vectors Using a Caption Generator

Hur, Chan; Park, Hyeyoung

doi:10.3390/app13127071

Open AccessArticle

Zero-Shot Image Classification with Rectified Embedding Vectors Using a Caption Generator

by

Chan Hur

and

Hyeyoung Park

^*

School of Computer Science and Engineering, Kyungpook National University, Daegu 41566, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(12), 7071; https://doi.org/10.3390/app13127071

Submission received: 10 May 2023 / Revised: 4 June 2023 / Accepted: 10 June 2023 / Published: 13 June 2023

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Although image recognition technologies are developing rapidly with deep learning, conventional recognition models trained by supervised learning with class labels do not work well when test inputs from untrained classes are given. For example, a recognizer trained to classify Asian bird species cannot recognize the species of kiwi, because the class label “kiwi” and its image samples have not been seen during training. To overcome this limitation, zero-shot classification has been studied recently, and the joint-embedding-based approach has been suggested as one of the promised solutions. In this approach, image features and text descriptions belonging to the same class are trained to be closely located in a common joint-embedding space. Once we obtain the embedding function that can represent the semantic relationship of image–text pairs in training data, test images and text descriptions (prototypes) of unseen classes can also be mapped to the joint-embedding space for classification. The main challenge with this approach is mapping inputs of two different modalities into a common space, and previous works suffer from the inconsistency between the distribution of two feature sets on joint-embedding space extracted from the heterogeneous inputs. To treat this problem, we propose a novel method of employing additional textual information to rectify the visual representation of input images. Since the conceptual information of test classes is generally given as texts, we expect that the additional descriptions from a caption generator can adjust the visual feature for better matching with the representation of the test classes. We also propose to use the generated textual descriptions to augment training samples for learning joint-embedding space. In the experiments on two benchmark datasets, the proposed method shows significant performance improvements of 1.4% on the CUB dataset and 5.5% on the flower dataset, in comparison to existing models.

Keywords:

zero-shot learning; image captioning; joint-embedding; visual feature enhancement; textural feature generation

1. Introduction

With the development of deep learning technology, the classification performance of image data has improved, and the field of application has expanded. This performance improvement was attributed to the developed feature extraction methodology and learning of a massive artificial neural network for numerous data designated by humans. However, the conventional methods for image classification are limited in that they do not work well when the classes in the testing dataset are not identical to those in the training set. Recently, zero-shot learning has been studied to overcome this problem.

Zero-shot learning is the task of recognizing new patterns whose samples are unavailable during training [1,2,3,4,5,6]. In zero-shot image classification, for example, a classifier that has learned using images of several bird species is asked to assign test images of new species to appropriate categories. Some conceptual information about the new species must be provided to the classifier in some form to perform this task. A representative form of the conceptual information is provided as predefined attributes such as the color of the beak, but these kinds of predefined attributes are rarely given in real applications. As an alternative, more natural descriptions such as “this bird has black feathers and a long beak.” have also been used [7,8,9,10,11]. In this case, image–text matching plays an important role in assigning test images to an appropriate class defined by textual descriptions.

Some researchers [7,8] have introduced an embedding-based approach to solving this task, which creates a semantic embedding space containing representation vectors for both image and text. In the training stage, image and text features are extracted, respectively, and the two sets of features are mapped onto a common embedding space. This joint-embedding method trains the image and text features from a single pair to be close. In the test stage, some conceptual information about unseen classes is given as textual descriptions, and the trained model maps the descriptions to the embedding space to create a prototype vector for each unseen class. Then, the zero-shot classification of test images is performed by assigning the embedded feature of the test image to the class of the nearest prototype vector.

The performance of this approach depends on whether the image features in the joint-embedding space are closely mapped to the corresponding class prototype. In other words, the image and text features belonging to the same category must be well-bound around the prototype in the embedding space. However, in zero-shot learning, test images are given from untrained classes, and their distributional properties can be different from those of training classes. This can lead to a large discrepancy between image features and class prototypes defined as textual features, which results in performance degradation.

To deal with this problem, we propose rectifying the image-feature vector by using additional semantic information obtained from a caption generator [12,13,14]. A caption generator can generate a general description sentence that provides meaningful information for the given input images; thus, it can be used as a guide for enhancing the image feature. Therefore, by combining the two vectors, a better representation of the given input is expected than that from a single image feature.

Figure 1 illustrates how the additional text feature given from the image caption can be combined with the original image feature to aid prediction and the comparison with prototype vectors. Furthermore, the generated description can also be used as augmented data in the training stage, improving the generalizability when the training data have insufficient text descriptions. Extensive experiments were conducted on the CUB-bird [15] and flower datasets [16] to verify the efficiency of the proposed model.

In summary, the main contributions of this work are as follows:

We propose a novel approach for zero-shot image classification using additional semantic information generated from an image-caption generator. Unlike [17,18] studies that have used the caption generator for enhancing visual feature extraction, the proposed model directly uses the generated text information for the rectification of the embedding vector in the semantic space.
The generated-text-feature vector enhanced the image-embedding vector in the testing stage by providing additional information. We call this vector a rectified vector, and it exhibits significant improvement in classification accuracy through the experiments.
We also demonstrate that image-caption generators can play an augmentation role in learning a joint-embedding space, especially when the image–description pairs in the training set are insufficient.

The paper is organized as follows. In Section 1, we describe the zero-shot image classification problem and briefly explain the existing prototype-based methods with a limitation and our proposed approach to solve it. In Section 2, the related works, zero-shot image classification, and image captioning are described. In Section 3, the proposed method for rectifying embedding vectors and data augmentation using generated captions are explained. In Section 4, we analyze the quantitative and qualitative results of the proposed method on two benchmark data. Finally, in Section 5, a summary and future work are presented.

2. Related Works

Zero-shot learning: Early zero-shot image classification studies [19,20,21,22] used a two-step strategy to predict the image class. First, when a new image comes in, the image attributes are predicted through the classifier obtained from the seen class in the learning phase. Second, the probability score of the attributes obtained in this way is combined with the prior attribute information of the unseen class and used for final class prediction. When combining the class attribute and attribute label in this two-stage approach, a domain shift problem occurs in which the attributes of two images from different classes overlap, and the performance deteriorates. For example, the shapes of a lion and a dog are very different, but the attribute called tail yields the same correct answer in the image-attribute classifier. This kind of information loss in the first stage leads to wrong predictions in the second stage.

To solve this domain shift problem, some studies [7,23,24,25,26,27,28] have proposed locating the attribute and image in the same embedding space. When the input image and its corresponding attribute are given, their feature vectors in the joint-embedding space should be located closely. In the inference stage, the image class is assigned to the closest prototype by comparing the distance between the prototype composed of attributes and the input image in the same space. To construct the same joint-embedding space, a model embeds attribute data into the image space [7], the image into the attribute space [23,26], and the image and attribute into the intermediate space [21,24,27,28]. Some models [29,30] that predict the class of the unseen image through the relationship between the image and semantic label using embedding are also presented.

Although various approaches using predefined attributes have demonstrated good results, the fundamental problem is that these methods use attribute information. Predefined attributes rarely exist in real data and are expensive because humans must label them. The studies [7,8] suggested using a free-style natural language description instead of an attribute to overcome this problem. Since such descriptions usually exist in image–text pair datasets that can be easily obtained through a web search, it is more natural and general to use the textual description as the conceptual definition of the test classes for zero-shot image classification. In this manner, the related works on image–text matching [17,21,22] have also been applied to zero-shot classification tasks.

Recently, some research such as CLIP [27] and ALBEF [28] achieved outstanding performance for zero-shot recognition using transformer-based models [31,32] with contrastive learning. Those models showed that a large number of image–text pairs can improve zero-shot recognition ability because they help to construct better joint-embedding space. However, those models are difficult to train because of high-computational costs.

Though the embedding-based methods designed for general image–text matching have shown considerable success in zero-shot classification, they focus on the distance between individual image–text pairs and do not consider the class distribution. Since the class distributions of test images (unseen classes) are not observable while learning the embedding space, it is difficult to guarantee that the image features of a test class are well clustered around its corresponding prototype defined as textual features. In this paper, we try to obtain a better representation of test images by using additional textual information that is derived from images.

Image-caption generation: Caption generation is a challenging artificial intelligence problem in which a text caption is generated from a given image. Recently, deep learning methods have achieved notable results [33,34,35,36] for this problem with attention-based models. For the encoded image from a CNN, the image-caption-generation model predicts a sequence of words that describes the image through an RNN. Some works [37,38] have solved this problem by learning image–text matching relationships using large amounts of image–text pairs using a transformer structure [39]. Furthermore, researchers [17,18] have exploited the image-captioning process to solve the image–text matching task. In [17], the image-captioning process sorts the order of cropped local image regions, and the captioning module helps to obtain better image features when local features are fused as a whole feature. Similarly, another study [18] employed the image-captioning process to obtain fine image features. They fed an extracted image feature that is composed of a local feature and a global feature as an input of the image-captioning model; then, the supervision signal of the captioning process caused the extracted feature to obtain better representation.

Although both models are significantly improved over previous studies, they focus on obtaining better representations of image features and do not directly use the generated text. In contrast, the proposed model uses the results of the caption-generator module directly. The direct use of generated-text features improved the problem of discrepancy between image and text distribution in the test stage. In addition, the generated texts can also be applied to data augmentation in the learning of joint-embedding space.

3. Proposed Method

3.1. Overall Structure

We consider a set of training samples

D_{t r} = {(I_{i}, T_{i}, y_{i}), i = 1, \dots N_{t r}}

with an associated class label

y_{i} \in T_{t r}

, where

I_{i}

is the

i

th training image and

T_{i}

is a textual description of

I_{i}

. The goal of the zero-shot classification task is to predict

y_{j} \in T_{t s}

of test image

I_{j}^{t s}

under the condition that training and testing classes are mutually disjoint, such that

T_{t r} \cap T_{t s} = \emptyset

.

The overall structure of the proposed model is presented in Figure 1. In the training stage, when image

I_{i}^{t r}

and text

T_{i}^{t r}

are given, the features of each piece of information are extracted through the CNN and long short-term memory (LSTM) network, respectively. These features are mapped onto the joint-embedding space through two fully connected layers to obtain vector representations

φ_{I} (I_{i}^{t r})

and

φ_{T} (T_{i}^{t r})

. In the joint-embedding space, the semantic relationship of the image and text pair is represented by the distance of the embedding vector pair. For example, an embedding vector of an image of a bird with a white beak and the text-embedding feature of the description “white head” are trained to be located nearby in the joint-embedding space.

In the testing stage, a given

I_{j}^{t s}

is assumed to belong to an unseen class

T_{t s}

in the training stage. Conceptual information for the new class must be provided to recognize it, and it is represented as a vector in the learned joint-embedding space, defined as a prototype vector

P

. Given the prototype vectors

P_{k} (k \in T_{t r})

, the classification of test image

I_{j}^{t s}

is performed based on the distance between the image-embedding vector

φ_{I} (I_{j}^{t s})

and prototype vectors

P_{k} (k \in T_{t s})

. This method is similar to the baseline approach [8] in that the joint-embedding model is constructed in an intermediate space. While previous studies [7,8] have used a single image vector

φ_{I} (I_{j}^{t s})

in the embedding space, we propose using additional textual information to obtain a better representation of the image. For example, when we examine a red bird, a human analyzes its visual appearance and simultaneously thinks, “It is a red bird.” By combining this kind of linguistic thinking with visual representation, we expect to better represent the object in the cognition system. To implement this procedure, we generate a textual description

G (I_{j}^{t s})

expressing the image

I_{j}

linguistically by exploiting the image-caption generator and use this description to obtain a rectified embedding vector

R_{j}

complementing the image feature. In addition, the generated text description can also be used in the training stage. The image–text data pairs sometimes do not exist or are very limited in a real situation, which results in poor performance because the joint-embedding space is not well-formed. We deal with this problem of insufficient data using the generated texts as an augmented description.

3.2. Learning Joint-Embedding Space

As depicted in Figure 2, the training stage for the joint-embedding space consists of two flows. First, a CNN model is applied to the image data, and the extracted feature is placed on the joint-embedding space through two trainable fully connected layers. The output of the image processing flow is written as

φ_{I} (I_{i})

. Second, an LSTM network is applied to the textual description of the given input images, and the text features extracted from the trainable LSTM network are again placed in the joint-embedding space through the trainable fully connected layer. The text processing flow output is written as

φ_{T} (T_{i})

. The vectors from different flows are placed on the same embedding space; thus, their inner product can be calculated as follows:

s_{i j} = φ_{I} (I_{i}) \cdot φ_{T} (T_{j}) .

(1)

Then, the matching loss for all image–text pairs (

I_{i}, T_{i}

) in a minibatch subset

B

is defined as follows:

L_{m a t} = \sum_{i = 1}^{| B |} \sum_{j \neq i} m a x (0, s_{i j} - s_{i i} + m a r g i n),

(2)

where

s_{i j}

is the score value of the inner product calculated using the mismatched image–text pair

(j \neq i)

, and the

m a r g i n

is a positive constant value for stable training. Learning with this loss function updates the embedding function to increase the matched score value

s_{i i}

and decrease the mismatched score value

s_{i j}

so that the embedding space can represent a semantic correlation between the image and text.

3.3. Learning Caption Generator

Inspired by the human recognition system that employs multi-information inflow in awareness, we deploy linguistic information derived from images for rectifying the embedding vector of

φ_{I} (I_{i})

. To realize this process, we used an image-captioning approach that translates an image to a textual sentence form [12,33,34,35,36]. Given a test image

I_{j}^{t s}

as input, a text description

G (I_{j}^{t s})

is generated through the caption-generator module, as illustrated in Figure 1. The attention-based image-caption-generation method [12] is adopted as the generation model. We employ a simple but intuitive model because our main goal of employing the caption-generator module is leveraging semantic textual information from images. Many other elaborate state-of-the-art captioning models [37,38] can also be utilized for generating high-quality captions.

For training of caption-generator image–sentence pairs,

{(I_{i}, T_{i}) | T_{i} = [w_{i, 1}, \dots, w_{i, N}]}

in

D_{t r}

, where

w

is a word composing the sentences, are used. The loss for generating the sentence is calculated by minimizing the log likelihood of the sequence of words, written as follows:

L_{g e n} = - \sum_{i = 1}^{| B |} \sum_{t = 0}^{N - 1} \log p_{i, t + 1} (w_{i, t + 1}),

(3)

where

N

denotes the number of words in a target sentence,

w_{i, t}

represents the

t

th word in the

i

th sentence of a minibatch, and

p

is the softmax function for predicting the next timestep word. One of the merits of our proposed model is that we can simultaneously train the caption-generator and joint-embedding modules. In the entire end-to-end training process, the total loss to be minimized is defined as follows:

L = L_{m a t} + L_{g e n} .

(4)

3.4. Rectified Embedding Vector for Zero-Shot Classification

In the testing stage of zero-shot classification, the generated caption is transferred to text feature extraction and combined with the image feature on the joint-embedding space to obtain the rectified vector. As displayed in Figure 2, the rectified vector created through this combination is expected to obtain a better representation for matching with the prototype defined using textural descriptions. For a given

j

-th test image

I_{j}^{t s}

, the rectified embedding vector

R_{j}^{t s}

can be obtained through a weighted combination of the image-embedding feature

I_{j}^{t s}

and the embedding feature of the generated text

φ_{T} (G (I_{j}^{t s}))

, which can be written as follows:

R_{j}^{t s} = (1 - α) φ_{I} (I_{j}^{t s}) + α φ_{T} (G (I_{j}^{t s})),

(5)

where

α

is a weight value in [0, 1] and is empirically adjusted. The rectified vector

R

is applied to the distance-based classifier with class prototypes. Finally, the predicted class is obtained as follows:

y (I_{j}^{t s}) = \underset{k \in T_{t s}}{a r g m i n} D i s t (R_{j}^{t s}, P_{k}) .

(6)

We verified the performance of our proposed method using the rectified vector against the benchmark dataset based on Equation (6), and the results show considerable improvement in performance compared with individual vectors, such as the single image and single text vectors.

3.5. Data Augmentation Using Generated Captions

The textual description obtained using the caption generator can also augment insufficient training data in a more realistic situation. In an actual dataset where the image and text exist as a pair, text information for an image is often insufficient for learning the joint-embedding space. We propose a method to augment training data through various sentences created by the caption-generator model to deal with this problem. When a training image is given, it is possible to generate descriptions through the caption generator, which can be added to the training dataset. In that case, the training dataset

D_{t r} {(I_{i}, T_{i}, y_{i})}

is enlarged with the augmented dataset

D_{A u g} {(I_{i}, T_{i}^{a u g}, y_{i})}

, where

T_{i}^{a u g} = G (I_{i})

. The effect of this augmentation is also investigated in the experiments section (Section 4.3).

4. Experimental Results

4.1. Experimental Settings

This section explains the experimental settings and details in the training and testing stages. In image processing, the pre-trained Resnet-152 model [40] is employed as the feature extractor, and the weight of the image-feature extractor is fixed. In text processing, text features are extracted from given descriptions by the trainable LSTM network. For the construction of the intermediate joint-embedding space, two trainable fully connected layers were employed for dimension reduction, which converts the 2048-dimensional image vector output to a 256-dimensional vector and also converts the 1024-dimensional text vector output to a 256-dimensional vector. In addition, we used the Adam optimizer with a learning rate of 0.005 and a gradient clipping of 5.0 to control the learning rate of the LSTM, and the batch size of the training step was 55. For text preprocessing, all words in the training descriptions were used without removing the stop words, and word-to-vector [34] models were used for word embedding to each word. We also used a single NVIDIA RTX 3090 GPU for computation.

The number of descriptions per image also affects classification accuracy. The training dataset has 10 descriptions per image, and we employed the whole descriptions in the training stage. To create prototypes

P_{K}

(k ∈

T_{t s}

) in the testing stage, we used all descriptions for one image in each prototype class, and this setting is the same as in [8] for a fair comparison. For the evaluation metric for the testing dataset, we used Rank-1 accuracy, similar to previous studies [8,10,11].

4.2. Comparion Models

We compare the five models presented below to verify the proposed model. We refer to the experimental results from previous works [8,11].

(1) Word CNN and word CNN + RNN [8] construct joint-embedding using compatible functions. Unlike previous work [28] that solely trained the image encoder, these methods train the image and text encoders simultaneously using a symmetric structured joint-embedding function.

(2) DEM [7] learns discriminative features using the bidirectional LSTM and CNN. This method constructs a joint-embedding space by minimizing the Euclidean distance between the image and text features on the same embedding space. Unlike the other four comparison models, the experiments only employ the CUB dataset.

(3) The IATV [10] approach uses a two-stage strategy for dealing with the image–text matching task. This approach proposes cross-modal entropy loss to construct the joint-embedding space. This approach also proposes a co-attention mechanism to incorporate the sentence into the image feature through the encoder–decoder LSTM network to improve the image representation for this processing.

(4) CMPC + CMPM [9] uses the norm-softmax loss to make embedded features compact. This method regularizes the angular range of the representation features; thus, it has more discriminative image features. Otherwise, it proposes probability-based loss, which performs KL-divergence, minimizing the distance between the probability distribution of the truly matching pairs and predicted matching pairs in the minibatch to avoid sampling processes, such as the traditional bidirectional ranking loss.

(5) TPAMI [11] also employs the norm-softmax cross-entropy loss for the con image representation and uses the KL-divergence, minimizing to construct joint-embedding space as in [4]. It also introduces GAN [41] loss to improve representation and reduce obscurity between text and image features using discriminator loss.

4.3. Experimental Results

The CUB dataset [15] contains 200 bird species and 11,788 images divided into 150 training and 50 testing classes. In the training stage, 8855 image–description pairs corresponding to 150 classes were used. In the testing stage, 2933 images corresponding to 50 unseen classes were used. The experimental results for the CUB dataset are presented in Table 1. The proposed method surpasses other comparison results on rank-1 accuracy of zero-shot image classification. In addition, the experiment was conducted on the flower dataset [16], which includes images of various flowers and their corresponding descriptions and class information. The dataset has 8189 images divided into 82 training and 20 testing classes. In the training stage, 7034 image–description pairs corresponding to 82 classes were used. A total of 10 descriptions were provided for each image, and all sentences were used for training. In the testing stage, 1155 images corresponding to 20 unseen classes were used in the inference process. All input images were resized to 224 × 224. Similar to the CUB dataset, Table 1 reveals that the proposed model achieves fair improvement in zero-shot classification accuracy. Additionally, the results of our proposed model also outperform most of the results of previous works under the zero-shot classification setting with predefined attributes [22,26,27].

We demonstrated the effect of the rectified vector through ablation studies. In Table 2, the method using only image features means that the generated text feature is not combined with the image feature in the testing stage; thus, α = 0 in Equation (5). Compared with that, the proposed method significantly improves classification performance, demonstrating the effect of the rectified vector. We also verify whether the generated caption created a semantically correct sentence. We set a situation where the method using only text features performs the classification task by using the condition α = 1 in Equation (5). In this experiment, classification performance presents meaningful results, indicating that the image-captioning model generates sentences that semantically match the image.

One of the interesting aspects is that the performance of the generated texts affects the improvement of the combined performance. For the flower dataset with relatively good classification performance on the method using only text features, when the two pieces of information were combined, the improved performance rate considerably increased compared to the CUB dataset. Thus, the advanced image-captioning model that can make more accurate descriptions, such as OFA [37] and mPLUG [38], could improve the proposed method in terms of accuracy.

4.4. Image Caption for Training Data Augmentation

When insufficient data exist in the training stage, data augmentation using the generated sentences is expected to have a strong effect on model training. Different settings are required during the training stage to prove this assumption. Unlike the prior setting, we start the experiment with only one description for each training image. Then we increase the number of additional descriptions generated from the caption generator one by one to see the effect of augmentation.

As presented in Figure 3, using only one description (no augmentation) for training results in poor performance because the image-embedding vector cannot be related to various text-embedding vectors. However, through the data augmentation process using the generated caption, the performance improved remarkably in both datasets. From this result, we confirm that this simple approach to textual data augmentation is highly effective and demonstrates powerful performance in a data-insufficient situation.

4.5. Qualitative Study

Figure 4 presents examples of the zero-shot image classification on the CUB dataset. In the table in Figure 4, the upper part shows the prediction result obtained by using only the image-feature vector, and the lower part shows the case of using our proposed rectified vector. In the example of the bird, a Laysan Albatross is characterized by a brown part of the body. A brown creeper, which the upper model predicted as the rank-1 answer, also has a brown part of the body. The upper model here overly relied on the brown body feature of the image, causing incorrect prediction. Our proposed model (lower part) modifies the prediction of the original model by providing additional information from generated sentences so that it does not depend only on large image features. As a result, the provision of additional textual information through the generated description in our proposed model improves classification accuracy. Although the generated sentences are sometimes not grammatically correct (the right in Figure 4), they still can provide meaningful information correcting the wrong prediction.

One limitation of the proposed method is that the caption generator does not always generate good semantic textual information. Figure 5 illustrates some examples of generated sentences. While the sentences of the left examples are semantically matched with the images, the right examples reveal that some words in the generated sentences are inaccurate. This kind of generated text that semantically mismatched with the image may lead to poor performance. The use of a better generation model may reduce this problem and avoid performance degradation due to the misunderstanding of images.

5. Conclusions

In this paper, we proposed a novel zero-shot image-classification model employing additional textual information obtained through a generative image-captioning model. To solve a problem of zero-shot image classification in which classification performance is degraded due to modality inconsistency between image and prototype vector composed of texts, we derived additional text information that linguistically describes images, and combined it with the image feature as a rectified vector in the test stage. This rectified vector resulted in a better representation than using an image feature alone. The utilization of this rectified embedding vector achieves notable performance improvement on two benchmark datasets. In addition, we demonstrate that when the textual descriptions in training data are insufficient, the generated image captions can be used as augmentation data for learning the joint-embedding space.

In future works, various attempts for performance enhancement are possible. The image- and text-feature-extraction models can be improved by using more sophisticated models, such as ConvNet [21] and BERT [42]. Those advanced models could capture more discriminative features in image and text data. In addition, though we utilized a fixed weighted sum for combining image and text features in the current model, it can be elaborated using a trainable combination module. These elaborated methods may lead to better learning of the joint-embedding space and reduce the classification error. Furthermore, the proposed embedding-vector-rectification method can be easily applied to various multimodal tasks such as text–video retrieval. For example, in a text–video search application that uses joint-embedding space, better search results can be expected by combining a video expression with a caption obtained through a video-captioning module. In addition, we currently augment training data from the perspective of the text; it is possible to improve performance by obtaining a richer number of data pairs through advanced image augmentation methods [43].

Author Contributions

Conceptualization, C.H. and H.P.; data curation, C.H. and H.P.; formal analysis, C.H. and H.P.; methodology, C.H. and H.P.; software, C.H.; validation, C.H. and H.P.; visualization, C.H.; writing—original draft, C.H. and H.P.; writing—review and editing, C.H. and H.P. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Human Resources Program in Energy Technology of the Korea Institute of Energy Technology Evaluation and Planning(KETEP) granted financial resource from the Ministry of Trade, Industry & Energy, Republic of Korea (No. 20204010600060). This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No. 2021-0-5 02068, Artificial Intelligence Innovation Hub).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data we used are public data for academic research.

Conflicts of Interest

The authors declare no conflict of interest.

References

Lampert, C.H.; Nickisch, H.; Harmeling, S. Attribute-based classification for zero-shot visual object categorization. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 36, 453–465. [Google Scholar] [CrossRef] [PubMed]
Larochelle, H.; Erhan, D.; Bengio, Y. Zero-data learning of new tasks. In Proceedings of the 23rd National Conference on Artificial Intelligence, Chicago, IL, USA, 13–17 July 2008; AAAI Press: Washington, DC, USA, 2008. [Google Scholar]
Rohrbach, M.; Stark, M.; Schiele, B. Evaluating knowledge transfer and zero-shot learning in a large-scale setting. In Proceedings of the CVPR 2011, Colorado Springs, CO, USA, 20–25 June 2011. [Google Scholar] [CrossRef] [Green Version]
Yu, X.; Aloimonos, Y. Attribute-Based Transfer Learning for Object Categorization with Zero/One Training Example. In Proceedings of the 11th European Conference on Computer Vision, Heraklion, Crete, Greece, 5–11 September 2010; pp. 127–140. [Google Scholar] [CrossRef]
Xu, X.; Shen, F.; Yang, Y.; Zhang, D.; Shen, H.T.; Song, J. Matrix Tri-Factorization with Manifold Regularizations for Zero-Shot Learning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2007–2016. [Google Scholar] [CrossRef]
Ding, Z.; Shao, M.; Fu, Y. Low-Rank Embedded Ensemble Semantic Dictionary for Zero-Shot Learning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6005–6013. [Google Scholar] [CrossRef]
Zhang, L.; Xiang, T.; Gong, S. Learning a deep embedding model for zero-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Honolulu, HI, USA, 22–25 July 2017; pp. 2021–2030. [Google Scholar]
Reed, S.; Akata, Z.; Lee, H.; Schiele, B. Learning deep representations of fine-grained visual descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 49–58. [Google Scholar]
Zhang, Y.; Huchuan, L. Deep cross-modal projection learning for image-text matching. In Proceedings of the 15th European Conference, Munich, Germany (ECCV), Munich, Germany, 8–14 September 2018; pp. 707–723. [Google Scholar]
Li, S.; Xiao, T.; Li, H.; Yang, W.; Wang, X. Identity-aware textual-visual matching with latent co-attention. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1908–1917. [Google Scholar] [CrossRef] [Green Version]
Sarafianos, N.; Xu, X.; Kakadiaris, I. Adversarial Representation Learning for Text-To-Image Matching. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 5813–5823. [Google Scholar] [CrossRef] [Green Version]
Xu, K.; Ba, J.L.; Kiros, R.; Cho, K.; Courville, A.; Salakhutdinov, R.; Zemel, R.S.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on International Conference on Machine Learning (ICML), Lille, France, 6–11 July 2015. [Google Scholar]
Kulkarni, G.; Premraj, V.; Ordonez, V.; Dhar, S.; Li, S.; Choi, Y.; Berg, A.C.; Berg, T.L. Understanding and generating simple image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 2891–2903. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3156–3164. [Google Scholar] [CrossRef] [Green Version]
Wah, C.; Branson, S.; Welinder, P.; Perona, P.; Belongie, S. The Caltech-Ucsd Birds-200-2011 Dataset; California Institute of Technology: Pasadena, CA, USA, 2011. [Google Scholar]
Nilsback, M.-E.; Zisserman, A. Automated Flower Classification over a Large Number of Classes. In Proceedings of the 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, Bhubaneswar, India, 16–19 December 2008; pp. 722–729. [Google Scholar] [CrossRef]
Huang, Y.; Wu, Q.; Wang, W.; Wang, L. Learning Semantic Concepts and Order for Image and Sentence Matching. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Zhang, Y.; Zhou, W.; Wang, M.; Tian, Q.; Li, H. Deep Relation Embedding for Cross-Modal Retrieval. IEEE Trans. Image Process. 2021, 30, 617–627. [Google Scholar] [CrossRef] [PubMed]
Lampert, C.H.; Nickisch, H.; Harmeling, S. Learning to Detect Unseen Object Classes by Between-Class Attribute Transfer. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 20–25 June 2009. [Google Scholar]
Farhadi, A.; Endres, I.; Hoiem, D.; Forsyth, D. Describing Objects by their Attributes. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 20–25 June 2009; pp. 1778–1785. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11966–11976. [Google Scholar] [CrossRef]
Jayaraman, D.; Grauman, K. Zero-shot recognition with unreliable attributes. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS), Cambridge, MA, USA, 8–13 December 2014; pp. 3464–3472. [Google Scholar]
Palatucci, M.; Pomerleau, D.; Hinton, G.; Mitchell, T.M. Zero-shot learning with semantic output codes. In Proceedings of the 22nd International Conference on Neural Information Processing Systems (NIPS), Vancouver, British, 7–10 December 2009; Curran Associates Inc.: Red Hook, NY, USA, 2009. [Google Scholar]
Romera-Paredes, B. An embarrassingly simple approach to zero-shot learning. In Proceedings of the 32nd International Conference on Machine Learning (ICML), Lille, France, 6–11 July 2015. [Google Scholar]
Akata, Z.; Reed, S.; Walter, D.; Lee, H.; Schiele, B. Evaluation of output embeddings for fine-grained image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 2927–2936. [Google Scholar]
Bucher, M.; Herbin, S.; Jurie, F. Improving Semantic Embedding Consistency by Metric Learning for Zero-Shot Classiffication. In Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 730–746. [Google Scholar] [CrossRef]
Alec, R.; Jong, W.K.; Chris, H.; Aditya, R.; Gabriel, G.; Sandhini, A.; Girish, S.; Amanda, A.; Pamela, M.; Jack, C.; et al. Learning Transferable Visual Models from Natural Language Supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar]
Li, J.; Selvaraju, R.R.; Gotmare, A.D.; Joty, S.; Xiong, C.; Hoi, S. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. arXiv 2021, arXiv:2107.07651. [Google Scholar]
Akata, Z.; Perronnin, F.; Harchaoui, Z.; Schmid, C. Label-embedding for image classification. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 1425–1438. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Frome, A.; Corrado, G.S.; Shlens, J.; Bengio, S.; Dean, J.; Ranzato, M.; Mikolov, T. Devise: A deep visual-semantic embedding model. In Proceedings of the 26th International Conference on Neural Information Processing Systems (NIPS), Lake Tahoe, NV, USA, 5–10 December 2013. [Google Scholar]
Yang, Y.; Hospedales, T. A unified perspective on multi-domain and multi-task learning. arXiv 2014, arXiv:1412.7489. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
Lu, J.; Xiong, C.; Parikh, D.; Socher, R. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3242–3250. [Google Scholar] [CrossRef] [Green Version]
Chen, H.; Ding, G.; Lin, Z.; Zhao, S.; Han, J. Show, Observe and Tell: Attribute-Driven Attention Model for Image Captioning. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; pp. 606–612. [Google Scholar] [CrossRef] [Green Version]
Cheng, Y.; Huang, F.; Zhou, L.; Jin, C.; Zhang, Y.; Zhang, T. A Hierarchical Multimodal Attention-based Neural Network for Image Captioning. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Tokyo, Japan, 7–11 August 2017; pp. 889–892. [Google Scholar] [CrossRef]
Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6077–6086. [Google Scholar] [CrossRef] [Green Version]
Wang, P.; Yang, A.; Men, R.; Lin, J.; Bai, S.; Li, Z.; Ma, J.; Zhou, C.; Zhou, J.; Yang, H. OFA: Unifying Architectures, Tasks, and Modalities through a Simple Sequence-to-Sequence Learning Framework. arXiv 2022, arXiv:2202.03052. [Google Scholar]
Li, C.; Xu, H.; Tian, J.; Wang, W.; Yan, M.; Bi, B.; Ye, J.; Chen, H.; Xu, G.; Cao, Z.; et al. mPLUG: Effective and Efficient Vision-Language Learning by Cross-Modal Skip-Connections. arXiv 2022, arXiv:2205.12005. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N. Lukasz Kaiser, Illia Polosukhin Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar] [CrossRef] [Green Version]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS), Montreal, QU, Canada, 8–13 December 2014; pp. 2672–2680. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Zhong, Z.; Zheng, L.; Kang, G.; Li, S.; Yang, Y. Random Erasing Data Augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020. [Google Scholar]

Figure 1. An example of utilizing generated text as additional information for zero-shot image classification. This generated-text-feature vector is combined with the original image-feature vector. Therefore, this rectified vector has a better representation on joint-embedding space, which leads to improved classification performance when it is compared with prototype vectors composed of free-style textual sentences.

Figure 2. The overall process of the proposed method. In the training stage, images and text pairs are processed through feature extraction encoders (CNN and LSTM) with fully connected layers to construct a joint-embedding space using matching loss. In the test stage, a rectified vector is created through a weighted sum of image-feature vectors and text-feature vectors in the joint-embedding space, and it is compared to the test class prototype vectors. Additionally, images and generated captions can be used for the augmentation of text descriptions in the training stage.

Figure 3. Change of zero-shot classification performance depending on the number of augmentation data. Unlike the original setting (five descriptions per image) used in the experiment of Table 1, we consider a situation where only one description was given for each image as the baseline (0 augmentation) and added the descriptions one by one. As the number of augmentation increases, the classification performance improves accordingly.

Figure 4. Zero-shot image classification results on the CUB dataset. Given the single test image, we compare classification results between using only the image-feature vector (upper) and using the proposed rectified vector (lower).

Figure 5. Examples of generated captions. The two examples on the left show correctly generated descriptions, and the two on the right have generated incorrect attribute words. The words in red on the right side are misleading the feature of birds.

Table 1. Experimental results on the dataset.

Representation Method	Rank-1 Accuracy (%)
Representation Method	CUB	Flower
Word CNN [8]	51.0	60.7
Word CNN + RNN [8]	56.8	65.6
DEM [7]	58.3	60.9
IATV [10]	61.5	68.9
CMPC + CMPM [9]	64.3	68.4
TIMAM [11]	67.7	70.6
Proposed method (rectified vector)	69.3	76.1

Table 2. Ablation studies.

Method	Rank-1 Accuracy (%)
Method	CUB	Flower
Proposed method (rectified vector)	69.3	76.1
Only image features (w/o generated text feature)	67.5	72.1
Only text features (w/o image feature)	24.4	43.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hur, C.; Park, H. Zero-Shot Image Classification with Rectified Embedding Vectors Using a Caption Generator. Appl. Sci. 2023, 13, 7071. https://doi.org/10.3390/app13127071

AMA Style

Hur C, Park H. Zero-Shot Image Classification with Rectified Embedding Vectors Using a Caption Generator. Applied Sciences. 2023; 13(12):7071. https://doi.org/10.3390/app13127071

Chicago/Turabian Style

Hur, Chan, and Hyeyoung Park. 2023. "Zero-Shot Image Classification with Rectified Embedding Vectors Using a Caption Generator" Applied Sciences 13, no. 12: 7071. https://doi.org/10.3390/app13127071

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Zero-Shot Image Classification with Rectified Embedding Vectors Using a Caption Generator

Abstract

1. Introduction

2. Related Works

3. Proposed Method

3.1. Overall Structure

3.2. Learning Joint-Embedding Space

3.3. Learning Caption Generator

3.4. Rectified Embedding Vector for Zero-Shot Classification

3.5. Data Augmentation Using Generated Captions

4. Experimental Results

4.1. Experimental Settings

4.2. Comparion Models

4.3. Experimental Results

4.4. Image Caption for Training Data Augmentation

4.5. Qualitative Study

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI