Zero-Shot Image Caption Inference System Based on Pretrained Models

Zhang, Xiaochen; Shen, Jiayi; Wang, Yuyan; Xiao, Jiacong; Li, Jin

doi:10.3390/electronics13193854

Open AccessArticle

Zero-Shot Image Caption Inference System Based on Pretrained Models

by

Xiaochen Zhang

,

Jiayi Shen

,

Yuyan Wang

,

Jiacong Xiao

and

Jin Li

^*

The School of Computer Science, Shaanxi Normal University, Xi’an 710119, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(19), 3854; https://doi.org/10.3390/electronics13193854

Submission received: 26 August 2024 / Revised: 20 September 2024 / Accepted: 27 September 2024 / Published: 28 September 2024

(This article belongs to the Section Electronic Multimedia)

Download

Browse Figures

Versions Notes

Abstract

:

Recently, zero-shot image captioning (ZSIC) has gained significant attention, given its potential to describe unseen objects in images. This is important for real-world applications such as human–computer interaction, intelligent education, and service robots. However, the zero-shot image captioning method based on large-scale pretrained models may generate descriptions containing objects that are not present in the image, which is a phenomenon termed “object hallucination”. This is because large-scale models tend to predict words or phrases with high frequency, as seen in the training phase. Additionally, the method set a limitation to the description length, which often leads to an improper ending. In this paper, a novel approach is proposed to address and reduce the object hallucination and improper ending problem in the ZSIC task. We introduce additional emotion signals as guidance for sentence generation, and we find that proper emotion will filter words that do not appear in the image. Moreover, we propose a novel strategy that gradually extends the number of words in a sentence to confirm the generated sentence is properly completed. Experimental results show that the proposed method achieves the leading performance on unsupervised metrics. More importantly, the subjective examples illustrate the effect of our method in improving hallucination and generating properly ending sentences.

Keywords:

zero-shot learning; image captioning; large pre-trained model; affective computing; cross-modal

1. Introduction

In the rapidly evolving fields of artificial intelligence and computer vision, image captioning has emerged as a foundational technique integral to various applications [1], including visually assistive system [2,3,4], human–computer interaction [5,6,7], intelligent education [8,9], and service robots [10,11]. The ability to generate accurate and descriptive captions for images enhances the user experience and plays a crucial role in bridging the gap between visual and textual data. With the advent of zero-shot learning, zero-shot image captioning has gained significant attention [12], given its potential to generate captions for images not encountered during the training phase. This capability is necessary for real-world applications where the diversity and novelty of images often surpass the scope of pretrained datasets.

The current methods of zero-shot image captioning predominantly rely on large pretrained models to predict captions without additional training [12]. The methods generally initializes a limited-length sentence. After that, a sampling strategy is utilized to update each word iteratively. To predict the selected word, all words’ distribution is calculated using a combination of language models [13] and vision-language models such as CLIP [14] (a large pretrained model capable of calculating the similarity between images and text). Although these approaches have marked a significant advancement in the field, they are not without their limitations.

One disadvantage of existing zero-shot image captioning methods [12] is the tendency to generate descriptions containing objects that are not present in the image, a phenomenon termed “object hallucination” [15]. The large-scale models tend to generate descriptions that do not exist in the image but frequently appear during pretraining. This issue arises also because current methods do not adequately consider the emotion conveyed by the image. If the emotion of the output sentence does not match the inherent emotion of the image, the language model might infer incorrect objects, leading to flawed captions. Additionally, the method set a limitation to the description length, which leads to an improper ending: so early that it causes an insufficient description or so late that the description has a bad sequel.

To address these issues, this paper proposes an innovative approach that introduces emotion and dynamically changing sentence length adjustment into the zero-shot image captioning framework. Our method begins by pre-defining the emotion of the image, which is then used as a control signal to guide the caption generation process. This alignment ensures that the emotion underlying the image matches the emotion that the model uses to generate captions, significantly reducing the occurrence of hallucination. Furthermore, we implement an iterative procedure to change the length of the generated sentence. Initially, a short sentence is produced as a foundation, followed by the insertion of masked positions to incrementally expand the sentence length. Accompanying the process, input images are also processed. This iterative extension allows for more flexible and accurate descriptions, accommodating the natural complexity and detail inherent in visual content.

The contributions of this paper are summarized as follows:

(1) An emotion signal module is introduced to integrate the control signal, which aligns better with the emotions inherently present in the images.

(2) An iterative word insertion algorithm is proposed, which can extend the length of a sentence to obtain a better caption.

2. Related Work

The typical image captioning pipeline includes two modules, the visual encoder and the language generator [16].

The first phase of this pipeline provides an effective representation of the visual content. Early researchers extracted global features by convolutional neural networks (CNNs, neural networks using convolutional layers to extract features from data) [17], which may lead to excessive compression of information. To increase the granularity level of visual encoding, Xu et al. [18] introduced the visual attention mechanism to focus on specific regions. Jiang et al. [19] proposed to use multiple CNNs to exploit their complementary information and then fused their representations with a recurrent procedure. Anderson et al. [20] proposed the bottom-up framework where an object detector was utilized to detect interesting image regions. Considering the relationship between objects, Graph Convolutional Networks (GCNs; they extend neural network methods to graph data, enabling node classification and link prediction by leveraging neighborhood information) [21] were introduced to mine the visual relationships between objects [22]. Shi et al. [23] represented the image as a semantic relationship graph and required the module to predict the nodes on annotated captions. Self-attention techniques can enhance the correlation between objects by adding different weights. Yang et al. [24] introduced the self-attention module to enhance relationships between features extracted by the object detector. Similarly, Li et al. [25] proposed a transformer model, where a visual encoder was coupled with a semantic encoder. With the development of visual transformer structures, effective image representation extracted by Vision Transformers (ViTs; they apply self-attention mechanisms to sequences of image patches, enhancing performance on vision tasks) [26] and CLIP [14] were utilized for the image captioning task [27].

When the visual encoding is obtained by the encoder, the second phase is to generate the sentence from the visual encoding. A direct idea of the generator or decoder is using a single-layer Long Short-Term Memory (LSTM, networks of recurrent neural network designed to remember long-term dependencies), which was proposed by Vinyals et al. [28]. After that, Ge et al. [29] used a bidirectional LSTM as an auxiliary module to enhance context generation. Huang et al. [30] augmented the LSTM with the attention operator, where the attention to visual and textual information was enhanced for text generation. Recurrent Neural Networks (RNNs) are a class of neural networks designed for processing sequential data, capable of maintaining a memory of previous inputs through their internal state. Zhu et al. [31] proposed RNN-based image captioning language models using a decoder enriched with self-attention. The Transformer structure [32] significantly improves the effectiveness of sequential data information, which is also used in the image captioning task. Ji et al. [33] used multi-head attention to calculate the influence of the image feature on each word. Cornia et al. [34] implemented the latest layer and all encoding layers to perform cross-attention.

Since large-scale models such as BERT and CLIP are widely used as the pretrained modules of visual and textual systems, it is possible to achieve image captioning without training on datasets, which is called zero-shot image captioning [12]. However, it may introduce the visual “hallucination” problem especially when the control signal does not match the original emotion contained in images well. Then, the language model will generate words which does not appear in the image. An example will be illustrated on Section 4.1.The reason for the visual hallucination is the textual prior of the pre-trained model that has a strong correlation between words, especially under specific emotions.

Ye et al. introduced a zero-shot controllable text generation model that utilized multimodal signals to generate text with desired attributes [35]. Though the method worked without additional training, there were still some hallucination issues, and it lacked results on unsupervised metrics.

Other methods require more or less additional training when using large pretrained models, and these methods can also be classified as supervised methods. Li et al. introduced a framework that leveraged the capabilities of CLIP [36]. It required text data for decoder training. Maet al. presented an approach to image captioning through the development of the Dynamic Transformer Network (DTNet) [37]. DTNet addressed the limitations of used methods by dynamically assigning customized paths for different input samples, enhancing the ability to generate accurate captions. Guo et al. introduced an approach to semantic communications, specifically focusing on enhancing the understanding-level semantic communications between agents and humans, as well as between agents themselves [38]. This approach transcended traditional feature-level semantic communications by incorporating true semantic understanding into the transmission process. However, these methods all rely on additional data and training, so they are not truly zero-shot methods and lack experimental results on unsupervised metrics.

3. Framework of Image Captioning with Zero-Shot

In this section, we introduce the image captioning task and present an implementation framework [12] for this task. The framework mainly consists of two parts: masked prediction based on a language model and similarity evaluation based on a cross-modal matching model of text and image. The purpose of the framework is to select texts that conform to linguistics and align with image content for sentence optimization. Although the framework has some flaws and fails to consider emotional factors, it provides a good baseline for generating emotional image descriptions.

3.1. Formulation of Task: Image Captioning

The image captioning task requires the model to generate descriptive text for a given input image. This is a cross-modal task, so the main issues are how to generate fluent and coherent sentences in the text modality, and how to ensure that the generated sentences match the content of the image. Let I be the input image; then, this task can be modeled as follows: in the latent semantic space

L

, find a descriptive text X that minimizes the distance between

I_{l a t e n t}

and

X_{l a t e n t}

, while also minimizing differences concerning language prior distribution

p (X)

. Here,

I_{l a t e n t}

and

X_{l a t e n t}

are mappings of I and X in semantic space.

Existing research has many implementations that can achieve excellent performance. However, they often require a large training cost and have poor generalization ability. Therefore, we introduce a framework that has the potential to address this deficiency.

3.2. Introduction of Framework

To solve the problem of the high training cost and poor generalization ability in existing image captioning methods, we introduce and incorporate a promising framework [12]. The framework is an iterative optimization method that mainly predicts the distribution of input sentences through a Masked Language Model, samples based on context, evaluates texts using a visual-language alignment model, and updates sentences. The input of the framework consists of the image to be described and the sentence to be updated. The output is an optimized description sentence. We qualitatively demonstrate this framework in Figure 1a. We made partial improvements to the original framework to address deficiencies and enhance performance in text generation, as shown in Figure 1b. The details are explained in Section 4.2.

Input of sentence: One of the inputs of the framework is a sentence

X = [x_{1}, x_{2}, . . ., x_{n}]

to be updated, where n is the length of the sentence. In the first iteration, X is randomly initialized with all [MASK] tokens, which is a technique in MLM. For a position in a sentence filled with [MASK], we input it into the language model. Then, based on this, the language model can predict the word distribution of the [MASK] position according to the context. In subsequent iterations, X is the output of the previous iteration to achieve continuous optimization and updating. We utilize BERT instead of using auto-aggressive models like GPT-2 [39] to avoid the pattern collapse [40] and repetition issues introduced by auto-regressive models.

Masking strategy: After obtaining the input X, the framework needs to select and mask a position for subsequent sentence updates. The strategy of selecting masking positions mainly includes two types: sequential and shuffle. The sequential strategy masks from the first word, moving one position backward with each iteration. For example, in the first iteration, it masks the first position; in the second iteration, it masks the second position, and so on. In contrast, the shuffle strategy randomly selects a masked position in each sentence but ensures that masked positions are neither repeated nor missed. Additionally, the sequential strategy can be seen as a special case of the shuffle strategy. The difference between them is that the sequential strategy can produce linguistically smoother results while the shuffle strategy can avoid monotonous repetition in generated results and prevent potential error accumulation.

Fill the Mask: After determining the mask position and applying the mask, the framework takes the masked sentence

X^{'}

and the image I as input to update the mask position. Specifically, the update consists of the Masked Language Model (i.e., BERT) and Image–Text Model (i.e., CLIP). First, we feed

X^{'}

into BERT. Based on context, BERT outputs candidate words and their corresponding confidences

P_{B E R T}

of the masked position. We sample K words with top confidence from these candidates and fill them into sentences to form a set of candidate sentences

X_{c a n d i}

. Then, we input

X_{c a n d i}

and I into CLIP to evaluate the matching degree between each sentence in

X_{c a n d i}

and the image I, providing the corresponding confidence

P_{C L I P}

. Finally, we calculate

P_{S c o r e} = α P_{B E R T} + β P_{C L I P},

(1)

and select the sentence with the highest

P_{S c o r e}

among them as output for this iteration (also used as input for the next iteration).

The detailed specific of the generation process of the text under the framework are shown in Algorithm 1.

Algorithm 1 Zero-shot captioning generation with fixed length

Input:: Image I, caption length n, initial caption $X^{(0)} = [x_{1}, . . ., x_{n}]$ , iterations T;
Output:: Finial caption: $X^{(T)} = [x_{1}, . . ., x_{n}]$ ;
1:: for $t = 1$ to T do
2:: Determine position sequence P;
3:: for all $i \in P$ do
4:: Replace $x_{i}$ with [MASK];
5:: Predict the word distribution based on mask and sampling;
6:: Select top-K candidate words from word distribution;
7:: Obtain candidate sentence set $X_{C} = {x^{(k)}}_{k = 1}^{K}$ ;
8:: Compute CLIP score for $X^{C}$ ;
9:: Select $X_{i}$ infill [MASK] by Equation (1);
10:: end for
11:: end for

4. Analysis and Improvement

The framework we introduced and adopted not only shows promising results in terms of flexibility and richness of generation but also benefits from the intrinsic properties of large pretrained models, demonstrating good zero-shot performance. However, there are still some issues to be addressed, including the so-called “hallucination” [15] and mismatches. We discuss and analyze these problems below, followed by the proposal of further improvement methods to enhance the performance of the framework.

4.1. Analysis of Framework

Hallucination [15] During the generation, the model may produce results that do not match or deviate from the content of the image. As shown in Figure 2, two girls in uniforms are enjoying chocolate and desserts. However, the model outputs that the girls are “smoking cigarettes”, which is an “hallucination”. Through an in-depth analysis, we found that this was due to a problem inherent in large language models themselves. The model’s generation is based on selecting candidate words generated by context-based LLMs like BERT. However, a serious issue arises when there is no visual information guiding it—there may be no words within the top K candidate word set that align with the image content. In other words, it is like “the blind man and the elephant”. LLMs learn associations between words from training data as a text prior and generate incorrect hallucinations and associations when producing candidate words.

The improper ending As mentioned earlier, in the generation sentence,

X = [x_{1}, x_{2}, . . ., x_{n}]

, n is also a hyperparameter. However, setting a pre-determined sentence length brings another issue: the model has successfully generated a good description

X^{A}

, but there is still space left that is required to be filled. As a consequence, the model is forced to awkwardly fill in after

X^{A}

, resulting in decreased fluency and accuracy of the generated results. Examples of such problem can be seen in Figure 3.

The above issues and analysis focus on specific examples. Here, we can provide a more vivid qualitative analysis. Suppose there is a latent space

L

, where both images and text are mapped as distributions in

L

. It can be seen that the sentences generated by the model are closer to human language in terms of distribution but further away from the image distribution. This indicates that the generated results are insufficient to reflect the semantic content of the image excellently. Therefore, improvements need to be made to reduce the gap between generated text and image content.

4.2. Improvement

Based on the analysis of the cases in the existing framework, we obtained the reasons behind these results. Next, we improved three aspects: sentiment signal guidance, input image pre-processing, and text length interpolation.

Image Captioning with Sentiment Although the existing image captioning (IC) methods have achieved some remarkable results, the generated content is still rigid and indifferent somehow. In contrast, human descriptions of images are often natural and flexible. One possible reason is that existing methods do not consider emotional factors when generating IC. Emotion can not only enhance the linguistic similarity between generated content and humans but also guide the text generation to reduce the hallucination issue. Therefore, we improved the evaluation criteria for candidate words by adding a Text Sentiment Analysis module. For a given emotional category (positive/negative), the module calculates the confidence of candidate words belonging to the specified category. Thus, the formula update is as follows:

p_{S c o r e} = p_{B E R T} + p_{C L I P} + p_{T S A} .

(2)

For example, if the “positive” emotional signal is added as a control to the example of Figure 2, it can improve the generation results of the model and reduce the hallucination problem, as shown in Figure 4.

Transformation of Image Emotion is an important part of improving the generation. Therefore, it is necessary to understand the human emotional cognitive process. According to existing research in psychology [41], the process of emotion generation is gradual: it starts with rough recognition and processing of visual signals to obtain a general sense of emotion, followed by gradual refinement. Therefore, following this human mechanism, we transformed the images using different degrees of Gaussian blur. For an input original image I, after the transformation, we obtain a set

I = {I_{1}, I_{2}, I}

with different levels of blurriness. The method starts with more blurred images as input and gradually increases the clarity of the input image to achieve accurate refinement of emotion. When changing the clarity of images, the length restriction on sentences is also properly adjusted.

Interpolation and variable length On the one hand, fully randomly initialized statements may lead the model to fall into local optima; on the other hand, emotional cognitive processes proceed from shallow to deep. Therefore, a natural idea is to initially restrict the model to generate shorter sentences and then gradually relax the length limitation to refine it. As mentioned earlier, expanding the length needs to be accompanied by changes in image clarity. In our implementation, we limited the length to 5 in the first two rounds of iteration with input

I_{1}

being blurry. Then, we relaxed this limitation to 8 for the next three iterations with the input replaced by relatively clear

I_{2}

, and finally expand it to 10 in the last iteration while using the original image I as input. This approach not only provided good guidance for model generation but also reduced computational complexity and improved generation speed. Most importantly, it could reduce the probability of “the improper ending” described earlier in Figure 3. Specifically, Algorithm 1 shows the process of generating a caption given the specific number of words. Then, our method gradually extends the length of the caption as described in Algorithm 2. Note that our method, Algorithm 2, is an improvement and extension of Algorithm 1. In conclusion, our improvements include the integration of emotional factors, processing of images, and guiding restrictions on caption sentence length. Refer to Figure 1a,b for the qualitative display of this point.

Algorithm 2 Algorithm of Our Method

Input:: Image I, initial caption $X^{(0)} = [x_{1}, . . ., x_{n}]$ , iterations T, emotion signal E;
Output:: Finial caption: $X^{(T)} = [x_{1}, . . ., x_{n}]$ ;
1:: Transform image: $I = f (I)$
2:: for $t = 1$ to 2 do
3:: Run Algorithm 1 with $n = 5$ ;
4:: end for
5:: Randomly select 3 positions in $X^{(2)}$ as $P^{'}$ ;
6:: for all $i \in P^{'}$ do
7:: Replace $x_{i}$ with [MASK];
8:: Predict the word distribution based on mask and sampling;
9:: Select top-K candidate words from word distribution;
10:: Obtain candidate sentence set $X_{C} = {x^{(k)}}_{k = 1}^{K}$ ;
11:: Compute CLIP and sentimental score for $X_{C}$ ;
12:: Select $X_{i}$ infill [MASK] by Equation (2);
13:: end for
14:: Run Algorithm 1 with input as $x^{(3)}$ and $n = 8$ ;
15:: Randomly select 2 positions in $X^{(2)}$ as $P^{″}$ ;
16:: for all $i \in P^{″}$ do
17:: Replace $x_{i}$ with [MASK];
18:: Predict the word distribution based on mask and sampling;
19:: Select top-K candidate words from word distribution;
20:: Obtain candidate sentence set $X_{C} = {x^{(k)}}_{k = 1}^{K}$ ;
21:: Compute CLIP and sentimental score for $X_{C}$ ;
22:: Select $X_{i}$ infill [MASK] by Equation (2);
22:: end for

5. Experiment Design and Details

To evaluate the efficiency of our method, particularly in zero-shot learning ability, we conducted experiments on both standard image captioning (without any sentiment signal) and sentiment image captioning tasks. We first follow the previous work and conduct standard image captioning (IC) using our method. Then, experiments on sentiment IC are carried out. Finally, the speed of generation and other qualitative results are presented.

5.1. Dataset

The COCO dataset is a large-scale visual dataset that has been widely used in computer vision tasks such as object detection, object segmentation, and image captioning. The dataset includes 328k image instances of 91 object types, aiming to depict common objects in nature scenes. COCO Caption is a subset of the COCO dataset. For each image in the dataset, five independent human caption annotations are provided. This dataset has been widely used in previous work. We conducted experiments using the val2017 dataset of COCO Caption. It contains 5000 images and 25,000 corresponding human captions.

SentiCap is a dataset focusing on sentiment image captioning. By rewriting human captions in the COCO Caption dataset to incorporate binary emotional signals, an image dataset containing emotion categories and emotional descriptions is constructed. We conducted experiments using a test set with 429 positive captions and 493 negative captions.

5.2. Implementation Detail

All experiments were conducted on a single RTX 4080, without the need for any additional training or fine-tuning. The hyperparameters used in the experiments are shown in Table 1. For sentiment tasks, an additional hyperparameter

g a m m a

= 1 was used. The BERT model used was bert-base-uncased, and the CLIP model was clip-vit-base-patch32. The TSA model was indobert-emotion-classification; we performed post-processing on the model output categories by excluding neutral and then re-normalizing them, considering happy and love as positive emotions and sadness, fear, and anger as negative emotions. Confidence within each class was simply summed up to obtain the output confidence score. Gaussian blur processing of images was performed using a Gaussian blur function provided by OpenCV with kernel sizes and corresponding standard deviations shown in Table 1.

5.3. Zero-Shot Ability with Standard Image Captioning

On the COCO Caption dataset, we completed the standard image captioning (IC) task without adding emotional signals. We hoped to verify through the performance of our method on the standard IC whether it could have a good generalization ability and perform well on unseen samples.

Following previous work, we selected BLEU-4, METEOR, CIDEr, RefCLIPScore as supervised metrics, and CLIPScore as an unsupervised metric.Supervised metrics are obtained by calculating the similarity between generated results and human reference annotated texts. They have been widely used to evaluate the quality of text generated by models. However, these metrics may lead models to generate text consistent with the dataset bias, thereby limiting the flexibility in generation capability and zero-shot ability. Unsupervised metrics are obtained by directly encoding images and texts into unified vectors and calculating distances. This cross-modal direct calculation metric avoids over-fitting to specific datasets by methods and can bring greater diversity to results, thus benefiting the zero-shot capabilities of methods.

BLEU [42] evaluates the quality of text generated by machines by measuring the similarity of n-grams between the generated text and reference texts. Specifically, BLEU-4 focuses on the precision of four-gram overlaps to assess the accuracy of consecutive word sequences.

METEOR [43] is an automated metric designed to evaluate the quality of text machines produce. It incorporates unigram matching, including both exact and stemmed forms, and aligns these matches.

CIDEr [44] serves as an evaluation metric specifically for the quality of image captions within the field of computer vision. It utilizes a tf-idf weighting scheme to emphasize the significance of less common, yet more informative n-grams within the captions.

RefCLIPScore [45] is an evaluation metric based on deep learning, specifically designed to assess the quality of text generated for image description tasks. It incorporates the capabilities of the CLIP model, which evaluates text accuracy and relevance by measuring the semantic similarity between generated text and reference text. The CLIP model, trained jointly on images and text, acquires rich features that adeptly capture content across different modalities. In RefCLIPScore, both the generated and reference texts are transformed into vector representations, and their semantic proximity is determined through the calculation of cosine similarity between these vectors. This method goes beyond mere surface matching, probing deeper into the semantic layers of the text, thereby ensuring more thorough and precise evaluations.

CLIPScore [45] is an unsupervised metric for evaluating image captioning models. Leveraging the ability of the CLIP model to interpret the relationship between images and text, CLIPScore measures the quality of descriptions by evaluating semantic and lexical alignment between generated text and images in a shared embedding space. A higher CLIPScore indicates greater semantic consistency between captions and images, which is reflected in their proximity in the embedding space.

5.4. Emotional Zero-Shot Image Caption

Next, in order to further explore the generation ability of method in emotional captioning task, we conducted experiment on the SentiCap dataset. Following previous research, we adopted BLEU-3, METEOR as supervised metrics and CLIPScore as an unsupervised metric. We also used a TSA model to calculate the accuracy of model-generated results.

6. Experiment Results

6.1. Quantitative Experiment Results

The results are shown in Table 2. In terms of supervised metrics (especially BLEU-4, METEOR, RefCLIPScore, and CIDEr), our method lagged behind other methods. This was because our generation, while ensuring semantic correctness, adopted a more flexible expression rather than simply imitating the language style and vocabulary of the COCO dataset. Our method performed well on RefCLIPScore and CLIPScore, which focus more on semantics. This indicated that our method outperformed others, whether it was calculating the semantics similarity between generated results and human reference captions or the similarity between generated results and image content. It also showed that our method captured higher-level semantic concepts with more flexible expressive ability and zero-shot capability.

To further evaluating the ability of methods in emotional image captioning task, we conducted experiment on the SentiCap dataset. As shown in Table 2, similar to standard image captioning (IC) experimental results, our method lagged behind other methods on traditional supervised metrics. This may be for the same reasons as the standard IC experiment. Since there was no public code available for other methods, we could not obtain their CLIPScore. However, our method slightly outperformed standard IC results on CLIPScore after introducing the emotion signal. This indicated that introducing emotional factors did improve the quality of generation and guided methods to generate text that better aligned with image semantics. We also evaluated the accuracy of the generated results in terms of emotions and found that our method outperformed comparative methods. Finally, the performance of our method in generating text with negative emotion was slightly worse than that with positive emotion; this may be due to negative emotions being more complex.

6.2. Qualitative Result

We present some generated results here as qualitative results. As shown in Figure 5, our method produced more diverse content and integrated emotional factors well. Our method generated “river” in the first example, which was also mentioned in the ground truth (GT). It also generated “church”, which was ignored in the GT. Due to positive emotional guidance, the result included the word “love”, which was consistent with the image content. This indicated that our method had better diversity and generalization capabilities. At the same time, ZeroCap encountered a pattern collapse issue at that point: repeating word generation. In the second example, both our method and ZeroCap successfully focused on “couple” and “dressed”, two semantic elements appearing in the GT. Meanwhile, ZeroCap also paid attention to their different eyes. The third example was more complex with abstract semantic elements. Both our method and ZeroCap’s generations were not good enough. However, while ZeroCap focused only on sign, our method captured semantic elements like “roadblock” and “public”.

As a verification of the assumption of consistency between image and text distribution, we qualitatively analyzed one result of the method. In Figure 6, we encoded the generated text and images into 512 dimensional vectors using CLIP and randomly sampled 30 dimensions for visualization. Our results were close to the image distribution and close to human caption references. This indicated that our method indeed reflected image semantics well and had a more richly flexible generative ability to handle unseen samples. Moreover, in Figure 5, the first and second images show that ZeroCap’s results differed significantly from the image distribution. In contrast, our method achieved results close to the image distribution. In the third image, the GT’s results also showed a large deviation from the image distribution, qualitatively demonstrating the significance of zero-shot tasks: a better generalization avoids dataset annotation bias.

We also wondered whether the method alleviated the “bad sequel” problem. Therefore, based on the positive subset of SentiCap, we truncated the generated results word by word and calculated their CLIPScore. As shown in Figure 7a, the generated results were sequentially truncated to a fixed length from list [3,4,5,…,9,10]. ZeroCap was introduced as a comparison method. The metrics represented the average results under the positive subset. The results showed that overall, trendwise, as the sentence length increased, the CLIPScore also gradually increased. Additionally, our method had higher overall results than ZeroCap and smaller variance at the same time. This indicated that our interpolation strategy indeed improved the “bad sequel” problem.

6.3. Performance on Speed

We also care about the speed of generation. We explored the generation speed of methods under different k’s with k = [200, 300, 400, 500, 600], and the results showed that as k increased, the time taken for model generation continued to increase. As shown in Figure 7b, compared to ZeroCap used as a control group, our method had a faster generation speed.

7. Conclusions

This paper proposed a novel method for zero-shot image captioning. Previous work used control signals to generate descriptions and an overall limited number of words. However, existing zero-shot image description methods often suffer from hallucination problems during generation and inappropriate endings, while the generated results are typically overly rigid. This leaves space for improvement in the model’s captioning outcomes. To avoid the hallucination problem in this framework, we added an emotion module to ensure that the sentiment of the description and input image were matched. Moreover, the length of the output sentence was gradually increased, instead of fixed at the beginning or totally unlimited. It brought a more flexible and natural generation to the method. Experimental results showed that the proposed method achieved the best performance both in the unsupervised metric CLIPScore and supervised metric RefCLIPScore.

Our method currently still has some issues. Currently, the emotion labels are only binary, and the granularity of the labels is relatively rough. Therefore, the method does not perform well in generating abstract art paintings, because they often contain subtle and complex emotion. Moreover, compared to traditional supervised methods that do not use large-scale pretraining, our method still lags behind in generation speed. We hope to alleviate this issue through mechanisms such as knowledge distillation in future work.

Author Contributions

Conceptualization, J.L. and X.Z.; methodology, X.Z. and J.S.; software, X.Z.; validation, X.Z., Y.W., and J.X.; formal analysis, X.Z. and J.L.; investigation, X.Z.; resources, J.L.; data curation, J.X. and Y.W.; writing—original draft preparation, J.L. and X.Z.; writing—review and editing, J.L., X.Z., and J.S.; visualization, J.S. and Y.W.; supervision, J.L.; project administration, J.L.; funding acquisition, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China grant no. 62206162, Young Talent Fund of Xi’an Association for Science and Technology grant no. 959202313048, the Ministry of Education’s Cooperative Education Project grant no. 202102591018.

Data Availability Statement

We have shared the link of the used dataset in the manuscript. The link for the COCO Caption image dataset is: http://images.cocodataset.org/zips/val2017.zip; the corresponding caption annotation information can be found at: https://cocodataset.org/#download. The link for the SentiCap dataset is: https://users.cecs.anu.edu.au/~u4534172/senticap.html. The experimental section did not create new data.

Conflicts of Interest

We declare that all authors have no conflict of interest including known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Ghandi, T.; Pourreza, H.; Mahyar, H. Deep Learning Approaches on Image Captioning: A Review. ACM Comput. Surv. 2023, 56, 1–39. [Google Scholar] [CrossRef]
Dognin, P.; Melnyk, I.; Mroueh, Y.; Padhi, I.; Rigotti, M.; Ross, J.; Schiff, Y.; Young, R.A.; Belgodere, B. Image Captioning as an Assistive Technology: Lessons Learned from VizWiz 2020 Challenge. J. Artif. Int. Res. 2022, 73, 437–459. [Google Scholar] [CrossRef]
Sidorov, O.; Hu, R.; Rohrbach, M.; Singh, A. TextCaps: A Dataset for Image Captioning with Reading Comprehension. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 742–758. [Google Scholar]
Gurari, D.; Zhao, Y.; Zhang, M.; Bhattacharya, N. Captioning Images Taken by People Who Are Blind. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 417–434. [Google Scholar]
Li, X.; Yin, X.; Li, C.; Zhang, P.; Hu, X.; Zhang, L.; Wang, L.; Hu, H.; Dong, L.; Wei, F.; et al. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 121–137. [Google Scholar]
Fukui, A.; Park, D.H.; Yang, D.; Rohrbach, A.; Darrell, T.; Rohrbach, M. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; pp. 457–468. [Google Scholar]
Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.; et al. Rethinking semantic segmentation from a sequence-to-sequence pefrspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 6881–6890. [Google Scholar]
Zhao, W.; Wu, X.; Zhang, X. Memcap: Memorizing style knowledge for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 12984–12992. [Google Scholar]
Li, H.; Li, X.; Wang, W. Research on Image Caption of Children’s Image Based on Attention Mechanism. In Proceedings of the 2021 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA), Dalian, China, 28–30 June 2021; pp. 182–186. [Google Scholar]
Luo, R.C.; Hsu, Y.T.; Ye, H.J. Multi-modal human-aware image caption system for intelligent service robotics applications. In Proceedings of the 2019 IEEE 28th International Symposium on Industrial Electronics (ISIE), Vancouver, BC, Canada, 12–14 June 2019; pp. 1180–1185. [Google Scholar]
Luo, R.C.; Hsu, Y.T.; Wen, Y.C.; Ye, H.J. Visual image caption generation for service robotics and industrial applications. In Proceedings of the 2019 IEEE International Conference on Industrial Cyber Physical Systems (ICPS), Taibei, Taiwan, 6–9 May 2019; pp. 827–832. [Google Scholar]
Tewel, Y.; Shalev, Y.; Schwartz, I.; Wolf, L. Zerocap: Zero-shot image-to-text generation for visual-semantic arithmetic. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022; pp. 17918–17928. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International conference on machine learning. PMLR, Virtual Event, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Rohrbach, A.; Hendricks, L.A.; Burns, K.; Darrell, T.; Saenko, K. Object hallucination in image captioning. arXiv 2018, arXiv:1809.02156. [Google Scholar]
Stefanini, M.; Cornia, M.; Baraldi, L.; Cascianelli, S.; Fiameni, G.; Cucchiara, R. From show to tell: A survey on deep learning-based image captioning. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 539–559. [Google Scholar] [CrossRef] [PubMed]
Karpathy, A.; Fei-Fei, L. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 8–10 June 2015; pp. 3128–3137. [Google Scholar]
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 2048–2057. [Google Scholar]
Jiang, W.; Ma, L.; Jiang, Y.G.; Liu, W.; Zhang, T. Recurrent fusion network for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 499–515. [Google Scholar]
Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Utah, UT, USA, 19–21 June 2018; pp. 6077–6086. [Google Scholar]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
Yao, T.; Pan, Y.; Li, Y.; Mei, T. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 684–699. [Google Scholar]
Shi, Z.; Zhou, X.; Qiu, X.; Zhu, X. Improving Image Captioning with Better Use of Caption. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 7454–7464. [Google Scholar]
Yang, X.; Zhang, H.; Cai, J. Learning to collocate neural modules for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 Octber–2 November 2019; pp. 4250–4260. [Google Scholar]
Li, G.; Zhu, L.; Liu, P.; Yang, Y. Entangled transformer for image captioning. In Proceedings of the IEEE/CVF international conference on computer vision, Seoul, Republic of Korea, 27 Octber–2 November 2019; pp. 8928–8937. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26 April–1 May 2020. [Google Scholar]
Mokady, R.; Hertz, A.; Bermano, A.H. Clipcap: Clip prefix for image captioning. arXiv 2021, arXiv:2111.09734. [Google Scholar]
Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 8–10 June 2015; pp. 3156–3164. [Google Scholar]
Ge, H.; Yan, Z.; Zhang, K.; Zhao, M.; Sun, L. Exploring overall contextual information for image captioning in human-like cognitive style. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1754–1763. [Google Scholar]
Huang, L.; Wang, W.; Chen, J.; Wei, X.Y. Attention on attention for image captioning. In Proceedings of the IEEE/CVF international conference on computer vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4634–4643. [Google Scholar]
Zhu, X.; Wang, W.; Guo, L.; Liu, J. AutoCaption: Image captioning with neural architecture search. arXiv 2020, arXiv:2012.09742. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.u.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Ji, J.; Luo, Y.; Sun, X.; Chen, F.; Luo, G.; Wu, Y.; Gao, Y.; Ji, R. Improving image captioning by leveraging intra-and inter-layer global representation in transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; Volume 35, pp. 1655–1663. [Google Scholar]
Cornia, M.; Stefanini, M.; Baraldi, L.; Cucchiara, R. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–18 June 2020; pp. 10578–10587. [Google Scholar]
Tu, H.; Yang, B.; Zhao, X. Zerogen: Zero-shot multimodal controllable text generation with multiple oracles. In Proceedings of the CCF International Conference on Natural Language Processing and Chinese Computing, Foshan, China, 14–15 October 2023; pp. 494–506. [Google Scholar]
Li, W.; Zhu, L.; Wen, L.; Yang, Y. DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training. In Proceedings of the The Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023; pp. 114–514. [Google Scholar]
Ma, Y.; Ji, J.; Sun, X.; Zhou, Y.; Hong, X.; Wu, Y.; Ji, R. Image Captioning via Dynamic Path Customization. arXiv 2024, arXiv:2406.00334. [Google Scholar] [CrossRef]
Guo, S.; Wang, Y.; Ye, J.; Zhang, A.; Xu, K. Semantic Importance-Aware Communications with Semantic Correction Using Large Language Models. arXiv 2024, arXiv:2405.16011. [Google Scholar]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models Are Unsupervised Multitask Learners. Available online: https://insightcivic.s3.us-east-1.amazonaws.com/language-models.pdf (accessed on 14 February 2019).
Xiao, Y.; Wu, L.; Guo, J.; Li, J.; Zhang, M.; Qin, T.; Liu, T.y. A survey on non-autoregressive generation for neural machine translation and beyond. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 11407–11427. [Google Scholar] [CrossRef] [PubMed]
LeDoux, J.E. Brain mechanisms of emotion and emotional learning. Curr. Opin. Neurobiol. 1992, 2, 191–197. [Google Scholar] [CrossRef] [PubMed]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 June 2002; pp. 311–318. [Google Scholar]
Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; pp. 65–72. [Google Scholar]
Vedantam, R.; Lawrence Zitnick, C.; Parikh, D. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4566–4575. [Google Scholar]
Hessel, J.; Holtzman, A.; Forbes, M.; Bras, R.L.; Choi, Y. Clipscore: A reference-free evaluation metric for image captioning. arXiv 2021, arXiv:2104.08718. [Google Scholar]
Su, Y.; Lan, T.; Liu, Y.; Liu, F.; Yogatama, D.; Wang, Y.; Kong, L.; Collier, N. Language Models Can See: Plugging Visual Controls in Text Generation. arXiv 2022, arXiv:2205.02655. [Google Scholar]
Nguyen, V.Q.; Suganuma, M.; Okatani, T. Grit: Faster and better image captioning transformer using dual visual features. In Proceedings of the Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part XXXVI, Tel Aviv, Israel, 25–27 October 2022. pp. 167–184. [Google Scholar]
Hu, X.; Gan, Z.; Wang, J.; Yang, Z.; Liu, Z.; Lu, Y.; Wang, L. Scaling Up Vision-Language Pretraining for Image Captioning. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, Louisiana, 21–24 June 2021; pp. 17959–17968. [Google Scholar]
Gan, C.; Gan, Z.; He, X.; Gao, J.; Deng, L. Stylenet: Generating attractive visual captions with styles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 22–25 July 2017; pp. 3137–3146. [Google Scholar]
Guo, L.; Liu, J.; Yao, P.; Li, J.; Lu, H. Mscap: Multi-style image captioning with unpaired stylized text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, BC, USA, 16–20 June 2019; pp. 4204–4213. [Google Scholar]

Figure 1. Introduction and description of the zero-shot image captioning framework, as well as a schematic diagram of our improved framework, which is shown in more detail in Section 4.2. (a) A framework of zero-shot image captioning. (b) Our improved zero-shot image description framework. Specific details are provided in Section 4.2. “TSA” stands for Text Sentiment Analysis.

Figure 2. An example of the hallucination problem. The “cigarettes” do not appear in this image; however, this word is strongly related to a negative emotion.

Figure 3. An example of the framework is limited by a specified length, resulting in being forced to awkwardly supplement content at the end.

Figure 4. The comparison effect before and after improvement is displayed. For the same image, the generated results show a noticeable improvement after receiving the “positive” emotional signal, reducing the hallucination issue.

Figure 5. Qualitative results of our method.

Figure 6. Qualitative result of consistency in generating text, image, and reference text.

Figure 7. Result of alleviation of “improper ending” and generation speed. (a) Result of matching degree based on CLIPScore. Our method can maintain stability during the generation process and alleviate the “improper ending” issue. (b) Result of the relationship of generation time cost and number of k-candidates. Our method significantly reduces the time cost compared to ZeroCap.

Table 1. Experimental parameters setting.

Parameters	Value
k	600
$α$	0.001
$β$	15
$γ$	1
temperature	0.2
Gaussian kernel size 1	(11, 11)
Deviation 1	5
Gaussian kernel size 2	(3, 3)
Deviation 2	0.5
PyTorch seed	42

Table 2. Performance compared with other methods on the MSCOCO dataset and SentiCap dataset. B-n, M, C, R, C-S, A are short for BLEU-n, METEOR, CIDEr, RefCLIPScore, CLIPScore, sentiment classification accuracy (%), respectively. Type is used to label whether the method is a supervised method (Sup) or an unsupervised method (Unsup). Note that CLIPScore is an unsupervised metric (*), so it is more important in unsupervised methods like ours. For each metric in the table, the best result is highlighted in bold.

Method	Type	MSCOCO					SentiCap
Method	Type	MSCOCO					Positive				Negative
		B-4	M	C	R	C-S *	B-3	M	C-S *	A	B-3	M	C-S *	A
ClipCap [27]	Sup	32.15	27.1	108.35	0.81	0.77	/	/	/	/	/	/	/	/
MAGIC [46]	Sup	12.90	17.22	48.33	0.74	0.74	/	/	/	/	/	/	/	/
GRIT [47]	Sup	42.4	30.6	144.2	0.77	0.82	/	/	/	/	/	/	/	/
LEMON [48]	Sup	42.6	31.4	145.5	/	/	/	/	/	/	/	/	/	/
StyleNet [49]	Sup	/	/	/	/	/	12.1	12.1	/	45.2	10.6	10.9	/	56.6
MSCap [50]	Sup	/	/	/	/	/	16.2	16.8	/	92.5	15.4	16.2	/	93.4
MemCap [8]	Sup	/	/	/	/	/	17.0	16.6	/	96.1	18.1	15.7	/	98.9
ZeroCap [12]	Unsup	2.60	11.50	14.60	0.79	0.87	/	/	/	/	/	/	/	/
Ours	Unsup	1.14	8.80	12.0	0.83	0.91	1.87	6.89	0.95	96.8	1.69	3.30	0.88	98.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, X.; Shen, J.; Wang, Y.; Xiao, J.; Li, J. Zero-Shot Image Caption Inference System Based on Pretrained Models. Electronics 2024, 13, 3854. https://doi.org/10.3390/electronics13193854

AMA Style

Zhang X, Shen J, Wang Y, Xiao J, Li J. Zero-Shot Image Caption Inference System Based on Pretrained Models. Electronics. 2024; 13(19):3854. https://doi.org/10.3390/electronics13193854

Chicago/Turabian Style

Zhang, Xiaochen, Jiayi Shen, Yuyan Wang, Jiacong Xiao, and Jin Li. 2024. "Zero-Shot Image Caption Inference System Based on Pretrained Models" Electronics 13, no. 19: 3854. https://doi.org/10.3390/electronics13193854

APA Style

Zhang, X., Shen, J., Wang, Y., Xiao, J., & Li, J. (2024). Zero-Shot Image Caption Inference System Based on Pretrained Models. Electronics, 13(19), 3854. https://doi.org/10.3390/electronics13193854

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Zero-Shot Image Caption Inference System Based on Pretrained Models

Abstract

1. Introduction

2. Related Work

3. Framework of Image Captioning with Zero-Shot

3.1. Formulation of Task: Image Captioning

3.2. Introduction of Framework

4. Analysis and Improvement

4.1. Analysis of Framework

4.2. Improvement

5. Experiment Design and Details

5.1. Dataset

5.2. Implementation Detail

5.3. Zero-Shot Ability with Standard Image Captioning

5.4. Emotional Zero-Shot Image Caption

6. Experiment Results

6.1. Quantitative Experiment Results

6.2. Qualitative Result

6.3. Performance on Speed

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI