Next Article in Journal
Advances in Human–Machine Interaction, Artificial Intelligence, and Robotics
Previous Article in Journal
Optimizing the Effectiveness of Moving Target Defense in a Probabilistic Attack Graph: A Deep Reinforcement Learning Approach
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

Zero-Shot Image Caption Inference System Based on Pretrained Models

The School of Computer Science, Shaanxi Normal University, Xi’an 710119, China
*
Author to whom correspondence should be addressed.
Electronics 2024, 13(19), 3854; https://doi.org/10.3390/electronics13193854 (registering DOI)
Submission received: 26 August 2024 / Revised: 20 September 2024 / Accepted: 27 September 2024 / Published: 28 September 2024
(This article belongs to the Section Electronic Multimedia)

Abstract

Recently, zero-shot image captioning (ZSIC) has gained significant attention, given its potential to describe unseen objects in images. This is important for real-world applications such as human–computer interaction, intelligent education, and service robots. However, the zero-shot image captioning method based on large-scale pretrained models may generate descriptions containing objects that are not present in the image, which is a phenomenon termed “object hallucination”. This is because large-scale models tend to predict words or phrases with high frequency, as seen in the training phase. Additionally, the method set a limitation to the description length, which often leads to an improper ending. In this paper, a novel approach is proposed to address and reduce the object hallucination and improper ending problem in the ZSIC task. We introduce additional emotion signals as guidance for sentence generation, and we find that proper emotion will filter words that do not appear in the image. Moreover, we propose a novel strategy that gradually extends the number of words in a sentence to confirm the generated sentence is properly completed. Experimental results show that the proposed method achieves the leading performance on unsupervised metrics. More importantly, the subjective examples illustrate the effect of our method in improving hallucination and generating properly ending sentences.
Keywords: zero-shot learning; image captioning; large pre-trained model; affective computing; cross-modal zero-shot learning; image captioning; large pre-trained model; affective computing; cross-modal

Share and Cite

MDPI and ACS Style

Zhang, X.; Shen, J.; Wang, Y.; Xiao, J.; Li, J. Zero-Shot Image Caption Inference System Based on Pretrained Models. Electronics 2024, 13, 3854. https://doi.org/10.3390/electronics13193854

AMA Style

Zhang X, Shen J, Wang Y, Xiao J, Li J. Zero-Shot Image Caption Inference System Based on Pretrained Models. Electronics. 2024; 13(19):3854. https://doi.org/10.3390/electronics13193854

Chicago/Turabian Style

Zhang, Xiaochen, Jiayi Shen, Yuyan Wang, Jiacong Xiao, and Jin Li. 2024. "Zero-Shot Image Caption Inference System Based on Pretrained Models" Electronics 13, no. 19: 3854. https://doi.org/10.3390/electronics13193854

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop