This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Open AccessArticle
Zero-Shot Image Caption Inference System Based on Pretrained Models
by
Xiaochen Zhang
Xiaochen Zhang ,
Jiayi Shen
Jiayi Shen ,
Yuyan Wang
Yuyan Wang ,
Jiacong Xiao
Jiacong Xiao and
Jin Li
Jin Li *
The School of Computer Science, Shaanxi Normal University, Xi’an 710119, China
*
Author to whom correspondence should be addressed.
Electronics 2024, 13(19), 3854; https://doi.org/10.3390/electronics13193854 (registering DOI)
Submission received: 26 August 2024
/
Revised: 20 September 2024
/
Accepted: 27 September 2024
/
Published: 28 September 2024
Abstract
Recently, zero-shot image captioning (ZSIC) has gained significant attention, given its potential to describe unseen objects in images. This is important for real-world applications such as human–computer interaction, intelligent education, and service robots. However, the zero-shot image captioning method based on large-scale pretrained models may generate descriptions containing objects that are not present in the image, which is a phenomenon termed “object hallucination”. This is because large-scale models tend to predict words or phrases with high frequency, as seen in the training phase. Additionally, the method set a limitation to the description length, which often leads to an improper ending. In this paper, a novel approach is proposed to address and reduce the object hallucination and improper ending problem in the ZSIC task. We introduce additional emotion signals as guidance for sentence generation, and we find that proper emotion will filter words that do not appear in the image. Moreover, we propose a novel strategy that gradually extends the number of words in a sentence to confirm the generated sentence is properly completed. Experimental results show that the proposed method achieves the leading performance on unsupervised metrics. More importantly, the subjective examples illustrate the effect of our method in improving hallucination and generating properly ending sentences.
Share and Cite
MDPI and ACS Style
Zhang, X.; Shen, J.; Wang, Y.; Xiao, J.; Li, J.
Zero-Shot Image Caption Inference System Based on Pretrained Models. Electronics 2024, 13, 3854.
https://doi.org/10.3390/electronics13193854
AMA Style
Zhang X, Shen J, Wang Y, Xiao J, Li J.
Zero-Shot Image Caption Inference System Based on Pretrained Models. Electronics. 2024; 13(19):3854.
https://doi.org/10.3390/electronics13193854
Chicago/Turabian Style
Zhang, Xiaochen, Jiayi Shen, Yuyan Wang, Jiacong Xiao, and Jin Li.
2024. "Zero-Shot Image Caption Inference System Based on Pretrained Models" Electronics 13, no. 19: 3854.
https://doi.org/10.3390/electronics13193854
Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details
here.
Article Metrics
Article Access Statistics
For more information on the journal statistics, click
here.
Multiple requests from the same IP address are counted as one view.