1. Introduction
Image captioning models aim to automatically describe the visual content within a provided image with coherent and accurate textual descriptions. This task represents a standard example of multi-modal learning, bridging the domains of Computer Vision (CV) and Natural Language Processing (NLP). Image captioning models have utility across diverse domains, with application including assistance to individuals with visual impairments [
1,
2], automatic medical image captioning [
3] and diagnosis [
4], and enhancing human–computer interactions [
5]. Motivated by the achievements of deep learning techniques in machine translation [
6], the majority of image captioning models adopt the encoder–decoder framework coupled with a visual attention mechanism [
7,
8]. The encoder transforms input images into fixed-length vector features, while the decoder decodes image features into descriptions, progressing word by word [
9,
10,
11,
12,
13].
In the past few years, researchers have adopted a pre-trained Convolutional Neural Network (CNN) as an encoder for extracting high-level features from the input image, with a Recurrent Neural Network (RNN) serving as the decoder [
9,
10]. Initially, Anderson et al. [
11] introduced the use of the Faster R-CNN object detector for extracting features at the regional level. Due to its substantial advantages, this approach became widely adopted in subsequent works. However, there are still shortcomings regarding regional-level features and the encoder of the object detector. Regional-level features may not capture specific and subtle elements that contribute to a more comprehensive understanding of the image content [
14]. Additionally, the encoder treats the image as sequences of visual features and does not preserve the spatial semantic information of the image. This can result in inaccurate or ambiguous captions, especially when objects in the image have spatial semantic relationships, as noted by Anderson et al. [
11,
15].
Recently, the main approach in image captioning models has been the use of Long Short-Term Memory (LSTM) [
16] decoders with a soft attention mechanism [
10]. However, drawbacks related to the training efficiency for handling long-term dependencies and inherited issues associated with sequential processing of LSTMs constrain the effectiveness of such models. Motivated by the achievements observed with the multihead self-attention mechanism and the Transformer architecture [
17] in Natural Language Processing (NLP) tasks, numerous researchers have started integrating multihead self-attention into the LSTM decoder [
12,
13] or directly employing the Transformer architecture as the decoder [
14,
18,
19] in image captioning models.
Especially, Transformer architecture gradually shows extraordinary potential in CV tasks and multi-modal tasks [
14,
20,
21,
22]. Researchers have proposed various methods that provide a new choice for encoding images into vectors of features. Nevertheless, they neglect image content semantic information in encoder Transformer modules and focus only on image visual features extracted by CNN and object detectors. Acknowledging the constraints associated with semantic image representation, we employ a Transformer-based image captioning model and incorporate external semantic knowledge representation for image objects in the encoder Transformer module. This is aimed at capturing meaningful relationships between image objects and subsequently improving the caption generation process. In encoder, we adopt faster R-CNN as an image object detector to extract objects’ visual features within the image and the class label of these detected objects. Then, we generate object semantic word embedding representation similar to [
15] from the class label by using an external knowledge base. Both of these objects, visual features and object semantic word embedding representation, serve as input to the encoder Transformer module, allowing it to focus attention on relevant regions when generating image captions. In contrast to [
15], in decoder, we directly adopt a Transformer decoder in machine translation [
17] to generate captions. This captioning model design enhances the performance of image captioning by enabling parallel processing of information. This parallel approach is more efficient for sequence-to-sequence tasks compared to LSTM models. Also, it empowers the model to make more informed and contextually relevant decisions when generating descriptive text for the image content by combining the encoder’s context vector with the encoding representation of the current word, resulting in the output text [
20,
23].
We validate our model via the MS-COCO [
24] offline “Karpathy” test split, which demonstrates the effectiveness of our proposed model. Also, we use a private novel MACE [
25] dataset for model generalization application. A comprehensive set of experiments, as well as quantitative and qualitative analyses, provide insights into the effectiveness of semantic attention image captioning models in visual captioning tasks.
Our main contributions are summarized as follows:
We create a Transformer-based image captioning model that integrates the external semantic knowledge representation of image objects into the encoder Transformer. This incorporation enhances the encoder and decoder Transformers’ capability to focus their attention on relevant regions and capture the meaningful relationships between image objects throughout the image captioning generation process.
We conduct a linguistic social word analysis for the generated captions, offering valuable insights into the efficacy of using the proposed model in vision and language tasks.
We extend the applicability of the proposed model and generate a description for the MACE visual captioning dataset. This newly archival dataset contains significant historical videos and scenes.
The remainder of this paper is organized as follows:
Section 2 presents background and related work.
Section 3 describes the model architecture. This is followed by the experiments and results in
Section 4.
Section 5 provides a discussion on the achieved outcomes. Model generalization is presented in
Section 6. The paper’s conclusions and future work ideas are provided in
Section 7.
2. Background and Related Works
In the past few years, motivated by the achievements of encoder–decoder frameworks in machine translation [
6], a diverse range of approaches adopting the encoder–decoder model in image captioning have emerged, achieving significant success. The conventional encoder–decoder models [
9,
26] employ a CNN as the encoder and an LSTM as the decoder, incorporating sequence-to-sequence connections. Subsequently, there have been numerous efforts aimed at advancing the encoder–decoder paradigm. Anderson et al. [
11] introduced a bottom-up mechanism for encoding with LSTM for decoding, facilitating attention calculation at the visual object level rather than initially across a uniform grid of CNN features [
10,
27]. Moreover, Zhang et al. [
28] introduced a visual relationship attention mechanism employing contextualized embeddings for visual objects. In the decoding phase, Xu et al. [
10] utilized LSTM to decode the convolutional features of an image, employing both hard and soft attention mechanisms to effectively highlight crucial regions. Lu et al. [
27] proposed incorporating a visual sentinel into the encoder–decoder framework for automatically regulating adaptive attention. Zhong et al. [
29] suggested employing adaptive spatial information attention (ASIA) to improve the utilization of feature information within images by enhancing LSTM’s ability to grasp the spatial details of significant objects or entire images from both global and local viewpoints.
In addition to utilizing visual features, techniques that leverage semantic information have been shown to significantly enhance caption accuracy. These additional semantic data can originate from either the entire image [
30,
31] or specific visual elements within the image [
11,
32]. To maximize the utilization of object semantic details, Yao et al. [
33] introduced Long Short-Term Memory with Attributes (LSTM-A), which incorporates attributes and visual features as inputs to LSTM, thus merging attributes into the effective CNN plus LSTM image captioning framework. Li et al. [
34] proposed a visual–semantic LSTM model that incorporates an attention mechanism to focus on visual semantic information. Furthermore, certain methods employing Graph Convolutional Networks (GCN) introduce semantic object relationships into the encoder–decoder architecture, enhancing semantic information utilization. Yao et al. [
35] suggested using GCN to incorporate semantic and spatial object relationships into the encoder. For a different approach to integrating semantic information, Hafeth et al. [
15] proposed involving external semantic knowledge bases representation for image objects’ labels to enrich visual attention in image encoders. Yang et al. [
36] introduced the Scene Graph Auto-Encoder (SGAE), which leverages semantic information to construct a dictionary, providing essential linguistic knowledge to guide the encoder–decoder process. Alternatively, instead of combining integrated semantic and visual information, Guo et al. [
37] proposed Visual Semantic Units Alignment (VSUA) to fully exploit alignment between word embeddings and integrated visual semantic units for image captioning.
Traditional encoder–decoder frameworks, characterized by recursive dependencies, encounter challenges in parallelization during training, resulting in diminished algorithmic efficiency. Consequently, the Transformer model [
17], which naturally accommodates the encoder–decoder paradigm and supports parallel training, emerged as a solution for image captioning tasks. Sharma et al. [
38] suggested the integration of the Transformer model into image captioning, demonstrating its efficacy. Additionally, the Transformer leverages spatial relationships extensively to enhance captioning accuracy. Herdade et al. [
39] proposed an object relation Transformer that explicitly incorporates spatial relationships among detected objects using geometric attention in the encoder phase. He et al. [
8] introduced a model based on image-Transformer encoder, aiming to enhance multihead attention by considering other relative spatial graph Transformer layers among image regions using only region visual features as input. Huang et al. [
12] proposed AoANet, introducing an additional attention mechanism by employing gating on the information, thereby enhancing the model’s ability to focus on relevant information. For a different approach to encoder attention, Cornia et al. [
18] utilized attention mechanisms to integrate outputs from multiple encoder layers. To maximize semantic information utilization in the Transformer, Li et al. [
40] introduced EnTangled Attention (ETA), enabling simultaneous exploitation of semantic and visual information in the decoder. Zhang et al. [
20] introduced the Multi-Feature Fusion-enhanced Transformer, a new approach to image captioning. Their model aims to boost Transformer performance in both encoder and decoder stages. By incorporating multi-feature fusion mechanisms, the model aligns specific visual and semantic features while also improving word organization. These enhancements contribute to more detailed and accurate descriptions. Luo et al. [
23] introduced the SCD-Net model, which enhances the synchronization of visual content and text across three stacked Transformers: a visual encoder, a semantic Transformer, and a sentence decoder. Their objective is to produce captions that are both coherent and semantically rich.
Based on the above reviews, it is apparent that few methods of techniques fully leverage the image semantic representation within Transformer-based image captioning methods. In addition, the Transformer architecture in Natural Language Processing demonstrates the ability to capture complex semantic connections. Inspired by this observation, we propose a new Transformer-based model specifically designed for image captioning. The proposed model employs a Transformer network for both encoder and decoder architecture, and integrates a semantic encoder Transformer to enhance semantic understanding to generate detailed captioning output.
3. Model Architecture
In this section, we provide the background information on the Transformer model, which serves as the foundation for our work (
Section 3.1). Subsequently, we present an illustration of the used semantic knowledge graph (
Section 3.2). Lastly, we explain the comprehensive architecture of our proposed model in detail (
Section 3.3).
3.1. Transformer Model for Image Captioning
We employ the Transformer model for image captioning, comprising an encoder and a decoder (
Figure 1). The encoder maps the input image representation
to a sequence of continuous representations
. The decoder generates the output sequence
for
z.
x represents the image visual features extracted from the input image, and
n denotes the number of features. The features we utilized are known as bottom-up features, derived from the bottom-up attention model introduced by Anderson et al. [
11].
z represents the output vector of the Transformer encoder, with a dimension of
t.
y corresponds to the output sentence generated by the Transformer decoder, with a length of
m. Unlike other image captioning models, the Transformer model employs stacked self-attention and point-wise fully connected layers instead of recurrent layers for both the encoder and decoder. The Transformer model specifically employed in this paper is based on [
17]. Additionally, the model’s input is replaced with features extracted from images.
Generally, the Transformer employs scaled dot-product attention to focus on relevant parts of the input sequence when generating the output, providing a way to capture dependencies and relationships within the data. This involves calculating the dot product between the query and key vectors, scaling it, applying a softmax to obtain attention scores, and then using these scores to weigh the corresponding values for each element in the input sequence [
42]; the computational procedure can be illustrated as follows:
In the given context, the attention inputs comprise the queries matrix
Q, keys matrix
K, and values matrix
V, all derived from the input sequence. The respective dimensions of these matrices are
,
, and
. To minimize the impact of the substantial value of
, a normalization factor of
is employed to push the softmax function into regions with small gradients. In practice, dot-product attention proves to be faster and more space-efficient due to its ability to be implemented through parallel optimization [
17].
Furthermore, multihead attention is constructed based on the foundation of scaled dot-product attention [
42]. It has the ability to acquire diverse representation subspaces at various positions. It consists of
h identical attention heads, where each head functions as a scaled dot-product attention, independently applying the attention mechanism to queries, keys, and values. Subsequently, the outputs from the
h attention heads are concatenated and then projected back to the original dimension, resulting in the ultimate values (Equations (
2) and (
3)).
where
,
,
,
are projection matrices that can be trained. In order to minimize overall computational expenses, the approach outlined in [
17] involves projecting the initial dimension of
onto
, where
h is set to 8.
The feed-forward network serves as another fundamental component, comprising a two-layer fully connected network featuring a ReLU activation function. This activation function is employed to enhance the network’s nonlinear capabilities [
42], as specified in Equation (
4), where
is the output of a previous sub-layer.
The encoder consists of N identical layers, each containing two sub-layers. The first sub-layer is a multihead self-attention mechanism, while the second sub-layer is a fully connected feed-forward network. Both sub-layers are accompanied by a residual connection [
43] and a normalization layer. The residual connection improves the flow of information and gradients, enabling more effective training, preserving important features and better overall performance of the Transformer model.
The decoder, like the encoder, consists of a stack of
N identical layers. Each decoder layer contains three sub-layers. In addition to the two sub-layers found in the encoder, the decoder introduces a third sub-layer that performs multihead attention over the encoder stack’s output. Similar to the encoder, residual connections followed by normalization layers are applied around these sub-layers. The masked multihead attention sub-layer ensures that predictions for position
i rely solely on the known outputs preceding position
i, achieved through a mask operation. This is because, during training, the Transformer generates words at position
i using the ground truth words, whereas, during testing, it generates the word at position
i based on the previously generated words. It is depicted in
Figure 1.
To apply the Transformer model to the image captioning tasks, we take the pre-trained bottom-up attention features [
11] as the representation of the input image. These visual features are extracted from an image using the bottom-up attention model to identify salient objects or regions within an image.
3.2. Leveraging Knowledge Graphs
The encoder Transformer model traditionally relies on visual embedding vectors as input. Typically, these visual embedding vectors, associated with individual objects in an image, are derived exclusively from the objects themselves, utilizing only their basic information.
In our work, we adopt an attention Transformer architecture comprising 6 blocks, as outlined by Vaswani et al. in [
17], to more effectively encode input images. As proposed by Hafeth et al. [
15], the attention mechanism is enriched by external semantic knowledge bases (KBs), such as ConceptNet5 [
41], which provide semantic object word representations.
The integration of KBs offers access to a wealth of semantic knowledge, resulting in enhanced caption quality and accuracy. This integration allows for the visual and semantic features extracted from the visual inputs to be mapped into a common space, facilitating meaningful comparisons and combinations. In essence, supplementing the visual content with additional semantic knowledge and context leads to the generation of more coherent and meaningful captions.
To achieve this, we extract ConceptNet word embeddings by harnessing a ConceptNet knowledge graph [
41]. These word embeddings encapsulate comprehensive information about the meanings and relationships of words in a compact vector format. Each word or concept in ConceptNet is assigned a high-dimensional vector representation, with similar words having closely positioned vectors, signifying their semantic similarity. These word embeddings capture various aspects of word meanings, encompassing synonyms, antonyms, hypernyms, and hyponyms. For instance, the vectors representing “dog” and “cat” are positioned closer to each other than those representing “dog” and “car”, reflecting the greater semantic similarities between dogs and cats. This approach allows us to incorporate not only the information of the object itself but also the information of its neighboring objects.
3.3. Transformer with Semantic-Based Model for Image Captioning
The architecture of the proposed image captioning model is illustrated in
Figure 1 and outlined in Algorithm 1. Training dataset has two types of input modalities, input image and caption(s) for that image only. We explain the process to extract semantic features representing image objects in the remaining part of the section.
Algorithm 1: Caption Generating Procedure |
|
The proposed model has a dual stream of encoder to encode visual information and a single stream of decoder to decode the input image caption. The encoder uses a popular object detection architecture, the Region-based Convolutional Neural Networks Faster R-CNN model. This utilizes a deep Residual Network (ResNet) [
43] as a convolutional backbone network to extract both visual feature map and class label for detected object in input image. The object detector network has been pre-trained on both the Imagenet dataset [
44] and Visual Genome dataset [
45]. The combination of Faster R-CNN and ResNet has demonstrated outstanding performance in object detection tasks, achieving state-of-the-art results [
11].
Given an input image
I, Faster R-CNN extracts features for the detected objects
N, where
is extracted object features vector. These visual features are represented as one stream of encoder and used as part of the input sequence for the attention-based Transformer. They are projected to
using a feed-forward layer and followed by a stack of six Transformer layers. Each layer consists of a self-attention layer and a feed-forward layer with residual connections and layer normalization, as explained in
Section 3.1. Consequently, the visual attended vector
is the output for each individual Transformer attention layer.
In the other stream, for input image
I, Faster R-CNN predicts class label for detected objects
as a list of words. These are transformed as word embedding vectors
by using ConceptNet embedding [
41]. These word embedding vectors are depicted as dense numerical vectors in a continuous 300 multi-dimensional space
. The resulting word embeddings encode information about the meaning and relationships of the objects’ semantic words in a dense vector format, as explained in
Section 3.2.
To enhance Transformer layer, a fusion strategy is devised to integrate the two representations of input image, visual attended representation and semantic word representation. For the channel connect strategy, these two feature representations are concatenated and then reduced to the model hidden dimension with a linear matrix, as shown in
Figure 1.
By fusing these modalities, we can leverage the complementary information they provide about the image content. Visual features offer fine-grained details about the visual content, while semantic word representations provide higher-level understanding and contextual information. As a consequence, this leads to the generation of captions that are more accurate, contextually relevant, and enriched with both visual and semantic details.
5. Discussion
In this work, we present a new image-Transformer-based model boosted with image object semantic representation. We extended the semantic Transformer model proposed by [
15]. The core idea behind this proposed architecture is to enhance the attention mechanism of the original Transformer layers, specifically designed for image captioning. In the encoder, we augment the Transformer layer with semantic representations of image objects’ labels to capture the spatial relationships between these objects. For that, we conducted extensive experiments to demonstrate the superiority of our model, presenting both qualitative and quantitative analyses to validate the proposed encoder and decoder Transformer layers. When compared to previous top models in image captioning, our model achieves a high CIDEr score. This indicates that the proposed model can generate captions that are not only accurate but also diverse, coherent, and contextually relevant. This improvement is attributed to the utilization of external commonsense knowledge.
In the evaluation of the impact of different CNN models on caption quality, the experimental results demonstrate that captions generated by a ResNet-101 encoder consistently outperform those from ResNet-18 and ResNet-50 encoders in all tested scenarios. This validates our original hypothesis. The superior performance of the proposed method can be attributed to the residual connections in ResNet, enabling the creation of a deeper model. Additionally, the ResNet-101 CNN model excels in preserving significant visual information, resulting in better feature extraction for image captioning by learning more abstract and distinctive visual features. This is particularly advantageous for generating accurate and descriptive captions for complex images where identifying and describing subtle visual details is essential. However, the choice of a Faster R-CNN backbone for feature extraction depends on the specific task and available resources. More complex backbones like ResNet-101 or ResNet-50 may yield better performance but may also require additional computational resources and longer training times. In addition, the fine-tuning visual features using CNNs like ResNet improve the relevance and quality of the generated captions, as evidenced by higher BLEU@4 metric scores across various encoder models.
Furthermore, we have observed that increasing the number of Transformer heads in the model enhances accuracy across various evaluation metrics. However, it comes at the cost of increased training time. Each additional head introduces extra parameters that require optimization during training, thus extending the training process. Furthermore, during the inference phase, generating captions with models featuring a high number of attention heads can result in slower performance, which may pose a notable drawback in real-time applications.
In summary, integrating visual semantic features significantly enhances the performance of Transformer-based image captioning models, enabling them to generate captions that faithfully represent the visual content of the images.
6. Generalization
To demonstrate the broad applicability of the proposed semantic Transformer model, we conducted experiments on the MACE dataset [
25]. This dataset comprises images from a visual historical archive that are not included in ImageNet. It is generated from archival video data.
In particular, generating content captions for heuristic data is an open problem with various challenges: (i) the lack of truly large-scale datasets; (ii) some old video content sounds/scenes are not clear or become damaged when converted and run via new-technology devices; and (iii) the data have outdated objects and scenes and also include cultural and historical context.
In
Table 5, one can observe the evaluation results for captions generated by ResNet-101, ResNet-50, and ResNet-18, which were used to extract feature vectors from frames in each video. These vectors were then passed through encoder semantic Transformer and decoder Transformer modules.
The results from ResNet-50 are higher than ResNet-18 and ResNet-101 in most evaluation metrics. The reason regarding MACE data is that they comprise a small dataset that might not provide enough diverse examples to leverage the additional capacity of ResNet-101. Additionally, the deeper and more complex nature of ResNet-101 in a small-dataset context raises the risk of overfitting, potentially capturing noise instead of generalizing well to unseen data.
7. Conclusions and Future Work
In this work, we introduce a new Transformer-based model for image captioning. Our approach incorporates semantic representations of image objects to capture spatial relationships between objects, aiming to enhance attention mechanisms for image captioning. Extensive experiments on the MS-COCO dataset confirm that the proposed model achieves an impressive CIDEr score of 132.0, indicating that it generates accurate, diverse, coherent, and contextually relevant captions through the use of external commonsense knowledge. A ResNet-101 encoder consistently outperforms ResNet-18 and ResNet-50 encoders in caption quality, attributed to its residual connections and better feature extraction. Additionally, refining with ResNet enhances BLEU@4 metric scores, thereby enhancing caption quality. Moreover, augmenting the number of Transformer multihead attention mechanisms improves image captioning outcomes. Nevertheless, this heightened accuracy is accompanied by the cost of extended training time, which can negatively affect real-time applications.
The study also applies the model on the MACE dataset to generate descriptive sentences for video frames, improving accessibility and understanding of historical artifacts through experiments. In summary, integrating visual semantic features enhances image captioning model performance, and provides reliable representations of visual content.
Future work will also examine the use of new models that have been successfully applied to different applications. These include (a) the PF-BiGRU-TSAM model, which has been used for interactive remaining useful life prediction of lithium-ion batteries [
57]; this model uses data-driven deep learning methods and time windows for prediction tasks over time; (b) the neural network in lifetime extension approach, based on Leven–Marq neural network and power routing [
58]; this model uses the Levenberg–Marquardt algorithm for optimizing the backpropagation neural network for real-time prediction in an industrial system.