A Systematic Literature Review on Using the Encoder-Decoder Models for Image Captioning in English and Arabic Languages

Alsayed, Ashwaq; Arif, Muhammad; Qadah, Thamir M.; Alotaibi, Saud

doi:10.3390/app131910894

Open AccessSystematic Review

A Systematic Literature Review on Using the Encoder-Decoder Models for Image Captioning in English and Arabic Languages

¹

Computer Science Department, Umm Al-Qura University, Makkah 24230, Saudi Arabia

²

Information Systems Department, Umm Al-Qura University, Makkah 24230, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(19), 10894; https://doi.org/10.3390/app131910894

Submission received: 6 September 2023 / Revised: 25 September 2023 / Accepted: 27 September 2023 / Published: 30 September 2023

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

With the explosion of visual content on the Internet, creating captions for images has become a necessary task and an exciting topic for many researchers. Furthermore, image captioning is becoming increasingly important as the number of people utilizing social media platforms grows. While there is extensive research on English image captioning (EIC), studies focusing on image captioning in other languages, especially Arabic, are limited. There has also yet to be an attempt to survey Arabic image captioning (AIC) systematically. This research aims to systematically survey encoder-decoder EIC while considering the following aspects: visual model, language model, loss functions, datasets, evaluation metrics, model comparison, and adaptability to the Arabic language. A systematic review of the literature on EIC and AIC approaches published in the past nine years (2015–2023) from well-known databases (Google Scholar, ScienceDirect, IEEE Xplore) is undertaken. We have identified 52 primary English and Arabic studies relevant to our objectives (The number of articles on Arabic captioning is 11, and the rest are for the English language). The literature review shows that applying the English-specific models to the Arabic language is possible, with the use of a high-quality Arabic database and following the appropriate preprocessing. Moreover, we discuss some limitations and ideas to solve them as a future direction.

Keywords:

image captioning; Arabic image captioning; BLEU score; transformer; feature extraction

1. Introduction

Image captioning (IC) is the task of automatically generating a description of an image. The image captions are generated using both Natural Language Processing and Computer Vision.

In IC, the whole detail of each object and its associated relationship is extracted. Finally, the system can construct a natural sentence to explain the image automatically. The encoder-decoder architecture has proven effective on various sequence-to-sequence prediction problems. The sequence prediction models use the outcome from the previous time step as input for the current time step model. The input image is encoded into a feature vector by an encoder network, where a decoder network accepts the feature vector as input and produces the output caption. Some areas using encoder-decoder architecture include machine translation, caption generation, and text summarizing.

IC is critical because it assists visually impaired individuals. Figure 1 shows an example with a few images with possible captions. The captions are “A brown dog is sprayed with water”, “A boy on a scooter“, and “A young kid playing the goalie in a hockey rink” for the images in Figure 1a–c, respectively. Automatically generated captions are useful in many applications, including image categorization, image filtering, search engines, newspaper companies, and teaching children.

Much work has been done in this field. One of the early works based on neural networks appeared in 2014 and proposed multimodal neural language models [2]. Subsequently, many studies emerged using Recurrent Neural Network models, where the sequence of language can be modeled by the recurrence relations [3,4,5]. An attention mechanism is used along with the recurrence to identify the relationships between image regions, words, and eventually, tags [6,7]. The current transformer [8] in which the recurring relation is foregone to utilize self-attention offers exceptional prospects in set and sequence modeling performances [9,10]. Few studies have examined captioning images in other languages, especially morphologically complex languages such as Arabic. The first paper on Arabic image captioning appeared in 2017 [11]. Thereafter, three articles were published in 2018 [12,13,14]. In 2020, two articles were published [15,16], in 2021, two articles were published [17,18] and three were published in 2022 [19,20,21]. Furthermore, some image captioning systematic literature reviews are available despite not covering some critical models between 2015 and 2023 [22,23].

1.1. Human Captioning

Most people agree that our ability to see is the most crucial sense for perception. The human brain has been created to receive, transmit, and analyze visual information. In reality, 90% of the information delivered to the brain is visual, and it is interesting to note that the human brain can process visuals faster than words. So, data from a visual perspective are processed more effectively than data from any other type [24]. Information travels from the eye to the brain at an unfathomably fast rate. When we perceive an image, we instantly analyze it, give it meaning, and position it in a context. We can quickly connect new knowledge to information already stored in memory as it is delivered into the brain.

Specifically, the steps to understand and describe the image involve detecting the objects and focusing on the important objects. Further, humans try to understand objects’ relationships to each other and identify the object’s activity [25]. Object recognition when we see our surroundings is a key element in image understanding [26]. As shown in Figure 2, first, the objects of the image are recognized (e.g., dog, ball, grass, and trees). After that, the important elements (e.g., the dog and the ball) are focused on. The interaction between these objects is determined (e.g., the dog trying to catch the ball), and the activity of the object is recognized (e.g., the dog jumping). Finally, a caption is given based on these data.

1.2. Previous Surveys

Despite the impressive developments in image captioning in the last decade, automatic image captioning is still challenging. Some surveys have covered the techniques used in image captioning. We briefly give an overview of previous surveys in this section.

The review paper [27] thoroughly analyzes deep learning methods used in image captioning until 2018. The survey’s authors focused on reviewing deep learning models used for novel caption generation. The review covers various aspects, including learning types and architectures, feature mapping, and language models. Other techniques include novel object-based captioning, stylized captions, attention-based captioning, and semantic concept-based captioning. They also examine the datasets and evaluation metrics often utilized in deep-learning-based image captioning.

Stefanini et al. [28] review image captioning approaches comprehensively from 2015 to 2021. They focus on presenting different training strategies, comparing datasets, and evaluating metrics. Additionally, they quantitatively compare multiple state-of-the-art methods to determine the most significant technological advancements in training methods and architectures with discussions of the problems and challenges.

Xu et al. [29] review different image captioning methods in natural and medical domains from 2019 to 2022. They describe standard captioning techniques, either with or without reinforcement learning. By simulating the human processes of seeing, focusing, and telling, which are subsequently reflected in feature representation, visual encoding, and language development, respectively, they suggest a workflow for making image captions. Additionally, they present datasets, evaluation metrics, and loss functions with defining advantages and disadvantages of existing approaches. The paper [22] provides a brief overview of advancements in image captioning from 2016 to 2019 in a comprehensive systematic way. Different feature extractors have been investigated, such as GoogleNet (including all nine Inception models), AlexNet, ResNet, VGGNet, and DenseNet. Language models such as RNN, LSTM, CNN, GRU, and TPGN were also discussed. This comparison involves the evaluation metrics BLEU (1–4), CIDEr, and METEOR on MSCOCO and Flicker30k datasets.

The paper by Chohan et al. [23] comprises a systematic literature review (SLR) from 2017 to 2019. They discuss the techniques and models used for deep-learning-based image captioning. They find that the most important models are CNNs, used to understand images. RNN and LSTM are used to generate language, and the most used evaluation metrics are BLEU and the databases MS COCO, Flicker30k, and Flicker8k, of which MS COCO is the most used. Additionally, it is discovered that LSTM and CNN have outperformed RNN and CNN. Encoder-decoder and attention mechanisms were discovered to be the two most effective methods.

The work by Elhagry et al. [30] focuses on recent deep-learning-based techniques between 2018 and 2020, such as adversarial learning, deep reinforcement, and attention mechanisms. They review a few techniques and implementation specifics, including the Conditional GAN-Based model, Meta-Learning, UpDown, OSCAR, and VIVO. They found that the Conditional GAN-Based model achieves the highest score. The survey by Luo et al. [31] compares state-of-the-art techniques used in image captioning from 2011 to 2021, providing details of different architectures used, evaluation metrics, and available datasets. They divide the image captioning methods into categories to summarize them and discuss their advantages and limitations. Moreover, they quantitatively evaluate the relevant cutting-edge studies to identify current trends and prospects in image captioning.

The paper by Hrga and Ivašić-Kos [32] presents an overview of recent advances in encoder-decoder image captioning research. They determine that the main benefit of that architecture is that it can be trained end-to-end, directly mapping from images to sentences. They focus on standard image captioning techniques with or without an attention mechanism. Furthermore, cross-lingual or multilingual image captioning methods are analyzed.

The survey paper by Ghandi et al. [33] presents a structured overview of image captioning methods and their performance, primarily focusing on deep learning methods. Along with discussions of open issues and unresolved challenges with image captioning, they also review commonly used datasets and performance metrics. This survey includes research works from 2018 to 2022.

Sharma et al. [34] reviewed a variety of image captioning techniques, including retrieval-based, template-based, and deep-learning-based techniques, and several assessment metrics, including BLEU, ROGUE, and METEOR.

The survey paper by Attai et al. [35] focused on image captioning in Arabic. Their main emphasis was architecture, attention mechanisms, image models, and language models. They discuss datasets, translation approaches, evaluation metrics, and the results for each technique. They cover Arabic papers from 2017 to 2020.

All the aforementioned works do not follow a systematic review procedure and have a limited scope, either for four years (e.g., [22]) or three years (e.g., [23]).

Surveying Arabic and English papers related to encoder-decoder image captioning systematically helps to give a comprehensive and disciplined overview of the field. Thus, we follow a systematic approach to the article collection process in this survey. Table 1 compares previous surveys regarding their coverage of different aspects of image captioning and whether the review is conventional or systematic. The conventional literature review gives background information on the topic of interest. As a result, the process may not be comprehensive. The main purpose is to give a general overview of the subject. Typically, the approaches employed are not predetermined or thoroughly discussed in the review. Systematic literature reviews raise precise questions and answer them by summarizing articles that satisfy pre-specified criteria. A review team searches for studies to answer these questions using a systematic search strategy.

1.3. Motivation and Contributions

Image captioning is essential, and many reasons motivate researchers to delve into this field, such as wanting to help visually impaired individuals by creating a product that helps people with visual impairments move around the world by converting the scene in front of them into descriptive text and then converting this text into an audible sound. It can be used in closed-circuit television cameras (CCTV cameras) so that relative descriptions are provided along with viewing the scene so an alert can be raised if there is malicious activity. Also, it can be used in self-driving cars by providing linguistic explanations of the scene around the car to reduce the psychological strain on passengers and avoid accidents. Many surveys focus on collecting, presenting, and analyzing different methodologies for image captioning. Similarly, we discuss how to generate a human-like description of an image to provide the interested reader with the latest developments in image captioning. Due to the development we have seen in performance using the encoder-decoder architecture, we have limited our focus to the encoder-decoder architecture of English image captioning and how to use these techniques in Arabic image captioning. This survey follows a systematic search procedure to gather methods based on the encoder-decoder architecture and other relevant information. Our contributions are

A systematic literature review is performed on encoder-decoder-based image captioning models from 2015 until 2023.
We also include Arabic image captioning research in the systematic procedure.
We discuss how the English captioning models can be adapted for image captioning to support the Arabic language.

1.4. Paper Organisation

The remainder of the paper is divided as follows: the next section (Section 2) provides the methodology of the research protocol. Section 3 illustrates the main techniques of image captioning. Section 4 presents the available datasets for image captioning. The evaluation metrics are covered in Section 5. Section 6 presents the Arabic image captioning. The SLR’s discussion is in Section 7, whereas the limitations and future directions are in Section 8. Finally, Section 9 provides the conclusion of this SLR.

2. Methodology

Following the “Preferred Reporting Items for Systematic Reviews and Meta-analyses” (PRISMA) checklist 2020 methodology [36], we conducted this systematic review. Image-captioning-related papers are collected based on specific search terms. Research questions filter the initial collection and exclusion criteria to assess the quality of the papers’ content. Each of these stages is described as follows.

2.1. Research Questions

This paper aims to systematically review encoder-decoder IC methods and techniques for Arabic and English captioning. In particular, we review how to adapt and develop EIC methods for AIC, and we intend to present a detailed perspective. It is critical to have specific questions that must be addressed after thoroughly reviewing the literature. Because the results of each inquiry must be exact and noise-free, the questions are carefully constructed after several attempts. Table 2 presents these questions and their purposes.

2.2. Exclusion Criteria

Several criteria are used to exclude the papers from the candidate set. The review has applied three exclusion criteria (EC), which are:

$E C_{1}$ : Only peer-reviewed articles published in English are considered. All papers written in other languages are excluded.
$E C_{2}$ : Books, notes, theses, letters, and patents are not included in this literature review.
$E C_{3}$ : Papers focusing on applying image captioning methods to languages other than English and Arabic are omitted.

2.3. Search Process

The PRISMA flowchart for the research process is shown in Figure 3. We have used Google Scholar, IEEE Xplore, and ScienceDirect databases to find all the relevant papers. We used the following search terms: “image captioning”, “Arabic image captioning”, and “Arabic image description” with a year limit from 2015 to 2023. As a result, we found 1690 studies in Google Scholar, 436 in IEEE Xplore, and 78 in ScienceDirect. We removed the duplicate records and excluded records based on title and abstract, so the records became 82. After applying the exclusion criteria, 52 remaining records were included in this review.

3. Methods for Image Captioning

Image captioning creates textual descriptions of an image using visual learning and language models. The caption should be in the natural language, containing meaningful sentences with correct syntax. Several methods have been implemented in image captioning, as the task depends on several aspects, including the image models, language models, the quality of the database, and the method used in the evaluation. Figure 4 shows the general workflow of the encoder-decoder image captioning model. First, the image is sent to the visual model to extract feature representations. Then, visual encoding is used to highlight important informatics components. The language model is then given representative visual content elements to produce sentences. This section describes the captioning process, including visual models, visual encoding and language models, and training strategies.

3.1. Visual Models

Providing effective image features is a critical step in image captioning. Using different deep models, the feature representations are primarily obtained. In this section, the models are divided into two main categories:

Feature Vector using Convolutional Neural Network (CNN).
Object Detection.

In the following subsections, we will explain each category in detail.

3.1.1. Feature Vector Using Convolutional Neural Network (CNN)

Different CNN models, such as VGGNet, ResNet, and Inception-V3, each with a different number of layers and filters but sharing the same fundamental architecture, can extract the image’s features.

GoogLeNet. The GoogLeNet [37] architecture employs various techniques, including global average pooling and 1 × 1 convolution, to build a deeper architecture. Its creators lowered the 60 million AlexNet-proposed parameters to just 4 million, comprising 22 layers. With a top-5 error rate of 6.67%, which was extremely similar to human performance, it took first place in ILSVRC. Inception-V3 [38] is a different GoogLeNet variant. Compared to its predecessors, it has 42 layers and a decreased error rate.

VGGNet. Karen Simonyan and Andrew Zisserman [39] proposed VGGNet in 2014 based on a Convolutional Neural Network architecture. The input to the VGGNet is a 224 × 224 RGB image. The preprocessing layer calculates mean image values derived for the complete ImageNet training. The mean image values are subtracted from the RGB input image. Then, this input image goes to the convolution layers. Convolution layers are stacked and then applied to the training images. The VGG16 architecture has 13 convolutional layers and three fully connected layers. Another VGGNet version, VGG19, contains 19 weight layers, including the same five pooling layers and 16 convolutional layers with three fully connected layers each.

Residual Neural Network (ResNet). On the ImageNet dataset, ResNet [40] took first place with a top-5 error rate of 3.57%, outperforming human performance. ResNet’s new architecture is based on two key ideas: skip connections (gated recurrent units) and heavy batch normalization. These two methods made it possible to train the model with 152 layers, less complex than VGGNet [39]. Another popular ResNet variant is DenseNet [41], which makes additional connections to address the vanishing gradients problem.

3.1.2. Object Detection

The two primary components of computer vision are classification and object detection. Identifying what is in an image is called classification while locating an object in an image is called object detection and localization. Finding the object’s coordinates in an image makes detection more difficult. Region-based CNNs or regions with CNN features (R-CNNs) are two cutting-edge techniques for employing deep learning to recognize objects [42]. Below, we will introduce the R-CNN and its improvements: the Fast R-CNN [43] and the Faster R-CNN [44].

Regions with CNN Features (R-CNN). R-CNN [42] extracts around 2000 region proposals from the input image. These proposed regions are often chosen at various scales, sizes, and shapes. Every region proposal will have a class and a ground-truth bounding box indicated on it. A pre-trained CNN should be chosen and truncated before the output layer. Resize each region proposal to the network’s required input size, then output the region proposal’s extracted features by forward propagation. Train multiple support vector machines to classify objects.

However, this method has some problems. It is slow because the feature map is calculated for each region proposal. R-CNN has three parts, CNN, SVM, and Bounding Box Regressor, that must be trained separately. Large memory is required to save every feature map of each region proposal.

Fast Region-Based Convolutional Network (Fast R-CNN). Fast R-CNN [43] appeared to solve the problems found in R-CNN. The region proposal is generated from a selective search algorithm. CNN inputs these region proposals with the image to generate the convolution feature map. The feature vector of a fixed length for each feature map is then extracted for each object proposal by a Region of Interest (RoI) pooling layer. Each feature vector is subsequently sent into the twin layers of the bounding-box regression and softmax classifier to categorize region suggestions and improve the location of the object’s bounding box. Fast R-CNN obtains a higher mAP on PASCAL VOC 2012 [45] and trains the very deep VGG16 network nine times faster than R-CNN.

Faster R-CNN. The Fast R-CNN model frequently creates region recommendations during a selective search to improve object detection accuracy. The Faster R-CNN [44] suggests changing from a selective search to a region proposal network to decrease region proposals without losing accuracy. In contrast to the Fast R-CNN, the Faster R-CNN transforms the region proposal strategy from a selective search to a region proposal network. The remaining parts of the model are unchanged.

3.2. Visual Encoding

The essence of image captioning is to give a suitable representation of images. Providing representational features of images to language models so they can produce sentences is challenging. The visual encoding method supports the caption model to focus on the critical informatics parts. Based on the CNN feature vector, some research works relied on extracting the global features, while others used the attention mechanism over grid features. On the other hand, some research depends on attention over visual regions and graph-based attention using object detection models.

3.2.1. Global CNN Features

Generally, high-level representations are extracted using the activation of a CNN last layer and then fed to the language model, as shown in Figure 5. Sharma et al. [9] employed Inception-ResNet-v2 [46], which offers optimization advantages through residual connections and computationally effective inception units. Vinyals et al. [4] and You et al. [47] utilized GoogleNet [37] to extract global image features, which were then fed to the LSTM to generate a corresponding sentence. Instead of the conventional CNN model, Deng et al. [48] employed the DenseNet [41] to improve the extraction of global features from images and enhance the descriptive text. The key benefit of using global features is that they are straightforward and consider the context of an image as a whole. However, this makes it challenging for a captioning model to generate precise, detailed descriptions [28].

3.2.2. Grid Features

Most of the research focused on the grid features to improve the extraction of accurate features, as shown in Figure 6. Xu et al. [3] presented the convolutional layer’s spatial output grid’s additive attention. The model may decide whether to depend on specific grid components by using a subset of features for each created word. The model first extracts the activation of the last convolutional layer of a VGG network. Then, it uses additive attention to compute a weight for each grid element that can be interpreted as the relative importance of that element in generating the following word. Another study also uses the grid feature by [49]. They use complementary data from multiple CNN encoders. The input image may be described in various ways and with greater detail using multiple CNNs. Chen et al. [5] suggest using channel-wise attention over convolutional activations. They also analyzed the effect of using multiple convolutional layers to exploit multi-layer features.

For the spatial positions in the image, Parameswaran et al. [50] presented adaptive positional encoding, which considers both the bounding box coordinates for the object features for the bottom-up method and the two-dimensional grid structure of the image features for the top-down method.

Table 3 represents studies based on CNN with their parameters.

3.2.3. Region-Based

Most of the research based on element discovery uses Faster R-CNN, as shown in Figure 7. Li et al. [55] used the Fast R-CNN proposed to generate a set of object proposals for an image. We have to predict each proposal’s object category and bounding box offsets. From AIC research, Jindal [11,12] used R-CNN as an object detector.

Anderson et al. [56] proposed a combined bottom-up and top-down visual attention mechanism. The bottom-up approach suggests a collection of salient image regions and is implemented using Faster R-CNN. They use the Visual Genome dataset [57] to pre-train their bottom-up attention model. Using task-specific context, the top-down method predicts an attention distribution over the visual regions. Huang et al. [6] used Faster R-CNN to extract the object features and then send them to a multimodal attribute detector (MAD) to predict the probability of image attributes using attribute embedding. Guo et al. [58] introduced a class of Geometry-aware Self-Attention that explicitly employs the relative geometry relations and the content of objects to improve picture interpretation. He et al. [59] proposed the image transformer for image captioning to encode spatial connections between image regions and decode the varied information in image regions. Each transformer layer contains several sub-transformers. Pan et al. [60] utilized Faster R-CNN for image encoder to detect a set of image regions. A stack of X-linear attention blocks is modified to encode region-level features with higher-order intramodal relations to increase region-level and image-level features.

Kumar et al. [61] extracted the feature vectors and detected objects in the image. Then, using the feature, vector embedding and object embedding are created, respectively. A final embedding vector is created by merging the two embeddings. The positional encoding vector and the final embedding vector were combined to create feds, which served as the encoder unit’s input. Wang et al. [62] propose an attention-reinforced transformer that improves the image encoding stage. It incorporates a feature attention block (FAB) and uses the connections between image regions. Geometrically coherent object proposals and label-attention were employed by Dubey et al. [63] to learn the associations between objects and the object detector for extracting object proposals. The main idea behind object extraction from images is to give the proposed architecture fine-grained data about the image’s content and overall structure. At the same time, the geometrical properties determine the association between objects. A global-local attention (GLA) technique for image captioning is suggested by Li et al. [64]. The proposed GLA approach integrates local features at the object level with global characteristics at the image level to selectively focus on semantically more important regions at different times while keeping global context information.

3.2.4. Graph-Based Attention

The scene graphs of the images are used to create structured representations of the images, as shown in Figure 8. Li et al. [55] suggested a scene graph-based framework for image captioning. Scene graphs have much structured information since they show pairwise relationships and object entities from images. In structured scene graphs, visual features and semantic knowledge are used. They extracted CNN features from the object entities’ bounding box offsets for visual representations. Additionally, they use triples to extract semantic relationship features for semantic representations. Afterward, a hierarchical attention-based module is proposed to learn discriminative features for word generation at each time step. To provide context for the next visual action, Zha et al. [7] created a sub-policy network that interprets the visual component sequentially by encoding past visual actions via an LSTM.

The word embedding vector only gets information from itself because it is built on itself. Zhang et al. [65] introduced the knowledge graph to use information from the word itself and its neighbors. To help the language model generate captions, Dong et al. [66] suggested a curriculum learning strategy to guide the transformer to generate image descriptions via Dual Graph Convolutional Networks (Dual-GCN) that specifically employs an object-level GCN to capture the object-to-object spatial relationship in a single image and an image-level GCN to collect the feature data offered by relevant images. Nguyen et al. [67] suggested a scene graph to bridge the semantic gap between the two scene graphs, one generated from the input image and the other from its caption, by leveraging the spatial location of objects and the Human–Object Interaction (HOI) labels as an extra HOI graph. Yang et al. [68] propose a sequential training algorithm to guide the relational transformer (ReFormer) to learn a scene graph to express the object relationships in decoding a sentence caption.

3.2.5. Self-Attention Encoding

Self-attention links every element in the set to every other element and then leverages residual connections to build a refined representation of the same set. It was introduced by Vaswani et al. [8] for machine translation tasks. A modified form of self-attention that considers the spatial interactions between regions is developed by Herdade et al. [69]. The attention weights are scaled using an additional geometric weight computed between object pairs. EnTangled Attention (ETA) was proposed by Li et al. [70], allowing the transformer to simultaneously examine semantic and visual information. Instead of considering spatial relationships between query and key pairs as the original transformer, He et al. [59] used a spatial graph transformer that assumes different categories of spatial relationship for each query region in a graph structure (e.g., parent, neighbor, child). Song et al. [71] recommended the Direction Relation Transformer (DRT) to improve the orientation perception between visual features, integrating relative direction embedding into multi-head attention. They constructed a relative direction matrix and investigated three types of direction-aware multi-head attention to include direction embedding into transformer design. A geometry-aware self-attention is created by Guo et al. [58] that explicitly employs the relative geometry relationships and the content of objects to help in image comprehension. Dubey et al. [63] proposed LATGeO, which uses a transformer to link the features, surrounds, geometrical attributes, and associated labels of semantically coherent objects. Labels are passed through a label-attention module to compute geometrical relationships. Multi-level representation of objects and features input to LATGeO, a fully connected encoder-decoder transformer. In contrast to the original transformer [10], a memory vector is included in the encoder, and a mesh-like structure is designed to connect the encoder and decoder. The encoder can encode prior information utilizing keys and values augmented with learnable vectors. A learnable gating technique regulates the contribution of each encoder layer during cross-attention in a mesh connection. A Global Enhanced Transformer GET enables a complete global representation extraction and then adaptively guides the decoder to produce high-quality captions [72]. GET includes a Global Enhanced Encoder and a Global Adaptive Decoder for embedding global features and caption-generating instructions. To model intra-layer and inter-layer global representation, the former employs the proposed Global Enhanced Attention and a layer-wise fusion module. The latter contains a Global Adaptive Controller, which can integrate global data into the decoder to direct caption synthesis. Luo et al. [73] developed a novel Dual-Level Collaborative Transformer (DLCT) network. In DLCT, region, and grid features are first mined for intrinsic properties using a novel Dualway Self Attention (DWSA), followed by a Comprehensive Relation Attention component to embed the geometric information. A Locality-Constrained Cross Attention module is suggested to solve the semantic noises caused by the direct merging of these two features, in which a geometric alignment graph is built to align and reinforce region and grid information accurately. Yang et al. [68] proposed a relational transformer (ReFormer) to generate embedded features with relation information and clearly explain the pair-wise interactions between items by a supplementary scene graph generation task. Because image captioning and scene graph construction are independent tasks, they propose a sequential training strategy to aid the ReFormer in learning both tasks.

3.3. Language Models

The language model generates the sequence of words by predicting the probability of producing the current word, given the previous word and visual image features. As outlined in Figure 4, the language models include RNN, LSTM, and transformer.

3.3.1. Recurrent Neural Networks (RNN)

Recurrent neural networks (RNNs) contain loops to allow information to persist. In Figure 9, a part of a recurrent neural network is shown, where the loop allows information to be passed between several steps as A processes some input

x_{t}

to produce output

h_{t}

.

We can show the loops in detail, where the recurrent neural network can be considered repeated copies of the same network, and each network sends a message to the next network, as shown in Figure 10.

When processing a sequence of text, each word of the text is received as input to the RNN. The RNN passes the information from the previous word to the next network. So, each word is handled independently, and the finished sentence is produced by sending a hidden state to the decoding step, producing the output.

However, there is an issue with recurrent neural networks, which is that they learn long data sequences. Gradients carry over the information used in RNN, and when the gradient becomes too tiny, the parameter updates become negligible.

3.3.2. Long Short-Term Memory (LSTM)

Long Short-Term Memory (LSTM) [74] is a specific type of RNN that can learn long-term dependencies. A series of sequentially repeating neural network modules characterizes all recurrent neural networks. This repeating module’s structure will be fairly straightforward with RNNs. This chain-like structure is also present in LSTMs, although the repeating module has a different structure. Four neural network layers, interacting in a highly unique manner, are used instead of just one, as shown in Figure 11.

The fact that the model frequently forgets information about distant points in the sequence is one of the drawbacks of LSTM. Due to the need to parse sentences word by word, another issue with RNNs and LSTMs is the difficulty in parallelizing the task. Furthermore, no model accounts for long-term and short-term interdependence.

Single-Layer LSTM. The simplest LSTM-based captioning architecture was developed by Vinyals et al. [4] and is based on a single-layer LSTM. The LSTM creates the output caption using visual encoding as its initial hidden state. At each time step, a word is predicted by applying a softmax activation function on the hidden state’s projection into a vector the same size as the vocabulary. The ground-truth sentence is utilized as training input, and the words used for inference are from the previous stage. Xu et al. [3] presents the additive attention mechanism. In this case, the previous hidden state directs the attention mechanism across the visual cues to compute a context vector given to the MLP responsible for predicting the output word. A single-layer LSTM decoder has been used in many subsequent studies, typically without architectural change [5]. Jia et al. [54] supplement each unit of the LSTM block with semantic information derived from the image, directing the model toward findings more closely related to the image’s content. An adaptive attention model with a visual sentinel is proposed by Lu et al. [76]. The model chooses whether to focus on certain regions of the image or the visual sentinel. The model chooses whether to focus on the image and where to gather relevant data to generate sequential words.

Two-Layer LSTM. Multi-layer structures can be added to LSTMs to improve their ability to capture higher-order relationships. Nguyen et al. [67] used a two-layer encoder-decoder LSTM architecture for this step. The encoder, an attention LSTM, is fed the encoding of visual scene graphs. The captions are then generated using an LSTM-based language model. Chu et al. [51] proposed a two-layer LSTM with a soft attention mechanism as the decoder, which selectively focuses the attention over a specific area of an image to predict the next sentences. Huang et al. [6] insert a subsequent attribute predictor (SAP) module with a two-layer LSTM to generate plausible image captions using the object features and probability of image attributes. Qin et al. [77] utilized two modules to handle the possible issue of accumulating errors at inference time: the look back module, which uses the previously attended vector to compute the next one, and the predict ahead module, which predicts the new two words concurrently. LSTM with two layers has been used in numerous additional investigations [56,64].

Attention. Adding the attention mechanism gave the encoder-decoder paradigm for machine translation a performance boost. The attention mechanism enables the decoder to employ the most pertinent portions of the input sequence flexibly [78]. Lu et al. [76] presented adaptive attention that can automatically select when to rely on visual information and when to rely on the linguistic model. Indeed, the model specifies where the image region should focus its attention while relying on visual data. Huang et al. [79] modified the LSTM by including the Attention on Attention operator, which computes an additional attention step on top of the visual self-attention step. Deng et al. [48] proposed an adaptive attention model with a visual sentinel to enable the model to determine better when to focus on language rules and when to focus on image data.

3.3.3. Transformer

The transformer model [8] was first created for machine translation and is now employed in various natural language processing applications. The transformer is a CNN with attention combined to solve the parallelization problem and boost the model’s speed. It is a sequence-to-sequence model that contains six encoders and a decoder. It depends on the attention mechanism, specifically self-attention. Osolo et al. [80] applied fast Fourier transforms to decompose the input features and extract more important information from the images to provide succinct and informative captions. To distinguish between the word semantics and grammatical structures of captions and include the PoS guiding information in the modeling, Wang et al. [81] proposed a novel part-of-speech guided transformer (PoS-Transformer). The PoS-Transformer smoothly merged the PoS prediction module with the transformer-based captioner for more accurate and fine-grained image captioning. Yang et al. [68] Proposed a relational transformer (ReFormer) to generate embedded features with relation information and clearly explain the pair-wise interactions between items by a supplementary scene graph generation task. Image captioning and scene graph generation are distinct tasks, so they suggest a sequential training algorithm that guides the ReFormer to learn both tasks. Cornia et al. [10] improved the image encoding and the language generation by transformer Integrated with Memory.

Encoder Each encoder contains a feed-forward neural network and a self-attention layer. The encoder’s inputs pass through a self-attention layer first. It assists the encoder in considering additional words in the input text as it encodes a specific word.

Decoder The decoder has the same layers as the encoder but also has an attention layer that enhances the decoder’s ability to focus on key elements of the input text. To bridge the gap between the visual and linguistic, the label-attention module (LAM), which is an extension of the conventional transformer, is developed by Dubey et al. [63]. In LAM, each decoder layer’s input includes object labels as prior data for caption generation. He et al. [59] suggested a decoder comprised of LSTM and implicit transformer decoding layers. The transformer layer uses dot product attention to infer the most important region in the image; the LSTM layer is a common memory module.

Ji et al. [72] proposed a Global Enhanced Transformer (GET) that follows the encoder-decoder architecture. The global-enhanced encoder maps the original inputs into highly abstract local representations, extracting the intra-layer and inter-layer global representations. The decoder uses the suggested global adaptive controller to create the caption word by word while concurrently incorporating the multimodal information. Li et al. [70] suggested Entangled Attention (ETA) and Gated Bilateral Controller (GBC) to examine visual and semantic information simultaneously. To concurrently execute attention over the visual and semantic outputs of the dual-way encoder, the decoder block inserts an ETA module and a GBC module between the self-attention sub-layer and the feed-forward sub-layer.

Self-Attention. Each word follows its path through the encoder. In the self-attention layer, these paths are interdependent. These dependencies are absent from the feed-forward layer, allowing the various paths to be executed in parallel as data passes. The self-attention is effectively used for capturing the relationships between the regions and words.

Multi-Head Attention. The multi-head attention technique uses a different learned projection each time to linearly project the queries, keys, and values h times. Then, each of these h projections is subjected to the single attention process in parallel to generate h outputs, which are then concatenated and projected once more to generate the final result. Song et al. [71] recommended the Direction Relation Transformer (DRT) to improve the orientation perception between visual features, integrating relative direction embedding into multi-head attention. They constructed a relative direction matrix and investigated three types of direction-aware multi-head attention to include direction embedding in transformer design.

The goal of multi-head attention is to make it possible for the attention function to extract data from several representation subspaces, which is impossible when using a single attention head. Zhou et al. [82] developed a semi-autoregressive image captioning (SATIC) model to balance speed and quality better. This model retains the autoregressive behavior at the global level while creating words at the local level in tandem. They based it on the well-known transformer but substituted relaxed masked multi-head self-attention for the original masked multi-head attention.

Positional Encoding. Positional encoding is crucial to the transformer’s encoding process for each word. Encoding each word’s position is essential because it affects how it will be translated. Guo et al. [58] modified the inputs at the bottom of the decoder to include sinusoidal positional encodings to follow the original transformer because position information is not included in the encoder side since the image’s regions do not naturally occur in sequences.

3.4. Loss Functions

The image captioning model typically expects to produce a caption word by word while considering the image and the preceding word. The output word is taken as a sample from a vocabulary distribution at each time step. The word with the highest probability is produced using the greedy decoding technique, which is the simplest situation used in the earlier methods [56,83]. The fundamental disadvantage of the situation is the rapid accumulation of potential prediction mistakes. The beam search strategy is one effective method for dealing with this problem. It keeps the k sequence candidates with the highest probability at each time step and selects the most likely one [4]. To train the model to generate captions better, image captioning training models utilize a variety of loss functions. Most studies follow a standard practice in image captioning proposed by Rennie et al. [83]. On the training side, where initial methods relied on time-wise cross-entropy training, a big advancement comes with the introduction of reinforcement learning, which allows non-differentiable caption metrics to be used as optimization objectives.

3.4.1. Cross-Entropy (CE)

Maximum likelihood estimation (MLE), or teacher forcing, is commonly used in image captioning models. This enables parameters

θ

to be learned by maximizing the likelihood of the observed sequence. The goal is to reduce the cross-entropy loss (XE), also known as negative log-likelihood, given a ground truth phrase

y_{1 : T}^{*}

and a prediction

y_{t}^{*}

of the captioning model with parameters

θ

:

L_{X E} (θ) = - \sum l o g (p θ (y_{t}^{*} | y_{1 : t - 1}^{*}))

(1)

3.4.2. Self-Critical Sequence Training (SCST)

Captioning systems are typically trained using the cross-entropy loss, as mentioned in the section above. The generative models are cast in Reinforcement Learning (RL) terms to optimize non-differentiable metrics and handle the exposure bias issue [83]. Equation (2) provides RL’s reward function

r (\cdot)

depending on the CIDEr score of a generated caption [10].

▽_{θ} L (θ) = - \frac{1}{k} \sum_{i = 1}^{k} ((r (Y^{i}) - b) ▽_{θ} log p (Y^{i}))

(2)

where

θ

is the learning parameters,

Y^{i}

is the

i - t h

sentence in the beam, k is the beam size, and

b = \frac{(\sum_{i} r (w^{i}))}{k}

is the baseline, the mean of the rewards attained by the sampled sequences, as computed.

3.4.3. Kullback–Leibler Divergence (KL Divergence)

The Kullback–Leibler Divergence score, abbreviated as the KL divergence score, quantifies the difference between two probability distributions. The Kullback–Leibler (KL) divergence function can discriminate between correct and incorrect predictions. The KL divergence term is calculated by comparing two probability distributions, one based on model prediction and the other on ground truth.

D_{K L} (P | | Q) = \sum P (i) l o g \frac{P (i)}{Q (i)}

(3)

Zhang et al. [65] suggested improving the transformer model. They add an extra KL divergence to the training objective MLE to help distinguish between incorrect predictions. They also found the efficiency of their method for the transformer.

4. Available Datasets in English Image Captioning

There are many datasets available for performing image captioning. The most commonly used datasets in the literature are MS COCO, Flicker30k, and Flicker8k. The distribution of the studies over the datasets is shown in Figure 12. MS COCO is the most used dataset due to its large size.

4.1. Microsoft COCO (MS COCO)

Microsoft COCO Dataset [84] is a sizable dataset for image segmentation, recognition, and captioning. The MS COCO dataset has many features, including context recognition, object segmentation, and multiple objects in each class. It contains 82,783 images for training and 40,504 for validation, with five descriptions per image. Most studies use the splits recommended by Karpathy et al. [85] for evaluation, employing 5000 images from the primary validation set for validation, 5000 images for testing, and the remaining images for training. The dataset also has an official test set server with 40,775 images and 40 private captions for each of them.

4.2. Flicker30k

The Flicker30k dataset [86] is used for automatic image captioning and grounded language understanding. It includes 30k Flicker images with 158k captions added by human annotators. The collection also includes classifiers for colors, detectors for specific objects, and a bias toward choosing larger objects.

4.3. Flicker8k

Flicker8k [1] is a well-known dataset of 8000 images gathered from Flicker. The test and validation data contain 1000 images, while the training data comprises 6000. There are five human-annotated reference captions for each image in the dataset. Table 4 summarises the datasets used in English image captioning and their split into training, validation, and testing.

5. Evaluation Metrics

In image captioning, evaluating the trained model is a challenging problem for which many evaluation matrices have been developed. The methods for evaluation BLEU, ROUGE, CIDEr, METEOR, and SPICE are the most often used in the literature. Figure 13 shows the distribution of the studies over the evaluation metrics. As we can see, BLEU is the most used evaluation metric.

5.1. Bilingual Evaluation Understudy (BLEU)

BLEU (Bilingual Evaluation Understudy) [87] is a metric used to assess machine-translated text. It measures the machine translation’s similarity to a group of expert reference translations ranging from 0 to 1 (a value of 1 denotes high quality). It is the evaluation criterion most often used. It examines the n-gram correlation between the reference translation and evaluated sentences.

5.2. METEOR

METEOR (Metric for Evaluation of Translation with Explicit Ordering) [88] was developed to address several issues with the BLEU metric. The authors show in their research that METEOR considerably enhances correlation with human assessments and that recall is more important than precision in achieving high levels of correlation with human judgments.

5.3. ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) [89] is a set of measures for evaluating automatic text summarization used for image captioning. The metric compares reference captions with automatically generated captions. To determine their similarity, it computes the longest matching sequence of words between the generated caption and the reference caption.

5.4. CIDEr (Consensus-Based Image Description Evaluation)

CIDEr [90] calculates the cosine similarity between the candidate caption and the collection of reference captions related to the image. It Uses Frequency-Inverse Document Frequency weighted n-grams. This method accounts for both recall and precision.

5.5. SPICE (Semantic Propositional Image Caption Evaluation)

SPICE (Semantic Propositional Image Caption Evaluation) [91] is a novel metric for evaluating captions based on semantic concepts. It is based on scene-graph, a graph-based semantic representation [184, 185]. From the descriptions of the images, this graph may extract information about various objects, attributes, and relationships between them. Table 5 presents the evaluation metrics’ original tasks and input. The possible inputs are the predicted caption, reference caption, and image, as shown in the table for each metric.

6. Arabic Image Captioning (AIC)

Arabic is a Semitic language. More than 315 million people speak it as a native language [92]. Morphology is crucial in the highly structured and derivational language of Arabic. Arabic script is written from right to left. It has 28 letters written in various shapes depending on where they are in a word (beginning, middle, or end).

Arabic natural language processing (NLP) tries to handle the complexity of the Arabic language. Few studies focus on the Arabic language in image captioning compared to English since there are only 11 research works for the Arabic language.

6.1. Model Architecture

Due to the lack of Arabic research works, all of them were collected, a total of 11. Two of these research works adopted the compositional architecture, while the rest adopted the encoder-decoder architecture, as shown in Table 6. In 2017, Jindal [11] proposed a three-step root-word-based method for automatic AIC generation. First, the image is fragmented, and the image fragments are mapped onto root words in Arabic deep belief networks pre-trained by Restricted Boltzmann Machines. Then, the method finds the most appropriate words for an image by selecting a set of vowels that must be added to the root words. Finally, it builds Arabic captions using the dependency tree relationships between these words. The authors upgraded the language model a year later, while preserving the same model architecture in [12] by replacing the deep belief network with LSTM. The experimental results show promising performance.

6.2. Visual Models

Since the image captioning databases contain images, Arabic research follows the same approach as other research in extracting the features of images. As explained in Section 3, image features fall into two categories: object-detection-based and CNN-feature-based. Jindal [11,12] relies on object detection in his studies to deal with images using R-CNN [42]. The work by Emami et al. [20] uses ResNeXt-152 C4 [93] as the object detection model. The image region vector is generated for each object detected and used as input to the final linear classification layer. The rest of the research depends on the CNN features, VggNet [13,15,18], ResNet [16,19,21], and Inception [18].

6.3. Language Models

Language models are used like those used with English. Language models such as RNN, LSTM, GRU, and Transformer can handle Arabic. The first research in the Arabic language [11] used the Deep Belief Network as a language model. A year later, the authors updated the language model While maintaining the same model architecture in [12] by replacing the deep belief network with LSTM. The experimental results showed promising performance. Al-muzaini et al. [13] and ElJundi et al. [15] use a single-layer LSTM with 256 memory units, while Za’ter et al. [19] use two-layer LSTM with 256 memory units. Also, many other research works use LSTM, such as [14,16,17,21]. Hejazi et al. [18] compare LSTM and GRU, but they do not notice an appreciable difference when using LSTM instead of GRU. Emami et al. [20] initialize pre-trained transformers on different Arabic corpora as a language model.

Text Preprocessing. Arabic is distinct from other languages because of its ambiguous, complicated structure and diversity of dialects, which the computational system must take into account at every linguistic level. Due to its complexity in form and meaning, Arabic verb morphology is essential to creating an Arabic sentence [94]. Therefore, it is necessary to preprocess the Arabic text to simplify it somewhat so the model can handle it.

Jindal [11,12] uses root words based on caption generation in Arabic for images. He first creates image fragments using a previously trained deep neural network on ImageNet. After that, these fragments are mapped to a set of root words in Arabic. A deep belief network chooses different root words associated with the image fragments and extracts the most relevant words. Finally, he uses dependency tree relations to construct a sentence from the words [11]. ElJundi et al. [15] followed the Arabic preprocessing methods suggested by [95]. The word ending characters ’t marbouta’ and ’ya’ maqsoura’ are normalized, the ’hamza’ on characters is normalized, and diacritics are eliminated. Additionally, they eliminate punctuation and non-Arabic letters. Hejazi et al. [18] followed four preprocessing methods to test their effect on the final result. In the first method, they use the caption without preprocessing. In the second method, they followed the method in [95]. In the third method, they remove the alif tanween. In the fourth method, they keep the single letter waw. It was found that the fourth method, which integrates the various preprocessing steps without deleting the waw, was the best. Hence, the importance of preprocessing the Arabic text becomes clear to us. Lasheen et al. [21] tokenize the Arabic text using FARASA word segmenter and Pyarabic [96], which splits captions into tokens based on spaces. FARASA [97] separates Arabic words into their component prefixes, stems, and suffixes.

6.4. Datasets in Arabic Image Captioning

The database is considered the core change in this task. On its basis, the quality and effectiveness of the model are evaluated, as the research relies on several methods for collecting Arabic data. Jindal [11] used two datasets, the first from the ImageNet dataset that contains 10,000 images with manually translated Arabic captions by professional Arabic translators. The second dataset is 100,000 images from the Al-Jazeera news website [98]. Additionally, Jindal [12] used two distinct datasets: the Flicker8k dataset with manually written Arabic captions by Arabic translators and the 405,000 images with captions from different Middle Eastern newspapers. Al-Muzaini et al. [13] collected data from two sources, MSCOCO and Flicker8k, and translated it using three different methods. They select 1166 images and 5358 captions from the MSCOCO training set. The captions utilized to create the dataset are human-generated using Crowd-Flower Crowdsourcing [99]. A professional translator translated 750 captions to Arabic of 150 images from Flicker8k. Using Google Cloud Translation API (GCT), 10,555 captions of 2111 images have been translated to Arabic and verified by native speakers. So, the total of images is 3427 from both datasets, MSCOCO and Flicker8k. Mualla et al. [14] used a subset of the Flicker8k dataset consisting of 2000 images, translated using smart translator Ultra edit [100]. ElJundi et al. [15] created the first public Arabic database based on Flicker8k, which is translated using GCT and then validated by a professional Arabic translator. Cheikh et al. [16] created a new dataset called ArabicFlickr1K containing 1095 images, each associated with three to five captions. Afyouni et al. [17] used GCT [101] to translate MSCOCO and Flicker8k datasets. Za’ter et al. [19] translated using three open-source services: GCT [101], Facebook Machine Translation [102], and the University of Helsinki Open translation services [103]; the three datasets are MSCOCO, Flicker30k, and Flicker8k. Emami et al. [20] and Lasheen et al. [21] used two public databases, Arabic-COCO [104] and Arabic Flicker8k by ElJundi et al. [15]. For COCO, it has been translated using GCT [101] into Arabic, but it has not been validated. Hejazi et al. [18] used Arabic Flicker8k by ElJundi et al. [15]. Table 7 summarizes the used dataset in Arabic regarding availability, size, and translation method.

6.5. Evaluation Metrics

The Arabic research evaluated the models using the same scales used in other languages, BLEU, METEOR, ROUGE, and CIDEr. In addition to Multilingual Universal Sentence Encoder (MUSE) [105], a new encoding model variation is developed for various NLP applications and languages, including Arabic.

7. Discussion

The primary goal of this survey is to analyze the outcomes from studies on Arabic and English image captioning. We conducted a systematic approach to collect papers related to encoder-decoder image captioning. The article collection process is shown in detail in Section 2. Based on selection criteria, we have selected 52 papers, and the techniques used in these papers are explained in Section 3, Section 4, Section 5, and Section 6, respectively. Specifically, we mentioned the main five components used to build the IC models: the language models, feature extraction model, loss functions, available datasets, and evaluation metrics. These findings require more analysis, which this section seeks to accomplish by answering the survey questions in Table 2. The answers to the survey questions are listed below.

RQ1. How do image caption generation techniques identify the important objects in the image? The image caption pipeline’s initial challenge is to represent visual material adequately. Two categories may be used to categorize the existing visual coding techniques: object detection and CNN features. Each category falls under two other categories, explained in Section 3. The object detection category, which mostly depends on Faster R-CNN, includes region-based and graph-based. Global CNN features and grid-based features fall under the CNN Feature category.

Region-based features, which followed the development of global and grid features, have been the state-of-the-art option, and most studies used them in image captioning for years because of their appealing performance, as shown in Table 8. However, several new factors have reopened the discussion on the ideal feature model for image captioning. The appearance of Adaptive Attention [76] led to improvement, as we can see in Table 8, although Li et al. [64] use two types of image features [76], which are superior to them, despite only using grid-based. Also, using complementary data from multiple CNN encoders [49] led to a remarkable development in the field compared to [50].

RQ2. Which deep learning techniques are used for caption generation? For many years, recurrent models have been used as standard sequential modeling, and their use has led to the creation of ingenious and effective concepts that may be incorporated into nonrecurrent systems. Nevertheless, they are difficult to train and have trouble maintaining long-term dependence. So, these limitations are mitigated by autoregressive and transformer-based systems that have lately gained favor. The transformer can take advantage of parallelism as it uses non-sequential processing. As shown in Table 8, the transformer has outperformed LSTM in many studies, such as MT [107] outperforming AoANet [79] in CIDEr score by 4.3 and in BLEU-4 score by 1.8. Many researchers have compared RNN and transformer using the same image feature extraction, such as [9,60], and indicate that the transformer model outperforms the RNN model, X-LAN [60] (LSTM) and X-Transformer [60] (Transformer); they have plugged their X-Linear attention blocks into LSTM and Transformer-based models, as we can see in Table 8. X-Transformer outperforms X-LAN in CIDEr score by 0.8 and in BLEU-4 score by 0.2.

RQ3. Which evaluation mechanisms are used in the literature for image captioning? Measuring the model’s performance concerning unobserved data is necessary after updating the model parameters to reduce the error between the predicted and ground truth values in the training stage. Evaluating a produced caption’s quality is a challenging task complicated by the requirement that captions correctly relate to the input image and be grammatically correct and fluid. A human judgment is still the most accurate way to evaluate how well a caption suits an image. Nevertheless, human evaluation is costly and unrepeatable, and it is impossible to compare various strategies fairly. Existing automated evaluation techniques frequently evaluate the produced captions’ quality by contrasting them with human reference captions, as described in Section 5. Table 9 shows the results of the experiments on the MS COCO evaluation server that contains 40 captions for 5000 randomly selected images from the MS COCO testing dataset. The evaluation server guarantees consistency in evaluating automatic caption generation methods.

RQ4. Which datasets are used for image captioning? The dataset is one of the key elements of image captioning, which is used to develop and evaluate the model. The several datasets that are available online are detailed in Section 4. The following requirements should be taken into account when selecting the dataset:

Size. Image captioning is a complex problem. It is a combination of two tasks requiring knowledge of handling images and text, so it requires a large amount of data for the model to generate accurate captions.
Data quality. Having many data with low quality has a side effect on performance. The data must be of high quality so that the model can accurately detect the objects in the images.
Diversity. To make the model generalizable to new domains, tasks, and scenarios. In particular, if they differ considerably from the training data regarding content, style, or context, the models developed using current datasets may need to perform better on unseen images or texts. For instance, a model trained on MSCOCO could not caption fashion or medical images.
Annotation quality. Annotation of images in any dataset must be consistent, complete, accurate, and contain no spelling or grammatical errors.
Linguistic richness. The captions for the images should be more accurate, sizable, and varied datasets that can accurately represent the diversity and richness of linguistic and visual information in the actual world.
Complexity. To use the image captioning task for diverse applications, the dataset must contain various objects, attributes, and interactions. This means the dataset should have various objects, such as man, dog, ball, and tree; different interactions, such as standing, walking, running, and extending; and more complex interactions, such as flipping and kicking.

Most of the data in these datasets are high-resolution and contain different postural activities such as running, walking, talking, and kicking. However, the captions for the images are basic, inaccurate, or repetitious, which may not match viewers’ expectations or natural language usage.

RQ5: Which loss functions are used to train image captioning models? Most studies follow a two-phase sequence training procedure, including pre-training through supervised learning and fine-tuning through reinforcement learning. (The traditional studies follow just the first phase as shown in Table 10). The cross-entropy loss is optimized for pre-training between the ground truth and the probability distribution produced. Sequential models for text are often trained to use back-propagation to maximize the likelihood of the following ground-truth word given the preceding ground-truth word, and this is called “Teacher-Forcing” [108]. This strategy, however, leads to a mismatch between training and testing since, during testing, the model predicts the following word using the previously produced words from the model distribution. Since the model has never been exposed to its predictions, this

e x p o s u r e b i a s

[109] causes error accumulation during generation at test time. During the fine-tuning phase, they use the REINFORCE method [110] to directly optimize the sequence level metrics and deal with the exposure bias issue. They particularly follow the self-critical technique [83]. Table 11 presents the results of studies that followed the two phases.

RQ6: What are the challenges in adopting the existing methods for image captioning in Arabic? It is possible to attain state-of-the-art outcomes by adapting English captioning models to Arabic utilizing public datasets [20]. The real challenge is using a high-quality, large, and varied database that contains rich information covering many aspects. Then, the appropriate preprocessing steps should be applied to reduce the complexity of the Arabic language, such as cleaning the text and applying normalization and tokenization. These seem like simple steps, but they significantly impact the model’s performance and the generation of high-quality captions.

8. Limitations and Future Directions

Despite the rapid development of image captioning, it is considered to be in the initial stages in Arabic. Some challenges in this field, such as accuracy, performance, and generalization to different domains and datasets, are still far from expected. In the following, we present several key paths for future directions in image captioning based on different aspects.

Visual aspect. The visual aspect should be developed in choosing the visual model and the appropriate encoding method so that the model can accurately extract the image objects, their properties, and the relationships between them. Since this aspect of Arabic studies has not been sufficiently developed, most studies use CNN-based features without an attention mechanism. A good feature representation should be provided by extracting the objects and their relationships, using the attention mechanism to focus on the important regions in the image, and sending them to the language model to produce better descriptions.
Language model aspect. Focusing on the transformer model with the Arabic language, it has proven its superiority in sequence modeling, and the reason for this is due to its ability to capture the relationships between each word and another in the sequence. Additionally, using pre-trained models saves time and resources, possibly leading to better results such as BERT [111]. It has several versions, such as mBERT [112], short for Multilingual BERT, which is pre-trained in several languages, including Arabic, AraBERT [113], and ArabicBERT [114].
Dataset aspect. Providing an open-source dataset enables people to add descriptions of images in natural and clear language to provide various descriptions of the image, as each person will give a description based on his view, resulting in a wide and varied dataset in natural language. Usually, image captioning relies on the vocabulary set from the dataset to generate the output sentence; here, we can create a learning mechanism for image captioning by asking people to name the objects not mentioned in the dataset.
Learning mechanism aspect. With supervised methods, the focus will be on creating more diverse, realistic datasets compatible with natural language. Therefore, the focus will be on unsupervised learning and reinforcement learning in the future.
Evaluation aspect. It is possible to use the reverse method by text-to-image models. So, an image is created from the generated caption, and then the created image is compared with the original image.

9. Conclusions

Given the overlap between NLP and Computer Vision challenges, image captioning presents a particularly severe challenge for artificial intelligence. This SLR reviews and analyzes most research on image captioning encoder-decoder architecture and examines how to adapt it to Arabic. Several aspects are covered: system architecture, visual model, language model, loss functions, datasets, and evaluation metrics. Other aspects considered are how the Arabic research applies to each aspect and how different challenges are dealt with.

Author Contributions

Conceptualization, A.A. and M.A.; methodology, A.A., M.A. and T.M.Q.; validation, A.A., M.A., T.M.Q. and S.A.; formal analysis, M.A., A.A. and T.M.Q.; investigation, A.A., T.M.Q. and M.A.; resources, T.M.Q., M.A. and S.A.; data curation, A.A.; writing—original draft preparation, A.A.; writing—review and editing, M.A. and T.M.Q.; visualization, S.A.; supervision, M.A.; project administration, M.A.; funding acquisition, T.M.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Hodosh, M.; Young, P.; Hockenmaier, J. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Intell. Res. 2013, 47, 853–899. [Google Scholar] [CrossRef]
Kiros, R.; Salakhutdinov, R.; Zemel, R. Multimodal neural language models. In Proceedings of the International Conference on Machine Learning, PMLR, Beijing, China, 22–24 June 2014; pp. 595–603. [Google Scholar]
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 7–9 July 2015; pp. 2048–2057. [Google Scholar]
Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3156–3164. [Google Scholar]
Chen, L.; Zhang, H.; Xiao, J.; Nie, L.; Shao, J.; Liu, W.; Chua, T.S. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5659–5667. [Google Scholar]
Huang, Y.; Chen, J.; Ouyang, W.; Wan, W.; Xue, Y. Image captioning with end-to-end attribute detection and subsequent attributes prediction. IEEE Trans. Image Process. 2020, 29, 4013–4026. [Google Scholar] [CrossRef]
Zha, Z.J.; Liu, D.; Zhang, H.; Zhang, Y.; Wu, F. Context-aware visual policy network for fine-grained image captioning. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 44, 710–722. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
Sharma, P.; Ding, N.; Goodman, S.; Soricut, R. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 2556–2565. [Google Scholar]
Cornia, M.; Stefanini, M.; Baraldi, L.; Cucchiara, R. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10578–10587. [Google Scholar]
Jindal, V. A deep learning approach for arabic caption generation using roots-words. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
Jindal, V. Generating image captions in Arabic using root-word based recurrent neural networks and deep neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Al-Muzaini, H.A.; Al-Yahya, T.N.; Benhidour, H. Automatic arabic image captioning using rnn-lst m-based language model and cnn. Int. J. Adv. Comput. Sci. Appl. 2018, 9. [Google Scholar]
Mualla, R.; Alkheir, J. Development of an Arabic Image Description System. Int. J. Comput. Sci. Trends Technol. 2018, 6, 205–213. [Google Scholar]
ElJundi, O.; Dhaybi, M.; Mokadam, K.; Hajj, H.M.; Asmar, D.C. Resources and End-to-End Neural Network Models for Arabic Image Captioning. In Proceedings of the VISIGRAPP (5: VISAPP), Valletta, Malta, 27–29 February 2020; pp. 233–241. [Google Scholar]
Cheikh, M.; Zrigui, M. Active learning based framework for image captioning corpus creation. In Proceedings of the International Conference on Learning and Intelligent Optimization, Athens, Greece, 24–28 May 2020; pp. 128–142. [Google Scholar]
Afyouni, I.; Azhar, I.; Elnagar, A. AraCap: A hybrid deep learning architecture for Arabic Image Captioning. Procedia Comput. Sci. 2021, 189, 382–389. [Google Scholar] [CrossRef]
Hejazi, H.; Shaalan, K. Deep Learning for Arabic Image Captioning: A Comparative Study of Main Factors and Preprocessing Recommendations. Int. J. Adv. Comput. Sci. Appl. 2021, 12. [Google Scholar] [CrossRef]
Eddin Za’ter, M.; Talaftha, B. Bench-Marking And Improving Arabic Automatic Image Captioning Through The Use Of Multi-Task Learning Paradigm. arXiv 2022, arXiv:2202.05474. [Google Scholar] [CrossRef]
Emami, J.; Nugues, P.; Elnagar, A.; Afyouni, I. Arabic Image Captioning using Pre-training of Deep Bidirectional Transformers. In Proceedings of the 15th International Conference on Natural Language Generation, Waterville, ME, USA, 18–22 July 2022; pp. 40–51. [Google Scholar]
Lasheen, M.T.; Barakat, N.H. Arabic Image Captioning: The Effect of Text Pre-processing on the Attention Weights and the BLEU-N Scores. Int. J. Adv. Comput. Sci. Appl. 2022, 13. [Google Scholar] [CrossRef]
Staniūtė, R.; Šešok, D. A systematic literature review on image captioning. Appl. Sci. 2019, 9, 2024. [Google Scholar] [CrossRef]
Chohan, M.; Khan, A.; Mahar, M.S.; Hassan, S.; Ghafoor, A.; Khan, M. Image Captioning using Deep Learning: A Systematic. Image 2020, 11. [Google Scholar]
Thorpe, S.; Fize, D.; Marlot, C. Speed of processing in the human visual system. Nature 1996, 381, 520–522. [Google Scholar] [CrossRef]
Biederman, I. Recognition-by-components: A theory of human image understanding. Psychol. Rev. 1987, 94, 115. [Google Scholar] [CrossRef] [PubMed]
Bracci, S.; Op de Beeck, H.P. Understanding human object vision: A picture is worth a thousand representations. Annu. Rev. Psychol. 2023, 74, 113–135. [Google Scholar] [CrossRef] [PubMed]
Hossain, M.Z.; Sohel, F.; Shiratuddin, M.F.; Laga, H. A comprehensive survey of deep learning for image captioning. ACM Comput. Surv. (CsUR) 2019, 51, 1–36. [Google Scholar] [CrossRef]
Stefanini, M.; Cornia, M.; Baraldi, L.; Cascianelli, S.; Fiameni, G.; Cucchiara, R. From show to tell: A survey on deep learning-based image captioning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 539–559. [Google Scholar] [CrossRef] [PubMed]
Xu, L.; Tang, Q.; Lv, J.; Zheng, B.; Zeng, X.; Li, W. Deep Image Captioning: A Review of Methods, Trends and Future Challenges. Neurocomputing 2023, 546, 126287. [Google Scholar] [CrossRef]
Elhagry, A.; Kadaoui, K. A thorough review on recent deep learning methodologies for image captioning. arXiv 2021, arXiv:2107.13114. [Google Scholar]
Luo, G.; Cheng, L.; Jing, C.; Zhao, C.; Song, G. A thorough review of models, evaluation metrics, and datasets on image captioning. IET Image Process. 2022, 16, 311–332. [Google Scholar] [CrossRef]
Hrga, I.; Ivašić-Kos, M. Deep image captioning: An overview. In Proceedings of the 2019 42nd International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), Opatija, Croatia, 20–24 May 2019; pp. 995–1000. [Google Scholar]
Ghandi, T.; Pourreza, H.; Mahyar, H. Deep Learning Approaches on Image Captioning: A Review. arXiv 2022, arXiv:2201.12944. [Google Scholar] [CrossRef]
Sharma, H.; Agrahari, M.; Singh, S.K.; Firoj, M.; Mishra, R.K. Image captioning: A comprehensive survey. In Proceedings of the 2020 International Conference on Power Electronics & IoT Applications in Renewable Energy and Its Control (PARC), Mathura, India, 28–29 February 2020; pp. 325–328. [Google Scholar]
Attai, A.; Elnagar, A. A survey on arabic image captioning systems using deep learning models. In Proceedings of the 2020 14th International Conference on Innovations in Information Technology (IIT), Virtual Conference, 17–18 November 2020; pp. 114–119. [Google Scholar]
Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. Int. J. Surg. 2021, 88, 105906. [Google Scholar] [CrossRef] [PubMed]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef]
Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
You, Q.; Jin, H.; Wang, Z.; Fang, C.; Luo, J. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4651–4659. [Google Scholar]
Deng, Z.; Jiang, Z.; Lan, R.; Huang, W.; Luo, X. Image captioning using DenseNet network and adaptive attention. Signal Process. Image Commun. 2020, 85, 115836. [Google Scholar] [CrossRef]
Jiang, W.; Ma, L.; Jiang, Y.G.; Liu, W.; Zhang, T. Recurrent fusion network for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 499–515. [Google Scholar]
Parameswaran, S.N.; Das, S. A Bottom-Up and Top-Down Approach for Image Captioning using Transformer. In Proceedings of the 11th Indian Conference on Computer Vision, Graphics and Image Processing, Hyderabad, India, 18–22 December 2018; pp. 1–9. [Google Scholar]
Chu, Y.; Yue, X.; Yu, L.; Sergei, M.; Wang, Z. Automatic image captioning based on ResNet50 and LSTM with soft attention. Wirel. Commun. Mob. Comput. 2020, 2020, 8909458. [Google Scholar] [CrossRef]
Chen, X.; Lawrence Zitnick, C. Mind’s eye: A recurrent visual representation for image caption generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2422–2431. [Google Scholar]
Fang, H.; Gupta, S.; Iandola, F.; Srivastava, R.K.; Deng, L.; Dollár, P.; Gao, J.; He, X.; Mitchell, M.; Platt, J.C.; et al. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1473–1482. [Google Scholar]
Jia, X.; Gavves, E.; Fernando, B.; Tuytelaars, T. Guiding the long-short term memory model for image caption generation. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2407–2415. [Google Scholar]
Li, X.; Jiang, S. Know more say less: Image captioning based on scene graphs. IEEE Trans. Multimed. 2019, 21, 2117–2130. [Google Scholar] [CrossRef]
Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6077–6086. [Google Scholar]
Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.J.; Shamma, D.A.; et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 2017, 123, 32–73. [Google Scholar] [CrossRef]
Guo, L.; Liu, J.; Zhu, X.; Yao, P.; Lu, S.; Lu, H. Normalized and geometry-aware self-attention network for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10327–10336. [Google Scholar]
He, S.; Liao, W.; Tavakoli, H.R.; Yang, M.; Rosenhahn, B.; Pugeault, N. Image captioning through image transformer. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November–4 December 2020. [Google Scholar]
Pan, Y.; Yao, T.; Li, Y.; Mei, T. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10971–10980. [Google Scholar]
Kumar, D.; Srivastava, V.; Popescu, D.E.; Hemanth, J.D. Dual-Modal Transformer with Enhanced Inter-and Intra-Modality Interactions for Image Captioning. Appl. Sci. 2022, 12, 6733. [Google Scholar] [CrossRef]
Wang, Z.; Shi, S.; Zhai, Z.; Wu, Y.; Yang, R. ArCo: Attention-reinforced transformer with contrastive learning for image captioning. Image Vis. Comput. 2022, 128, 104570. [Google Scholar] [CrossRef]
Dubey, S.; Olimov, F.; Rafique, M.A.; Kim, J.; Jeon, M. Label-attention transformer with geometrically coherent objects for image captioning. Inf. Sci. 2023, 623, 812–831. [Google Scholar] [CrossRef]
Li, L.; Tang, S.; Deng, L.; Zhang, Y.; Tian, Q. Image caption with global-local attention. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
Zhang, Y.; Shi, X.; Mi, S.; Yang, X. Image captioning with transformer and knowledge graph. Pattern Recognit. Lett. 2021, 143, 43–49. [Google Scholar] [CrossRef]
Dong, X.; Long, C.; Xu, W.; Xiao, C. Dual graph convolutional networks with transformer and curriculum learning for image captioning. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event, 20–24 October 2021; pp. 2615–2624. [Google Scholar]
Nguyen, K.; Tripathi, S.; Du, B.; Guha, T.; Nguyen, T.Q. In defense of scene graphs for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 1407–1416. [Google Scholar]
Yang, X.; Liu, Y.; Wang, X. Reformer: The relational transformer for image captioning. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 5398–5406. [Google Scholar]
Herdade, S.; Kappeler, A.; Boakye, K.; Soares, J. Image captioning: Transforming objects into words. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
Li, G.; Zhu, L.; Liu, P.; Yang, Y. Entangled transformer for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8928–8937. [Google Scholar]
Song, Z.; Zhou, X.; Dong, L.; Tan, J.; Guo, L. Direction relation transformer for image captioning. In Proceedings of the 29th ACM International Conference on Multimedia, Chengdu, China, 15 July 2021; pp. 5056–5064. [Google Scholar]
Ji, J.; Luo, Y.; Sun, X.; Chen, F.; Luo, G.; Wu, Y.; Gao, Y.; Ji, R. Improving image captioning by leveraging intra-and inter-layer global representation in transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 1655–1663. [Google Scholar]
Luo, Y.; Ji, J.; Sun, X.; Cao, L.; Wu, Y.; Huang, F.; Lin, C.W.; Ji, R. Dual-level collaborative transformer for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 2286–2293. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Understanding LSTM Networks. Available online: https://colah.github.io/posts/2015-08-Understanding-LSTMs/ (accessed on 31 May 2023).
Lu, J.; Xiong, C.; Parikh, D.; Socher, R. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 375–383. [Google Scholar]
Qin, Y.; Du, J.; Zhang, Y.; Lu, H. Look back and predict forward in image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 15 2019 to June 20 2019, Long Beach, CA, USA; 2019; pp. 8367–8375. [Google Scholar]
Hernández, A.; Amigó, J.M. Attention mechanisms and their applications to complex systems. Entropy 2021, 23, 283. [Google Scholar] [CrossRef] [PubMed]
Huang, L.; Wang, W.; Chen, J.; Wei, X.Y. Attention on attention for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4634–4643. [Google Scholar]
Osolo, R.I.; Yang, Z.; Long, J. An attentive fourier-augmented image-captioning transformer. Appl. Sci. 2021, 11, 8354. [Google Scholar] [CrossRef]
Wang, D.; Liu, B.; Zhou, Y.; Liu, M.; Liu, P.; Yao, R. Separate Syntax and Semantics: Part-of-Speech-Guided Transformer for Image Captioning. Appl. Sci. 2022, 12, 11875. [Google Scholar] [CrossRef]
Zhou, Y.; Zhang, Y.; Hu, Z.; Wang, M. Semi-autoregressive transformer for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 3139–3143. [Google Scholar]
Rennie, S.J.; Marcheret, E.; Mroueh, Y.; Ross, J.; Goel, V. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7008–7024. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Karpathy, A.; Fei-Fei, L. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3128–3137. [Google Scholar]
Plummer, B.A.; Wang, L.; Cervantes, C.M.; Caicedo, J.C.; Hockenmaier, J.; Lazebnik, S. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision, Washington, DC, USA, 7–13 December 2015; pp. 2641–2649. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Acl Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, Michigan, 29 June 2005; pp. 65–72. [Google Scholar]
Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
Vedantam, R.; Lawrence Zitnick, C.; Parikh, D. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4566–4575. [Google Scholar]
Anderson, P.; Fernando, B.; Johnson, M.; Gould, S. Spice: Semantic propositional image caption evaluation. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Part V 14. Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 382–398. [Google Scholar]
Asian Languages—The Origin and Overview of Major Languages. Available online: https://gtelocalize.com/asian-languages-origin-and-overview/ (accessed on 18 August 2023).
Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
Shaalan, K.; Siddiqui, S.; Alkhatib, M.; Abdel Monem, A. Challenges in Arabic natural language processing. In Computational Linguistics, Speech and Image Processing for Arabic Language; World Scientific: Singapore, 2019; pp. 59–83. [Google Scholar]
Shoukry, A.; Rafea, A. Preprocessing Egyptian dialect tweets for sentiment mining. In Proceedings of the Fourth Workshop on Computational Approaches to Arabic-Script-Based Languages, San Diego, CA, USA, 1 November 2012; pp. 47–56. [Google Scholar]
PyArabic. Available online: https://pypi.org/project/PyArabic/ (accessed on 2 May 2023).
Abdelali, A.; Darwish, K.; Durrani, N.; Mubarak, H. Farasa: A fast and furious segmenter for arabic. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, San Diego, CA, USA, 12–17 June 2016; pp. 11–16. [Google Scholar]
Al-Jazeera News Website. Available online: http://www.aljazeera.net (accessed on 23 May 2023).
Collect, Clean, and Label Your Data at Scale with CrowdFlower. Available online: https://visit.figure-eight.com/People-Powered-Data-Enrichment_T (accessed on 23 May 2023).
Ultra Edit Smart Translator. Available online: https://forums.ultraedit.com/how-to-change-the-menu-language-t11686.html (accessed on 23 May 2023).
Google Cloud Translation API. Available online: https://googleapis.dev/python/translation/latest/index.html (accessed on 23 May 2023).
Facebook Machine Translation. Available online: https://ai.facebook.com/tools/translate/ (accessed on 23 May 2023).
University of Helsinki Open Translation Services. Available online: https://www.helsinki.fi/en/language-centre/translation-services-for-the-university-community (accessed on 23 May 2023).
Arabic-COCO. Available online: https://github.com/canesee-project/Arabic-COCO (accessed on 2 May 2023).
Yang, Y.; Cer, D.; Ahmad, A.; Guo, M.; Law, J.; Constant, N.; Abrego, G.H.; Yuan, S.; Tar, C.; Sung, Y.H.; et al. Multilingual universal sentence encoder for semantic retrieval. arXiv 2019, arXiv:1907.04307. [Google Scholar]
Chen, C.; Mu, S.; Xiao, W.; Ye, Z.; Wu, L.; Ju, Q. Improving image captioning with conditional generative adversarial nets. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8142–8150. [Google Scholar]
Yu, J.; Li, J.; Yu, Z.; Huang, Q. Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 4467–4480. [Google Scholar] [CrossRef]
Bengio, S.; Vinyals, O.; Jaitly, N.; Shazeer, N. Scheduled sampling for sequence prediction with recurrent neural networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar]
Ranzato, M.; Chopra, S.; Auli, M.; Zaremba, W. Sequence level training with recurrent neural networks. arXiv 2015, arXiv:1511.06732. [Google Scholar]
Williams, R.J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Reinf. Learn. 1992, 5–32. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Multilingual BERT. Available online: https://github.com/google-research/bert/blob/master/multilingual.md (accessed on 1 June 2023).
Antoun, W.; Baly, F.; Hajj, H. Arabert: Transformer-based model for arabic language understanding. arXiv 2020, arXiv:2003.00104. [Google Scholar]
Safaya, A.; Abdullatif, M.; Yuret, D. Kuisail at semeval-2020 task 12: Bert-cnn for offensive speech identification in social media. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, Barcelona, Spain, 12–13 December 2020; pp. 2054–2059. [Google Scholar]

Figure 1. Examples of images with their captions from Flicker8k dataset [1].

Figure 2. The steps of the human captioning process.

Figure 3. Stages of review protocol.

Figure 4. The general workflow of encoder-decoder image captioning mainly follows the following steps: visual model, visual encoding, and language model, and it is trained with different training strategies.

Figure 5. High-Level Diagram of Global CNN Features.

Figure 6. High-Level Diagram of Grid-based Features.

Figure 7. High-Level Diagram of Region-based Features.

Figure 8. High-Level Diagram of Graph-based Features.

Figure 9. Representation of Part of RNN.

Figure 10. Detailed Representation of RNN.

Figure 11. An LSTM’s repeating module [75].

Figure 12. Distribution of the studies over the datasets.

Figure 13. Distribution of the studies over the evaluation metrics.

Table 1. Summary of existing surveys of image captioning. The * symbol indicates that the aspect is partially covered.

Ref.	Year	Visual Model	Language Model	Loss	Dataset	Metrics	Languages Coverd	Conventional/Systematic
[27]	2019	✓	✓		✓	✓	One	Conventional
[22]	2019	✓	✓		✓	✓	One	Systematic
[32]	2019	✓	✓		✓	✓	One	Conventional
[23]	2020	✓	✓		✓	✓	One	Systematic
[34]	2020	✓	✓		✓	✓	One	Conventional
[35]	2020	✓	✓		✓	✓	One	Conventional
[30]	2021	✓	✓	*	✓	✓	One	Conventional
[28]	2022	✓	✓	✓	✓	✓	One	Conventional
[31]	2022	✓	✓	*	✓	✓	One	Conventional
[33]	2022	✓	✓	*	✓	✓	One	Conventional
[29]	2023	✓	✓	✓	✓	✓	One	Conventional
ours	2023	✓	✓	✓	✓	✓	Two	Systematic

Table 2. Research questions of our SLR.

Question	Purpose
RQ1. How do image caption generation techniques identify the important objects in the image?	Aims to comprehend the reasons for the diversity of image models, each seeking to solve a specific problem.
RQ2. Comparison of deep learning techniques used for caption generation?	It aims to discover the sequence model used to generate the captions.
RQ3. What types of evaluation mechanisms are used for image captioning?	Finds the evaluation metrics used in image captioning to measure the performance.
RQ4. What types of datasets are available for image captioning?	It aims to find another factor simulating the model’s performance, i.e., the dataset’s quality used to train it.
RQ5. What loss functions are used to train image captioning models?	It aims to clarify and compare the loss functions used to train the image captioning modes.
RQ6. What are the challenges in adopting the existing methods for image captioning in Arabic?	It focuses on how authors have adopted the methods used from English to Arabic captioning.

Table 3. Summary of studies based on CNN.

Ref.	Year	CNN	Parameters
[9]	2018	Inception-ResNetv2	54 M
[47]	2016	GoogleNet	71 M
[3]	2015	VGG19	138 M
[4]	2015	GoogleNet	71 M
[51]	2020	VGG16	138 M
[51]	2020	ResNet50	21 M
[5]	2017	VGG19	138 M
[5]	2017	ResNet152	41 M
[48]	2020	DenseNet-121	7.9 M
[52]	2015	VGGNet	138 M
[53]	2015	AlexNet	62.3 M
[53]	2015	VGGNet	138 M
[54]	2015	MatConvNet	NA

Table 4. An overview of the training, validation, and testing datasets for English image captioning.

Dataset	Training	Validation	Testing
MSCOCO [84]	82,783	40,504	40,775
Flicker30K [86]	29,000	1000	1000
Flicker8k [1]	6000	1000	1000

Table 5. Original task, input, and summary description of the standard evaluation metrics.

Metric	Original Task	Inputs			Description
Metric	Original Task	Pred	Refs	Image	Description
BLEU	Translation	✓	✓		Relies on n-gram precision taking n-grams up to four.
METEOR	Translation	✓	✓		The recall of matching unigrams from the predicted and reference sentences.
ROUGE	Summarization	✓	✓		Taking into account the longest subsequence of tokens in both the predicted and the reference caption, and both are in the same relative order.
CIDEr	Captioning	✓	✓		Evaluates how well a produced caption matches the reference captions.
SPICE	Captioning	✓	✓	✓	Semantically captures human judgments over model-generated captions

Table 6. Distribution of the architectures and Attention used in the studies.

Year	Study	Architecture	Attention
2017	[11]	Compositional	No
2018	[12]	Compositional	No
2018	[13]	Encoder-decoder	No
2018	[14]	Encoder-decoder	No
2020	[15]	Encoder-decoder	No
2020	[16]	Encoder-decoder	No
2021	[17]	Encoder-decoder	Yes
2021	[18]	Encoder-decoder	No
2022	[19]	Encoder-decoder	Yes
2022	[20]	Encoder-decoder	Yes
2022	[21]	Encoder-decoder	Yes

Table 7. Summary of Arabic datasets.

Ref.	Dataset	Access	Size	Notes
[11]	ImageNet dataset	Private	10,000 images	The captions are manually written in Arabic by professional Arabic translators.
[11]	Al-Jazeera news website	Private	100,000 images	A native Arabic website.
[12]	Flicker8k dataset	Private	8000 images	Manually written captions in Arabic by professional Arabic translators.
[12]	Middle Eastern countries’ newspapers	Private	405,000 images	Arabic
[13]	MSCOCO	Private	1166 images	Human-generated captions using Crowd-Flower Crowdsourcing [99].
[13]	Flicker8k	Private	150 images	Translated by professional translator.
[13]	Flicker8k	Private	2111 images	Translated by GCT, then verified by Arabic native speakers.
[14]	Flicker8k	Private	2000 images	Translated using smart translator Ultra edit [100].
[15]	Flicker8k	Public	8000 images	Translated using GCT and then validated by a professional Arabic translator.
[16]	ArabicFlickr1K	Private	1095 images	Framework for creating an image captioning corpus based on active learning.
[17]	MSCOCO	Private	123,287 images	GCT
[17]	Flicker8k	Private	8000 images	GCT
[19]	MS COCO	Private	120,000 images	The translation uses a combination of GTC, Facebook Machine Translation (FMT), and translation services offered by the University of Helsinki (UH).
[19]	Flicker30k	Private	32,000 images	The translation uses a combination of GTC, FMT, and UH.
[19]	Flicker8k	Private	8000 images	The translation uses a combination of GTC, FMT, and UH.

Table 8. An overview of models for image captioning based on deep learning. Scores are gathered from the corresponding papers. For the metrics, a higher value is better.

Year	Model	Global	Grid	Region	Graph	Language Model	BLEU-4	CIDEr
2015	DMSM [53]	✓				-	-	-
	Mind’s Eye [52]	✓				RNN	18.8	-
	Hard-Attention [3]		✓			LSTM	25.0	-
	gLSTM [54]	✓				LSTM	26.4	81.25
	NIC [4]	✓				LSTM	27.7	85.5
2016	ATT-FCN [47]	✓				LSTM	30.4	-
2017	SCA-CNN [5]		✓			LSTM	31.1	-
	GLA [64]	✓		✓		LSTM	31.2	96.4
	Adaptive Attention [76]		✓			LSTM	33.2	108.5
2018	Sharma et al. [9]	✓				Transformer	-	-
	Up-Down [56]		✓			LSTM	36.3	120.1
	Bottom-Up [50]		✓	✓		Transformer	36.5	120.6
	RFNet [49]		✓			LSTM	36.5	121.9
2019	Obj-R+Rel-A+CIDEr [55]				✓	LSTM	36.3	120.2
	Chen et al. [106]			✓		LSTM	38.3	123.2
	CAVP [7]			✓		LSTM	38.6	126.3
	ETA [70]			✓		Transformer	39.3	126.6
	LBPF [77]			✓		LSTM	38.3	127.6
	Herdade et al. [69]			✓		Transformer	38.6	128.3
	AoANet [79]			✓		LSTM	38.9	129.8
	MT [107]			✓		Transformer	40.7	134.1
2020	AICRL [51]			✓		LSTM	-	-
	D-ada [48]	✓				LSTM	32.6	-
	MAD+SAP (F) [6]			✓		LSTM	38.6	128.8
	He et al. [59]			✓		Transformer	39.5	130.8
	M2 Transformer [10]			✓		Transformer	39.1	131.2
	X-LAN [60]			✓		LSTM	39.5	132.0
	NG-SAN [58]					Transformer	39.9	132.1
	X-Transformer [60]			✓		Transformer	39.7	132.8
2021	SG2Caps [67]				✓	LSTM	33.0	112.3
	Trans[D2GPO+MLE]+KG [65]				✓	Transformer	34.39	112.60
	SATIC [82]			✓		Transformer	38.4	129.0
	Dual-GCN+Transformer+CL [66]				✓	Transformer	39.7	129.2
	AFCT [80]			✓		Transformer	38.7	130.1
	(w/MAC) [72]			✓		Transformer	39.5	131.6
	DRT [71]			✓		Transformer	40.4	133.2
	Luo et al. [73]		✓	✓		Transformer	39.8	133.8
2022	Dual-Modal Transformer [61]		✓	✓		Transformer	-	-
	Wang et al. [81]			✓		Transformer	39.3	129.9
	ReFormer [68]				✓	Transformer	39.8	131.9
	ArCo [62]			✓		Transformer	41.4	139.7
2023	LATGeO [63]			✓		Transformer	38.8	131.7

Table 9. The reported results from Microsoft COCO server.

Year	Ref	Test Server
Year	Ref	B-1	B-2	B-3	B-4	M	R	C	S
2015	Hard-Attention [3]	70.5	52.8	38.3	27.7	24.1	51.6	86.5	17.2
	NIC [4]	71.3	54.2	40.7	30.9	25.4	53.0	94.3	18.2
	Mind’s Eye [52]	-	-	-	18.4	19.5	-	53.1	-
2016	ATT-FCN [47]	73.1	56.5	42.4	31.6	25.0	53.5	94.3	18.2
2017	SCA-CNN [5]	71.2	54.2	40.4	30.2	24.4	52.4	91.2	-
	Adaptive Attention [76]	74.8	58.4	44.4	33.6	26.4	55.0	104.2	-
2018	RFNet [49]	80.4	64.9	50.1	38.0	28.2	58.2	122.9	-
	Up-Down [56]	80.2	64.1	49.1	36.9	27.6	57.1	117.9	21.5
2019	AoANet [79]	81.0	65.8	51.4	39.4	29.1	58.9	126.9	-
	Chen et al. [106]	81.9	66.3	51.7	39.6	28.7	59.0	123.1	-
	Obj-R+Rel-A [55]	79.2	62.6	47.5	35.4	27.3	56.2	115.1	-
	CAVP [7]	80.1	64.7	50.0	37.9	28.1	28.1	121.6	-
	MT [107]	81.7	66.8	52.4	40.4	29.4	59.6	130.0	-
	ETA [70]	81.2	65.5	50.9	38.9	28.6	58.6	122.1	-
2020	M2 Transformer [10]	81.6	66.4	51.8	39.7	29.4	59.2	129.3	-
	MAD+SAP [6]	80.5	65.1	50.4	38.4	28.6	58.7	125.1	-
	X-LAN [60]	81.4	66.5	52.0	40.0	29.7	59.5	130.2	-
	X-Transformer [60]	81.9	66.9	52.4	40.3	29.6	59.5	131.1	-
	NG-SAN [58]	80.8	65.4	50.8	38.8	29.0	58.7	126.3	-
	He et al. [59]	81.2	-	-	39.6	29.1	59.2	127.4	-
2021	(w/MAC) [72]	81.6	66.5	51.9	39.7	29.4	59.1	130.3	-
	DLCT [73]	82.4	67.4	52.8	40.6	29.8	59.8	133.3	-
	DRT [71]	82.7	67.7	53.1	40.9	29.6	59.8	132.2	-
2022	ReFormer [68]	82.0	-	-	40.1	29.8	59.9	129.9	-
	ArCo [62]	83.4	68.8	54.3	42.0	30.6	60.8	138.5	-
2023	LATGeO [63]	80.5	64.8	50.0	37.9	28.8	58.1	126.7	-

Table 10. The reported results from Microsoft COCO Karpathy test split [85], with cross-entropy loss function.

Year	Ref	(Cross-Entropy Loss)
Year	Ref	B-1	B-2	B-3	B-4	M	R	C	S
2015	Hard-Attention [3]	71.8	50.4	35.7	25.0	23.04	-	-	-
	Soft-Attention [3]	70.7	49.2	34.4	24.3	23.9	-	-	-
	NIC [4]	-	-	-	27.7	23.7	-	85.5	-
	Mind’s Eye [52]	-	-	-	18.8	19.6	-	-	-
	gLSTM [54]	67.0	49.1	35.8	26.4	22.74	-	81.25	-
2016	ATT-FCN [47]	0.709	0.537	0.402	0.304	0.243	-	-	-
2017	SCA-CNN [5]	71.9	54.8	41.1	31.1	25.0	-	-	-
	GLA [64]	72.5	55.6	41.7	31.2	24.9	53.3	96.4
	Adaptive Attention [76]	74.2	58.0	43.9	33.2	26.6	-	108.5	-
2018	RFNet [49]	76.4	60.4	46.6	35.8	27.4	56.5	112.5	20.5
	Bottom-Up [50]	76.2	60.4	46.8	36.3	27.9	56.7	114.6	20.9
	Up-Down [56]	77.2	-	-	36.2	27.0	56.4	113.5	20.3
2019	AoANet [79]	77.4	-	-	37.2	28.4	57.5	119.8	21.3
	Obj-R+Rel-A+CIDEr. [55]	0.767	0.598	0.453	0.338	0.262	0.549	1.103	0.198
	MT [107]	77.3	-	-	37.4	28.7	57.4	119.6	-
	LBPF [77]	77.8	-	-	37.4	28.1	57.5	116.4	-
	ETA [70]	77.3	-	-	37.1	28.2	57.1	117.9	21.4
2020	MAD+SAP (F) [6]	-	-	-	37.1	28.1	57.2	117.3	21.3
	D-ada [48]	73.9	57.0	42.2	32.6	27.0	-	-	-
	X-LAN [60]	78.0	62.3	48.9	38.2	28.8	58.0	122.0	21.9
	X-Transformer [60]	77.3	61.5	47.8	37.0	28.7	57.5	120.0	21.8
2021	Trans[D2GPO+MLE]+KG [65]	76.24	-	-	34.39	27.71	-	112.60	-
	Dual-GCN+Transformer+CL [66]	82.2	67.6	52.4	39.7	29.7	59.7	129.2	-
	SG2Caps [67]	-	-	-	32.6	26.4	55.0	106.6	19.8
2022	Wang et al. [81]	76.6	-	-	36.3	28.2	56.9	116.1	-
	ReFormer [68]	82.3	-	-	39.8	29.7	59.8	131.9	23.0
2023	LATGeO [63]	76.5	-	-	36.4	27.8	56.7	115.8	-

Table 11. The reported results from Microsoft COCO Karpathy test split [85], with CIDEr Score Optimization.

Year	Ref	(CIDEr Score Optimization)
Year	Ref	B-1	B-2	B-3	B-4	M	R	C	S
2018	RFNet [49]	79.1	63.1	48.4	36.5	27.7	57.3	121.9	21.2
	Bottom-Up [50]	78.0	-	-	36.5	28.1	57.2	120.6	21.6
	Up-Down [56]	79.8			36.3	27.7	56.9	120.1	21.4
2019	AoANet [79]	80.2	-	-	38.9	29.2	58.8	129.8	22.4
	Herdade et al. [69]	80.5	-	-	38.6	28.7	58.4	128.3	22.6
	Chen et al. [106]	81.1	65.0	50.4	38.3	28.6	58.6	123.2	22.1
	Obj-R+Rel-A+CIDEr [55]	79.2	63.2	48.3	36.3	27.6	-	120.2	-
	CAVP [7]	-	-	-	38.6	28.3	58.5	126.3	21.6
	MT [107]	81.9	-	-	40.7	29.5	59.7	134.1	-
	LBPF [77]	80.5	-	-	38.3	28.5	58.4	127.6	22.0
	ETA [70]	81.5	-	-	39.3	28.8	58.9	126.6	22.7
2020	M2 Transformer [10]	80.8	-	-	39.1	29.2	58.6	131.2	22.6
	MAD+SAP (F) [6]	-	-	-	38.6	28.7	58.5	128.8	22.2
	NG-SAN [58]	-	-	-	39.9	29.3	59.2	132.1	23.3
	He et al. [59]	80.8	-	-	39.5	29.1	59.0	130.8	22.8
	X-LAN [60]	80.8	65.6	51.4	39.5	29.5	59.2	132.0	23.4
	X-Transformer [60]	80.9	65.8	51.5	39.7	29.5	59.1	132.8	23.4
2021	(w/MAC) [72]	81.5	-	-	39.5	29.3	58.9	131.6	22.8
	DLCT [73]	81.4	-	-	39.8	29.5	59.1	133.8	23.0
	SATIC [82]	-	-	-	38.4	28.8	-	129.0	22.7
	DRT [71]	81.7	-	-	40.4	29.5	59.3	133.2	23.3
	SG2Caps [67]	-	-	-	33.0	26.2	55.6	112.3	19.4
	AFCT [80]	80.5	-	-	38.7	29.2	58.4	130.1	22.5
2022	Wang et al. [81]	80.8	-	-	39.3	29.0	58.9	129.9	-
	ArCo [62]	82.8	-	-	41.4	30.4	60.4	139.7	24.5
2023	LATGeO [63]	81.0	-	-	38.8	29.2	58.7	131.7	22.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alsayed, A.; Arif, M.; Qadah, T.M.; Alotaibi, S. A Systematic Literature Review on Using the Encoder-Decoder Models for Image Captioning in English and Arabic Languages. Appl. Sci. 2023, 13, 10894. https://doi.org/10.3390/app131910894

AMA Style

Alsayed A, Arif M, Qadah TM, Alotaibi S. A Systematic Literature Review on Using the Encoder-Decoder Models for Image Captioning in English and Arabic Languages. Applied Sciences. 2023; 13(19):10894. https://doi.org/10.3390/app131910894

Chicago/Turabian Style

Alsayed, Ashwaq, Muhammad Arif, Thamir M. Qadah, and Saud Alotaibi. 2023. "A Systematic Literature Review on Using the Encoder-Decoder Models for Image Captioning in English and Arabic Languages" Applied Sciences 13, no. 19: 10894. https://doi.org/10.3390/app131910894

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Systematic Literature Review on Using the Encoder-Decoder Models for Image Captioning in English and Arabic Languages

Abstract

1. Introduction

1.1. Human Captioning

1.2. Previous Surveys

1.3. Motivation and Contributions

1.4. Paper Organisation

2. Methodology

2.1. Research Questions

2.2. Exclusion Criteria

2.3. Search Process

3. Methods for Image Captioning

3.1. Visual Models

3.1.1. Feature Vector Using Convolutional Neural Network (CNN)

3.1.2. Object Detection

3.2. Visual Encoding

3.2.1. Global CNN Features

3.2.2. Grid Features

3.2.3. Region-Based

3.2.4. Graph-Based Attention

3.2.5. Self-Attention Encoding

3.3. Language Models

3.3.1. Recurrent Neural Networks (RNN)

3.3.2. Long Short-Term Memory (LSTM)

3.3.3. Transformer

3.4. Loss Functions

3.4.1. Cross-Entropy (CE)

3.4.2. Self-Critical Sequence Training (SCST)

3.4.3. Kullback–Leibler Divergence (KL Divergence)

4. Available Datasets in English Image Captioning

4.1. Microsoft COCO (MS COCO)

4.2. Flicker30k

4.3. Flicker8k

5. Evaluation Metrics

5.1. Bilingual Evaluation Understudy (BLEU)

5.2. METEOR

5.3. ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

5.4. CIDEr (Consensus-Based Image Description Evaluation)

5.5. SPICE (Semantic Propositional Image Caption Evaluation)

6. Arabic Image Captioning (AIC)

6.1. Model Architecture

6.2. Visual Models

6.3. Language Models

6.4. Datasets in Arabic Image Captioning

6.5. Evaluation Metrics

7. Discussion

8. Limitations and Future Directions

9. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI