1. Introduction
Recently, the parameters of deep learning models have increased exponentially. The parameter of GPT-3 [
1] has a very large size of 175B. This is more than 500 times larger than the BERT large [
2] model. As a result, the training cost increased significantly. In the case of GPT-3, the training cost of 3.14 × 10
23 flops is almost 590 times that of the BERT large model. This is hundreds of years of work with a single NVIDIA V100 GPU and it costs over
$4.6 million. Training these large models every time is expensive. Therefore, research to reduce the training cost is needed.
A common way to reduce training costs is transfer learning using pre-trained models. Transfer learning is solving one problem and then applying it to a different but related problem. In computer vision, a pre-trained feature extractor such as a VGG network [
3] is used to fine-tune the target tasks [
4,
5,
6,
7]. In NLP, pre-trained BERT [
2] is used to fine-tune the target tasks [
8,
9,
10]. However, the previous methods take a pre-trained network and use it as part of the entire model, e.g., connect the pre-trained feature extractor and Region Proposal Network to detect objects in the image [
5], and connect the pooling layer and softmax calssifier to the pre-trained BERT to perform sentence embedding [
8]. So, part of the model is still training from scratch. The more parts to train from scratch, the higher the training cost.
The Compositional Intelligence (CI) method, on the other hand, constructs the entire model by combining pre-trained models. The advantage of using pre-trained models is that training converges quickly and performance can be improved. A representative model to which CI can be applied is the encoder–decoder structure. Since the encoder and decoder are connected with a hidden layer in the middle, they can be connected if the shape of the hidden vector is the same. It can be applied to various target tasks by combining a pre-trained encoder–decoder. Previous studies of CI have been done to combine two pre-trained models in the same domain. In the study by Yoo et al. [
11], the image style transfer task was performed through a combination of encoder and decoder, trained with different style images. Oh et al. [
12] applied CI to machine translation. First, the Transformer is pre-trained with monolingual task, and then the machine translation task is trained by connecting an encoder and decoder trained in different languages. Yoo et al. [
11] applied the CI to the image domain and Oh et al. [
12] applied the CI to the text domain. Since the same domain data is used, there is no need to consider the problem of domain differences when using CI. We extended CI to a dual domain; image and text. Two models trained in each domain were used for the image captioning task.
As shown in
Figure 1, the typical image captioning model has an encoder–decoder structure. Most encoders use pre-trained feature extractors [
3,
5,
13]. In the decoder, NLP models such as LSTM [
14,
15] and Transformer [
16] are used as caption generators. After the announcement of the Transformer [
16], which showed higher performance than the existing NLP model, a model using the Transformer as a caption generator was studied [
17,
18,
19,
20]. Since the model size of the Transformer [
16] is much larger than that of LSTM [
14], the training cost increased along with the performance improvement. Usually, the caption generators are trained from scratch, so training is expensive.
Since Image Captioning has an encoder–decoder structure, the CI method can be applied. Applying CI can reduce training costs while maintaining or improving performance. We used the pre-trained image feature extractor [
21] commonly used in image captioning models. We devised and trained a proper pre-training task that works well as a caption generator. Connecting two pre-trained models on data from different domains requires feature mapping layer (FML) to mitigate differences in data distributions. We compared the training cost and metric scores [
22,
23,
24,
25,
26] of our CI model that uses a pre-trained caption generator and the From Scratch model that trains the caption generator from scratch. For comparison of training cost, we checked the training time by applying early stopping when fine-tuning the image caption task. Compared to the From Scratch model, our CI model significantly shortened the training time, and all metric scores showed a slight improvement. In summary, our contributions of this study as follows.
We extended our CI study from single domain to dual domain;
We devised a pre-training task for caption generator;
We showed in the image captioning task that the CI model is effective in reducing training costs.
3. Approach
A typical image captioning model uses a pre-trained image feature extractor, but a caption generator is trained from scratch. Unlike this, we pre-trained the caption generator and conducted image captioning training using CI. A pre-training method suitable for the target task to which CI is to be applied is required. Inappropriate pre-training methods can cause performance degradation during fine-tuning. Therefore, we devised and applied a pre-training method suitable for image caption generator.
3.1. Pre-Training Model Architecture
We used the Transformer [
16] base model as the pre-training model. As shown in
Figure 2, the Transformer is divided into encoder and decoder structures, and the decoder is responsible for generating text. So, we reused the pre-trained decoder as a caption generator. We used keywords as input to the pre-training, but there is no need for positional information between the keywords. This is because, regardless of the order of keywords, the model completes the sentence by putting the keyword in the appropriate position when generating the sentence. Therefore, Positional Encoding (PE) used in the original Transformer encoder was removed. As shown in
Figure 3, the pre-training model was constructed. We extract keywords from sentences and use them as input to the encoder. The decoder calculates the attention score between the keywords passed through the encoder and the tokens predicted previously (
). The decoder updates the values of the tokens with the attention score and predicts the next token (
). As shown in
Figure 4, the CI model requires FML because of the different vector distributions of the two pre-trained models. On the other hand, the From Scratch model does not need FML because it trains the caption generator from scratch.
3.2. Pre-Training Method
It is impossible to unconditionally increase good performance by connecting pre-trained models to each other. Therefore, in order to harmoniously connect the models, it is necessary to devise a pre-training method suitable for the target task. We used two pre-trained models for the image captioning task. As shown in
Figure 1, the image captioning model consists of two parts, the first part is the image feature extractor and the second part is the caption generator. As shown in
Figure 4, Faster R-CNN composed of ResNet-101 was used as an image feature extractor [
21]. Between 10 and 100 feature vectors of 2048-dimension are created for each image.
Each object feature vector extracted from the image feature extractor is fed into the caption generator. That is, object information is used to create captions. The caption generator learns to insert verbs or adjectives into relationships between objects when there is no information other than objects. As shown in
Figure 5, it can be seen that the object detected in the image and the keyword of the sentence play the same role. Therefore, we assumed that if the caption generator was pre-trained to construct a sentence with keywords, it would be suitable for working with the image captioning task.
Figure 6 shows the training process and input of the pre-training model. The process of generating the input is divided into two parts, as shown in
Figure 6b. First, we extract keywords from the sentence. Keywords were created by dividing sentences into word units and excluding stopwords [
42]. Then, we select some of the keywords and shuffle to make them input data. Training was conducted using only 70% of the keywords made in one sentence. The use of 70% of keywords was determined through experiments, in
Section 4.3.1.
Table 1 is an example of the input actually made from the sentence. As shown in
Figure 6a, the generated input is fed into the encoder of the pre-training model, and the decoder is trained to generate the original sentence. A word tokenizer was used instead of sub-words when tokenizing keywords to match object feature units in the image to be used for fine-tuning. We used the pre-trained Transformer-xl tokenizer [
43] of Huggingface [
44].
Since our pre-training method creates input keywords from a single sentence, an Input-Output data pair is not required. Therefore, it has the advantage of being able to easily obtain the data required for pre-training. We used WikiText-103 dataset [
45] for pre-training. In order to have a distribution similar to the MS-COCO caption, which is an image captioning dataset, several preprocessing steps were performed. The maximum length of the MS-COCO caption was 250 and the minimum length was 21. Therefore, wiki sentences with a length of less than 21 characters or more than 250 characters were excluded. Pre-training was performed for 5 epochs with about 730,000 data.
3.3. Feature Mapping Layer (FML)
We pre-trained the model in the image domain and text domain. As shown in
Figure 7, images and words of the same meaning trained in different domains have different vector values. This is because different models have different distributions of feature spaces they create. So we need a layer that smoothly maps the different feature spaces. Previous studies also showed that the model with FML outperformed the model without FML [
11,
12]. From the experimental results in
Section 4.3.2, the Overcomplete FML showed the best performance. FML consists of two fully-connected layers. Using ReLU, a non-linear mapping function can be created.
3.4. Compositional Intelligence (CI)
As shown in
Figure 4, a CI method was used to connect the pre-trained image feature extractor and caption generator with FML. We used a model composed of Faster R-CNN [
5] and ResNet-101 [
13] as a pre-trained image feature extractor. However, we did not train the model and used the extracted image vectors provided by Anderson et al. [
21]. Anderson et al. pre-trained the image feature extractor with the Visual Genome dataset [
46], and extracted feature vectors of objects from the images of the MS-COCO dataset [
47]. The image of the bounding box created through the Region Proposal Network (RPN) of Faster R-CNN was given as input to ResNet-101. An intermediate feature map of ResNet-101 was used. For each bounding box, 2048-dimensional vector values were extracted.
Before proceeding with CI, we reduced the 2048-dimensional vector to 512-dimensional using an auto-encoder. The reason is that it was not possible to use a 2048-dimensional vector as it is in our experimental environment. Through the test, it was confirmed that there was no performance degradation due to dimensional reduction. The vector is input as the Key and Value of the second Multi-Head Attention of the Transformer decoder, which is a caption generator, through FML. The caption generator gets its weights from a Transformer model that has been pre-trained with keywords. The completed model has a structure in which Faster R-CNN with ResNet-101 and Transformer decoder are connected through FML. Then, fine-tuning of the image captioning task is performed on the MS-COCO dataset.