*3.3. Results*

As mentioned in the previous sections, we used four di fferent pre-trained CNNs for the image encoding branch, which were E fficientNet, ResNet50, Inception\_v3, and VGG16. Figure 7 illustrates the evolution of the triplet loss function during the training phase for these di fferent networks. We can see that the loss function decreased gradually with an increase in the number of iterations. In general, the model reached stable values after 40 iterations. In Figure 8 we show examples of features obtained by the image and text encoding branches at the end of the training process.

**Figure 7.** Evolution of loss function for the E fficientNet, ResNet50, Inception\_v3, and VGG16.

Table 1 illustrates the performance of DBTN using E fficientNet as a pre-trained CNN for encoding the visual features. It could be observed with one sentence (Sent.1). The method achieved 13.02%, 40%, and 59.30% in R@1, R@5, and R@10, respectively. In contrast, when the five sentences are fused, the performance was further improved to 17.20%, 51.39%, and 73.02% of R@1, R@5, and R@10, respectively. Further, we computed the average of R@1, R@5, and R@10 for each sentence, and for fusion, we observed that the average of fusion had the highest score. Table 2 shows the results obtained using ResNet50 as the image encoder to learn the image features. We can see that the performances in R@1, R@5, and R@10 were 10.93%, 38.60%, and 54.41%, respectively, for Sent.1, while the method achieved 13.72%, 50.93%, and 69.06% of R@1, R@5, and R@10, respectively, with the fusion. Similarly, from Table 3 we observed that with Inception\_v3, considering the fusion, the performance was also better than that of individual sentences. Finally, the results of using VGG16 are shown in Table 4. We can see that for Sent.1, our method achieved 10%, 36.27%, and 51.62% of R@1, R@5, and R@10, respectively, whereas the fusion process yielded 11.86%, 44.41%, and 63.72% of R@1, R@5, and R@10, respectively.

According to these preliminary results, one can notice that the fusing of the representations of the five sentences produced better matching results than did using one sentence. Additionally, E fficientNet seemed to be better compared to the other three pre-trained networks. This indicates that learning visual features by E fficientNet was quite e ffective and allowed better scores to be obtained compared to the other pre-trained CNNs.

**Figure 8.** Image and text feature generated by the image and text encoding branches.


**Table 1.** Bidirectional text image matching results on our dataset by using EfficientNet-B2.

**Table 2.** Bidirectional text image matching results on our dataset by using ResNet50.


**Table 3.** Bidirectional text image matching results on our dataset by using Inception\_v3.


**Table 4.** Bidirectional text image matching results on our dataset by using VGG16.


To analyze the performance in detail for image retrieval given a query text, we showed many successful and failure scenarios. For example, we could see (Figure 9) a given query text (five sentences) with its image, and the top nine relevant retrieved images (from left to right); the image in red box is the ground truth image of the query text (true match). We could observe that our method output reasonable relevant images, where all nine images had almost the same content (objects). In these four scenarios, the rank of the retrieved true images was 1, 6, and 1, respectively.

(**c**) 

**Figure 9.** Successful scenarios (**<sup>a</sup>**, **b** and **c**) of text-to-image retrieval.

In contrast, Figure 10 shows two failure scenarios. In this case, we obtained relevant and irrelevant images, but the true matched image was not retrieved. This gives an indication that the problem was not easy and requires further investigations in improving the alignment of the descriptions to the image content.

**Figure 10.** Unsuccessful scenarios (**a** and **b**) of text-to-image retrieval.
