**4. Discussion**

In this section, we analyze further the performances of DBTN using different versions of EfficientNets, which are B0, B3, and B5. The version B0 contains 5.3 M parameters, while B3 and B5 are deeper and have 12M and 30M parameters, respectively. The results reported in Table 5 show that using B2 yields slightly better results compared to the other models. On the other side, B0 seems to be less competing as it provides an average recall of 45.65 compared to 47.20 for B2.


**Table 5.** Bidirectional text image matching results on our dataset using different EfficientNets.

Table 6 shows sensitivity analysis for bidirectional text image matching at multiple margin values. We can observe that setting this parameter to α = 0.5 seems to be the most suitable choice. Increasing further this value leads to a decrease in the average recall as the network tends to select easy negative triplets.

In Table 7, we report the recall results obtained by using only one direction instead of bidirectional training. That is, we use text-to-image (Anchor text) and image-to-text (Anchor image). Obviously, the performance with bidirectional achieves the best results where relative similarity in one direction is useful for retrieval in the other direction, in the sense that the model trained with text-to-image triplets obtains a reasonable result in an image-to-text retrieval task and vice-versa. Nevertheless, the model trained with bi-directional triplets achieves the best result, indicating that the triplets organized in bidirectional provide more overall information for text-to-image matching.


**Table 6.** Sensitivity with respect to the margin parameter α.


