**1. Introduction**

The steady accessibility of remote sensing data, particularly high resolution images, has animated remarkable research outputs in the remote sensing community. Two of the most active topics in this regard refer to image classification and retrieval [1–5]. Image classification aims to assign scene images to a discrete set of land use/land cover classes depending on the image content [6–10]. Recently, with rapidly expanded remote sensing acquisition technologies, both quantity and quality of remote sensing data have been increased. In this context, content-based image retrieval (CBIR) has become a paramount research subject in order to meet the increasing need for the efficient organization and

managemen<sup>t</sup> of massive volumes of remote sensing data, which has been a long lasting challenge in the community of remote sensing.

In the last decades, grea<sup>t</sup> e fforts have been made to develop e ffective and precise retrieval approaches to search for interest information across large archives of remote sensing. A typical CBIR system involves two main steps [11], namely feature extraction and matching, where the most relevant images from the archive are retrieved. In this regard, both extraction of features as well as matching play a pivotal role in controlling the e fficiency of a retrieval system [12].

Content-based remote sensing image retrieval is a particular application of CBIR, in the field of remote sensing. However, the remote sensing community seems to put the emphasis more on devising powerful features due to the fact that image retrieval systems performance relies greatly on the e ffectiveness of the extracted features [13]. In this respect, remote sensing image retrieval approaches rely on handcrafted features and deep-learning.

As per handcrafted features, low-level features are harnessed to depict the semantic tenor of remote sensing images, and it is possible to draw them from either local or global regions of the image. Color features [14,15], texture features [2,16,17], and shape features [18] are widely applied as global features. On other hand, local features tend to emphasize the description on local regions instead of looking at the image as a whole. There are various algorithms for describing local image regions such as the scale-invariant feature transform (SIFT) and speed up robust features (SURF) [19,20]. The bag-of-word (BOW) model [21], and the vector of aggregated local descriptors (VLAD) [22] are generally proposed to encode local features into a fixed-size image signature via a codebook/dictionary of keypoint/feature vectors.

Recently, remote sensing images have been witnessing a steady increase due to the prominent technological progress of remote sensors [23]. Therefore, huge volumes of data with various spatial dimensions and spectral channels can be availed [24]. On this point, handcrafted features may be personalized and successfully tailored to small chunks of data; they do not meet, however, the standards of practical contexts where the size and complexity of data increases. Nowadays, deep learning strategies, which aim to learn automatically the discriminative and representative features, are highly effective in large-scale image recognition [25–27], object detection [28,29], semantic segmentation [30,31], and scene classification [32]. Furthermore, recurrent neural networks (RNNs) have achieved immense success with various tasks in sequential data analysis as recognition of action [33,34] and image captioning [35]. Recent research shows that image retrieval approaches work particularly well by exploiting deep neural networks. For example, the authors in [36] introduced a content-based remote sensing image retrieval approach depending on deep metric learning using a triplet network. The proposed approach has shown promising results compared to prior state-of-the-art approaches. The work in [37] presented an unsupervised deep feature learning method for the retrieval task of remote sensing images. Yang et al. [38] proposed a dynamic kernel with a deep convolutional neural network (CNN) for image retrieval. It focuses on matching patches between the filters and relevant images and removing the ones for irrelevant pairs. Furthermore, deep hashing neural network strategies are adopted in some works for large-scale remote sensing image retrieval [39]. Li et al. [40] presented a new unsupervised hashing method, the aim of which is to build an e ffective hash function. In another work, Li et al. [41], investigated cross-source remote sensing image retrieval via source-invariant deep hashing CNNs, which automatically extract the semantic feature for multispectral data.

It is worthwhile mentioning that the aforementioned image retrieval methods are single label retrieval approaches, where the query image and the images to be retrieved are labelled by a single class label. Although these approaches have been applied with a certain amount of success, they tend to abstract the rich semantic tenor of a remote sensing image into a single label.

In order to moderate the semantic gap and enhance the retrieval performance, recent remote sensing research proposed multi-label approaches. For instance, the work in [12] presented a multi-label method, making use of a semi-supervised graph-theoretic technique in order to improve the region-based retrieval method [42]. Zhou et al. [43] proposed a multi-label retrieval technique by training a CNN for semantic segmentation and feature generation. Shao et al. [11] constructed a dense labeling remote sensing dataset to evaluate the performance of retrieval techniques based on traditional handcrafted feature as well as deep learning-based ones. Dai et al. [44] discussed the use of multiple hyperspectral image retrieval labels and introduced a multi-label scheme that incorporates spatial and spectral features.

It is evident that the multi-label scenario is generally favored (over the single label case) on account of its abundant semantic information. However, it remains limited due to the discrete nature of labels pertaining to a given image. This suggests a further endeavor to model the relation among objects/labels using an image description. With the rapid advancement of computer vision and natural language processing (NLP), machines began to understand, slowly but surely, the semantics of images.

Current computer vision literature suggests that, instead of tackling the problem from an image-to-image matching perspective, cross-modal text-image learning seems to o ffer a more concrete alternative. This concept has manifested itself lately in the form of image captioning, which stems as a crossover where computer vision meets NLP. Basically, it consists of generating a sequential textual narration of visual data, similar to how humans perceive it. In fact, image captioning is considered as a subtle aid for image grasping, as a description generation model should capture not only the objects/scenes presented in the image, but it should also be capable of expressing how the objects/scenes relate to each other in a textual sentence.

The leading deep learning techniques, for image captioning, can be categorized into two streams. One stream adopts encoder–decoder, an end-to-end fashion [45,46] where a CNN is typically considered as the encoder and an RNN as the decoder, often a Long-Short Term Memory (LSTM) [47]. Rather than translating between various languages, such techniques translate from a visual representation to language. The visual representation is extracted via a pre-trained CNN [48]. Translation is achieved by RNNs based language models. The major usefulness of this method is that the whole system adopts end to end learning [47]. Xu et al. [35] went one step further by introducing the attention mechanism, which enables the decoder to concentrate on specific portions of the input image when generating a word. The other stream adopts a compositional framework, such as [49] for instance, which divided the task of generating the caption into various parts: detection of the words by a CNN, generating the caption candidates, and re-ranking the sentence by a deep multimodal similarity model.

With respect to image captioning, the computer vision literature suggests several contributions mainly based on deep learning. For instance, You et al. [50] combined top-down (i.e., image-to-words) and bottom-up (i.e., joining several relevant words into a meaningful image description) approaches via CNN and RNN models for image captioning, which revealed interesting experimental results. Chen et al. [51] proposed an alternative architecture based on spatial and channel-wise attention for image captioning. In other works, a common deep model called a bi-directional spatial–semantic attention network was introduced [52,53], where an embedding and a similarity network were adopted to model the bidirectional relations between pairs of text and image. Zhang and Lu [54] proposed a projection classification loss that classified the vector projection of representations from one form to another by improving the norm-softmax loss. Huang et al. [52] addressed the problem of image text matching in bi-direction by making use of attention networks.

So far, it can be noted that computer vision has been accumulating a steady research basis in the context of image captioning [47,50,55]. In remote sensing, however, contributions have barely begun to move in this direction, often regarded as the 'next frontier' in computer vision. Lu et al. [56] for instance, proposed a similar concept as in [51] by combining CNNs (for image representation) and LSTM network for sentence generation in remote sensing images. Shi et al. [57] leveraged a fully convolutional architecture for remote sensing image description. Zhang et al. [58] adopted an attribute attention strategy to produce remote sensing image description, and investigated the e ffect of the attributes derived from remote sensing images on the attention system.

As we have previously reviewed, the mainstream of the remote sensing works focuses mainly on scenarios of single label, whereas in practice images may contain many classes simultaneously. In the quest for tackling this bottleneck, recent works attempted to allocate multiple labels to a single query image. Nevertheless, coherence among the labels in such cases remains questionable since multiple labels are assigned to an image regardless of their relativity. Therefore, these methods do not specify (or else model) explicitly the relation between the di fferent objects in a given image for a better understanding of its content. Evidently, remote sensing image description has witnessed rather scarce attention in this sense. This may be explained by the fact that remote sensing images exhibit a wide range of morphological complexities and scale changes, which render text to/from image retrieval intricate.

In this paper we propose a solution based DBTN for solving the text-to-image matching problem. It is worth mentioning that this work is inspired from [53]. The major contributions of this work can be highlighted as follows:


The paper includes five sections, where the structure of the paper is as follows. In Section 2, we introduce the proposed DBTN method. Section 3 presents the TextRS dataset and the experimental results followed by discussions in Section 4. Finally, Section 5 provides conclusions and directions for future developments.

#### **2. Description of the Proposed Method**

Assume a training set D = {*Xi*,*Yi*} *N i*=1 composed of *N* images with their matching sentences. In particular, to each training image *Xi* we associated a set of *M* matching sentences *Yi* = *y*1 *i* , ... , *yK i* . In the test phase, given a query sentence *tq*, we aimed to retrieve the most relevant image in the training set D. Figure 1 shows a general description of the proposed DBTN method composed of image and text encoding branches that aimed to learn appropriate image and text embeddings *f*(*Xi*) and *g*(*Ti*), respectively, by optimizing a bidirectional triplet loss. Detailed descriptions are provided in the next sub-sections.

(**a**) 

**Figure 1.** *Cont.*

**Figure 1.** Flowchart of the proposed Deep Bidirectional Triplet Network (DBTN): (**a**) text as anchor, (**b**) image as anchor.

#### *2.1. Image Encoding Module*

The image encoding module uses a pre-trained CNN augmented with an additional network to learn the visual features *f*(*Xi*) of the image (Figure 2). To learn informative features and suppress less relevant ones, this extra network applies a channel attention layer termed squeeze excitation (SE) to the activation maps layer obtained after the 3 × 3 convolution layer. The goal is to enhance further the representation of the features by grasping the significance of each feature map among all extracted feature maps. As illustrated in Figure 2, the squeeze operation produces features of dimension (1,1,128) by means of global average pooling (GAP), which are then fed to a fully connected layer to reduce the dimension by 1/16. Then the produced feature vector *s* calibrates the feature maps of each channel (V) by channel-wise scale operation. SE works as shown below [59]:

$$s = \text{Sign}(\mathcal{W}\_2(\text{ReLU}(\mathcal{W}\_1(V))))\tag{1}$$

$$V\_{SE} = s \odot V \tag{2}$$

where *s* is the scaling factor, - refers to the channel-wise multiplication, and *V* represents the feature maps obtained from a particular layer of the pre-trained CNN. Then the resulting activation maps *VSE* are fed to a GAP followed by a fully connected and *l*2-normalization for feature rescaling yielding the features *f*(*Xi*).

As pre-trained CNNs, we adopted in this work different CNNs including VGG16, inception\_v3, ResNet50, and EfficientNet. The VGG16 was proposed in 2014 and has 16-layers [27]. Such network was trained on the imagenet dataset to classify 1.2 million RGB images of size 224 × 224 pixel into 1000 classes. The inception-v3 network [60], introduced by Google, contains 42 layers as well as three kinds of inception modules, which comprise convolution kernels with sizes of 5 × 5 to 1 × 1. Such modules seek to reduce the parameters number. The Residual network (ResNet) [25] is a 50-layer network with shortcut connection. This network was proposed for deeper networks to solve the problem of vanishing gradients. Finally, EfficientNets, which are new state-of-the-art models with up to 10 times better efficiency (faster as well as smaller), were developed recently by a research team from Google [61] to scale up CNNs using a simple compound coefficient. Differently from traditional approaches that scale network dimensions (width, depth, and resolution) individually, EfficientNet tries to scale each dimension in a balanced way using a stationary set of scaling coefficients evenly. Practically, the performance of the model can be enhanced by scaling individual dimensions. Further, enhancing the entire performance can be achieved through scaling each dimension uniformly, which leads to higher accuracy and efficiency.

**Figure 2.** Image encoding branch for extracting the visual features.

#### *2.2. Text Encoding Module*

Figure 3 shows the text encoding module, which is composed of *K* symmetric branches, where each branch is used to encode one sentence describing the image content. These sub-branches use a word embedding layer followed by LSTM, a fully-connected layer, and *l*2-normalization.

**Figure 3.** Text embedding branch: The five sentences describing the content of an image are aggregated using an average fusion layer. LSTM is Long-Short Term Memory.

The word embedding layer receives a sequence of integers representing the words in the sentence and transforms them into representations, where similar words should have similar encodings. Then the outputs of this layer are fed to LSTM [62] for modeling the entire sentence based on their long-term dependency learning capacity. Figure 4 shows the architecture of LSTM, with its four types of gates at each time step *t* in the memory cell. These gates are the input gate *it*, the update gate *ct*, the output gate *ot*, and the forget gate *ft*. For each time step, these gates receive as input the hidden state *ht*−<sup>1</sup> and the current input *yt*. Then, the cell memory recursively updates itself based on its previous values and forget and update gates.

**Figure 4.** LSTM structure.

The working mechanism of LSTM is given below (for simplicity, we omit the image index *i*) [62]:

$$i\_t = \text{sigmoid}(\mathcal{W}\_i.[h\_{t-1}, y\_t])\tag{3}$$

$$f\_t = \text{sigmoid}\{\mathcal{W}\_f.[h\_{t-1}, y\_t]\}\tag{4}$$

$$\widetilde{\mathbf{c}\_{l}} = \tanh(\mathcal{W}\_{\mathcal{S}} \left[ \mathbf{h}\_{l-1}, y\_{l} \right]) \tag{5}$$

$$\mathfrak{c}\_{l} = f\_{l} \ast \mathfrak{c}\_{t-1} + i\_{l} \ast \overline{\mathfrak{c}\_{l}} \tag{6}$$

$$o\_t = \text{sigmoid}(\mathbb{W}\_o[h\_{t-1}, y\_t])\tag{7}$$

$$h\_t = \ o\_t \* \tanh(c\_t) \tag{8}$$

where ∗ denotes the Hadamard product, and *Wi*, *Wf* , *Wg*, and *Wo* are learnable weights. In general, we can model the hidden state *ht* of the LSTM as follows [62]:

$$h\_t = LSTM(h\_{t-1}, y\_{t-1}) \tag{9}$$

where *rt*−1 indicates the memory cell vector at time step *t* − 1.

For each branch, the output of LSTM is fed to an additional fully-connected layer yielding *K* feature representation *<sup>g</sup>yki*, *k* = 1, ... , *K*. Then, the final outputs of different branches are fused using an average fusion layer to obtain a feature of dimension 128 [7]:

$$\log(T\_i) = \frac{\sum\_{k=1}^{K} \lg(y\_i^k)}{K} \tag{10}$$

#### *2.3. DBTN Optimization*

Many machine learning and computer vision problems are based on learning a distance metric for solving retrieval problems [63]. Inspired by achievements of deep learning in computer vision [26], deep neural networks were used to learn how to embed discriminative features [64,65]. These methods learn to project images or texts into a discriminative embedding space. The embedded vectors of similar samples are closer, while they are farther to those of dissimilar samples. Then several loss functions were developed for optimization such as triplet [65], quadruplet [66], lifted structure [67], N-pairs [68], and angular [69] losses. In this work, we concentrate on the triplet loss, which aims to learn a discriminative embedding for various applications such as classification [64], retrieval [70–74], and person re-identification [75,76]. It is worth recalling that a standard triplet in image-to-image retrieval is composed of three samples: an anchor, a positive sample (from the same category to the

anchor), and a negative sample (from the different category to the anchor). The aim of the triplet loss is to learn an embedding space, where anchor samples are closer to positive samples than to negative ones by a given margin.

In our case, the network is composed of asymmetric branches, unlike standard triplet networks, as the anchor; positive and negative samples are represented in a different way. For instance, triplets can be formed using a text as an anchor, its corresponding image as a positive sample in addition to an image with a different content image as a negative. Similarly, one can use an image as an anchor associated with positive and negative textual descriptions. The aim is to learn discriminative features for different textual descriptions and discriminative features for different visual features as well. In addition, we should learn similar features to each image and its corresponding textual representation. For such purpose, we propose a bidirectional triplet loss as a possible solution to the problem. The bidirectional triplet loss is given as follows:

$$l\_{\rm DBTN} = \lambda\_1 L\_1 + \lambda\_1 L\_2 \tag{11}$$

$$L\_1 = \sum\_{i=1}^{N} \left[ \left\| \mathbf{g}(T\_i^a) - f(\mathbf{X}\_i^p) \right\|\_2^2 - \left\| \mathbf{g}(T\_l^a) - f(\mathbf{X}\_l^n) \right\|\_2^2 + a \right]\_+ \tag{12}$$

$$L\_2 = \sum\_{i=1}^{N} \left[ \left\| f(X\_i^a) \right\| - \mathcal{g}(T\_l^p)\_2^2 - \left\| f(X\_l^a) - \mathcal{g}(T\_l^n) \right\|\_2^2 + a \right]\_+ \tag{13}$$

where |*z*|+ = *max*(*<sup>z</sup>*, 0),{\displaystyle A} and α is the margin that ensures the negative is farther away than the positive. *g*(*Tia*) refers to the embedding of the anchor text, *f*(*Xip*) is the embedding of the positive image, and *f*(*Xin*) refers to the embedding of the negative image. On the other side, *f*(*Xia*) refers to the embedding of the anchor image, *g*(*Tip*) is the embedding of the positive text, and *g*(*Tin*) refers to the embedding of the negative text. λ1 and λ2 are parameters of regularization controlling the contribution of both terms.

The performance of DBTN heavily relies on triplet selection. Indeed, the process of training is often so sensitive to the selected triplets, i.e., selecting the triplets randomly leads to non-convergence. To surmount this problem, the authors in [77] proposed triplet mining, which utilized only semi-hard triplets, where the positive pair was closer than the negative. Such valid semi-hard triplets are scarce, and therefore semi-hard mining requires a large batch size to search for informative pairs. A framework named smart mining was provided by Harwood et al. [78] to find out hard samples from the entire dataset that suffered from the burden of off-line computation. Wu et al. [79] discussed the significance of sampling and proposed a sampling technique called distance weighted sampling, which uniformly samples negative examples by similarity. Ge et al. [80] built a hierarchal tree of all the classes to find out hard negative pairs, which were collected via a dynamic margin. In this paper, we proposed to use a semi-hard mining strategy, as shown in Figure 5, although other sophisticated selection mechanism could be investigated as well. In particular, we selected triplets in an online mode based on the following constraint [77]:

$$d(\mathcal{g}(T^{a}), f(X^{p})) < d(\mathcal{g}(T^{a}), f(X^{n})) < d(\mathcal{g}(T^{a}), f(X^{p})) + a \tag{14}$$

where *d*(·) is the cosine distance.

**Figure 5.** Semi-hard triplet selection scheme.

#### **3. Experimental Results**

#### *3.1. Dataset Description*

We built a dataset, named TextRS, by collecting images from four well-known different scene datasets, namely the AID dataset, which consists of 10,000 aerial images of size 600 × 600 pixels within 30 classes collected from Google Earth imagery by different remote sensors. The Merced dataset contains 21 classes; each class has 100 images of size 256 × 256 pixels with a resolution of 30 cm and RGB color. Such dataset was collected from USGS. The PatternNet was gathered from high-resolution imagery and includes 38 classes; each class contains 800 images of size 256 × 256 pixels. The NWPU dataset is another scene dataset, which has 31,500 images and is composed of 45 scene classes.

TextRS is composed of 2144 images selected randomly from the above four scene datasets. In particular, 480, 336, 608, and 720 images were selected from AID, Merced, PatternNet, and NWPU, respectively (16 images were selected from each class of such datasets). Then each remote sensing image was annotated by five different sentences; therefore, the total number of sentences was 10,720, and all the captions of this dataset were generated by five people to prove the diversity. It is worth recalling that the choice of the five sentences was mainly motivated by other datasets developed in the general context of computer vision literature [47,81]. During, the annotation we took into consideration some rules that had to be followed during generation of the sentences:


Some samples from our dataset are shown in Figure 6.



**Figure 6.** Example of images with the five sentences for each image.

#### *3.2. Performance Evaluation*

We implemented the method using the keras open-source library for deep learning written in python. For training the network, we randomly select 1714 images as training and the remaining 430 images as the test corresponding to approximately to 80% for training and 20% for testing. For training the DBTN, we used a mini-batch size of 50 images with the Adam optimization method with a fixed learning rate equal to 0.001 and exponential decay rates for the moment estimates equal to 0.9 and 0.999. Additionally, we set the regularization parameters to the default values of λ1 = λ2 = 0.5. To evaluate


the performance of the method, we used the wide recall measure, which is suitable for text-to-image retrieval problems. In particular, we presented the results in Recall@K (R@K) terms for di fferent values of K (1, 5, 10), which are the percentage of ground-truth matches shown in the top K-ranked results. We conducted the experiments on a station with an Intel Core i9 processor with a speed of 3.6 GHz and 32 GB of memory, and a Graphical Processing Unit (GPU) with 11 GB of GDDR5X memory.
