**3. Methodology**

#### *3.1. Sinkhorn Loss*

The Sinkhorn loss consists of softmax function and Sinkhorn distance. When the score vector is output from the fully connected layer, we convert it into a probability distribution by the softmax function and then calculate distance between the actual distribution and the predicted distribution using Sinkhorn distance. The approximate solution of optimal transport problem between two distributions can be determined by iterative learning. The advantages of Sinkhorn distance in calculating the distance between two distributions will be introduced next.

Two signatures, *P* and *Q*, are defined to represent the predicted distribution and the actual distribution, with *m* classes respectively. These two signatures can be represented by Equations (4) and (5), where *pi* is the label in *P*, *wpi* is the probability of *pi* in *P*, *qj* is the label in *Q*, and *wqj* is the probability of *qj* in *Q*. Here we set *pi* = *i* and *qj* = *j* to represent the different labels. The value of *wpi* is determined by the output of softmax function. The value of *wqi* is determined based on the real class. For a specific sample, if *i* is the real class of it, we set *wqi* = 1, otherwise we set *wqi* = 0.

$$P = \{ (p\_1, w\_{p\_1}), \dots, (p\_m, w\_{p\_m}) \} \; \; \; \tag{4}$$

$$\mathcal{Q} = \left\{ (q\_1, w\_{q\_1}), \dots, (q\_m, w\_{q\_m}) \right\}. \tag{5}$$

In order to measure the work of transforming one distribution into another, two matrices are introduced: the distance matrix *D* and the coupling matrix *F*. Each element *dij* in the distance matrix *D* represents the distance of moving *pi* to *qj*. Here we set *dij* = 1 when *i* = *j* and *dij* = 0 when *i* = *j*. Each element *fij* in the coupling matrix *F* indicates the probability quality that needs to be assigned when moving from *pi* to *qj*. According to the above definition, the total cost *<sup>t</sup>*(*<sup>P</sup>*, *Q*) can be calculated by the Frobenius inner product between *F* and *D*:

$$\text{tr}(P, Q) = \langle D, F \rangle = \sum\_{i=1}^{m} \sum\_{j=1}^{m} d\_{ij} f\_{ij}. \tag{6}$$

The goal is to find an optimal coupling matrix *F*∗ that minimizes the overall cost function, and the least cost function over all coupling functions is the solution to this optimal transport problem, called *EMD*.

$$F^\* = \underset{F}{\text{arg min}} \, t(P, Q)\_{\prime} \tag{7}$$

$$EMD = \frac{\min\_{\vec{F}} t(P\_\prime Q)}{\sum\_{i=1}^{m} \sum\_{j=1}^{m} f\_{ij}},\tag{8}$$

s.t.

$$f\_{\vec{i}\vec{j}} \ge 0,$$

$$\sum\_{j=1}^{m} f\_{\vec{i}\vec{j}} \le w\_{p\_{\vec{i}'}} $$

$$\sum\_{i=1}^{m} f\_{\vec{i}\vec{j}} \le w\_{\vec{q}\_{\vec{i}'}} $$

$$\sum\_{i=1}^{m} \sum\_{j=1}^{m} f\_{\vec{i}\vec{j}} = \min(\sum\_{i=1}^{m} w\_{p\_{\vec{i}'}}, \sum\_{j=1}^{m} w\_{q\_{\vec{j}}}).$$

EMD has a complicated calculation method for finding the optimal solution and is not suitable as a loss function. However, when solving the distance between distributions, it can increase the influence of the inter-class distance on the cost function by reasonably presetting the distance matrix. Thus, we introduce the Sinkhorn distance as loss function which is the approximate value of EMD. It smooths the classic optimal transport problem with an entropic regularization term. The solution to the problem can be rewritten as:

$$For \lambda > 0, SD := \left\langle D, F^{\lambda} \right\rangle,\tag{9}$$

where

$$F^{\lambda} = \underset{F}{\text{arg min }} t(P, Q) - \frac{1}{\lambda}h(F)\_{\lambda}$$

$$h(F) = -\sum\_{i\bar{j}} F\_{i\bar{j}} \log F\_{i\bar{j}}.$$

*λ* is the regularization coefficient. When *λ* grows, a slower convergence can be observed as *F<sup>λ</sup>* gets closer to the optimal vertex *F*<sup>∗</sup>, but the computational complexity will also rise at the same time. Thus we take *λ* = 10 where the computational complexity and the accuracy of the approximate solution reach the compromise. By introducing entropy regularization term, the transport problem is turned into a strictly convex problem that can be solved with Sinkhorn's matrix scaling algorithm at a speed which is several orders of magnitude faster than that of transport solvers. For *λ* > 0, the solution *F<sup>λ</sup>* of the problem is unique and has the form *F<sup>λ</sup>* = diag(*u*)*K*diag(*v*), where *u* and *v* are two non-negative vectors of R*m* and *K* is the element-wise exponential of − *λ D*.

#### *3.2. Integrating Deep Learning with Binary Coding for Texture and Remote Sensing Image Classification*

Nowadays, the networks used for image classification are generally trained and tested through an end-to-end network, and the classification accuracy is improved by optimizing the parameters of the feature extractor and classifier. However, the features extracted by the deep network have limitations. In order to improve the performance of the classification algorithm, the local texture information obtained by the ULBP of the image is used as the supplementary features. This paper combines it with the deep features as the input of fully connected layer, and the optimization of network parameters is guided by Sinkhorn loss. The framework of the two stream model is shown in Figure 7.

**Figure 7.** The detailed framework of the proposed algorithm: DBSNet.

The ResNet-50 is pre-trained on the ImageNet 2012 dataset used in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [35] and then the original softmax with cross-entropy loss is replaced with the Sinkhorn loss to ge<sup>t</sup> a new network (RSNet). Finally, we fine-tuned the RSNet on different datasets and removed the fully connected layer and the classifier to ge<sup>t</sup> the deep feature extraction network. The binary coded feature extractor is the ULBP algorithm. The input of the model is an RGB image. Firstly, 2048 dimensional features are extracted through the deep feature extractor. At the same time, the image is grayscale processed and encoded by ULBP to ge<sup>t</sup> the 59 dimensional local texture features. After the two sets of features are fused, the class of image is predicted by the output of the fully connected layer.

In order to clearly observe the difference before and after the feature fusion, t-distributed stochastic neighbor embedding (t-SNE) [36] is used to visualize the pre-fusion deep features, ULBP features and the merged DBSNet features extracted on KTH-TIPS2-b texture dataset in the 2D space. The results are shown in Figure 8. As shown in the figure, the deep features have good image characterization capabilities, but the samples of the same class are more scattered. The LBP features have certain image

characterization capabilities but the discriminability is not good. The DBSNet features combine the deep features and the LBP features. It can be seen from the reduced-dimensional image features that the image feature representation capability of the DBSNet features is better than the deep features and the ULBP features and the samples of the same class are more compact, indicating that the ULBP features complement deep features.

**Figure 8.** Comparison of feature maps of RSNet, ULBP, and DBSNet algorithms on KTH-TIPS2-b dataset.
