**2. Preliminaries**

#### *2.1. Deep Feature for Image Classification*

Deep learning was successfully applied to image classification, and it is possible to approximate the complex functions of human visual perception by rationally combining several basic modules. The basic modules of the deep model are mainly information extraction module, activation and pooling module, and tricks module. The information extraction module extracts features from the input. The activation and pooling module is mainly devoted to nonlinear transformation and dimensionality reduction. The tricks module can speed up the training procedure and avoid overfitting [32]. The information extraction module of CNN is mainly composed of convolutional layers. Receptive fields are used to describe the area of the input image which can affect the features of the CNN. Figure 2 shows that as the depth of the network deepens, the receptive field of the posterior neurons increases, and the extracted features also change from low-level features such as edge information to mid-level features such as texture information and high-level features such as structural information.

The high-level features are often used for classification in CNN, losing a lot of detailed information, such as edge features and texture features, which may lead to poor performance in image classification requiring more detailed information. In order to improve the discriminative ability of the model, texture features can be used to complement the discriminability of deep features.

Deep models are widely used in texture classification and remote sensing scene classification. ResNet is one of the best deep models for image classification. ResNet was proposed by Kaiming He et al. in 2015. A 152 layer deep neural network was trained with a special network structure

and won the championship on the ImageNet competition classification task (top-5 error rate: 3.57%). ResNet overcame the difficulty that the deep network could not be trained, and not only the accuracy of classification was improved, but also the parameter quantity was less than the VGG model. Consequently, ResNet-50 is used as the deep feature extractor in the proposed algorithm.

**Figure 2.** The relationship between receptive field and network depth.

ResNet deepens the network without reducing the accuracy of classification by residual learning. Based on the existing design ideas (batch normalization, small convolution kernel, and fully convolution network), the residual module is introduced. Each residual module contains two paths, one of which performs two or three convolution operations on the input feature to obtain the residual of the feature; the other path is the direct path of the input feature. The outputs of these two paths are finally added together to be the output of the residual module. There is an example of residual module shown in Figure 3. The first 1 × 1 convolution in the module is used to reduce the dimension (from 256 to 64), and the second 1 × 1 convolution is used to upgrade the dimension (from 64 to 256). Consequently, the number of input and output channels of the intermediate 3 × 3 convolution is small (from 64 to 64), and the parameters to be learned can be significantly reduced.

**Figure 3.** Residual module in ResNet-50.

Figure 4 shows the framework of ResNet-50, in which the residual modules are repeated 3 times, 4 times, 6 times, and 3 times respectively. The deep features of the image are extracted using the fine-tuned ResNet-50.

**Figure 4.** The framework of ResNet-50.

#### *2.2. Binary Coded Feature for Image Classification*

In the field of texture recognition, LBP is one of the most commonly used texture description methods, which was firstly proposed by T. Ojala et al. in 1994. It was originally developed as a

method of describing texture images and later improved for image feature analysis. LBP has significant advantages such as gray invariance and rotation invariance [29]. LBP has been extensively researched in many fields and has demonstrated outstanding performance in several comparative studies [33,34]. The LBP descriptor works by thresholding the values of its neighborhood pixels, while the threshold is set as the value of the center pixel. The LBP descriptor is capable of detecting local primitives, including flat regions, edges, corners, curves, and edge ends, and it was later extended to obtain multi-scale, rotational invariant, and uniform representations and has been successfully applied to other tasks, such as object detection, face recognition [11], and remote sensing scene classification. The framework of traditional LBP can be presented by Figure 5, which can be divided into three steps. Firstly, the binary relationship between each pixel in the image and its local neighborhood is calculated in grayscale. Then, the binary relationship is weighted into an LBP code according to certain rules. Finally, the histogram sequence obtained by statistics in the LBP image is described as image features.

**Figure 5.** Calculation of the local binary pattern (LBP).

The traditional LBP algorithm can be expressed by Equations (1) and (2):

$$LBP\_{P,R} = \sum\_{p=0}^{P-1} s(\mathcal{g}\_p - \mathcal{g}\_c) 2^p \,\text{.}\tag{1}$$

.

where

$$\mathbf{s}(\mathbf{x}) = \begin{cases} \mathbf{1}\_{\prime} & \mathbf{x} \ge \mathbf{0} \\ \mathbf{0}\_{\prime} & \mathbf{x} < \mathbf{0} \end{cases}$$

*P* is the number of neighborhoods, *R* represents the distance between the central pixel and the neighborhood pixel, and we set *R* = 1 here. *gc* is the grayscale value of the center pixel, and *gp* is the grayscale value of the neighborhood pixel. Compared with the value of center pixel, the value of neighborhood pixel is set as 1 when it is greater or 0 when it is less. Then binary encoding is performed in a certain order. For those pixels which are not at the center of the neighborhood pixels, the grayscale values of their neighborhood pixels can be estimated by linear interpolation. Finally, by traversing all LBP pixel values, a histogram is created to represent the texture features of the image. Assuming the image size is *I* × *J*, the histogram is expressed as:

$$H(k) = \sum\_{i=1}^{l} \sum\_{j=1}^{l} f(LBP\_{P, \mathcal{R}}(i, j), k)\_{\prime} \tag{2}$$

where

$$k \in [0, 255]\_\prime$$

$$f(\mathbf{x}, y) = \begin{cases} 1, & \mathbf{x} = y \\ 0, & \text{otherwise} \end{cases}$$

.

As one of the binary coded feature extractors, LBP can effectively deal with illumination changes and is widely used in texture analysis and texture recognition. However, with the increase of the variety of patterns, the computational complexity and the data volume of the traditional LBP method will increase sharply. LBP will also be more sensitive to noise, and the slight fluctuations of the central pixel may cause the coded results to be quite different. In order to solve the problem of excessive binary patterns and make the process of statistic more concise, Ojala et al. proposed a uniform pattern to reduce the dimension of the patterns. Ojala believed that most patterns only contain up to two jumps from 1 to 0 or from 0 to 1 in the natural image. Therefore, Ojala defined the "Uniform Pattern", that is, when a loop binary pattern has a maximum of two jumps from 0 to 1 or from 1 to 0, the pattern is called a uniform pattern, such as 00000000 (0 jump), 10001111 (first jump from 1 to 0, then jump from 0 to 1). Except for the uniform pattern, other patterns are classified as mixed pattern, such as 10010111 (totally 4 jumps). The framework of ULBP is shown in Figure 6. *<sup>U</sup>*(*LBPP*,*<sup>R</sup>*) can be used to indicate the number of jump in the code, which is calculated as follows:

$$MI(LBP\_{P,R}) = |s(\mathcal{g}\_{P-1} - \mathcal{g}\_{\varepsilon}) - s(\mathcal{g}\_0 - \mathcal{g}\_{\varepsilon})| + \sum\_{p=1}^{P-1} |s(\mathcal{g}\_p - \mathcal{g}\_{\varepsilon}) - s(\mathcal{g}\_{p-1} - \mathcal{g}\_{\varepsilon})|.\tag{3}$$

Through such an improvement, the number of patterns is reduced from the original 2*<sup>P</sup>* to *P* × (*P* − 1) + 3, where P represents the number of sampling points in the neighborhood. For 8 sampling points in the 3 × 3 neighborhood, the number of binary pattern is reduced from 256 to 59. The values of uniform pattern are assigned from 1 to 58 in ascending order, and the mixed pattern is assigned 0. Since the range of grayscale value is 0–58, the ULBP feature image is entirely dark, which makes the feature vector less dimensional and less impacted by high frequency noise.

**Figure 6.** The framework of uniform local binary pattern (ULBP).
