*1.1. Background*

High-resolution remote sensing images contain complex geometrical structures and spatial patterns and thus scene classification is a challenge for remote sensing image interpretation. Discriminative feature extraction is vital for scene classification which aims to automatically assign a semantic label to each remote sensing image to know the category it belongs to. Recently, convolutional neural networks (CNNs) has attracted increasing attention in remote sensing community [1,2], which can learn more abstract and discriminative features and achieve the state-of-the-art performance on classification. A typical CNN framework is shown in Figure 1, where three types of modules are cast into, including feature extraction module, quantization module, and trick module. The quantization module includes spatial quantization, amplitude quantization, and evaluation quantization. Pooling is a typical spatial quantization used for downsampling and dimensionality reduction to reduce the number of parameters and calculations while maintaining significant information. Amplitude quantization is a nonlinear process that compresses real values into a specific range. Examples are sigmoid, tanh, ReLU [3] and its variants [4–6], which are often used to activate neurons because nonlinearity is vital for strengthening the approximation and representation abilities. Evaluation quantization is performed to obtain the outputs that meet the specific requirements. For instance, the softmax function is often used in the final hidden layer of a neural network to output classification probabilities. The trick module contains some algorithms such as dropout [7] and batch normalization [8], for improving training effects and achieving better performance. Among all the modules, the feature extraction module is the most essential, which is based on convolution to extract image features. It invests CNN with some favourable characteristics such as local perception, parameter sharing, and translation invariance.

**Figure 1.** Convolutional neural network (CNN) framework. CNN generally contains three modules, including feature extraction module, quantization module, and tricks module. These three modules are repeatedly stacked to build the deep structure, and finally, a classification module is applied for the specific classification task. The trick module is embedded in the other two modules, which is not displayed in this figure.

The extraction of scene-level discriminative features is the key step of scene classification. In CNN-based methods, convolution is the most critical technology for remote sensing imagery feature extraction. It has the properties of local connection, weight sharing, and translation invariance, which makes full use of the spatial relationship between pixels. However, convolution has some intrinsic drawbacks, such as considerable computational complexity and limitation in transformation capacity due to its linearity. For high-resolution remote sensing images, the computation challenge is severe, and the considerable computational complexity restrains scene classification from that conducted on computationally limited hardware platforms. In addition, it is vital to learn discriminative feature representation of remote sensing images, but the linearity impedes CNNs from extracting more powerful features with better representation and fitting well on complex functions. To extract more representative features, the literature [9] employs collaborative representation after a pre-trained CNN. The literature [10] integrates multilayer features of CNNs. In [11], the last two fully connected layer features of pre-trained VggNet are fused to improve the discriminant ability. In [12], the pre-trained deep CNN is combined with a sparse representation to improve the performance. Most of these methods increase the dimension of features and none of them are proposed to improve the convolutional layer. Therefore, exploration for a better method to extract features is still of grea<sup>t</sup> significance for future CNN-based scene classification research.

Convolution is also the paramount operation of the traditional implementation of the wavelet transform (WT). The discrete wavelet transform (DWT) is one branch of WT, which is traditionally implemented by the two-channel filter bank representation [13]. It is widely applied in scientific fields because it has the favourable characteristic of time-frequency localization, but is limited by the considerable computation complexity and the linearity brought by convolution. To overcome those difficulties, W. Sweldens proposed the lifting scheme [14–18], which mainly contains three steps: split, predict, and update. **Split**: The input signal is decomposed into two non-overlapping sample sets. A common method is the lazy wavelet transform, which splits the original signal into an even subset and an odd subset, both of which contain half the number of input samples. **Predict**: The two subsets divided by a signal with strong local correlation are highly correlated in most cases. With one subset known, another subset is reasonably predicted with some predicted error produced. In particular, the predicted errors are all zero for a constant signal. **Update**: One key characteristic of coarse signals is that they have the same average as the input signal, which is guaranteed by this step. The lifting scheme is an efficient algorithm for traditional DWT as Daubechies and W. Sweldens proved that any FIR wavelet transform can be decomposed into a series of prediction and update steps [19]. The form of the lifting scheme makes it easier to perform DWT on a hardware platform, which was adopted as the algorithm used in the wavelet transform module in the JPEG2000 standard [20,21]. Furthermore, the second-generation wavelet, produced directly by designing the prediction and update operators in the lifting scheme [16], can be constructed to perform the nonlinear transformation [22–26]. Briefly, compared with the traditional wavelet transform, the lifting scheme is superior in the following aspects [19]: (1) It leads to a faster implementation of the discrete wavelet transform. (2) It allows a fully in-place implementation of the fast wavelet transform, which signifies that the wavelet transform can be calculated without allocating auxiliary memory. (3) Nonlinear wavelet transforms are easy to build with the lifting scheme. A typical example is the integer wavelet transforms [25].

Therefore, the lifting scheme is a more efficient algorithm for DWT compared with the convolution-based method. In this paper, we prove that the lifting scheme is also an efficient implementation for convolution in the CNN and can substitute the convolutional layer to extract better representations. With the lifting scheme introduced, neural networks share the advantages of nonlinear transformation, all-integral transformation, and the facility of hardware implementation.

#### *1.2. Problems and Motivations*

This paper develops the method from the following aspects:


#### *1.3. Contributions and Structure*

In this paper, the lifting scheme is introduced into neural networks to serve as the feature extraction module, and a lifting scheme-based deep neural network (LSNet) method is proposed to enhance network performance. The main contributions are summarized as follows. (1) This paper introduces the lifting scheme to propose a novel CNN-based method for scene classification, which has potential in easing the computational burden. (2) This paper expands the range of filter bases that can be replaced with the lifting scheme, and the convolution kernel with random-valued parameters are proven to have the equivalent lifting scheme. (3) A learnable lifting scheme block and the backpropagation approach are given. Therefore, any vanilla convolutional layer in CNNs can be replaced with its relative lifting scheme. (4) A novel lifting scheme-based deep neural network (LSNet) model is presented using a nonlinear lifting scheme that is constructed by nonlinear operators. The nonlinear lifting scheme is used as a feature extractor to substitute the vanilla linear convolutional layer to learn discriminative feature representation of remote sensing images. (5) LSNet is validated on the CIFAR-100 and then evaluated on the AID datasets. Experimental results demonstrate that the LSNet outperforms the vanilla CNN and has potential in remote sensing scene classification.

The rest of this paper is organized as follows. Section 2 describes the proposed method and the datasets. Section 3 describes the experimental results on the CIFAR-100 and AID datasets. Section 4 analyzes the results and discusses our future research directions. Section 5 closes with a conclusion.

#### **2. Materials and Methods**

In this section, the equivalence between the lifting scheme and the convolution in CNNs is first proven, extending the few wavelet bases that can be replaced with the lifting scheme to convolution kernels with random-valued parameters. With that, the relative lifting scheme is derived for a 1 × 3 convolution kernel as an example. Finally, we propose a novel lifting scheme-based deep neural network (LSNet), substituting the linear convolutional layers in CNNs with the nonlinear lifting scheme, demonstrating the superiority of the lifting scheme to introduce nonlinearity into the feature extraction module. The datasets used in the experiment are also described in this section.

#### *2.1. Equivalence between the Lifting Scheme and Convolution*

In the two-channel filter bank representation of the traditional wavelet transform, a low-pass digital filter *h* and a high-pass digital filter *g* are used to process the input signal *x*, followed by a downsampling operation with base 2, as shown in Equations (1) and (2).

$$a = (\mathfrak{x} \* \mathfrak{h}) \downarrow 2 \tag{1}$$

$$d = (\mathfrak{x} \* \mathfrak{g}) \downarrow 2 \tag{2}$$

The filter bank is elaborately designed according to requirements such as compact support and perfect reconstruction, while downsampling is used for redundancy removal. The low-pass filter is highly contacted with the high-pass filter. In other words, with the low-pass filter designed, the high-pass filter is consequently generated. Therefore, the wavelet bases of the first-generation wavelets are finite, restricted, and prefixed. In contrast, parameters of convolution kernels in CNNs are changing ceaselessly during backpropagation, which makes it necessary to expand the lifting scheme to be equivalent to a random-valued filter. In addition, to fit the structure of the convolutional layer, the detailed component *d* generated by the lifting scheme is removed while the coarse component *a* is retained in the following proof.

Consider a 1D convolution kernel represented by *h* = [*h*0, *h*1, ... , *hk*−<sup>1</sup>]. It is a finite impulse response (FIR) filter from the signal processing perspective, as only a finite number of filter coefficients are non-zero. The *z*-transform of the FIR filter *h* is a Laurent polynomial given by

$$H(z) = \sum\_{i=0}^{k-1} h\_i z^{-i} \tag{3}$$

The degree of the Laurent polynomial *H*(*z*) is

$$\deg(H(z)) = k - 1\tag{4}$$

In contrast to the convolution in the signal processing field [32,33], convolution in the CNN omits the reverse operation. The convolution between the input signal *x* and the convolution kernel *h* can be written as

$$y = \mathbf{x} \odot \mathbf{h} = \mathbf{x} \ast \mathbf{h} \tag{5}$$

where *y* represents the matrix of output feature maps. Operators " " and "∗" represent the cross-correlation and the convolution in the spotlight of digital signal processing, respectively, while *h* ¯ is the reversal signal of *h*. The *z*-transform of Equation (5) is

$$Y(z) = X(z)H(z^{-1})\tag{6}$$

where *<sup>H</sup>*(*z*<sup>−</sup><sup>1</sup>) is the *z*-transform of the reversal sequence *h*¯ = [*hk*−1,..., *h*1, *h*0].

In the lifting scheme implementation of traditional wavelets, a common method in the split stage is splitting the original signal *x* = [*<sup>x</sup>*0, *x*1, *x*2, ...] into an even subset *xe* = [*<sup>x</sup>*0, *x*2, ..., *x*2*k*, ...] and an odd subset *x* = [*<sup>x</sup>*1, *x*3, ..., *<sup>x</sup>*2*k*+1, ...]. Transforming the signal space to the *z*-domain, the two-channel filter bank representation shown in Equations (1) and (2) is equivalent to Equation (7), with *<sup>A</sup>*(*z*) and *<sup>D</sup>*(*z*) to represent the *z*-transform of *a* and *d*.

$$
\begin{pmatrix} A(z) \\ D(z) \end{pmatrix} = P^T(z^{-1}) \begin{pmatrix} X\_\mathfrak{e}(z) \\ X\_\mathfrak{e}(z) \end{pmatrix} \tag{7}
$$

*Xe*(*z*) and *Xo*(*z*) are the *z*-transform of *xe* and *xo*, which are the even subset and odd subset of *x*, respectively. *<sup>P</sup>*(*z*) is the polynomial matrix of *h* and *g*:

$$P(z) = \begin{pmatrix} H\_{\varepsilon}(z) & G\_{\varepsilon}(z) \\ H\_{o}(z) & G\_{o}(z) \end{pmatrix} \tag{8}$$

where *He*(*z*) and *Ho*(*z*) are the *z*-transform of the even subset *he* and the odd subset *ho* of *h*, respectively, while *Ge*(*z*) and *Go*(*z*) are the *z*-transform of the even subset *ge* and the odd subset *go* of *g*, respectively.

Different from the DWT in Equations (1) and (2), there is only one branch in the convolution in CNN, as shown in Equation (6). To adapt to the particular form of the convolution in CNN, we maintain the low-pass filter *h* while discarding the high-pass filter *g*, and modify the polynomial matrix to preserve only the relative part of *<sup>H</sup>*(*z*), as in Equation (9).

$$P(z) = \begin{pmatrix} H\_{\ell}(z) \\ H\_{\ell}(z) \end{pmatrix} \tag{9}$$

Instead of the lazy wavelet transform, we use the original signal *x* = [*<sup>x</sup>*0, *x*1, *x*2, ...] and a time-shift signal *x-* = [*<sup>x</sup>*1, *x*2, ...] as *xe* and *xo* in the first stage to obtain the same form as Equation (6), as shown in Equation (10).

$$\begin{aligned} P^T(z^{-2}) \begin{pmatrix} X(z) \\ zX(z) \end{pmatrix} &= \left( H\_\epsilon(z^{-2}) + zH\_\bullet(z^{-2}) \right) X(z) \\ &= X(z)H(z^{-1}) \end{aligned} \tag{10}$$

Furthermore, the polynomial matrix *<sup>P</sup>*(*z*) can be decomposed into a multiplication form of finite matrices and thus is completed by finite prediction and update steps. As mentioned above, both *He*(*z*) and *Ho*(*z*) are Laurent polynomials. If the following conditions are satisfied:

$$\begin{aligned} H\_{\mathfrak{c}}(z) &\neq 0, \quad H\_{\mathfrak{o}}(z) \neq 0\\ \deg(H\_{\mathfrak{c}}(z)) &\geq \deg(H\_{\mathfrak{o}}(z)) \end{aligned} \tag{11}$$

There always exists a Laurent polynomial *q*(*z*) (the quotient) with *deg*(*q*(*z*)) = *deg*(*He*(*z*)) − *deg*(*Ho*(*z*)), and a Laurent polynomial *r*(*z*) (the remainder) with *deg*(*r*(*z*)) < *deg*(*Ho*(*z*)) so that

$$q(z) = \mathcal{H}\_e(z) / \mathcal{H}\_o(z) \tag{12}$$

$$r(z) = H\_{\mathfrak{e}}(z) \% H\_{\mathfrak{o}}(z) \tag{13}$$

Iterating the above step, *<sup>P</sup>*(*z*) is then decomposed. As convolution kernels frequently used in CNN are commonly oddly size, such as 1 × 3, 3 × 1 [34], 3 × 3, 5 × 5, the above conditions are generally satisfied. Comparing Equation (6) with (10), we reach a conclusion that convolution in CNN, which is with random-valued parameters, has equivalent lifting scheme implementation.

#### *2.2. Derivation of the Lifting Scheme for a Relative Convolutional Layer*

#### 2.2.1. Lifting Scheme for 1 × 3 Convolutional Layer

In this subsection, we derive the relative lifting scheme for a 1 × 3 convolutional layer as an example to illustrate the decomposition of *<sup>P</sup>*(*z*) with the Euclidean algorithm. Given a convolution kernel *h* = [*h*0, *h*1, *h*2], its polynomial matrix specifically is

$$P(z) = \begin{pmatrix} H\_{\mathfrak{c}}(z) \\ H\_{\mathfrak{o}}(z) \end{pmatrix} = \begin{pmatrix} h\_0 + h\_2 z^{-1} \\ h\_1 \end{pmatrix} \tag{14}$$

The first step of decomposition is to divide *He*(*z*) by *Ho*(*z*) and obtain the quotient and remainder.

$$\begin{aligned} r\_1(z) &= H\_c(z) \% H\_o(z) = h\_0\\ q\_1(z) &= H\_c(z) / H\_o(z) = \frac{h\_2}{h\_1} z^{-1} \end{aligned} \tag{15}$$

Then, the divisor *Ho*(*z*) in the first step serves as the dividend in the second step, divided by the remainder *<sup>r</sup>*1(*z*) in the first step.

$$\begin{aligned} r\_2(z) &= H\_o(z) \% r\_1(z) = 0\\ q\_2(z) &= H\_o(z) / r\_1(z) = \frac{h\_1}{h\_0} \end{aligned} \tag{16}$$

Iteration stops after two steps as the final remainder is 0 and the dividend in the final step *Ho*(*z*) = *h*1 = *gcd*(*He*(*z*), *Ho*(*z*)), where the operator *gcd*(·) stands for the greatest common divisor.

With quotients and remainders obtained above, *<sup>P</sup>*(*z*) is decomposed into the form of matrix multiplication, as Equation (17) shows.

$$P(z) = \begin{pmatrix} q\_1(z) & 1\\ 1 & 0 \end{pmatrix} \begin{pmatrix} q\_2(z) & 1\\ 1 & 0 \end{pmatrix} \begin{pmatrix} h\_0\\ 0 \end{pmatrix} = \begin{pmatrix} 1 & \frac{h\_2}{h\_1}z^{-1}\\ 0 & 1 \end{pmatrix} \begin{pmatrix} 1 & 0\\ \frac{h\_1}{h\_0} & 1 \end{pmatrix} \begin{pmatrix} h\_0\\ 0 \end{pmatrix} \tag{17}$$

Therefore,

$$P^T(z^{-2}) = \begin{pmatrix} h\_0 & 0 \end{pmatrix} \begin{pmatrix} 1 & \frac{h\_1}{h\_0} \\ 0 & 1 \end{pmatrix} \begin{pmatrix} 1 & 0 \\ \frac{h\_2}{h\_1} z^2 & 1 \end{pmatrix} \tag{18}$$

Given the matrix multiplication form in Equation (18), parameters of the lifting scheme, including the prediction and update operators and the scaling factor, are thus obtained. The 1 × 3 convolutional layer and its equivalent lifting scheme in one spatial plane are shown in Figure 2. In the 1 × 3 convolution, the convolution kernel moves across the input row by row in the horizontal direction to perform the sliding dot product. The lifting scheme can simultaneously process the entire input image within each of the steps listed in Table 1.

**Figure 2.** Equivalence between the 1 × 3 convolution and the lifting scheme. The convolution kernel moves across the entire input image row by row to conduct the sliding dot product and generate the output feature maps. The lifting scheme can process the entire input image in parallel within each step. The lifting scheme implementation with three steps is equivalent to a vanilla 1 × 3 convolution in one plane.

As illustrated in Table 1, the input image passes through the three steps in the lifting scheme. In the split stage, the input is transformed into 2 branches, *xe* and *xo*. Different from the lifting scheme in [14], we utilize a sliding window to obtain two branches instead of the lazy wavelet transform. Then, *xo* is predicted by *xe* with the predict operator *h*2 *h*1 *z*2, the outcome of which is used to update *xe*. The final output is obtained by scaling the updated *xe* by the factor *h*0.


**Table 1.** Steps of the lifting scheme equivalent to 1 × 3 convolutional layer.

#### 2.2.2. The Lifting Scheme for the 2D Convolutional Layer

For a convolutional layer with 2D convolutional kernels, the lifting scheme can also be realized through the 1D lifting scheme. The 2D convolution operation is the sum of several 1D convolution operations. For instance, the output of the convolution operation between a *m* × *n* input image and a 3 × 3 convolutional kernel has a size of (*m* − 2) × (*n* − <sup>2</sup>). Then, the element on the *i*-th row, *j*-th column of the output matrix is obtained by

$$\mathbf{y}(i,j) = \sum\_{\mathbf{u}=0}^{2} \sum\_{\mathbf{v}=0}^{2} \mathbf{x}(i+\mathbf{u}, j+\mathbf{v}) \cdot h(\mathbf{u}, \mathbf{v}) \tag{19}$$

which can be rewritten as

$$y(i,j) = \sum\_{v=0}^{2} x(i, j+v) \cdot h(0, v) + \sum\_{v=0}^{2} x(i+1, j+v) \cdot h(1, v) + \sum\_{v=0}^{2} x(i+2, j+v) \cdot h(2, v) \tag{20}$$

In other words, the 2D convolution operation is equivalent to the summation of three 1D convolution operations. As the relative lifting scheme of the 1D convolution has been worked out as the right side of Figure 2 shown, the 2D lifting scheme is simply the summation of the relative 1D lifting scheme.

Thus, any convolution kernels with random-valued parameters have equivalent lifting scheme implementation. As the prediction operator and update operator in the lifting scheme can be designed to fit other requirements, which correspond to a larger serving range than vanilla convolution, the lifting scheme is the superset of vanilla convolution. In other words, the vanilla convolutional layer is just a special case of the lifting scheme. To improve the feature extractor in the CNN, a more powerful lifting scheme structure can be designed by choosing other predict operators and update operators.

#### 2.2.3. Backpropagation of the Lifting Scheme

In this part, we propose the backpropagation algorithm to train and update parameters in the lifting scheme. For simplicity, a new list of weights *w* = [ *w*0, *w*1, *<sup>w</sup>*2] is used to represent the parameters in Figure 2 as

$$w\_0 = h\_0, w\_1 = \frac{h\_1}{h\_0}, w\_2 = \frac{h\_2}{h\_1} \tag{21}$$

Assume the backpropagation error from the next layer is *L*, then the gradient with respect to this layer's output *y* is obtained by

$$
\Delta y(i,j) = \frac{\partial L}{\partial y(i,j)}\tag{22}
$$

where *y*(*<sup>i</sup>*, *j*) represents one pixel in the output feature map *y*. According to the derivation rule, the gradients with respect to the weights are obtained as follows.

$$
\Delta w\_0 = \sum\_i \sum\_j \frac{\partial L}{\partial y(i, j)} \frac{\partial y(i, j)}{\partial w\_0} = \Delta y \odot \mathbf{x} + w\_1 \cdot \Delta \mathbf{y} \odot \mathbf{x}' + w\_1 \cdot w\_2 \cdot \mathbf{x}'' \tag{23}
$$

$$
\Delta w\_1 = \sum\_i \sum\_j \frac{\partial L}{\partial y(i, j)} \frac{\partial y(i, j)}{\partial w\_1} = w\_0 \cdot \Delta y \odot \mathbf{x}' + w\_0 \cdot w\_2 \cdot \Delta y \odot \mathbf{x}''' \tag{24}
$$

$$
\Delta w\_2 = \sum\_i \sum\_j \frac{\partial L}{\partial y(i, j)} \frac{\partial y(i, j)}{\partial w\_2} = w\_0 \cdot w\_1 \cdot \Delta y \odot \mathbf{x}^{\prime \prime} \tag{25}
$$

where *x-* is the map whose elements in each row are the one-position left shifting elements in the corresponding row of *x*. *x--* is the map whose elements in each row are the two-position left shifting elements in the corresponding row of *x*. *x-* and *x--* maintain the same size as *x* with some boundary extension methods. The operator " " represents the cross-correlation between two matrices. With these gradients, the weights can be updated by stochastic gradient descent.

#### *2.3. Lifting Scheme-Based Deep Neural Network*

In this section, the lifting scheme is introduced into the deep learning field, and a lifting scheme-based deep neural network (LSNet) method is proposed to enhance network performance. From Sections 2.1 and 2.2, the lifting scheme can substitute the convolutional layer because it can perform convolution and utilize backpropagation to update parameters. Specifically, operators in the lifting scheme are flexible, which can be designed not only to make the lifting scheme equivalent to vanilla convolutional layers but also extended to meet other requirements. Thus, we develop the LSNet with nonlinear feature extractors utilizing the ability of nonlinear transformation of the lifting scheme.

#### 2.3.1. Basic Block in LSNet

Nonlinearity enables neural networks to fit complex functions and thus strengthens their representation ability. As the lifting scheme is capable of constructing nonlinear wavelets, we introduce nonlinearity into the feature extraction module to build the LSNet. It is realized by designing nonlinear predict and update operators in the lifting scheme, which demonstrates the enormous potential of the lifting scheme to perform nonlinear transformation and enhance the nonlinear representation of the neural network.

We construct the basic block in LSNet based on ResNet34 [35], as shown in Figure 3. The first layer in the basic block is a 3 × 3 convolutional layer, which is used to change the number of channels and downsampling. The middle layer is the LS block, mainly for feature extraction, with the same number of channels between the input and the output. The plug-and-play LS block is used to substitute the vanilla convolutional layer without any other alterations. In the LS block, the input is split into two parts, including *xe* and *xo*. The nonlinear predict operator and update operator are constructed by a vanilla convolution kernel followed by a nonlinear function. *xo* is then predicted by *xe* to gain the detailed component, which is discarded after the update step. *xe* is updated by the detailed component, the outcome of which is the coarse component and used as the final output of the LS block. Finally, we add a 1 × 1 convolutional layer as the third layer to enhance channel-wise communication. Batch normalization and ReLU are followed by the first two layers for overfitting avoidance and activation, respectively. The identity of the input through a shortcut is added to the output of the third layer, followed by batch normalization. The addition outcome is again activated by ReLU, which is the final output of the basic block.

**Figure 3.** Basic block in LSNet.

As both the 1D and 2D convolutional layers are widely used, we propose the 1D and 2D LS block, named the LS1D block and LS2D block, respectively. The specific processes of which are illustrated in Table 2.

**Table 2.** Lifting scheme steps of LS1D block and LS2D block.


Note that notation *x*[*i*] represents the *i*th column in image *x* while the notation *<sup>x</sup>*[*<sup>i</sup>*, *j*] represents the *i*th row, *j*th column pixel in the image *x*.

Note that N {·} and M{·} denote the nonlinear transformation in the predict step and the update step, respectively. Pconv*n*<sup>D</sup>(·) and Uconv*n*<sup>D</sup>(·) represent the vanilla *n*D convolution operation in the predict and the update step, respectively. The operators N { Pconv*n*<sup>D</sup>(·) } and N { Uconv*n*<sup>D</sup>(·) } are the prediction and update operators, respectively, which are changeable to meet other requirements.

#### 2.3.2. Network Architecture and Settings

The network architecture of LSNet is shown in Figure 4. The input of LSNet passes through a single LS block and a 1 × 1 convolutional layer for initial feature extraction. The feature maps are then processed by stacked basic blocks to obtain low dimension image representation, which is followed by an average pooling for dimension reduction. Finally, the representation is unfolded as a 1D vector, and processed by a fully connected layer and a softmax function to obtain the output.

**Figure 4.** LSNet architecture.

The modified ResNet34 is chosen as the baseline model to evaluate the performance of LSNet. Experiments are separately conducted for the 1D convolution and the 2D convolution, as both are widely used. The setups of each network are listed in Table 3.

In Table 3, the size and the number of convolution kernels in vanilla convolutional layers are listed. For LSNet, the structure of the LS1D block and LS2D block are shown in Table 2, while the number represents the depth of the output. To demonstrate the effect of nonlinearity, we use five different nonlinear functions to construct different LS blocks to compare the performance, which are ReLU [3], leaky ReLU [4], ELU [5], and CELU and SELU [6].


**Table 3.** Network architectures in contrast experiments.

In the experiment on the AID dataset, we use the stochastic gradient descent (SGD) as an optimizer, with a momentum of 0.9 and a weight decay of 5 × 10−4. We train the training set for 100 epochs, setting the mini-batch size as 32. The learning rate is 0.01 initially, which decreases by 5 times every 25 epochs. For the CIFAR-100 dataset, all the settings are the same except that the mini-batch size is 128, and the initial learning rate is 0.1.
