**1. Introduction**

Remote sensing image change detection (CD) uses two or more remote sensing images of the same area at different times to compare and analyze the atmospheric, spectral, and sensor information through artificial intelligence or mathematical statistics to obtain the change information of the area [1,2]. CD is an important research direction in the field of remote sensing and plays a great role in many fields such as land planning, urban expansion [3,4], environmental monitoring [5–7], and disaster assessment [8] as a key technology for monitoring surface conditions.

Recently, with the gradual maturity of remote sensing imaging technology, remote sensing image data with high resolution (HR) have been emerging. Compared with medium-resolution and low-resolution remote sensing images, HR remote sensing images have richer geometric and spatial information, which provide favorable conditions for humans to monitor surface changes more accurately. Therefore, the authors have paid more attention to the processing of HR remote sensing images. Effectively extracting the rich feature information of HR remote sensing images, better focusing on the change regions, avoiding the interference of other factors, and reducing the interference of pseudo-changes are the key issues of remote sensing image CD research [9].

**Citation:** Ma, J.; Shi, G.; Li, Y.; Zhao, Z. MAFF-Net: Multi-Attention Guided Feature Fusion Network for Change Detection in Remote Sensing Images. *Sensors* **2022**, *22*, 888. https://doi.org/10.3390/s22030888

Academic Editors: Moulay A. Akhloufi and Mozhdeh Shahbazi

Received: 21 December 2021 Accepted: 22 January 2022 Published: 24 January 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

There are many CD methods proposed, and different authors have made a more comprehensive summary classification from different aspects. In this paper, we will summarize and compare two perspectives from traditional methods and deep learning-based methods.

The traditional methods are divided into pixel-based remote sensing image CD methods and object-oriented remote sensing image CD methods according to the size of the basic unit [10]. The pixel-based remote sensing image CD method usually directly processes the input image according to the pixel-level spectral features, texture features, and other specific meaningful features (water bodies, vegetation indices). It obtains the difference image by difference or ratio. The change information is then extracted using a threshold segmentation method [11]. In the early days, methods such as image difference [12], image ratio [13], and regression analysis [14] were commonly used. However, these methods usually failed to obtain complete change information. To better utilize the spectral information of images, methods based on image transformation such as independent component analysis (ICA) [15] and multivariate alteration detection (MAD) [16,17] have emerged one after another and have achieved good results in land CD. For multispectral remote sensing images, the change vector analysis (CVA) [18] method is proposed to detect different changes in the ground. The CVA methods calculate the amplitude and phase angle and use the phase angle information to subdivide the changes. However, the performance of this type of method depends heavily on the quality of the spectral bands involved in the calculation, and the stability of the algorithm cannot be guaranteed. Therefore, improved versions of the CVA technique have been proposed during 2012–2016 to further improve the performance of CD [19–22]. With the development of HR optical remote sensing satellite technology, more and more HR remote sensing images are used for CD.

The characteristic of "different objects in the same spectrum" in HR remote sensing images easily leads to the phenomenon of "salt and pepper" in the detection results. This problem further limits the practical application of pixel-level CD methods in HR remote sensing images [23]. Object-based CD methods are commonly used in HR remote sensing image CDs. This is because it allows for a richer representation of information. Ma et al. [24] investigated the effects of semantic strategy, scale, and feature space on an unsupervised, object-based CD method in urban areas. Subsequently, Zhang et al. [25] proposed an object-based CD method for unsupervised CD by incorporating a multi-scale uncertainty analysis. Zhang et al. [26] proposed a method based on the box-whisker plot with cosine law, which outperformed the traditional CD method. For CD tasks where "from–to" change information has to be determined, Gil-Yepes et al. [27] and Qin et al. [28] utilized a postclassification comparison strategy. Although the object-based CD method can better utilize the spatial feature information of HR remote sensing images compared with the pixel-based CD method, it also relies on the traditional manual feature extraction method, which is not only complicated and low-efficiency, but also has less stable CD performance [9]. In recent years, deep learning methods have been widely used in natural language processing, speech recognition [29,30], and image processing [31–33]. Deep learning methods have excellent learning ability and do not require the manual design of feature factors to extract features. With the success of deep learning in the field of image processing, deep learningbased CD for remote sensing images has quickly attracted the interest of scholars. With the continuous development of technology, the field of remote sensing CD has also started to make some excellent research based on convolutional neural networks (CNNs) [34]. CNNs do not require feature extraction by manually designed features. In the field of remote sensing CD, ResNet [35], full convolutional networks (FCN) [36], and UNet [37] structures have been widely used for feature map extraction with certain results. With continuous research, the model of remote sensing CD has been continuously optimized and improved.

For example, the FC-EF [38] network performs a concatenation operation before feeding two images into the backbone network of the UNet structure, then processes the images separately through two branches of the network. These two branches have the same network structure and shared parameters, and, finally, the outputs of the two branches are combined using convolutional layers. The FC-Siam-conc [38] and FC-Siam-diff [38] improve the network by jump-connecting the three feature maps from the two encoder branches and the corresponding decoder layer. FC-Siam-diff improves the network by first differencing the feature maps of the two decoder branches, then finding the absolute value of the difference, finally using a skip connection strategy to connect with the corresponding decoder layer. Subsequently, the FCN-based UNet network was successfully applied to the CD task [39,40], which was trained in an end-to-end manner from scratch using only available CD datasets. Coarse-to-fine [41] proposes a detection framework based on coarse-to-fine detection to detect remote sensing change regions. It firstly uses an encoder and decoder to obtain coarse change maps of bi-temporal images, then applies the idea of residuals to obtain refined change maps. The method can effectively detect the change regions with good results. After considering the feature maps between different layers with the idea of residuals, many scholars also use the attention mechanism in the direction of remote sensing CD to extract richer and finer feature maps. ResNet is used as a backbone by STANet [42], and then a self-attention module for CD is added in the process of feature extraction, which can calculate any two pixels. The authors of this model introduced Transformer on top of ResNet, which makes the network performance further improved [43]. DASNet [44] proposes a dual-attention mechanism to generate better feature representations to enhance the performance of the network. Zhang et al. [45] first use the two Siamese network architectures as the raw images feature extraction network. To enhance the integrity of change map boundaries and internal densities, multi-level depth features are fused with image difference map features by an attention mechanism. In 2021, Hou et al. [46] proposed a novel attention mechanism for mobile networks by embedding location information into channel attention, calling it Coordinate Attention (CA). CA enhances feature representation. In addition, in 2021, HDFNet [47] uses the idea of a hierarchical fusion and dynamic convolution model to obtain a fine feature map. The network makes innovations in the fusion of features at different levels, which makes the network recognition performance superior. The above methods have achieved certain results in the field of remote sensing CD. However, the accurate extraction of effective feature representations and the adequate fusion of feature information at different scales are still research challenges in the field of remote sensing CD. For the benefit of retrieval, a summary of the above-mentioned methods is presented in Table 1.

**Table 1.** Summary of contemporary CD methods.


In this paper, we propose a Multi-Attention Guided Feature Fusion Network (MAFF-Net) for remote sensing images to address the above problems effectively. The main contributions of this article are as follows:

1. We propose the Feature Enhancement Module (FEM), which solves the problem that the features extracted from the backbone network have much interference information

and the feature representation is not clear enough. The FEM captures not only crosschannel information but also direction-aware and location-sensitive information, which helps the model to locate the region of interest more accurately and enhance the representation of changing region features.


We tested the model on three publicly available remote sensing image datasets. The experimental results validate the effectiveness of our proposed algorithm. The remainder of this article is organized as follows: Section 2 describes the proposed method in detail. In Section 3, corresponding experiments are designed to verify the effectiveness of the method in this article, and the experimental results are analyzed and discussed. Section 4 draws some conclusions about our method.

#### **2. Methodology**

In this section, a detailed description of the network proposed for the remote sensing image CD task is presented. First, the backbone of the architecture is described. Second, a detailed description of the proposed FEM is presented. Next, the attention-guided feature fusion mechanism is the focus of this section description, and these modules are described separately in this section. Then, the RRB proposed in this paper is introduced. Finally, the final prediction results are generated by applying convolutional operations [48,49] on the final fused feature maps.

#### *2.1. Network Architecture*

The overall structure of the proposed network in this paper is shown in Figure 1. The proposed network uses ResNet18 as its backbone network. Based on some previous work [42,50,51], the proposed network modifies Res-Net18 by removing the last max-pooling layer and the fully connected layer and retaining the layers in the first five convolutional blocks (Conv1 to Conv5).

First, the bi-temporal image pairs (*T*1, *T*2) are input to the feature extraction network to obtain sets of feature maps, *FT*<sup>1</sup> 0 <sup>1</sup>, *FT*<sup>1</sup> 1 <sup>1</sup>, *FT*<sup>1</sup> 2 <sup>1</sup>, *FT*<sup>1</sup> 3 <sup>1</sup>, *FT*<sup>1</sup> 4 1 and *FT*<sup>2</sup> 0 <sup>2</sup>, *FT*<sup>2</sup> 1 <sup>2</sup>, *FT*<sup>2</sup> 2 <sup>2</sup>, *FT*<sup>2</sup> 3 <sup>2</sup>, *FT*<sup>2</sup> 4 2 . For each set of feature maps, the proposed method uses only the last four feature maps. These feature maps are then fed into the Feature Enhancement Module (FEM) according to their respective scales to obtain two sets of updated feature maps, *F*1 <sup>1</sup> , *<sup>F</sup>*<sup>2</sup> <sup>1</sup> , *<sup>F</sup>*<sup>3</sup> <sup>1</sup> , *<sup>F</sup>*<sup>4</sup> 1 and *F*1 <sup>2</sup> , *<sup>F</sup>*<sup>2</sup> <sup>2</sup> , *<sup>F</sup>*<sup>3</sup> <sup>2</sup> , *<sup>F</sup>*<sup>4</sup> 2 . Next, the cross-layer feature fusion strategy is employed for each of the two updated feature maps. It should be noted here that our cross-layer feature fusion strategy targets different scale features of the same image. Specifically, take image *T*1 as an example. First, bilinear up-sampling [52–54] and convolution operations are performed on high-level features *F*<sup>3</sup> <sup>1</sup> <sup>∈</sup> <sup>R</sup>4*C*×*H*/4×*W*/4 to obtain *<sup>F</sup>*<sup>3</sup> <sup>1</sup> <sup>∈</sup> <sup>R</sup>*C*×*H*×*W*, where *<sup>H</sup>* <sup>×</sup> *<sup>W</sup>* is the size of the feature map *F*<sup>1</sup> <sup>1</sup> <sup>∈</sup> <sup>R</sup>*C*×*H*×*<sup>W</sup>* and *<sup>C</sup>* is the channel dimension of *<sup>F</sup>*<sup>1</sup> <sup>1</sup> . Then, the feature maps *F*<sup>1</sup> <sup>1</sup> and *<sup>F</sup>*<sup>3</sup> <sup>1</sup> of the *<sup>T</sup>*1 image are concatenated to obtain feature *<sup>F</sup>*<sup>13</sup> <sup>1</sup> <sup>∈</sup> <sup>R</sup>2*C*×*H*×*W*. *F*<sup>13</sup> <sup>1</sup> is input to the convolutional block attention module (CBAM) [55] and then output to *F*<sup>13</sup> <sup>1</sup> <sup>∈</sup> <sup>R</sup>*C*×*H*×*<sup>W</sup>* after using 3 <sup>×</sup> 3 convolution on it. The same method is used to fuse *F*2 <sup>1</sup> <sup>∈</sup> <sup>R</sup>2*C*×*H*/2×*W*/2 and *<sup>F</sup>*<sup>4</sup> <sup>1</sup> <sup>∈</sup> <sup>R</sup>8*C*×*H*/8×*W*/8 of *<sup>T</sup>*1 to obtain *<sup>F</sup>*<sup>24</sup> <sup>1</sup> <sup>∈</sup> <sup>R</sup>2*C*×*H*/2×*<sup>W</sup>*/2. With the FFM module, four feature maps *F*<sup>13</sup> <sup>1</sup> , *<sup>F</sup>*<sup>13</sup> <sup>2</sup> , *<sup>F</sup>*<sup>24</sup> <sup>1</sup> , and *<sup>F</sup>*<sup>24</sup> <sup>2</sup> were obtained. Depending on the

corresponding scales, the fused feature map pairs, *F*<sup>13</sup> <sup>1</sup> , *<sup>F</sup>*<sup>13</sup> 2 and *F*<sup>24</sup> <sup>1</sup> , *<sup>F</sup>*<sup>24</sup> 2 , are fed into our proposed RRB to further refine the feature representation to obtain *F*<sup>13</sup> <sup>12</sup> <sup>∈</sup> <sup>R</sup>*C*×*H*×*<sup>W</sup>* and *F*<sup>24</sup> <sup>12</sup> <sup>∈</sup> <sup>R</sup>2*C*×*H*/2×*<sup>W</sup>*/2, respectively. Finally, the two feature maps, *<sup>F</sup>*<sup>13</sup> <sup>12</sup> and *<sup>F</sup>*<sup>24</sup> <sup>12</sup> , are sent to the FFM for final fusion. The prediction map is obtained after applying a pixel classifier (equipped with the sequence 3 × 3 Conv, batch normalization (BN) [56], and ReLU [57]).

**Figure 1.** Architecture of the proposed MAFF-Net network. The green dotted box shows the crosslayer fusion strategy. *F*1 <sup>1</sup> , *<sup>F</sup>*<sup>2</sup> <sup>1</sup> , *<sup>F</sup>*<sup>3</sup> <sup>1</sup> , *<sup>F</sup>*<sup>4</sup> 1 and *F*1 <sup>2</sup> , *<sup>F</sup>*<sup>2</sup> <sup>2</sup> , *<sup>F</sup>*<sup>3</sup> <sup>2</sup> , *<sup>F</sup>*<sup>4</sup> 2 denote the two sets of features updated by the FEM.

#### *2.2. Feature Enhancement Module*

The existing CD methods for HR remote sensing images have received less attention to the position information and channel relationships. HR remote sensing images have rich location-spatial information. To obtain accurate position information, a Feature Enhancement Module (FEM) based on coordinate attention (CA) is proposed in this paper to obtain the accurate location information and channel relationships of HR remote sensing images. The module can consider both position information and channel information. The structure of the FEM is shown in Figure 2.

In Figure 2, first, a 3 × 3 convolution operation is performed on the input *F*1. Then it is fed into the CA block to obtain the weighted feature map, *<sup>F</sup>*<sup>2</sup> <sup>∈</sup> <sup>R</sup>*C*×*H*×*W*. Feature maps *F*1 and *F*2 are merged into one feature map by element-wise summation, and a 3 × 3 convolution operation is used to obtain *F*3.

In Figure 2, the coordinate attention module encodes *H* and *W* respectively. In the HR remote sensing image, for a given position (*i*, *j*), its pixel value on channel *c* is *xc*(*i*, *j*). The *H* average pooling output of the *c*-th channel at height *h* is as Equation (1) [46]:

$$z\_{\mathbf{c}}^h(h) = \frac{1}{W} \sum\_{0 \le i < W} \mathbf{x}\_{\mathbf{c}}(h, i) \tag{1}$$

**Figure 2.** Feature Enhancement Module (FEM). "*W* average pooling" and "*H* average pooling" refer to 1D horizontal global average pooling and 1D vertical global average pooling, respectively. The r indicates the reduction ratio, where r is set to 16. The Reshape operation permutes the Dimension of the tensor. The Resize operation extends the tensor to the same size as the input I1.

Similarly, the *W* average pooling output of the *c*-th channel at width *w* is as Equation (2) [46]:

$$z\_{\mathfrak{c}}^{w}(w) = \frac{1}{H} \sum\_{0 \le j < H} \mathbf{x}\_{\mathfrak{c}}(j, w) \tag{2}$$

Then, the Reshape operation is used to permute the dimensionality of the *z<sup>h</sup> <sup>c</sup>* tensor to be the same as that of the *z<sup>w</sup> <sup>c</sup>* tensor. Next, the coordinate attention module uses concatenation, convolution, and activation function operations. The related definition is as Equation (3) [46]:

$$f = \delta\left(F\_{\mathbb{C}}\left(\left[z\_{\varepsilon'}^{\mathrm{h}} \mid z\_{\varepsilon}^{\mathrm{av}}\right]\right)\right) \tag{3}$$

where [,] indicates a concatenation operation, *FC* indicates a 1 × 1 convolution operation, and *δ* indicates the ReLU activation function. *f* is the output feature map of the ReLU layer.

After the split operation, *<sup>f</sup>* can be decomposed into *<sup>f</sup> <sup>h</sup>* <sup>∈</sup> <sup>R</sup>*C*/*r*×1×*<sup>H</sup>* and *<sup>f</sup> <sup>w</sup>* <sup>∈</sup> <sup>R</sup>*C*/*r*×1×*W*. The reshape operation is used again to permute the dimension of the tensor *f <sup>h</sup>* to obtain *<sup>f</sup> <sup>h</sup>* <sup>∈</sup> <sup>R</sup>*C*/*r*×*H*×1. Next, two 1 <sup>×</sup> 1 convolutional transforms *Fh* and *Fw* are used to transform *f <sup>h</sup>* and *f <sup>w</sup>* into tensor with the same number of channels as the input I1, respectively. Then, applying the sigmoid activation function [58] to the tensors updated by *Fh* and *Fw*, respectively, two outputs are obtained as shown in Equation (4) and Equation (5) [46]:

$$\mathcal{g}^h = \sigma\left(F\_h\left(f^h\right)\right) \tag{4}$$

$$\mathbf{g}^w = \sigma(F\_w(f^w)) \tag{5}$$

where *σ* indicates sigmoid activation function. The Resize operation expands the size of *<sup>g</sup><sup>h</sup>* <sup>∈</sup> <sup>R</sup>*C*×*H*×<sup>1</sup> and *<sup>g</sup><sup>w</sup>* <sup>∈</sup> <sup>R</sup>*C*×1×*<sup>W</sup>* to the same size as the input I1 <sup>∈</sup> <sup>R</sup>*C*×*H*×*W*, respectively, and the *g<sup>h</sup>* and *gw*, after being Resized, are used as attention weights. Finally, the output feature map I2 of the CA block is defined as Equation (6) [46]:

$$y\_{\varepsilon}(i,j) = \left(\mathbf{x}\_{\varepsilon}(i,j) \times \mathbf{g}\_{\varepsilon}^{h}(i)\right) \times \mathbf{g}\_{\varepsilon}^{w}(j) \tag{6}$$

where *c* is the *c*-th channel, *g<sup>h</sup> <sup>c</sup>* (*i*) is the weight of the *i*-th position in the *H* direction, *g<sup>w</sup> <sup>c</sup>* (*j*) is the weight of the *j*-th position in the *W* direction, and *yc*(*i*, *j*) is the value of the output feature map I2.

#### *2.3. Feature Fusion Module*

With the study of deep learning-based CD, it has been found that the CD task is unsatisfactory if it relies only on simple feature extraction networks. On the one hand, this is because simple feature extraction networks cannot eliminate semantic interference such as seasonal appearance differences and cannot accurately label change regions in the presence of diverse object shapes and complex boundaries. On the other hand, it is not fully exploited to multi-scale information, and the fusion of multi-scale features to make them communicate can help our network improve its performance.

Therefore, as shown in Figure 1, an attention-based Feature Fusion Module (FFM) is introduced into the CD network. The detail of the FFM is shown in Figure 3.

The proposed FFM is slightly different at different stages. The FFM whose input features are from FEM is named FFM\_S1, and the FFM whose input features are from RRB is named FFM\_S2. Specifically, the difference between FFM\_S1 and FFM\_S2 lies in the input part. The inputs of FFM\_S1 are two feature maps of different scales of one image, while the input of FFM\_S2 is two feature maps of the same scale of two images.

After FEM processing, two sets of updated feature maps, *F*1 <sup>1</sup> , *<sup>F</sup>*<sup>2</sup> <sup>1</sup> , *<sup>F</sup>*<sup>3</sup> <sup>1</sup> , *<sup>F</sup>*<sup>4</sup> 1 ∈ *T*1 and *F*1 <sup>2</sup> , *<sup>F</sup>*<sup>2</sup> <sup>2</sup> , *<sup>F</sup>*<sup>3</sup> <sup>2</sup> , *<sup>F</sup>*<sup>4</sup> 2 ∈ *T*2, were obtained. For FFM\_S1, the inputs are the feature map pairs *F*1 <sup>1</sup> , *<sup>F</sup>*<sup>3</sup> 1 and *F*2 <sup>1</sup> , *<sup>F</sup>*<sup>4</sup> 1 and *F*1 <sup>2</sup> , *<sup>F</sup>*<sup>3</sup> 2 and *F*2 <sup>2</sup> , *<sup>F</sup>*<sup>4</sup> 2 , respectively. Figure 3 shows FFM\_S1, and the structure of FFM\_S2 is not drawn separately because the two only have different inputs. However, it should be emphasized here that FFM\_S2, which has two input feature maps of the same scale, does not distinguish between high-level features and low-level features, and also does not need to up-sample high-level features such as FFM\_S1.

**Figure 3.** Feature Fusion Module (FFM). *F*1–*F*5 represent the feature maps that are output by different blocks.

The next step is to describe FFM\_S1. After experiments, it is found that the fusion of features by cross-layer is more effective. This may be because the high-level features will lose some semantic information carried by the original image or low-level features, such as some edge features, as the number of convolution layers increases, and the fusion with low-level features can compensate for this deficiency. At the same time, the semantic information carried by the feature maps between neighboring layers is not so obviously different, so the fusion method by cross-layer plays a role. For an original image *T*1, feature map pairs *F*1 <sup>1</sup> , *<sup>F</sup>*<sup>3</sup> 1 and *F*2 <sup>1</sup> , *<sup>F</sup>*<sup>4</sup> 1 are fed into FFM\_S1, respectively. For original image *T*2, feature map pairs *F*1 <sup>2</sup> , *<sup>F</sup>*<sup>3</sup> 2 and *F*2 <sup>2</sup> , *<sup>F</sup>*<sup>4</sup> 2 are fed into FFM\_S1, respectively. As shown in Figure 3, the high-level feature needs an up-sampling operation to make the feature map shape consistent with the low-level feature. Next, one 1 × 1 convolution is used to obtain the feature map *<sup>F</sup>*<sup>2</sup> <sup>∈</sup> <sup>R</sup>*C*×*H*×*W*. The two inputs *<sup>F</sup>*<sup>1</sup> <sup>∈</sup> <sup>R</sup>*C*×*H*×*<sup>W</sup>* and *<sup>F</sup>*2 are concatenated to obtain the feature map *<sup>F</sup>*<sup>3</sup> <sup>∈</sup> <sup>R</sup>2*C*×*H*×*W*. The resulting feature map can be viewed as a feature map with different channels. The calculation process of F3 is shown in Equation (7):

$$F3 = \left[ \mathbb{C}omv(\mathbb{J}Ip(F1)), F2 \right] \tag{7}$$

where *Conv* denotes the 1 × 1 convolution, and [.,.] denotes the concatenation operation. Considering that this direct aggregation of features in cross-layer does not yet communicate well in the channel and spatial dimensions, feed *F*3 to the *CBAM*. *CBAM* is an attention module consisting of the channel and spatial attention. It considers both the importance of pixels in different channels and the importance of pixels in different positions in the same channel. The *CBAM* outputs the feature map *<sup>F</sup>*<sup>4</sup> <sup>∈</sup> <sup>R</sup>2*C*×*H*×*W*. Then, the 3 <sup>×</sup> 3 convolution block is used, the main purpose of which is to recover the channels of the aggregated feature map to the number of channels of the input feature map. The above calculation process is shown in Equation (8):

$$F4 = \mathbb{C}inv(\mathbb{C}BAM(F3))\tag{8}$$

where *Conv* denotes the 3 × 3 convolution block. Next, in two subsubsections, two parts of *CBAM*, namely the channel attention module and the spatial attention module, are described in detail.

#### 2.3.1. Channel Attention Module

In the Channel Attention Module (CAM), the vectors described as A*Fca avg* <sup>∈</sup> <sup>R</sup>*B*×*C*×<sup>1</sup> and *Fca max* <sup>∈</sup> <sup>R</sup>*B*×*C*×<sup>1</sup> are obtained by the average-pooling and max-pooling operations, respectively. Then, each of them is input to the shared multi-layer perceptron (MLP) with one hidden layer, respectively, to get two vectors, and the two vectors are merged to one feature vector by element-wise summation. After sigmoid activation, the feature map of the CAM is finally obtained. This is shown in Equation (9) [55]:

$$M\_{ca}(D) = \delta \left( FC\_1 \left( FC\_0 \left( F\_{avg}^{ca} \right) \right) + FC\_1 \left( FC\_0 \left( F\_{max}^{ca} \right) \right) \right) \tag{9}$$

where *FC*<sup>0</sup> and *FC*<sup>1</sup> denote the convolution operation in MLP and *δ* denotes the sigmoid function. The CAM compresses the feature map spatial dimensions to obtain a onedimensional vector before manipulating it. Channel attention is concerned with what is significant on this feature map. The average-pooling has feedback for every pixel point on the feature map, while max-pooling has feedback for gradients only where the response is greatest in the feature map when performing gradient backpropagation calculations.

#### 2.3.2. Spatial Attention Module

In the Spatial Attention Module (SAM), it is the feature map output from the CAM that is used as input. First, do a max-pooling and average-pooling based on the channel to get the element-wise summation, and then a concatenation operation is performed on the two layers. Then, convolution is performed and reduced to 1 channel, and then the feature map output from the SAM is obtained by sigmoid activation. This is given by Equation (10) [55]:

$$M\_{\rm sa}(D^{ca}) = \delta\left(f^{\heartsuit \times \heartsuit} \left(\mathrm{Cat}\left(F\_{\mathrm{avg}\prime}^{\mathrm{sa}}, F\_{\mathrm{max}}^{\mathrm{sa}}\right)\right)\right) \tag{10}$$

where *Cat* is the concatenation operation, *f* <sup>7</sup>×<sup>7</sup> represents a convolutional layer with a filter size of 7 × 7, and *δ* denotes the sigmoid function. The SAM is a channel compression mechanism that performs average-pooling and max-pooling in the channel dimension respectively. The max-pooling operation is to extract the maximum value on the channel, and the number of extractions is *H* × *W*. The average-pooling operation is to extract the average value on the channel, and the number of extractions is also *H* × *W*. Thus, a *2*-channel feature map can be obtained.

### *2.4. Refinement Residual Block*

The use of a single 3 × 3 convolutional kernel has some shortcomings in refining the feature representation. Inspired by Yu et al. [59], a Refinement Residual Block (RRB) is introduced to modify the channels of the aggregated feature map to be consistent with the input feature map and further refine the feature representation before the final feature fusion using FFM\_S2. Its structure is shown in Figure 4.

As can be seen in Figure 4, the RRB has three inputs, one of which is the difference map of two feature maps. The three feature maps are first subjected to a concatenation operation, followed by two consecutive convolution blocks, each consisting of *Conv* 3 × 3, BN, and ReLU. The two convolution blocks output the feature maps *<sup>F</sup>*<sup>1</sup> <sup>∈</sup> <sup>R</sup>*C*×*H*×*<sup>W</sup>* and *<sup>F</sup>*<sup>2</sup> <sup>∈</sup> <sup>R</sup>*C*×*H*×*W*, respectively. Here, it should be noted that the number of channels of each convolutional block output is different. In addition, the module adds additional residual connections with the 1 × 1 convolutional layers for obtaining some additional spatial information of the remote sensing images. Finally, the four feature maps are subjected to element-wise summation and the final output feature map *<sup>F</sup>*<sup>4</sup> <sup>∈</sup> <sup>R</sup>*C*×*H*×*<sup>W</sup>* is obtained.

**Figure 4.** Refinement Residual Block (RRB). *F*1–*F*4 represent the feature maps that are output by different blocks.

#### *2.5. Loss Function*

In the training stage, a cross-entropy loss function optimized by Chen et al. [43] is used, which minimizes the cross-entropy loss to optimize the network parameters. Formally, the loss function is defined as Equation (11) [43]:

$$L = \frac{1}{H\_{0\times} \mathcal{W}\_0} \sum\_{h=1, w=1}^{H\_\prime \mathcal{W}} I(P\_{hw\prime} \, \mathcal{Y}\_{hw}) \tag{11}$$

where *<sup>l</sup>*(*Phw*, *<sup>y</sup>*) <sup>=</sup> <sup>−</sup>*log Phwy* is the cross-entropy loss and *Yhw* is the label for the pixel at location (*h*, *w*) [43].

#### **3. Experiments and Results**

In this section, the proposed network MAFF-Net is evaluated on three publicly available benchmark datasets to demonstrate its effectiveness. First, the details of the three datasets, the CDD dataset [60], the LEVIR-CD dataset [42], and the WHU-CD dataset [61], are introduced. Next, the implementation details are presented, including the experimental environment and evaluation metrics. Then, seven state-of-the-art (SOTA) comparison methods are introduced. In this section, quantitative and qualitative analyses of these methods are presented on three datasets.

#### *3.1. Datasets and Settings*

The CDD dataset has three types of images, synthetic images with no relative movement of objects, synthetic images with less relative movement of objects, and real remote sensing images with seasonal changes (obtained from Google Earth). In this paper, a subset of remote sensing image data with seasonal changes is selected. This subset has 16,000 images with an image size of 256 × 256 pixels, of which 10,000 images are used as the training set, 3000 images as the validation set, and 3000 images as the test set. As shown in Figure 5, the change scenarios of this dataset include building changes, road changes, and vehicle changes. The data set was considered for different sizes of objects.

**Figure 5.** Illustration of samples from CDD. (Image-T1) and (Image-T2) indicate the bi-temporal image pairs. (GT) indicates the ground truth.

LEVIR-CD contains 637 very high resolution (VHR, 0.5 m/pixel) Google Earth image patch pairs, 1024 × 1024 pixels in size. These bitmap images spanning 5 to 14 years have significant land-use changes, especially building growth. LEVIR-CD covers various types of buildings such as villas, high-rise apartments, small garages, and large warehouses. The fully annotated LEVIR-CD contains a total of 31,333 individual instances of change construction. As shown in Figure 6, each sample is cropped into 16 small patches of size 256 × 256, generating 7120 image patch pairs for training, 1024 for validation, and 2048 for testing.

**Figure 6.** Illustration of samples from LEVIR-CD. (Image-T1) and (Image-T2) indicate the bi-temporal image pairs. (GT) indicates the ground truth.

The third dataset is named the WHU-CD dataset, which is a CD dataset of public buildings. The dataset covers the area where the 6.3 magnitude earthquake occurred in February 2011 and has been reconstructed in the following years. It consists of a pair of HR (0.075 m) aerial images of size 32, 507 × 15, 354. Considering that the authors of the original paper did not provide a solution for data segmentation, as shown in Figure 7, the solution of cropping the image into small pieces of size 224 × 224 was finally chosen, and dividing them into three random parts: 7918/987/955 for training/validation/testing, respectively.

**Figure 7.** Illustration of samples from WHU-CD. (Image-T1) and (Image-T2) indicate the bi-temporal image pairs. (GT) indicates the ground truth.

#### *3.2. Evaluation Metrics and Settings*

For quantitative assessment, three indices, namely the *F*1-score (*F*1), *Kappa* coefficient (*Kappa*), and overall accuracy (*OA*) are used as the evaluation metrics. These three indices can be calculated as follows:

$$P = \frac{TP}{TP + FP} \tag{12}$$

$$R = \frac{TP}{TP + FN} \tag{13}$$

$$F1 = \frac{2}{P^{-1} + R^{-1}}\tag{14}$$

$$OA = \frac{TP + TN}{TP + FP + TN + FN} \tag{15}$$

$$PRE = \frac{\left(TP + FN\right) \times \left(TP + FP\right) + \left(TN + FP\right) \times \left(TN + FN\right)}{\left(TP + TN + FP + FN\right)^2} \tag{16}$$

$$Kappa = \frac{OA - PRE}{1 - PRE} \tag{17}$$

where *OA* and *PRE* denote the overall accuracy and expected accuracy, respectively. The *TP*, *FP*, *TN*, and *FN* are the number of true positives, false positives, true negatives, and false negatives, respectively.

We implemented our proposed method with PyTorch, supported by NVIDIA CUDA with a GeForce GTX 2080Ti GPU. In the training stage, the feature extraction backbone of the proposed MAFF-Net is initialized from ResNet18. We used the Adam (β<sup>1</sup> = 0.5, β<sup>2</sup> = 0.9) optimizer and the entire training period was set to 200 epochs. The initial learning

rate is 0.001 in the first 100 epochs, in the next 100 epochs, the value of the learning rate decays linearly to 0. Considering the GPU size, we set the batch size to 8 to facilitate GPU training.

#### *3.3. Comparison of Experimental Results*

In this section, the performance of the different methods is compared on the three datasets CDD, LEVIR-CD, and WHU-CD, respectively. The advantages and disadvantages of each method are further described based on the results of the quantitative and qualitative analyses. In addition, an ablation study is performed on the proposed method to compare and analyze the effectiveness of each of its modules.

#### 3.3.1. Comparison Methods

To verify the effectiveness and superiority of our methods, we selected seven methods that are represented in the CD task and compared the performance of these methods in CDD, LEVIR-CD, and WHU-CD, respectively, and a brief description of the selected methods is as follows:


#### 3.3.2. CDD Dataset

For quantitative comparison, we calculated and summarized the evaluation metrics for CDD, LEVIR-CD, and WHU-CD, as shown in Tables 1–3, respectively. To compare the performance of each method more visually, we visualized the test results of each method on the three data sets, as shown in Figures 8–10, respectively. The white color indicates the changes that were correctly detected. Black indicates that no changes have been correctly detected. Red indicates false alarms. Blue indicates unpredicted changes.


**Table 2.** Comparison of CDD dataset results. The best scores are highlighted in bold.

**Table 3.** Comparison of LEVIR-CD dataset results. The best scores are highlighted in bold.


**Figure 8.** Illustration of a qualitative comparison on dataset CDD. (**a**–**h**) indicate samples from CDD and the change maps obtained with different methods. The white color indicates the changes that were correctly detected. Black indicates that no changes have been correctly detected. Red indicates false alarms. Blue indicates unpredicted changes.

**Figure 9.** Illustration of a qualitative comparison on dataset LEVIR-CD. (**a**–**h**) indicate samples from LEV-IR-CD and the change maps obtained with different methods. The white color indicates the changes that were correctly detected. Black indicates that no changes have been correctly detected. Red indicates false alarms. Blue indicates unpredicted changes.

As can be seen from Table 2, the proposed MAFF-Net reached the first on *F*1, *Kappa*, and *OA* on the CDD dataset. This also indicates that the proposed network performs optimally on this dataset. It is also evident from Figure 8 that the proposed network can better mark the change region, while there are few cases of wrong and missing detections. Specifically, as can be seen from the data in Table 2, CD-Net, which does not pay attention to the connections and interactions between multi-scale features, performs relatively poorly in the three evaluation metrics, 14.6% lower than the proposed MAFF-Net in terms of *F*1 score. This is somewhat related to its fewer network levels and relatively simple structure. Considering early fusion and late fusion strategies separately and using skip-connected encoding-decoding, the baselines of FC-EF, FC-Siam-conc, and FC-Siam-diff achieve better performance with their compact and efficient structures. Among these three baselines, the late fusion baseline shows a clear advantage over the early fusion baseline. The fusion of feature maps using bi-temporal image pairs with their difference maps achieves better results than the fusion of feature maps using only bi-temporal image pairs. FC-Siam-Diff scores 0.8%, 0.9%, and 0.1% higher than FC-Siam-conc on *F*1, *Kappa*, and *OA*, respectively. This is because the original image coding features are preserved as much as possible while obtaining the difference maps. This helps the network to achieve better performance.

**Figure 10.** Illustration of a qualitative comparison on dataset WHU-CD. (**a**–**h**) indicate samples from WHU-CD and the change maps obtained with different methods. The white color indicates the changes that were correctly detected. Black indicates that no changes have been correctly detected. Red indicates false alarms. Blue indicates unpredicted changes.

Based on the attention mechanism, which can further focus on the information exchange between feature maps, DASNet works better than FC-EF. IFN pays more attention to the connection and interaction of multi-scale information. It introduces channel attention and spatial attention and uses a post-fusion strategy for deep supervision. Its *F*1 and *Kappa* scores reached 90.1% and 89.2%, respectively. STANet proposes a spatial-temporal attention module based on a feature pyramid to better adapt the network to the detection task of complex scenes, ranking second in all evaluation metrics. The proposed MAFF-Net achieves the highest level in all metrics, respectively. It is able to detect and label the change regions better than other methods because the network employs an attention-based cross-layer feature fusion strategy and also designs a refinement residual block to further improve the network detection performance.

Also, the qualitative analysis in Figure 8 allows for further analysis of the performance of each network. For visual analysis, eight challenging sets of bi-temporal images were selected and visualized. Each set of images contains different ranges of change regions or change scenes. Among the three FCN-based baselines, FC-Siam-conc and FC-Siam-diff can give better results than FC-EF. As can be seen in Figure 8, only a small number of change regions (Figure 8a) can be marked by FC-EF, but it performs poorly for smaller change regions and more complex scenes (Figure 8b–h). This is because it does not preserve the features of each original image, especially the shallow features, which makes the detected change regions significantly inaccurate. In general, the other two baselines perform better than FC-EF, as evidenced by the completeness of the information in the regions of change detected in the illustrations. However, they still suffer from many missed and false detections, such as Figure 8b–g. In particular, in Figure 8e, they do not detect the change region at all. Therefore, there is still potential for improvement. By introducing dual attention in the decoding stage, DASNet can detect most of the change regions. However, its detection performance for small change regions needs to be improved. For example, in Figure 8e, there are many missed regions in its detection results, and there are also false detection regions. This demonstrates that it is not yet quite accurate in terms of the boundaries and details of the change regions. In addition, it also does not perform well in Figure 8b,f,h with false detections and missed detections.

IFN and STANet are relatively more complete in terms of local detail because of the introduction of channels and spatial attention. However, they still have false positives and false negatives in detecting some very small target regions or edges, as shown in the red and blue regions in Figure 8c,e,g,h. The processing of some regions is too smoothed, and some edge information is ignored to some extent. The proposed MAFF-Net can better label the change regions and accurately detect the edges of the change regions. It can be seen from the exhibited samples that there are very few red and blue regions representing false and missed detections. In particular, the detection performance is well for small and complex change regions, as shown in Figure 8e–h, for example. This also demonstrates that the proposed network can detect the change regions accurately in general.

#### 3.3.3. LEVIR-CD Dataset

As can be seen from Table 3, the difference in performance between the three baselines of FCN is not significant, where the higher score among the three indicators is FC-Siam-Diff, with *F*1 and *Kappa* scores of 83.7% and 82.8%, respectively. DASNet, by introducing dual attention, improved the *F*1-score by about 0.9% compared to the three baselines of FCN. The *F*1 score of IFN reached third place with 86.2%, while the scores of *Kappa* and *OA* also performed well. However, the scores of all metrics are lower than those of STANet, which may be because STANet pays more attention to multi-scale information while introducing attention. By introducing an attention mechanism involving multiple scales, the proposed MAFF-Net improves the *F*1 score to 89.7%, which is better than other comparative methods. Moreover, *Kappa* and *OA* reached the highest values among the compared methods with 89.1% and 98.7%, respectively.

Figure 9 also illustrates the change maps on eight selected sets of bi-temporal images. The change regions in these images cover multiple scenes, areas, shapes, and distribution ranges. For multiple regularly shaped building changes in Figure 9a,b, the overall contours of the buildings are correctly detected. However, the detection results of the CD-Net and FC-EF methods still have obvious false detection and missed detection areas. Although STANet can locate the change region, the detection of more complex and small change regions is not entirely correct. For example, as shown in Figure 9f,h, the proposed MAFF-Net is more accurate than the other methods, as seen from the fewer regions marked in red and blue. For Figure 9a,b,d, the attention-based methods DASNet and STANet and the proposed attention-based guided cross-layer feature fusion network MAFF-Net are visually closer to the GT. For the more densely distributed change regions in Figure 9c, DASNet, STANet, and MAFF-Net maintain visual correctness, while MAFF-Net has fewer errors and can accurately detect and distinguish multiple dense change regions. However, for Figure 9f–h with more complex edges and smaller change regions, IFN, DASNet, and STANet do not perform well. On the contrary, MAFF-Net shows better adaptability, and it can accurately detect changing regions with complex shapes and small objects.

#### 3.3.4. WHU-CD Dataset

According to the data in Table 4, the performance of the methods with FCN as the baseline does not differ much. The double attention-based DASNet performs slightly better than IFN and STANet, with scores of 90.7%, 90.1%, and 99.0% for *F*1, *Kappa*, and *OA*, respectively. We attribute this to the fact that the weighted double-margin contrastive loss (WDMC) used by DASNet can solve the problem of sample imbalance. The proposed MAFF-Net achieved the best scores in all evaluation metrics compared to the other comparison methods. Compared with the method using FCN as the baseline, the proposed method obtained a 9.1%, 9.6%, and 1.0% increase in *F*1, *Kappa*, and *OA*, respectively. This also demonstrates the effectiveness of the proposed multi-attention-guided feature fusion-based method. Compared to DASNet, IFN, and STANet, the proposed method improves the gains for *F*1, *Kappa*, and *OA* by 1.7%, 2.0%, and 0.4%, respectively. Such gains are generated thanks to our fusion strategy that fully considers multi-scale features, while effectively exploiting the advantage of the attention to greatly improve the network performance.


**Table 4.** Comparison of WHU-CD dataset results. The best scores are highlighted in bold.

For visual comparison, Figure 10 shows some typical CD results for the test samples in the WHU-CD dataset. As shown in Figure 10a,c–e,h, there are many missed detections and false detections in the compared methods. As shown in Figure 10c,e,h, CD-Net not only has false detections but also has many missed detection regions. The performance of the FCN-based FC-EF, FC-Siam-conc, and FC-Siam-diff have been improved and the missed detection regions are significantly reduced. However, they still have the same problems as CD-Net as shown in Figure 10d,e,h. In Figure 10e,h, the attention-based DASNet, IFN, and STANet do not perform well, with significant missed detection regions and some false detection regions. In terms of consistency with the GT, the proposed MAFF-Net achieves the best visual performance. Specifically, as shown in the samples in Figure 10, MAFF-Net significantly reduces the missed detections and has a very low false detection rate compared with other methods. In addition, the change maps generated by MAFF-Net have clearer and more accurate boundaries compared with other methods.

#### *3.4. Ablation Study*

In the CD task, our proposed model achieves superior performance. To validate the effectiveness and feasibility of our proposed method, we conducted a series of ablation experiments on three datasets, CDD, LEVIR-CD, and WHU-CD, to verify that our model has advanced performance. We conducted five ablation experiments on three HR datasets, and in our experiments, the Baseline represents the ResNet18 network structure. In total, five ablation experiments were conducted in this paper: Baseline, Baseline+FEM, Baseline+FEM+FFM\_S1, Baseline+FEM+FFM\_S1+RRB, and the MAFF-Net (Baseline+FEM+FFM\_S1+RRB+FFM\_S2). As shown in Figure 11, the Baseline does not achieve good performance in detecting change regions, especially when the change region scene is more complex or the change region area is small (Figure 11d). Compared with the Baseline, the Baseline + FEM method obtains richer features after adding the FEM, which can help the network detect most of the change regions. It can be seen that the Baseline+FEM+FFM\_S1 can effectively remove some irrelevant information (Figure 11f), while further capturing the change features and refining the feature representation. The FFM\_S1 module adopts a cross-layer fusion strategy, which helps the model to fully fuse the features of high and low layers to achieve better feature representation. Compared with the Baseline+FEM method, the Baseline+FEM+FFM\_S1 method detects more accurate and complete change regions. However, it can also be found that the method is slightly lacking when faced with small change regions or poorly characterized features (Figure 11f–1). Therefore, the Baseline+FEM+FFM\_S1+RRB method aims to further refine the feature representation, which helps to detect smaller change features and improve the network performance. As can be seen by Figure 11g, the change map obtained by this method is already very close to the change region of the GT. Finally, the method proposed in this paper performs feature fusion feature maps to obtain a prediction map that is closest to the real change regions. As can be seen from Figure 11h, the change map obtained by the proposed method is very close to the GT, which also surfaces the effectiveness of the proposed method. Meanwhile, the proposed method shows good accuracy on three different datasets. By comparing the visualization results of each module, the effectiveness and accuracy of the MAFF-Net method proposed in this paper are effectively demonstrated.

**Figure 11.** Visualization comparison plots of each network on different datasets in the ablation experiment. (**1**–**3**) indicate samples from the CDD dataset, (**4**–**6**) indicate samples from the LEVIR-CD dataset, and (**7**–**9**) indicate samples from the WHU-CD dataset. (**a**) Image T1. (**b**) Image T2. (**c**) Ground truth. (**d**) Baseline. (**e**) Baseline+FEM. (**f**) Baseline+FEM+FFM\_S1. (**g**) Baseline+FEM+FFM\_S1+RRB. (**h**) MAFF-Net (Baseline+FEM+FFM\_S1+RRB+FFM\_S2).

In addition, we also performed statistics and comparisons on the *F*1, *Kappa*, and *OA* values of different methods. As shown in Table 5, the model achieves optimal performance when all innovation modules are added, which also proves the effectiveness of our proposed innovation modules.


**Table 5.** Ablation study of different modules on different datasets. All the scores are described in percentage (%). The best scores are highlighted in bold.

In the Baseline+FEM method, as can be seen, there is a significant improvement in three indicators compared with the Baseline method. In the CDD dataset, *Kappa*, *F*1, and *OA* increased by 6.4%, 5.6%, and 1.5% compared with the Baseline, respectively. In the LEVIR-CD dataset, *Kappa*, *F*1, and *OA* were increased by 3.9%, 3.7%, and 0.4%, respectively, compared with the Baseline. In the WHU-CD dataset, *Kappa*, *F*1, and *OA* were increased by 4%, 3.9%, and 0.2%, respectively, compared with the Baseline.

In the Baseline+FEM+FFM\_S1 method, it can be seen that all metrics are improved compared to the baseline+FEM method. In the CDD dataset, *Kappa*, *F*1, and *OA* improve by 1.1%, 1%, and 0.3%, respectively, compared to the Baseline. In the LEVIR-CD dataset, *Kappa*, *F*1, and *OA* improved by 1.3%, 1.2%, and 0.1%, respectively, compared to the Baseline. In the WHU-CD dataset, *Kappa*, *F*1, and *OA* improved by 1.6%, 1.5%, and 0.1%, respectively, compared to the Baseline. We can see the improvement of all metrics on all datasets, indicating the innovation and validity of our proposed FFM\_S1, while the joint use of FFM\_S1 and FEM achieves better performance and makes the model more accurate.

In the Baseline+FEM+FFM\_S1+RRB method, it can be seen that there are improvements in all metrics compared with the Baseline+FEM+FFM\_S1 method. In the CDD dataset, *Kappa*, *F*1, and *OA* improve by 1.6%, 1.3%, and 0.3%, respectively, compared to the Baseline. In the LEVIR-CD dataset, *Kappa*, *F*1, and *OA* improved by 0.6%, 0.6%, and 0.1%, respectively, compared to the Baseline. In the WHU-CD dataset, *Kappa*, *F*1, and *OA* improved by 0.6%, 0.5%, and 0.1%, respectively, compared to the Baseline+FEM+FFM\_S1. We can see the improvement in all metrics on all datasets, indicating that our proposed RRB enhances the feature representation of the feature map, while the combined use of FFM\_S1, FEM, and RRB leads to better performance of the model.

In the Baseline+FEM+FFM\_S1+RRB+FFM\_S2 method, it can be seen that all metrics are improved compared to the baseline+FEM+FFM\_S1+RRB approach. In the CDD dataset, *Kappa*, *F*1, and *OA* improved by 0.6%, 0.6%, and 0.2%, respectively, compared to the Baseline. In the LEVIR-CD dataset, *Kappa* and *F*1 improved by 0.9% and 0.9%, respectively, compared to the Baseline. In the WHU-CD dataset, *Kappa*, *F*1, and *OA* improved by 0.6%, 0.5%, and 0.1%, respectively, compared to the Baseline+FEM+ FFM\_S1+RRB. We can see the improvement of all the metrics on all datasets, indicating our proposal that FFM\_S2 has a facilitating effect in fusing multi-scale feature information exchanges, while FFM\_S1 and FFM\_S2 have a mutual facilitating effect in feature extraction, and also, it is known experimentally that MAFF-Net helps the network fuse multi-scale features to achieve multi-scale information communication, which can improve the performance of the network.

#### *3.5. Efficiency Analysis of the Proposed Network*

Although the proposed network MAFF-Net achieves encouraging performance, it has some potential limitations. The computational complexity of MAFF-Net is relatively high and the number of parameters is large. This is not friendly to devices and applications with limited resources. In this section, the parameter amount (take M as the unit) and the training time of an epoch (take min/epoch as the unit) are used as quantitative indicators for evaluation. As shown in Figure 12, the number of trainable parameters of MAFF-Net is 49.08 million, which is the largest among the compared methods. However, from another perspective, the training efficiency of the proposed MAFF-Net is also relatively impressive. Compared with STANet and DASNet, the training time of the proposed method is reduced by 56.22% and 40.86%, respectively, which makes the proposed method more valuable in practical applications under the same equipment conditions.

**Figure 12.** Illustration of an efficiency analysis of the comparison methods.

Though the number of training parameters and training time is comprehensive, the proposed method has space for improvement and enhancement in the future. For example, model compression can be performed in the proposed network, employing pruning and knowledge distillation [63,64] to reduce the size of the model.

#### **4. Conclusions**

In this paper, we propose a novel feature fusion network for remote sensing image CD tasks. To enhance the feature representation, we propose a Feature Enhancement Module (FEM), which introduces coordination attention (CA) that can capture long-range dependencies with precise location information while modeling inter-channel relationships. The FEM helps the network to further refine the features extracted by the backbone network ResNet18. The quantitative and qualitative analysis of the ablation study shows that the performance of the FEM on the Baseline is improved, which demonstrates the reasonability and effectiveness of the FEM. Considering that layer-by-layer feature fusion may lose part of the semantic information, we propose an FFM employing a cross-layer feature fusion strategy. The FFM uses semantic cues in the high-level feature map to guide feature selection in the low-level feature map. In addition, to highlight changing regions and suppress useless features, we introduce a CBAM in the FFM, which combines the

advantages of channel attention and spatial attention, allowing the model to learn which region to focus on and pay more attention to critical information. Depending on the input features, we classified FFM into FFM\_S1 and FFM\_S2, both of which further enhance the feature fusion effect. Based on the ablation study in Section 3, we can see that the FFM significantly improves the performance of the network. To compensate for the shortcomings of using a single convolutional kernel for feature refinement, we propose a Refinement Residual Block (RRB) that employs a residual structure. The RRB changes the number of channels of the aggregated features and uses convolutional blocks to further refine the feature representation. Based on the comparison results between the proposed MAFF-Net and other methods in quantitative and qualitative analysis, the proposed method is able to efficiently detect changing regions and has a strong ability to select features through a feature fusion strategy guided by multiple attention mechanisms. On the three publicly available benchmark datasets CDD, LEVIR-CD, and WHU-CD, the *F*1 scores of MAFF-Net are improved by at least 1%, 2%, and 3%, respectively, compared to other methods. This demonstrates the better performance of our method than other SOTA methods.

However, it should be noted that, as shown in Figure 12, although the proposed model has an advantage in terms of training speed, it cannot be ignored that the number of parameters of the proposed model is relatively large, reaching 49.08 M. This has potential limitations for its practical application in the future. Therefore, in future work, we hope that the network can be made lightweight by using some model compression techniques. In addition, the proposed method solves the CD task of bi-temporal remote sensing images, and in future work, it will focus on the CD task of multi-temporal remote sensing images.

**Author Contributions:** Conceptualization, J.M.; methodology, J.M.; software, Y.L.; validation, J.M. and Z.Z.; formal analysis, G.S.; investigation, G.S.; resource, Y.L.; data curation, J.M.; writing-original draft preparation, J.M.; writing-review and editing, G.S., Z.Z. and Y.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the Natural Science Foundation of China No.62162059 and the National KeyR&D plan project under Grant No.2018YFC0825504.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The CDD, LEVIR-CD, WHU-CD datasets are openly available at https: //drive.google.com/fifile/d/1GX656JqqOyBi\_Ef0w65kDGVto-nHrNs9 (accessed on 1 December 2021), https://justchenhao.github.io/LEVIR/ (accessed on 1 December 2021), http://gpcv.whu.edu. cn/data/building\_dataset.html (accessed on 1 December 2021), respectively.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:



#### **References**

