Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

A Novel Squeeze-and-Excitation W-Net for 2D and 3D Building Change Detection with Multi-Source and Multi-Feature Remote Sensing Data

Remote Sens. 2021, 13(3), 440; https://doi.org/10.3390/rs13030440

by Haiming Zhang¹, Mingchang Wang^1,2,*

, Fengyan Wang¹, Guodong Yang¹, Ying Zhang¹, Junqian Jia¹ and Siqi Wang^1,3

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Remote Sens. 2021, 13(3), 440; https://doi.org/10.3390/rs13030440

Submission received: 22 December 2020 / Revised: 15 January 2021 / Accepted: 22 January 2021 / Published: 27 January 2021

Round 1

Reviewer 1 Report

The topic of the manuscript, change detection of buildings in an urban environment, and the choice of methods, i.e. deep learning and multi-source data for the detection, make this work interesting. This importance is though hindered by the presentation of the work, this including:

Structure of the paper, e.g. Methodology and Introduction are overloaded by explanations of previous/existing research, Results and their interpretation are split into two sections (Experiments and Discussion), a late introduction of Data (and labels/annotations for the supervised learning);
Not clear where is the novelty (both W-net and SEnet are known architectures/methods);
Math is sloppy, explanations are wordy and often ambiguous, terminology is at times confusing.

The following has more detailed comments referring to text in the manuscript (e.g. with line numbers), and elaborates on the above points.

Section 2.1.1: explanations of U-net are out of place in this section of methodology for your solution.

Eq. 1: what is the index k?

Eq. 2: what is N, e.g. number of image pixels? (related question: what is the output, e.g. a binary map giving change?)

Lines 282-283: what is the sample, e.g. a pixel? If that is the case, better use consistently the same way of indexing a pixel, e.g. (i,j) that is used in eq. 3, 9, 10 ... instead of i for pixel index.

L284-286: what are low- and high-dimensional features?

Section 2.1.2: the first two paragraphs (and further) sound more like a description of previous/existing work instead of a concise presentation of the architectural choice (and novelty?) of your method.

L323: what is a feature channel?

L325-327: what are H, W and C? What is the dimension of the output of a convolution operator (is it not a 3D image, with the 3rd dimension being as image band/feature)?

Are equations 3, 4 and 5 connected, i.e. a later equation uses the calculation from a previous equation? For example, z in eq. 4 is z_c as calculated from eq. 3; s_c in eq. 5 is s calculation in eq. 4.

Use consistently the same notation: e.g. U_c and u_c in eq. 3 & 5 both denote the same image, i.e. c-th feature map; keep only one.

Eq. 4: what are W1 and W2?

Eq. 5: use the index c for the term X-tilde.

L350-361: a long introduction for the choices in your method; most of the text here is better placed within a 'Previous work' section.

Section 2.2.1: what are the features referred to in the title? What is a feature, e.g. is it a (2D) image/map?

Eq. 6-8: indexing is done differently from eq. 3, e.g. j for a pixel instead of (i,j); i is index for an image band (e.g. red or green bands?) while eq. 3 uses c for the analogous feature map; is N equal to H*W in eq. 3?

Use the same indexing throughout, in order to reduce confusion.

Eq. 9-13: is the output of the operation a 2D image/map?

Eq. 10: should the denominator be under the sums?

L395-399: is p(i,j) a pixel or a kernel around pixel (i,j)? or around pixel (x,y)??

Table 1: explain the number of bands/band number. Eq. 6-8 suggest that a color moment is a scalar. this table seems to suggest it is an image (??)

Figure 5: what do notations Band-1, Band-3, Band-9, Band-15 mean?

Section 3.1.1: How is the ground truth/reference map acquired/computed?

L555: give a reference for 'original article'.

Table 4: DeepLabv3+ performs better than the method of this paper.

Table 5: U-Net performs better (in accuracy and run-time) than the method of this paper.

Section 4.1.2: Why do you use different methods (manual, automatic) and different training sets (size) for the different methods during their comparison? Such questions the fairness of the comparison, and the conclusions drawn.

L700-701: what is validation performance (loss and accuracy) during training? Do you use a separation of train and validation set? Is there a test set; and if there is, what is the performance in the test set?

L753 & Fig.15: there is no significant difference in the results from different methods.

Fig. 4(d): what does the range -0.2 to 1.4 mean??

Author Response

Response to Reviewer

We would like to thank the reviewer for providing truly outstanding comments and suggestions that significantly helped us improve the technical quality and presentation of our paper.

Our comments are inset in blue following each point of the guest editor. The text quoted directly from the revised manuscript is set in italics. The line number cited our response refer to the revised manuscript.

Reviewer #1’s comments

Structure of the paper, e.g. Methodology and Introduction are overloaded by explanations of previous/existing research, Results and their interpretation are split into two sections (Experiments and Discussion), a late introduction of Data (and labels/annotations for the supervised learning);

Thank you for your careful review. Your comments are valuable. Our article structure does have some problems. Through careful reading, we found some parts that need to be modified. We have deleted the excessive explanation of the previous research in the introduction, added an introduction to the source of the reference map in section 3.1, and sorted out the confusing content of the experimental results and the discussion. The specific modify are shown in lines 51-76, 101-118, 474-476, 484-488:

With the continuous development of remote sensing technology and computer tech-nology, more and more satellite-borne and airborne sensors such as QuickBird, Worldview, GaoFen, Sentinel, ZY, Pléiades, et al. are designed, manufactured, and put in-to operation. In this case, massive and diverse remote sensing data are produced, which also enriches the data sources of change detection [5]. The available data types for change detection have expanded from the medium- and low-resolution optical remote sensing images to high resolution or very high resolution (HR/VHR) optical remote sensing imag-es, light detection and ranging (LiDAR), or synthetic aperture radar (SAR) data. HR/VHR remote sensing images contain richer spectral, texture, and shape features of ground ob-jects, allowing a detailed comparison of geographic objects in different periods. Further-more, non-optical image data such as LiDAR or SAR can provide observation information with different ground physical mechanisms, and solve the technical problem of optical sensors being affected by weather conditions. It can also make up for the shortcoming, that is, HR/VHR remote sensing images can provide a macro view of the earth observation, but it is difficult to fully reflect the types and attributes of objects in the observation area [6]. Multi-source remote sensing data such as HR/VHR remote sensing images, LiDAR, or SAR data can provide rich information for the observed landscape through various phys-ical and material perspectives. If these data are used comprehensively and collaboratively, the data sources of change detection will be significantly enriched, and the detection re-sults can describe the change information more accurately and comprehensively [7]. However, due to the diverse sources of multi-source remote sensing data, it is difficult to compare and analyze these heterogeneous data based on one method. Most of the current related research focuses on the use of homogeneous remote sensing data for change detec-tion [8-12]. Therefore, in order to use remote sensing data for change detection more fully and effectively, it is exigent to develop a change detection method that can comprehen-sively use multi-source remote sensing data.

Buildings have unique geographic attributes and play an essential role in the process of urbanization. The accurate depiction of their temporal and spatial dynamics is an effec-tive way to strengthen land resource management and ensure sustainable urban devel-opment [27]. Therefore, BCD has always been a research hotspot in the field of change de-tection. At present, the application of HR/VHR remote sensing images has been popular-ized, which provides a reliable data source for change detection tasks, especially for iden-tifying detailed structures (buildings, et al.) on the ground. In addition, LiDAR and SAR data have also received extensive attention in urban change detection. Related researches have appeared one after another, such as extracting linear features from bitemporal SAR images to detect changing pixels [28], using time series SAR images and combining like-lihood ratio testing and clustering identification methods to identify objects in urban areas [29], fusing aerial imagery and LiDAR point cloud data, using height difference and gray-scale similarity as change indicators, and integrating spectral and geometric features to identify building targets [30]. However, most of these studies are only based on SAR images, and some use optical remote sensing images as their auxiliary data, so the degree of data fusion is low, and some methods cannot even be directly applied to optical images. In addition, the degree of 3D change detection is relatively low, and there is almost no suitable method capable of simultaneously performing 2D and 3D change detection.

The image of the experimental area and the reference change map are shown in Figure 5. Among them, the reference change map is provided by the data publisher.

Since the MtS-WH data set is mainly used for theoretical research and validation of scene change detection methods, and the original data only provides the category label of the scene. To obtain the changing scene, we obtained the reference change map of the building area by comparing the scene categories.

Not clear where is the novelty (both W-net and SEnet are known architectures/methods);

For the novelty of the network we designed, we would like to explain a few points. First, some people may have proposed a neural network called W-Net. However, we have read a lot of literature and found that some so-called W-Nets do not have a two-sided input and single-sided output structure. The networks they design are just similar in shape to W-Net. For example, some people designed a W-shaped network by stacking two U-Nets, and named the network W-Net. Second, although SENet is also a known network, some of its excellent properties are rarely used. As far as we know, research on embedding sequeeze-and-excitation(SE) modules into U-Net networks is rare. The SE module has good channel relationship analysis capabilities, and we innovatively use this structure in the new network. Third, the W-shaped network we designed has two-sided input terminals. This structure can meet the input of heterogeneous remote sensing data. We input remote sensing data on both sides at the same time, and adjust model parameters through category labels. In our opinion, such a structure is original. Fourth, we redesigned the structure of the network. Specifically, it includes deepening the convolution operation, using the Batch Normalization layer, and replacing the loss function. Our operations can improve the performance of the model. This has also been verified in experiments. Fifth, the proposed squeeze-and-excitation W-Net is a powerful and universal network struc-ture, which can learn the abstract features contained in homogeneous and heterogeneous data through a structured symmetric system. Sixth, the form of two-sided input not only satisfies the input of multi-source data, but also is suitable for multiple features derived from multi-source data. We innovatively introduce the squeeze-and-excitation module as a strategy for explicit modeling of the interdepend-ence between channels, which makes the network more directional and can recalibrate the feature channels, emphasize essential features, and suppress secondary features. Moreo-ver, the squeeze-and-excitation module is embedded between each convolution operation, which can overcome the insufficiency of the convolution operation that can only take into account the features information in the local receptive field and improve the global recep-tion ability of the network.

Math is sloppy, explanations are wordy and often ambiguous, terminology is at times confusing.

Thank you for your detailed comments. Based on your comments, we have revised and refined the formulas, explanations and terms in the article. For the same content, we use a uniform index. For some vague expressions, we have modified them.

Section 2.1.1: explanations of U-net are out of place in this section of methodology for your solution.

Just like what you said, our explanation of U-Net in section 2.1.1 is redundant. In this section, we just want to explain how we design a bilaterally symmetrical end-to-end network architecture, there is no need to explain U-Net too much. Therefore, we deleted the explanation of U-Net and deleted Figure 2. The specific modify are shown in lines 239-240:

U-Net is an extension of FCN and is currently a widely used semantic segmentation network with good scalability [52]. It is a U-shaped symmetrical structure with a con-tracting path on the left and an expansive path on the right. The contracting path can ex-tract image features and reduce the spatial dimension, and the expansive path restores the image details and spatial dimensions. The skip connection operation between the two sides can combine the low-dimensional simple features on the left with the high-dimensional complex features on the right so that the expansive path can better re-store image detail information and restore image accuracy. The left side of the network contains 4 sets of encoding modules, and the right side contains 4 sets of decoding mod-ules, with ReLu as the activation function. Each group of decoding modules includes 2 convolutional layers with a convolution kernel size of 3*3 and 1 downsampling layer, and each group of decoding modules includes 2 convolutional layers with a convolution ker-nel size of 3*3 and 1 upsampling layer. At the end of the network is a convolutional layer with a convolution kernel size of 1*1, which is used to map a 64-dimensional feature map equal to the size of the input image into a 2-dimensional feature map, and Softmax classi-fier is used to calculate the probability value of each pixel belonging to a certain category [47]. The U-Net model structure is shown in Figure 2.

Figure 2. Schematic diagram of the U-Net structure.

1: what is the index k?

The meaning of k was not explained, and we are sorry for that. In Eq.1, k represents the k-th neuron. The specific modify are shown in line 264:

where, is the activation value of the k-th neuron after transformation;

2: what is N, e.g. number of image pixels? (related question: what is the output, e.g. a binary map giving change?)

Our interpretation of what N means is unclear, and we have modified it. N represents the number of predicted values output by the model. The output of the model refers to the model's prediction of the input data. For example, for data set , where is the feature value or input variable; is the predicted value, which is the output of the model. The simplest output has only two discrete values, or . The specific modify are shown in lines 271-272:

where, represents the number of predicted values output by the model;

Lines 282-283: what is the sample, e.g. a pixel? If that is the case, better use consistently the same way of indexing a pixel, e.g. (i,j) that is used in eq. 3, 9, 10 ... instead of i for pixel index.

The samples in Eq.3 represent pixels, but the loss value calculated by the loss function is based on a batch of data, and the number of samples in a batch is represented by N. We use to indicate that a certain sample is easy to cause confusion. We have modified this and replaced with . In eq. 3, 9, 10..., still represents pixels. The specific modify are shown in lines 270-272:

is the sample label; is the predicted label of the sample by the model;

L284-286: what are low- and high-dimensional features?

Our expression may have caused you the inconvenience of understanding, we apologize for that. The W-shaped network we designed has a contracting path on the left and right sides, and an expansive path in the middle. In the contracting path, layer-by-layer convolution and down-sampling are performed, and in the expansive path, layer-by-layer convolution and up-sampling are performed. As the convolution operation and the down-sampling operation proceed, the shallow, middle and deep features will be extracted from the original image. The up-sampling process is the process of restoring the original image based on the feature map. Each step of convolution and down-sampling corresponds to a step of image restoration (cutting the feature map of the contracting path to the feature map of the same size as the expansive path). Moreover, image restoration is a deeper convolution and sampling process. Therefore, we have used the expression of low-dimensional features and high-dimensional features here.

Section 2.1.2: the first two paragraphs (and further) sound more like a description of previous/existing work instead of a concise presentation of the architectural choice (and novelty?) of your method.

Thank you for your valuable comments. Our description of deep convolutional networks such as VGGNet, DenseNet, and ResNet seems to be superfluous, and we should not explain the attention mechanism that is not very useful. We deleted some content and refined sentences. The specific modify are shown in lines 286-302:

The W-shaped network is improved on the basis of U-Net, expanding the path of data input, deepening the convolution operation, accelerating the training speed of the model, improving the robustness of the model, and effectively preventing overfitting. However, the convolution operation can only be along the data input channel, fusing the spatial and channel information in the local receptive field [55]. In addition, when comprehensively considering the multi-source data and the multiple features derived from it, it is difficult to model the spatial dependence of the data based on the information feature construction method of the local receptive field. Moreover, the repeated convolution operation without considering the spatial attention is not conducive to the extraction of useful features.

We introduce the attention mechanism [56-58] strategy, use global information to explicitly model the dynamic nonlinear relationship between channels, which can simplify the learning process and enhance the network representation ability. The main function of the attention mechanism is to assign weights to each channel to enhance important information and suppress secondary information. The main operation can be divided into three parts: squeeze operation , excitation operation , and fusion operation . Its operation flow chart is shown as in Figure 2.

L323: what is a feature channel?

Our expression here is wrong, we are sorry. We have changed feature channel to feature map. For convolution operations, a large part of the work is to improve the receptive field, that is, to fuse more features spatially, or to extract multi-scale spatial information, such as the multi-branch structure of the Inception network. For feature fusion of channel dimensions, the convolution operation basically defaults to fusing all channels of the input feature map. The Group Convolution and Depthwise Separable Convolution in the MobileNet network group channels mainly to make the model more lightweight and reduce the amount of calculation. The innovation of Squeeze-and-excitation operation is to focus on the relationship between channels, hoping that the model can automatically learn the importance of different channel features. Therefore, what we want to express here is feature map. The specific modify are shown in lines 306, 315, 321, 323:

The squeeze operation uses a global average pooling method to compress features along the spatial dimension and scale each two-dimensional feature map to a real num-ber, which has a global receptive field and can represent global information.

The squeeze operation only obtains a 1*1 global descriptor, which cannot be used as the weight of each feature map.

That is, the channel weight calculated by the excitation operation is fused with the origi-nal feature map, and the calculation is as shown in formula (5):

where, is the c-th global description; is the c-th original feature map.

L325-327: what are H, W and C? What is the dimension of the output of a convolution operator (is it not a 3D image, with the 3rd dimension being as image band/feature)?

Here we use the letters H, W and C to denote the height, width and number of the feature map respectively. The output of each convolution operation is a two-dimensional feature map, and the output of the entire convolution operation is the superposition of the feature map. The third dimension represents the number of feature maps, which can also be understood as the number of channels. The important physical meaning of convolution is: the weighted superposition of one function (such as unit response) on another function (such as input signal). For a linear time-invariant system, if the unit response of the system is known, then convolving the unit response with the input signal is equivalent to weighting and superimposing the unit response at each time point of the input signal, and the output signal is directly obtained. Therefore, the third dimension of the convolution output is a quantitative representation of the number of signals.

Are equations 3, 4 and 5 connected, i.e. a later equation uses the calculation from a previous equation? For example, z in eq. 4 is z_c as calculated from eq. 3; s_c in eq. 5 is s calculation in eq. 4.

Yes, what we are saying is that Equations 3, 4, and 5 are connected. The input in eq.4 is the calculation result of eq.3. The input in eq.5 is the calculation result of eq.4. The Squeeze-and-Excitation module first performs a squeeze operation on the feature map obtained by convolution to obtain channel-level global features, and then performs an excitation operation on the global features to learn the relationship between each channel, and also get the weights of different channels, and finally multiply by the original feature map gets the final feature. In essence, the Squeeze-and-Excitation module performs attention or gating operations in the channel dimension. This attention mechanism allows the model to pay more attention to the channel features with the most information, while suppressing those unimportant channel features.

Use consistently the same notation: e.g. U_c and u_c in eq. 3 & 5 both denote the same image, i.e. c-th feature map; keep only one.

Thank you for your careful review. We used different notation to express the same meaning, which is incorrect. We have replaced with . The specific modify are shown in lines 322-323:

4: what are W1 and W2?

W1 and W2 here represent weight matrices, but they have different meanings. In order to obtain the relationship between channels, the operation of squeeze must meet two conditions: (1) flexibility. It must be able to learn the nonlinear interaction between each channel; (2) non-mutual exclusion. The learning relationship is not mutually exclusive, because it allows multi-channel features instead of one-hot form. For this reason, squeeze uses a gating mechanism in the form of sigmoid. In order to reduce model complexity and improve generalization ability, a bottleneck structure containing two fully connected layers is adopted here. Among them, the dimension reduction layer parameter is W1, and the dimension reduction ratio is r. After a ReLU is an dimension increase layer with a parameter of W2. Finally, a real number sequence of 1*1*C is obtained, and the final output is obtained by performing a Scale operation through equation 5.

In the above formula, represents the set of real numbers, and represents the channel.

5: use the index c for the term X-tilde.

Thank you for your correction. We have revised the irregular expression. The specific modify are shown in lines 322-323:

L350-361: a long introduction for the choices in your method; most of the text here is better placed within a 'Previous work' section.

Your opinion is very reasonable. This part of the detailed introduction of method selection is really not appropriate here. Like you said, putting this part of the content in the 'Previous work' section will make it clear at a glance. The article will be more fluent and logical. We have made adjustments. The specific modify are shown in lines 167-178:

A variety of features derived from remote sensing data tend to show characteristics such as stable nature, little impact by radiation differences, and not easily affected by remote sensing image time phase changes [3]. Using spatial or spectral features to detect changes in the state of objects or regions has become a hot spot for researchers. In addition, the phenomenon of "the same object with the different spectrum, the same spectrum of differ-ent matter" appears in large numbers in HR/VHR remote sensing images, making it more difficult to detect small and complex objects such as buildings or roads in cities. Emerging deep learning methods have the potential to extract features of individual buildings in complex scenes. However, the feature extraction method of deep learning represented by convolution operation only extracts the abstract features of the original image through the continuous deepening of the number of convolution layers and does not consider the use of useful derivative features of the ground objects [3,27].

Section 2.2.1: what are the features referred to in the title? What is a feature, e.g. is it a (2D) image/map?

Just like what we mentioned earlier in the article "A variety of features derived from remote sensing data tend to show characteristics such as stable nature, little impact by radiation differences, and not easily affected by remote sensing image time phase changes.". What we want to express is the hidden characteristics in remote sensing data. We use different methods to extract color, texture, and shape features from remote sensing data. The representation of features can be understood as a two-dimensional image. Because, in the field of machine learning and deep learning, feature engineering is the top priority.

Features can be understood as the characteristics of a thing different from other things. It is the performance of some outstanding properties and the key to distinguishing things. To classify or recognize things is actually to extract ‘features’ and make judgments based on the performance of the features. There are many features of things, but in the end, the extracted features should obey our purpose. Feature selection needs to be performed after feature extraction. Its essence is to measure the goodness of a given feature subset through a specific evaluation criterion. Through feature selection, redundant features and irrelevant features in the original feature set are removed, while useful features are retained.

6-8: indexing is done differently from eq. 3, e.g. j for a pixel instead of (I,j); I is index for an image band (e.g. red or green bands?) while eq. 3 uses c for the analogous feature map; is N equal to H*W in eq. 3?

We used a different index in the article, which caused confusion, and we apologize for that. We have modified some indexes. In Eq.6-8, we use (i,j) to represent a pixel and k to represent color components. In addition, what we want to say is that N means the total number of pixels, and H*W means the size of the feature map. In order to facilitate the distinction, we did not use H*W to represent the size of the image. The specific modify are shown in lines 346-348:

	(6)
	(7)
	(8)

where, is the k-th color component of the (i,j)-th pixel in the image; N is the number of pixels in the image.

Use the same indexing throughout, in order to reduce confusion.

We checked the full text carefully. After modification, we used the same index in the full text.

9-13: is the output of the operation a 2D image/map?

The output of Eq.9-13 are all two-dimensional feature maps. According to the gray level co-occurrence matrix, we calculated and obtained texture feature maps based on five scalars of Variance, Homogeneity, Contrast, Dissimilarity and Entropy.

10: should the denominator be under the sums?

Your work is very meticulous, and we deeply admire your work. Our expression of this formula is incorrect. This is due to our carelessness. We are ashamed of such a mistake. We modified this formula. The specific modify are shown in lines 366-367:

(10)

L395-399: is p(i,j) a pixel or a kernel around pixel (i,j)? or around pixel (x,y)?

The gray-level co-occurrence matrix is obtained by statistically calculating the status of two pixels maintaining a certain distance on the image with a certain gray level. The gray level co-occurrence matrix of an image can reflect the comprehensive information of the image gray level on the direction, the adjacent interval, and the range of change. It is the basis for analyzing the local patterns of the image and their arrangement rules.

We use to represent the gray-level co-occurrence matrix in the article. Assuming that is a two-dimensional image, the size is M*N, and the gray level is Ng, then the gray matrix that satisfies a certain spatial relationship is . Obviously, is a matrix of size Ng*Ng. When the distance between the pixel and the pixel is d, and the angle between the two and the horizontal axis of the coordinate is θ, the gray level co-occurrence matrix of various pitches and angles can be obtained.

Table 1: explain the number of bands/band number. Eq. 6-8 suggest that a color moment is a scalar. this table seems to suggest it is an image (??)

We don't seem to express our thoughts clearly. We use the first moment (mean), second moment (variance), and third moment (skewness) of the color moments to extract the color features of the image. The extracted result is not a real number, but a feature map. We use a sliding window to calculate three values for the small image in the fixed window. Through calculation, three color feature values of each pixel in the image are obtained. In this case, we get a color feature map with three channels. The specific modify are shown in lines 341-345:

Since the color information in the image is mostly distributed in the low-order moments of the image, we extract the color features of the image by calculating the first-order moment (mean), second-order moment (variance), and third-order moment (skewness) of the image. The color feature map of the entire image is extracted with a fixed-size sliding window.

Figure 5: what do notations Band-1, Band-3, Band-9, Band-15 mean?

We use Band-1, Band-3, Band-9, Band-15 in Figure 5. What we want to illustrate is the number of channels included in the feature map. The specific modify are shown in lines 393-394:

The number of bands (Band-number) corresponding to each is shown in Table 1, and the combination is shown in Figure 4.

Section 3.1.1: How is the ground truth/reference map acquired/computed?

We conducted a total of four sets of experiments in this article, and the data source of each experiment is different. The reference map for the first set of experiments was provided by the data publisher. The data publisher of the second set of experiments only provided two remote sensing images and corresponding category label maps. Based on the two category label maps, we calculated the reference change map through image difference. The change reference map of the third set of experiments was manually annotated by us. The reference change map of the fourth set of experiments was obtained through field investigation and manual annotation.

L555: give a reference for 'original article'.

Thank you for your comment. The parameters we used refer to the paper "Squeeze-and-Excitation Networks". Here, we did not quote the original article, which is an oversight. We quoted this article. The specific modify are shown in lines 529-532:

In order to facilitate comparison with other methods and minimize the time expenditure, the epoch of each experiment is set to 100, the training images used in the experiment are 1000, and the reduction ratio set in the network is 16, as provided in the original article[55].

Table 4: DeepLabv3+ performs better than the method of this paper.

Table 5: U-Net performs better (in accuracy and run-time) than the method of this paper

Thank you for your careful review. As you said, in the third set of experiments and the fourth set of experiments, the method we proposed was not completely superior to the other methods in terms of validation accuracy, validation loss value, and training time. In some aspects, DeepLabv3+ and U-Net perform even better than our method. What we want to say is that we conducted experiments carefully. Moreover, we objectively analyzed the experimental results. The method we propose has certain advantages, but it is not absolute. Moreover, the training of the model is related to many factors. The quality of the training samples during training and the randomly allocated training data set and validation data set will affect the validation accuracy and loss value during model training. Model training is a process of continuous optimization. The optimal value in all epochs does not fully explain the pros and cons of the model. Because the whole training process is an iterative process. The overall trend and the test results on the new data set are the most valuable basis for evaluating the model.

When developing a model, it is always necessary to adjust the model configuration, such as selecting the number of layers or the size of each layer. This adjustment process needs to use the performance of the model on the validation data as a feedback signal. This adjustment process is essentially a kind of learning: finding a good model configuration in a certain parameter space. Therefore, if the model configuration is adjusted based on the performance of the model on the validation set, it will quickly cause the model to overfit on the validation set, even if the model is not directly trained on the validation set. Every time the model hyperparameters are adjusted based on the model's performance on the validation set, some information about the validation data will leak into the model. If we adjust each parameter only once, there is little information leaked, and the validation set can still reliably evaluate the model. But if this process is repeated many times (run an experiment, evaluate on the validation set, and then modify the model accordingly), more and more information about the validation set will leak into the model.

Therefore, the verification accuracy, loss value, and time cost can evaluate the pros and cons of the model. However, the final performance of the model on the new data set can best explain the quality of the model.

Section 4.1.2: Why do you use different methods (manual, automatic) and different training sets (size) for the different methods during their comparison? Such questions the fairness of the comparison, and the conclusions drawn.

Thank you for your valuable comment. For our sample selection part, we want to make the following explanation. The comparison methods we use include four types, traditional methods (RCVA), machine learning methods (SVM, RF), transition methods (DBN) and deep learning methods (U-Net, SegNet, DeepLabv3+). Among them, machine learning methods, transition methods, and deep learning methods all require training samples. We use different sample selection methods in these three methods, and the number of sample selections is also different. In the machine learning method, we manually selected more than 1000 training samples. In the transition method, we automatically selected 5000 training samples based on the reference change map. In the deep learning method, we automatically selected 1000 training samples based on the reference change map. As we all know, the quantity and quality of training samples are crucial to the training of the model. When the model hyperparameters are basically reasonable, the number and quality of training samples determine the final performance of the model. Moreover, we believe that the quality of manually selected training samples is better than automatically selected ones. The number of training samples of the traditional method and the transition method is more than that of the method in this paper, which can better illustrate the superiority of our method. The experimental results also show that although the number and quality of training samples used by the machine learning method are greater than those used in this paper, the results obtained are not so good. In addition, the amount of training sample data used by the transition method is 5 times that of the method in this paper, but the quality of the detection results is far lower than the method in this paper. Finally, we want to explain machine learning methods and transition methods. There is no fixed evaluation standard for the quality and quantity of training samples used by both. But the training samples used by machine learning are of higher quality, and the number of training samples used by the transition method is larger. We believe that the variable factors of the two may be offset. After offsetting the variable factors, it becomes more meaningful to compare the two. Furthermore, this article mainly discusses the difference between the results of this method and other methods. Therefore, we will not discuss too much about the differences between other methods. Due to our limited level, experiments and methods need to be improved. This may be our next step. We do not know whether you are satisfied with the above explanation. Please consider our ideas.

L700-701: what is validation performance (loss and accuracy) during training? Do you use a separation of train and validation set? Is there a test set; and if there is, what is the performance in the test set?

Thanks for your question. The focus of the evaluation model is to divide the data into three sets: training set, validation set and test set. Train the model on the training data and evaluate the model on the validation data. Once the best parameters are found, the final test is performed on the test set. The validation performance (loss and accuracy) in the figure represents the performance of the model on the validation data set. We separated the training data and validation data when training the model. We use a scale parameter to control the amount of training data and validation data. After we trained the model, we did not use the test data set to test the model. Instead, we directly use the model to make predictions for the entire image. In other words, our experimental results are the performance of the model on the test set. Our code to separate the training data set and the validation data set is as follows.

def get_train_val(val_rate = 0.2):

train_url = []

train_set = []

val_set = []

for pic in os.listdir(filepath + 'train_left'):

train_url.append(pic)

random.shuffle(train_url)

total_num = len(train_url)

val_num = int(val_rate * total_num)

for i in range(len(train_url)):

if i < val_num:

val_set.append(train_url[i])

else:

train_set.append(train_url[i])

return train_set,val_set

L753 & Fig.15: there is no significant difference in the results from different methods.

Thank you for your valuable comments. Our wording may not be appropriate. We misused "significantly", which may not fit scientifically. We modified this to replace "significantly" with "clearly". The specific modify are shown in lines 727-728:

The results in Figure 14 and Figure 15 are shown that the detection results of the models obtained by the two data input methods are clearly different.

4(d): what does the range -0.2 to 1.4 mean??

Thank you for your detailed comments. Since the increase and decrease of FA and MA in the fourth set of experiments are very large, the increase of FA is 64.03%, and the decrease of MA is 67.83%. Therefore, the limit reflected in the figure exceeds [0,1]. In order to clearly express the magnitude of increase and decrease, we adjusted the value of the ordinate axis to [-0.2,140]. If there is anything that does not meet academic standards, we are willing to modify it. The accuracy data of the fourth set of experiments on DeepLabv3+ are as follows.

DeepLabv3+		increment
multi-feature	only-image	increment
0.991594	0.8804	-0.11119
0.958498	0.318182	-0.64032
0.04806	0.72632	0.67826
0.003903	0.050709	0.046806

Author Response File: Author Response.pdf

Reviewer 2 Report

This is an interesting paper about detecting changes in buildings using multiple sources of remote sensing data, especially high- or very high-resolution remote sensors. The authors use both two- and three-dimensional images. The authors provide a detailed and extensive discussion of current literature in this area in the Introduction section. A good description of the proposed algorithm is provided as well as a comparison of results from the proposed algorithm with 7 other common classification/change detection methods. Nice, clear descriptions of datasets used, including ground truth images of detected changes.

What does assumed-change area mean in figure 7? Are changes being simulated here in the image to create a second image because there is only one date of a real image?
Line 556 – what is meant by the traditional method? Similarly, other methods are mentioned in the paragraph. How do they all related to the 7 methods mentioned at the end of section 2? Table 2 clarifies this, but it would be good to indicate these groupings in section 2 where these 7 methods are first mentioned.
Accuracy of results are good in the proposed method compared to others; not very different from U-Net; in many cases U-Net is slightly better.
Time comparisons – U-Net, SegNet and W-Net take roughly the same amount of time.
So, what is the real advantage of the proposed W-Net over U-Net?
Lines 787-789: “The qualitative and quantitative analysis of the experimental results showed that our method obtained higher OA and F1 values, and lower MA and FA values. And while improving the detection accuracy, the time cost of the squeeze-and-excitation W-Net we designed is lower.” This statement is questionable when W-Net results are compared to U-Net.

Author Response

Response to Reviewer

We would like to thank the reviewer for providing truly outstanding comments and suggestions that significantly helped us improve the technical quality and presentation of our paper.

Reviewer #2’s comments

What does assumed-change area mean in figure 7? Are changes being simulated here in the image to create a second image because there is only one date of a real image?

Thank you for your careful review. We used the Vaihingen dataset provided by ISPRS-Commision II Working Group Ⅱ/4 in the first set of 3D experiments. This data set contains only one period of data. We created another period of data based on the change area. However, the simulation data is still useful. Because, in addition to making changes in the changed area, we also added a series of noise points in the unchanged area. We believe that this can verify the robustness of the model. Furthermore, the data required for three-dimensional change detection is more complicated. It is also very difficult to find a change detection data set that can be used. In future research, we will further look for high-quality change detection data sets for experiments. Hope you are satisfied with our explanation.

Line 556 – what is meant by the traditional method? Similarly, other methods are mentioned in the paragraph. How do they all related to the 7 methods mentioned at the end of section 2? Table 2 clarifies this, but it would be good to indicate these groupings in section 2 where these 7 methods are first mentioned.

We used a total of 7 comparison methods. We divide these 7 methods into 4 types, traditional methods (RCVA), machine learning methods (SVM, RF), transition methods (DBN) and deep learning methods (U-Net, SegNet, DeepLabv3+). We didn't explain it clearly in Section 2, we modified it. The specific modify are shown in lines 424-427:

We adopted seven widely used change detection methods and classified them into traditional methods (RCVA), machine learning methods (SVM, RF), transition methods (DBN) from machine learning to deep learning (hereinafter referred to as transition methods), and deep learning methods (U-Net, SegNet, DeepLabv3+).

Accuracy of results are good in the proposed method compared to others; not very different from U-Net; in many cases U-Net is slightly better.

Time comparisons – U-Net, SegNet and W-Net take roughly the same amount of time.

So, what is the real advantage of the proposed W-Net over U-Net?

Thank you for your very valuable comment. As you said, the results obtained by our method are compared with U-Net. In some cases, our results are worse than U-Net. However, what we want to say is that there are not many such cases. In terms of validation accuracy and validation loss value, the accuracy of U-Net in the third and fourth sets of experiments is higher than our method, but the loss value is much larger than our method. In the other two experiments, our method is better than U-Net in terms of accuracy and loss. In terms of model training time cost, although our method consumes more time than U-Net in the first, second, and fourth experiments, it is almost all of an order of magnitude. The difference is almost within a few seconds, this difference cannot fully explain the pros and cons of the model. In addition, our model is more complicated than the U-Net model, which may be the root cause of the increased time cost.

In terms of experimental results. In the four sets of experiments, only in the second set of experiments, the OA value of U-Net is larger than our model. In other cases, our models have achieved the largest OA and F1 values. Moreover, the MA and FA values of our model are relatively small. In the four sets of experiments, many of the minimum MA and FA values are obtained by our model.

Explanation about the similarity between U-Net and our model detection results. Our model is created on the basis of U-Net. We have made a lot of improvements to U-Net and designed a new network structure. The specific content is as follows: "In order to make up for these shortcomings of U-Net, we designed a two-sided input W-shaped network, which contains a contracting path on both sides and an expansive path in the middle. The contraction path on Both sides contains 4 sets of encoding mod-ules, but the encoding module deepens the number of layers of convolution, and introduce the Batch Normalization layer. The expansion path contains 4 sets of decoding modules, and also adds the Batch Normalization layer. Among them , the Batch Normalization lay-er can normalize the input data of each batch with the mean and variance, so that the in-put of each layer maintains the same distribution, which can speed up the speed of model training. In addition, the Batch Normalization layer can increase noise through the idea of updating the mean and variance of each layer, thereby increasing the robustness of the model and effectively reducing overfitting. ".

In addition, we have embedded sequeeze-and-excitation modules in the network. This module can make our network more robust. The specific content is as follows: "The convolution operation can only be along the data input channel, fusing the spatial and channel information in the local receptive field. In addition, when comprehensively considering the multi-source data and the multiple features derived from it , it is difficult to model the spatial dependence of the data based on the information feature construction method of the local receptive field. Moreover, the repeated convolution operation without considering the spatial attention is not conducive to the extraction of useful features. We introduce the attention mechanism strategy, use global information to explicitly model the dynamic nonlinear relationship between channels, which can simplify the learning process and enhance the network representation ability. The main function of the attention mechanism is to assign weights to each channel to enhance important information and suppress secondary information."

Lines 787-789: “The qualitative and quantitative analysis of the experimental results showed that our method obtained higher OA and F1 values, and lower MA and FA values. And while improving the detection accuracy, the time cost of the squeeze-and-excitation W-Net we designed is lower.” This statement is questionable when W-Net results are compared to U-Net.

Thanks for your review. Our expression here does not conform to the norms of scientific expression. Our text seems too absolute. It should be that, in most cases, our model has achieved such results. We have modified this. The specific modify are shown in lines 761-765:

The qualitative and quantitative analysis of the experimental results showed that, in most cases, our method obtained higher OA and F1 values, and lower MA and FA values. And while improving the detection accuracy, the time cost of the squeeze-and-excitation W-Net we designed is lower.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Thank you for the careful consideration of the comments in the previous review round in the response letter, and the modifications made in the manuscript.

Article Menu

A Novel Squeeze-and-Excitation W-Net for 2D and 3D Building Change Detection with Multi-Source and Multi-Feature Remote Sensing Data

Further Information

Guidelines

MDPI Initiatives

Follow MDPI