Review Reports - Joint Learning of Contour and Structure for Boundary-Preserved Building Extraction

Round 1

Reviewer 1 Report

Page 2, Line 70:

Is dilated convolution applied only to reduce the computational cost or has this convolution some other features?

Page 3, Line 106?

What is CRF?

Page 4, Figure 2:

Blue labels in the image on the left side are not visible.

You have the black and the blue arrow (Convolutional layers) in Legend, but in Figure only appear only black arrows.

If I follow the arrows, final stage of the data flow is some Equation (from BCellLoss and DiceLoss)? Are main structure presented properly in Figure?

You also have typo in Legend (Convlution => Convolutional).

Page 4, Lines 134-146

The section should be rewritten. The Figure 2 should be described more clearly. I do know what represent what is a building extraction constraint module and boundary enhanced multiloss in Figure. I have a feeling, that a lot of information is missing, or represented confusingly.

Page 5, Figure 3:

The residual module contain several convolutional blocks. Where exactly the dilated convolution applied? In residual module or structural constraint module or both of them?

Page 5, Lines 174-177:

These lines should be in Section 4, where results are presented.

Page 6, Line 193:

“with half of the original image”

Which half of the original image? Do you mean half of the resolution of the original image?

Page 6, Line 201:

“for the segmentation brunch”

Do you mean “segmentation branch”?

Page 7, Line 208

Subscript “dice”of “L” is not whole in italic.

Page 6-7, Lines 200-211

You applied the two loss functions (binary entropy loss and Dice loss). Which function is used where?

Page 8, Lines 223-231:

These lines should be in Section 4 (not in description of methodology), because you describe how the experiments are performed.

Page 9, Table 1; Page 10, Table 2; Page 11, Table 3:

Table 1 shows 8 other methods. Table 2 presents only 6 (DE-Net and MA-FCN are missing). Table 3 shows also 6, but (SRI-Net, DE-Net and MA-FCN are missing, BRR-NET is added). Why the same set of methods are not applied for all datasets?

Author Response

We are thankful to the reviewers for pointing out some important modifications needed in the report.

We have thoughtfully taken into account these comments. The explanation of what we have changed in response to the reviewers’ concerns is given point by point.

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

This manuscript presents a very interesting topic on the extraction of buildings from high-resolution orthoimages proposing a strategy that considers the contours of the buildings as well.

The workflow is correct and the article well presented, but I would recommend some changes in presentation. Below there are some comments.

s23-s32-s41: orthophoto/s. I would avoid the term “orthophoto” and I would change to “orthoimage” because the term photo refers to an analogue term as opposed to the term image which is digital. Please consider it.

s38: “This issue is widely acknowledged in the computer vision community.”: This a well-known topic in GIS, please have a look to “traditional” GIS publications such as

Principles of Geographical Information Systems (Spatial Information Systems) by Peter A. Burrough and Rachael A. McDonnell. Chapter 9. ASIN : 0198233655.Publisher : Oxford University Press; 2nd edition (April 9, 1998)ISBN-10 : 9780198233657

Or newer such as

https://www.wiley.com/en-us/Geographical+Information+Systems%3A+Principles%2C+Techniques%2C+Management+and+Applications%2C+2nd+Edition%2C+Abridged-p-9780471735458

s40: "it causes blurry or zigzag effects on building boundaries when applied to orthophotos". Several concepts related to orthoimage production underlie such as the concept “true orthoimage”, anti-alias, Spectral Mixture Analysis, etc.

s50 MAP-Net. First time it appears, please explain it and refer to bibliography.

s51: pixel-wise scores. Explain it.

s54: Figure 1: This comment can be applied to Figures 1, 5, 6, 7 and partially to Figure 4. In this way, it’s really difficult to appreciate what the text pretends to show, since sometimes it is just a pixel the difference and if zoom is not bigger than 100% it’s very difficult to appreciate. I recommend to change these figures by creating a zoom detail of the yellow areas. Maybe overlay the results with additive colours is a good solution also to see the differences in a “summary image”.

S121: “SOTA” First time it appears, please explain what it means and refer to bibliography.

S148: Figure 2. Please explain what each element and draw are (as in left images), maybe a label in the lower/upper part would be fine (blue is difficult to read) and some elements are not labeled.

Please increase the resolution of the left images, specially the Gradient (Sobel filtering) figure.

S180: It seems the 3rd (b): “Features extracted without the SC module” image is not correct. Please correct it. Again, I miss a zoom/comparison between results that clarify the results. Please consider it.

S212: 3.5. Evaluation Metrics and Training Strategies

For sake of completeness on the statement of the accuracy, it might be good to test the significance of these values, e.g., by adding a Chi-square test.

s299 Figure 5. Same as Figure 1

s339: Figure 6, Same as Figure 1

s344: Figure 7, Same as Figure 1

Author Response

We are thankful to the reviewers for pointing out some important modifications needed in the report.

We have thoughtfully taken into account these comments. The explanation of what we have changed in response to the reviewers’ concerns is given point by point.

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Good paper!

Author Response

We are thankful for your praise of our work.

Best Regards!

Reviewer 4 Report

In this paper authors present new method for building extraction from aerial images using both contours and structures of the buildings. Multiscale feature extraction is performed using MAP-Net network, and residual module is used to extract structural features that represent boundary information. Results show that the proposed method outperforms most state-of-the-art methods, using 3 different datasets, compared with pixel-level and instance level metrics.

Remarks for the paper:

1. Please better explain In Equation 3, why did you choose alpha and beta parameters to be 0.4 and 0.6? Would the change in weighting factors have influence on the results?

2. Regarding the loss function, have you considered some other type, Eq. 1? This might be better explained.

3. Timing performance is missing, together with the computer description that was used, at least for the proposed method (and possibly other methods that were tested).

4. Table 4: although in all cases there is improvement, compared with MSP-Net, in some cases all results might be considered as being low: e.g. with threshold 0.8 and Massachusetts dataset, compared with WHU dataset and the same threshold. Can you comment if this can be improved, or the dataset should have better resolution.

5. Source code for the proposed method might be uploaded somewhere for the easier comparison with other algorithms.

Author Response

We are thankful to the reviewers for pointing out some important modifications needed in the report.

We have thoughtfully taken into account these comments. The explanation of what we have changed in response to the reviewers’ concerns is given point by point.

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 5 Report

This paper proposes a new strategy for extraction of buildings from high-resolution orthophotos remote sensing images that besides segmentation masks also considers the contours of the buildings. The proposed framework inherits the structure of MAP-Net and includes a structural feature constraint module with a dice loss.

The paper has a good experimental section which shows a better performance than state of the art for the proposed method.

Some minor errata:

- Page 7, line 205: I don't think Dice loss to be firstly proposed in reference [28]
- Page 12, line 356: Results are presented in table 4, not in table 3.

Author Response

We are thankful to the reviewers for pointing out some important modifications needed in the report.

We have thoughtfully taken into account these comments. The explanation of what we have changed in response to the reviewers’ concerns is given point by point.

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

The authors provide satisfactory explanations, and correct paper according to the comments.

Reviewer 2 Report

Dear authors,
You have done a serious and conscientious job of revision. The quality of the paper has increased considerably.

I have only one comment more:

S26: "This issue is widely acknowledged in the computer vision community and Geographical Information Systems (GIS)."
I miss some references.