Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Printed Edition

A printed edition of this Special Issue is available at MDPI Books....

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Improved Winter Wheat Spatial Distribution Extraction from High-Resolution Remote Sensing Imagery Using Semantic Features and Statistical Analysis

Remote Sens. 2020, 12(3), 538; https://doi.org/10.3390/rs12030538

by Feng Li^1,2,†, Chengming Zhang^3,4,*,†

, Wenwen Zhang^3,†, Zhigang Xu⁵, Shouyi Wang³, Genyun Sun⁶ and Zhenjie Wang⁶

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Remote Sens. 2020, 12(3), 538; https://doi.org/10.3390/rs12030538

Submission received: 23 December 2019 / Revised: 29 January 2020 / Accepted: 4 February 2020 / Published: 6 February 2020

(This article belongs to the Special Issue Artificial Neural Networks and Evolutionary Computation in Remote Sensing)

Round 1

Reviewer 1 Report

Overall, the paper is nicely written. I do notice a couple of typos (page 1- line 27, should be "named as post-processing CNN, page 1- line 44, remote should be capitalized). The authors should go through the paper carefully to make sure that there are no more typos or grammatical errors.

My main issue is the lack of detail in the algorithm description, i.e. Section 3. The authors should add more details and formulas for the bullet points that might need them. For example, this might need more explanation- "Wavelet transform was used to extract texture features from remote sensing images in the training data set".

Also, it would look better if the "4. Results" section title was moved to the next page. Moreover, the bottom of page 3 is partly blank- perhaps the text could be moved to fill up the whole page.

Author Response

Dear Reviewer:

We would like to thank you for the good comments and suggestions. We have substantially revised the manuscript according to your good suggestions and detailed responses are provided below. All revised contents are in blue.

General comments

Comment: Overall, the paper is nicely written.

Reply: Thank you for your support. According to your good suggestion, we have revised the relevant content.

Specific comments

Comment 1. I do notice a couple of typos (page 1- line 27, should be "named as post-processing CNN, page 1- line 44, remote should be capitalized). The authors should go through the paper carefully to make sure that there are no more typos or grammatical errors.

Reply: According to your good suggestion, we employed a professional English editor to check and edit the full text in English, and corrected grammatical errors in the original text.

Comment 2. My main issue is the lack of detail in the algorithm description, i.e. Section 3. The authors should add more details and formulas for the bullet points that might need them. For example, this might need more explanation- "Wavelet transform was used to extract texture features from remote sensing images in the training data set".

Reply: According to your good suggestion, we have revised relevant content. The revised contents are as follows, and all revised contents are in blue.

3.3.1. Feature Selection

Based on the prior knowledge that the inner pixels and edge pixels in winter wheat planting areas have very similar colors and textures, and the near-infrared (NIR) band is sensitive to crops, we create a feature vector for each pixel using the red, blue, green, and near-infrared bands along with NDVI, uniformity (UNI), contrast (CON), entropy (ENT), and inverse difference (INV). NDVI was calculated following Wang et al. [10],

, (3)

UNI, CON, ENT, and INV were extracted using the methods proposed by Yang and Yang based on GLCM [23],

, (4)

, (5)

, (6)

, (7)

In (4)-(7), q is the gray level quantization and g(i,j) is the element of GLCM.

The feature vector v of each pixel had nine elements, structured as:

(red,green,blue,NIR,NDVI,UNI,CON,ENT,INV) (8)

3.3.2. Vector Distance Calculation Method

We used the improved Euclidean distance to calculate the vector distance of the two feature vectors. The standard Euclidean distance is defined as:

, (9)

where x and y are the feature vectors to be compared, x_i and y_i are the feature components, and b is the length of the feature vector. Smaller distances between the two feature vectors correspond to greater similarity. In the standard Euclidean distance, all elements are considered to have equal weight, without considering the influence of the aggregation degree of elements on the distance.

Statistically, among the features of the samples of the same category, higher concentration of the value of a certain feature corresponds to stronger distinguishability of this feature and greater weight that should be assigned to this feature. Similarly, greater dispersion in the value of a certain feature corresponds to weaker distinguishability and smaller assigned weight of this feature.

Based on the prior knowledge, we introduced the reciprocal of the feature value distance as the weight factor to improve the Euclidean distance, thus better reflecting the influence of feature value aggregation on the vector distance. This weight factor was calculated as:

, (10)

where i is the position number of the component in the feature vector, w_i is the weight of the component, max_i is the maximum value of the ith components of all feature vectors, and min_i is the minimum value of the ith components of all feature vectors. On this basis, the vector distance calculation formula was:

, (11)

where x and y are the feature vectors to be compared, x_i and y_i are the feature components, w_i is the weight of component i, and n is the component number of the feature vector.

3.3.3. Vector Distance Threshold Determination

Firstly, each complete crop planting area in the training image was set as a statistical unit. The vector distance d between each pixel and other pixels was calculated individually, and the maximum vector distance d_i of the unit was recorded, where i was the number of the statistical unit. Secondly, the vector distance threshold (vdt) was obtained by:

, (12)

where n is the number of statistical units.

3.3.4. Low-confidence Pixel Classification

We used the following steps to optimize the results of winter wheat planting areas outputted by the improved RefineNet model:

NDVI for each pixel was calculated; UNI, CON, ENT, and INV for each pixel was calculated; CL was calculated pixel by pixel; Winter wheat pixels with continuous position and CL > minCL were divided into a separate group; For each group, these adjacent pixels for which CL < minCL were processed individually. For a certain adjacent pixel p, we calculated the vector distances between p and each pixel in the adjacent group and then chose the minimum value as the minimum distance mind. If mind < vdt, p was re-classified as a winter wheat pixel.

Comment 3. Also, it would look better if the "4. Results" section title was moved to the next page. Moreover, the bottom of page 3 is partly blank- perhaps the text could be moved to fill up the whole page.

Reply: According to your good suggestion, we have rearranged the relevant content.

Author Response File: Author Response.pdf

Reviewer 2 Report

Overall

The authors present a method for post-processing a CNN architecture for improving the accuracy of image classification, mitigating edge and boundary effects. Although there is merit on this work, I do not believe the manuscript is ready for publication as additional analysis is needed to better establish the presented methods. Please see in detail:

Introduction

The presented literature on methods addressing edge and boundary effects in CNN’s is very poor. Please, perform an appropriate literature review. Here are some examples to start with [1-6]:

[1] Fu, Tengyu, et al. "Using convolutional neural network to identify irregular segmentation objects from very high-resolution remote sensing imagery." Journal of Applied Remote Sensing 12.2 (2018): 025010.

[2] Audebert, Nicolas, et al. "Distance transform regression for spatially-aware deep semantic segmentation." Computer Vision and Image Understanding 189 (2019): 102809.

[3] Zhao, Wenzhi, Shihong Du, and William J. Emery. "Object-based convolutional neural network for high-resolution imagery classification." IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 10.7 (2017): 3386-3396.

[4] Mboga, Nicholus, et al. "Fully Convolutional Networks and Geographic Object-Based Image Analysis for the Classification of VHR Imagery." Remote Sensing 11.5 (2019): 597.

[5] Mi, Li, and Zhenzhong Chen. "Superpixel-enhanced deep neural forest for remote sensing image semantic segmentation." ISPRS Journal of Photogrammetry and Remote Sensing 159 (2020): 140-152.

[6] Papadomanolaki, Maria, Maria Vakalopoulou, and Konstantinos Karantzalos. "A Novel Object-Based Deep Learning Framework for Semantic Segmentation of Very High-Resolution Remote Sensing Data: Comparison with Convolutional and Fully Convolutional Networks." Remote Sensing 11.6 (2019): 684.

Methods

The issues with edge pixels and boundary definition are a well-known topic in remote sensing image segmentation. Post-processing with object-based image analysis, conditional random fields or more refined deep-learning networks that improve boundary definition are well-established. The authors use raw forms of CNN architectures to compare their method with. As expected, there are significant gains in accuracy, but the actual merit of the author’s proposed method cannot be known unless compared with other techniques that aim to improve boundary and edge pixel classification. I recommend the authors add at least one method that contains post-processing to the benchmark to assess their method in more firm ground and dispel the reader of any doubt about biased results.

Moreover, it would be wise to report the computational time for each presented method to attain the results. For example, if the authors’ proposed method requires several more hours to be completed (compared to running a simple CNN) the gains in accuracy might not be justified according to the needs of each application.

Minor comments:

L77:80: Support Vector Machines, Random Forests and Decision Trees are not deep learning methods but machine learning ones.

L101-102: The accuracy of CNN’s although usually demonsrated higher it is not ‘far greater than traditional methods’. That is an exaggeration not supported by evidence. In fact in same cases, traditional machine learning techniques seemed to outperform CNN’s in image classification [7]

[7] Jozdani, Shahab Eddin, Brian Alan Johnson, and Dongmei Chen. "Comparing deep neural networks, ensemble classifiers, and support vector machine algorithms for object-based urban land use/land cover classification." Remote Sensing 11.14 (2019): 1713.

Author Response

Dear Reviewer:

General comments

Comment: The authors present a method for post-processing a CNN architecture for improving the accuracy of image classification, mitigating edge and boundary effects. Although there is merit on this work, I do not believe the manuscript is ready for publication as additional analysis is needed to better establish the presented methods.

Reply: Thank you for your good suggestions and comments. According to your good suggestion, we have revised the relevant content.

Specific comments

Comment 1. Introduction

The presented literature on methods addressing edge and boundary effects in CNN’s is very poor. Please, perform an appropriate literature review. Here are some examples to start with [1-6]:

[1] Fu, Tengyu, et al. "Using convolutional neural network to identify irregular segmentation objects from very high-resolution remote sensing imagery." Journal of Applied Remote Sensing 12.2 (2018): 025010.

[2] Audebert, Nicolas, et al. "Distance transform regression for spatially-aware deep semantic segmentation." Computer Vision and Image Understanding 189 (2019): 102809.

[3] Zhao, Wenzhi, Shihong Du, and William J. Emery. "Object-based convolutional neural network for high-resolution imagery classification." IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 10.7 (2017): 3386-3396.

[4] Mboga, Nicholus, et al. "Fully Convolutional Networks and Geographic Object-Based Image Analysis for the Classification of VHR Imagery." Remote Sensing 11.5 (2019): 597.

[5] Mi, Li, and Zhenzhong Chen. "Superpixel-enhanced deep neural forest for remote sensing image semantic segmentation." ISPRS Journal of Photogrammetry and Remote Sensing 159 (2020): 140-152.

Reply: According to your good suggestion, we have revised relevant content. The revised contents are as follows, and all revised contents are in blue.

When CNNs are used for pixel classification, the accuracy is high in the inner area but low in the edge area, resulting in rough edges [61, 62]. Because the rough edges are caused by the differences in feature values between pixels of the same type, it is necessary to introduce appropriate post-processing methods to improve the accuracy of edge pixel classification [63-65]. The fully connected CRF comprehensively uses the pixel spatial distance information and the semantic information generated by the CNN to effectively improve the edge accuracy of segmentation, but the amount of data required for model calculation is too large. Researchers use recurrent neural networks [63] and convolution [64] to improve the calculation efficiency. Reference [65] comprehensively used the pixel spatial distance information and category information as constraints for network training to improve the accuracy of image segmentation results.

Object-level information is an information category commonly used in post-processing methods; it includes object shape information [66] and position information [66, 67]. Using object-level information to post-process the CNN segmentation results can improve the fineness of the edges. Multiresolution segmentation algorithms [68] and patch-based learning [66,69] have been used to successfully generate image object information. Classifiers are equally important; using more powerful classifiers such as decision trees, the results obtained are better than those obtained by simple linear classifiers [70]. Methods for extracting more knowledge and more suitable post-processing methods still require further research.

In order to obtain fine winter wheat spatial distribution information from high-spatial-resolution remote sensing imagery using CNNs, we proposed a post-process CNN (PP-CNN) that uses prior knowledge of the similarity in color and texture between the inner and edge pixels of the target type and their differences from other types to post-process CNN segmentation results and effectively improve the accuracy of edge pixel classification (and thus overall classification).

Audebert, N.; Boulch, A.; Saux, B. E.; Lefèvre, S. Distance transform regression for spatially-aware deep semantic segmentation. Vis. Image Underst. 2019, 189, 102809. DOI:10.1016/j.cviu.2019.102809 Fu, T.; Ma, L.; Li, M.; Johnson, B.A. Using convolutional neural network to identify irregular segmentation objects from very high-resolution remote sensing imagery. Appl. Remote Sens. 2018, 12(2), 025010, DOI:10.1117/1.JRS.12.025010. Mboga , N.; Georganos, S.; Grippa, T.; Lennert, M.; Vanhuysse, S.; Wolff, E. Fully Convolutional Networks and Geographic Object-Based Image Analysis for the Classification of VHR Imagery. Remote Sens. 2019, 11, 597; doi:10.3390/rs11050597 Zhao, W.; Du, S.; and Emery, W. J. Object-based convolutional neural network for high-resolution imagery classification. IEEE J. Sel. Topics Appl. Earth Obs. Remote Sens. 2017, 10(7), 3386–3396. DOI: 1109/JSTARS.2017.2680324 Papadomanolaki, M.; Vakalopoulou, M.; Karantzalos, K. A Novel Object-Based Deep Learning Framework for Semantic Segmentation of Very High-Resolution Remote Sensing Data: Comparison with Convolutional and Fully Convolutional Networks. Remote Sens. 2019, 11, 684; doi:10.3390/rs11060684. Mi, L.; and Chen, Z. Superpixel-enhanced deep neural forest for remote sensing image semantic segmentation. ISPRS J. Photogramm. Remote Sens. 2020, 159, 140-152. DOI: 1016/j.isprsjprs.2019.11.006

Comment 2. Methods

(1) The issues with edge pixels and boundary definition are a well-known topic in remote sensing image segmentation. Post-processing with object-based image analysis, conditional random fields or more refined deep-learning networks that improve boundary definition are well-established. The authors use raw forms of CNN architectures to compare their method with. As expected, there are significant gains in accuracy, but the actual merit of the author’s proposed method cannot be known unless compared with other techniques that aim to improve boundary and edge pixel classification. I recommend the authors add at least one method that contains post-processing to the benchmark to assess their method in more firm ground and dispel the reader of any doubt about biased results.

Reply: According to your good suggestion, we have revised relevant content. The revision including: Original CRF model was added as a comparison method according to you good. We rewrote the experimental design, rewrote the content of the experimental result. The revised contents are as follows, and all revised contents are in blue.

Carranza-García, M.; García-Gutiérrez, J.; Riquelme, J.C. A framework for evaluating land use and land cover classification using convolutional neural networks. Remote Sens. 2019, 11, 274. DOI: 10.3390/rs11030274 Zhang, C.M.; Han,Y.J.; Li,; Gao,S.; Song, D.J.; Zhao,H.; Fan,K.Q.; Zhang,Y.N. A new CNN-Bayesian model for extracting improved winter wheat spatial distribution from GF-2 imagery. Remote Sens. 2019, 11, 619. DOI: 10.3390/rs11060619 Zheng, S.; Jayasumana, S.; Romera-Paredes, B.; Vineet, V.; Su, Z.; Du, D.; Huang, C.; Torr, P.H. Conditional random fields as recurrent neural networks. IEEE Int. Conf. Comput. Vis. 2015, 1529–1537. Teichmann, M.T.T.; Cipolla, R. Convolutional CRFs for Semantic Segmentation. arXiv 2018, 1805.04777v2 [cs.CV] Audebert, N.; Boulch, A.; Saux, B. E.; Lefèvre, S. Distance transform regression for spatially-aware deep semantic segmentation. Vis. Image Underst. 2019, 189, 102809. DOI:10.1016/j.cviu.2019.102809

3.4. Experimental Setup

We conducted a comparative experiment on a graphics workstation with a 12-GB internal graphics card and a Linux Ubuntu 16.04 operating system. TensorFlow 1.10 software was used to write the statistical analysis and post-processing code in the Python language. Using a RefineNet model from the GitHub platform, we modified the output of the SoftMax model used by RefineNet. We used this for initial segmentation and used the output as basic data for statistical analysis.

We selected SegNet and unmodified RefineNet models as standard CNN and CRF as the post-process method for comparison with PP-CNN (Table 1). SegNet works like RefineNet, except it uses only high-level semantic features to generate feature vectors for each pixel.

Table 1. Models used in the comparative experiment.

Name	Description
PP-CNN	The proposed method
SegNet	Classifier using only high-level semantic features
SegNet-CRF	SegNet was used as the initial segmentation model, CRF was used as the post-processing method
PP-SegNet	As in PP-CNN, SegNet was used as the initial segmentation model
RefineNet	Linear model was adopted for feature fusion
RefineNet-CRF	Classic RefineNet was used as the initial segmentation model, CRF was used as the post-processing method

By comparing the results from SegNet and RefineNet, we hoped to verify that the strategy of generating features with RefineNet was better than that of generating features with SegNet. By comparing the results of SegNet-CRF, RefineNet, and RefineNet-CRF with PP-CNN, we hoped to show that post-processing could effectively improve the accuracy of segmentation results. By comparing the results of SegNet with PP-SegNet, we hoped to show that the proposed post-processing method had strong adaptability.

Results

We randomly selected ten test images from the test data set and assessed their segmentation results using the SegNet, SegNet-CRF, PP-SegNet, RefineNet, RefineNet-CRF, and PP-CNN models (Figure 5).

The six methods had very similar performances within the winter wheat planting areas with virtually no misclassifications. However, differences were obvious at the edges of these areas. PP-CNN and PP-SegNet misclassified only very small numbers of discrete pixels, while SegNet had the most errors in a more continuous pattern, with errors being more common at corners than at edges. RefineNet had significantly fewer errors than SegNet model, with most located near corners and few in continuous patterns.

Comparing SegNet-CRF and PP-SegNet, RefineNet, and PP-CNN, respectively, it can be seen that, on the premise that the initial segmentation results are the same, the results obtained by post-processing using the proposed method are better than those obtained by using CRF. Considering that CRF has very good performance in processing camera images, this may be because the resolution of remote sensing images is lower than that of camera images, which reduces the performance of CRF. It shows that the appropriate post-processing method should be selected according to the image characteristics.

Whether using CRF or the method proposed in this paper, the accuracy of the results after post-processing is improved, which also shows the importance of post-processing methods when CNN is applied to image segmentation.

Figure 5. Comparison of segmentation results for GF-2 satellite imagery for six test images: (a) original image; (b) manually labeled image; (c) SegNet; (d) SegNet-CRF; (e) PP-SegNet; (f) RefineNet; (g) RefineNet-CRF; (h) PP-CNN.

We then produced a confusion matrix for the segmentation results for all four methods (Table 4), where each column represents the classification result obtained from the segmentation results and each row represents the actual category defined by manual classification. PP-CNN was clearly superior, with classification errors accounting for only 5.6%, lower than the 13.7% for SegNet, 9.8% for SegNet-CRF, 6.2% for PP-SegNet, 7.2% for RefineNet, and 5.9% for RefineNet-CRF.

Table 4. Confusion matrix for winter wheat classification.

Approach	Predicted	Winter wheat	Non-winter wheat
SegNet	Winter wheat	29.6%	9.4%
SegNet	Non-winter wheat	9.9%	51.1%
SegNet-CRF	Winter wheat	31.9%	7.1%
SegNet-CRF	Non-winter wheat	8.3%	52.7%
PP-SegNet	Winter wheat	33.1%	5.9%
PP-SegNet	Non-winter wheat	5.9%	55.1%
RefineNet	Winter wheat	32.5%	6.5%
RefineNet	Non-winter wheat	6.3%	54.7%
RefineNet-CRF	Winter wheat	35.3%	3.7%
RefineNet-CRF	Non-winter wheat	7.8%	53.2%
PP-CNN	Winter wheat	36.9%	2.1%
PP-CNN	Non-winter wheat	3.5%	57.5%

We used the accuracy, precision, recall, and Kappa coefficient to evaluate the performance of the four models [45] (Table 5). The average accuracy of PP-CNN was 13.7% higher than SegNet, 7.2% higher than RefineNet, and 6.2% higher than PP-SegNet.

Table 5. Statistical comparison of model performance.

Index	SegNet	SegNet-CRF	PP-SegNet	RefineNet	RefineNet-CRF	PP-CNN
Accuracy	80.7%	84.6%	88.2%	87.2%	88.5%	94.4%
Precision	79.7%	83.7%	87.6%	86.6%	87.7%	93.9%
Recall	79.8%	84.1%	87.6%	86.5%	88.9%	94.4%
Kappa	0.663	0.722	0.779	0.763	0.786	0.889

(2) Moreover, it would be wise to report the computational time for each presented method to attain the results. For example, if the authors’ proposed method requires several more hours to be completed (compared to running a simple CNN) the gains in accuracy might not be justified according to the needs of each application.

Reply: According to your good suggestion, we have revised relevant content, added content about the model running time. The new contents are as follows, and all new contents are in blue.

Table 6 shows the average time required for each method to complete testing of one image. The proposed post-processing method requires an approximate increase of 2% in time and improves the accuracy by 7.2%. The time consumed by CRF is higher than that by the proposed method because the CRF must calculate the distances between all pixel–pixel pairs for a single image, while the proposed method must calculate the distances for only a small number of pixel–pixel pairs.

Table 6. Statistical comparison of model performance.

Index	SegNet	SegNet-CRF	PP-SegNet	RefineNet	RefineNet-CRF	PP-CNN
Time [ms]	295	375	301	297	361	302

*ms: millisecond

Comment 3. Minor comments:

(1) L77:80: Support Vector Machines, Random Forests and Decision Trees are not deep learning methods but machine learning ones.

Reply: According to your good suggestion, we have revised relevant content. The revised contents are as follows, and all revised contents are in blue.

The development of machine learning has allowed researchers to use machine learning abilities to improve pixel feature extraction. However, early machine learning methods such as neural networks [26,27], support vector machines [28,29], decision trees [30,31], and random forests [32,33] still use pixel spectral information as input.

(2) L101-102: The accuracy of CNN’s although usually demonsrated higher it is not ‘far greater than traditional methods’. That is an exaggeration not supported by evidence. In fact in same cases, traditional machine learning techniques seemed to outperform CNN’s in image classification [7]

Reply: According to your good suggestion, we have revised relevant content. The revised contents are as follows, and all revised contents are in blue.

CNN and traditional feature extraction methods have different advantages, and CNN cannot completely replace traditional feature extraction methods. The fusion of different feature extraction methods can improve the accuracy of the segmentation results [60].

Jozdani, S.; Johnson, B. A.; Chen, D. Comparing deep neural networks, ensemble classifiers, and support vector machine algorithms for object-based urban land use/land cover classification. Remote Sens. 2019, 11, 1713. doi:10.3390/rs11141713.

Author Response File: Author Response.pdf

Reviewer 3 Report

In this paper, the authors proposed a method based on a convolutional neural network (CNN) using knowledge and statistical analysis for edge pixel classification. In fact, an improved RefineNet model is used to roughly segment remote sensing imagery. Moreover, manual labels are considered as a reference and performed statistical analysis on the class probability vectors. Finally, the filtered pixels are exploited to improve classification accuracy.

The proposed approach has been validated on real remote sensing images.
Generally, the proposed idea is very interesting, however, some revisions have to be made and some parts of these experiments are not complete to claim the advantage of the proposed method :

1) In the proposed method, could the authors explain what is the main contribution over standard classification methods based on CNN? what are its advantages over them?

2) In the experimental setup, the authors randomly choose the training and testing samples (images) for the classification task. What happens when you change the training samples (another random selection of images)?

3) In the experimental setup, could the authors give a clarification about how they set the hyper-parameters of the different models, e.g. learning rate, epochs, batch size, etc

4) I suggest the authors add in the introduction these references related to 3-D CNN, which aim to preserve the spectral and spatial feature of remote sensing images :

- Spectral--spatial classification of hyperspectral imagery with 3D convolutional neural network, Remote Sensing, 2017.
- Hyperspectral imagery classification based on semi-supervised 3-D deep neural network and adaptive band selection, Expert Systems with Applications, 2019.

5) The English and format of this manuscript should be checked very carefully.

Author Response

Dear Reviewer:

General comments

Comments: In this paper, the authors proposed a method based on a convolutional neural network (CNN) using knowledge and statistical analysis for edge pixel classification. In fact, an improved RefineNet model is used to roughly segment remote sensing imagery. Moreover, manual labels are considered as a reference and performed statistical analysis on the class probability vectors. Finally, the filtered pixels are exploited to improve classification accuracy.

The proposed approach has been validated on real remote sensing images.

Generally, the proposed idea is very interesting, however, some revisions have to be made and some parts of these experiments are not complete to claim the advantage of the proposed method.

Reply: Thank you for your support. According to your good suggestion, we have revised the relevant content.

Specific comments

Comment 1. In the proposed method, could the authors explain what is the main contribution over standard classification methods based on CNN? what are its advantages over them?

Reply: According to your good suggestion, we added relevant content in section 1, i.e., introduction. The added contents are as follows, and added contents are in blue.

The main contributions of this work are as follows.

PP-CNN uses confidence to evaluate the reliability of the pixel-by-pixel classification results obtained using CNN and clarifies the calculation method of confidence. PP-CNN proposes a new hierarchical classification strategy. Features generated by standard CNN from the large receipt fields are used for the first-level classifier; features generated from the small receipt fields are used for the second-level classifier. As this hierarchical classification strategy combines the advantage of the large receipt field and the small receipt field, it thus achieves the goal of obtaining fine edges.

Comment 2. In the experimental setup, the authors randomly choose the training and testing samples (images) for the classification task. What happens when you change the training samples (another random selection of images)?

Reply: According to your good suggestion, we have revised relevant content. The word ‘randomly’ is inappropriate, we delete this word. The revised contents are as follows, and all revised contents are in blue.

We used cross-validation techniques in the comparative experiments. Each CNN model was trained over four rounds; in each round, 87 images were selected as test images and the other images were used as training images. Each image was used at least once as the test image (Table 2).

Table 2. Percent of every category sample used in experiments.

Category	Percent of total samples
Winter wheat	39.00%
Agricultural buildings	0.10%
Woodland	9.01%
Buildings	19.01%
Roads	0.81%
Water bodies	0.90%
Unplanted farmland	24.12%
Other	7.05%

Comment 3. In the experimental setup, could the authors give a clarification about how they set the hyper-parameters of the different models, e.g. learning rate, epochs, batch size, etc

Reply: According to your good suggestion, we have added relevant content. The added contents are as follows, and all new contents are in blue.

Table 3 shows the hyper-parameter setup we used to train our model. In the comparison experiments, the hyper-parameters were also applied to the comparison model.

Table 3. The hyper-parameter setup.

Hyper-parameter	Value
mini-batch size	32
learning rate	0.0001
momentum	0.9
epochs	20000

Comment 4. I suggest the authors add in the introduction these references related to 3-D CNN, which aim to preserve the spectral and spatial feature of remote sensing images:

- Spectral--spatial classification of hyperspectral imagery with 3D convolutional neural network, Remote Sensing, 2017.

- Hyperspectral imagery classification based on semi-supervised 3-D deep neural network and adaptive band selection, Expert Systems with Applications, 2019.

Reply: According to your good suggestion, we have revised relevant content. The revised contents are as follows, and all revised contents are in blue.

2-D convolution methods are unsuitable for processing images with many channels, such as hyperspectral remote sensing images [55]. Aiming to preserve the spectral and spatial features of hyperspectral remote sensing images, researchers use 3-D convolution to extract spectral–spatial information [55, 56]. Because 3-D convolution can fully utilize the abundant spectral and spatial information of hyperspectral imagery, 3-D convolution has achieved remarkable success in the classification of hyperspectral images.

Li, Y.; Zhang, H.; Shen, Q. Spectral–spatial classification of hyperspectral imagery with 3D convolutional neural network. Remote Sens, 2017, 9, 67. DOI:10.3390/rs9010067. Sellami, A.; Farah, M.; Farah, I. R.; Solaiman, B. Hyperspectral imagery classification based on semi-supervised 3-D deep neural network and adaptive band selection. Expert Syst. Appl., 2019, 129(5), 246–259. DOI: 10.1016/j.eswa.2019.04.006.

Comment 5. The English and format of this manuscript should be checked very carefully.

Reply: According to your good suggestion, we employed a professional English editor to check and edit the full text in English, and corrected grammatical errors in the original text.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

The authors have addressed my main concerns in the revised version.

Reviewer 3 Report

The authors have revised the manuscript carefully according to my questions. I have no further questions about this manuscript. It could be accepted.

Article Menu

Printed Edition

Improved Winter Wheat Spatial Distribution Extraction from High-Resolution Remote Sensing Imagery Using Semantic Features and Statistical Analysis

Further Information

Guidelines

MDPI Initiatives

Follow MDPI