Next Article in Journal
An Improved Descalloping Method Combined With Imaging Parameters for GaoFen-3 ScanSAR
Next Article in Special Issue
Synergistic Use of Multi-Temporal RADARSAT-2 and VENµS Data for Crop Classification Based on 1D Convolutional Neural Network
Previous Article in Journal
Volcanic Hot-Spot Detection Using SENTINEL-2: A Comparison with MODIS–MIROVA Thermal Data Series
Previous Article in Special Issue
Automatic Mapping of Center Pivot Irrigation Systems from Satellite Images Using Deep Learning
 
 
Article
Peer-Review Record

Improved Winter Wheat Spatial Distribution Extraction Using A Convolutional Neural Network and Partly Connected Conditional Random Field

Remote Sens. 2020, 12(5), 821; https://doi.org/10.3390/rs12050821
by Shouyi Wang 1,†, Zhigang Xu 2,†, Chengming Zhang 1,*,†, Jinghan Zhang 3, Zhongshan Mu 4, Tianyu Zhao 4, Yuanyuan Wang 1,5, Shuai Gao 6, Hao Yin 1 and Ziyun Zhang 1
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Remote Sens. 2020, 12(5), 821; https://doi.org/10.3390/rs12050821
Submission received: 6 February 2020 / Revised: 27 February 2020 / Accepted: 28 February 2020 / Published: 3 March 2020
(This article belongs to the Special Issue Deep Learning and Remote Sensing for Agriculture)

Round 1

Reviewer 1 Report

In this paper, the authors proposed an approach based on a convolutional neural network and partly connected conditional random field for winter wheat spatial distribution extraction. In fact, the RefineNet model has been used to segment remote sensing images. Then, a statistical analysis is performed in order to select pixels that required optimization. Finally, the PCCRF model is used to refine the classification. The proposed approach has been validated on remote sensing images.

Generally, the proposed idea is very interesting, however, some revisions have to be made and some parts of these experiments are not complete to claim the advantage of the proposed models :

1)    Could the authors explain what is the main contribution of the proposed approach than over existing methods of winter wheat spatial distribution extraction?  Moreover, what is the motivation to use the random filed model?

2)   In the experimental setup, the authors randomly choose the training and testing samples for the classification task. What happens when you change the training samples (another random selection of images)?

3)   In the experimental setup, could the authors add some details about the hyper-parameter settings, e.g, epochs, learning rate, etc?

4)   I suggest the authors add in the manuscript these references related to 3-D CNN, which aim to preserve the spectral and spatial feature of remote sensing images :

- Spectral-spatial classification of hyperspectral imagery with 3D convolutional neural network, Remote Sensing, 2017.

- Hyperspectral imagery classification based on semi-supervised 3-D deep neural network and adaptive band selection, Expert Systems with Applications, 2019.

5) The English and format of this manuscript should be checked very carefully.

Author Response

Dear Reviewer:

We would like to thank you and your good comments and suggestions. We have substantially revised the manuscript according to suggestions, and detailed responses are provided below. All revised contents are in blue.

General comments

Comments: In this paper, the authors proposed an approach based on a convolutional neural network and partly connected conditional random field for winter wheat spatial distribution extraction. In fact, the RefineNet model has been used to segment remote sensing images. Then, a statistical analysis is performed in order to select pixels that required optimization. Finally, the PCCRF model is used to refine the classification. The proposed approach has been validated on remote sensing images.

Generally, the proposed idea is very interesting, however, some revisions have to be made and some parts of these experiments are not complete to claim the advantage of the proposed models.

Reply: Thank you for your support and good suggestions. We have substantially revised the manuscript according to you good suggestions, and detailed responses are provided below. All revised contents are in blue.

 

Specific comments

Comment 1. Could the authors explain what is the main contribution of the proposed approach than over existing methods of winter wheat spatial distribution extraction?  Moreover, what is the motivation to use the random filed model?

Reply: According to your good suggestions, first, we add the comments about the existing methods of winter wheat spatial distribution extraction.

Second, we add the contents of the main contribution of the proposed approach than the existing methods and the motivation of use the conditional random field the last of Section 1, i.e. Introduction. especially, we add a new section, i.e. Section 5.4, to describe the comparisons between this paper and our another paper which have been published, indicate what is new compared to the published paper.

Third, we rewrite the conclusion, add the content which emphasis the effect of conditional random field.

 

The added comments relevant to existing methods of winter wheat spatial distribution extraction are provider below

As winter wheat is an important food crop, previous studies have proposed numerous methods to extract the spatial distribution information of winter wheat from remote sensing images. When using low- and medium-resolution images as data sources, NDVI and other vegetation indices are typically used as the main features [71]. When higher-resolution remote sensing images are used as data sources, regression methods [72], support vector machine [73, 74], random forest [75], linear discriminant analysis [76], and CNNs [77, 78] are the more commonly used methods. There is a significant amount of mis-segmented pixels at the edges of winter wheat planting areas, which are common problems that these methods must overcome. Although the edge accuracy of the winter wheat planting area can be improved with the use of CRF [78], improving the computational efficiency of CRF is still an important issue that requires an urgent solution.

  1. Chu, L.; Liu, Q.-S.; Huang, C.; Liu, G.-H. Monitoring of winter wheat distribution and phenological phases based on MODIS time-series: A case study in the Yellow River Delta, China. Integ. Agric. 2016, 15(10), 2403-2416. doi:10.1016/S2095-3119(15)61319-3.
  2. Zhang, X.-W.; Liu, J.-F; Qin, Z.; Qin, F. Winter wheat identification by integrating spectral and temporal information derived from multi-resolution remote sensing data. Integ. Agric. 2019, 18(11), 2628-2643. doi:10.1016/S2095-3119(19)62615-8.
  3. Hao, Z.; Zhao, H.; Zhang, C.; Wang, H.; Jiang, Y.; Yi, Z. Estimating winter wheat area based on an SVM and the variable fuzzy set method. Remote Sens. Lett. 2019, 10(4), 343-352. doi:10.1080/2150704X.2018.1552811.
  4. He, T.; Xie, C.; Liu, Q.; Guan, S.; Liu, G. Evaluation and Comparison of Random Forest and A-LSTM Networks for Large-scale Winter Wheat Identification. Remote Sens. 2019, 11, 1665. doi:10.3390/rs11141665.
  5. He, Y.; Wang, C.; Chen, F.; Jia, H.; Liang, D.; Yang, A. Feature Comparison and Optimization for 30-M Winter Wheat Mapping Based on Landsat-8 and Sentinel-2 Data Using Random Forest Algorithm. Remote Sens. 2019, 11, 535. doi:10.3390/rs11050535.
  6. Aneece, I.; Thenkabail, P. Accuracies Achieved in Classifying Five Leading World Crop Types and their Growth Stages Using Optimal Earth Observing-1 Hyperion Hyperspectral Narrowbands on Google Earth Engine. Remote Sens. 2018, 10, 2027. doi:10.3390/rs10122027.
  7. Teimouri, N.; Dyrmann, M.; Jorgensen, R.N. A Novel Spatio-Temporal FCN-LSTM Network for Recognizing Various Crop Types Using Multi-Temporal Radar Images. Remote Sens. 2019, 11, 990. doi:10.3390/rs11080990.
  8. Chen, Y.; Huang, L.; Zhu, L.; Yokoya, N.; Jia, X. Fine-Grained Classification of Hyperspectral Imagery Based on Deep Learning. Remote Sens. 2019, 11, 2690. doi:10.3390/rs11222690.

 

 

The main contribution of the proposed approach compared these existing methods is provided below. all new contents and revised contents are in blue.

 

In this study, we propose a partly connected conditional random field (PCCRF) model to post-process the RefineNet extraction results, referred to as RefineNet-PCCRF, to eventually achieve the goal of obtaining the high-quality winter wheat spatial distribution. The main contributions of this paper are as follows:

 

  • The statistical analysis technology is used to analyze the segmentation results of RefineNet, and prior knowledge is applied to PCCRF modeling.
  • Based on prior knowledge, we modified the fully connected conditional random field (FCCRF) to build the PCCRF. We refined the definition of pairwise potential-energy, employing a linear model to connect the unary potential-energy and pairwise potential-energy. Compared to the equal weight connect model used in the FCCRF, the new fusion model used in the PCCRF can better reflect the different roles of information generated form a larger receptive field and information generated from a smaller receptive field.
  • We only used pixel-pairs associated with the selected pixels in the PCCRF, which can effectively reduce the amount of data required for computing models and improve the computational efficiency of the PCCRF.
  • Benefiting from the ability to describe the spatial correlation between pixel categories of the CRF, RefineNet-PCCRF can not only improve the classification accuracy of edge pixels in the winter wheat planting area, but also has high computing efficiency.

 

5.4. Comparison Between PP-CNN and RefineNet-PPCRF

To obtain a high-quality spatial distribution information of winter wheat, we used an improved Euclidean distance to establish PP-CNN as a post-processing method [81]. According to the improved Euclidean distance of the feature vector between a pixel being classified and the determined winter wheat pixel, the pixel being classified can be determine whether it belongs to the winter wheat. Unlike the PP-CNN, the proposed PCCRF was established on the basis of the CRF. Due to the advantage of the CRF in using global distribution characteristics, the PP-CRF can more accurately determine the category label of the edge of the winter wheat planting area.

In general, the PP-CNN can be used in cases where the feature differences are stable between the mixed pixels on the edge of the winter wheat planting area and the inner pixels of the same area. When the difference is unbalanced, the distance threshold bias is large, which increases the probability of pixel classification errors during post-processing. The PCCRF fully considers the spatial correlation between pixel categories, hence yielding a strong global balance ability. Therefore, this method can better handle situations where the edge pixels are significantly different from the inner pixels, thereby effectively reducing the impact of large differences in crop growth.

 

  1. Li, F.; Zhang, C.; Zhang, W.; Xu, Z.; Wang, S.; Sun, G.; Wang, Z. Improved Winter Wheat Spatial Distribution Extraction from High-Resolution Remote Sensing Imagery Using Semantic Features and Statistical Analysis. Remote Sens. 2020, 12, 538; doi:10.3390/rs12030538.

 

The revised contents in conclusions, i.e. Section 6, relate to the emphasis the effect conditional random field are provided below.

 

The main contributions of this study are as follows: (1) Pre-processing (such as statistical analysis of the CNN segmentation results) allows the use of post-processing and modeling of prior knowledge, such that only those pixels with lower confidence are processed, thus significantly reducing calculation time. As the RefineNet has high segmentation accuracy, this post-processing only requires the use of 20% of all the pixels. (2) According to the characteristics of the winter wheat planting area on the remote sensing image, the PCCRF uses original channel values, texture features, and low-level semantic features to compose the feature vector and construct the pairwise potential energy. This feature vector better matches the characteristics of the remote sensing imagery. At the same time, after normalizing the pairwise potential energy, the data range is identical to that of the unary potential energy. This aspect is more reasonable than that of FCCRF. (3) The PCCRF uses a linear model to fuse the unary energy and pairwise energy, such that the parameters of the linear mode are determined while training the PCCRF. This strategy is more reasonable than the fix weight value strategy adopted by the FCCRF. Due to the ability to describe the globe spatial correlation between pixel categories of the CRF, the RefineNet-PCCRF can efficiently improve the classification accuracy of edge pixels in a winter wheat planting area.

 

Comment 2. In the experimental setup, the authors randomly choose the training and testing samples for the classification task. What happens when you change the training samples (another random selection of images)?

Reply: According to your good suggestions, we revised the contents in the experimental setup. The new contents are provider below, all new contents are in blue.

 

We used cross-validation techniques in the comparative experiments. Each CNN model was trained over five rounds. In each round, 200 images were selected as test images and the other images were used as training images to guarantee that each image was used at least once as a test image.

 

Comment 3. In the experimental setup, could the authors add some details about the hyper-parameter settings, e.g, epochs, learning rate, etc?

Reply: According to your good suggestions, we added the hyper-parameter settings used in our study. The new contents are provider below, all new contents are in blue.

Table 2 lists the hyper-parameter setup used to train the proposed RefineNet-PPCRF. In the comparison experiments, the hyper-parameters were also applied to the comparison model.

Table 2. The hyper-parameter setup.

Hyper-parameter

Value

mini-batch size

32

learning rate

0.00001

momentum

0.9

epochs

30,000

 

Comment 4.  I suggest the authors add in the manuscript these references related to 3-D CNN, which aim to preserve the spectral and spatial feature of remote sensing images :

- Spectral-spatial classification of hyperspectral imagery with 3D convolutional neural network, Remote Sensing, 2017.

- Hyperspectral imagery classification based on semi-supervised 3-D deep neural network and adaptive band selection, Expert Systems with Applications, 2019.

Reply: According to your good suggestion, we have revised relevant content. The revised contents are as follows, and all revised contents are in blue.

 

RefineNet and most other classic CNNs typically use two-dimensional (2-D) convolution methods to extract feature values. Two-dimensional convolution methods are unsuitable for processing images with small channels, such as optical remote sensing images or camera images [56]. When processing hyperspectral remote sensing images, to preserve the spectral and spatial features, previous studies have used three-dimensional (3-D) convolution methods to extract spectral–spatial features [56,57]. As the 3-D convolution method can fully use the abundant spectral and spatial information of hyperspectral imagery, this convolution method has achieved remarkable success in the classification of hyperspectral images.

 

  1. Li, Y.; Zhang, H.; Shen, Q. Spectral–spatial classification of hyperspectral imagery with 3D convolutional neural network. Remote Sens, 2017, 9, 67. DOI:10.3390/rs9010067.
  2. Sellami, A.; Farah, M.; Farah, I. R.; Solaiman, B. Hyperspectral imagery classification based on semi-supervised 3-D deep neural network and adaptive band selection. Expert Syst. Appl., 2019, 129(5), 246–259. DOI: 10.1016/j.eswa.2019.04.006.

 

5) The English and format of this manuscript should be checked very carefully.

Reply: According to your good suggestion, we employed a professional English editor to check and edit the full text in English, and corrected grammatical errors in the original text.

Author Response File: Author Response.pdf

Reviewer 2 Report

I congratulate the authors for this magnificent work, I think it is very well structured, written and in general they justify each of the decisions addressed perfectly, showing the state of the art, the basic algorithms on which to improve and create their own algorithm. Demonstrating its effectiveness brilliantly with data, examples, graphs and tables. As well as a very well justified comparison.

 

However, I think there are some aspects that should be considered for publication:

 

I think figure 4 should be placed before equation 1. And it should be made a little larger so that we can see with better quality the direction of each arrow in the block diagram. Does the image enter the network from the top and go through each convolutional block from top to bottom? It is not very clear from the figure.

 

I think that equation 1 can be better explained by showing the probability vector and the i-th iteration. In this way, the expression that relates each element of the vector of probabilities will be better seen and not so many of the same expressions will be repeated.

I understand that m=10 categories, but I don't quite understand the relationship with figure 4, where there are only 4 Refinets. Is that so? Or is it an example that doesn't exactly correspond to the model implemented?

 

In line 237, it is said that scatter plots are used to obtain the Cgate value. Perhaps an example figure would be useful to better understand that concept.

 

 

The statement in line 250 I think should be secured with references.

 

I don't quite understand what figure 6 provides.

 

At the end of line 293 there is a space to separate t(xi) from the subsequent "as".

 

The metrics used in experimentation, although they are the classic ones in this type of article, and have been referenced, it would not be superfluous to include their definitions if there is space.

 

The results obtained, both in table 2 and in figures 8 and 9, are very interesting, but I think that more conclusions and more criticism could be drawn than those expressed in the paragraph of line 364-367. And in any case these conclusions should be written after visualizing the corresponding figures and not before.

 

I think that the conclusions of figures 10 and 11 shown in the paragraph of lines 422-430 should be shown after the figures and not before.

 

In several places in the text the improvement in calculation needs of the proposed algorithm is mentioned, but I have missed a more thorough study of this statement.

Author Response

Dear Reviewer:

We would like to thank you and your good comments and suggestions. We have substantially revised the manuscript according to suggestions, and detailed responses are provided below. All revised contents are in blue.

General comments

Comments: I congratulate the authors for this magnificent work, I think it is very well structured, written and in general they justify each of the decisions addressed perfectly, showing the state of the art, the basic algorithms on which to improve and create their own algorithm. Demonstrating its effectiveness brilliantly with data, examples, graphs and tables. As well as a very well justified comparison.

However, I think there are some aspects that should be considered for publication:

Reply: Thank you for your support and good suggestions. We have substantially revised the manuscript according to you good suggestions, and detailed responses are provided below. All revised contents are in blue.

Specific comments

Comment 1. I think figure 4 should be placed before equation 1. And it should be made a little larger so that we can see with better quality the direction of each arrow in the block diagram. Does the image enter the network from the top and go through each convolutional block from top to bottom? It is not very clear from the figure.

Reply: According to your good suggestion, we have revised relevant content. The revised contents are as follows, and all revised contents are in blue.

We selected RefineNet as our initial segmentation model. Unlike the FCN, SegNet, DeepLab, and other models, this model uses a multi-path structure that fuses low-level detailed semantic features with high-level rough semantic features, thereby effectively improving the distinguishing ability of the pixel features. We modified the classic RefineNet model to initially segment remote sensing images;Fig. 4 shows the structure of the improved RefineNet model.

 

Figure 4. Structure of the improved RefineNet model used in this study.

 

Improvements to the RefineNet model were as follows.

First, we replaced the equal weight fusion model used in the classic model with a linear fusion model to fuse detailed low-level semantic features and high-level rough semantic features. The fusion method is as follows:

,

(1)

where s denotes the fused features, f represents the detailed low-level semantic feature values generated by the convolution block g denotes the up-sampling feature of the high-level rough semantic features, and a and b are the coefficients of the fusion model. The specific values of a and b must be determined via model training.

Second, we modified the classifier of RefineNet, i.e. Softmax, to simultaneously output the prediction category label and category probability vector, P, for each pixel.

The probability value of a pixel was assigned as the i-th category label pi, which was calculated as follows:

,

(2)

where m is the number of categories and ri, rj, represents the output of the RefineNet encoder (i.e., the product of the pixel’s feature vector and i-th feature function), respectively. Based on the definition of pi, P can be defined as follows:

.

(3)

 

Comment 2. I think that equation 1 can be better explained by showing the probability vector and the i-th iteration. In this way, the expression that relates each element of the vector of probabilities will be better seen and not so many of the same expressions will be repeated.

Reply: According to your good suggestion, we have revised relevant content. The revised contents are as follows, and all revised contents are in blue.

Second, we modified the classifier of RefineNet, i.e. Softmax, to simultaneously output the prediction category label and category probability vector, P, for each pixel.

The probability value of a pixel was assigned as the i-th category label pi, which was calculated as follows:

,

(2)

where m is the number of categories and ri, rj, represents the output of the RefineNet encoder (i.e., the product of the pixel’s feature vector and i-th feature function), respectively. Based on the definition of pi, P can be defined as follows:

.

(3)

 

Comment 3. I understand that m=10 categories, but I don't quite understand the relationship with figure 4, where there are only 4 Refinets. Is that so? Or is it an example that doesn't exactly correspond to the model implemented?

Reply: According to your good suggestion, we have revised relevant confusing content, especially, we rename the fuse block in Figure 4. The revised contents are as follows, and all revised contents are in blue.

 

The probability value of a pixel was assigned as the i-th category label pi, which was calculated as follows:

,

(2)

where m is the number of categories and ri, rj, represents the output of the RefineNet encoder (i.e., the product of the pixel’s feature vector and i-th feature function), respectively. Based on the definition of pi, P can be defined as follows:

.

(3)

 

 Comment 4. In line 237, it is said that scatter plots are used to obtain the Cgate value. Perhaps an example figure would be useful to better understand that concept.

Reply: According to your good suggestion, we added an example. It should be noted that considering that the histogram is better, we now use the histogram. The new contents are as follows, and all revised contents are in blue.

 

Third, a histogram was produced for PR and PW using the CL as the x-axis and the number of pixels corresponding to a certain CL value as the y-axis. Figure 5 provides an example of a histogram, which was used to determine the value of Cgate. In general, the principle is that, when CL is greater than Cgate, the number of misclassified pixels should be as small as possible.

Figure 5. An example of a histogram.

 

 

Comment 5. The statement in line 250 I think should be secured with references.

Reply: According to your good suggestion, we have revised relevant content, add references. The revised contents are as follows, and all revised contents are in blue.

Based on previous studies [51–53,58,59], approximately 80% of the pixel-by-pixel classification results generated by CNN models are credible.

 

Comment 6. I don't quite understand what figure 6 provides.

Reply: According to your good suggestion, considering that the figure 6 does not bring new information, we have deleted the figure 6.

 

Comment 7. At the end of line 293 there is a space to separate t(xi) from the subsequent "as".

Reply: According to your good suggestion, we have revised relevant content. The revised contents are as follows, and all revised contents are in blue.

 

Based on the definition of , we can define the sum of the pairwise potential energy of xi   as:

 

Comment 8. The metrics used in experimentation, although they are the classic ones in this type of article, and have been referenced, it would not be superfluous to include their definitions if there is space.

Reply: According to your good suggestion, we have added relevant content. The revised contents are as follows, and all revised contents are in blue.

 

We used four popular criteria, named Accuracy, Precision, Recall and F1-score [80] to evaluate the performance of the proposed model. They were calculated using the confusion matrix.

Accuracy is the ratio of the number of correctly classified samples to the total number of samples, calculated as:

 ,

(23)

where  denotes the number of correctly classified samples, and  is the number of samples of class i misidentified as class j. Precision denotes the average proportion of pixels correctly classified to one class from the total retrieved pixels, calculated as:

 .

(24)

Recall represents the average proportion of pixels that are correctly classified in relation to the actual total pixels of a given class, calculated as:

 .

(25)

F1-Score present the harmonic mean of precision and recall, calculated as:

.

(26)

We evaluated the results using the accuracy, precision, recall, and F1-score. The RefineNet-PCCRF scored highest for all models using all metrics (Table 3).

Table 3. Comparison of the six results.

Index

SegNet

SegNet-CRF

SegNet-PCCRF

RefineNet

RefineNet-CRF

RefineNet-PCCRF

Accuracy

79.01%

81.31%

83.86%

86.79%

94.01%

94.51%

Precision

76.50%

78.94%

80.68%

85.45%

91.71%

92.39%

Recall

73.61%

76.24%

80.40%

79.54%

89.16%

90.98%

F1-score

75.03%

77.57%

80.54%

82.39%

90.42%

91.68%

 

Comment 9. The results obtained, both in table 2 and in figures 8 and 9, are very interesting, but I think that more conclusions and more criticism could be drawn than those expressed in the paragraph of line 364-367. And in any case these conclusions should be written after visualizing the corresponding figures and not before.

Reply: According to your good suggestion, we have add relevant content. The revised contents are as follows, and all revised contents are in blue.

  1. Results and Evaluation

Figure 7 presents 10 randomly selected image blocks and their corresponding results using the six comparison methods.

 

Figure 7. Comparison of the segmentation results for 10 randomly selected image blocks: (a) original images, (b) manually-labeled images corresponding to (a), (c) SegNet, (d) SegNet-CRF, (e) SegNet-PCCRF, (f) RefineNet, (g) RefineNet-CRF, and (h) RefineNet-PCCRF.

 

Although there are certain misclassified pixels in the inner regions of the winter wheat planting area in the SegNet results, the overall classification accuracy of each comparison method in the inner regions of the winter wheat planting area is satisfactory. The difference between the result of the six comparison modes at the edge is observable. In the SegNet results, the edges of the winter wheat fields are rough, and therefore, the RefineNet results are superior to those of the SegNet, thereby demonstrating the importance of using fused features over high-level features. Both the CRF and PCCRF post-processing methods produced superior results, thus demonstrating the importance of post-processing procedures. The SegNet-PCCRF was superior to SegNet-CRF while the RefineNet-PCCRF was superior to the RefineNet-CRF; this demonstrated that the PCCRF is more suitable as a post-processing method. Comparing the SegNet-PCCRF and RefineNet-CRF, the performance of the RefineNet-CRF was superior, thereby confirming that the initial segmentation method is also a an extremely significant factor in determining the final result.

We used four popular criteria, named Accuracy, Precision, Recall and F1-score [80] to evaluate the performance of the proposed model. They were calculated using the confusion matrix.

Accuracy is the ratio of the number of correctly classified samples to the total number of samples, calculated as:

 ,

(23)

where  denotes the number of correctly classified samples, and  is the number of samples of class i misidentified as class j. Precision denotes the average proportion of pixels correctly classified to one class from the total retrieved pixels, calculated as:

 .

(24)

Recall represents the average proportion of pixels that are correctly classified in relation to the actual total pixels of a given class, calculated as:

 .

(25)

F1-Score present the harmonic mean of precision and recall, calculated as:

.

(26)

We evaluated the results using the accuracy, precision, recall, and F1-score. The RefineNet-PCCRF scored highest for all models using all metrics (Table 3).

Table 3. Comparison of the six results.

Index

SegNet

SegNet-CRF

SegNet-PCCRF

RefineNet

RefineNet-CRF

RefineNet-PCCRF

Accuracy

79.01%

81.31%

83.86%

86.79%

94.01%

94.51%

Precision

76.50%

78.94%

80.68%

85.45%

91.71%

92.39%

Recall

73.61%

76.24%

80.40%

79.54%

89.16%

90.98%

F1-score

75.03%

77.57%

80.54%

82.39%

90.42%

91.68%

 

Figures 8 and 9 present confusion matrices for the different models, demonstrating that the RefineNet-PCCRF achieved the best segmentation results.

 

Figure 8. Confusion matrices of different models using the GF-2 image datasets: (a) SegNet, (b) SegNet-CRF, (c) SegNet-PCCRF, (d) RefineNet, (e) RefineNet-CRF, and (f) RefineNet-PCCRF.

 

 

 

Figure 9. Confusion matrices of the different models using the GF-2 image datasets: (a) SegNet, (b) SegNet-CRF, (c) SegNet-PCCRF, (d) RefineNet, (e) RefineNet-CRF, and (f) RefineNet-PCCRF.

In the confusion matrices of the six models, there is nearly no confusion between the winter wheat and urban areas. This can be attributed to the difference in the characteristics of the two land-use types. However, the confusion between winter wheat and farmland is serious. This is because most winter wheat regions that were misclassified as farmlands have poor growing conditions. In these areas, their characteristics are similar to those of farmlands in winter, which led to a greater probability of misclassification. There is also a certain degree of confusion in the winter wheat and woodland areas. This is because certain trees are still green in winter, similar to the characteristics in the regions of winter wheat. However, in this case, due to the use of both texture and high-level semantic information, the degree of confusion is significantly lower than that of farmland. This also explains the advantage of post-processing from another aspect, as it leads to the introduction of new information, which can effectively improve the accuracy of the classification results.

 

 

Comment 10. I think that the conclusions of figures 10 and 11 shown in the paragraph of lines 422-430 should be shown after the figures and not before.

Reply: According to your good suggestion, we have revised relevant content. The revised contents are as follows, and all revised contents are in blue.

 

5.3. Cgate Effect

Given the overall importance of the Cgate parameter in the RefineNet-PCCRF, we held other parameters steady and calculated the relationships among the Cgate, accuracy (Fig. 10), and consumed time (Fig. 11).

 

Figure 10. The relationship between the average segmentation accuracy and Cgate.

Figure 11. Relationship between average consumed time and Cgate.

Higher Cgate values improved the accuracy because pixels were filtered with a higher level of confidence. Post-processing resulted in the reclassification of the initially misclassified pixels, thus improving the accuracy of the overall result. Therefore, when selecting the Cgate value, we must consider the classification ability of the initial segmentation model. In addition, selecting a model with a stronger classification ability for preliminary segmentation can significantly improve the performance of the results obtained from the PCCRF model. Higher Cgate values also increased the consumed time; this indicated that  a further reduction in the number of pixels involved in modeling, i.e., using more prior knowledge, is the key to further improving the calculation efficiency of both the PCCRF and classic CRF models.

 

 

Comment 11. In several places in the text the improvement in calculation needs of the proposed algorithm is mentioned, but I have missed a more thorough study of this statement.

Reply: According to your good suggestion, we have added relevant content to illustrate the improvement in calculation needs of the proposed algorithm. The revised contents are as follows, and all revised contents are in blue.

 

Table 4 lists the average time required for each method to complete the testing of a single image. The proposed RefineNET-PPCRF method causes an approximate increase of 3% with regard totime and improves the accuracy by 5–8%. The time consumed by the CRF is higher than that using the proposed PCCRF method because the CRF has to calculate the distances between all pixel–pixel pairs for a single image while the proposed PCCRF method calculates the distances for only a small number of pixel–pixel pairs. The number of pixel–pixel pairs calculated in the SegNet-PCCRF is only approximately 30% of that of the SegNet-CRF. The number of pixel–pixel pairs calculated in the RefineNet -PCCRF is only approximately 20% of that in the RefineNet -CRF.

Table 4. Statistical comparison of model performance.

Index

SegNet

SegNet-CRF

SegNet-PCCRF

RefineNet

RefineNet-CRF

RefineNet-PCCRF

Time [ms]

301

383

315

293

403

313

*ms: millisecond

 

Author Response File: Author Response.pdf

Reviewer 3 Report

Image classification is one of the most important problems in remote sensing and in particular an important task is improving the accuracy of edge pixel classification. This problem has been known for years. Authors tried to solve this problem thanks to the improvement winter wheat spatial distribution extraction using CNN and PCCRF. The obtained results allowed to draw conclusions that CNN can significantly improve the overall accuracy of the remote sensing image segmentation results, but not in all cases. This was the reason why they tested PCCRF model, which was used to post-process the results of CNN to better solve the problem of rough edges in the results extracted using only CNN. In my opinion the results of the experiments are worth publishing.

Some comments for the authors:

152 - explain why ENVI software was used;

164 - I suggest to explain the number of pixels in block (1000) and number of blocks (920) - why such numbers

172 - the legend is not clearly visible

356 - I suggest to give a picture and description of the picture (Figure 7) on the same page 

 

 

Author Response

Dear Reviewer:

We would like to thank you and your good comments and suggestions. We have substantially revised the manuscript according to suggestions, and detailed responses are provided below. All revised contents are in blue.

General comments

Comments: Image classification is one of the most important problems in remote sensing and in particular an important task is improving the accuracy of edge pixel classification. This problem has been known for years. Authors tried to solve this problem thanks to the improvement winter wheat spatial distribution extraction using CNN and PCCRF. The obtained results allowed to draw conclusions that CNN can significantly improve the overall accuracy of the remote sensing image segmentation results, but not in all cases. This was the reason why they tested PCCRF model, which was used to post-process the results of CNN to better solve the problem of rough edges in the results extracted using only CNN. In my opinion the results of the experiments are worth publishing.

Reply: Thank you for your support and good suggestions. We have substantially revised the manuscript according to you good suggestions, and detailed responses are provided below. All revised contents are in blue.

 

Specific comments

Comment 1. 152 - explain why ENVI software was used;

Reply: According to your good suggestion, we have revised relevant content. The revised contents are as follows, and all revised contents are in blue.

 

ENVI software is a remote sensing image processing software that integrates numerous mainstream image processing tools and therefore improves the efficiency of image processing and utilization. ENVI can especially use an interactive data language to develop image processing programs according to our requirements, which can further improve our work efficiency. We used ENVI to preprocess the imagery through three steps: atmospheric correction used the FLAASH module, orthorectification used the RPC module, and data fusion used the NNDiffuse Pan Sharpening module. We developed a batch program using an interactive data language (IDL) to improve the degree of automation during pre-processing.

Comment 2. 164 - I suggest to explain the number of pixels in block (1000) and number of blocks (920) - why such numbers

Reply: According to your good suggestion, we have revised relevant content. The revised contents are as follows, and all revised contents are in blue.

Larger image blocks are advantageous for model training. Considering the hardware used in our research, we cut each pre-processed image into equal-sized image blocks (1,000 × 1,000 pixels). A total of 920 cloudless image blocks were selected for manual labeling with numbers assigned to the following categories: (1) winter wheat, (2) mountain land, (3) water, (4) urban residential area, (5) agricultural building, (6) woodland, (7) farm land, (8) roads, (9) rural residential area, and (10) others. While selecting the pixel blocks, we used the following principle: each pixel block should contain at least three or more land-use types, where the area proportion of each land-use type in the selected images was similar to that in the pre-processed images.

 

Comment 3. 172 - the legend is not clearly visible

Reply: According to your good suggestion, we have revised the figure. The revised figure are as follows.

Comment 4. 356 - I suggest to give a picture and description of the picture (Figure 7) on the same page

Reply: According to your good suggestion, we have retyped this part, and they are one the same page now.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

The authors have revised the manuscript carefully according to my questions. I have no further questions about this manuscript. It could be accepted.

Back to TopTop