Next Article in Journal
Optimal Design and Testing of a Crawler-Type Flax Combine Harvester
Next Article in Special Issue
Experimental and Numerical Simulations of a Solar Air Heater for Maximal Value Addition to Agricultural Products
Previous Article in Journal
Potential and Improvement of Maintenance Efficiency of Agricultural PTO Shafts by a New Digital Maintenance Assistant
Previous Article in Special Issue
Recent Advances in Carbon and Activated Carbon Nanostructured Aerogels Prepared from Agricultural Wastes for Wastewater Treatment Applications
 
 
Article
Peer-Review Record

Attention-Based Fine-Grained Lightweight Architecture for Fuji Apple Maturity Classification in an Open-World Orchard Environment

Agriculture 2023, 13(2), 228; https://doi.org/10.3390/agriculture13020228
by Li Zhang 1, Qun Hao 1,2,3 and Jie Cao 1,2,*
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Agriculture 2023, 13(2), 228; https://doi.org/10.3390/agriculture13020228
Submission received: 10 November 2022 / Revised: 11 January 2023 / Accepted: 12 January 2023 / Published: 17 January 2023
(This article belongs to the Special Issue Advances in Agricultural Engineering Technologies and Application)

Round 1

Reviewer 1 Report

The Fuji maturity classification (FGAL-MC) task is addressed by a unique CNN-based fine-grained lightweight architecture presented in this paper. I have the following comments that may improve the paper.

1.      The following very recent paper is highly related to the scope of this paper, and it should be reviewed in section 2.1. Since augmentation is an important parameter to be considered in the agricultural field to overcome the challenges in the field, such as occlusion, illumination, ripeness level etc.

https://doi.org/10.29130/dubited.1075572

2.      Data collection process is not detailed enough. What are the camera specs at the time of the data collection process and what was the distance between the camera and the objects? I understand that the images were taken once a week, however, the timing should also be included.

3.      How can you make sure that the proposed method can handle with fruit occlusion when conducting ripeness classification?

4.      Since for the proposed model, the training was conducted from the scratch, how did you manage the tuning of the hyperparameters apart from the learning rate?

5.      The experiments only demonstrates the comparison with other types of popular neural network. I would suggest adding a new experimental setup under different occlusion and illumination conditions and seeing the accuracy for all NN, including the proposed one.

6.      What is the application of the proposed model? Can it be employed with smartphones through a cloud?

7.      Is there any other benchmarks for fuji apples so that you can also show a comparative analysis?

8.      Who has done the labelling procedure? An expert in the field or the researcher?

 

9.      The paper requires proofreading. 

Author Response

  1. The following very recent paper is highly related to the scope of this paper, and it should be reviewed in section 2.1. Since augmentation is an important parameter to be considered in the agricultural field to overcome the challenges in the field, such as occlusion, illumination, ripeness level etc.

https://doi.org/10.29130/dubited.1075572

 

 

Response: We are grateful to the reviewer for reminding us of this point. According to your advice, we have added this highly related work in section 2.1. “[16] proposed strawberry ripeness detection system through camouflage-based data augmentation technique to simulate the natural environment of strawberry harvesting conditions which achieved promising outputs”

Reference:

  1. Sadak, F. Strawberry Ripeness Assessment Via Camouflage-Based Data Augmentation for Automated Strawberry Picking Robot. Düzce Üniversitesi Bilim ve Teknoloji Dergisi 2022, 10, 1589-1602.

 

  1. Data collection process is not detailed enough. What are the camera specs at the time of the data collection process and what was the distance between the camera and the objects? I understand that the images were taken once a week, however, the timing should also be included.

 

Response:  Thank you very much for your positive and valuable comments. We have added the detailed information in the revised manuscript. “We captured images from an open-world orchard environment in Bologna (11â—¦21E, 44â—¦30N), Italy, with a digital camera (Cannon EOS Kiss X5). Captured more than sixteen weeks of images from July 5, to November 15 once per week, and the distance between the camera and Fuji trees is around one to one and a half meters. Due to the first four weeks the fruits were quite small and totally immature, the last twelve weeks which contains 9852 captured images were kept as our original image data, and processed these with the resolution of 1280 by 720 pixels”

 

  1. How can you make sure that the proposed method can handle with fruit occlusion when conducting ripeness classification?

 

Response: We are grateful to the reviewer for reminding us of this point. In our present research work, we focus our attention on the fruits with natural sunlight change and backgrounds influence. We tried to design the model with effective attention model to focus attention on fruits region, avoid background influence as much as possible. We will focus on the ripeness classification of fruits under occlusion conditions to improve this research in future work, and we added it in revised manuscript in section 7 future work. “In future work, we will one step more to focus on the fruit ripeness classification task under occlusion conditions, and try to realize the classification of fruit maturity under occlusion conditions by complementing the multi nearby frames information of the video”.

 

  1. Since for the proposed model, the training was conducted from the scratch, how did you manage the tuning of the hyperparameters apart from the learning rate?

Response: Thank you very much for your positive and valuable comments. We have added the detailed information on how to manage the hyperparameters in the revised manuscript.

Here, N as the number of total epochs, and we took N=200 in this experiment. AFGL-MC were trained with a mini-batch size of 64 and tuning 200 epochs. All the hyper-parameters were optimized with 60 update epochs at a learning rate of 10-1, 60 to 120, 120 to 180, and last 20 update epochs at a lower learning rate of 10-2, 10-3 and 10-4, respectively.

 

  1. The experiments only demonstrates the comparison with other types of popular neural network. I would suggest adding a new experimental setup under different occlusion and illumination conditions and seeing the accuracy for all NN, including the proposed one.

Response: Thank you very much for your positive and valuable comments. According to your advice, we have added a new experimental result under different occlusion and illumination in revised manuscript.

To analyze the performance of our proposed model under different occlusion and illumination conditions, we setup an experiment for the compared four models under normal light without occlusion, normal light with occlusion, and non-integrated light without occlusion these three kinds of conditions respectively. The final accuracy results are shown in Table 4.

From the compared results, we found that all the compared models achieved better accuracy results on natural illumination and without occlusion conditions. When object fruits occluded the accuracy of output results were largely decreased, which probably due to the occlude factors will affect the judgment of fruit appearance seriously. for the occlusion condition, our proposed model obtained better outputs than the other three compared models which may thanks to the attention mechanism that focus on the fruit region. As for illumination changes are less affected than occlusions, and our proposed model handles this situation well.

 

  1. What is the application of the proposed model? Can it be employed with smartphones through a cloud?

Response: We are grateful to the reviewer for reminding us of this point. In this research, we hope that the designed model can be used as a ripeness classification step after fruit detection for harvesting robot. By adding this lightweight model, after the fruit detection task a harvesting can precisely chose a certain grade of fruit which may avoid potential increased costs and fruit damage for subsequent tasks such as sorting. So, In the future work, we will also try to deploy the model to a harvesting robot.

  1. Is there any other benchmarks for fuji apples so that you can also show a comparative analysis?

Response: Thanks very much for the kind suggestions. To our best knowledge, there is few public benchmarks for fuji ripeness classification. We found some public fuji benchmarks which are for fruit detection task irrelevant to the research content of this paper. In order to facilitate the research on the Fuji ripeness classifier task in the future, we will organize the image dataset used and provide it as a public benchmark.

 

  1. Who has done the labelling procedure? An expert in the field or the researcher?

 Response: We are grateful to the reviewer for reminding us of this point. We labelled our capture images by experts in the field, according to the standards for grades of apples from United States. We specific this point in revised manuscript. “Due to the appearance of Fuji apples highly related to its attributes, we invited experts to label these images. According to the image captured date and the appearance of each fruit, the experts classified these images into Unripe, turning, ripe and overripe these four categories”.

  1. The paper requires proofreading. 

Response: I am very grateful to your comments for the manuscript. We have checked all the references and formatted them strictly according to the Guide for Authors. According with your advice, we tried our best to improve the manuscript and made some changes in the manuscript. These changes will not influence the content and framework of the paper.

Author Response File: Author Response.docx

Reviewer 2 Report

This manuscript shows a fine-grained classification architecture for the Fuji maturity classification task. Overall English is weak. This main as follows: 

L68  The rest of the paper is organized as follows.,  I thought this part should start a new paragraph. And the L73 repeat this part content.

L80 The part of Apple Maturity Classification, introduces the first maturity and method. But I thought this part should be contained the introduction. The related work should introduce the details of the deep method, such as fine-grained visual categorization, and extended content. 

L143, The regular growth pattern of Fuji.  What point in time? What stage of growth? Why?

L158, why do four categories classify ti? Reference? Or Expert experience? 

L184-186, during imageNet, there are much lower resolutions. The training model needs high time cost and hardware costs, but fine-tuning the model only spend a few resources. 

L216, figure 4, what is the gray part? Where is the max pooling?   What is the BR that maybe is B? The Blue one Conv(3,1) represents the kernel 1 or 3 ? why? 

L344, Compared with 5 models, but the figure only show 3 other models. And the training line of figure 8 is greatly different from figure 7, why?  And during the training analysis, the loss is more important than accuracy. 

L359, the sequence of the four categories is G4, G1, G2, G3 from the best to worst, but figure 6 shows the sequence of the four categories is G4, G1, G3, G2. The two results are contradictory.

 

L375 The part of 5.3 Generalization evaluation, aims to analyze the application of AFGL-MC. But if this manuscript introduces the fine-grained dataset, it would be proved by this fine-grained dataset. And only a few trained epochs were not enough. 

This paper presents the modeling results of field apple classification data and other public data as an argument for the superiority of the method used in this paper. From the analysis of the whole article's architecture, it is only possible to obtain that the present architecture method has better classification accuracy when classifying apples in the field.

Since the model did not do ablation experiments, and in the case that the description of the construction and the training results are not very clear, and the analysis of the results has a great logical contradiction, a few questions are reserved on the validity of the model training. Therefore, the overall argument is weak and not illustrative.

The overall manuscript has less algorithmic innovation, and a weak introduction to the maturity measurement of Fuji apples, which is not informative in agriculture and was an algorithmic article. 

Author Response

L68 “The rest of the paper is organized as follows.”,  I thought this part should start a new paragraph. And the L73 repeat this part content.

Response: We are very sorry for this mistake. And we have revised it in the manuscript.

 

L80 The part of “Apple Maturity Classification”, introduces the first maturity and method. But I thought this part should be contained the introduction. The related work should introduce the details of the deep method, such as fine-grained visual categorization, and extended content. 

Response: We are grateful to the reviewer for reminding us of this point. According to your advice, we have reorganized the content of introduction and related work.

L143, The regular growth pattern of Fuji.  What point in time? What stage of growth? Why?

Response: Thank you very much for your positive and valuable comments. We have added the detailed information in the revised manuscript. “We captured images from an open-world orchard environment in Bologna (11â—¦21E, 44â—¦30N), Italy, with a digital camera (Cannon EOS Kiss X5). Captured more than sixteen weeks of images from July 5, to November 15 once per week, and the distance between the camera and Fuji trees is around one to one and a half meters. Due to the first four weeks the fruits were quite small and totally immature, the last twelve weeks which contains 9852 captured images were kept as our original image data, and processed these with the resolution of 1280 by 720 pixels”

 

L158, why do four categories classify ti? Reference? Or Expert experience? 

Response: Thanks very much for the kind comments. For these four categories are proposed from standards for grades of apples from United States, we have studied these standards, according these standards and also with expert’s experience, images captured date, appearance, etc., we set these four categories as our research dataset. According to your advice, we have added reference and detailed information in the revised manuscript. “according to the United States Standards for Grades of Apples [38], [38] indicated a mature apple becomes overripe it will show varying degrees of firmness, depending upon the stage of the ripening process, so "Hard", "Firm", "Firm ripe" and "Ripe" these four terms are used for describing different stages of different stages. Due to the appearance of Fuji apples highly related to its attributes, we invited experts to label these images. According to the image captured date and the appearance of each fruit, the experts classified these images into Unripe, turning, ripe and overripe these four categories”.

Reference:

[38] Usda, U.; Ams, A. United States standards for grades of apples. USDA Publication 2002

 

L184-186, during imageNet, there are much lower resolutions. The training model needs high time cost and hardware costs, but fine-tuning the model only spend a few resources. 

Response: Thank you very much for your positive and valuable comments. L184-186 indicates that tiny-scale datasets are more promising in practical use. We have revised in manuscript.

    In fact, such kinds of tiny-scale datasets are more widely and prevalently used in practical fields due to the high time and price cost for data collection in practical.

 

L216, figure 4, what is the gray part? Where is the max pooling?   What is the BR that maybe is B? The Blue one Conv(3,1) represents the kernel 1 or 3 ? why? 

Response: Thanks for your comments. The gray part is presented as Batch Normalization, the deep purple is presented as max pooling after the first ReLU layer. The B and R of the abbreviation BR are for Batch Normalization and ReLU, respectively. Conv (k, s) is shown as a convolution layer with kernel size k and stride s. So, Conv (3,1) is a convolution layer with kernel size 3 and stride 1. To present figure 4 in a more clearly way, we re-draw this figure in revised manuscript.

 

L344, Compared with 5 models, but the figure only show 3 other models. And the training line of figure 8 is greatly different from figure 7, why?  And during the training analysis, the loss is more important than accuracy. 

Response: Thanks for your comments. We compared ResNet-18, DenseNet-121, MobileNet-Tiny, AlexNet and VGG-16 respectively, because AlexNet and VGG-16 were hard to converge on our tiny-dataset, as we presented on L348. So, figure 8 only shows the other three models compared results. To avoid possible disagreements, we have revised the manuscript of this section. “We trained all these network models independently on our tiny-scale dataset with the number of total training epochs as 200, and confirmed the other conditions were as close as possible. We found AlexNet and VGG16 were hard to converge on such limited number of training samples and the number of iterations, we just compared AFGL-MC with the ResNet-18, DenseNet-121, MobileNet-Tiny.”

      Figure 7 presents the visualization feature map, figure 8 is the curves of the evaluation results, figure 6 is the evaluation results of our proposed model on different grades, they are presented in different content. All the compared models we trained trend to converge.

 

L359, the sequence of the four categories is G4, G1, G2, G3 from the best to worst, but figure 6 shows the sequence of the four categories is G4, G1, G3, G2. The two results are contradictory.

Response: We are grateful to the reviewer for reminding us of this point. In fact, figure 6 shows the curves of average accuracy of four categories, G4, G1, G2 and G2 from the best to worst. Also, L359 table3 is for the Precision, Recall and F1-score indictors, and a confusion matrix as shown in figure 9 presents the average accuracy on each grade. G4, G1, G3 and G2 are 1.0, 0.92, 0.88 and 0.84 from best to worst, respectively.

 

L375 The part of “5.3 Generalization evaluation”, aims to analyze the application of AFGL-MC. But if this manuscript introduces the fine-grained dataset, it would be proved by this fine-grained dataset. And only a few trained epochs were not enough. 

Response: Thank you so much for the valuable comments. The part of "5.3 Generalization evaluation" presented was to analyze if our presented model can be generalized on other public benchmarks. In the previous section we have already deployed experiment on our dataset. And we totally agree with you that only a few trained epochs were not enough, due to the public benchmark and our proposed fine-grained dataset are quite tiny datasets, we try to trained with more epochs, but it easily to be stable. So, we tried to design this small architecture which tries to learn on small datasets. 

Round 2

Reviewer 1 Report

I have carefully considered the response letter, which is responded thoroughly, and I suggest the publication of this paper. 

 

Reviewer 2 Report

ALL questions above had been solved.  Although there are still improvements in the training loss and public dataset.

I agree to the publication of this manuscript. 

Back to TopTop