Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Design and Experimental Verification of the YOLOV5 Model Implanted with a Transformer Module for Target-Oriented Spraying in Cabbage Farming

Agronomy 2022, 12(10), 2551; https://doi.org/10.3390/agronomy12102551

by Hao Fu^1,2,†, Xueguan Zhao^3,†

, Huarui Wu³, Shenyu Zheng¹, Kang Zheng¹ and Changyuan Zhai^1,2,*

Reviewer 1:

Amir Ghalamzan Esfahani

Reviewer 2:

Wieslaw Ptach

Agronomy 2022, 12(10), 2551; https://doi.org/10.3390/agronomy12102551

Submission received: 18 August 2022 / Revised: 22 September 2022 / Accepted: 13 October 2022 / Published: 18 October 2022

(This article belongs to the Topic Applications of Big Data and Machine Learning in Smart Agriculture)

Round 1

Reviewer 1 Report

Paper review agronomy (MDPI)

Design and Experimental Verification of the YOLOV5 Model Implanted with a Transformer Module for Target-Oriented Spraying in Cabbage Farming

Summary

The paper uses a deep neural network inspired by YOLOv5 in a target-oriented cabbage spraying task. To improve the detection scores two main strategies are employed: (I) adding transformer modules to YOLOv5 for improved feature extraction and (II) image data augmentation by blurring a set of the collected images. The proposed pipeline shows improvement in terms of pestiside saving rate, cabbage detection scores, and computation time w.r.t YOLOv5. The effect of the lighting condition on the performance of the detection and spraying system is also evaluated. The pipeline is tested with a real system on the field in non-trivial test cases.

Strength

- Experiments are conducted on a real-world task in the field.

- Novel machine vision approach w.r.t. the cabbage harvesting domain.

- Many test cases to evaluate the effect of the task variables (e.g. lighting condition) on the performance.

- Light computation complexity of the model and capability of onboard computations.

Weakness

- The paper lacks clarity in stating the main contributions and novelties. As such, if modifying the YOLOv5 network is the main contribution, the paper significantly lacks test cases to support the claim. The proposed method should be compared with either similar detection techniques used in the cabbage detection domain or benchmarking on known datasets used by YOLO models (e.g. CoCo dataset) to support the improvement.

- Technical discussion about model design and interpreting the results is missing from most parts of the paper. For instance, there is no comment about the intuition why adding Transformer modules to YOLOv5 improves the performance in this task. Maybe simply adding more convolutional layers to YOLOv5 could result in the same amount of improvement.

- It appears the idea of adding Transformer modules to YOLO network is already presented in [1]. Comparison to this paper can further clarify the efficiency of adding the Transformers in the specific way that the paper does it w.r.t. [1].

- Although it is claimed that hardware development is a contribution of the paper, it appears most of the hardware parts are already developed in previous research [2]. However, there is no reference to previous works in section 2.4.

- It is interesting to know how adding Transformer modules also improves the computation complexity w.r.t YOLOv5. Is it only because of using DWConv layers instead of normal convolution layers in YOLOv5 or there are other reasons?

- The presented l1 and l2 metrics are difficult to comprehend for the reader to understand as a performance measure. In particular, the way these metrics are illustrated in Figure 15 is presenting redundant information for all the measuring points in the field. This section appears unique to the harvesting setup, thus not increasing domain understanding for the reader.

- In terms of improving the computation complexity, there is 4.83 ms (8%) decrease in execution time relative to YOLOv5 which does not seem to be a significant improvement. Does 8% improvement have a significant influence on spraying system performance?

- Comments on the limitations of the proposed system are also missing in the paper.

Suggestions for improvement

- The flow of the contents in the paper can significantly improve to help readers’ easier understanding. Currently, there is a change of focus between (i) model design, (ii) pre-processing method, and (iii) real-world experiments sections across the paper which can be confusing for the reader.

- Both laboratory and field experiments should include the comparison to methods close to that used in the paper. Simply adding a layer to a network and showing slight improvement is not a sufficient contribution to a journal paper.

- Used metrics for spray precision can either be replaced or plotted more intuitively for easier interpretation.

- Technical comments should be added about the intuition behind choosing the design parameters and discussions about the results.

[1] Zhang, Zixiao, Xiaoqiang Lu, Guojin Cao, Yuting Yang, Licheng Jiao, and Fang Liu. "ViT-YOLO: Transformer-based YOLO for object detection." In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2799-2808. 2021.

[2] Zhao, Xueguan, Xiu Wang, Cuiling Li, Hao Fu, Shuo Yang, and Changyuan Zhai. "Cabbage and Weed Identification Based on Machine Learning and Target Spraying System Design." Frontiers in Plant Science (2022): 2299.

Author Response

Article Title: “Design and Experimental Verification of the YOLOV5N Model Implanted with a Transformer Module for Target-Oriented Spraying in Cabbage Farming”

Dear Reviewers,

On behalf of all the authors, we sincerely appreciate your valuable comments on the manuscript. Your comments not only provided constructive suggestions for improving the quality of the manuscript but also led us to consider our approaches and the design of the system in detail. These comments will also promote our future research.

Best regards,

Hao Fu , Xueguan Zhao, Changyuan Zhai ^*

Comment # 1: The paper lacks clarity in stating the main contributions and novelties. As such, if modifying the YOLOv5n network is the main contribution, the paper significantly lacks test cases to support the claim. The proposed method should be compared with either similar detection techniques used in the cabbage detection domain or benchmarking on known datasets used by YOLO models (e.g. CoCo dataset) to support the improvement.

Author response 1: Based on your comments, At the beginning of line 572 of the manuscript, the performance comparison of cabbage recognition model is added. (Yellow Highlight in Line572-578) Compared with the cabbage recognition model proposed in the article [1] and[2], the recognition accuracy of the model in this manuscript is higher than that of the model in article [1]. When the speed is 0.7 m/s, the recognition accuracy increases by 37.91%, compared with the cabbage recognition model in article [2], the recognition accuracy is improved by 6.61%.

Comment # 2: Technical discussion about model design and interpreting the results is missing from most parts of the paper. For instance, there is no comment about the intuition why adding Transformer modules to YOLOv5n improves the performance in this task. Maybe simply adding more convolutional layers to YOLOv5n could result in the same amount of improvement.

Author response 2: Based on your comments, YOLOv5n is a one-stage target detection model, which integrates the classification and positioning functions of cabbage into a neural network. It uses the C3 structure based on the cross-stage partial network (CSPNet) to extract target features, which has a good feature extraction effect for large targets without mutual occlusion. However, when cabbage and weeds block each other in the image and under strong light conditions, the extracted cabbage feature information is relatively small This paper proposes to use the transformer module to replace the C3 structure in the original network. By adding the location coding of cabbage targets and adopting the multi head attention mechanism, the feature acquisition ability of cabbage targets is improved, and the recognition accuracy of the model is improved. The calculated amount of a single transformer module is lower than that of the C3 structure, and the image processing speed can be improved to a certain extent. the technical discussion on model design is added in the 146-176 lines of the manuscript(Yellow Highlight in Line146-177), and the reason for increasing the accuracy of transformer module is discussed in the 558 lines, and the conclusion of this manuscript is confirmed by reference.

Comment # 3: It appears the idea of adding Transformer modules to YOLO network is already presented in [1]. Comparison to this paper can further clarify the efficiency of adding the Transformers in the specific way that the paper does it w.r.t.[1].

Author response 3: Reading this document will be of great help to the improvement of our research. The model in this article is optimized by integrating transformer base on YOLOv4-p7. YOLOv4-p7 adds two output parts on the basis of yolov4. The addition of these two parts improves the detection accuracy of small targets to a certain extent, but the increase in calculation is very large, YOLOv4-p7 structure and YOLOv5- P7 is similar, YOLOv5-p7 has higher speed and recognition accuracy than it, it has a computational load of 121.4 GFLOPs. The model proposed in this manuscript uses transformer to replace the C3 structure and repeated many times in the original model, with a computational load of 4.0 GFLOPs, only about 1 / 30 of the calculated load in the literature. The computational efficiency of this model is far better than that of YOLOv4-p7 fusion transformer model in the manuscript.

Comment # 4: Although it is claimed that hardware development is a contribution of the paper, it appears most of the hardware parts are already developed in previous research [2]. However, there is no reference to previous works in section 2.4.

Author response 4: Thanks for your suggestion, In the previous research, our team developed the targeted drug application system based on the active light sources. The image processing of the system is completed on the computer. For the environment without an external power supply in the field, it can not work continuously for a long time. It is an industrial camera, which is expensive in cost. Therefore, in view of this problem, this manuscript made relevant improvements, replacing the computer with a low-power edge computing device, The use of a low-cost webcam for image acquisition increases the cost and operation time of the sprayer. In line 325 of the text, references and corresponding.

Comment # 5: It is interesting to know how adding Transformer modules also improves the computation complexity w.r.t YOLOv5n. Is it only because of using DWConv layers instead of normal convolution layers in YOLOv5n or there are other reasons?

Author response 5: Thank you for the suggestion. The model proposed in this manuscript uses transformer module to replace the C3 module in YOLOv5n that is repeatedly connected. The number of repetitions of C3 module is reduced, and the amount of model calculation can be reduced to a certain extent. Therefore, the reduction of computation is the change brought by the two optimizations. using DWConv to replace the traditional convolution. For example, the pixels of the input image are 480pixels×288pixel, the number of output feature maps is 16, and the convolution kernel size is 3×3. The calculation amount of traditional convolution is 5.972×10⁷, using a depth separable convolution of 3.732×10⁶, the computational complexity of using deep separable convolution is only 6.25% of that of traditional convolution. The computational load is reduced from 10.2 GFLOPs of YOLOv5nn to 4.0 GFLOPs. This is the result of the replacement of the two modules. If we only replace the convolution module, the calculation amount of the model will be reduced to 4.2GFLOPs, so for the use of transformer module instead of C3 module, the calculation amount of the model can be reduced by 0.2GFLOPs

Comment # 6: The presented l1 and l2 metrics are difficult to comprehend for the reader to understand as a performance measure. In particular, the way these metrics are illustrated in Figure 15 is presenting redundant information for all the measuring points in the field. This section appears unique to the harvesting setup, thus not increasing domain understanding for the reader.

Author response 6: In this manuscript, the L1 and L2 evaluation criteria are changed into the nozzle opening distance in advance and the nozzle closing distance in delay, which more intuitively reflects the accuracy of spraying. At the same time, Figure 15 is changed into a scatter plot, and the mean and variance of the data are increased, which more intuitively shows the distribution of the data.

Comment # 7: In terms of improving the computation complexity, there is 4.83 ms (8%) decrease in execution time relative to YOLOv5n which does not seem to be a significant improvement. Does 8% improvement have a significant influence on spraying system performance?

Author response 7: The requirement for target spraying operation is to identify the cabbage target accurately under the movement state and different lighting conditions. Our main goal is to solve the problem of low recognition accuracy caused by motion blur under different lighting conditions and during operation. YOLOv5n is a model with excellent recognition speed. In order to solve the above problem, after optimizing the model, it still has a high processing speed. At the same time, the image processing speed cannot keep up with the image acquisition speed. The image acquisition time is 33ms. We hope that the image processing time is close to the image acquisition time, the image processing time of the original model is 55.9ms, which is 22.9ms longer than image acquisition time (33ms). After model optimization, the image processing time (51.07) is 18.07 ms longer than image acquisition time (33ms). From 22.9ms to 18.07ms, the time gap is decreased by 21.09% because of the model improvement. (Yellow Highlight in Line560-564)

The time gap decreased (4.83ms) indeed have some influence on spraying system performance. Thank you for your comment. We will continue to focus on the lightweight of the model and improve the image processing speed in the future

Comment # 8: Comments on the limitations of the proposed system are also missing in the paper。

Author response 8: Based on your comments, at the beginning of line 610, the discussion on the system's shortcomings and the optimization method of the future system are added. The shortcomings are that when the speed is higher than 0.7 m/s, the solenoid valve can't be closed for cabbage plants with small gap. The reason is that the response time of the solenoid valve is longer than the time when the sprayer travels through the cabbage gap. According to the previous test, the response time of the solenoid valve is measured to be 20 ms. when the speed is 0.7 m/s and the gap of cabbage was less than 1.4 cm, it has a certain influence on the effect of target spray. The future optimization methods are proposed as follows: In the future, the response time of the solenoid valve can be improved through optimization, to improve the operation speed of the system.

Comment # 9: The flow of the contents in the paper can significantly improve to help readers’ easier understanding. Currently, there is a change of focus between (i) model design, (ii) pre-processing method, and (iii) real-world experiments sections across the paper which can be confusing for the reader.

Author response 9: According to the suggestion, the structure of the manuscript was modified. By adding the analysis of the overall structure of YOLOv5n model, the advantages and disadvantages of different model structures are analyzed theoretically, and the model optimization method in this manuscript is introduced, which paves the way for the following model optimization content. The model design is placed in the first part of the manuscript, and the collection and production of the data set is placed in image preprocessing part of the second part, which can make manuscript more readable. (See section 2.1and section 2.2)

Comment # 10: Both laboratory and field experiments should include the comparison to methods close to that used in the paper. Simply adding a layer to a network and showing slight improvement is not a sufficient contribution to a journal paper.

Author response 10: According to your suggestion, we have supplemented the field experiment, mainly to supplement whether the optimization model and YOLOv5N model use motion blur to enhance the spraying accuracy of cabbage trained by the dataset under different lighting conditions. Only under high spraying accuracy, other spraying indicators are meaningful. The test site is the National Precision Agriculture Research Normal Base in Xiaotangshan Town, Beijing, on September 16 and 17, 2022. In these three groups of tests, the light intensities were 1.10~5.51, 8.47~11.23, and 4.23~8.66 wlx. The experiment process is as follows, the number of sprayed cabbages with missing data and misidentification issues was recorded manually. If the nozzle was not opened above the cabbage, it was considered a missed spraying. When the nozzle was opened in the non-target area, it was considered a misidentification. The specific steps were as follows. In the experiment of target-oriented spraying, two people followed the sprayer with red and blue labels. The red labels were placed on the unsprayed cabbage. The blue labels were placed on the mistakenly sprayed targets. After the spraying occurred, the labels were collected, the numbers of mistakenly sprayed and unsprayed labels were collected, and the spraying accuracy in combination with the number of cabbages in the test area that were manually counted before the test was calculated. The test site diagram is shown in Figure 1

Fig 1 Test picture

Figure 2 shows the recognition accuracy results of the different models in the three groups of tests. It can be seen from this figure that the hybrid transformer model proposed in this paper trained on the motion-blur-augmented data set achieved higher recognition accuracy than the other models in all three groups of tests. The accuracy of spraying was 98.91%, 98.84% and 98.20%, with an average of 98.65%. In the first and third groups of experiments, with relatively low light intensity, the spraying accuracy was improved by 3.87% at most. In the second group of experiments, under strong light conditions, The accuracy of spraying is 98.84%, increased by 7.98%. At the same time, Figure 2 shows that the spraying accuracy of the cabbage recognition models trained with the motion-blur-enhanced data set was higher than that of the cabbage recognition models not trained with the enhanced data set. Thus, the cabbage recognition model optimization method proposed in this paper is shown to significantly improve the spraying accuracy under strong light conditions.

Fig. 2 Spraying accuracy for cabbages under different light intensities.

Note: YOLOv5T represents the optimized model proposed in this paper, and + represents a model trained on a data set augmented with motion blur.

The test showed that the spraying accuracy reached 98.91%, and the model performance was significantly improved. Compared with other cabbage recognition models. For example, the recognition accuracy of the cabbage recognition model in article [1] [2] has been significantly improved, and the maximum value has reached 37.91%. We add the analysis and discussion on the performance improvement of the new model in the manuscript. (See the discussion section of the manuscript with yellow highlight on lines 552-577 and 585-596)

Thank you for your suggestion. We believe that this manuscript will attract extensive attention and interest of readers.

Comment # 11: Used metrics for spray precision can either be replaced or plotted more intuitively for easier interpretation.

Author response 11: In this manuscript, the L1 and L2 evaluation criteria are changed into the nozzle opening distance in advance and the nozzle closing distance in delay, which more intuitively reflects the accuracy of spray. At the same time, Figure 15 is changed into a scatter plot, and the mean and variance of the data are increased, which more intuitively shows the distribution of the data.

Comment # 12: Technical comments should be added about the intuition behind choosing the design parameters and discussions about the results.

Author response 12: In the 147-170 lines of the manuscript, the theoretical analysis of model structure selection and optimization scheme is added, and in the 549-572 lines of the discussion section, the discussion of model optimization results is added. Relevant literature is cited to further verify that the performance of the model can be improved through the model optimization method in this manuscript.

Zhai, C.; Fu, H.; Zheng, K.; Zheng, S.; Wu, H.; Zhao, X. Establishment and experimental verification of deep learning model for on-line recognition of field cabbage. Trans. Chin. Soc. Agric. Mach. 2022, 53, 293–303..
Cruz Ulloa, C.; Krus, A.; Barrientos, A.; Cerro, J. del; Valero, C. Robotic Fertilization in Strip Cropping Using a CNN Vegetables Detection-Characterization Method. Computers and Electronics in Agriculture 2022, 193, 106684, doi:10.1016/j.compag.2022.106684.
Hussain, N.; Farooque, A.; Schumann, A.; McKenzie-Gopsill, A.; Esau, T.; Abbas, F.; Acharya, B.; Zaman, Q. Design and Development of a Smart Variable Rate Sprayer Using Deep Learning. Remote Sensing 2020, 12, 4091, doi:10/gnkct6.

Author Response File: Author Response.pdf

Reviewer 2 Report

The authors presented an interesting research issue in their work. I believe the article would certainly gain in value and readability after applying the following corrections and additions:

1) For what purpose was the information about "image processing time" provided? How does this time translate into the entire analytical process? I wonder how this time was determined (lines 22-23)

2) In many places in the article, "is" appears instead of "was" - after all, the experiment has already taken place.

3) The authors described the process of taking pictures of plants with little accuracy and without explanation (starting from line 148). Why were the photos taken 10 times? Were these ten photos of the same plant at different stages of development ? Why is the word "most of .." used in the text (line 154)? How were the plants selected for the pictures selected? How many plants are registered in a single photo?

4) Was the selection of 2,425 photos (line 159) statistically significant? Have they all been analyzed?

5) The description of the monitoring and control system (start from lines 154) should be more detailed. To what extent was it automated?

6) How was the sprinkler head controlled? Was it stationary and only moved with the entire device? Whether the sprinkler was only spraying one plant or more as shown in Fig 6 (part 9).

7) Figure 8 does not show the length of the row as written in line 325.

8) Was the analysis of photos (line 331) with the use of Photoshop done manually or was it an automatic process (without human intervention)? Did it take place in the field prior to spraying or in an intimate setting?

9) How did you get the "fog droplet deposition amount" described in line 361?

10) Some of the photos included in the work are currently not very clear and should be larger.

Author Response

Article Title: “Design and Experimental Verification of the YOLOV5 Model Implanted with a Transformer Module for Target-Oriented Spraying in Cabbage Farming”

Dear Reviewers,

Best regards,

Hao Fu , Xueguan Zhao, Changyuan Zhai ^*

Comment # 1: For what purpose was the information about "image processing time" provided? How does this time translate into the entire analytical process? I wonder how this time was determined (lines 22-23)

Author response 1: The image processing time is the time when the cabbage recognition model processes an image, and the time period from the image input recognition model to the model output of cabbage position information and diameter information. During the measurement, the time when the image is input to the model and the time when the model output of cabbage position information and diameter information are recorded using the time() function of Python programming language. The image processing time is obtained by subtracting the two times.

Comment # 2: In many places in the article, "is" appears instead of "was"-after all, the experiment has already taken place.

Author response 2: 'is' has been modified in many parts of the text.

Comment # 3: The authors described the process of taking pictures of plants with little accuracy and without explanation (starting from line148). Why were the photos taken 10 times? Were these ten photos of the same plant at different stages of development? Why is the word "most of .." used in the text (line 154)? How were the plants selected for the pictures selected? How many plants are registered in a single photo?

Author response 3: The 255 lines in the text is modified to explain that the decline in recognition accuracy is due to the motion blur of the real-time collected image during the operation of the machine, which leads to the decline in recognition accuracy. The reason for shooting 10 times is to shoot different growth cycles of cabbage. For the collected cabbage with a later growth time, due to the plant spacing error at the time of planting and some differences in growth, a certain number of cabbage leaves overlap, so when described in the paper, most of the cabbage leaves do not overlap, Due to lack of seedlings and other reasons, the number of cabbage in each picture ranges from 0 to 12

Comment # 4: Was the selection of 2,425 photos (line 159) statistically significant? Have they all been analyzed?

Author response 4: For the data for training deep learning, the more images the better, but due to the cost problem, when a certain number of images, the trained model can converge without overfitting, for example: document [1]collected 920 pictures, document [2] collected 592 pictures. Document [3]collected 750 pictures.

Comment # 5: The description of the monitoring and control system (start from lines 154) should be more detailed. To what extent was it automated?

Author response 5: When the data set acquisition system captures the cabbage image, it uses the form of manually triggering the camera to shoot. The relevant description of the system operation is added in line 245 of the text.

Comment # 6: How was the sprinkler head controlled? Was it stationary and only moved with the entire device? Whether the sprinkler was only spraying one plant or more as shown in Fig 6 (part 9).

Author response 6: The nozzle is equipped with a solenoid valve. When the cabbage recognition model detects cabbage, the controller outputs a signal combined with the relative position between the nozzle and the cabbage calculated by the encoder to control the solenoid valve to open directly above the cabbage. It is statically installed on the sprayer and moves forward with the sprayer. According to the control system signal, only the cabbage can be sprayed.

Comment # 7: Figure 8 does not show the length of the row as written in line325.

Author response 7: Figure 8 was modified to add the length information and ridge number information of cabbage rows in the figure.

Comment # 8: Was the analysis of photos (line 331) with the use of Photoshop done manually or was it an automatic process (without human intervention)? Did it take place in the field prior to spraying or in an intimate setting?

Author response 8: The weed segmentation here is used to calculate the weed density during the experiment and record the weed density during the experiment. The segmentation has no effect on the target application of cabbage.

Comment # 9: How did you get the "fog droplet deposition amount" described in line 361?

Author response 9: In the text, lines 405-410 add the step of processing the water sensitive paper to obtain the deposition density and droplet deposition of fog droplets. ‘’After the experiment, the tsn450 scanner developed by Tiancai Electronics (Shenzhen) Co., Ltd. is used to scan the water sensitive paper sampled in the test to obtain the gray-scale image of the water sensitive paper, and then the fog droplet deposition analysis software developed by Chongqing Liuliu Shanxia Co., Ltd. is used to analyze the scanned image of the water sensitive paper to obtain the deposition density and droplet deposition.’’

Comment # 10: Some of the photos included in the work are currently not very clear and should be larger.

Author response 10: Replace Fig. 4, Fig. 15 and Fig. 17 with larger and clearer pictures

Jin, X.; Sun, Y.; Che, J.; Bagavathiannan, M.; Yu, J.; Chen, Y. A Novel Deep Learning-Based Method for Detection of Weeds in Vegetables. Pest Management Science 2022, 78, 1861–1869, doi:10.1002/ps.6804.
Li, G.; Suo, R.; Zhao, G.; Gao, C.; Fu, L.; Shi, F.; Dhupia, J.; Li, R.; Cui, Y. Real-Time Detection of Kiwifruit Flower and Bud Simultaneously in Orchard Using YOLOv4 for Robotic Pollination. Computers and Electronics in Agriculture 2022, 193, 106641, doi:10.1016/j.compag.2021.106641.
Ying, B.; Xu, Y.; Zhang, S.; Shi, Y.; Liu, L. Weed Detection in Images of Carrot Fields Based on Improved YOLO V4. Trait. Signal 2021, 38, 341–348, doi:10.18280/ts.380211.

Author Response File: Author Response.pdf

Article Menu

Design and Experimental Verification of the YOLOV5 Model Implanted with a Transformer Module for Target-Oriented Spraying in Cabbage Farming

Further Information

Guidelines

MDPI Initiatives

Follow MDPI