Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Influence of Preprocessing Methods of Automated Milking Systems Data on Prediction of Mastitis with Machine Learning Models

AgriEngineering 2024, 6(3), 3427-3442; https://doi.org/10.3390/agriengineering6030195

by Olivier Kashongwe^1,2,*

, Tina Kabelitz¹

, Christian Ammon¹

, Lukas Minogue¹

, Markus Doherr³

, Pablo Silva Boloña⁴, Thomas Amon^1,3

and Barbara Amon^5,6

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Reviewer 4: Anonymous

Reviewer 5:

Alexey Sibirev

AgriEngineering 2024, 6(3), 3427-3442; https://doi.org/10.3390/agriengineering6030195

Submission received: 15 July 2024 / Revised: 29 August 2024 / Accepted: 6 September 2024 / Published: 18 September 2024

(This article belongs to the Special Issue Implementation of Artificial Intelligence in Agriculture)

Round 1

Reviewer 1 Report (Previous Reviewer 4)

Comments and Suggestions for Authors

The topic of the thesis is of relevance; cow mastitis is a very common and costly disease in the cattle industry. Using machine learning techniques to predict the occurrence of mastitis is a very promising application for the cattle industry. The paper uses a variety of missing value processing methods and resampling techniques and explores their impact on the performance of machine learning models. This is a worthwhile issue to explore as missing values and category imbalance are common problems in real data. The paper employs a variety of machine learning models, including logistic regression, decision trees, random forests and multilayer perceptual machines, and parameter tuning through cross-validation and grid search. This is a comprehensive approach to evaluate the performance of the models. The experimental setup of the paper is reasonable, the dataset is from an actual cattle farm, and the experimental results have credibility. However, there are some areas for improvement in the thesis:

1. the introductory part of the thesis could have gone into more depth on the impact of mastitis on the cattle industry and the need to use machine learning for prediction.

2. the results part of the thesis can present the performance comparison of different models under different preprocessing methods more clearly, so that readers can better understand the impact of different preprocessing methods on model performance.

3. The discussion part of the paper can analyse the influence mechanism of different preprocessing methods on model performance more deeply, and put forward the direction of future research.

4. The conclusion part of the paper can more clearly summarise the impact of different preprocessing methods on model performance and put forward suggestions for practical application.

Author Response

Reviewer 1

the introductory part of the thesis could have gone into more depth on the impact of mastitis on the cattle industry and the need to use machine learning for prediction.

This comment has been addressed, see below

Mastitis affects both the economic viability of farms and the health of dairy cows, hence impacting the viability of the dairy industry [1]. Mastitis leads to reduced milk yield, increased veterinary costs, and higher culling rates [2]. Mastitis affects both the quantity and quality of milk produced. Infected cows produce less milk, and the milk often has altered composition, including higher somatic cell counts (SCC) and lower levels of key components like casein [3]. This not only reduces the volume of milk available but also its suitability for processing into dairy products. Effective management of mastitis involves both preventive measures and timely treatment. Strategies include maintaining good milking hygiene, using proper milking techniques, and implementing selective dry cow therapy to reduce the use of antibiotics. Early detection and treatment are crucial to minimize the impact of the disease [4].

the results part of the thesis can present the performance comparison of different models under different preprocessing methods more clearly, so that readers can better understand the impact of different preprocessing methods on model performance (table 6, .

Table 6 has been reworked

The RF models exhibited the highest performance, with 4 out of the top five rankings, while LR had the lowest ranking. The top performing classifiers did not undergo resampling, were imputed with MI for RF (kappa= 0.962), LI for DT (0.811) and SI for MLP (0.781), and for LR (0.607). The best performing resampling methods used were SVMSMOTE and SMOTE, that produced fair to moderate kappa scores (>0.35). The best discriminative classifiers, had lower kappa scores (0.239-0.607). MLP models recorded low prediction accuracy (kappa = 0.229-0.332) with LI, CC and MICE, both with and without resampling methods.

The discussion part of the paper can analyse the influence mechanism of different preprocessing methods on model performance more deeply, and put forward the direction of future research.

This comment has been addressed see below and in the manuscript

Hence, in depth feature engineering with techniques such as interaction features, binning and more domain specific transormation than only moving averages can be explored in further studies to improve ML performance with the respampling and imputation techniques presented in our study. Looking at the time series perspective as well as applying weighted loss to classifiers such as LR may also be explored to improve performance.

The conclusion part of the paper can more clearly summarise the impact of different preprocessing methods on model performance and put forward suggestions for practical application.

Conclusion has be improved (see below ad in the text)

Based on our research, the choice between missing value imputation and resampling techniques for machine learning models depends on the specific model being used. We found that complete case analysis yielded higher kappa scores than missing imputation techniques for logistic regression (LR), while random forest (RF), decision tree (DT), and multilayer perceptron (MLP) performed better with imputation techniques. We also noticed significant variations between models and agreement between accuracy, F1 score, precision, and recall metrics with kappa. For ensemble models, resampling with Synthetic Minority Over-sampling Technique (SMOTE) or Support Vector Machine SMOTE (SVMSMOTE) improved classification performance using simple imputations or complete cases. In addition, linear interpolation (LI) and SMOTE resampling improved MLP classification, while LR performed better with complete cases and SVMSMOTE resampling. Therefore, we suggest using SVMSMOTE sampling for studies with similar class imbalance problems when using LR or ensemble models for classification, and SMOTE when using MLP classifier. However, in cases where missing values are significant, simple imputation for ensemble models and linear interpolation for MLP will enhance classifier performance. When dealing with missing value imputation, we recommend comparing results from imputed datasets with complete cases.

Reviewer 2 Report (Previous Reviewer 5)

Comments and Suggestions for Authors

The manuscript "Influence of Preprocessing Methods of Automated Milking Systems Data on the Prediction of Mastitis with Machine Learning Models" (agriengineering-3132773). Considering the first submission of this study, some parts have improved, however, others need strong revisions. However, before recommending the manuscript for publication, the authors must improve several aspects of the present study. Therefore, I am recommending this work for major revisions.

As small observations, which must be attended to, I highlight:

1 – Was the manuscript reviewed by a fluent English speaker? I suggest that authors request a review by a company.

2 – Authors must reformulate the abstract. Note that you are presenting an abstract with 220 words, and AgriEngineering limits an abstract to 200 words. I also highlight that authors must follow the premise of presenting the highlights of the results in the abstract, something that I did not observe in this summary.

3 – Unfortunately, the manuscript does not comply with AgriEngineering's standards, featuring irregular indents, a template that does not fit within the margins, and overall poor formatting. If the authors do not agree to make the changes now, the editorial board of AgriEngineering will undoubtedly require them at some point.

4 – Emphasize in the introduction why predicting mastitis is important and the impact of preprocessing methods on the performance of the machine learning model. Additionally, the authors need to expand the literature review to include more recent studies related to mastitis prediction and preprocessing techniques. Note that more than 70% of the references in your study are from before 2020. The topic of your study is closely related to Precision Livestock Farming, a new field with numerous current studies.

5 – Provide more detailed explanations of the preprocessing methods used (e.g., simple imputer, MICE, linear interpolation) and the justification for choosing them.

6 – Figure 1 needs to be more detailed regarding the methodological flowchart of the processing steps, analyses, and results.

7 – In my opinion, the results need to be better presented. The authors have barely explored the results presented.

8 – In the discussion, I noticed the absence of three lines of reasoning: "Contextualization," "Limitations," and "Future Work" for the present study. Contextualize the results within the broader literature and discuss how the findings align or differ from previous studies. Acknowledge any limitations of the study, such as potential biases or data limitations. Finally, suggest avenues for future research, such as exploring other preprocessing techniques or testing the models in different farm scenarios.

9 – The work needs a separate section for "Conclusions," which is essential! Summarize the main findings and their implications succinctly, and emphasize the practical applications of the study for dairy producers and industry stakeholders.

Comments on the Quality of English Language

Moderate editing of English language required.

Author Response

Reviewer 2

As small observations, which must be attended to, I highlight:

1 – Was the manuscript reviewed by a fluent English speaker? I suggest that authors request a review by a company.

Yes, the manuscript was reviewed by a professional scientific editing company

Absttract has been reduced to 200 words

A journal template has been used in the current revision. Margins and indent have conformed to the Journals formating guidelines

4 – Emphasize in the introduction why predicting mastitis is important and the impact of preprocessing methods on the performance of the machine learning model. Additionally, the authors need to expand the literature review to include more recent studies related to mastitis prediction and preprocessing techniques. Note that more than 70% of the references in your study are from before 2020. The topic of your study is closely related to Precision Livestock Farming, a new field with numerous current studies.

We have updated the references that were older than 2020. Only key references for which no alternatives were found have remained. More than 70% are from 2020 or newer

5 – Provide more detailed explanations of the preprocessing methods used (e.g., simple imputer, MICE, linear interpolation) and the justification for choosing them.

Detailed explanations on the mechanics of the imputation methods and the reasons for chosing them are presented in the metodology section, lines 170-184 to complement the brief description in the introduction (lines 80 - 94)

6 – Figure 1 needs to be more detailed regarding the methodological flowchart of the processing steps, analyses, and results.

The description of figure 1 has been expanded see lines 266-270

The figure shows how the test set was split from the original dataset. The two lines of processing with complete cases and with missing values. The data with missing values was imputed with SI, MI and LI while the complete case analysis was without missing values. All the datasets were then submitted to resampling creating 16 datasets that were split into training and validation sets to train and evaluate the models before testing the final model on the test set.

7 – In my opinion, the results need to be better presented. The authors have barely explored the results presented.

The results section has been improved, Table 6 has been simplified, while descriptions on the tables and figures have been updated

8 – In the discussion, I noticed the absence of three lines of reasoning: "Contextualization," "Limitations," and "Future Work" for the present study. Contextualize the results within the broader literature and discuss how the findings align or differ from previous studies. Acknowledge any limitations of the study, such as potential biases or data limitations. Finally, suggest avenues for future research, such as exploring other preprocessing techniques or testing the models in different farm scenarios.

The aspects mentioned have been added to the discussion section (lines 320-325, lines 359-362)

The study is based on the analysis of data collected by a milking robot, hence should be understood in the context of mastitis prediction with sensor collected data. The sensors offer the advantage of data with high time resolution but they bring along the missrecording and missing values. Handling the latter is the purpose of the current study. Although the nature of data recording with sensors can be seen as a limitation compared to data generated in controlled conditions, it offers the higher opportunities for applications in practice in dairy farms, especially because of the icreasing use of automated milking system.

Hence, in depth feature engineering with techniques such as interaction features, binning and more domain specific transormation than only moving averages can be explored in further studies to improve ML performance with the respampling and imputation techniques presented in our study. Looking at the time series dimension as well as applying weighted loss to discriminative classifiers such as LR may also be explored to improve their performance.

9 – The work needs a separate section for "Conclusions," which is essential! Summarize the main findings and their implications succinctly, and emphasize the practical applications of the study for dairy producers and industry stakeholders.

We have added a separate section for conclusions

Reviewer 3 Report (New Reviewer)

Comments and Suggestions for Authors

This manuscript addresses an important aspect of dairy farming—predicting mastitis using machine learning (ML) models and data from automated milking systems (AMS). The focus on preprocessing methods, specifically imputation techniques and class imbalance handling, is relevant and could provide valuable insights for improving mastitis prediction models. The manuscript is well-organized especially after revision (highlighted). Thus, I would like to recommend it to be published. One question: a separate 5. Conclusion part should be provided at the end.

Comments on the Quality of English Language

Fine.

Author Response

This manuscript addresses an important aspect of dairy farming—predicting mastitis using machine learning (ML) models and data from automated milking systems (AMS). The focus on preprocessing methods, specifically imputation techniques and class imbalance handling, is relevant and could provide valuable insights for improving mastitis prediction models. The manuscript is well-organized especially after revision (highlighted). Thus, I would like to recommend it to be published. One question: a separate 5. Conclusion part should be provided at the end.
Conlcusions hae been placed in a separate section

Reviewer 4 Report (New Reviewer)

Comments and Suggestions for Authors

The manuscript makes a valuable contribution to the mastitis field, with a clear logical structure and a good writing style. However, to make the article more complete, I suggest the author make the following minor revisions.

1. Line 126， is there any correlation between daily milk yield, electric conductivity at quarter, cow levels, somatic cell counts and Mastitis? why choose these indicators?

2. Line 131, Can you tell us what the six features are and why they were chosen?

3. Line 183, How is the imbalance ratios calculated and is there any basis for it?

4. In discussion, you can further explore the potential significance and application prospects of the research results to enhance the impact of the paper.

5. Please review the text carefully and correct any spelling, grammatical, or improper punctuation mistakes.

6. Although the literature review is already quite comprehensive, authors are advised to supplement relevant studies published within the last year to enrich the background information further.

Comments on the Quality of English Language

Minor editing of English language required.

Author Response

This manuscript addresses an important aspect of dairy farming—predicting mastitis using machine learning (ML) models and data from automated milking systems (AMS). The focus on preprocessing methods, specifically imputation techniques and class imbalance handling, is relevant and could provide valuable insights for improving mastitis prediction models. The manuscript is well-organized especially after revision (highlighted). Thus, I would like to recommend it to be published. One question: a separate 5. Conclusion part should be provided at the end.
Conlcusions hae been placed in a separate section

Reviewer 5 Report (New Reviewer)

Comments and Suggestions for Authors

The purpose of the study, in the form of assessing the impact of resampling and imputation techniques on the performance of ML models, is important and significant for the development of modern agricultural science. Assessing three imputation and resampling methods at once demonstrates a holistic scientific approach to the problem. The lack of systematization is a significant problem for the field of ML. The article addressed the issues of choosing imputation and resampling methods, as well as their interaction with various classifiers. However, in the Abstract section, it is worth indicating the best results for precision, recall, and F1 Score and compared models with the kappa score obtained during the study in the form of numerical values. This paper systematizes the use of resampling and imputation methods for use as a means of reconstructing missing data in samples used to train ML models. В разделе Материалы и методы я обратил внимание на годы сбора данных – 2015-2017. Можно ли считать их актуальными? Не улучшилась ли точность сбора данных базовыми инструментами доильных роботов? В подразделе 2.5 предлагаю описать подробнее, почему вы остановились именно на этих 4 классификаторах (logistic regression (LR), decision tree (DT), random forest (RF) and multilayer perceptron (MLP)). In the Materials and Methods section, I drew attention to the years of data collection - 2015-2017. Can they be considered relevant? Has the accuracy of data collection improved with the basic tools of milking robots? In subsection 2.5, I suggest describing in more detail why you chose these 4 classifiers (logistic regression (LR), decision tree (DT), random forest (RF) and multilayer perceptron (MLP)). The Results section is presented in an understandable form thanks to a large number of graphs and tables. The Discussion section contains a detailed analysis of the results, but I suggest supplementing the work with a Conclusions section, including a brief summary of the results obtained in the study to simplify the understanding of readers. The Introduction section presents a sufficient number of scientific papers that cover the topic of the study. However, the statements in lines 36-43 should be supplemented with more numerical values in the text that demonstrate the prevalence of mastitis among cows, as well as the economic losses that accompany dairy farms. Similarly, lines 64-66 do not indicate even an approximate number of "missing values, outliers and skewed values". In lines 485-498 the size of the figures is so small that it is difficult to see the results presented.

Author Response

Reviewer 5

In the Materials and Methods section, I drew attention to the years of data collection - 2015-2017. Can they be considered relevant? Has the accuracy of data collection improved with the basic tools of milking robots?

In subsection 2.5, I suggest describing in more detail why you chose these 4 classifiers (logistic regression (LR), decision tree (DT), random forest (RF) and multilayer perceptron (MLP)).

More details on the choice of classifiers is given in the lines 216-219

The Results section is presented in an understandable form thanks to a large number of graphs and tables. The Discussion section contains a detailed analysis of the results, but I suggest supplementing the work with a Conclusions section, including a brief summary of the results obtained in the study to simplify the understanding of readers.

A conclusion section has been added to the study, lines 408-419

The Introduction section presents a sufficient number of scientific papers that cover the topic of the study. However, the statements in lines 36-43 should be supplemented with more numerical values in the text that demonstrate the prevalence of mastitis among cows, as well as the economic losses that accompany dairy farms.

The paragraph ahs been exapnded to include thoose comments. See lines 35-39

Similarly, lines 64-66 do not indicate even an approximate number of "missing values, outliers and skewed values".

Details provided with figures included in lines 70-72

In lines 485-498 the size of the figures is so small that it is difficult to see the results presented.

The figures in appendix have been enlarged

Round 2

Reviewer 2 Report (Previous Reviewer 5)

Comments and Suggestions for Authors

Based on the corrections provided, I am considering this study for publication.

Comments on the Quality of English Language

Minor editing of English language required.

This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Although the problem tackled by the authors deserves to be investigated, I have found the research presented in the article:

i. pretty irrelevant for the reader, as non clear conclusion or suggestion is given so to clearly suggest how to investigate similar datasets, not allowing the reader to draw a strategy to approach problems of this family;

ii. lacking of scientific soundness in the application of ML approaches.

Regarding this second point, some mayor issues and problems are present:

1. No detail is given on the subdivision of data into training/validation/test sets.

2. If point 1 was performed by the authors, then only a single subdivision was studied (considering the presented results), thus the article suffers of a strong bias.

3. Not having performed multiple subdivisions, there is no statistical analysis of the results, thus no sound conclusion can be drawn (the study remains on an exploratory level, not getting to a deep enough level to get sound results).

4. No detail on the selection of the parameters of the investigated ML techniques is given: are default one used? If so, without a parameter exploration, no sound conclusion can be drawn when comparing algorithms performances.

Author Response

Reviewers comments

SN	Reviewers Comments/ questions	Reviewer
	*General comments*
1	lacking of scientific soundness in the application of ML approaches	1	The paper adresses a important question pertaining the preprocessing of data before submitting it to machine learning models. We show how this can impact on the prediction outcomes and provide ways to handle the issue. Due to the nature of data missingness and the occurence of mastitis on dairy farms the problem is rather a crucial aspect if we want to obtain robust models for disease prediction but it is rather undadressed and the solutions we provided could pave the way for a systematic way to handle the problem.
	*Methods*
8	No detail is given on the subdivision of data into training/validation/test sets.	1	The data processing workflow in Figure 1 describes the steps followed in the analysis. Typical for this types of studies. Nevertheless we have added more details on the data subdivision into training+validation (80% (60+20) and test set (20%). We had already provided the opportunity to access the source code used in this study (github link is provided).
9	No detail on the selection of the parameters of the investigated ML techniques is given: are default one used? If so, without a parameter exploration, no sound conclusion can be drawn when comparing algorithms performances.	1	Revised section 2.3. Briefly, we mostly used default parameters with few alterations mostly for SGD and LR where max_iter was set 1000, cross validation was 5 fold and the best C for LR was searched for each model from a range between 0.01 and 100, often 1.0 was the best fit (see supplementary file 2). For resampling methods, parameters are insrted in the text.

Reviewer 2 Report

Comments and Suggestions for Authors

The scientific research is good and has addressed a major problem, but the clear aspect of the research is more of a theoretical study than the practical aspect, and there is a percentage of error in it because it relied on inventory and collected data.

I hope to include the data in the body of the research and not the end

The research is important as an idea, but it was not presented in an ideal way that clarifies the problem and presents its solution in a positive way

I wish you continued success and excellence

Author Response

Kindly find our responses attached and in the revised manuscript

Author Response File: Author Response.docx

Reviewer 3 Report

Comments and Suggestions for Authors

No comments

Comments for author File: Comments.pdf

Author Response

We improved referencing style throughout the document for consistency.

We used for this Zotero open source reference manager which has the mdpi reference format imbedded.

Thanks

Reviewer 4 Report

Comments and Suggestions for Authors

This paper explores the impact of data preprocessing methods for automated milking systems on the prediction of mastitis in dairy cows by machine learning models. This paper explores the impact of data preprocessing methods for automated milking systems on the prediction of mastitis in dairy cows by machine learning models. On dairy farms, machine learning models are used to help make decisions and these models need to process large amounts of regularly collected data. Automated milking systems and herd management procedures are playing an increasingly important role in collecting such data. However, I also found some areas for improvement:

1. The title of the paper is not quite consistent with the content. The title expresses "The influence of data preprocessing methods of automated milking system on machine learning models for mastitis prediction". It is suggested that the title be revised to more accurately reflect the content of the paper.

2. In the "data preprocessing methods" section of Section 2.2, the description of missing value imputation methods is not comprehensive and clear enough, such as the simple method of taking the mean and median value; Principle of multiple interpolation; Parameter selection in linear interpolation method. The specific implementation of these methods is very important for the reader to understand the experiments in this paper.

3. In Section 2.3, "Imputation of missing Values", three imputation methods are used, but the experimental results show that only the simple imputation method and the multiple imputation method are used, and the results of the linear imputation method are not reported. Whether they are missed or the experimental results show that the linear imputation method performs poorly needs to be clarified.

4. According to the impact of missing value imputation methods on model prediction performance ", the performance of SVM-MOTE oversampling method is poor, which is inconsistent with the mentioned SVM-MOTE oversampling method can improve the performance. It is necessary to re-examine the experimental process and explain this result.

5.In the results table in Section 3, "Impact of missing Data Imputation Methods on Model Predictive Performance," the fifth column, "no treatment," means that missing data imputation and oversampling are not used at all.

6. On the whole, the analysis of the experimental results can be more comprehensive and in-depth, such as discussing the applicable situations of different methods, comparing their advantages and disadvantages, and testing the robustness of the results.

7. The paper writing format is standardized, and the chart drawing quality is high, but please unify the size and format of the text in the figure. It is recommended to add more descriptive text to improve the readability of the paper.

8.Chapter 5 "Conclusion" should be added.

9.In the bottom of Introduction, please provide the simple descriptions of Sections 2-4/5.

Comments for author File: Comments.pdf

Comments on the Quality of English Language

None

Author Response

Reviewers comments

SN	Reviewers Comments/ questions	Reviewer
	*General comments*
2	The title of the paper is not quite consistent with the content.	4	Title revised to read Influence of preprocessing methods of AMS data on the prediction of mastitis with machine learning models
3	unify the size and format of the text in the figure. It is recommended to add more descriptive text to improve the readability of the paper.	4	addressed
	*Introduction*
7	In the "data preprocessing methods" section of Section 2.2, the description of missing value imputation methods is not comprehensive and clear enough	4	The description and theoretical background around missing value imputation has been expanded. In the introduction, lines 92-119 and in section 2.3, lines 179-202. Briefly more details about theoretical assumption of each methods and examples of their applications in the literature have been added. Secondly we have expanded on the procedure we used to impute the data used in this study
	*Methods*
10	Improve methods / research design description	2, 4	Methods description is improved see respnses 7, 8, 9
11	In the "data preprocessing methods" section of Section 2.2, the description of missing value imputation methods is not comprehensive and clear enough	4	See response 7
12	In Section 2.3, "Imputation of missing Values", three imputation methods are used, but the experimental results show that only the simple imputation method and the multiple imputation method are used, and the results of the linear imputation method are not reported	4	The three methods are reported see sections 2 and 3
	*Results*
14	I hope to include the data in the body of the research and not the end	2	This is not possible since the table is too lengthy. But we have provided an appendix attached to the paper and github link to the source code and data will be stored in a public repository after anonymization.
15	Could improve the results	2,4	The overall text of result explanation is revised by the language editing service. We have deepened the discussion (see lines 327-332; 346-352; 356-358; 374-379)
16	the performance of SVM-MOTE oversampling method is poor, which is inconsistent with the mentioned SVM-MOTE oversampling method can improve the performance. It is necessary to re-examine the experimental process and explain this result.	4	This was found to depend on the ML model tested. A clear performance improvement was found for SGD and LR
17	In the results table in Section 3, "Impact of missing Data Imputation Methods on Model Predictive Performance," the fifth column, "no treatment," means that missing data imputation and oversampling are not used at all.	4	This does not seem to be found in the text
	*Discussion*
18	On the whole, the analysis of the experimental results can be more comprehensive and in-depth, such as discussing the applicable situations of different methods, comparing their advantages and disadvantages, and testing the robustness of the results	4	Improved in depth description of each method tested and discussion of their results. see lines 327-332; 346-352; 356-358; 374-379). We have included an argument of the benefits and the down sides of each method when targeting specific machine learning models. And provided more direction as to which ones to use preferentially in which conditions
	*Conclusion*
19	Conclusions must be improved	2, 4,5	We imprived the conclusion (see lines 385-396) to provide in a nuschell an indication of the performance of each group of ML models based on resampling and imputation and provided at least one solution to handle the problem for each case

	*References*

Reviewer 5 Report

Comments and Suggestions for Authors

The manuscript " Influence of automated milking systems data preprocessing methods on the prediction of mastitis occurrence with machine learning models" (agriengineering-2849419). I note that this study presents different machine learning algorithms for determining mastitis in dairy cows, an innovative study in the field of animal production, resulting in a very low rate of plagiarism. However, I question the validation procedures and have my doubts about the results presented. However, before recommending the manuscript for publication, the authors must improve several aspects of the present study. Therefore, I am recommending this work for major revisions.

As small observations, which must be attended to, I highlight:

1 – I noticed some grammatical errors in writing, therefore, I suggest the revision of English by a native speaker.

2 – Authors must reformulate the abstract. Note that you are presenting an abstract with 327 words, and AgriEngineering limits an abstract to 200 words. I also highlight that authors must follow the premise of presenting the highlights of the results in the abstract, something that I did not observe in this summary.

3 – The keywords need to be revised, the authors present 7 keywords, but 3 occur in the title, which I do not see as a correct action, keywords should not be contained in the title.

4 – Improve the quality of Figure 1, set its output to at least 600 DPI, so that when converting to PDF, the quality is not compromised.

5 – Remove the borders of Figure 2 from “a” to “d”, and improve the image quality, following the recommendations in the previous question.

6 – Same for Figure 3.

7 – Why were the Kappa index values so low? Is there any statistical question to justify such a drop in this index?

8 – Authors must insert the “Final Considerations” topic, it is essential!

9 – Why did the authors not submit an ethics committee statement?

As a minor and main note, I highlight:

1 – Use the Mendeley Reference Manager for references as well as citations, as both Water standards are not standardized in the body of every manuscript.

2 – Authors should present more references; I think 28 references is a very low number for a work of such magnitude.

Comments on the Quality of English Language

Extensive editing of English language required.

Author Response

Reviewers comments

SN	Reviewers Comments/ questions	Reviewer
	*General comments*
4	Extensive editing of English language required	5	The document has been submitted to an english language editing service (attach certificate)
5	Improve the quality of Figure 1, set its output to at least 600 DPI, so that when converting to PDF, the quality is not compromised	5	increased
6	Why did the authors not submit an ethics committee statement?	5	For this study, ethical approval was not necessary. No direct measurements on animals. Neither animals’ identification nor farm name are reported. We used past records of cows performance data collected with non-invasive sensors.
13	I question the validation procedures and have my doubts about the results presented.	5	Please see response 8 and 9.
	*Conclusion*
19	Conclusions must be improved	2, 4,5	We improved the conclusion (see lines 385-396) to provide in a nutshell an indication of the performance of each group of ML models based on resampling and imputation and provided at least one solution to handle the problem for each case
20	Insert final considerations	5	The last paragraph contains the final considerations (conclusion) unless it is advised to have a distinct subheading.

	*References*
20	Use Mendeley reference manager	5	We used Zotero open source reference manager which has the mdpi reference format imbedded
21	Authors should present more references; I think 28 references is a very low number for a work of such magnitude.	5	We increased the number of cited references to 43
	*Abstract*
22	Note that you are presenting an abstract with 327 words, and AgriEngineering limits an abstract to 200 words	5	Reduced
23	I also highlight that authors must follow the premise of presenting the highlights of the results in the abstract, something that I did not observe in this summary	5	Results are presented in the abstract
24	The keywords need to be revised, the authors present 7 keywords, but 3 occur in the title, which I do not see as a correct action, keywords should not be contained in the title.	5	Keywords are revised 5 are presented and they do not occur in the title

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The authors did not address several points raised in my first review. Point ii has not been discussed, but, more importantly, no answer was given to points 2 and 3, and only a marginal answer was given to point 4.

Reviewer 4 Report

Comments and Suggestions for Authors

Accept in present form

Reviewer 5 Report

Comments and Suggestions for Authors

The authors did not highlight the changes made in the manuscript. Still, in parts I did not detect significant changes that could consider the manuscript for publication.

Therefore, I am rejecting this study.

Article Menu

Influence of Preprocessing Methods of Automated Milking Systems Data on Prediction of Mastitis with Machine Learning Models

Further Information

Guidelines

MDPI Initiatives

Follow MDPI