1. Introduction
Mastitis is the most costly and frequent disease in dairy farming, contributing to considerable (about EUR 125 per cow and year) and recurring costs incurred through a reduction of milk quantity and quality, as well as the reproductive performance and longevity of cows [
1,
2,
3]. The disease affects 20–40% of lactating cows annually, leading to a potential 11 to 18% of gross margins of dairy farms. Subclinical mastitis contributes about 48% of these costs through either subsequent milk yield reduction (72%) or the subsequent culling of infected cows (25%). This disease can be caused by a wide diversity of pathogens but is dominated by only few species. The most important mastitis-causing bacterial species are
Streptococcus agalactiae,
Streptococcus dysgalactiae,
Streptococcus uberis,
Staphylococcus aureus, and
Escherichia coli [
2,
4]. The subclinical form of mastitis, without overt symptoms, is by far more prevalent than the clinical form and causes high economic losses [
2,
5]. Simultaneously, subclinical mastitis is, due to the low level of symptoms, difficult to detect. Mastitis affects both the economic viability of farms and the health of dairy cows, hence impacting the viability of the dairy industry [
6]. Mastitis leads to reduced milk yield, increased veterinary costs, and higher culling rates [
7]. Mastitis affects both the quantity and quality of the milk produced. Infected cows produce less milk, and the milk often has altered composition, including higher somatic cell counts (SCCs) and lower levels of key components like casein [
8]. This not only reduces the volume of milk available but also its suitability for processing into dairy products. The effective management of mastitis involves both preventive measures and timely treatment. Strategies include maintaining good milking hygiene, using proper milking techniques, and implementing selective dry-cow therapy to reduce the use of antibiotics. Early detection and treatment are crucial to minimize the impact of the disease [
9]. Therefore, modern data analysis tools like machine leaning could help us to identify subclinical mastitis cases more often and accurately.
Managing mastitis is even a bigger challenge in large-scale dairy farms despite the use of automated milking systems (AMSs) that has steadily increased in Germany and across European countries over the last few years [
10]. The AMS uses sensors to collect data such as milk yield and milk components, and through management programs, alerts are given in case of variations of either milk production (milk yield and milk flow), electric conductivity, somatic cell counts (SCCs), milk temperature, milk color or a combination of these parameters as an indication for mastitis occurrence [
11,
12]. The data collected at each milking, or through monthly milk recording programs, are routinely used to help farmers for better decision-making in production, reproduction and health. This is specifically the case for mastitis predictions based on milk yield, milk parameters (conductivity, SCC, blood in milk, temperature) and cow characteristics [
11,
13]. For this prediction purpose, several machine learning (ML) approaches have been used that aim to improve the monitoring of udder health status in general or mastitis specifically, whether subclinical or clinical. Bobbo et al. [
14] compared eight ML models and achieved a prediction accuracy above 75% for all in a binary classification, with 200,000 cells/mL SCC as a threshold for positivity. Hyde et al. [
15] trained random forest models to predict mastitis infection patterns in a binary classification where the predictor was contagious vs. environmental mastitis or environmental lactation vs. an environmental dry period. They obtained 98% prediction accuracy. Post et al. [
16] applied ML models to a group of animals with historical records of diseases and achieved higher prediction accuracy than when the model was applied to the whole population. Findings from other studies using similar methodologies showed also high prediction accuracy [
16,
17]. Despite these encouraging results, the application of mastitis-prediction ML models in real-life conditions remains limited because of a discrepancy between the performance on the training and validation datasets and the actual data. The nature of data recording by sensor systems and the low occurrence of the disease appear as major reasons for the difference [
18].
Indeed, farm sensor data present, in general, two types of challenges that make them difficult to handle by prediction algorithms. First, the data are often noisy, with missing values, outliers and skewed values accounting for about 30% of data loss prior to analysis. They occur because of sensor failure during signal transmission or the interruption of the milking process, e.g., by the detachment of a teat cup [
19]. The missing values or wrongly recorded values fall into the type of either missing values completely at random or missing values at random. Missing values together with a clear definition of positive cases represent a major hindrance for ML algorithms trained on ‘experimental datasets’ to be used under farm conditions [
20]. To handle the problem, it is common in practice to delete missing values completely or at least to apply methods such as listwise deletion, but less common is the reporting of the magnitude of missing values or the use of missing data handling methods [
21]. Although working without missing values is convenient, it only produces reliable estimates in limited situations where missing values occur completely at random and only for the dependent variable. In other situations, this results in severely biased estimates, not to mention the potential waste of information in the omitted data and the low practical application of the obtained results [
22]. This is of particular importance in disease prediction, where metrics obtained from the training datasets need to be applied in real-life situations [
20]. Indeed, it is very common to have large amounts of missing values in sensor-generated datasets [
21].
Various techniques have been developed to deal with the challenge of missing values in large datasets. The common imputation methods are simple imputation, multiple imputation and linear interpolation. The simple imputation method replaces missing values with mean, median or mode values [
23]. This method is largely applied because of its computational convenience, although, in many cases, the results and conclusions are not sensible or generalizable [
24]. Multiple imputation uses the MICE (multiple imputation with chained equations) algorithm, which is a Markov chain Monte Carlo method that imputes incomplete data in a variable-by-variable way, starting with a random draw of the observed data. For instance, a first regression of the first variable with missing values is applied to all other variables, provided that the rows have observations for the variable of interest. Then, missing values for the variable of interest are replaced by simulated draws from its posterior predictive distribution. The process is repeated for all other variables with missing values in turn; this is called a cycle. This process is repeated several times to generate a single imputed dataset, and the whole process is repeated three to five times to obtain stable results [
25]. Although recognized for its robustness, the method suffers the limitation of a lack of theoretical rationale [
23,
25,
26]. Linear interpolation estimates the value of the missing data based on the two data points adjacent to the missing one in a one-dimensional data sequence [
27]. It is reputed to perform well on time-dependent data and on datasets with a small to moderate number of missing values between adjacent points [
26].
The second major hurdle when training models on AMS data to predict mastitis is the class imbalance between positive and negative cases [
18]. Although frequently observed in dairy farms, mastitis is a rare occurrence when the data resolution is increased to either a daily basis, an animal level, or both. This imbalance causes a bias when fitting standard learning classifiers, reflected in their inability to correctly predict the minority class, despite sometimes achieving high prediction accuracy [
28]. Johnson and Khoshgoftaar [
29] noted that the total number of the minority class is more important than the percentage of imbalance. Various methods to handle class imbalance are reported in the literature to improve disease prediction [
18,
28]. Johnson and Khoshgoftaar [
29] categorized them into three groups: data-level methods, algorithm-level methods and hybrid methods. Data-level methods change the dataset structure by either reducing the majority class (undersampling), increasing the minority class (oversampling), or both, to achieve a more balanced class distribution [
24]. Among popular resampling techniques is the Synthetic Minority Oversampling Technique (SMOTE), which produces synthetic samples by interpolating minority samples with their k-nearest neighbors. The algorithm seems to be improved by taking into consideration the minority class lying along the borderline, hence expanding the minority class area towards the side of the majority class where only few instances of majority class are found [
30]. However, oversampling techniques may lead to overfitting. Random undersampling is among the first undersampling techniques developed and works by discarding random samples in the majority class. The technique has been improved with several techniques using nearest neighbors to reduce instances in the majority class. Edited Nearest Neighbors (ENN) tests every instance with the rest of the samples using k-NN and disqualifies incorrectly classified samples. Undersampling methods have the disadvantage of discarding information that may be useful. Techniques combining oversampling and undersampling have been developed to overcome the limitations of individual methods. The SMOTE-ENN technique combines SMOTE and Edited Nearest Neighbors for undersampling [
28].
Studies on mastitis prediction with machine learning classifiers use the above-mentioned data (pre)processing techniques almost interchangeably, making the comparison and evaluation of effectiveness across studies more complex. Hence, starting with an analysis based on complete cases where all missing values are removed from the dataset prior to analysis, we evaluated whether the imputation techniques at three levels of complexity, namely simple imputation, multiple imputation with a chained equation, or linear interpolation would improve the performance of ML classifiers. The second of our hypotheses was to evaluate the improvement of ML classifiers’ performance when class imbalance was handled by resampling techniques of varying levels of complexity with SMOTE, SMOTEEN and SVMSMOTE, respectively. Thirdly, we compared performance metrics across models with both imputation and resampling techniques in order to decipher individual models’ suitability and robustness for mastitis prediction using data collected through automated milking systems. We used several metrics, including accuracy, F1 score, precision, recall and kappa score.
4. Discussion
This study demonstrated the influence of resampling and imputation techniques on the prediction performance of three machine learning models trained to detect mastitis incidences from automated milking systems data. Features included quarter and cow-level conductivity, milk yield and in-line somatic cell count, with their seven-day moving averages and standard deviations, from a conventional dairy farm in Germany. The study is based on the analysis of data collected by a milking robot and hence should be understood in the context of mastitis prediction with sensor-collected data. The sensors offer the advantage of data with high time resolution, but they bring along the issues of misrecording and missing values. Handling the latter is the purpose of the current study. Although the nature of data recording with sensors can be seen as a limitation compared to data generated in controlled conditions, it offers greater opportunities for applications in practice in dairy farms, especially because of the increasing use of automated milking systems. Three types of classifiers were evaluated: classical discriminative classifiers (LR), ensemble classifiers (DT and RF), and a neural network-based classifier (MLP). We considered the data with complete cases (without missing values) as the control or ground-truth dataset to which resampling methods were compared for each classifier. The dataset without resampling was used to control ML classifiers’ performance using resampling techniques (namely SMOTE, SMOTEEN, and SVMSMOTE). Additionally, within each case of imputation (CC, SI, MI, and LI), we evaluated the performance improvement contributed by both imputation and resampling methods.
We found that, on average, the recall scores of DT, RF and MLP classifiers subjected to CC analysis were lower than those subjected to imputation techniques. In contrast, the precision score was higher than that of imputation techniques for RF and DT only. The CC analysis was considered in this study as the reference dataset for the ones with the imputation methods. Mukaka et al. [
35] argue in the same direction, stating that CC analysis can generate unbiased estimates for binary outcomes while achieving high statistical coverage. Therefore, they recommended using CC analysis to complement that with imputation techniques.
The analysis of model-specific performance revealed that ensemble models had the highest performance metrics with the lowest difference between precision and recall regardless of the imputation technique used. In contrast, the discriminative models (LR) had the highest difference between precision and recall scores. Bobbo et al. [
14], comparing machine learning models for the prediction of udder health status, also reported a better performance for ensemble and neural network-based models than linear models. In our study, for instance, the RF classifier had the best performance with MI (kappa = 0.96), CC (kappa = 0.83) without resampling, and SMOTEEN resampling without imputation (kappa = 0.811). The best two DT models were with LI and SI without resampling (kappa = 0.81 and 0.79, respectively). This trend was confirmed by the ROC curves for RF and DT that had higher TPR (>90%) and lower FPR (<10%) for RF with imputation methods compared to CC (Supplement). Findings by Tiwaskar et al. [
36] confirm this improvement in machine learning models’ performance with imputation techniques in a study where they tested RF models with various levels of missing values. Performance improvement was observed not only for ensemble models but also for others. Simple imputer improved the performance of LR (kappa = 0.61 vs. 0.24) compared to CC without resampling. Overall, the LR models had high recall and lower precision scores for CC than imputation techniques, leading to lower kappa scores than ensemble models. The ROC curves with higher or similar performance for CC than imputation techniques, regardless of the resampling techniques, confirm this (Supplement). Mukaka et al. [
35] also found better CC analysis results than imputation techniques for binary outcomes with LR. Other authors [
29,
30] found the opposite and suggested that imputation was better than CC analysis. On the one hand, this can be explained by the fact that imputation techniques, especially for MI, increase the variability in the outcome values that inflates the standard error of the effect-size estimate, probably caused by a random component added to the missing outcome values [
24,
25,
35]. On the other hand, the difference can also be attributed to the mechanisms of occurrence of missing values, for which [
35] provided an in-depth analysis and suggested a thorough examination before deciding on the imputation method to apply. Hence, in-depth feature engineering with techniques such as interaction features, binning and more domain-specific transformation than only moving averages can be explored in further studies to improve ML performance with the resampling and imputation techniques presented in our study. Looking at the time-series dimension as well as applying weighted loss to discriminative classifiers such as LR may also be explored to improve their performance.
The comparison among imputation methods showed that LI performed either similarly or worse than SI or MI for all models. According to [
19], this could be due to the lack of data segregation before applying linear imputation to datasets. They suggested that LI methods estimate the value of the missing data based on two adjacent data points in a dimensional sequence. Hence, for datasets where many consecutive data points still need to be included, such as in the AMS data used in the current study, the performance of LI may need to be improved. The LI imputation method relies mainly on time-dependent missing-value imputation instead of inter-attribute correlations employed by other imputation techniques [
35]. For this reason, LI is incredibly efficient for time series and has been reported to improve the performance of neural network-based classifiers in other studies [
36,
37]. This is less of an issue for ensemble models that work by segregating data into similar packets small enough to identify their inherent patterns in the terminal nodes. For example, decision trees have two kinds of nodes and determine, for each leaf node, a class label with a majority vote of training examples reached by the leaf. Further, they treat each internal node as representing a question on features that will branch out according to the answers found. Hence, they split the leaves of a tree until questions are exhausted [
38]. Therefore, these intrinsic characteristics of the ensemble models and LI led to no significant performance improvement compared to other imputation techniques or complete cases. Following the approach suggested by [
27], it could be beneficial to segregate the data before submitting it to LI for better results. This data segregation may not be helpful for ensemble models, which are reputed to be robust enough to yield good performance with SI and MI and sometimes without imputing missing values, as explained above [
39]. Indeed, four of the top ten models in this study were RF or DT without missing imputation.
The performance of ML models from resampled datasets showed similar behavior to that of the missing-value imputation. The RF models had the highest metrics, followed by DT, MLP and LR. Resampling improved the recall scores of all classifiers. The SVMSMOTE had the best recall score for all models and the best or comparable precision scores for DT, RF and MLP. Nithya et al. [
40] similarly suggested that integrating ensemble models with SVMSMOTE allows for the more effective handling of imbalanced datasets. The SMOTE was slightly better than other resampling techniques for LR, which are reported to perform better with more balanced datasets [
41,
42]. The evaluation of model fit revealed that both resampling and missing-value imputation are relevant to explaining the performance of most of the tested ML models.
The MLP model performed moderately with imputed data, even without resampling. This finding aligns with reports that the method improves the performance of neural network-based models; hence, it could be applied to these ML models without resampling [
36,
37]. The same behavior was observed for SI and MI data fitted to DT and RF models, resulting in better performance than CCs. The SVMSMOTE for LR performed better without imputation than the data where missing values were imputed. This improvement suggests that an improvement in the class imbalance between the classes (majority and minority) is beneficial for the performance of these classifiers [
41,
43,
44]. Indeed, some studies have applied resampling methods without imputation with satisfactory prediction performance. Random forest, DT, and, to some extent, MLP, had models with good performance without resampling or missing-value imputation [
37,
39,
45]. These models are reported to have robust mechanisms to handle imbalances and missing values. They are sometimes used to preprocess data and predictions [
46,
47]. However, in the case of AMS data for mastitis prediction, it always seems reasonable to compare results from resampling/imputation techniques and those without to assess the extent of performance improvement. In this context, the data without resampling or imputation may serve as the ground truth to evaluate the preprocessing techniques.