Leveraging IGOOSE-XGBoost for the Early Detection of Subclinical Mastitis in Dairy Cows

Guo, Rui; Dai, Yongqiang

doi:10.3390/app15158763

Open AccessArticle

Leveraging IGOOSE-XGBoost for the Early Detection of Subclinical Mastitis in Dairy Cows

by

Rui Guo

and

Yongqiang Dai

^*

College of Information Science and Technology, Gansu Agricultural University, Lanzhou 730070, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(15), 8763; https://doi.org/10.3390/app15158763

Submission received: 24 October 2024 / Revised: 30 November 2024 / Accepted: 2 January 2025 / Published: 7 August 2025

Download

Browse Figures

Versions Notes

Abstract

Subclinical mastitis in dairy cows poses a significant challenge to the dairy industry, leading to reduced milk yield, altered milk composition, compromised animal health, and substantial economic losses for dairy farmers. A model based on the XGBoost algorithm, optimized with an Improved GOOSE Optimization Algorithm (IGOOSE), is presented in this work as an innovative approach for predicting subclinical mastitis in order to overcome these problems. The Dairy Herd Improvement (DHI) records of 4154 cows served as the model’s original foundation. A total of 3232 samples with 21 characteristics made up the final dataset, following extensive data cleaning and preprocessing. To overcome the shortcomings of the original GOOSE algorithm in intricate, high-dimensional problem spaces, three significant enhancements were made. First, an elite inverse strategy was implemented to improve population initialization, enhancing the algorithm’s balance between global exploration and local exploitation. Second, an adaptive nonlinear control factor was added to increase the algorithm’s stability and convergence speed. Lastly, a golden sine strategy was adopted to reduce the risk of premature convergence to suboptimal solutions. According to experimental results, the IGOOSE-XGBoost model works better than other models in predicting subclinical mastitis, especially when it comes to recognizing somatic cell scores, which are important markers of the illness. This study provides a strong predictive framework for managing the health of dairy cows, allowing for the prompt identification and treatment of subclinical mastitis, which enhances the efficiency and quality of milk supply.

Keywords:

subclinical mastitis in cows; GOOSE optimization algorithm; XGBoost; machine learning; cow somatic cells

1. Introduction

1.1. Background

With the quick advancement of contemporary animal husbandry, dairy farming has become one of the most important industries in the world. High-quality milk is of great significance to both human health and the economy of the dairy industry. In dairy farming, cows can develop various health conditions due to factors such as environmental conditions, feed quality, and exposure to pathogens. Common health issues include mastitis, pneumonia, and hoof inflammation [1]. One of the biggest issues facing dairy producers is mastitis, which not only affects milk production and composition, but also affects the health of dairy cows and causes huge economic losses [2]. In India alone, mastitis causes an economic loss of about 237 million rupees per year, of which subclinical mastitis accounts for about 70% [3]. According to Tekel [4], on average, 50% of cows in herds without a mastitis health control program have subclinical mastitis. Clinical mastitis is thought to be responsible for 20–30% of the economic losses related to mastitis, with subclinical mastitis accounting for the remaining portion. In 90–95% of cases, despite the normal appearance of both the udder and milk, elevated somatic cell count (SCC) is accompanied by changes in milk composition and reduced yield. Chronical mastitis has been found to slow down normal calf development [5]. Therefore, early detection and treatment of mastitis in dairy cows is important for maintaining cow health, ensuring dairy product quality, improving production efficiency and animal welfare, and reducing financial losses for farms.

Nazira Mammadova et al. [6] employed support vector machine technology to identify dairy cows with subclinical and clinical mastitis according to their somatic cell counts. The cows were classified as healthy or infected. Somatic cell count (SCC) is a critical tool for diagnosing subclinical mastitis in dairy cows, yet its accuracy can be influenced by various non-infectious factors, leading to false-positive outcomes. These factors encompass the stage of lactation, environmental stress, milk yield levels, and genetic predispositions, which can cause an abnormal increase in SCC, thus generating false-positive results [7]. Consequently, when interpreting SCC data, it is imperative to consider these non-infectious factors comprehensively to prevent misdiagnosis and unnecessary treatment. In order to predict the presence of subclinical mastitis in Italian Mediterranean water buffaloes, T. Bobbo et al. [8] used machine learning (ML) analysis. They built prediction models using four different algorithms: neural networks, random forests, support vector machines, and generalized linear models. Based on multiple evaluation criteria, the test set was best predicted by the neural network, and it was established that machine learning techniques have promise for enhancing the monitoring and prevention of subclinical mastitis. The advent of these emerging methods offers fresh approaches to the early identification of mastitis in dairy cows and is anticipated to establish itself as a standard detection technique in dairy farming in the future. Wang et al. [9] presented an enhanced moth-flapping optimization algorithm (IMFO) and tested it on the UCI dataset and the dairy cow subclinical mastitis disease dataset. Comparative analysis revealed IMFO’s superior feature selection performance, which enables more efficient and accurate detection of subclinical mastitis in large dairy herds. A Satoła et al. [10] compared the accuracy of certain machine learning algorithms, such as logistic regression, support vector machines, Gaussian naive Bayes, k-nearest neighbors, and decision trees, with ensemble learning techniques, such as bagging, boosting, stacking, and super learner models. The results of the study confirmed the beneficial effect of the integrated ML method.

Existing machine learning models have significantly advanced in their ability to forecast the likelihood of subclinical mastitis in dairy cows; however, they still exhibit limitations. Predominantly, both domestic and international prediction models for animal diseases rely on classical machine learning methods such as support vector machines and decision trees. These methods allow the analysis and forecasting of disease probabilities, often emphasizing comparative evaluations across multiple machine learning algorithms. Currently, there is not universally accepted the most effective or accurate model, leaving ample room for experimentation and innovation. Of particular note are the parameters of machine learning models, which typically require manual configuration prior to training to regulate model learning and complexity [11]. This parameterization significantly impacts model performance and generalizability. Traditionally, grid search (GS) has been employed as a standard method to determine optimal hyperparameters [12]. However, GS becomes computationally expensive when dealing with extensive parameter spaces. Currently, swarm intelligence optimization algorithms are increasingly being adopted to navigate parameter spaces and identify optimal parameter combinations, thereby enhancing model performance and generalization capabilities [13]. In the realm of feature selection, these algorithms prove beneficial by identifying the most pertinent feature subsets [14]. This capability is particularly valuable for managing high-dimensional data and extensive feature sets, ultimately improving model efficiency and accuracy.

In this study, our principal objective was to construct a predictive model for subclinical mastitis in dairy cows, integrating an enhanced GOOSE optimization approach with the XGBoost algorithm. We hypothesized that the IGOOSE-XGBoost model would provide superior predictive accuracy compared to traditional machine learning approaches for both disease classification and somatic cell score prediction.

1.2. Methodological Approach

The research implements three distinct strategies to enhance the GOOSE optimization algorithm [15], developing the Improved GOOSE Optimization Algorithm (IGOOSE). Comparative simulation experiments were conducted between IGOOSE and other algorithms, including DBO [16], HBA [17], GWO [18], HHO [19], and the original GOOSE, using CEC2022 benchmark functions [20], leading to the establishment of the IGOOSE-XGBoost prediction model for bovine subclinical mastitis. The main innovations of this study include developing an XGBoost-based predictive model for subclinical mastitis in dairy cows, optimizing model parameters and feature selection through the improved GOOSE optimization algorithm, and developing the IGOOSE-XGBoost prediction model. Qualitative and quantitative evaluation results demonstrate the significant contribution of the IGOOSE-XGBoost model in diagnosing and predicting subclinical mastitis in dairy cows.

1.3. Paper Structure

The paper is structured as follows: Section 2 outlines the applications of XGBoost applications and the GOOSE algorithm; Section 3 introduces the development and technical details of the Improved GOOSE Optimization Algorithm; Section 4 evaluates IGOOSE performance using CEC2022 benchmark functions; Section 5 details the application of IGOOSE-XGBoost in subclinical mastitis prediction, including classification models for disease status prediction and regression models for somatic cell score prediction; Section 6 discusses the research findings and their implications. The overall research workflow is illustrated in Figure 1.

2. Relevant Methods and Theories

2.1. XGBoost Model

An ensemble machine learning approach based on decision trees, called XGBoost (eXtreme Gradient Boosting), has gained significant recognition for its exceptional performance in both classification and regression tasks [21]. For classification, it effectively handles binary and multi-class problems through optimized tree structures and node splitting strategies [22]. The algorithm’s robust performance against noise and missing values, coupled with its high computational efficiency, has led to successful applications across various domains.

In regression tasks, XGBoost efficiently manages complex relationships in high-dimensional data [23], achieving notable success through its gradient boosting technology. The algorithm’s extensive parameter set significantly influences model performance and training efficiency, making intelligent optimization algorithms essential for parameter tuning and feature selection. This optimization approach not only enhances model performance and generalization capabilities, but also improves the interpretability and practical applicability of machine learning models.

2.2. GOOSE Optimization Algorithm

The GOOSE algorithm, an innovative intelligent optimization technique inspired by geese’s activity during rest and foraging activities, was created in 2024 by Rebwar Khalid Hamad. The algorithm draws inspiration from the unique behavioral patterns of geese, particularly their resting habits, whereby a designated guardian goose maintains vigilance by holding a stone with one leg. This natural behavior is translated into algorithmic terms through specific mathematical formulations for position updates and search strategies.

The algorithm operates in two distinct phases determined by a random variable (rnd). When rnd ≥ 0.5, the development phase activates, utilizing Equations (1) and (2) to update goose positions:

X_{(i t + 1)} = F_F_S + D_G_{i t} \times T_A^{2}

(1)

X_{(i t + 1)} = F_F_S + D_G_{i t} \times T_A^{2} \times C o e

(2)

These equations incorporate parameters such as

F_F_S

(falling stone velocity),

D_G_{i t}

(guardian-individual distance), and

T_A

(sound propagation time), with a specific condition for the coefficient Coe maintaining 0.17 as a threshold value. Conversely, when rnd < 0.5, Equation (3) governs the exploration phase:

X_{(i t + 1)} = r a n d n (1, \dim) \times (M_T \times a l p h a) + B e s t_p o s

(3)

To better illustrate the working mechanism of the GOOSE algorithm, a simple numerical example is presented below. Consider the optimization problem of finding the minimum of the quadratic function:

f (x) = x^{2} - 4 x + 4, x \in [- 2, 6]

(4)

The implementation process begins with the initialization phase, in which a population of three geese (n = 3) is randomly distributed in the search area. The initial positions are assigned as x₁ = −1, x₂ = 2, and x₃ = 5, with corresponding fitness values of

f (- 1)

= 9,

f (2)

= 0, and

f (5)

= 9, respectively. Given these values, the position x₂ = 2 represents the guardian goose position as it yields the optimal fitness value. During the iterative process, when

r n d

≥ 0.5, the development phase is activated. Equations (1) and (2) are applied with the following specific parameters:

F_F_S

= 0.17,

D_G_{i t 1}

= |2 − (−1)| = 3,

D_G_{i t 3}

= |2 − 5| = 3, and

T_A

= 1. These parameters determine how the non-guardian geese update their positions relative to the guardian goose. Conversely, when

r n d

< 0.5, the exploring phase of the algorithm begins, utilizing Equation (3) with

B e s t_p o s

= 2 as the reference point, allowing geese to explore new regions of the search space.

Through successive iterations, the geese progressively approach the optimal solution. The algorithm successfully identifies the global minimum at x = 2, where

f (2)

= 0, thereby validating the effectiveness of the GOOSE algorithm’s dual-phase optimization strategy. This example clearly demonstrates how the algorithm exploitation (development phase) and exploration phase to find the best solution and effectively navigate the search area.

While the emergence of new algorithms continually enriches the field of optimization, the “no free lunch” statement serves as a reminder that no one algorithm can tackle every optimization problem in an optimal manner. This understanding drives the ongoing refinement of the GOOSE algorithm to better address real-world complexities.

3. Improved GOOSE Optimization Algorithm (IGOOSE)

To mitigate the algorithm’s limitations in handling high-dimensional complex problems, this study introduces several enhancements. An elite reverse strategy is incorporated to enhance population initialization, thereby augmenting both global exploration and local development capabilities. Additionally, to improve algorithm stability and speed up convergence, an adaptive nonlinear control element is added. Furthermore, a golden sine strategy is implemented to mitigate the risk of converging to local optima. These improvements constitute the IGOOSE algorithm.

3.1. Elite Opposition-Based Learning Strategy (EOBL)

The success of swarm intelligence algorithms depends heavily on the population’s initiation. If the population is only initialized by randomness, it could lead to the population being overly concentrated in a specific geographic region, which will lead to an imbalance in the distribution of the population, and in turn make it easy for early individuals to fall into aimless exploration during the search process. Therefore, the initial stage uses the elite reverse learning technique. In addition to broadening the algorithm’s search range and enhancing its ability to search globally, its reverse solution creation, and its elite person selection, it also improves the algorithm’s capacity to avoid local optima [18].

Let the solution of the jth goose at the tth iteration be

X_{j}^{t}

, and its opposite answer be

{(X_{j}^{'})}^{t}

. The fitness function is denoted by F(x). When

F (X_{j}^{t}) < F {(X_{j}^{'})}^{t}

, the original solution is better than the inverse solution, which is the elite solution. Equation (5) provides an explanation of the elite inverse solution.

{(X_{j}^{'})}^{t} = u b + l b - X_{j}^{t}

(5)

In this case, the dynamic boundary’s lower and upper bounds are denoted by ub and lb. The advantage of the dynamic boundary is that it can compensate for the disadvantage of the fixed boundary, which cannot save search experience, and is conducive to shortening the search time.

3.2. Adaptive Nonlinear Control Parameter

The GOOSE Optimization Algorithm’s exploring phase, the parameter alpha ([0, 2]), is a crucial variable. The value of alpha decreases significantly with each iteration in the loop. Equation (6) is the expression of alpha. It is evident that alpha is near 2 when there are few iterations, which is conducive to global searching; when the number of iterations is large, alpha is close to 0, which is conducive to local searching. Adaptive nonlinear control parameters are introduced to enhance the algorithm’s robustness and adaptability, improve the position accuracy of the optimal solution, and more accurately simulate the behavior of the goose. This increases the likelihood that the algorithm will find a better solution faster or converge to the optimal solution sooner. Equation (7) is the expression of the adaptive nonlinear control parameter

{a l p h a}^{'}

.

a l p h a = (2 - (\frac{l o o p}{\frac{M a x_I T}{2}}))

(6)

{a l p h a}^{'} = (2 - {(\frac{l o o p}{\frac{M a x_I T}{2}})}^{\frac{2 l o o p}{M a x_I T}})

(7)

3.3. Golden Sine Strategy

Due to a decline in population variety, the GOOSE algorithm is vulnerable to local optima in the later phases of the search. Thus, to improve global search capabilities and the ability of escaping local optima, the golden sine technique is employed [24]. The golden sine method simulates the search space by scanning the unit circle with a sine function and combines it with the golden section coefficient to iteratively search for a better traversal of the algorithm search space and keep the algorithm from becoming trapped in a local optimum. The mathematical model is as follows:

V_{j}^{t + 1} = x_{j}^{t} |\sin (r_{1})| - |r_{2} \sin (r_{1})| |x_{1} D_{j}^{t} - x_{2} V_{j}^{t}|

(8)

where is the jth individual’s position at the t + 1 iteration;

V_{j}^{t}

is the jth individual’s position at the tth iteration;

D_{j}^{t}

is the jth individual’s global optimal position at the tth iteration; r₁ is a random number in [0, 2

π

] that controls the search distance; r₂ is a random number in [0,

π

], which controls the search direction; and the golden ratio coefficients are

x_{1} = a (1 - g) + b g

and

x_{2} = a g + b (1 - g)

, where the golden ratio is

g = (\sqrt{5} - 1) / 2

, and the golden ratio search initial values are a = −

π

and b =

π

.

The goose’s position update procedure is as follows, following the introduction of the golden sine strategy:

x_{j}^{t_n e w} = x_{j}^{t} |\sin (r_{1})| - |r_{2} \sin (r_{1})| |x_{1} x_{b e s t} - x_{2} x_{j}^{t}|

(9)

where

x_{b e s t}

is the position of the tth iteration of the goose;

x_{j}^{t}

is the position of the jth goose obtained by applying the position update formula from the initial iteration of the GOOSE algorithm; and

x_{j}^{t_n e w}

is the position of the jth goose obtained by using the golden sine strategy in the tth iteration.

3.4. Algorithm Flow Chart

Based on the three improvement strategies, enhancements were made to the GOOSE optimization algorithm. The detailed flowchart is depicted in Figure 2.

4. Performance Testing and Comparison

CEC2022 benchmark functions are a standard test set designed by the IEEE Computational Intelligence Society for evaluating the performance of optimization algorithms. It contains different types of optimizations problems, including single-peaked functions (for testing local search capability), multi-peaked functions (for testing global search capability), hybrid functions, and combinatorial functions (for testing the performance of algorithms in complex scenarios). These functions have different characteristics, such as multimodality, non-separability, rotation, etc., which enable a thorough evaluation of the optimization techniques’ performance. The IGOOSE method was simulated using CEC2022 benchmark functions to examine its performance. Single-peak (F1), multimodal (F2–F5), hybrid (F6–F8), and mixed (F9–F12) functions are among the functions used. These functions were picked because they can thoroughly test the algorithm’s performance and represent many kinds of optimization challenges. Through these studies, we can confirm the IGOOSE algorithm’s capacity for optimization in a range of situations and guarantee its dependability for use in the future when predicting subclinical mastitis in dairy cows. This article uses 10 dimensions. The specific function information is shown in Table 1.

This study compared IGOOSE with the DBA, HBA, GWO, HHO, and GOOSE algorithms to confirm its superiority and efficacy. The comparison algorithms were set to a population size of N = 30, a maximum number of iterations of T = 1000, and each algorithm was executed 30 times independently to guarantee the fairness of the comparison findings. In Table 2, the statistical findings are displayed.

(1) Table 2 shows that, for 11 of the test functions, the algorithm results are nearly at the hypothetical ideal values, except for the test results for F6. According to the experimental data, IGOOSE achieves the theoretical optimal value for the unimodal function F1 and exhibits notable optimization performance. The theoretical best value is reached by F4 for the multimodal functions F2 through F5. The theoretical best value is reached by F6 and F8 for the mixed functions F6 to F8. Every one of the combined functions F9–F12 reaches the supposedly ideal value. In F1, F3, F4, F8, F9, F10, F11, and F12, the extremely low standard deviation shows that IGOOSE is more stable than the other algorithms. In this study, we also selected six representative functions from Table 2—F1, F3, F6, F7, F10 and F12—and plotted convergence graphs for 30 runs. As seen in Figure 3, the ordinate denotes the function’s average fitness value, while the abscissa indicates the number of iterations.

(2) From the process point of view, in the single-peak function F1, despite starting the iteration with a somewhat slower rate of convergence than the HHO and GWO algorithms, the IGOOSE algorithm achieves the theoretical optimum after roughly 400 iterations. The IGOOSE algorithm starts to converge at around 100 iterations until it reaches the global optimum for function F3, and its convergence pace is marginally faster than HBA’s. After roughly 50 iterations, the IGOOSE algorithm starts to converge in the mixed function F6, eventually reaching the global optimum; meanwhile, the F7 function reaches its optimum after 100 iterations. For the combined functions F10 and F12, convergence begins at the 50th iteration, and at the beginning of the iteration, the convergence pace is faster than the other five algorithms, until convergence to the global optimum.

IGOOSE performs better than DBO, HBA, GWO, HHO, and GOOSE in terms of convergence speed and stability, as evidenced by the fact that its optimization curve is generally much lower than those of the other algorithms and that its convergence results are superior to those of the other algorithms.

5. Prediction Model for Subclinical Mastitis in Dairy Cows Based on IGOOSE-XGboost

This study applies IGOOSE to a predictive model of subclinical mastitis in dairy cows, aiming to validate its feasibility and applicability. There are two tasks: (1) predicting whether cows have subclinical mastitis, and (2) predicting somatic cell score (SCS). Task 1 involves classification prediction, while Task 2 involves regression prediction. The study yields two types of results: categorical disease status and numerical SCS. Qualitative analysis discusses non-numerical features related to disease status to assist in herd health management. Quantitative analysis utilizes SCS data to establish models predicting the risk and likelihood of subclinical mastitis, providing decision support for farms.

5.1. Data Processing

The study was conducted at Gansu Nongken Tianmu Dairy Co., Ltd., located in Jinchang City, China. The farm utilized a free-stall housing system with mattress-bedded stalls. The cows were given a total mixed ration (TMR) twice a day, which included concentration, alfalfa hay, and corn silage. Milking was performed in an enclosed milking facility equipped with a large rotary milking parlor (72–80 stalls). The milking process followed a systematic routine where cows entered the rotary platform in sequence. The milking procedure included pre-milking teat sanitization (medicine bath), disinfection, cleaning, and milk inspection before cluster attachment. The milking clusters were automatically removed upon completion. Post-milking procedures included teat cleaning, sanitization, and disinfection to prevent infection. The fresh milk was immediately transferred to rapid cooling tanks and delivered to processing facilities to ensure milk quality. The herd consisted of pure Holstein cows. As shown in Table 3, the dataset comprised the DHI data of a total of 4154 cows on the farm in 2022. The data records included more than 20 indicators, such as individual number, parity, measurement date, milk yield, milk fat amount and so on. It was analyzed to form a test report reflecting the information of breeding, reproduction, feeding, disease, production performance of the farm, etc. The test report recorded the basic information of the dairy herd and the individual cows at the same time, which provided the raw data for the improvement of the herd and the monitoring of the health and nutritional level of the individual cows [25].

After eliminating irrelevant attributes, handling outliers, and imputing missing values in the DHI dataset, we obtained a final dataset comprising 3232 samples and 21 features. These features include Parity, Lactation Persistence, Days In Milk, Within-Herd Index (WHI), Milk Yield, Foremilk yield, Milk Fat Percentage, Peak Milk Yield, Protein Percentage, Peak Lactation Day, Fat-to-Protein Ratio [26], 305-Day Milk Yield, Urea Nitrogen, Total Milk Volume, Milk Loss, Total Milk Fat, Economic Loss, Total Protein, Corrected Milk, Adult Equivalent, and Somatic Cell Score, as shown in Table 3.

To predict subclinical mastitis in dairy cows, the cows were classified into healthy and diseased groups based on somatic cell count (SCC). The international standard for judging the SCC of breast milk in cases of recessive mastitis ranges from 100,000 to 500,000/m, and in China, the current criterion for recessive breast infection is 200,000 to 500,000/mL [27]. To enhance the predictive accuracy of our model, this study adopted an SCC greater than 100,000 cells/mL as the criterion for diagnosing subclinical mastitis. Somatic cell score (SCS), derived from SCC, exhibits a higher genetic correlation (0.6–0.8) with subclinical mastitis in dairy cows and has a higher heritability compared to SCC [28]. Thus, SCC was transformed into SCS using the formula SCS = log₂(SCC/100,000) + 3 [29], where SCS ≥ 3 indicates a score of 1 (indicating disease), and scores below 3 indicate 0 (indicating health).

5.2. Building the IGOOSE-XGBoost Classification Model

5.2.1. Evaluation Metrics for Classification Models

To predict whether a cow had subclinical mastitis, we used four different metrics [30] to evaluate the model: accuracy (ACC), precision (PRE), F1 score (F1), and recall (REC). Their expressions are as follows:

A c c u r a r y = \frac{T P + T N}{T P + T N + F P + F N}

(10)

\Pr e c i s i o n = \frac{T P}{T P + F P}

(11)

Re c a l l = \frac{T P}{T P + F N}

(12)

F_{1} = \frac{2 P R}{P + R}

(13)

5.2.2. Building an IGOOSE-XGBoost Classification Model for Subclinical Mastitis in Dairy Cows

The intelligent optimization algorithm combined with XGBoost for feature selection and classification tasks is a powerful method that can help to find the best features in complex datasets and train high-performance classifiers. First, the dataset was split into an 8:2 ratio between a training set and a test set. The dataset was subjected to feature selection using the IGOOSE algorithm, GOOSE algorithm, SSA algorithm, HBA algorithm, HO algorithm [31], and DBO algorithm, respectively. Then, the XGBoost classification algorithm was used to classify and predict the data after feature selection. The number of algorithm populations was set to 30, the number of iterations was set to 1000, and an average of 50 experiments were used to confirm the results’ scientific nature. Six assessment indicators—accuracy, number of selected features, algorithm fitness value, recall rate, precision rate, and F1 score—were employed to compare the algorithms’ performance in feature selection. Table 4 displays the findings.

Table 4 shows the classification results of the different models after feature selection of the dataset. The experimental data show that among the six algorithms, the IGOOSE-XGBoost model achieves the highest average classification accuracy of 0.85 ± 0.013 after feature selection. In addition, the model has the highest recall (0.87 ± 0.015) and F1 score (0.80 ± 0.013). It is noteworthy that IGOOSE-XGBoost still achieved such excellent performance while maintaining a condensed feature set of only six features. In terms of fitness value, the IGOOSE-XGBoost model also outperforms the other models with a value of 0.151 ± 0.0019, indicating the best optimization. The standard deviations of all the evaluation indexes were controlled within a small range (≤0.02), reflecting the high repeatability and reliability of the experiment.

Analyses with Friedman’s test show significant differences between the algorithms (x² = 4.93, p < 0.05). In particular, the IGOOSE-XGBoost model shows significant advantages in several assessment metrics, mainly in terms of accuracy (p < 0.05), recall (p < 0.05), and fitness value (p < 0.05). Although the improvement in precision rate is relatively modest, it is still statistically significant (p < 0.05). In terms of feature selection efficiency, IGOOSE consistently selects a compact feature set, which demonstrates its ability to identify the most relevant features while still maintaining high predictive performance.

IGOOSE’s greater optimization power is demonstrated by the fact that its fitness value (0.151 ± 0.0019) is much lower than that of the other algorithms (p < 0.05). In addition, the narrow confidence intervals for all the evaluated metrics indicate that the results are highly reliable and reproducible. The IGOOSE algorithm shows unique advantages in the feature selection process: not only can it achieve the best performance while maintaining a small number of features, but it also considers the interactions between the features during the feature selection process and effectively avoids local optimal solutions through the iterative optimization process.

The swarm intelligence optimization algorithm not only helps to determine the most representative feature subset, but also searches for the parameter space to find the optimal parameter combination to enhance the model’s functionality and capacity for generalization. Next, based on feature selection, the corresponding intelligent optimization algorithm simultaneously optimizes the parameters (maximum number of iterations, depth, learning rate) of the classifier XGBoost. The results are shown in Table 5.

Table 5 shows the classification results of each algorithm after parameter optimization. The prediction performance of the IGOOSE-XGBoost model is significantly improved in all metrics: the accuracy rate is increased to 0.87 ± 0.012, the precision rate reaches 0.89 ± 0.013, the recall rate stays at a high level of 0.88 ± 0.014, and the F1 score is improved to 0.88 ± 0.012. The standard deviation of all evaluation metrics remained below 0.014, indicating that the model has good stability. By comparing prediction performance, precision, recall, F1 score, fitness value, and number of features, it can be observed that the IGOOSE-XGBoost model has the best prediction effect.

From the perspective of disease prediction, recall is of particular importance. Lower recall means that many cows that are sick may be misclassified as healthy, which will delay diagnosis and treatment. The IGOOSE-XGBoost model (0.88 ± 0.014) and the DBO-XGBoost model (0.89 ± 0.017) perform well in terms of recall. A higher recall rate helps with disease prevention and early diagnosis and treatment. Considering all the metrics together, the IGOOSE-XGBoost model demonstrates the best overall prediction.

For the parameter-optimized model, IGOOSE-XGBoost shows significant improvements in accuracy, precision, and F1 score (p < 0.05). The smaller standard deviation (≤0.014) and narrow confidence interval further demonstrate the high stability and reliability of this model. Through parameter optimization, the model not only improves predictive accuracy, but also maintains the consistency and reproducibility of the results.

These modifications result in more rigorous and coherent content while maintaining scientific and readability. In the following subsection, each paragraph clearly describes the specific assessment metrics and analyses and provides a reasoned explanatory account of the important findings. The credibility of the study’s conclusions is enhanced by elaborating on the results of statistical tests and standard deviation analyses.

5.3. Building the IGOOSE-XGBoost Regression Model for Subclinical Mastitis in Dairy Cows

5.3.1. Correlation Analysis

Correlation analysis was conducted for the attributes listed in Table 3, to ensure variable independence and minimize redundancy. Pearson’s correlation coefficient was utilized to quantify the associations among variables. This coefficient measures the similarity between attributes and is suitable for analyzing correlations among continuous variables. A coefficient near 0 shows no connection or a very weak correlation (0 to 0.2), a coefficient between 0.2 and 0.4 indicates a weak correlation, a coefficient between 0.4 and 0.6 indicates a moderate correlation, a coefficient between 0.6 and 0.8 indicates a high correlation, and a coefficient between 0.8 and 1 indicates an extremely strong association [32]. The correlation heat map depicted in Figure 3 illustrates these relationships, based on the Pearson’s correlation coefficients between variables.

Figure 4 displays the Pearson correlation coefficients among the different variables, with darker colors indicating stronger correlations. This figure illustrates the relationship between various variables and SCS. In this study, features with correlation coefficients greater than or equal to 0.4 were selected. Ultimately, x7, x8, x13, and x21 (i.e., Milk Fat Percentage, Peak Milk, Urea Nitrogen, SCS) were identified and used to construct a regression model.

5.3.2. Evaluation Indicators for Regression Models

Four indicators—the mean absolute error (MAE), mean square error (MSE), root mean square error (RMSE), and coefficient of determination (R2)—were taken into consideration in order to test and assess the prediction accuracy and performance of the regression method machine learning model examined in this study [33].

The regression problem’s loss was mostly measured by the mean square error, or MSE. The model’s prediction performance is higher if the MSE value is lower, since it indicates a lower error between the actual outcomes predicted by the approach and the original data. On the other hand, a higher MSE value indicates a greater discrepancy between the data and the expected outcomes, a higher error, and a less effective model. It is expressed as follows:

M S E = \frac{1}{n} {\sum_{i = 1}^{n} (\overset{\land}{y_{i}} - y)}^{2}

(14)

The root means square error (RMSE) falls between 0 and +∞. RMSE = 0, or an ideal model, occurs when the true value and the predicted value are the same. The RMSE value increases with the size of the error. The formula is expressed as follows:

R M S E = \sqrt{\frac{1}{m} {\sum_{i = 1}^{m} (y_{i} - \overset{\land}{y_{i}})}^{2}}

(15)

The advantage of the mean absolute error (MAE) is that it is not sensitive to extreme values, it has a stable gradient regardless of the input value, and it has a robust solution. The range is [0, +∞). A smaller score indicates a better model, because there is less discrepancy between the true and predicted values. On the contrary, a greater score indicates that the model is poor. The formula is expressed as follows:

M A E = \frac{1}{n} \sum_{i = 1}^{n} |\overset{\land}{y_{i}} - y_{i}|

(16)

The primary purposes of the adjusted R-squared (R²) are to assess the degree of fit between the expected and actual values and to ascertain whether the statistical model can adequately fit the data. The range of the value is [0, 1]. The model performs better in terms of prediction the closer it approaches to 1. The specific expression is as follows:

R^{2} = 1 - \frac{\sum_{i} {(\overset{\land}{y_{i}} - y)}^{2}}{\sum_{i} {(\bar{y_{i}} - y_{i})}^{2}}

(17)

5.3.3. Building the IGOOSE-XGBoost Regression Model

First, four commonly used machine learning methods, KNN [34], XGBoost, SVM [35], and GBDT [36], were used to perform regression prediction. According to the correlation analysis, each sample contained four features.

The comparison of the evaluation metrics of several regression models is displayed in Table 6. From the evaluation results, the XGBoost model shows optimal prediction performance, with the highest coefficient of determination (R² = 0.631 ± 0.013) and the lowest mean absolute error (MAE = 0.552 ± 0.016) and mean square error (MSE = 0.565 ± 0.017). This is followed by the GBDT model, which performs slightly better than XGBoost in terms of root mean square error (RMSE = 0.753 ± 0.020) but is slightly inferior in terms of other metrics such as R² value (0.611 ± 0.015) and MAE (0.565 ± 0.021). The KNN model has the worst overall performance, with the lowest R² value (0.485 ± 0.018) and all its error metrics being significantly higher than the other models. All the models maintain strong stability (standard deviation ≤ 0.026), according to the standard deviation study, with the XGBoost model having the best stability (standard deviation range 0.013–0.018). The best prediction results are shown by the XGBoost model when all four evaluation measures are compared.

The relevant literature indicates that swarm intelligence optimization techniques are increasingly being used to optimize model search procedures. SSA [37], WOA [38], ESOA [39], GOOSE, and IGOOSE were combined with XGBoost, as shown in Figure 5. Five combined models were obtained by adding an optimization process, and the obtained prediction results were compared and analyzed to find a new model with optimal performance.

The comparison findings of these optimized models’ assessment standards are displayed in Table 7. As can be seen from the data, the IGOOSE-XGBoost model achieves the best performance in all the metrics: it has the highest R² value (0.663 ± 0.012), and the lowest MAE (0.526 ± 0.013), MSE (0.511 ± 0.014), and RMSE (0.693 ± 0.015). The GOOSE-XGBoost model performs second best, with a similar R² value (0.642 ± 0.013) and error metrics (MAE = 0.531 ± 0.014, MSE = 0.521 ± 0.015, and RMSE = 0.722 ± 0.016) to IGOOSE-XGBoost. The other optimized models, although improved compared to the base model, are not as significantly improved as the IGOOSE and GOOSE models.

It is particularly noteworthy that in comparison with the original XGBoost model, the IGOOSE-optimized model achieves significant improvements in all evaluation metrics: the R² value improves by 0.032 (from 0.631 to 0.663), the MAE decreases by 0.026 (from 0.552 to 0.526), the MSE decreases by 0.054 (from 0.565 to 0.511), and the RMSE decreases by 0.065 (from 0.758 to 0.693). These improvements are not only statistically significant (p < 0.05), but also valuable in practical applications. Furthermore, all the optimized models’ assessment metrics’ standard deviations are maintained at a low level (≤0.015), demonstrating the models’ strong stability and dependability. In particular, the IGOOSE-XGBoost model has the smallest standard deviation (≤0.014) for all the indicators, which further confirms that the model has the best stability and reliability.

In summary, our study created an IGOOSE-XGBoost-based model to forecast the subclinical mastitis of dairy cows, predicting whether cows were sick using their SCS. To predict whether cows had subclinical mastitis, feature selection was performed using six intelligent optimization algorithms, and XGBoost was used as a classifier. To further improve prediction accuracy, to improve the XGBoost parameters, each of these six algorithms was applied. According to the results, the IGOOSE-XGBoost classification model performed the best (ACC = 0.87; REC = 0.88; PRE = 0.89; F1 score = 0.88; fitness value = 0.14523). For predicting the SCS of dairy cows, the DHI data used started with 33 features, and after deleting irrelevant attributes and correlation analysis, the four most important features were selected. We trained and compared four machine learning models, and selected the model with the best prediction performance, XGBoost, to which five intelligent optimization algorithms were added to improve its parameters. The IGOOSE-XGBoost regression model produced the greatest results, judging by the results (RMSE = 0.693; MSE = 0.511; MAE = 0.526; R2 = 0.663). The integration of these approaches enhances prediction accuracy and comprehensiveness, offering valuable insights for targeted prevention and treatment strategies.

6. Conclusions and Future Work

Early identification and management of dairy cow mastitis are critically important for maintaining the health of herds, ensuring the quality of dairy products, increasing production efficiency, safeguarding animal welfare, and reducing economic losses in dairy operations. This study presents an IGOOSE-XGBoost predictive model for subclinical mastitis in dairy cows, which integrates the XGBoost algorithm with an enhanced GOOSE optimization algorithm.

Our work advances the field in several key aspects compared to previous studies. Mansour et al.’s [40] study, which utilized different machine learning models (Deep Learning, Naïve Bayes, Generalized Linear Model, Logistic Regression, Decision Tree, Gradient-Boosted Tree, and Random Forest) and a large dataset of 364,249 milking instances, found that GBT was the most accurate, with 84.9% accuracy, in predicting subclinical mastitis. Cavero et al. [41] evaluated different methods for machine learning (Naïve Bayes, Random Forest, and Extreme Gradient Boosting) for clinical mastitis prediction, reaching a 72% accuracy rate and accurately categorizing 85% of CM cows in their continuous model. In comparison, our model achieves higher accuracy (87%) while specifically addressing the more challenging task of subclinical mastitis detection. The experimental outcomes, scrutinized from both qualitative and quantitative standpoints, indicate that the IGOOSE-XGBoost model has higher accuracy in predicting subclinical mastitis (ACC = 0.87; REC = 0.88; PRE = 0.89; F1 score = 0.88), outperforming traditional methods. These results surpass those reported in the recent literature, in which typical accuracy rates range from 0.75 to 0.83. Furthermore, our regression model for SCS prediction achieves an R² value of 0.663, demonstrating stronger predictive power compared to conventional approaches. In future work, we intend to utilize different CEC test functions to evaluate the efficiency of the improved algorithm and apply it to cow disease prediction, thereby expanding the scope and applicability of our prediction framework to the veterinary and dairy industry.

Author Contributions

Methodology, R.G. and Y.D.; writing—original draft preparation, R.G.; writing—review and editing, R.G.; visualization, Y.D.; supervision, Y.D.; funding acquisition, Y.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Basic research innovation group grant number [21JR7RA858] and Gansu University Innovation Fund Project “Research on Machine Learning Prediction Model of Subclinical Mastitis in Dairy Cows” grant number [2022B-107].

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Vargová, M.; Zigo, F.; Výrostková, J.; Farkašová, Z.; Rehan, I.F. Biofilm-Producing Ability of Staphylococcus Aureus Obtained from Surfaces and Milk of Mastitic Cows. Vet. Sci. 2023, 10, 386. [Google Scholar] [CrossRef]
Kerro Dego, O.; Vidlund, J. Staphylococcal Mastitis in Dairy Cows. Front. Vet. Sci. 2024, 11, 1356259. [Google Scholar] [CrossRef] [PubMed]
Sahu, C.; Manimaran, A.; Kumaresan, A.; Rajendran, D.; Sivaram, M. Role of Trisodium Citrate and Nanominerals in Mastitis Management in Dairy Animals: A Review. Agric. Rev. 2025, 46, 272–279. [Google Scholar] [CrossRef]
Deshapriya, R.M.C.; Rahularaj, R.; Ranasinghe, R. Mastitis, Somatic Cell Count and Milk Quality: An Overview. Sri Lanka Vet. J. 2019, 66, 1–12. [Google Scholar] [CrossRef]
Hisira, V.; Zigo, F.; Kadaši, M.; Klein, R.; Farkašová, Z.; Vargová, M.; Mudroň, P. Comparative Analysis of Methods for Somatic Cell Counting in Cow’s Milk and Relationship between Somatic Cell Count and Occurrence of Intramammary Bacteria. Vet. Sci. 2023, 10, 468. [Google Scholar] [CrossRef] [PubMed]
Mammadova, N.; Keskin, İ. Application of the Support Vector Machine to Predict Subclinical Mastitis in Dairy Cattle. Sci. World J. 2013, 2013, 603897. [Google Scholar] [CrossRef]
Sun, X.; Zhao, R.; Wang, N.; Zhang, J.; Xiao, B.; Huang, F.; Chen, A. Milk Somatic Cell Count: From Conventional Microscope Method to New Biosensor-Based Method. Trends Food Sci. Technol. 2023, 135, 102–114. [Google Scholar] [CrossRef]
Bobbo, T.; Matera, R.; Pedota, G.; Manunza, A.; Cotticelli, A.; Neglia, G.; Biffani, S. Exploiting Machine Learning Methods with Monthly Routine Milk Recording Data and Climatic Information to Predict Subclinical Mastitis in Italian Mediterranean Buffaloes. J. Dairy Sci. 2023, 106, 1942–1952. [Google Scholar] [CrossRef]
Wang, Z.; Dai, Y.; Liu, H. Feature Selection of Recessive Mastitis in Dairy Cows Based on Improved Moth-Flame Optimization Algorithm. Heilongjiang Anim. Husb. Vet. Med. 2024, 66, 8–16. [Google Scholar]
Satoła, A.; Satoła, K. Performance Comparison of Machine Learning Models Used for Predicting Subclinical Mastitis in Dairy Cows: Bagging, Boosting, Stacking, and Super-Learner Ensembles versus Single Machine Learning Models. J. Dairy Sci. 2024, 107, 3959–3972. [Google Scholar] [CrossRef]
Sheik, A.G.; Malla, M.A.; Srungavarapu, C.S.; Patan, A.K.; Kumari, S.; Bux, F. Prediction of Wastewater Quality Parameters Using Adaptive and Machine Learning Models: A South African Case Study. J. Water Process Eng. 2024, 67, 106185. [Google Scholar] [CrossRef]
Khalil, M.; AlSayed, A.; Liu, Y.; Vanrolleghem, P.A. An Integrated Feature Selection and Hyperparameter Optimization Algorithm for Balanced Machine Learning Models Predicting N2O Emissions from Wastewater Treatment Plants. J. Water Process Eng. 2024, 63, 105512. [Google Scholar] [CrossRef]
Wang, T.; Zhang, K.; Liu, Z.; Ma, T.; Luo, R.; Chen, H.; Wang, X.; Ge, W.; Sun, H. Prediction and Explanation of Debris Flow Velocity Based on Multi-Strategy Fusion Stacking Ensemble Learning Model. J. Hydrol. 2024, 638, 131347. [Google Scholar] [CrossRef]
Kwakye, B.D.; Li, Y.; Mohamed, H.H.; Baidoo, E.; Asenso, T.Q. Particle Guided Metaheuristic Algorithm for Global Optimization and Feature Selection Problems. Expert Syst. Appl. 2024, 248, 123362. [Google Scholar] [CrossRef]
Hamad, R.K.; Rashid, T.A. GOOSE Algorithm: A Powerful Optimization Tool for Real-World Engineering Challenges and Beyond. Evol. Syst. 2024, 15, 1249–1274. [Google Scholar] [CrossRef]
Xue, J.; Shen, B. Dung Beetle Optimizer: A New Meta-Heuristic Algorithm for Global Optimization. J. Supercomput. 2023, 79, 7305–7336. [Google Scholar] [CrossRef]
Hashim, F.A.; Houssein, E.H.; Hussain, K.; Mabrouk, M.S.; Al-Atabany, W. Honey Badger Algorithm: New Metaheuristic Algorithm for Solving Optimization Problems. Math. Comput. Simul. 2022, 192, 84–110. [Google Scholar] [CrossRef]
Zhang, H.; Zhang, Y.; Niu, Y.; He, K.; Wang, Y. A Grey Wolf Optimizer Combined with Artificial Fish Swarm Algorithm for Engineering Design Problems. Ain Shams Eng. J. 2024, 15, 102797. [Google Scholar] [CrossRef]
Heidari, A.A.; Mirjalili, S.; Faris, H.; Aljarah, I.; Mafarja, M.; Chen, H. Harris Hawks Optimization: Algorithm and Applications. Future Gener. Comput. Syst. 2019, 97, 849–872. [Google Scholar] [CrossRef]
Sun, K.; Huo, J.; Jia, H.; Liu, Q.; Yang, J.; Cai, C. Nonlinear Optimization of Optical Camera Multiparameter via Triple Integrated Gradient-Based Optimizer Algorithm. Opt. Laser Technol. 2024, 179, 111294. [Google Scholar] [CrossRef]
Costache, R.; Arabameri, A.; Moayedi, H.; Pham, Q.B.; Santosh, M.; Nguyen, H.; Pandey, M.; Pham, B.T. Flash-Flood Potential Index Estimation Using Fuzzy Logic Combined with Deep Learning Neural Network, Naïve Bayes, XGBoost and Classification and Regression Tree. Geocarto Int. 2022, 37, 6780–6807. [Google Scholar] [CrossRef]
Joe, H.; Kim, H.-G. Multi-Label Classification with XGBoost for Metabolic Pathway Prediction. BMC Bioinform. 2024, 25, 52. [Google Scholar] [CrossRef]
Meng, W.; Meng, X.; Wang, J.; Li, G.; Liu, B.; Kan, G.; Lu, J.; Zhao, L.; Zhi, P. Prediction of the Shear Wave Speed of Seafloor Sediments in the Northern South China Sea Based on an XGBoost Algorithm. Front. Mar. Sci. 2024, 11, 1307768. [Google Scholar] [CrossRef]
Fu, J.; Wu, C.; Wang, J.; Haque, M.M.; Geng, L.; Meng, J. Lithium-Ion Battery SOH Prediction Based on VMD-PE and Improved DBO Optimized Temporal Convolutional Network Model. J. Energy Storage 2024, 87, 111392. [Google Scholar] [CrossRef]
Han, X.; Gao, M.; Shen, W.; Liu, H.; Dai, B.; He, Y.; Liu, H. A Framework for Generating Anomaly Analysis Comments in DHI Interpretation Report. Comput. Electron. Agric. 2023, 214, 108331. [Google Scholar] [CrossRef]
De Oliveira Padilha, D.A.; Evangelista, A.F.; Valloto, A.A.; Zadra, L.E.F.; De Almeida, R.; De Almeida Teixeira, R.; Dias, L.T. Genetic Association between Fat-to-Protein Ratio and Traits of Economic Interest in Early Lactation Holstein Cows in Brazil. Trop. Anim. Health Prod. 2024, 56, 90. [Google Scholar] [CrossRef]
YE, W.; MA, Z.; YU, Y.; HAN, B. Incidence Status of Mastitis in Dairy Cows and Its Prevention and Treatment Measures in China. Chin. J. Anim. Sci. 2023, 59, 343–348. [Google Scholar]
Chen, R.; Wang, Z.; Yang, Z.; Zhu, X.; Ji, D.; Mao, Y. Association of IL8 -105G/A with Mastitis Somatic Cell Score in Chinese Holstein Dairy Cows. Anim. Biotechnol. 2015, 26, 143–147. [Google Scholar] [CrossRef]
Carvalho-Sombra, T.C.F.; Fernandes, D.D.; Bezerra, B.M.O.; Nunes-Pinheiro, D.C.S. Systemic Inflammatory Biomarkers and Somatic Cell Count in Dairy Cows with Subclinical Mastitis. Vet. Anim. Sci. 2021, 11, 100165. [Google Scholar] [CrossRef]
Karthik, A.; Hamatta, H.S.A.; Patthi, S.; Krubakaran, C.; Kumar Pradhan, A.; Rachapudi, V.; Shuaib, M.; Rajaram, A. Ensemble-Based Multimodal Medical Imaging Fusion for Tumor Segmentation. Biomed. Signal Process. Control 2024, 96, 106550. [Google Scholar] [CrossRef]
Amiri, M.H.; Mehrabi Hashjin, N.; Montazeri, M.; Mirjalili, S.; Khodadadi, N. Hippopotamus Optimization Algorithm: A Novel Nature-Inspired Optimization Algorithm. Sci. Rep. 2024, 14, 5032. [Google Scholar] [CrossRef]
Shohag, M.J.I.; Tian, S.; Sriti, N.; Liu, G. Enhancing Bok Choy Growth through Synergistic Effects of Hydrogel and Different Nitrogen Fertilizer Forms. Sci. Hortic. 2024, 336, 113400. [Google Scholar] [CrossRef]
Che, Z.; Peng, C. Improving Support Vector Regression for Predicting Mechanical Properties in Low-Alloy Steel and Comparative Analysis. Mathematics 2024, 12, 1153. [Google Scholar] [CrossRef]
Yang, J.; Kuang, J.; Wang, G.; Zhang, Q.; Liu, Y.; Liu, Q.; Xia, D.; Li, S.; Wang, X.; Wu, D. Adaptive Three-Way KNN Classifier Using Density-Based Granular Balls. Inf. Sci. 2024, 678, 120858. [Google Scholar] [CrossRef]
Han, R. Application of Inertial Navigation High Precision Positioning System Based on SVM Optimization. Syst. Soft Comput. 2024, 6, 200105. [Google Scholar] [CrossRef]
Wang, L.; Chi, J.; Ding, Y.; Yao, H.; Guo, Q.; Yang, H. Transformer Fault Diagnosis Method Based on SMOTE and NGO-GBDT. Sci. Rep. 2024, 14, 7179. [Google Scholar] [CrossRef]
Xue, J.; Shen, B. A Novel Swarm Intelligence Optimization Approach: Sparrow Search Algorithm. Syst. Sci. Control Eng. 2020, 8, 22–34. [Google Scholar] [CrossRef]
Mirjalili, S.; Lewis, A. The Whale Optimization Algorithm. Adv. Eng. Softw. 2016, 95, 51–67. [Google Scholar] [CrossRef]
Chen, Z.; Francis, A.; Li, S.; Liao, B.; Xiao, D.; Ha, T.; Li, J.; Ding, L.; Cao, X. Egret Swarm Optimization Algorithm: An Evolutionary Computation Approach for Model Free Optimization. Biomimetics 2022, 7, 144. [Google Scholar] [CrossRef]
Ebrahimi, M.; Mohammadi-Dehcheshmeh, M.; Ebrahimie, E.; Petrovski, K.R. Comprehensive Analysis of Machine Learning Models for Prediction of Sub-Clinical Mastitis: Deep Learning and Gradient-Boosted Trees Outperform Other Models. Comput. Biol. Med. 2019, 114, 103456. [Google Scholar] [CrossRef]
Fadul-Pacheco, L.; Delgado, H.; Cabrera, V.E. Exploring Machine Learning Algorithms for Early Prediction of Clinical Mastitis. Int. Dairy J. 2021, 119, 105051. [Google Scholar] [CrossRef]

Figure 1. Overall Flow Chart.

Figure 2. Algorithm flow chart.

Figure 3. Iterative curves.

Figure 4. Correlation heat map.

Figure 5. Composite Flowchart.

Table 1. CEC2022 benchmark functions.

	No.	Function	F_min
Unimodal Function	1	Shifted and full Rotated Zakharoy Function	300
Basic Functions	2	Shifted and full Rotated Rosenbrock’s Function	400
	3	Shifted and full Rotated Expanded Schaffer’s F6 Function	600
	4	Shifted and full Rotated Non-Continuous Rastrigin’s Function	800
	5	Shifted and full Rotated Levy Function	900
Hybrid Functions	6	Hybrid Function 1 (N = 3)	1800
	7	Hybrid Function 2 (N = 6)	2000
	8	Hybrid Function 3 (N = 5)	2200
Composition Functions	9	Composition Function 1 (N = 5)	2300
	10	Composition Function 1 (N = 5)	2400
	11	Composition Function 1 (N = 5)	2600
	12	Composition Function 1 (N = 5)	2700
Search range: [−100, 100] ^D

^D means 10 dimensions.

Table 2. Optimization performance indicators of IGOOSE and comparison algorithms for different functions.

Function	Evaluation Criterion	DBO	HBA	GWO	HHO	GOOSE	IGOOSE
F1	std	3.23 × 10²	2.10 × 10⁶	1.57 × 10²	1.24 × 10¹	5.70 × 10⁻⁵	8.45 × 10⁻⁵
F1	avg	4.49 × 10²	3.50 × 10²	3.60 × 10²	3.51 × 10²	3.45 × 10²	3.00 × 10²
F2	std	3.17 × 10¹	1.29 × 10¹	2.41 × 10¹	6.19 × 10¹	1.32 × 10¹	2.56 × 10¹
F2	avg	4.27 × 10²	4.07 × 10²	4.28 × 10²	4.58 × 10²	4.08 × 10²	4.17 × 10²
F3	std	6.96 × 10¹	1.05 × 10¹	1.06 × 10¹	1.06 × 10¹	1.15 × 10¹	1.90 × 10¹
F3	avg	6.09 × 10²	6.00 × 10²	6.01 × 10²	6.19 × 10²	6.62 × 10²	6.02 × 10²
F4	std	1.03 × 10¹	7.90 × 10	7.17 × 10	7.84 × 10	2.05 × 10¹	3.56 × 10¹
F4	avg	8.31 × 10²	8.19 × 10²	8.17 × 10²	8.27 × 10²	8.48 × 10²	8.00 × 10²
F5	std	1.19 × 10²	4.50 × 10¹	2.69 × 10¹	1.23 × 10²	5.14 × 10²	2.50 × 10²
F5	avg	9.97 × 10²	9.23 × 10²	9.19 × 10²	1.38 × 10³	2.04 × 10³	1.27 × 10³
F6	std	2.05 × 10³	2.16 × 10³	2.34 × 10³	2.68 × 10³	1.99 × 10³	1.82 × 10³
F6	avg	5.10 × 10³	5.63 × 10³	5.11 × 10³	5.84 × 10³	5.72 × 10³	5.08 × 10³
F7	std	1.19 × 10¹	6.20 × 10¹	2.35 × 10¹	3.46 × 10¹	5.92 × 10¹	1.35 × 10¹
F7	avg	2.03 × 10³	2.02 × 10³	2.03 × 10³	2.07 × 10³	2.14 × 10³	2.03 × 10³
F8	std	7.92 × 10¹	3.03 × 10¹	5.84 × 10¹	1.24 × 10¹	1.31 × 10²	1.09 × 10¹
F8	avg	2.23 × 10³	2.23 × 10³	2.22 × 10³	2.23 × 10³	2.43 × 10³	2.22 × 10³
F9	std	1.03 × 10¹	2.68 × 10¹	4.32 × 10¹	3.39 × 10¹	4.62 × 10¹	1.73 × 10¹
F9	avg	2.54 × 10³	2.54 × 10³	2.58 × 10³	2.57 × 10³	2.56 × 10³	2.44 × 10³
F10	std	6.22 × 10¹	5.71 × 10¹	6.87 × 10¹	1.18 × 10²	6.41 × 10²	1.60 × 10¹
F10	avg	2.40 × 10³	2.40 × 10³	2.50 × 10³	2.42 × 10³	2.51 × 10³	2.40 × 10³
F11	std	1.21 × 10²	1.25 × 10²	1.18 × 10²	3.04 × 10²	3.46 × 10²	6.65 × 10¹
F11	avg	2.71 × 10³	2.67 × 10³	2.96 × 10³	2.93 × 10³	3.68 × 10³	2.65 × 10³
F12	std	1.42 × 10¹	3.13 × 10¹	1.24 × 10¹	4.42 × 10¹	8.63 × 10¹	2.88 × 10¹
F12	avg	2.87 × 10³	2.88 × 10³	2.91 × 10³	2.94 × 10³	2.91 × 10³	2.77 × 10³

Table 3. Description of features.

No.	Variable	Mean	Standard Deviation
x1	Parity	1.1	0.37
x2	Lactation Persistence	101.38	10.39
x3	Days in milk	138.86	64.90
x4	WHI	99.98	26.33
x5	Milk yield	1.16	0.16
x6	Fore milk yield	11.41	2.49
x7	Milk Fat Percentage	4.46	0.85
x8	Peak Milk Yield	0.27	0.90
x9	Protein Percentage	3.7376	0.3657
x10	Peak Lactation Day	100.88	9.48
x11	Fat-to-Protein Ratio	1.19	0.28
x12	305-Day Milk Yield	7696	1684
x13	Urea Nitrogen	8.55	16.90
x14	Total Milk Volume	33.37	6.69
x15	Milk Loss	0.1789	0.6414
x16	Total Milk Fat	7720.51	1666.25
x17	Economic Loss	0.4915	0.7616
x18	Total Protein	206.36	59.64
x19	Corrected Milk	166.42	49.42
x20	Adult equivalents	8828.01	1863.72
x21	SCS	2.1	1.3

Table 4. Comparison of evaluation indicators of different classification models.

Algorithm	Average Accuracy	Average Recall	Average Precision	Average F1 Score	Average Fitness Value	Average Number of Features
SSA	0.83 ± 0.016	0.80 ± 0.018	0.75 ± 0.017	0.77 ± 0.015	0.170 ± 0.0024	9.12 ± 0.42
HBA	0.81 ± 0.014	0.83 ± 0.016	0.77 ± 0.014	0.79 ± 0.014	0.161 ± 0.0021	10.08 ± 0.45
CPO	0.83 ± 0.015	0.85 ± 0.017	0.71 ± 0.016	0.75 ± 0.015	0.163 ± 0.0023	6.05 ± 0.31
DBO	0.80 ± 0.017	0.78 ± 0.019	0.73 ± 0.018	0.77 ± 0.016	0.163 ± 0.0022	6.07 ± 0.33
GOOSE	0.84 ± 0.015	0.84 ± 0.017	0.74 ± 0.016	0.79 ± 0.014	0.163 ± 0.0021	7.10 ± 0.38
IGOOSE	0.85 ± 0.013	0.87 ± 0.015	0.76 ± 0.014	0.80 ± 0.013	0.151 ± 0.0019	6.04 ± 0.28

Table 5. Comparison of evaluation indicators of different models with statistical measures.

Algorithm	Average Accuracy	Average Recall	Average Precision	Average F1 Score	Average Fitness Value
SSA	0.83 ± 0.015	0.81 ± 0.017	0.80 ± 0.016	0.79 ± 0.015	0.158 ± 0.0023
HBA	0.84 ± 0.014	0.83 ± 0.016	0.79 ± 0.015	0.80 ± 0.014	0.147 ± 0.0021
CPO	0.84 ± 0.015	0.87 ± 0.016	0.78 ± 0.015	0.85 ± 0.014	0.148 ± 0.0022
DBO	0.84 ± 0.016	0.89 ± 0.017	0.84 ± 0.016	0.79 ± 0.015	0.148 ± 0.0023
GOOSE	0.85 ± 0.014	0.84 ± 0.016	0.78 ± 0.015	0.81 ± 0.014	0.146 ± 0.0020
IGOOSE	0.87 ± 0.012	0.88 ± 0.014	0.89 ± 0.013	0.88 ± 0.012	0.145 ± 0.0018

Table 6. Comparison of evaluation indicators of different regression models.

Evaluation Indicators	SVR	GBDT	KNN	XGBoost
Average R²	0.578 ± 0.016	0.611 ± 0.015	0.485 ± 0.018	0.631 ± 0.13
Average MAE	0.646 ± 0.021	0.565 ± 0.021	0.727 ± 0.023	0.552 ± 0.016
Average MSE	0.753 ± 0.024	0.571 ± 0.019	0.934 ± 0.026	0.565 ± 0.017
Average RMSE	0.868 ± 0.022	0.753 ± 0.020	0.967 ± 0.024	0.758 ± 0.018

Table 7. Comparison of evaluation indicators of different optimized models.

Evaluation Indicators	SSA	WOA	ESOA	GOOSE	IGOOSE
Average R²	0.585 ± 0.015	0.623 ± 0.014	0.598 ± 0.015	0.642 ± 0.013	0.663 ± 0.012
Average MAE	0.675 ± 0.019	0.585 ± 0.016	0.667 ± 0.018	0.531 ± 0.014	0.526 ± 0.013
Average MSE	0.910 ± 0.022	0.799 ± 0.018	0.870 ± 0.020	0.521 ± 0.015	0.511 ± 0.014
Average RMSE	0.954 ± 0.021	0.848 ± 0.019	0.932 ± 0.020	0.722 ± 0.016	0.693 ± 0.015

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, R.; Dai, Y. Leveraging IGOOSE-XGBoost for the Early Detection of Subclinical Mastitis in Dairy Cows. Appl. Sci. 2025, 15, 8763. https://doi.org/10.3390/app15158763

AMA Style

Guo R, Dai Y. Leveraging IGOOSE-XGBoost for the Early Detection of Subclinical Mastitis in Dairy Cows. Applied Sciences. 2025; 15(15):8763. https://doi.org/10.3390/app15158763

Chicago/Turabian Style

Guo, Rui, and Yongqiang Dai. 2025. "Leveraging IGOOSE-XGBoost for the Early Detection of Subclinical Mastitis in Dairy Cows" Applied Sciences 15, no. 15: 8763. https://doi.org/10.3390/app15158763

APA Style

Guo, R., & Dai, Y. (2025). Leveraging IGOOSE-XGBoost for the Early Detection of Subclinical Mastitis in Dairy Cows. Applied Sciences, 15(15), 8763. https://doi.org/10.3390/app15158763

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Leveraging IGOOSE-XGBoost for the Early Detection of Subclinical Mastitis in Dairy Cows

Abstract

1. Introduction

1.1. Background

1.2. Methodological Approach

1.3. Paper Structure

2. Relevant Methods and Theories

2.1. XGBoost Model

2.2. GOOSE Optimization Algorithm

3. Improved GOOSE Optimization Algorithm (IGOOSE)

3.1. Elite Opposition-Based Learning Strategy (EOBL)

3.2. Adaptive Nonlinear Control Parameter

3.3. Golden Sine Strategy

3.4. Algorithm Flow Chart

4. Performance Testing and Comparison

5. Prediction Model for Subclinical Mastitis in Dairy Cows Based on IGOOSE-XGboost

5.1. Data Processing

5.2. Building the IGOOSE-XGBoost Classification Model

5.2.1. Evaluation Metrics for Classification Models

5.2.2. Building an IGOOSE-XGBoost Classification Model for Subclinical Mastitis in Dairy Cows

5.3. Building the IGOOSE-XGBoost Regression Model for Subclinical Mastitis in Dairy Cows

5.3.1. Correlation Analysis

5.3.2. Evaluation Indicators for Regression Models

5.3.3. Building the IGOOSE-XGBoost Regression Model

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI