1. Introduction
Lactation curves (LC) provide invaluable information that could be used to evaluate the genetic and management status of lactating dairy cows. An LC is a graphical representation of change in milk yield during the lactation period following calving [
1]. It typically increases rapidly until the peak yield is achieved and then decreases slowly until the drying period. The shape of an LC, described based on peak yield, peak time, and persistency, is used to determine the milking potential of a dairy cow and is an index that facilitates feed and health management, in addition to breeding decision [
2,
3]. For a large group of cows, the mean LC follows a typical pattern, which can be described mathematically using a simple equation [
4]. Various equations have been proposed [
5,
6,
7,
8], and they fit the mean LC of a large group of cows relatively well [
2,
9].
The shapes of individual LC, however, are quite diverse because LC shape varies based on various factors, including biological factors such as breed, genetic effects, physiological conditions, pregnancy, parity, and age, and environmental factors such as feed, herd management, and calving season [
5,
10,
11,
12,
13]. Some studies have reported atypical shapes, such as the absence of peak yield, in approximately 20∼30% of cases [
9,
14,
15]. Despite their considerable proportion, such cases have been considered outliers or disregarded in analyses, as they are difficult to explain using previously developed equations [
16,
17,
18]. To facilitate precision dairy farming, however, it is useful to identify and analyze LC shapes that can be influenced by individual variation, feed management, or disease. In previous studies, LCs have been analyzed after grouping dairy cows based on production levels, parity differences, or presence of disease [
3,
19,
20,
21]. But the limitations remain because those studies tried to derive LCs for predefined groups instead of extracting physiological characteristics from various LCs.
Clustering analysis facilitates the defining of groups objectively based on phenotypic observations. Particularly, the unsupervised and non-hierarchical clustering methods (i.e.,
k-means and
k-medoids) are extensively used for the analysis of large datasets because they are powerful and have lower computational cost requirements than hierarchical clustering methods [
22,
23]. Several studies in the field of dairy science have applied the
k-means method to classify dairy cows according to genetic status and health status based on peak milk yield, milk components, and blood properties [
24,
25,
26,
27].
The objective of the present paper, therefore, was to cluster lactation data based on the shapes of LCs using the k-medoids, a variant of k-means, clustering algorithm. We obtained three years of milking records from commercial Korean dairy farms and used them for the analyses. After clustering, the representative LCs of each cluster were fitted using several conventional LC models. We also investigated differences in physiological and milking characteristics among the clustered groups.
2. Backgrounds
An LC can be thought as a time-series of daily milk production for an individual dairy cattle. Therefore, grouping LCs can be handled by time-series clustering. Time-series clustering and its application have been widely studied in various fields [
28,
29,
30,
31,
32]. Liao T. W. [
28] provided guidance on how to apply clustering algorithms to time-series data. Aghabozorgi et al. [
31] also classified the time-series clustering into three categories: whole time-series clustering, subsequence time-series clustering and time point clustering. In the study, the authors only focused to whole time-series clustering because subsequence time-series clustering was considered meaningless by Keogh and Lin [
33] and time point clustering is similar to time-series segmentation [
31].
There are two most extensively used time-series clustering algorithms: the connectivity-based hierarchical clustering and the centroid-based
k-means clustering. Hierarchical clustering repeatedly merges to form a larger cluster (i.e., agglomerative clustering) or divides a larger cluster into smaller clusters until the cluster is not partitioned (i.e., divisive clustering). This clustering algorithm is often used as a primary prescription for time-series clustering, because it requires no prior understanding of the datasets and clusters [
31]. However, due to its high computational complexity, the hierarchical clustering method is not recommended for large datasets [
28,
32]. Conversely,
k-means algorithm is a non-trivial algorithm, which implies that it relies heavily on the initial center of the cluster [
31,
34]. In addition,
k-means algorithm has a slow initial convergence rate [
34]. To address the challenge, a weighted probability distribution can be adopted to initialize the center of the cluster, which is introduced in the
k-means++ algorithm [
34]. In the method, the initial center is selected with probability
, where
is the shortest distance from the data point to the nearest center already selected. Therefore, the method increases the rate of convergence, and evenly distributes the initial centers.
Although the
k-means algorithm is the most extensively applied, the
k-medoids algorithm has promising applications because it is more robust against noise and outliers [
23,
35]. For example, Sauder et al. [
36] compared the performance of hierarchical clustering and two non-hierarchical clustering algorithms (i.e.,
k-means and
k-medoids) in the classification of Holstein dairy cows based on their growth curves, and reported that the
k-medoids method was the most appropriate for grouping cows. They also observed significant differences in milking performance between the clustered groups.
K-medoids algorithm is a
k-means variant that complements the noise vulnerabilities of
k-means algorithm.
K-means algorithm adopts the mean value as the center of the cluster. Therefore, it naturally suffers from noise [
32]. Conversely,
k-medoids algorithm is less sensitive to outliers because it uses the median for the selection of center values, which makes the computation procedure in
k-medoids more complex than the procedure in
k-means algorithm; however,
k-medoids is still a more powerful tool for clustering large datasets than hierarchical clustering [
31]. In addition, unlike
k-means clustering,
k-medoids does not require additional calculation for the inter-LC distances whenever the center is updated.
4. Results
The
k-medoids clustering algorithm (
) grouped 330 datasets into six clusters of 119, 64, 50, 47, 38, and 12 datasets. The clusters were denoted (a)–(f) (
Table 4). The largest cluster (a), had 36% of the datasets, while the smallest cluster (f) had 4% of the datasets, which represented the smallest dataset. Excluding the 70th DIM in a lactation, there were significant differences in parity, average daily milk yield, total milk yield from 10 to 280 DIM, peak DIM, and peak milk yield (
) among the clusters. Clusters (a) and (d), which had high proportions of multiparous cows, had a parity of 2.6, which was significantly higher than those of other clusters (
). In contrast, clusters (e) and (f) contained more primiparous cows than multiparous cows, and had lower average parities when compared with the other clusters (
). The daily average milk yield and total milk yield were the highest in cluster (a) (39.5 L and 10,713 L) and the lowest in cluster (e) (35 L and 9509 L) when compared to other clusters (
). Peak milk yield was the highest in cluster (d) (54 L), with an 11.6-L difference from cluster (e), which had the lowest peak yield (
). Clusters (a), (b), and (d) had peak yields at the early lactation period, while clusters (c), (e), and (f) had peak yields at the mid-lactation period. In cluster (e), which had the lowest peak yield, DIM at peak yield was 144 days, which was the latest among the six clusters (
).
The regression graphs and model parameters of the three conventional models for the average lactation data in the clusters are presented in
Figure 2 and
Table 5, respectively. Clusters (a) and (b) curves displayed typical shape similar to that of the average LC of the 330 lactations, while clusters (d) and (f) curves had a different shape (
Figure 2). Clusters (c) and (e) curves were gentle with lower peak milk yield than clusters (a) and (b). Cluster (d) curves exhibited rapid declines in milk production in the mid-lactation period after the high peak yields, and cluster (f) curves displayed an abnormal shape rather than a general LC. Cluster (f) curve did not fit properly in all three model. Distorted parameter estimates were also derived from the Wilmink model (
Table 5). When the
s of the three models for the clusters were compared, the model with the lowest
varied across the clusters (Wood model-cluster [e], Wilmink model-clusters [a] and [c], Dijkstra model-cluster [b]). The clusters (d) and (f), which exhibited abnormal shapes, had high
s (2.53 and 1.96, respectively) on average for all models, unlike other clusters. The
for the lactation data of the samples within the cluster was the lowest in cluster (a) with the ideal curve shape, and the
of cluster (f) was twice as high as that of cluster (a).
The estimated values of the parameters are listed in
Table 6, except Wilmink model. Predicted values of peak milk yield were close between the Wood model and the Dijkstra model, but were underestimated by 6 L when compared to the actual lactation data. The greatest difference was 8 L, which was observed in cluster (d). Excluding in cluster (a), the calculated peak DIM value varied between the two models, and there was a gap of 15 days on average. The greatest difference between the models was 20 days, which was observed in cluster (d). Compared to the actual lactation value, the difference in the predicted peak DIM value from two models was relatively low, at 3 days in cluster (a) and (e), but high in cluster (d), at 34 days. Persistency, the relative declining rate at the half point between peak milk yield and the end of lactation, was the lowest in cluster (a) and the highest in cluster (e).
5. Discussion
Clustering of LC shapes yielded clusters (a), (b), (c), and (e), which accounted for 82% of the total individuals, and had typical LC shapes (
Table 4). In particular, clusters (a) and (e) showed typical LC shapes of multiparous and primiparous cows, respectively. Primiparous cows generally have a flatter LC and have relatively high persistency, whereas multiparous cows exhibit rapid increases in daily milk yield from calving to the peak milk yield, followed by a significant decline [
5,
43]. Our study revealed significant differences in milking characteristics such as peak milk yield, peak time, and persistency (
). The results are consistent with the findings of previous studies [
44,
45,
46,
47,
48,
49], which reported that total milk yield in a 305-d lactation increases with an increase in the parity of Holstein cows. Particularly, some studies reported that peak milk yield increased dramatically from parity 1 to parity 2, and peak milk yield was observed at later periods in primiparous cows and earlier in multiparous cows [
47,
48]. Other studies have reported that milk yield is generally higher in multiparous cows than in primiparous cows, while persistency is often greater in the primiparous cows with less developed mammary glands [
50,
51,
52]. In addition, most primiparous cows are not physically mature [
53]. Since primiparous cow require nutrients for their own growth, their metabolic status is different from that of multiparous cows [
54]. Such results support the observation that the clustering method used in the present study can discriminate milking characteristics such as total milk yield, peak milk yield, and DIM at peak yield, which vary depending on parity.
Clusters (d) and (f), which accounted for approximately 18% of the cows (
Table 4), exhibited abnormal LC shapes (
Figure 2). Cluster (f), which consisted of only 12 individuals (4%), had an LC shape with no peaks and no significant changes in milk production. The LC of cluster (d), which had high and rapid peaks, had an undulating shape due to a sharp decrease in milk yield during the mid-lactation period. Previous studies have reported that LCs with atypical shapes, such as no peak in LC, account for approximately 20∼30% individuals in datasets [
2,
9,
13,
14,
15], which is consistent with the observations of the present study. Some studies have also reported very high peaks in the early lactation stages are accompanied by sharp decreases in milk yield during subsequent lactation periods, which could be due to metabolic stress and negative energy balance in some instances [
55,
56]. In addition, previous studies have reported that cows with high milk production and relatively early peak milk yields in early-lactation periods exhibit slow rates of recovery of body condition score, increased physiological stress, and have high risks of udder disease during mid- and late lactation periods [
57,
58,
59]. The results suggest that the undulating shape of cluster (d) indicated energy unbalance or metabolic disorder. However, such atypical shapes (e.g., non-peak or undulating) of LCs could be caused by individual characteristics [
2,
9,
60]. Arnal, et al. [
61] classified LCs of dairy goats using principal component analysis and reported that LCs that were undulating in the mid lactation (120 DIM) period after peak yield were not associated with health problems or environmental factors. To facilitate precision dairy farming, it is critical to determine when such atypical LC shapes emerge, and whether they are attributable to variations, or health problems. Therefore, future studies should develop algorithms that can distinguish such variations or statuses.
The best fitting models are not the same in each clusters as shown in
Table 5. The results are consistent with the those of previous studies [
3,
19]. Previous studies performed goodness of fit tests of the LC model for lactation data by grouping the data according to the period in which the milking data was obtained, the level of milk production, and parity, with the aim of finding a model that optimally describes lactation. Direct comparison on fitting accuracy of presented method to the previous studies is somewhat difficult to perform, since data collection duration, number of cows breed, metric of average milk yield and the data grouping method were not unified among the previous studies. The LC models in previous studies [
3,
19] were established from the selected groups considering ability of milk production and parity from a large breeding group. Abnormal populations were not considered for the LC model and excluded as exceptional cases. The previous approaches are not appropriate for studying the atypical cases of LC. The LC characteristics of atypical cases would be taken into account using the clustering method.
The three conventional LC models applied in the present study had high fitting errors for clusters with atypical shapes (
Table 5). Cluster (d), which had the undulating shape, had the highest fitting error among the clusters, and cluster (f), which exhibited an abnormal shape without peaks, had distorted parameter estimates. Consequently, it was not possible to calculate peak yield, DIM, and persistence in cluster (f) using parameters derived from the model (
Table 6). In addition, in cluster (d), the peak milk yield and DIM calculated using model parameters were significantly different from the actual milking data when compared with other clusters. Cluster (f) individuals accounted for only 4% of the individuals in the entire dataset and were less similar between individual LCs in the group. Therefore, the validation of individual data should be performed before further analyses (
Table 5;
). However, cluster (d) individuals accounted for a considerable proportion of the total data (47 out of 330 cows). In addition, since the similarity between individual LCs in the group was comparable to the similarity observed within other clusters (
Table 5;
), it can be considered as one of the LC types that can emerge in the commercial farms. In previous studies, LCs with atypical shapes have been omitted from initial datasets or their influence on the overall dataset has been diluted using mean values [
16,
17,
18]. As mentioned above, such LC types are attributable to diverse factors ranging from simple individual differences to health problems [
2,
9,
55,
56,
60], so more detailed investigations are required to determine their specific origins. However, in the present study, the conventional model had a high fitting error for LCs with atypical shapes, which is consistent with the findings of previous studies [
9,
15,
62]. Therefore, it is necessary to develop a high-resolution model that can distinguish detailed features with high goodness of fit for diverse LC shapes.
In summary, clusters delineated from LC shape have different milking characteristics due to the varying proportions of multiparous and primiparous cows in each cluster. In addition, 18% of the individuals had atypical LC shapes. One of the clusters had an undulating LC shape and accounted for 14% of the individuals. This observation is presumably due to metabolic problems caused by rapid peaks and high production at the early stages of lactation. The fit of the model varied for each cluster following the fitting the three popular LC models to average lactation data clusters. We confirmed, however, that the conventional model is not suitable for clusters with atypical shapes due to high fitting errors. The method proposed in the present study may facilitate the identification and management of cows that require attention in herds. One of the practical applications of this study is to predict the daily milk production of individual cows using the cluster information. If the LC of currently milking cow follows a typically shaped cluster, then the regression models of the cluster can be thought as a predictor. The present study, however, was carried out using a limited amount of lactation data, so more commercial farm data should be collected to validate the various LC shapes. In addition, the correlations between LC shapes and the factors influencing such correlations should be investigated further based on biological and environmental data from individuals linked to lactation data. It is also necessary to investigate how many clusters of lactation data are appropriate for application in herd management. Finally, further studies should be conducted to develop a model that can mathematically explain LCs with various shapes.