1. Introduction
Dairy farming is a vital component of the livestock industry. However, traditional manual farming methods are characterized by high labor intensity, high time consumption, and notable subjectivity [
1,
2]. Therefore, supported by rapidly advancing information and intelligent technologies, dairy farming is developing towards informatization, intelligence, and precision [
3]. Precision and intelligent farming enables the real-time monitoring of the behavior of individual cows and facilitates timely management decisions [
4]. The identification of individual cows serves as a prerequisite and foundation for achieving precision and intelligent farming [
5,
6].
Currently, radio frequency identification (RFID) technology is widely employed for identifying individual cows in large-scale dairy farms. This technology involves the employment of tags (typically ear tags) attached to cows and wireless transmission to record individual cow information [
7,
8]. Compared with traditional manual methods, RFID technology increases identification accuracy and work efficiency. However, the task of attaching ear tags is labor intensive and may cause animal stress, thus impacting animal welfare, and ear tags are easily lost and damaged [
9].
With the advancement in computer technology, contactless cattle identification methods based on computer vision have increasingly emerged [
10,
11,
12,
13]. This method can be used to extract biometric features by capturing cow images via cameras, including nose and mouth patterns [
14], iris patterns [
15], facial contour textures [
16], and body patterns [
17,
18]. This technique enables automatic, contactless, and accurate image recognition without the need for extensive manual operations [
19]. However, nose and mouth patterns, iris patterns, and facial contour textures, as physiological features of the head, require high cooperation from cows for extraction. The image acquisition process is highly sensitive to the angle and position of the camera. Kumar et al. [
20] proposed a deep learning-based method for identifying individual cattle through the recognition of muzzle and nose pattern image features. This method achieved a relatively high recognition accuracy. However, it required close-range capture of the muzzle and nose pattern images. Additionally, the presence of dirt on the cattle’s muzzle and nose had to be wiped off before image acquisition, which, in turn, necessitated a high level of cooperation from the cattle. Lu et al. [
15] developed a dairy cattle iris recognition method based on two-dimensional complex wavelet transform. Although the reliability of this method was validated, extracting ocular information proved to be challenging. To obtain high-quality images, it was necessary to restrain the animals’ heads during image acquisition to prevent excessive head movement. This process was overly complex and did not guarantee image quality. Yang et al. [
16] proposed a dairy cattle facial recognition method based on an improved RetinaFace and FaceNet. By optimizing the network architecture and loss function, the method achieved efficient feature extraction and precise matching. Qi Yongsheng et al. [
21] proposed a dual-branch network for recognizing partially occluded cattle faces. By employing depthwise separable convolution and multi-scale hybrid pooling, the method achieved occlusion segmentation and feature restoration, resulting in an average accuracy of 86.34% for occluded cattle face recognition. Although cameras can capture facial features of cattle, the low cooperation level during the image acquisition process necessitates the use of specialized equipment to restrain the animals, ensuring the acquisition of clear and high-quality cattle face images.
Body patterns denote the systematic arrangement of black and white fur on the torso of Holstein cattle. Body patterns are widely distributed, yet their features remain distinguishable and stable over time. Images of body patterns can be captured through side-view or overhead-view photos or videos of cows while they are walking. Zhao Kaixuan and He Dongjian (2015) derived a 48 × 48 feature matrix from images of the cow torso and developed a convolutional neural network (CNN) to serve as a model for individual cow recognition [
22]. Their dataset included 30 cows, and they obtained an identification accuracy of 90.55%. However, in side-view image acquisition, the overlap and occlusion of multiple cows often occur, leading to a decrease in trunk localization accuracy. Therefore, researchers have explored identity recognition methods based on top-view images. He Dongjian and Liu Jianmin (2020) proposed an individual identification method for dairy cows based on video analysis and an improved deep CNN, namely, the You-Only-Look-Once version 3 (YOLOv3) algorithm, involving the design of a video acquisition scheme for the backs of cows entering and exiting the milking parlor, with an identification accuracy of 95.91% [
23]. Xiao et al. (2022) proposed an individual cow identification method based on an improved Mask R-CNN model. In this method, an improved Mask R-CNN algorithm was employed to segment the recorded back images of cows, the optimal feature subset was obtained, and an SVM classifier was finally employed to classify and identify the back images of cows, thereby achieving an identification accuracy of 98.67% [
24]. The methods all entail the use of color images to locate the trunk of cows and extract body pattern features. However, gait activity affects the angle of the cow captured by the camera, resulting in inconsistent orientation positions of the obtained color images.
Despite the high accuracy achieved by two-dimensional image analysis techniques in individual dairy cattle identification, the practical application performance remains susceptible to significant fluctuations due to variations in illumination conditions, interference from complex backgrounds, and environmental variables. Zhao et al. [
13] utilized depth images to generate point cloud data of cattle backs, and combined voxelization with convex hull algorithms to quantify the degree of fat depression around the skeletal structure. The convex hull distance feature, calculated as the distance from peripheral voxels to the surface of the convex hull, reflects the uniqueness of the structural characteristics of the hook and pin bones. Kyaw et al. [
25] addressed the light sensitivity issue in traditional 2D image-based cattle identification through 3D point cloud segmentation techniques. By integrating the PointNet++ algorithm with multi-scale feature extraction, they achieved the high-precision segmentation of the cattle back region. Menezes et al. [
26] developed an animal identification system based on computer vision (CVS), which can recognize individual cattle through key points on the animal’s body surface. The system locates seven specific anatomical landmarks (left and right hips, left and right pin bones, tail head, sacral, and cervical vertebrae) to establish a key point model, and performs individual identification based on the Euclidean distances between these key points.
Building on the existing research, an individual cow identification method based on anchor point detection and body pattern characteristics was proposed. In this method, a three-dimensional point cloud is employed to standardize the direction and position of the top-view cow images, key points are located to determine the key dorsal areas, and body pattern features are extracted from the key areas for identity determination. By analyzing the three-dimensional structure of cows, the proposed method aims to standardize the direction and position of the obtained cow images, thereby correcting any color image deformation issues. The detailed procedure of this method is outlined as follows: firstly, color and depth images are registered to generate a colored point cloud image of the back of the selected cow. Secondly, the point cloud is subjected to pose normalization, background noise removal, and clustering analysis to extract the back trunk of the cow. The improved PointNet++ model is subsequently employed to approximately locate the hook and pin bones and segment the corresponding hook and pin bone regions. The point cloud of the key regions is then converted into a body pattern image format. The body pattern images are thereafter processed using the Otsu thresholding method for binary segmentation. Finally, the body pattern images are input into an improved ConvNeXt classification model to realize body pattern image classification. The method is robust to the position and angle of the cow in the top-view. The contributions of this paper are as follows.
A new method for individual dairy cattle identification is proposed, which includes the following steps: first, the hook and pin bone points are accurately located; then, the key areas are extracted and converted into two-dimensional images; and finally, the individual identity of cattle is recognized by classifying the body pattern images.
A method for locating hook and pin bone points on cattle back is proposed, which reduces the impact of environmental and camera angles on locating cattle torso and has better robustness.
A method for individual cattle identification based on the improved PointNet++ and ConvNeXt models is proposed, which combines the three-dimensional structural features of cattle backs with body pattern features to improve the accuracy of individual cattle identification.
2. Materials and Methods
2.1. Data Acquisition
To verify the robustness of the proposed algorithm, this study established two scenarios for the data collection of dairy cow images. In the first scenario, a camera was mounted above the feeding stall to capture top-view images of cows’ backs. In the second scenario, a camera was installed above the passageway leading to the milking parlor to collect top-view images of cows’ backs while they were walking. The images captured above the feeding stall were used for training and analysis of the method proposed in this study, while the images collected above the walking passageway were designated as the test set to validate the robustness of the proposed method.
(1) Dataset 1
The experimental data were collected at Shengsheng Farm, located in Luoyang, Henan Province, from 24 to 25 October 2024. The subjects of data collection were lactating Holstein cows in the feeding process. Via the use of a camera stand, an Intel RealSense D455 depth camera (Intel Corporation, Santa Clara, CA, USA) was mounted above the back of the cow at a height of 2.9 m from the ground. The data collection setup is shown in
Figure 1.
A depth camera was used to simultaneously capture top-view color images and depth images of the back of cows, as shown in
Figure 2, both with a resolution of 640 × 360 pixels. A total of 40 cows were surveyed, resulting in 7600 color images and 7600 corresponding depth images. These images were split into training, validation, and test sets at a ratio of 7:2:1.
The distribution of the proportion of the black part of cow’s back in all the data was analyzed, which was divided into 5 groups from 0 to 100%, and the interval of each group was 20%. Its distribution histogram is shown in
Figure 3. Among them, the occurrence probability for groups with black back area percentages of 0~20% and 80~100% was less than 10%, and the overall distribution was normal.
(2) Dataset 2
To validate the robustness of the proposed method for individual dairy cow identification, a more complex moving-scene dataset was established for testing. An Intel RealSense D455 depth camera was mounted above the passage leading to the milking parlor, as shown in
Figure 4, to capture the top-view back RGB-D data of cows while they were walking. A total of 200 top-view back depth images and 200 color images were collected from 10 dairy cows, with each image having a resolution of 640 pixels × 360 pixels, as shown in
Figure 5.
2.2. Converting Depth Images into Point Cloud
To analyze the three-dimensional characteristics of the backs of the selected cows, the depth and color images obtained with the Intel RealSense D455 depth camera in Dataset 1 were registered to generate a point cloud. Then, operations such as pose normalization, ground removal, clustering analysis, noise background removal, spine line fitting, and the rotation and translation of the point cloud images were performed to align the spine line with the X-axis.
2.2.1. Depth Image Preprocessing
Before the registration of the depth and color images to generate a 3D point cloud, it can be observed that some data loss occurs in the image shown in
Figure 2b. To improve the quality of the depth images of dairy cows and ensure the accuracy and robustness of subsequent analyses, a comprehensive preprocessing approach was employed to address potential issues such as local depth information loss, noise interference, and outlier points. Specifically, three methods—nearest neighbor interpolation, median filtering, and bilateral filtering—were utilized to repair missing information in the depth images, remove noise and outliers, and smooth the images while preserving edge features. The results of this preprocessing are shown in
Figure 6.
Firstly, to address the potential local depth information loss in the depth images, nearest neighbor interpolation was employed. This method leverages the known depth values surrounding the missing pixels to infer and fill in the voids within the image. By doing so, it ensures the continuity of depth information and mitigates the discontinuity issues at the boundaries caused by data loss.
Secondly, in the environment where the dairy cows are located, the presence of water stains, mud, or other impurities on the ground can lead to abnormally high or low reflective regions in the depth images. To address this, median filtering was applied to smooth the depth images and remove noise and outliers. Median filtering effectively suppresses random noise arising from sensor errors, environmental interference, or the data acquisition process. It reduces the impact of these anomalies on subsequent processing, thereby enhancing the stability and usability of the images.
Furthermore, to preserve the edge features of the images while denoising, bilateral filtering was introduced. Bilateral filtering is a non-linear filtering technique that combines spatial proximity and pixel intensity similarity for weighted smoothing. It suppresses noise while maintaining the sharpness of object edges. This is particularly important for the depth images of dairy cows, as the contours and feature points of the cows need to be as clear as possible. High reflective regions, such as those caused by spots or wet areas on the cows’ skin, can interfere with depth measurements. Bilateral filtering can mitigate the influence of these high reflective areas, resulting in smoother depth images while retaining critical boundary information. This provides more stable data support for the subsequent feature extraction and target recognition.
2.2.2. Point Cloud Attitude Normalization
After preprocessing the depth images, the depth images were transformed into point clouds through the inverse transformation of perspective projection (Formula (1)).
Following the conversion of depth images to three-dimensional point clouds, to facilitate the precise analysis of the body pattern features of dairy cows, it is necessary to map the RGB information from the color images to the generated 3D point cloud. This process is achieved through the registration of color and depth images, ensuring that each point in the 3D point cloud corresponds to an accurate RGB color value. The resultant visualization is shown in
Figure 7. As can be seen from the figure, the colored point cloud data of the dairy cow clearly depicts the three-dimensional morphology of the cow’s back and the distribution of its body patterns.
Prior to the removal of background and noise, to facilitate the subsequent unified processing of the cow point cloud images, they should be normalized. Based on the ground in the background point cloud, each cow point cloud image was translated and rotated to coincide with the XOY plane.
The ground is typically a large planar region, which can be fitted using methods such as least squares. Specifically, the x, y, and z coordinates of all the points in the point cloud are substituted into the plane equation. The best plane parameters A, B, C, and D can be determined using the least squares method. The resulting fitted plane can be expressed as Ax + By + Cz + D = 0.
To align the fitted plane with the XOY plane, we extracted the normal vector
of the fitted plane and the normal vector
of the XOY plane. Then, we computed a rotation matrix via the Rodrigues model (Formula (2)). This matrix was then applied to align the normal vector of the fitted plane with the positive direction of the
Z-axis, thereby effectively rotating the fitted plane to coincide with the XOY plane and correcting the tilt in the point cloud. The core of the Rodrigues method is to convert the rotation axis vector
and the rotation angle
θ into a rotation matrix. Regarding the rotation between the vector
and the target vector
, the cross product is employed to calculate the rotation axis, and the dot product is applied to calculate the cosine of the rotation angle.
where
R is the rotation matrix,
I is a 3 × 3 identity matrix, and
θ is the rotation angle, which is calculated as the angle between the two vectors (using the dot product and the magnitudes of the vectors). Moreover,
is the rotation axis, which is the normalized cross-product of the two vectors, and
is the skew-symmetric matrix of
(also referred to as the cross-product matrix or the antisymmetric matrix), which is defined as follows:
where
are the components of the vector
.
2.2.3. Point Cloud Noise and Background Removal
Since the cow images were captured during feeding, the environment is complex. To eliminate the influence of the environment, it is necessary to subject the collected data to background and noise removal and extract cow trunks. The resultant visualization of the cattle torso after background removal is shown in
Figure 8.
(1) Ground Removal
With the use of the above process, the orientation of the point cloud can be normalized, with the ground aligned with the XOY plane. Since this study focuses on the top-view point cloud of the cattle’s back, which is at a certain distance from the ground, and considering that the cattle farm’s ground is uneven, a method to maximize the removal of the ground can be employed. Notably, ground points can be removed by setting a threshold for the z-value. Specifically, points with z < 100 are classified as part of the ground. By clipping all the points in this region, the ground portion can be effectively eliminated.
(2) Filtering Processing
The remaining portion of the point cloud can then be clustered via the density-based spatial clustering of applications with noise (DBSCAN) method for cow point cloud analysis [
27]. The DBSCAN algorithm is a typical density-based spatial clustering algorithm that aims to divide high-density point regions into clusters and can be used to effectively filter out low-density point regions. It can achieve clustering of arbitrary shapes in datasets containing noise [
28]. The traditional k-means algorithm [
29] can hardly resolve non-spherical clusters and clusters of different sizes. Moreover, the DBSCAN algorithm aims to group clusters on the basis of density differences, allowing it to manage clusters of varying sizes and shapes [
30]. Therefore, to process cow point clouds in this study, the DBSCAN algorithm was selected for cluster analysis. This method effectively distinguished the point cloud of the cattle’s back from railings and other objects, retaining only the point cloud corresponding to the cattle’s back. Subsequently, statistical filtering was applied to the point cloud for noise removal and outlier elimination. Through statistical filtering, environmental noise points within the cattle point cloud were effectively removed, while the authentic three-dimensional structural information of the cattle’s back was preserved. This approach enhanced the quality and accuracy of the point cloud.
(3) Point cloud rotation
To facilitate the subsequent processing, the cow point cloud images must be rotated over a certain angle so that the spine line is aligned with the
X-axis. The resultant visualization is shown in
Figure 9. Based on the three-dimensional structural characteristics of the cattle, the z-values of each column are compared to find the points with the maximum z-value in each column. These points are then linearly fitted to determine the preliminary spine fitting line. Then, a threshold was set, and all the points with depths greater than this value were fitted to obtain the final spine line of the cow. To enable the subsequent differentiation between left and right hook and pin bones, the cow point cloud was rotated along the spine line until it became parallel to the
X-axis, after which it was translated until it became reconnected with the
X-axis.
2.3. Locating Key Areas
The body pattern features are mainly distributed in the trunk area on the back of the cow. To standardize the feature areas of each cow, the key regions were defined by using the hook and pin bones as landmarks. First, with the use of the PointNet++ model, the hook and pin bones were approximately located, and the corresponding hook and pin bone regions were segmented. Second, the hook and pin bones were located precisely via curvature analysis, thus enabling the accurate determination of these points. Finally, based on the area surrounded by the hook and pin bones, the distance (hook width) between two hook bones was extended along the positive direction of the
X-axis (the direction of the spinal line). The starting point for connecting the extension line is the extension area, and the extension area added based on the original area is the key area, as shown in
Figure 10.
2.3.1. Coarse Localization of Hook and Pin Bones via the Improved PointNet++ Model
There are several methods for segmenting regions in point clouds, including distance-, normal-, and deep learning-based segmentation methods. In distance-based segmentation, a seed point is selected from the point cloud, and it is determined whether adjacent points meet the predefined distance or angle criteria. The points that satisfy these conditions are added to the current region, and this process is continued until no new points can be included. This method is suitable for extracting regions of flat or continuous surfaces. Normal-based segmentation involves calculating the normal of each point and segmenting regions based on changes in the normal direction. If the difference in the normal direction between adjacent points is small, those points are grouped into the same region. However, this method performs poorly for surfaces with complex shapes or curvatures. In deep learning-based segmentation methods, such as the PointNet++ model, point cloud data are directly used as input, and each point is classified to complete the segmentation task [
31].
In order to improve the feature extraction and classification prowess of the PointNet++ network in complex scenarios, modifications were made to its feature extraction module. The HeatBlock attention mechanism was integrated into the module to enhance the weighted adjustment of channel features. By generating channel weights, the mechanism dynamically modulates the significance of each feature channel, thereby augmenting the focus on target regions. This process effectively suppresses noise and redundant information, leading to an enhancement in the capability of feature depiction. The architecture of the improved network is depicted in
Figure 11.
The HeatBlock module achieves the efficient modeling of both global structure and local details by integrating feature processing in both the spatial and frequency domains. Initially, local features are extracted from the input feature maps through convolutional operations, capturing fine-grained information. Subsequently, the features are transformed into the frequency domain via the Discrete Cosine Transform (DCT), where they are decomposed into low-frequency and high-frequency components. Based on the Heat Conduction Operator (HCO) module, the exponential decay factor formula (Formula (4)) is utilized to simulate the heat conduction process. In the frequency domain, the high-frequency components, which exhibit intense oscillations, decay rapidly, while the low-frequency components are preserved, thereby enhancing the global structural information.
where
m and
n represent the frequency, and
a and
b represent the range of the spectrum.
The attenuated spectrum is returned to the spatial domain through the Inverse Discrete Cosine Transform (IDCT), generating a smooth feature map. Then, the dynamic weight generation module produces a position-related weight matrix based on the input features. Through training, this module adaptively optimizes the feature responses. Ultimately, the feature map processed by the heat conduction treatment is element-wise multiplied with the dynamic weights, integrating global and local features to enhance the model’s ability to focus on different patterns.
The Improved PointNet++ network, building on a hierarchical feature learning framework that uses Sampling and Grouping for downsampling local regions, PointNet for feature extraction, and Set Abstraction with interpolation for multi-scale fusion, incorporates DCT/IDCT frequency transforms and heat conduction operations to dynamically fuse spatial and frequency-domain features, enhancing global structure perception and local detail capture. A dynamic weight generation mechanism adaptively adjusts feature importance, optimizing performance on sparse and non-uniform point clouds. During encoding, multi-level Sampling and Grouping and Heatblock strengthen neighborhood interactions, while the decoding phase employs interpolation and lightweight Unit PointNet to restore resolution and integrate multi-level features, ultimately boosting robustness and expressive power for complex point cloud tasks like classification and segmentation.
Therefore, to accurately locate the hook and pin bones of cattle, the Improved PointNet++ model was employed for training and prediction. First, an annotation tool was used to label and delineate the hook and pin bone regions of the cow, as depicted in
Figure 12. Then, training and evaluation were performed according to the training, validation, and test sets, respectively, which were split at a 7:2:1 ratio as described above. During the training process, to enhance the performance of the algorithm and reduce the occurrence of overfitting, the following hyperparameters and strategies were employed: a batch size of 16, a total of 251 epochs, an initial learning rate of 0.001, a weight decay coefficient of 1 × 10
−4, and the Adam optimizer. Additionally, a dynamic learning rate adjustment strategy was adopted, where the learning rate was decayed by a factor of 0.5 every 20 epochs.
2.3.2. Model Evaluation Indicators
The training set was used to train a region segmentation model based on the Improved PointNet++ model. Following the training process, images from the test set were fed into the trained segmentation model to assess its performance. To verify the accuracy of the segmentation model proposed in this study and to enhance the reliability of the experimental results, the performance of the algorithm was evaluated using the mean intersection over union (
MIoU) and the average precision (AP). To ensure accurate recognition and optimal segmentation performance, a given segmentation result was defined as a positive sample when the intersection over union (IOU) for detection exceeded 0.5.
where
TP denotes the number of true positive samples, indicating the number of samples correctly identified as positive by the model;
FN denotes the number of false negative samples, indicating the number of positive samples incorrectly identified as negative by the model; and
FP denotes the number of false positive samples, indicating the number of negative samples incorrectly identified as positive by the model.
The AP was employed to evaluate the localization accuracy for the lumbar and hip corner point segmentation of the algorithm proposed in this study. The instance segmentation evaluation metrics include the AP and AP50. The AP50 metric refers to the measurement of the AP when the IOU threshold is set to greater than 0.5.
2.3.3. Accurate Positioning of Hook and Pin Bones via Curvature Analysis
Because the hook and pin bones are in the raised part of the hook and pin bone regions, respectively, the principal curvature of the raised surface is high, whereas the average curvature of the raised edge is low. Therefore, the three-dimensional curvature characteristics of the extracted hook and pin bone regions can be used to accurately locate the hook and pin bones.
Gaussian and mean curvature parameters are often used together in 3D surface analysis, as they describe the curvature properties of a surface from different perspectives. When combined, they provide a more accurate characterization of both the local and global features of the surface. At sharp edges or singular points, both the Gaussian curvature K and the mean curvature H exhibit significant local variations. In contrast, in flatter regions, both the Gaussian curvature K and the mean curvature H approach zero.
To calculate the Gaussian curvature, the local neighborhood of a point must be determined first. For a given point p
i, its nearest neighboring points N (p
i) are identified. These neighboring points are employed to approximate the local surface around p
i. Next, the local surface is fitted: a quadratic surface is fitted to the neighboring points via principal component analysis (PCA). The fitted surface can be expressed as follows:
where
z denotes the value along the normal direction, and
x and
y are the coordinates within the local tangent plane.
The principal curvatures can then be calculated as follows: based on the fitted quadratic surface equation, the principal curvatures can be derived from the second-order derivatives using differential geometry equations. Notably, the principal curvatures can be obtained as follows:
where
are the second-order derivatives of the local surface along different directions.
Finally, the Gaussian curvature can be calculated: once the two principal curvatures
and
are obtained, the Gaussian curvature
K can be calculated as follows:
The mean curvature
H can be calculated as follows:
2.3.4. Identifying Key Areas
After accurately locating the waist and hook bone points, the cow back point cloud was projected onto the XOY plane. Owing to the irregular edges of the point cloud, it was necessary to ensure that the extracted region contained sufficient body pattern features for effective feature extraction and recognition while also simplifying the extraction process. Therefore, the key areas were divided according to the hook and pin bones and connected to the hook and pin bones in turn. The length of the connection between the hook and pin bones is the hook width. Based on the area surrounded by the hook and pin bones, a square area with a length equal to the hook width was extended from the two hook bones along the positive direction of the X-axis. The sum of the extension area and the area surrounded by the original hook and pin bones was the key area to be determined.
2.4. Processing Body Pattern Images
The most distinctive aspect of the body pattern images in the trunk area is the arrangement of black and white patches. Consequently, this arrangement served as the foundation for classifying the body pattern images. To facilitate the subsequent extraction of body pattern features, it is necessary to convert the key regions from a point cloud format into body pattern images.
Due to the positional relationship between the hook bone points and the pin bone points, the extracted key regions are not regular rectangular areas. To facilitate the subsequent extraction of body pattern features, this study performed a normalization process on the point cloud data within the key regions. First, the resolution was unified to 224 × 224, and the scaling ratio was calculated. Second, the point cloud data within the key region were stretched to fill the entire two-dimensional image. Third, the image was smoothed using Gaussian blur. Finally, to emphasize the key characteristics of the black and white patterns, the pattern images were binarized. This process made the black-and-white features in the images more prominent and enhanced their contrast.
The Otsu algorithm was applied in body pattern image processing. The Otsu algorithm is the optimal choice for threshold selection in image segmentation, offering simple computations and remaining unaffected by image brightness and contrast. The Otsu method is an adaptive threshold selection algorithm for image binarization. The algorithm yields the optimal threshold by minimizing the within-class variance and maximizing the between-class variance.
2.5. Classifying Body Pattern Images
After the binarization of the body pattern images and other processing steps, they were input into the classification model to extract body pattern features, achieve image classification, and ultimately realize identity recognition.
2.5.1. ConvNeXt
The ConvNeXt model was adopted as the primary CNN architecture. The main reason for this choice is that the ConvNeXt model retains the structural advantages of classical CNNs while incorporating the design principles of modern transformer models. The ConvNeXt model provides a significantly enhanced representational capability and performance via the introduction of techniques such as larger receptive fields, deeper network structures, and layer normalization based on the CNN architecture [
32]. Compared with traditional models such as ResNet, the ConvNeXt model offers higher accuracy and better generalizability across various computer vision tasks. In addition, the ConvNeXt model features a relatively simple structure, making it easy to integrate into the existing deep learning frameworks. Its high computational efficiency allows it to better address the challenges of individual cow identification in this study. In this model, multi-level ConvNeXt blocks are employed, with downsampling employed at each stage to reduce spatial dimensions while increasing the number of feature map channels.
2.5.2. Convolutional Block Attention Module
To increase the generalizability and robustness of the model, a lightweight attention module (convolutional block attention module, CBAM) was introduced to enhance the ConvNeXt network architecture. The CBAM [
33] is an attention mechanism module that enhances the representational ability of CNNs. The CBAM is a sequential structure, as shown in
Figure 13a, comprising two independent submodules: the channel attention module (CAM) and the spatial attention module (SAM), as shown in
Figure 13b,c, respectively. The primary task of the CAM is to perform attention adjustment in the channel dimension of the input feature map, thereby adaptively adjusting the weight of each channel by learning the correlations between channels. The SAM focuses primarily on the spatial dimension of the input feature map, and it aims to adjust the weight of each spatial position by learning the correlations among different spatial locations.
In the CBAM, the CAM and SAM are combined via serial connections. However, this serial connection method may lead to interference between the two attention modules. To address this issue, the CAM and SAM were modified by connecting them in parallel. By connecting the two modules in parallel, both modules can function simultaneously without interfering with each other. The improved CBAM attention mechanism can better capture the channel and spatial correlations of the input feature maps and increase the representation ability and performance of the model. The improved CBAM attention mechanism is shown in
Figure 14.
2.5.3. Improving the ConvNeXt Network Architecture
To increase the performance of the ConvNeXt model, the modified CBAM was added after the first convolution layer of the ConvNeXt block, as shown in
Figure 15a. The improved ConvNeXt block replaces the ConvNeXt block in the original ConvNeXt model, as shown in
Figure 15b.
2.5.4. Model Training and Evaluation Metrics
Following the training and optimization of the network structure, the images from the input dataset were utilized to assess the performance of the enhanced ConvNeXt model. Before the model was input, all the body pattern images were resized to a uniform size of 224 × 224 pixels via the bicubic interpolation method. The enhanced ConvNeXt classification model was assessed by allocating the body pattern images of each cow in the dataset into the training, validation, and test sets at a 7:2:1 proportion. During training, the input images fed into the model had a resolution of 256 × 256. To enhance algorithm performance and mitigate overfitting, the batch size was set to 32, and the number of epochs was set to 200. The AdamW optimizer was selected for training the model, with a learning rate of 0.0005 and a weight decay of 0.05.
After classification, the performance of the improved ConvNeXt model was comprehensively evaluated in terms of its
precision,
accuracy,
recall, and
F1 score. Precision is the ratio of correctly predicted positive samples to the total number of samples predicted as positive by the model. A higher value indicates a greater prediction capacity of the model.
Precision can be calculated with Formula (10).
Accuracy refers to the probability of correct predictions among all the samples and can be calculated by Formula (11).
Recall is calculated as the proportion of true positive samples that are correctly identified out of all actual positive samples. A higher value indicates better prediction performance of the model.
Recall can be obtained via Formula (12).
The
F1 score is a metric used in statistics to evaluate the accuracy of binary classification (or multitask binary classification) models. It simultaneously accounts for both the
precision and
recall of the classification model. The
F1 score can be regarded as a weighted average of the
precision and
recall of the model. The maximum value is 1, and the minimum value is 0. Notably, the higher the value is, the better the model. The
F1 score can be calculated via Formula (13).
4. Discussion
4.1. Mechanistic Analysis of Depth-Enhanced Cattle Identification
This study employs RGB-D images to enhance the accuracy of individual dairy cow identification. The specific contributions of depth information to the accuracy, stability, and robustness of dairy cow identification are analyzed from two aspects: position normalization and motion robustness.
Firstly, regarding geometric invariance, depth information provides three-dimensional data about the cow’s morphology, enabling the construction of a standardized representation based on spatial coordinates. RGB images are susceptible to variations in shooting angles, whereas depth data can offer the true three-dimensional shape of the object. By utilizing depth information, the cow’s body contour can be normalized to ensure consistent feature representation across different shooting angles. Moreover, depth images provide the Z-axis information from the depth camera to the cow, which allows us to eliminate scale errors caused by varying shooting distances, thereby enhancing the stability of identification.
In terms of motion robustness, dairy cows are often not stationary during image capture, leading to deformation in torso images due to their movement. RGB images are prone to texture distortion when the cow is walking or slightly moving, while the depth channel can more stably capture the cow’s shape information.
In summary, depth information plays a significant role in the task of dairy cow identification, demonstrating superior stability and robustness in aspects such as position normalization and motion robustness.
4.2. Robust Analysis
To validate the robustness of the dairy cow individual identification method proposed in this study, the data in Dataset 2 were processed and analyzed to evaluate the robustness of the proposed algorithm under different scenarios. First, the RGB-D images of the cattle’s back were converted into point clouds, from which the key regions of the cattle’s back were extracted. These key regions were then transformed into binary body pattern images and fed into the improved ConvNeXt classification model to achieve body pattern classification. The final accuracy of individual identity recognition was 94.62%. Therefore, our research method demonstrates satisfactory performance in identifying dairy cows during the walking process.
To further validate the robustness of the proposed method, five dairy cows were randomly selected from the feeding dataset, and their top-view RGB-D information was captured while they were walking. The identity recognition model previously trained on the feeding dataset was then tested using this new data. A total of 50 depth images and 50 color images were captured for the five cows. These images were processed to generate 3D point clouds of the cows’ backs, followed by the extraction of key regions and conversion into binary body pattern images for classification. The final identity recognition accuracy on this test set was 93.16%. Additionally, the binary body pattern images of the key regions from the same cows during feeding and walking were compared. During feeding, cows are not entirely stationary and may exhibit minor movements. Therefore, the back images captured during feeding, walking, and static feeding states were compared to further demonstrate the accuracy of the results. The key regions were extracted and converted into binary body pattern images for all three states, and the overlap of these images was visualized, as shown in
Figure 21. The white regions extracted from the three states were displayed in blue, yellow, and green, respectively, and then overlaid, with the overlapping areas highlighted in red. The red overlapping area accounted for 87.62% of the blue area, 85.18% of the yellow area, and 90.24% of the green area. These results indicate that the cows’ movements within the camera’s field of view during top-view image capture have a minimal impact on identity recognition.
4.3. Analysis of the Performance of Individual Dairy Cow Identification
To demonstrate its superiority more intuitively, the method is compared with alternative recognition approaches using body pattern images, as detailed in
Table 4. Although Xiao et al. (2022) [
24] had a high recognition accuracy in the table, it could hardly effectively extract features of cows without body patterns or with small body patterns. The use of the method proposed in the study increases the recognition accuracy for cows with no or small white body patterns by combining three-dimensional features with body pattern features. In the process of extracting the key regions and standardizing their shapes, blank areas were filled with white point clouds, which enhanced the features and contributed to a greater recognition performance. The dataset in Shen et al. (2020) [
34] encompasses cow features captured from a lateral view, where occlusion is a significant issue. He Dongjian and Liu Jianmin (2020) [
23] fixed the camera at the entrance and exit of the milking area, which resulted in a limited shooting angle and the occurrence of dairy cow congestion. This led to a significant degree of bending deformation in the backs of some cows, causing them to be missed in identification. In contrast, the present study collected top-view RGB-D images. By performing posture normalization on the point cloud images and adjusting the angle of the cows’ back images, the impact of the shooting angle can be reduced, thus achieving higher robustness. Andrew et al. (2021) [
17], Wang et al. (2024) [
35], and our method all collected top-view back images. However, their accuracy rates were lower than that of the method presented in this study. Moreover, they collected RGB images, which were susceptible to the influence of perspective and environment, leading to missing image information. The present study, by collecting top-view RGB-D images, located the key points of the hook and pin bones on the back to determine the critical regions. Adjustments were made to these regions, making the body pattern images more comparable and enhancing the identity recognition features. Therefore, the research method proposed in this paper not only has a high accuracy rate in identity recognition and achieves good recognition effects, but also exhibits high robustness regarding the position and angle of cattle in the top-view field of vision.
4.4. Error Analysis
The incorrect results of individual identification were counted and analyzed.
Figure 22 illustrates the two cows with the lowest recognition rate in the dataset, both having an individual identification rate of 78%. After analysis, the factors contributing to the low identification rate can be summarized in the following two aspects: (1) There were many patches on these cows, which affected the processing of body pattern features. (2) The extracted key area of the backs of these cows contained an excessive proportion of white parts, the black pattern region was small, and the corresponding recognition features were small. Additionally, the activities of dairy cows can cause minor changes in the distribution and shape of the patterns, leading to a scarcity of pattern features and thereby reducing the accuracy of identity recognition. In this study, we conducted a statistical analysis on the cows in the dataset whose black body pattern area accounted for less than 10% of the total area. This subset of cows constituted 7% of the total number of cows in the dataset. The accuracy of individual identity recognition for these cows was 85.2%, which is 12.75% lower than the overall recognition accuracy. However, this level of accuracy still enables the basic individual identification of dairy cows.
4.5. Future Research
Although the method proposed in this paper enabled highly accurate cow identification, there is still room for improvement. The camera remained stationary in the data acquisition process, and there could be situations in which the torso area does not occur within the acquisition area due to the small-amplitude activity of the cow during feeding, which, in turn, could cause key areas to be missing. In future research, a tracking camera could be integrated with a platform for real-time image display and processing, enabling the remote acquisition and processing of dairy cow images in real time. The research method proposed in this paper yields an increased recognition accuracy for cows with a high proportion of black parts, but the recognition accuracy is lower for cows with a high proportion of white parts or even all-white cows. In future research, cows with a high proportion of white point clouds can be recognized individually, such as by marking the torso, or combined with other identification characteristics, which will increase the identification accuracy of individual cows.