Individual Identification of Holstein Cows from Top-View RGB and Depth Images Based on Improved PointNet++ and ConvNeXt

Zhao, Kaixuan; Wang, Jinjin; Chen, Yinan; Sun, Junrui; Zhang, Ruihong

doi:10.3390/agriculture15070710

Open AccessArticle

Individual Identification of Holstein Cows from Top-View RGB and Depth Images Based on Improved PointNet++ and ConvNeXt

by

Kaixuan Zhao

^1,2,*,

Jinjin Wang

¹,

Yinan Chen

¹,

Junrui Sun

¹ and

Ruihong Zhang

¹

College of Agricultural Equipment Engineering, Henan University of Science and Technology, Luoyang 471023, China

²

Science & Technology Innovation Center for Completed Set Equipment, Longmen Laboratory, Luoyang 471023, China

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(7), 710; https://doi.org/10.3390/agriculture15070710

Submission received: 20 February 2025 / Revised: 20 March 2025 / Accepted: 25 March 2025 / Published: 26 March 2025

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

The identification of individual cows is a prerequisite and foundation for realizing accurate and intelligent farming, but this identification method based on image information is easily affected by the environment and observation angle. To identify cows more accurately and efficiently, a novel individual recognition method based on the using anchor point detection and body pattern features from top-view depth images of cows was proposed. First, the top-view RGBD images of cows were collected. The hook and pin bones of cows were coarsely located based on the improved PointNet++ neural network. Second, the curvature variations in the hook and pin bone regions were analyzed to accurately locate the hook and pin bones. Based on the spatial relationship between the hook and pin bones, the critical area was determined, and the key region was transformed from a point cloud to a two-dimensional body pattern image. Finally, body pattern image classification based on the improved ConvNeXt network model was performed for individual cow identification. A dataset comprising 7600 top-view images from 40 cows was created and partitioned into training, validation, and test subsets using a 7:2:1 proportion. The results revealed that the AP₅₀ value of the point cloud segmentation model is 95.5%, and the cow identification accuracy could reach 97.95%. The AP₅₀ metric of the enhanced PointNet++ neural network exceeded that of the original model by 3 percentage points. Relative to the original model, the enhanced ConvNeXt model achieved a 6.11 percentage point increase in classification precision. The method is robust to the position and angle of the cow in the top-view.

Keywords:

dairy cows; individual identification; PointNet++; body pattern features; ConvNeXt

1. Introduction

Dairy farming is a vital component of the livestock industry. However, traditional manual farming methods are characterized by high labor intensity, high time consumption, and notable subjectivity [1,2]. Therefore, supported by rapidly advancing information and intelligent technologies, dairy farming is developing towards informatization, intelligence, and precision [3]. Precision and intelligent farming enables the real-time monitoring of the behavior of individual cows and facilitates timely management decisions [4]. The identification of individual cows serves as a prerequisite and foundation for achieving precision and intelligent farming [5,6].

Currently, radio frequency identification (RFID) technology is widely employed for identifying individual cows in large-scale dairy farms. This technology involves the employment of tags (typically ear tags) attached to cows and wireless transmission to record individual cow information [7,8]. Compared with traditional manual methods, RFID technology increases identification accuracy and work efficiency. However, the task of attaching ear tags is labor intensive and may cause animal stress, thus impacting animal welfare, and ear tags are easily lost and damaged [9].

With the advancement in computer technology, contactless cattle identification methods based on computer vision have increasingly emerged [10,11,12,13]. This method can be used to extract biometric features by capturing cow images via cameras, including nose and mouth patterns [14], iris patterns [15], facial contour textures [16], and body patterns [17,18]. This technique enables automatic, contactless, and accurate image recognition without the need for extensive manual operations [19]. However, nose and mouth patterns, iris patterns, and facial contour textures, as physiological features of the head, require high cooperation from cows for extraction. The image acquisition process is highly sensitive to the angle and position of the camera. Kumar et al. [20] proposed a deep learning-based method for identifying individual cattle through the recognition of muzzle and nose pattern image features. This method achieved a relatively high recognition accuracy. However, it required close-range capture of the muzzle and nose pattern images. Additionally, the presence of dirt on the cattle’s muzzle and nose had to be wiped off before image acquisition, which, in turn, necessitated a high level of cooperation from the cattle. Lu et al. [15] developed a dairy cattle iris recognition method based on two-dimensional complex wavelet transform. Although the reliability of this method was validated, extracting ocular information proved to be challenging. To obtain high-quality images, it was necessary to restrain the animals’ heads during image acquisition to prevent excessive head movement. This process was overly complex and did not guarantee image quality. Yang et al. [16] proposed a dairy cattle facial recognition method based on an improved RetinaFace and FaceNet. By optimizing the network architecture and loss function, the method achieved efficient feature extraction and precise matching. Qi Yongsheng et al. [21] proposed a dual-branch network for recognizing partially occluded cattle faces. By employing depthwise separable convolution and multi-scale hybrid pooling, the method achieved occlusion segmentation and feature restoration, resulting in an average accuracy of 86.34% for occluded cattle face recognition. Although cameras can capture facial features of cattle, the low cooperation level during the image acquisition process necessitates the use of specialized equipment to restrain the animals, ensuring the acquisition of clear and high-quality cattle face images.

Body patterns denote the systematic arrangement of black and white fur on the torso of Holstein cattle. Body patterns are widely distributed, yet their features remain distinguishable and stable over time. Images of body patterns can be captured through side-view or overhead-view photos or videos of cows while they are walking. Zhao Kaixuan and He Dongjian (2015) derived a 48 × 48 feature matrix from images of the cow torso and developed a convolutional neural network (CNN) to serve as a model for individual cow recognition [22]. Their dataset included 30 cows, and they obtained an identification accuracy of 90.55%. However, in side-view image acquisition, the overlap and occlusion of multiple cows often occur, leading to a decrease in trunk localization accuracy. Therefore, researchers have explored identity recognition methods based on top-view images. He Dongjian and Liu Jianmin (2020) proposed an individual identification method for dairy cows based on video analysis and an improved deep CNN, namely, the You-Only-Look-Once version 3 (YOLOv3) algorithm, involving the design of a video acquisition scheme for the backs of cows entering and exiting the milking parlor, with an identification accuracy of 95.91% [23]. Xiao et al. (2022) proposed an individual cow identification method based on an improved Mask R-CNN model. In this method, an improved Mask R-CNN algorithm was employed to segment the recorded back images of cows, the optimal feature subset was obtained, and an SVM classifier was finally employed to classify and identify the back images of cows, thereby achieving an identification accuracy of 98.67% [24]. The methods all entail the use of color images to locate the trunk of cows and extract body pattern features. However, gait activity affects the angle of the cow captured by the camera, resulting in inconsistent orientation positions of the obtained color images.

Despite the high accuracy achieved by two-dimensional image analysis techniques in individual dairy cattle identification, the practical application performance remains susceptible to significant fluctuations due to variations in illumination conditions, interference from complex backgrounds, and environmental variables. Zhao et al. [13] utilized depth images to generate point cloud data of cattle backs, and combined voxelization with convex hull algorithms to quantify the degree of fat depression around the skeletal structure. The convex hull distance feature, calculated as the distance from peripheral voxels to the surface of the convex hull, reflects the uniqueness of the structural characteristics of the hook and pin bones. Kyaw et al. [25] addressed the light sensitivity issue in traditional 2D image-based cattle identification through 3D point cloud segmentation techniques. By integrating the PointNet++ algorithm with multi-scale feature extraction, they achieved the high-precision segmentation of the cattle back region. Menezes et al. [26] developed an animal identification system based on computer vision (CVS), which can recognize individual cattle through key points on the animal’s body surface. The system locates seven specific anatomical landmarks (left and right hips, left and right pin bones, tail head, sacral, and cervical vertebrae) to establish a key point model, and performs individual identification based on the Euclidean distances between these key points.

Building on the existing research, an individual cow identification method based on anchor point detection and body pattern characteristics was proposed. In this method, a three-dimensional point cloud is employed to standardize the direction and position of the top-view cow images, key points are located to determine the key dorsal areas, and body pattern features are extracted from the key areas for identity determination. By analyzing the three-dimensional structure of cows, the proposed method aims to standardize the direction and position of the obtained cow images, thereby correcting any color image deformation issues. The detailed procedure of this method is outlined as follows: firstly, color and depth images are registered to generate a colored point cloud image of the back of the selected cow. Secondly, the point cloud is subjected to pose normalization, background noise removal, and clustering analysis to extract the back trunk of the cow. The improved PointNet++ model is subsequently employed to approximately locate the hook and pin bones and segment the corresponding hook and pin bone regions. The point cloud of the key regions is then converted into a body pattern image format. The body pattern images are thereafter processed using the Otsu thresholding method for binary segmentation. Finally, the body pattern images are input into an improved ConvNeXt classification model to realize body pattern image classification. The method is robust to the position and angle of the cow in the top-view. The contributions of this paper are as follows.

A new method for individual dairy cattle identification is proposed, which includes the following steps: first, the hook and pin bone points are accurately located; then, the key areas are extracted and converted into two-dimensional images; and finally, the individual identity of cattle is recognized by classifying the body pattern images.
A method for locating hook and pin bone points on cattle back is proposed, which reduces the impact of environmental and camera angles on locating cattle torso and has better robustness.
A method for individual cattle identification based on the improved PointNet++ and ConvNeXt models is proposed, which combines the three-dimensional structural features of cattle backs with body pattern features to improve the accuracy of individual cattle identification.

2. Materials and Methods

2.1. Data Acquisition

To verify the robustness of the proposed algorithm, this study established two scenarios for the data collection of dairy cow images. In the first scenario, a camera was mounted above the feeding stall to capture top-view images of cows’ backs. In the second scenario, a camera was installed above the passageway leading to the milking parlor to collect top-view images of cows’ backs while they were walking. The images captured above the feeding stall were used for training and analysis of the method proposed in this study, while the images collected above the walking passageway were designated as the test set to validate the robustness of the proposed method.

(1) Dataset 1

The experimental data were collected at Shengsheng Farm, located in Luoyang, Henan Province, from 24 to 25 October 2024. The subjects of data collection were lactating Holstein cows in the feeding process. Via the use of a camera stand, an Intel RealSense D455 depth camera (Intel Corporation, Santa Clara, CA, USA) was mounted above the back of the cow at a height of 2.9 m from the ground. The data collection setup is shown in Figure 1.

A depth camera was used to simultaneously capture top-view color images and depth images of the back of cows, as shown in Figure 2, both with a resolution of 640 × 360 pixels. A total of 40 cows were surveyed, resulting in 7600 color images and 7600 corresponding depth images. These images were split into training, validation, and test sets at a ratio of 7:2:1.

The distribution of the proportion of the black part of cow’s back in all the data was analyzed, which was divided into 5 groups from 0 to 100%, and the interval of each group was 20%. Its distribution histogram is shown in Figure 3. Among them, the occurrence probability for groups with black back area percentages of 0~20% and 80~100% was less than 10%, and the overall distribution was normal.

(2) Dataset 2

To validate the robustness of the proposed method for individual dairy cow identification, a more complex moving-scene dataset was established for testing. An Intel RealSense D455 depth camera was mounted above the passage leading to the milking parlor, as shown in Figure 4, to capture the top-view back RGB-D data of cows while they were walking. A total of 200 top-view back depth images and 200 color images were collected from 10 dairy cows, with each image having a resolution of 640 pixels × 360 pixels, as shown in Figure 5.

2.2. Converting Depth Images into Point Cloud

To analyze the three-dimensional characteristics of the backs of the selected cows, the depth and color images obtained with the Intel RealSense D455 depth camera in Dataset 1 were registered to generate a point cloud. Then, operations such as pose normalization, ground removal, clustering analysis, noise background removal, spine line fitting, and the rotation and translation of the point cloud images were performed to align the spine line with the X-axis.

2.2.1. Depth Image Preprocessing

Before the registration of the depth and color images to generate a 3D point cloud, it can be observed that some data loss occurs in the image shown in Figure 2b. To improve the quality of the depth images of dairy cows and ensure the accuracy and robustness of subsequent analyses, a comprehensive preprocessing approach was employed to address potential issues such as local depth information loss, noise interference, and outlier points. Specifically, three methods—nearest neighbor interpolation, median filtering, and bilateral filtering—were utilized to repair missing information in the depth images, remove noise and outliers, and smooth the images while preserving edge features. The results of this preprocessing are shown in Figure 6.

Firstly, to address the potential local depth information loss in the depth images, nearest neighbor interpolation was employed. This method leverages the known depth values surrounding the missing pixels to infer and fill in the voids within the image. By doing so, it ensures the continuity of depth information and mitigates the discontinuity issues at the boundaries caused by data loss.

Secondly, in the environment where the dairy cows are located, the presence of water stains, mud, or other impurities on the ground can lead to abnormally high or low reflective regions in the depth images. To address this, median filtering was applied to smooth the depth images and remove noise and outliers. Median filtering effectively suppresses random noise arising from sensor errors, environmental interference, or the data acquisition process. It reduces the impact of these anomalies on subsequent processing, thereby enhancing the stability and usability of the images.

Furthermore, to preserve the edge features of the images while denoising, bilateral filtering was introduced. Bilateral filtering is a non-linear filtering technique that combines spatial proximity and pixel intensity similarity for weighted smoothing. It suppresses noise while maintaining the sharpness of object edges. This is particularly important for the depth images of dairy cows, as the contours and feature points of the cows need to be as clear as possible. High reflective regions, such as those caused by spots or wet areas on the cows’ skin, can interfere with depth measurements. Bilateral filtering can mitigate the influence of these high reflective areas, resulting in smoother depth images while retaining critical boundary information. This provides more stable data support for the subsequent feature extraction and target recognition.

2.2.2. Point Cloud Attitude Normalization

After preprocessing the depth images, the depth images were transformed into point clouds through the inverse transformation of perspective projection (Formula (1)).

Z_{C} [\begin{matrix} u \\ v \\ 1 \end{matrix}] = [\begin{matrix} f_{x} & 0 & u_{0} & 0 \\ 0 & f_{y} & v_{0} & 0 \\ 0 & 0 & 1 & 0 \end{matrix}] [\begin{matrix} X_{C} \\ Y_{C} \\ Z_{C} \\ 1 \end{matrix}]

(1)

Following the conversion of depth images to three-dimensional point clouds, to facilitate the precise analysis of the body pattern features of dairy cows, it is necessary to map the RGB information from the color images to the generated 3D point cloud. This process is achieved through the registration of color and depth images, ensuring that each point in the 3D point cloud corresponds to an accurate RGB color value. The resultant visualization is shown in Figure 7. As can be seen from the figure, the colored point cloud data of the dairy cow clearly depicts the three-dimensional morphology of the cow’s back and the distribution of its body patterns.

Prior to the removal of background and noise, to facilitate the subsequent unified processing of the cow point cloud images, they should be normalized. Based on the ground in the background point cloud, each cow point cloud image was translated and rotated to coincide with the XOY plane.

The ground is typically a large planar region, which can be fitted using methods such as least squares. Specifically, the x, y, and z coordinates of all the points in the point cloud are substituted into the plane equation. The best plane parameters A, B, C, and D can be determined using the least squares method. The resulting fitted plane can be expressed as Ax + By + Cz + D = 0.

To align the fitted plane with the XOY plane, we extracted the normal vector

\vec{n}

of the fitted plane and the normal vector

\vec{k}

of the XOY plane. Then, we computed a rotation matrix via the Rodrigues model (Formula (2)). This matrix was then applied to align the normal vector of the fitted plane with the positive direction of the Z-axis, thereby effectively rotating the fitted plane to coincide with the XOY plane and correcting the tilt in the point cloud. The core of the Rodrigues method is to convert the rotation axis vector

\vec{v}

and the rotation angle θ into a rotation matrix. Regarding the rotation between the vector

\vec{n}

and the target vector

\vec{k}

, the cross product is employed to calculate the rotation axis, and the dot product is applied to calculate the cosine of the rotation angle.

R = I + \sin θ \cdot {[\vec{v}]}_{\times} + (1 - \cos θ) \cdot {[\vec{v}]}_{\times}^{2}

(2)

where R is the rotation matrix, I is a 3 × 3 identity matrix, and θ is the rotation angle, which is calculated as the angle between the two vectors (using the dot product and the magnitudes of the vectors). Moreover,

\vec{v}

is the rotation axis, which is the normalized cross-product of the two vectors, and

{[\vec{v}]}_{\times}

is the skew-symmetric matrix of

\vec{v}

(also referred to as the cross-product matrix or the antisymmetric matrix), which is defined as follows:

{[\vec{v}]}_{\times} = [\begin{matrix} 0 & - v_{z} & v_{y} \\ v_{z} & 0 & - v_{x} \\ - v_{y} & v_{x} & 0 \end{matrix}]

(3)

where

v_{x}, v_{y}, a n d v_{z}

are the components of the vector

\vec{v}

.

2.2.3. Point Cloud Noise and Background Removal

Since the cow images were captured during feeding, the environment is complex. To eliminate the influence of the environment, it is necessary to subject the collected data to background and noise removal and extract cow trunks. The resultant visualization of the cattle torso after background removal is shown in Figure 8.

(1) Ground Removal

With the use of the above process, the orientation of the point cloud can be normalized, with the ground aligned with the XOY plane. Since this study focuses on the top-view point cloud of the cattle’s back, which is at a certain distance from the ground, and considering that the cattle farm’s ground is uneven, a method to maximize the removal of the ground can be employed. Notably, ground points can be removed by setting a threshold for the z-value. Specifically, points with z < 100 are classified as part of the ground. By clipping all the points in this region, the ground portion can be effectively eliminated.

(2) Filtering Processing

The remaining portion of the point cloud can then be clustered via the density-based spatial clustering of applications with noise (DBSCAN) method for cow point cloud analysis [27]. The DBSCAN algorithm is a typical density-based spatial clustering algorithm that aims to divide high-density point regions into clusters and can be used to effectively filter out low-density point regions. It can achieve clustering of arbitrary shapes in datasets containing noise [28]. The traditional k-means algorithm [29] can hardly resolve non-spherical clusters and clusters of different sizes. Moreover, the DBSCAN algorithm aims to group clusters on the basis of density differences, allowing it to manage clusters of varying sizes and shapes [30]. Therefore, to process cow point clouds in this study, the DBSCAN algorithm was selected for cluster analysis. This method effectively distinguished the point cloud of the cattle’s back from railings and other objects, retaining only the point cloud corresponding to the cattle’s back. Subsequently, statistical filtering was applied to the point cloud for noise removal and outlier elimination. Through statistical filtering, environmental noise points within the cattle point cloud were effectively removed, while the authentic three-dimensional structural information of the cattle’s back was preserved. This approach enhanced the quality and accuracy of the point cloud.

(3) Point cloud rotation

To facilitate the subsequent processing, the cow point cloud images must be rotated over a certain angle so that the spine line is aligned with the X-axis. The resultant visualization is shown in Figure 9. Based on the three-dimensional structural characteristics of the cattle, the z-values of each column are compared to find the points with the maximum z-value in each column. These points are then linearly fitted to determine the preliminary spine fitting line. Then, a threshold was set, and all the points with depths greater than this value were fitted to obtain the final spine line of the cow. To enable the subsequent differentiation between left and right hook and pin bones, the cow point cloud was rotated along the spine line until it became parallel to the X-axis, after which it was translated until it became reconnected with the X-axis.

2.3. Locating Key Areas

The body pattern features are mainly distributed in the trunk area on the back of the cow. To standardize the feature areas of each cow, the key regions were defined by using the hook and pin bones as landmarks. First, with the use of the PointNet++ model, the hook and pin bones were approximately located, and the corresponding hook and pin bone regions were segmented. Second, the hook and pin bones were located precisely via curvature analysis, thus enabling the accurate determination of these points. Finally, based on the area surrounded by the hook and pin bones, the distance (hook width) between two hook bones was extended along the positive direction of the X-axis (the direction of the spinal line). The starting point for connecting the extension line is the extension area, and the extension area added based on the original area is the key area, as shown in Figure 10.

2.3.1. Coarse Localization of Hook and Pin Bones via the Improved PointNet++ Model

There are several methods for segmenting regions in point clouds, including distance-, normal-, and deep learning-based segmentation methods. In distance-based segmentation, a seed point is selected from the point cloud, and it is determined whether adjacent points meet the predefined distance or angle criteria. The points that satisfy these conditions are added to the current region, and this process is continued until no new points can be included. This method is suitable for extracting regions of flat or continuous surfaces. Normal-based segmentation involves calculating the normal of each point and segmenting regions based on changes in the normal direction. If the difference in the normal direction between adjacent points is small, those points are grouped into the same region. However, this method performs poorly for surfaces with complex shapes or curvatures. In deep learning-based segmentation methods, such as the PointNet++ model, point cloud data are directly used as input, and each point is classified to complete the segmentation task [31].

In order to improve the feature extraction and classification prowess of the PointNet++ network in complex scenarios, modifications were made to its feature extraction module. The HeatBlock attention mechanism was integrated into the module to enhance the weighted adjustment of channel features. By generating channel weights, the mechanism dynamically modulates the significance of each feature channel, thereby augmenting the focus on target regions. This process effectively suppresses noise and redundant information, leading to an enhancement in the capability of feature depiction. The architecture of the improved network is depicted in Figure 11.

The HeatBlock module achieves the efficient modeling of both global structure and local details by integrating feature processing in both the spatial and frequency domains. Initially, local features are extracted from the input feature maps through convolutional operations, capturing fine-grained information. Subsequently, the features are transformed into the frequency domain via the Discrete Cosine Transform (DCT), where they are decomposed into low-frequency and high-frequency components. Based on the Heat Conduction Operator (HCO) module, the exponential decay factor formula (Formula (4)) is utilized to simulate the heat conduction process. In the frequency domain, the high-frequency components, which exhibit intense oscillations, decay rapidly, while the low-frequency components are preserved, thereby enhancing the global structural information.

e^{(- [{(\frac{n π}{a})}^{2} + {(\frac{m π}{b})}^{2}] * k t)}

(4)

where m and n represent the frequency, and a and b represent the range of the spectrum.

The attenuated spectrum is returned to the spatial domain through the Inverse Discrete Cosine Transform (IDCT), generating a smooth feature map. Then, the dynamic weight generation module produces a position-related weight matrix based on the input features. Through training, this module adaptively optimizes the feature responses. Ultimately, the feature map processed by the heat conduction treatment is element-wise multiplied with the dynamic weights, integrating global and local features to enhance the model’s ability to focus on different patterns.

The Improved PointNet++ network, building on a hierarchical feature learning framework that uses Sampling and Grouping for downsampling local regions, PointNet for feature extraction, and Set Abstraction with interpolation for multi-scale fusion, incorporates DCT/IDCT frequency transforms and heat conduction operations to dynamically fuse spatial and frequency-domain features, enhancing global structure perception and local detail capture. A dynamic weight generation mechanism adaptively adjusts feature importance, optimizing performance on sparse and non-uniform point clouds. During encoding, multi-level Sampling and Grouping and Heatblock strengthen neighborhood interactions, while the decoding phase employs interpolation and lightweight Unit PointNet to restore resolution and integrate multi-level features, ultimately boosting robustness and expressive power for complex point cloud tasks like classification and segmentation.

Therefore, to accurately locate the hook and pin bones of cattle, the Improved PointNet++ model was employed for training and prediction. First, an annotation tool was used to label and delineate the hook and pin bone regions of the cow, as depicted in Figure 12. Then, training and evaluation were performed according to the training, validation, and test sets, respectively, which were split at a 7:2:1 ratio as described above. During the training process, to enhance the performance of the algorithm and reduce the occurrence of overfitting, the following hyperparameters and strategies were employed: a batch size of 16, a total of 251 epochs, an initial learning rate of 0.001, a weight decay coefficient of 1 × 10⁻⁴, and the Adam optimizer. Additionally, a dynamic learning rate adjustment strategy was adopted, where the learning rate was decayed by a factor of 0.5 every 20 epochs.

2.3.2. Model Evaluation Indicators

The training set was used to train a region segmentation model based on the Improved PointNet++ model. Following the training process, images from the test set were fed into the trained segmentation model to assess its performance. To verify the accuracy of the segmentation model proposed in this study and to enhance the reliability of the experimental results, the performance of the algorithm was evaluated using the mean intersection over union (MIoU) and the average precision (AP). To ensure accurate recognition and optimal segmentation performance, a given segmentation result was defined as a positive sample when the intersection over union (IOU) for detection exceeded 0.5.

M I o U = \frac{1}{n} \sum \frac{T P}{T P + F P + F N}

(5)

where TP denotes the number of true positive samples, indicating the number of samples correctly identified as positive by the model; FN denotes the number of false negative samples, indicating the number of positive samples incorrectly identified as negative by the model; and FP denotes the number of false positive samples, indicating the number of negative samples incorrectly identified as positive by the model.

The AP was employed to evaluate the localization accuracy for the lumbar and hip corner point segmentation of the algorithm proposed in this study. The instance segmentation evaluation metrics include the AP and AP₅₀. The AP₅₀ metric refers to the measurement of the AP when the IOU threshold is set to greater than 0.5.

2.3.3. Accurate Positioning of Hook and Pin Bones via Curvature Analysis

Because the hook and pin bones are in the raised part of the hook and pin bone regions, respectively, the principal curvature of the raised surface is high, whereas the average curvature of the raised edge is low. Therefore, the three-dimensional curvature characteristics of the extracted hook and pin bone regions can be used to accurately locate the hook and pin bones.

Gaussian and mean curvature parameters are often used together in 3D surface analysis, as they describe the curvature properties of a surface from different perspectives. When combined, they provide a more accurate characterization of both the local and global features of the surface. At sharp edges or singular points, both the Gaussian curvature K and the mean curvature H exhibit significant local variations. In contrast, in flatter regions, both the Gaussian curvature K and the mean curvature H approach zero.

To calculate the Gaussian curvature, the local neighborhood of a point must be determined first. For a given point p_i, its nearest neighboring points N (p_i) are identified. These neighboring points are employed to approximate the local surface around p_i. Next, the local surface is fitted: a quadratic surface is fitted to the neighboring points via principal component analysis (PCA). The fitted surface can be expressed as follows:

z = a x^{2} + b y^{2} + c x y + d x + e y + f

(6)

where z denotes the value along the normal direction, and x and y are the coordinates within the local tangent plane.

The principal curvatures can then be calculated as follows: based on the fitted quadratic surface equation, the principal curvatures can be derived from the second-order derivatives using differential geometry equations. Notably, the principal curvatures can be obtained as follows:

k_{1}, k_{2} = \frac{2 (λ_{1} + λ_{2}) \pm \sqrt{(λ_{1} + λ_{2}) + 4 λ_{3}^{2}}}{2}

(7)

where

λ_{1}, λ_{2}, a n d λ_{3}

are the second-order derivatives of the local surface along different directions.

Finally, the Gaussian curvature can be calculated: once the two principal curvatures

k_{1}

and

k_{2}

are obtained, the Gaussian curvature K can be calculated as follows:

K = k_{1} \cdot k_{2}

(8)

The mean curvature H can be calculated as follows:

H = \frac{k_{1} + k_{2}}{2}

(9)

2.3.4. Identifying Key Areas

After accurately locating the waist and hook bone points, the cow back point cloud was projected onto the XOY plane. Owing to the irregular edges of the point cloud, it was necessary to ensure that the extracted region contained sufficient body pattern features for effective feature extraction and recognition while also simplifying the extraction process. Therefore, the key areas were divided according to the hook and pin bones and connected to the hook and pin bones in turn. The length of the connection between the hook and pin bones is the hook width. Based on the area surrounded by the hook and pin bones, a square area with a length equal to the hook width was extended from the two hook bones along the positive direction of the X-axis. The sum of the extension area and the area surrounded by the original hook and pin bones was the key area to be determined.

2.4. Processing Body Pattern Images

The most distinctive aspect of the body pattern images in the trunk area is the arrangement of black and white patches. Consequently, this arrangement served as the foundation for classifying the body pattern images. To facilitate the subsequent extraction of body pattern features, it is necessary to convert the key regions from a point cloud format into body pattern images.

Due to the positional relationship between the hook bone points and the pin bone points, the extracted key regions are not regular rectangular areas. To facilitate the subsequent extraction of body pattern features, this study performed a normalization process on the point cloud data within the key regions. First, the resolution was unified to 224 × 224, and the scaling ratio was calculated. Second, the point cloud data within the key region were stretched to fill the entire two-dimensional image. Third, the image was smoothed using Gaussian blur. Finally, to emphasize the key characteristics of the black and white patterns, the pattern images were binarized. This process made the black-and-white features in the images more prominent and enhanced their contrast.

The Otsu algorithm was applied in body pattern image processing. The Otsu algorithm is the optimal choice for threshold selection in image segmentation, offering simple computations and remaining unaffected by image brightness and contrast. The Otsu method is an adaptive threshold selection algorithm for image binarization. The algorithm yields the optimal threshold by minimizing the within-class variance and maximizing the between-class variance.

2.5. Classifying Body Pattern Images

After the binarization of the body pattern images and other processing steps, they were input into the classification model to extract body pattern features, achieve image classification, and ultimately realize identity recognition.

2.5.1. ConvNeXt

The ConvNeXt model was adopted as the primary CNN architecture. The main reason for this choice is that the ConvNeXt model retains the structural advantages of classical CNNs while incorporating the design principles of modern transformer models. The ConvNeXt model provides a significantly enhanced representational capability and performance via the introduction of techniques such as larger receptive fields, deeper network structures, and layer normalization based on the CNN architecture [32]. Compared with traditional models such as ResNet, the ConvNeXt model offers higher accuracy and better generalizability across various computer vision tasks. In addition, the ConvNeXt model features a relatively simple structure, making it easy to integrate into the existing deep learning frameworks. Its high computational efficiency allows it to better address the challenges of individual cow identification in this study. In this model, multi-level ConvNeXt blocks are employed, with downsampling employed at each stage to reduce spatial dimensions while increasing the number of feature map channels.

2.5.2. Convolutional Block Attention Module

To increase the generalizability and robustness of the model, a lightweight attention module (convolutional block attention module, CBAM) was introduced to enhance the ConvNeXt network architecture. The CBAM [33] is an attention mechanism module that enhances the representational ability of CNNs. The CBAM is a sequential structure, as shown in Figure 13a, comprising two independent submodules: the channel attention module (CAM) and the spatial attention module (SAM), as shown in Figure 13b,c, respectively. The primary task of the CAM is to perform attention adjustment in the channel dimension of the input feature map, thereby adaptively adjusting the weight of each channel by learning the correlations between channels. The SAM focuses primarily on the spatial dimension of the input feature map, and it aims to adjust the weight of each spatial position by learning the correlations among different spatial locations.

In the CBAM, the CAM and SAM are combined via serial connections. However, this serial connection method may lead to interference between the two attention modules. To address this issue, the CAM and SAM were modified by connecting them in parallel. By connecting the two modules in parallel, both modules can function simultaneously without interfering with each other. The improved CBAM attention mechanism can better capture the channel and spatial correlations of the input feature maps and increase the representation ability and performance of the model. The improved CBAM attention mechanism is shown in Figure 14.

2.5.3. Improving the ConvNeXt Network Architecture

To increase the performance of the ConvNeXt model, the modified CBAM was added after the first convolution layer of the ConvNeXt block, as shown in Figure 15a. The improved ConvNeXt block replaces the ConvNeXt block in the original ConvNeXt model, as shown in Figure 15b.

2.5.4. Model Training and Evaluation Metrics

Following the training and optimization of the network structure, the images from the input dataset were utilized to assess the performance of the enhanced ConvNeXt model. Before the model was input, all the body pattern images were resized to a uniform size of 224 × 224 pixels via the bicubic interpolation method. The enhanced ConvNeXt classification model was assessed by allocating the body pattern images of each cow in the dataset into the training, validation, and test sets at a 7:2:1 proportion. During training, the input images fed into the model had a resolution of 256 × 256. To enhance algorithm performance and mitigate overfitting, the batch size was set to 32, and the number of epochs was set to 200. The AdamW optimizer was selected for training the model, with a learning rate of 0.0005 and a weight decay of 0.05.

After classification, the performance of the improved ConvNeXt model was comprehensively evaluated in terms of its precision, accuracy, recall, and F1 score. Precision is the ratio of correctly predicted positive samples to the total number of samples predicted as positive by the model. A higher value indicates a greater prediction capacity of the model. Precision can be calculated with Formula (10).

p r e c i s i o n = \frac{T P}{T P + F P}

(10)

Accuracy refers to the probability of correct predictions among all the samples and can be calculated by Formula (11).

a c c u r a c y = \frac{T P + T N}{T P + F N + F P + T N}

(11)

Recall is calculated as the proportion of true positive samples that are correctly identified out of all actual positive samples. A higher value indicates better prediction performance of the model. Recall can be obtained via Formula (12).

r e c a l l = \frac{T P}{T P + F N}

(12)

The F1 score is a metric used in statistics to evaluate the accuracy of binary classification (or multitask binary classification) models. It simultaneously accounts for both the precision and recall of the classification model. The F1 score can be regarded as a weighted average of the precision and recall of the model. The maximum value is 1, and the minimum value is 0. Notably, the higher the value is, the better the model. The F1 score can be calculated via Formula (13).

F 1 = \frac{2 \times p r e c i s i o n \times r e c a l l}{p r e c i s i o n + r e c a l l}

(13)

3. Results

3.1. Analysis of the Segmentation Results for the Hook and Pin Bone Regions

The test set was input into the trained Improved PointNet++ model to assess the segmentation performance for the hook and pin bone regions. The results revealed that the average MIoU of the segmentation model reached 85.8%, the average AP₅₀ value was 95.5%, and the AP₇₅ value reached 83.6%. The results of the predicted segmentation of the hook and pin regions are shown in Figure 16. It shows that the Improved PointNet++ algorithm can be used to segment the hook and pin bone regions accurately and effectively from the top-view point cloud point images of dairy cows.

To facilitate comparative analysis, the same dataset was used to train the PointNet++ and PointNet neural network under identical conditions, and the resulting data are provided in Table 1. As indicated in the table, the segmentation performance significantly increased when the Improved PointNet++ network structure was adopted. Compared with those of the PointNet++ model, the MIoU increased by 3.5%, the AP₅₀ value increased by 3%, and the AP₇₅ value increased by 1.8%. Compared with those of the PointNet model, the MIoU increased by 13.9%, the AP₅₀ value increased by 13.7%, and the AP₇₅ value increased by 11.1%. This improvement is attributed to the fact that Improved PointNet++ not only incorporates a hierarchical structure capable of extracting features at different scales but also integrates the HeatBlock module. After heat conduction processing, the feature maps are multiplied point-wise by dynamic weights, thereby integrating global and local features. This enhances the model’s ability to focus on different patterns, resulting in superior performance on the dataset used in this study.

3.2. Key Area Positioning Analysis

Since the hook and pin bones are in the protruding areas of the hook and pin bone regions, respectively, the Gaussian and mean curvature values of the protruding surface are relatively high. In contrast, the Gaussian and mean curvature values at the edges of the protrusions are lower. Therefore, by combining the Gaussian and mean curvature values, the hook and pin bones can be precisely located.

To locate the hook and pin bones precisely, a curvature analysis of the segmented hook and pin bone regions was conducted. The Gaussian and mean curvature values for the hook and pin bone regions of the cows are shown in Figure 17. The first row of the figure shows the Gaussian curvature values for all the points in the left hook bone regions, right hook bone regions, left pin bone regions, and right pin bone regions. The second row shows the mean curvature values for the same regions. As shown in Figure 17b,f and Figure 17c,g and Figure 17d,h, the points of the highest Gaussian and mean curvature values correspond to the same point. Therefore, this point is the target point for the respective region. However, as shown in Figure 17a,e, there are two relatively close points with relatively high Gaussian curvature values, making it difficult to determine the target point. However, the mean curvature values for these two points significantly differ, allowing the identification of the target point for this region based on the mean curvature values.

After the accurate positioning of the hook and pin bones, the key areas were determined according to the hook and pin bone regions. Based on the surrounding areas of the hook and pin bones, a square area with a length equal to the waist width, i.e., the distance between two hooks along the positive direction of the X-axis, was applied. The sum of the two areas is the key area, and the effect is shown in Figure 18. The red frame is the area surrounded by the hook and pin bones, and the green frame is the extension area, which can be used to accurately determine the location of the key area.

3.3. Analysis of the Body Pattern Image Classification Results

To facilitate the subsequent extraction of body pattern features and classification of dairy cow images, the key areas extracted from the point cloud images were transformed from a point cloud format into body pattern images. The resolution of these images is unified to 224 × 224, and a scaling ratio is calculated to stretch the point cloud data, ensuring that the point cloud fills the entire two-dimensional image. The image is then smoothed using Gaussian blur, and finally, the two-dimensional image is subjected to binarization. The key area extraction and body pattern binarization results for select dairy cows are shown in Figure 19, which shows that the body pattern characteristics can be accurately processed.

The 7600 binary body pattern images of the 40 cows were divided into training, validation, and test sets at a 7:2:1 ratio. To analyze the effect of the improved model, the original and improved ConvNeXt models were trained with the body pattern images in the training set. The training results are shown in Figure 20, and a precision comparison of the classification results is provided in Table 2. Relative to the original model, the precision, accuracy, recall, and F1 score of the improved model increased by 6.11%, 6.56%, 2.64%, and 6.37%, respectively, while maintaining similar weights. These results indicated that the improved ConvNeXt model significantly outperforms the ConvNeXt model in terms of both precision and other key performance metrics.

3.4. Ablation Experiment

To systematically evaluate and quantify the impact of RGB-D features and attention mechanisms, we adopted the concept of the control variable method. By maintaining consistent parameters such as learning rate and optimizer, we conducted ablation studies to verify the effectiveness of our proposed model, with the results shown in Table 3.

To demonstrate the advantage of incorporating depth images in our study, we designed a comparative experiment using only RGB images for individual dairy cow identification. The top-view RGB images of the cattle’s back in Dataset 1, captured in Section 2.1, were preprocessed as follows: First, the position of the cattle’s torso was localized, and the torso was extracted using the minimum bounding rectangle. The extracted cattle back torso images were then resized to a uniform dimension of 224 × 224 and subsequently binarized to highlight the body pattern features. To assess the recognition performance using color images alone, the preprocessed cattle torso images were fed into both the ConvNeXt classification model and the improved ConvNeXt classification model. The resulting accuracies for individual dairy cow identification were 82.64% and 87.47%, respectively. Additionally, by selectively removing and adding the attention mechanism in the model, we analyzed its impact on identification accuracy.

As shown in Table 3, when the attention mechanism was not incorporated into the model, the recognition accuracy of the RGB-D images was 7.59 percentage points higher than that of the RGB images. When the attention mechanism was included, the recognition accuracy of the RGB-D images was 10.48 percentage points higher than that of the RGB images. Specifically, when recognizing the RGB images, the introduction of the improved CBAM model into the ConvNeXt classification model increased the recognition accuracy by 4.83 percentage points. When recognizing the RGB-D images, the incorporation of the attention mechanism significantly improved the recognition accuracy by 7.72 percentage points. Therefore, the introduction of the RGB-D features and the attention mechanism enhanced the accuracy of dairy cow identification.

After incorporating the key points based on the depth images, a substantial improvement in accuracy was achieved, especially under conditions of rotation and deformation. This is because when the cattle back undergoes rotation or deformation, a large portion of the back area in the RGB image may be occluded, thereby affecting the recognition accuracy. On this basis, further improvements in the PointNet++ and ConvNeXt models, along with the introduction of the attention mechanism, can further enhance the recognition and detection performance.

4. Discussion

4.1. Mechanistic Analysis of Depth-Enhanced Cattle Identification

This study employs RGB-D images to enhance the accuracy of individual dairy cow identification. The specific contributions of depth information to the accuracy, stability, and robustness of dairy cow identification are analyzed from two aspects: position normalization and motion robustness.

Firstly, regarding geometric invariance, depth information provides three-dimensional data about the cow’s morphology, enabling the construction of a standardized representation based on spatial coordinates. RGB images are susceptible to variations in shooting angles, whereas depth data can offer the true three-dimensional shape of the object. By utilizing depth information, the cow’s body contour can be normalized to ensure consistent feature representation across different shooting angles. Moreover, depth images provide the Z-axis information from the depth camera to the cow, which allows us to eliminate scale errors caused by varying shooting distances, thereby enhancing the stability of identification.

In terms of motion robustness, dairy cows are often not stationary during image capture, leading to deformation in torso images due to their movement. RGB images are prone to texture distortion when the cow is walking or slightly moving, while the depth channel can more stably capture the cow’s shape information.

In summary, depth information plays a significant role in the task of dairy cow identification, demonstrating superior stability and robustness in aspects such as position normalization and motion robustness.

4.2. Robust Analysis

To validate the robustness of the dairy cow individual identification method proposed in this study, the data in Dataset 2 were processed and analyzed to evaluate the robustness of the proposed algorithm under different scenarios. First, the RGB-D images of the cattle’s back were converted into point clouds, from which the key regions of the cattle’s back were extracted. These key regions were then transformed into binary body pattern images and fed into the improved ConvNeXt classification model to achieve body pattern classification. The final accuracy of individual identity recognition was 94.62%. Therefore, our research method demonstrates satisfactory performance in identifying dairy cows during the walking process.

To further validate the robustness of the proposed method, five dairy cows were randomly selected from the feeding dataset, and their top-view RGB-D information was captured while they were walking. The identity recognition model previously trained on the feeding dataset was then tested using this new data. A total of 50 depth images and 50 color images were captured for the five cows. These images were processed to generate 3D point clouds of the cows’ backs, followed by the extraction of key regions and conversion into binary body pattern images for classification. The final identity recognition accuracy on this test set was 93.16%. Additionally, the binary body pattern images of the key regions from the same cows during feeding and walking were compared. During feeding, cows are not entirely stationary and may exhibit minor movements. Therefore, the back images captured during feeding, walking, and static feeding states were compared to further demonstrate the accuracy of the results. The key regions were extracted and converted into binary body pattern images for all three states, and the overlap of these images was visualized, as shown in Figure 21. The white regions extracted from the three states were displayed in blue, yellow, and green, respectively, and then overlaid, with the overlapping areas highlighted in red. The red overlapping area accounted for 87.62% of the blue area, 85.18% of the yellow area, and 90.24% of the green area. These results indicate that the cows’ movements within the camera’s field of view during top-view image capture have a minimal impact on identity recognition.

4.3. Analysis of the Performance of Individual Dairy Cow Identification

To demonstrate its superiority more intuitively, the method is compared with alternative recognition approaches using body pattern images, as detailed in Table 4. Although Xiao et al. (2022) [24] had a high recognition accuracy in the table, it could hardly effectively extract features of cows without body patterns or with small body patterns. The use of the method proposed in the study increases the recognition accuracy for cows with no or small white body patterns by combining three-dimensional features with body pattern features. In the process of extracting the key regions and standardizing their shapes, blank areas were filled with white point clouds, which enhanced the features and contributed to a greater recognition performance. The dataset in Shen et al. (2020) [34] encompasses cow features captured from a lateral view, where occlusion is a significant issue. He Dongjian and Liu Jianmin (2020) [23] fixed the camera at the entrance and exit of the milking area, which resulted in a limited shooting angle and the occurrence of dairy cow congestion. This led to a significant degree of bending deformation in the backs of some cows, causing them to be missed in identification. In contrast, the present study collected top-view RGB-D images. By performing posture normalization on the point cloud images and adjusting the angle of the cows’ back images, the impact of the shooting angle can be reduced, thus achieving higher robustness. Andrew et al. (2021) [17], Wang et al. (2024) [35], and our method all collected top-view back images. However, their accuracy rates were lower than that of the method presented in this study. Moreover, they collected RGB images, which were susceptible to the influence of perspective and environment, leading to missing image information. The present study, by collecting top-view RGB-D images, located the key points of the hook and pin bones on the back to determine the critical regions. Adjustments were made to these regions, making the body pattern images more comparable and enhancing the identity recognition features. Therefore, the research method proposed in this paper not only has a high accuracy rate in identity recognition and achieves good recognition effects, but also exhibits high robustness regarding the position and angle of cattle in the top-view field of vision.

4.4. Error Analysis

The incorrect results of individual identification were counted and analyzed. Figure 22 illustrates the two cows with the lowest recognition rate in the dataset, both having an individual identification rate of 78%. After analysis, the factors contributing to the low identification rate can be summarized in the following two aspects: (1) There were many patches on these cows, which affected the processing of body pattern features. (2) The extracted key area of the backs of these cows contained an excessive proportion of white parts, the black pattern region was small, and the corresponding recognition features were small. Additionally, the activities of dairy cows can cause minor changes in the distribution and shape of the patterns, leading to a scarcity of pattern features and thereby reducing the accuracy of identity recognition. In this study, we conducted a statistical analysis on the cows in the dataset whose black body pattern area accounted for less than 10% of the total area. This subset of cows constituted 7% of the total number of cows in the dataset. The accuracy of individual identity recognition for these cows was 85.2%, which is 12.75% lower than the overall recognition accuracy. However, this level of accuracy still enables the basic individual identification of dairy cows.

4.5. Future Research

Although the method proposed in this paper enabled highly accurate cow identification, there is still room for improvement. The camera remained stationary in the data acquisition process, and there could be situations in which the torso area does not occur within the acquisition area due to the small-amplitude activity of the cow during feeding, which, in turn, could cause key areas to be missing. In future research, a tracking camera could be integrated with a platform for real-time image display and processing, enabling the remote acquisition and processing of dairy cow images in real time. The research method proposed in this paper yields an increased recognition accuracy for cows with a high proportion of black parts, but the recognition accuracy is lower for cows with a high proportion of white parts or even all-white cows. In future research, cows with a high proportion of white point clouds can be recognized individually, such as by marking the torso, or combined with other identification characteristics, which will increase the identification accuracy of individual cows.

5. Conclusions

In response to the challenges of bovine trunk localization and the multitude of influencing factors during dairy cow identification, a method for individual dairy cow identity recognition based on the three-dimensional structural features and body pattern features of the cow’s back under top-view conditions is proposed. In the three-dimensional structure of dairy cows, key points are precisely located based on an improved PointNet++ model and curvature features. Specifically, the HeatBlock attention mechanism is incorporated into the PointNet++ model, thereby enhancing its capability for feature extraction and classification in complex scenarios. The results indicate that compared with the original model, the AP₅₀ has increased by 3%, and the MIoU has improved by 3.5%. Furthermore, an improved CBAM attention mechanism is integrated into the body pattern image classification model ConvNeXt, which elevates the model’s generalization and robustness. In comparison with the original model, the accuracy has been enhanced by 6.11%. The results also demonstrate that despite the stride movements of dairy cows during feeding, which cause shifts in shooting angles and positions, the research method proposed in this paper can still accurately identify the individual identity of dairy cows with high precision. The research method presented in this paper not only achieves a high accuracy rate in identity recognition and exhibits satisfactory recognition effects, but also demonstrates high robustness with respect to the position and angle of dairy cows in the top-view field of vision.

Author Contributions

Conceptualization, J.W. and K.Z.; methodology, J.W. and K.Z.; software, J.W., J.S. and R.Z.; validation, J.W.; formal analysis, J.W., Y.C. and R.Z.; investigation, K.Z.; resources, J.W., R.Z. and J.S.; data curation, K.Z., Y.C., J.S. and R.Z.; writing—original draft, J.W.; writing—review and editing, R.Z., Y.C. and K.Z.; visualization, J.S., Y.C. and R.Z.; supervision, K.Z.; project administration, K.Z.; funding acquisition, K.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the National Key R&D Plan Key projects (Grant No. 2023YFD2000702), the University Science and Technology Innovation Talent Project of Henan Province (Grant No. 24HASTIT052), and the International Science and Technology Cooperation Project of Henan Province Key Research and Development Projects (Grant No. 232102521006).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The datasets presented in this article are not readily available because the data are part of an ongoing study. Requests to access the datasets should be directed to the corresponding author.

Acknowledgments

The authors acknowledge Luoyang Shengsheng Farm for the facilitation of data acquisition and permission for data use.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Banhazi, T.M.; Lehr, H.; Black, J.L.; Crabtree, H.; Schofield, P.; Tscharke, M.; Berckmans, D. Precision Livestock Farming: An International Review of Scientific and Commercial Aspects. Int. J. Agric. Biol. Eng. 2012, 5, 1–9. [Google Scholar] [CrossRef]
Tullo, E.; Finzi, A.; Guarino, M. Review: Environmental Impact of Livestock Farming and Precision Livestock Farming as a Mitigation Strategy. Sci. Total Environ. 2019, 650, 2751–2760. [Google Scholar] [CrossRef] [PubMed]
Berckmans, D. Precision Livestock Farming Technologies for Welfare Management in Intensive Livestock Systems. Rev. Sci. Tech. 2014, 33, 189–196. [Google Scholar] [CrossRef]
Xiao, J.; Si, Y.; Xie, M.; Liu, G.; Yan, Z.; Wang, K. A Novel and Convenient Lying Cow Identification Method Based on YOLOX and CowbodyNet: A Study with Applications in a Barn. Comput. Electron. Agric. 2024, 225, 109287. [Google Scholar] [CrossRef]
Awad, A.I. From Classical Methods to Animal Biometrics: A Review on Cattle Identification and Tracking. Comput. Electron. Agric. 2016, 123, 423–435. [Google Scholar] [CrossRef]
Salau, J.; Haas, J.H.; Junge, W.; Thaller, G. A Multi-Kinect Cow Scanning System: Calculating Linear Traits from Manually Marked Recordings of Holstein-Friesian Dairy Cows. Biosyst. Eng. 2017, 157, 92–98. [Google Scholar] [CrossRef]
Cappai, M.G.; Rubiu, N.G.; Nieddu, G.; Bitti, M.P.L.; Pinna, W. Analysis of Fieldwork Activities during Milk Production Recording in Dairy Ewes by Means of Individual Ear Tag (ET) Alone or plus RFID Based Electronic Identification (EID). Comput. Electron. Agric. 2018, 144, 324–328. [Google Scholar] [CrossRef]
Feng, J.; Fu, Z.; Wang, Z.; Xu, M.; Zhang, X. Development and Evaluation on a RFID-Based Traceability System for Cattle/Beef Quality Safety in China. Food Control 2013, 31, 314–325. [Google Scholar] [CrossRef]
Stankovski, S.; Ostojic, G.; Senk, I.; Rakic-Skokovic, M.; Trivunović, S.; Kucevic, D. Dairy Cow Monitoring by RFID. Sci. Agric. 2012, 69, 75–80. [Google Scholar] [CrossRef]
Achour, B.; Malika, B.; Filali, I.; Laghrouche, M.; Lahdir, M. Image Analysis for Individual Identification and Feeding Behaviour Monitoring of Dairy Cows Based on Convolutional Neural Networks (CNN). Biosyst. Eng. 2020, 198, 31–49. [Google Scholar] [CrossRef]
Qiao, Y.; Guo, Y.; Yu, K.; He, D. C3D-ConvLSTM Based Cow Behaviour Classification Using Video Data for Precision Livestock Farming. Comput. Electron. Agric. 2022, 193, 106650. [Google Scholar] [CrossRef]
Shi, W.; Dai, B.; Shen, W.; Sun, Y.; Zhao, K.; Zhang, Y. Automatic Estimation of Dairy Cow Body Condition Score Based on Attention-Guided 3D Point Cloud Feature Extraction. Comput. Electron. Agric. 2023, 206, 107666. [Google Scholar] [CrossRef]
Zhao, K.; Zhang, M.; Shen, W.; Liu, X.; Ji, J.; Dai, B.; Zhang, R. Automatic Body Condition Scoring for Dairy Cows Based on Efficient Net and Convex Hull Features of Point Clouds. Comput. Electron. Agric. 2023, 205, 107588. [Google Scholar] [CrossRef]
Cong, S.; Wang, J.; Zhang, R.; Zhao, L. Cattle Identification Using Muzzle Print Images Based on Feature Fusion. IOP Conf. Ser. Mater. Sci. Eng. 2020, 853, 012051. [Google Scholar] [CrossRef]
Lu, Y.; He, X.; Wen, Y.; Wang, P. A New Cow Identification System Based on Iris Analysis and Recognition. Int. J. Biom. 2014, 6, 18–32. [Google Scholar] [CrossRef]
Yang, L.; Xu, X.; Zhao, J.; Song, H. Fusion of RetinaFace and Improved FaceNet for Individual Cow Identification in Natural Scenes. Inf. Process. Agric. 2024, 11, 512–523. [Google Scholar] [CrossRef]
Andrew, W.; Gao, J.; Mullan, S.; Campbell, N.; Dowsey, A.W.; Burghardt, T. Visual Identification of Individual Holstein-Friesian Cattle via Deep Metric Learning. Comput. Electron. Agric. 2021, 185, 106133. [Google Scholar] [CrossRef]
Zhang, R.; Ji, J.; Zhao, K.; Wang, J.; Zhang, M.; Wang, M. A Cascaded Individual Cow Identification Method Based on DeepOtsu and EfficientNet. Agriculture 2023, 13, 279. [Google Scholar] [CrossRef]
Shan, C.; Gong, S.; McOwan, P.W. Facial Expression Recognition Based on Local Binary Patterns: A Comprehensive Study. Image Vis. Comput. 2009, 27, 803–816. [Google Scholar] [CrossRef]
Kumar, S.; Singh, S.K.; Singh, A.K. Muzzle Point Pattern Based Techniques for Individual Cattle Identification. Inst. Eng. Technol. 2017, 11, 805–814. [Google Scholar] [CrossRef]
Qi, Y.; Zhang, X.; Zhang, J.; Liu, L.; Li, Y. Feature Mask-Based Local Occlusion Cattle Face Recognition Method. Trans. Chin. Soc. Agric. Mach. 2024, 55, 93–102. [Google Scholar] [CrossRef]
Zhao, K.; He, D. Recognition of individual dairy cattle based on convolutional neural networks. Trans. Chin. Soc. Agric. Eng. 2015, 31, 181–187. [Google Scholar] [CrossRef]
He, D.; Liu, J. Individual Identification of Dairy Cows Based on Improved YOLO V3. Trans. Chin. Soc. Agric. Mach. 2020, 51, 250–260. [Google Scholar] [CrossRef]
Xiao, J.; Liu, G.; Wang, K.; Si, Y. Cow Identification in Free-Stall Barns Based on an Improved Mask R-CNN and an SVM. Comput. Electron. Agric. 2022, 194, 106738. [Google Scholar] [CrossRef]
Kyaw, P.P.; Tin, P.; Aikawa, M.; Kobayashi, I.; Zin, T.T. Cow’s Back Surface Segmentation of Point-Cloud Image Using PointNet++ for Individual Identification. In Proceedings of the Genetic and Evolutionary Computing, Miyazaki, Japan, 28–30 August 2024; Pan, J.-S., Zin, T.T., Sung, T.-W., Lin, J.C.-W., Eds.; Springer Nature: Singapore, 2025; pp. 199–209. [Google Scholar]
Menezes, G.; Negreiro, A.; Ferreira, R.; Higaki, S.; Casella, E.; Alves, A.; Dórea, J.R.R. 67 Precision Identification and Weight Assessment of Cattle Using Supervised Machine Learning on Body Surface Keypoints. J. Anim. Sci. 2024, 102, 310–311. [Google Scholar] [CrossRef]
Liu, H.; Song, R.; Zhang, X.; Liu, H. Point Cloud Segmentation Based on Euclidean Clustering and Multi-Plane Extraction in Rugged Field. Meas. Sci. Technol. 2021, 32, 095106. [Google Scholar] [CrossRef]
Ma, B.; Yang, C.; Li, A.; Chi, Y.; Chen, L. A Faster DBSCAN Algorithm Based on Self-Adaptive Determination of Parameters. Procedia Comput. Sci. 2023, 221, 113–120. [Google Scholar] [CrossRef]
Ahmed, M.; Seraj, R.; Islam, S.M.S. The K-Means Algorithm: A Comprehensive Survey and Performance Evaluation. Electronics 2020, 9, 1295. [Google Scholar] [CrossRef]
Wan, J.; Hu, D.; Jiang, Y. Research on Method of Multi-Density Self-Adaptive Determination of DBSCAN Algorithm Parameters. Comput. Eng. Appl. 2022, 58, 78–85. [Google Scholar] [CrossRef]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Yang, C.; Zhang, C.; Yang, X.; Li, Y. Performance Study of CBAM Attention Mechanism in Convolutional Neural Networks at Different Depths. In Proceedings of the 2023 IEEE 18th Conference on Industrial Electronics and Applications (ICIEA), Ningbo, China, 18–22 August 2023; pp. 1373–1377. [Google Scholar]
Shen, W.; Hu, H.; Dai, B.; Wei, X.; Sun, J.; Jiang, L.; Sun, Y. Individual Identification of Dairy Cows Based on Convolutional Neural Networks. Multimed. Tools Appl. 2020, 79, 14711–14724. [Google Scholar] [CrossRef]
Wang, R.; Gao, R.; Li, Q.; Zhao, C.; Ru, L.; Ding, L.; Yu, L.; Ma, W. An Ultra-Lightweight Method for Individual Identification of Cow-Back Pattern Images in an Open Image Set. Expert Syst. Appl. 2024, 249, 123529. [Google Scholar] [CrossRef]

Figure 1. Diagram of the data acquisition device. (a) shows a schematic of the setup. (b) shows the onsite camera setup, with the red box indicating the position of the Intel RealSense D455 depth camera.

Figure 2. The RGB-D images captured by the Intel RealSense D455 depth camera.

Figure 3. Histogram of the proportion distribution of black areas on the back of dairy cows.

Figure 4. Cattle back image acquisition setup in the walking passage. The red box in the figure indicates the Intel RealSense D455 depth camera. The first image shows the acquisition setup without any cattle passing by, while the second image illustrates the setup in operation as cows walk through.

Figure 5. RGB-D images of the cattle back captured by the Intel RealSense D455 depth camera while the cows were walking.

Figure 6. Cow depth image preprocessing effect.

Figure 7. Point cloud generated by the registration of depth and color images.

Figure 8. Point cloud of the dairy cows’ back after background removal.

Figure 9. Point cloud of the cattle’s back after rotation.

Figure 10. Process diagram for identifying key areas.

Figure 11. Improved PointNet++ model.

Figure 12. Annotation example diagram. On the 3D point cloud of the cattle’s back, the regions of the hook bones and pin bones are labeled and used as input data for the Improved PointNet++ model. In this figure, the green area represents the labeled hook region, the red area represents the labeled pin region, and the blue part represents the remaining dorsal area excluding the hook and pin regions.

Figure 13. CBAM structure.

Figure 14. Improved CBAM attention mechanism. The CAM (channel attention module) and SAM (spatial attention module) in the CBAM (convolutional block attention module) attention mechanism are modified from a serial to a parallel configuration.

Figure 15. Improved ConvNeXt model.

Figure 16. Segmentation results of the hook and pin bone regions. In this figure, (a) shows the predicted segmentation result, where the brown region represents the hook bone area segmented by the Improved PointNet++ model, the light blue region represents the pin bone area segmented by the Improved PointNet++ model, and the dark blue region represents the cattle back area excluding the hook bone and pin bone regions. (b) shows a comparison between the predicted and manual annotation results for the hook and pin bone regions. The high overlap between the two indicates that the segmentation performance of the Improved PointNet++ model is satisfactory.

Figure 17. Gaussian and mean curvature values for the hook and pin bone regions. (a–d) Gaussian curvature values for all the points in the left hook bone region, right hook bone region, left pin bone region, and right pin bone region, respectively. (e–h) Mean curvature values for all the points in the left hook bone region, right hook bone region, left pin bone region, and right pin bone region, respectively. The positions of the hook bone points and pin bone points can be accurately determined by analyzing the Gaussian curvature and mean curvature values of the hook bone and pin bone regions.

Figure 18. Localization of the key regions on the dairy cows’ back. The red box in the figure represents the area enclosed by the hook bone points and pin bone points, while the green box represents the region extending from the two hook bone points with a length equal to the distance between them.

Figure 19. Key area transformed from 3D to 2D rendering. The first column in the figure shows the point cloud representations of the key regions on the backs of dairy cows numbered 103641 and 104525. The second column illustrates the transformation of these key regions from point clouds to two-dimensional images with a unified resolution of 224 × 224. The third column presents the binarized images of the key regions, which facilitate the subsequent extraction of body pattern features.

Figure 20. Comparative analysis of the training results: (a) Comparison of the training accuracy between the original and improved ConvNeXt models. (b) Comparison of the training loss.

Figure 21. Top-view images of dairy cows’ backs captured from different positions and angles: (a) is the top-view image of a dairy cow’s back while feeding. (b) is the top-view image with lateral displacement of the cow’s torso. (c) is the top-view image of a dairy cow’s back while walking. The first column shows the RGB images of the top-view cow back. The second column displays the point cloud images of the top-view cow back. The third column presents the point cloud images of the identified key regions. The fourth column shows the binary images of the key regions. To enhance the contrast of the white regions in the key areas across different angles, the white regions of the body patterns captured at different angles are highlighted in different colors, as shown in the fifth column. Here, blue, yellow, and green represent the white regions in the fourth column, respectively. The red areas in the sixth column indicate the overlap of all the white regions from the fifth column. By highlighting the proportion of the overlapping areas relative to the white regions in the key areas captured at different angles, the high degree of overlap among the extracted key regions across angles and the robustness of this study are further demonstrated.

Figure 22. Body pattern images of the two cows with the lowest recognition rates. The figure illustrates the body pattern distribution of the two cows with relatively low recognition rates in this study, displayed in the form of point clouds of the key regions and binary body pattern images.

Table 1. Comparison with other network models.

Network Model	AP₅₀	AP₇₅	MIoU
Improved PointNet++	95.5%	83.6%	85.8%
PointNet++	92.5%	81.8%	82.3%
PointNet	81.8%	72.5%	71.9%

Table 2. Accuracy of the classification results.

Model	Accuracy (%)	Precision (%)	Recall (%)	F1 Score (%)
Improved ConvNeXt	97.95	95.82	94.61	96.83
ConvNeXt	91.84	89.26	91.97	90.46

Table 3. Comparison of ablation experiment results.

Method	Accuracy (%)
RGB-ConvNeXt	82.64
RGB-Improved ConvNeXt	87.47
RGBD-PointNet++-ConvNeXt	90.23
RGBD-Improved PointNet++-Improved ConvNeXt	97.95

Table 4. Training accuracy of the four categories.

Authors	Method	Camera Settings	Region of Interest	Accuracy (%)
Andrew et al. (2021) [17]	Softmax and Reciprocal Triplet Loss	RGB Top view	Back	93.8%
He Dongjian and Liu Jianmin (2020) [23]	Improved YOLOv3	RGB Top view	Back	95.91%
Xiao et al. (2022) [24]	CNN and SVM	RGB Top view	Back	98.67%
Shen et al. (2020) [34]	YOLO and AlexNet	RGB Side view	Head, trunk, and legs of cow from the side	96.65%
Wang et al. (2024) [35]	LightCowsNet	RGB Top view	Back	94.26%
Our method	Improved PointNet++ and Improved ConvNeXt	RGBD Top view	The key area in the back determined by the datum points	97.95%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, K.; Wang, J.; Chen, Y.; Sun, J.; Zhang, R. Individual Identification of Holstein Cows from Top-View RGB and Depth Images Based on Improved PointNet++ and ConvNeXt. Agriculture 2025, 15, 710. https://doi.org/10.3390/agriculture15070710

AMA Style

Zhao K, Wang J, Chen Y, Sun J, Zhang R. Individual Identification of Holstein Cows from Top-View RGB and Depth Images Based on Improved PointNet++ and ConvNeXt. Agriculture. 2025; 15(7):710. https://doi.org/10.3390/agriculture15070710

Chicago/Turabian Style

Zhao, Kaixuan, Jinjin Wang, Yinan Chen, Junrui Sun, and Ruihong Zhang. 2025. "Individual Identification of Holstein Cows from Top-View RGB and Depth Images Based on Improved PointNet++ and ConvNeXt" Agriculture 15, no. 7: 710. https://doi.org/10.3390/agriculture15070710

APA Style

Zhao, K., Wang, J., Chen, Y., Sun, J., & Zhang, R. (2025). Individual Identification of Holstein Cows from Top-View RGB and Depth Images Based on Improved PointNet++ and ConvNeXt. Agriculture, 15(7), 710. https://doi.org/10.3390/agriculture15070710

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Individual Identification of Holstein Cows from Top-View RGB and Depth Images Based on Improved PointNet++ and ConvNeXt

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Acquisition

2.2. Converting Depth Images into Point Cloud

2.2.1. Depth Image Preprocessing

2.2.2. Point Cloud Attitude Normalization

2.2.3. Point Cloud Noise and Background Removal

2.3. Locating Key Areas

2.3.1. Coarse Localization of Hook and Pin Bones via the Improved PointNet++ Model

2.3.2. Model Evaluation Indicators

2.3.3. Accurate Positioning of Hook and Pin Bones via Curvature Analysis

2.3.4. Identifying Key Areas

2.4. Processing Body Pattern Images

2.5. Classifying Body Pattern Images

2.5.1. ConvNeXt

2.5.2. Convolutional Block Attention Module

2.5.3. Improving the ConvNeXt Network Architecture

2.5.4. Model Training and Evaluation Metrics

3. Results

3.1. Analysis of the Segmentation Results for the Hook and Pin Bone Regions

3.2. Key Area Positioning Analysis

3.3. Analysis of the Body Pattern Image Classification Results

3.4. Ablation Experiment

4. Discussion

4.1. Mechanistic Analysis of Depth-Enhanced Cattle Identification

4.2. Robust Analysis

4.3. Analysis of the Performance of Individual Dairy Cow Identification

4.4. Error Analysis

4.5. Future Research

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI