1. Introduction
With the acceleration of the modernization process of dairy farming, the demand for accurate production in farms is becoming increasingly prominent [
1]. Holstein cows, as the core population of dairy farming, have individual identity markers that are inefficient and lack sufficient accuracy if relying on manual methods. The accuracy of these data directly affects the productivity and economic benefits of dairy farms. Therefore, developing a fast and accurate intelligent identification system and applying it practically in the cattle farm environment is particularly important. Individual identification of Holstein dairy cows is a core aspect of realizing precision farming. It plays a crucial role in automated management, such as body condition monitoring [
2], disease early warning [
3], and estrus detection [
4].
Among traditional identification methods, manual identification is inefficient and prone to errors, alongside radio frequency identification (RFID) [
5]. Whereas RFID has been widely used in Holstein dairy farming [
6], a common method of identification is through the installation of electronic ear tags. Still, ear tags are easily lost during Holstein cow activities [
7], which increases the cost of replacing ear tags and may lead to stress in cows [
8]. Therefore, RFID individual identification methods for Holstein cows may trigger animal stress and pose high economic costs and potential animal welfare problems.
Holstein cows and their crossbred progeny have black and white patchy coats, also known as black and white flower cows. Different individuals display unique textural features that can be considered identity traits, and each Holstein cow is uniquely identified by image recognition techniques [
9]. In livestock farming, computer vision technology is increasingly becoming pivotal for identification, owing to its notable benefits: non-intrusiveness, absence of stress induction, and economic efficiency. This technology implements the automatic identification of individual Holstein cows utilizing a vision acquisition device and a computing device without needing additional equipment. It can fully use the textural features extracted from the facial features, the uniqueness of the muzzle pattern of Holstein cows [
10,
11], and the details of the body contour. This technique not only safeguards the welfare of animals but also improves the efficiency and accuracy of individual identification [
12,
13].
The research on individual identification of Holstein dairy cows based on computer vision technology has made significant progress, providing strong technical support for Holstein dairy cow management, health monitoring, and improving production efficiency. The application of this technology not only enhances the identification efficiency but also reduces the error rate of manual operation, injecting new vitality into the development of modern animal husbandry. For example, Fu et al. [
14] introduced a non-invasive system for individual Holstein cow recognition, utilizing an enhanced ResNet50 architecture integrated with the Ghost module and convolutional block attention mechanism. The method significantly reduces the number of parameters and the model size while maintaining high accuracy. Yang et al. [
15] proposed an individual cow recognition method fusing RetinaFace and an improved version of FaceNet, which achieves, through MobileNet-enhanced RetinaFace and joint optimization of cross-entropy loss and ternary loss, an accuracy of 99.50% accuracy and 83.60% on the test set. Unlike the previous recognition approach using single features, Hu et al. [
16] attained a recognition accuracy of 98.36% on a dataset comprising 93 cow side views by isolating features from the head, torso, and legs, treating them as distinct components, and subsequently amalgamating depth-based features from each component via an advanced depth part feature fusion technique.
The aforementioned research indicates that individual Holstein cows can be identified using their heads and trunks; however, the practical applications of these methods are somewhat constrained [
17,
18]. Images of the head and side of Holstein cows taken at different angles are prone to distortion and occlusion and, thus, unsuitable for individual recognition on large-scale farms. A feasible method is to acquire Holstein cowback images through a top-view angle; for this reason, Xiao et al. [
19] proposed a method to recognize individual Holstein cows using an improved Mask R-CNN model and SVM classifier. The technique improved the accuracy of Holstein cow recognition to 98.67% by top-view image analysis, effectively solved the challenge of individual Holstein cow recognition in a free environment, and demonstrated the potential of application in accurate Holstein cow management. Similarly, Xu et al. [
20] proposed a BottleNet Transformer (BoTNet) model combining graph sampling and counterfactual attentional learning, which optimizes feature capture in the back pattern region of Holstein cows, outperforms existing techniques on public datasets, has better generalization ability, and effectively solves the problem of individual recognition in large-scale farms. Wang et al. [
21] proposed an open-set recognition method integrating spatial feature transformation and metric learning for the issue of directional diversity and partial occlusion in the recognition of individual Holstein cows from a top-view angle, which achieved 94.58% recognition accuracy under open set and improved the performance of recognition of partially visible Holstein cows by optimizing the deep feature extraction module and the attention mechanism. These studies show that computer vision technology based on target detection and image recognition has demonstrated significant advantages in Holstein cow recognition applications, which not only avoid stressing Holstein cows but also significantly reduce labor costs and have good prospects for promotion and application [
22].
However, due to the fact that the majority of Holstein cowback images are captured from an overhead perspective in practical farming environments, factors such as cow movement and variations in lighting conditions can result in inadequate extraction of image features, thereby diminishing the precision of the recognition process. Aiming at the problems in the traditional top-view recognition method, this study proposes a recognition scheme that uses different viewpoints to acquire Holstein cowback images and explores the effects of varying attention mechanisms and their fusion methods on recognition performance. Random rotation and random occlusion strategies are introduced in the data preprocessing stage to further enhance its ability to recognize local features of the image and improve its stability and reliability in practical applications.
The main contributions of this study are as follows:
- Based on dairy farm data in a real production environment, a Holstein cowback image dataset under different viewpoints is successfully constructed. This verifies the effectiveness of the method in this paper in a real production environment and provides an important data basis for future related research. 
- A lightweight feature extraction network for Holstein cow individual recognition, CowBackNet, is proposed. By introducing the composite attention mechanism, the network can maintain high recognition accuracy in the case of changes in the shooting angle, enhance the adaptability to external conditions, and ensure the stability and reliability of the model from different angles. 
- Comparing and analyzing with the existing mainstream recognition models, CowBackNet shows significant advantages in recognition efficiency, model size and classification accuracy, especially in the lightweight and computational efficiency of the model, and can provide more efficient reasoning performance and adapt to the deployment requirements in actual production. 
- By introducing gradient-weighted class activation mapping (Grad-CAM), the model’s decision-making process is visualized, significantly improving the interpretability of the individual recognition task. The heatmap analysis visualizes the key regions the model focuses on during the decision-making process, providing a basis for further optimization. 
  3. Results
  3.1. Experimental Setup and Parameters
The experimental platform in this study comprises a Linux server running the Ubuntu Server 22.04 operating system. The server’s hardware specifications feature two Intel Xeon Gold 6139M processors clocked @ 2.30 GHz, complemented by 128 GB of RAM and eight NVIDIA GeForce RTX 3090 GPUs. The software environment includes Python version 3.8.19, CUDA 11.3, and a suite of deep learning frameworks: PyTorch 1.10.1, MMEngine 0.10.4, and MMPretrain 1.2.0. 
Table 3 outlines the study’s learning algorithms, optimized using the Adam optimizer with an initial learning rate of 0.0005, a weight decay parameter of 0.0001, and a batch size set to 32.
  3.2. Selection of Feature Extraction Networks
This study evaluated ResNet50, ResNet101, MobileNetV2, MobileViT, and EfficientNetV2 as feature extraction networks, comparing their test set performance comprehensively. The results are detailed in 
Table 4. In our self-built CowBack test set of 22 Holstein cows with back images acquired from different viewpoints for individual recognition, EfficientNetV2 achieves 76.61% accuracy, and the other models perform lower than EfficientNetV2. During the individual identification of 155 Holstein cows in the Cows2021 dataset, which consists of back images captured from a top–down perspective, EfficientNetV2 exhibited superior performance, achieving a top-1 accuracy rate of 95.69% and a top-5 accuracy rate of 98.76%. ResNet50 performs the worst but achieves 91.46% top-1 accuracy and 96.78% top-5 accuracy. MobileNetV2 and MobileViT performed slightly lower than EfficientNetV2. Also, in the Cows2021_mix test set of hybrid multiview, the individual recognition of 155 Holstein cows still had the highest recognition accuracy with EfficientNetV2, with a top-1 accuracy of 93.56% and top-5 accuracy of 97.70%. It shows that the EfficientNetV2 model can still perform optimally even when mixing the back image data of Holstein cows in a real production environment. Based on the top-1 and top-5 accuracy results across the three datasets, EfficientNetV2 was chosen as the foundational network for feature extraction in Holstein cow individual identification in this research. 
Figure 8 illustrates the evolution of top-1 accuracy during the training phase of each base model.
  3.3. Single-Attention Mechanism
To validate the effectiveness of the selected attention mechanisms, this study compares multiple-attention mechanisms on the CowBack, Cows2021 and Cows2021_mix datasets to evaluate their impact on classification performance. These attention mechanisms include CA, CBAM and ECA, as shown in 
Table 5.
The experimental results show that after replacing the original SE attention mechanism in the EfficientNetV2 model, the CA, CBAM, and ECA models all showed performance improvement in identifying individual cows. CBAM and ECA improved the accuracy of top-1 identification on the CowBack dataset by 7.5% and 11.17%, respectively.
  3.4. Multi-Attention Fusion Mechanism
This section combines the advantages of both CBAM and ECA attentional mechanisms. Specifically, it analyzes three attentional fusion strategies: tandem fusion, weighted fusion, and weighted fusion combined with residual joining (as shown in 
Table 6). The effects of these fusion strategies on the recognition performance are evaluated through experiments on three datasets: CowBack, Cows2021, and Cows2021_mix.
The experimental results show that the recognition results using weighted fusion and introducing residual connectivity outperform those using a single-attention mechanism on all the test datasets from different viewpoints. This fusion approach not only enhances the model’s focus on channel information but also overcomes the shortcomings of the SE attention mechanism that ignores spatial information, especially when dealing with real environmental images derived from complex backgrounds and cows with variable postures in the CowBack dataset, which show significant improvements. This fusion approach improves the top-1 recognition accuracy by 0.52% over using only the ECA attention mechanism.
  3.5. Model Performance Analysis
To further evaluate the performance of CowBackNet, we added two benchmark models for comparison, i.e., EfficientNetV2 and EfficientNetV2 + ECA. The relationship between the loss values, the number of model iterations, and the corresponding curves during the training process are shown in 
Figure 9. 
Figure 9 shows that after 100 iterations, the model tends to converge. Finally, the loss of the CowBackNet model is stabilized at around 0.951, 0.457, and 0.54 under the CowBack, Cows2021, and Cows2021_mix datasets, respectively, which proves that the improved model has a strong learning ability.
Then, we compared CowBackNet with mainstream recognition models, including ResNet, MobileNet, MobileViT, and EfficientNetV2. In addition, it was also compared with each model of EfficientNetV2 with the introduction of a single-attention mechanism. 
Table 7 shows the statistical results of the proposed CowBackNet with mainstream recognition models. As can be seen from 
Table 7, compared with ResNet50 and ResNet101, which also have a residual network structure, the FLOPs and the number of parameters of CowBackNet are significantly reduced. Compared with the lightweight models MobileNet, MobileViT, and EfficientNetV2, although the number of parameters is improved, CowBackNet achieves a higher accuracy rate. In addition, the top-1 accuracy of CowBackNet is 11.69%, 4.93%, 4.19%, and 0.52% higher than that of EfficientNetV2, EfficientNetV2 + CA, EfficientNetV2 + CBAM, and EfficientNetV2 + ECA, respectively. In terms of computation, the FLOPs of CowBackNet are all lower than those of EfficientNetV2, EfficientNetV2 + CA, EfficientNetV2 + CBAM, and EfficientNetV2 + ECA, and in terms of model sizes, CowBackNet outperforms the EfficientNetV2, EfficientNetV2 + CA, EfficientNetV2 + CBAM, and EfficientNetV2 + ECA, reaching 6.096 MB. It is shown that the complexity of CowBackNet is significantly lower than the above model.
The data in 
Table 7 were analyzed using a radar chart, and the results are shown in 
Figure 10. Based on the above results and analysis, the proposed CowBackNet model has certain advantages in terms of accuracy, model size, and efficiency compared with mainstream classification models for the task of individual cow recognition.
  3.6. Model Interpretability Analysis
To fully evaluate the performance of CowBackNet for Holstein cow individual recognition in a realistic environment, we generated heatmap visualization results using gradient-weighted class activation mapping (Grad-CAM) [
34], which are shown in 
Table 8. This is the result of a visualization of individual identification through the back of a Holstein cow. All Holstein cow samples in 
Table 8 are from the test set of the CowBack dataset, where Holstein cowback samples with salient and weak features are randomly selected and salient features refer to features easily recognized and understood in an image. Examples of salient features include noticeable color patches, specific markings or labels, etc. Weak features refer to those image features that are less obvious or easily overlooked, which may not directly affect the interpretation and recognition of the image as much as the salient features, but still have recognition value in some cases—for example, subtle hair textures, slight color gradients, etc. In addition to visualizing the recognition results on the original image, the recognition results are also visualized on the image after random rotation, random occlusion, and different viewpoints, which are used to verify that the model is still able to focus on the same or similar regions of interest after the above operations. In addition, the test set of the Cows2021 dataset of Holstein cowback images acquired from the top-view angle was also used to visualize the recognition results of individual Holstein cows, as shown in 
Table 9. In this case, the warm color in the heatmap indicates that the model pays excellent attention to the region, i.e., the model believes that the area makes a more significant contribution to the decision.
When using EfficientNetV2 to recognize individual Holstein cows in a realistic environment, it was susceptible to background interference, resulting in less attention being paid to the back region of the Holstein cow. This is the main reason why the accuracy of the EfficientNetV2 model in the CowBack dataset for Holstein cow individual recognition is lower than the recognition accuracy of the Cows2021 dataset constructed from the top view. However, by optimizing the operation of EfficientNetV2, as shown in the heatmap of CowBackNet, after visualizing the recognition results of the original image of the same model, even in the face of the interference of transformation or noise, the results of the image visualized using Grad-CAM did not change much from the visualization results of the original image, and the model was still able to capture the key information accurately, indicating that the model has better generalization ability and anti-interference ability; on the other hand, after visualizing the recognition results of the images after the same operation with different models, the warmer color is darker, indicating that the CowBackNet model displays greater attention in the critical feature regions, and can make a decision from the practical features, focusing on the key areas that have decision-making impact on the classification, which will, in turn, affect the recognition accuracy.
This result confirms the effectiveness of the optimization operation, showing that CowBackNet can enhance the focus on the back region of Holstein cows while mitigating the interference of the background. In addition, it is confirmed that the model significantly enhances its context-awareness capability by introducing a multi-attention fusion mechanism, which enables the model to more accurately localize and focus on key features on the back of Holstein cows when processing image data. Even when the shooting angle changes, the introduction of LightCBAM enables the model to automatically adjust the channel and spatial attentional mechanisms, which allows the model to more accurately extract the features on the back of the cow under different viewpoints and enhances its ability to recognize individual cows. In conclusion, this approach further optimizes the feature extraction process by effectively filtering and suppressing irrelevant background noise, thus improving the robustness and accuracy of the overall model. The heatmap visualization evaluation results show that these optimization operations help improve individual Holstein cows’ recognition performance in a completely realistic environment.
  4. Discussion
In this study, our proposed CowBackNet model achieves 88.30%, 95.86%, and 94.32% top-1 accuracy on the CowBack, Cows2021, and Cows2021_mix test sets, respectively. As can be seen from 
Table 4 and 
Table 6, compared to the EfficientNetV2 model, the CowBackNet model improves the top-1 accuracy by 0.17% and 0.76% on the Cows2021 and Cows2021_mix test sets, respectively, while the improvement is significant up to 11.69% on the CowBack test set. This significant performance improvement highlights that the CowBackNet model has a stronger generalization ability in processing Holstein cowback images in a completely realistic environment. In addition to accuracy, the model’s computational efficiency and storage requirements are also key factors that must be considered in practical applications. Among them, as shown in 
Table 7, the FLOPs of the CowBackNet model reach 0.727 G, indicating that its computational demand in inference is moderate, making the model suitable for resource-constrained environments. In addition, the model’s overall size is only 6.096 MB, which further reduces storage pressure and improves deployment flexibility. This combination reduces the running cost and helps the model be used in mobile devices and embedded systems, enabling the CowBackNet model to perform efficient individual identification of Holstein cows in real-time scenarios. As a result, the CowBackNet model optimizes resource usage efficiency while maintaining high accuracy, demonstrating the potential for application in modern Holstein dairy cow management systems.
Furthermore, the CowBackNet model has demonstrated significant performance in the task of Holstein cow individual recognition in multiple viewpoints, accurately recognizing Holstein cow individuals in the vast majority of cases. As shown in 
Figure 11, cow001, cow003, cpw004, cow008, cow009, cow010, cow012, cow013, cow014, and cow016 all have 100% recognition accuracy, and cow005, cow006, cow007, cow011, cow015, cow017, cow018, cow020, cow021, and cow022 can achieve more than 80% recognition accuracy, of which cow002 and cow019 have the lowest recognition accuracy, mainly due to the fact that the back image features of these two cows are relatively few, causing the model to face difficulties in recognition. However, despite the model’s excellent performance across multiple viewpoints, we still observed that recognition misalignment occurred in a few specific viewpoints. Some examples of misrecognition in the CowBack test set are shown in 
Table 10. Due to the change in shooting angle, some images have viewpoint-related feature loss (e.g., cow006, cow011), which leads to incomplete essential features in the image, affecting the feature extraction process and thus leading to misrecognition. In addition, background interference is particularly prominent in large-scale farm environments, especially in the presence of other Holstein cows (e.g., cow005, cow015). The complexity of the background and the presence of other Holstein cows may lead to difficulties for the model in distinguishing the unique features on the backs of different individual Holstein cows, which may affect the recognition accuracy. Rapid movement of Holstein cows is also an essential factor leading to recognition errors. The motion blur produced by fast walking (e.g., cow007, cow022) results in the loss of detailed information in the image, leading to the inability to accurately capture the key features of the Holstein cow’s back, further affecting the model’s recognition ability.
As shown in 
Figure 12, recognition accuracy of cow037, cow040, cow047, cow054, and cow78 is below 80%, and individual ones such as recognition accuracy of cow092, cow098, cow116, and cow138 can only reach about 20–40%, while the rest can reach about 80–100%. However, despite the excellent performance of the CowBackNet model in image processing for the top view, we still observed some misrecognized samples in the Cows2021 test set. 
Table 11 shows some examples of misrecognition samples in the Cows2021 test set. When analyzing these misrecognition samples, we observe that issues such as low texture complexity, image blurring, information loss, background interference, and pose changes impact the model’s recognition precision directly. In particular, low texture complexity refers to the lack of significant texture variations in an image, which leaves less valid information, making it difficult to extract detailed features. Due to the extensive black-colored areas on the backs of Holstein cows, their distinctive features are predominantly found in the white-patterned regions. When the size of the white areas on different Holstein cows is limited, as exemplified by images such as cow050, cow109, and cow154, the subtle variations in these white patterns among individuals pose significant challenges in accurately distinguishing one Holstein cow from another, thereby intensifying the model’s recognition difficulty. In addition, as the movement of Holstein cows leads to image blurring, e.g., in the images of cow001 and cow138, the different parts of the image are too complex to be clearly distinguished, and sufficient helpful information cannot be extracted, so the practical information of the image is significantly reduced, which ultimately leads to recognition errors. In addition, information loss is another major factor leading to recognition errors. Images of cow005, cow077, and cow092 have different degrees of information loss, leading to incomplete feature extraction. This information loss seriously interfered with the model’s accurate recognition of individual features of Holstein cows, which in turn affected the model’s recognition rate. Background interference (cow054) and pose variation (cow116) in the images are also key factors for misrecognition.
In summary, textural features [
35] on the back of Holstein cows vary from breed to breed and pose different challenges regarding individual recognition. While focusing on the back images of Holstein cows with salient and weak features at the same time, the diversity of the training data is enhanced by random rotation, random occlusion and other methods to help the model learn more global features and detailed features, and to improve its recognition ability on weak feature images with solid colors or no obvious patterns as much as possible.
However, due to the lack of strong texture features and low image contrast in weak feature images, the effect of feature extraction using the model proposed in this paper is still not as good as that of significant feature images. Future research can consider combining multimodal information, such as near-infrared images, temperature data, or depth images, and combining these with visual images to form multimodal inputs to make up for the deficiencies of pure color images and to help the model differentiate better through additional channel information. Alternatively, a transfer learning approach can apply deep networks pre-trained in other tasks (e.g., face recognition, object detection) to the cow recognition task to learn more generalized features to help improve the recognition of weak feature images that are difficult to distinguish. However, due to time and resource constraints, we were not able to delve deeper into this topic in the current study. The main considerations are that integrating multimodal information faces several challenges and limitations: first, the data acquisition process of different modalities is complicated and requires additional equipment and technical support, which poses challenges in terms of time and cost. Second, effective integration of data from different modalities requires complex algorithms and model design to ensure complementarity and consistency of information, which is technically challenging. Third, multimodal inputs may require larger training datasets and longer training times so that models can learn effective features, which may be difficult to achieve with limited resources. In addition, since Holstein cowback images are obtained from pen 2, although adult Holstein cowback images are generated by taking into account the various situations in the production environment as much as possible, they may not cover all environmental conditions, thus affecting the accuracy. Subsequent studies could consider collecting images at different dairy farms and evaluating the effects caused by indoor and outdoor lighting environments to improve the individual cow recognition model using a more diverse dataset. Future work will aim to extend the dataset and optimize the model structure to enhance the model’s usefulness in natural production environments and address the challenges of applications in this setting.