1. Introduction
Detecting pain in dairy cattle accurately is an unresolved and critical problem in animal welfare. Animals can suffer from pain caused by various health conditions, such as mastitis, metritis, lameness, infections, etc., as well as environmental reasons. Pain erodes the quality of life of the animals and silently drains farm productivity through reduced milk yields, impaired immunity, and higher veterinary costs, creating an ethical and economic imperative to intervene early [
1]. Evolution, however, has prepared cattle to mask their distress [
2]. As prey animals, they minimize conspicuous behaviors that might attract predators, so their overt pain signals are rare, fleeting, and low-amplitude. The conventional tools such as locomotion scoring, heart-rate telemetry, and vocal-sound counting often miss these subtle cues and demand invasive devices or continuous expert oversight, making them impractical for large herds and commercial barns [
3].
Researchers from different domains have been exploring into animal welfare and production more in recent years, and lately, both farmers and researchers have been leaning towards automated, non-invasive solutions for dairy cattle management. Identifying and addressing stress and pain in dairy cows from implicit cues by analyzing body movements, gestures, behaviors, and facial expressions are some applications in this domain [
4]. Researchers have shown that discomfort or pain in dairy cows can arise for several types of health reasons/diseases, as well as due to farming operations and environmental factors, like crowdedness, heat stress, poor technology, and rough handling [
5]. Although manual and sensor-based animal welfare monitoring has been popular for decades, computer-vision-based approaches and multimodal data analyses combining sensor and visual information show significant improvements in automated systems for animal monitoring and emotion recognition [
6,
7,
8,
9,
10].
Over the past decade, researchers have turned to facial expression analysis as a non-intrusive lens for mammalian nociception (i.e., the process of detecting and transmitting signals in nervous systems for potential damaging stimuli) [
11,
12]. Studies have been conducted on large domestic animals like horses, pigs, and cows to monitor their facial expressions from images, video frames, and real-time observations during different types of pain (i.e., painful diseases, dental treatments, lameness, castration, laparotomy, farrowing, etc.) and assess and analyze their pain scores using both manual and automated systems [
13,
14,
15]. In sheep, pigs, and horses, discrete Facial Action Units (FAUs) correlate reproducibly with both natural pain and experimentally induced pain, suggesting that rapid cranio-facial reflexes can outsmart voluntary suppression [
16,
17,
18]. The biological rationale is well established: the nociceptive input propagates through cranial nerves V (i.e., the trigeminal nerve) and VII (i.e., the facial nerve), driving reflex contractions in the periocular, perinasal, and perioral muscles that remain difficult to inhibit at will [
19,
20]. Even species that dampen their lameness or visceral discomfort cannot entirely hide these muscle activations, which arise milliseconds after noxious stimulation [
21,
22].
Bovine behaviorists have cataloged a repertoire of such markers of long- and short-term discomfort in cattle animals of different breeds and ages [
23,
24,
25]. Ear position asymmetry, orbital tightening, tension above the eye, nostril flare, and mouth-strain geometry all increase after painful procedures like dehorning or mastitis induction [
26,
27]. The Calf Grimace Scale (CGS) formalizes six FAUs with high inter-observer agreement, giving practitioners a shared lexicon for pain scoring for nonhuman mammals [
28,
29]. Even with a high volume of exploratory research, these protocols remain fundamentally static. Each frame is paused, examined, and annotated by trained observers, throttling the throughput, embedding subjective bias, and discarding the temporal dynamism that may distinguish short-lived discomfort from benign facial idiosyncrasy [
30].
Artificial intelligence (AI) studies have begun to automate facial analysis, but most still use coarse labels such as ‘pain’ or ‘no pain’ assigned to entire images or videos, ignoring the sub-second time structure that characterizes acute distress [
31]. This omission is crucial because the most diagnostic events are the involuntary micro-expressions—muscle twitches lasting between 1/25 s and 1/3 s—that psychologists and security agencies have long exploited to unveil concealed emotion in humans [
32,
33,
34,
35].
Human-factors engineering has already harnessed micro-expression analytics for driver-state monitoring [
36]. In advanced driver-assistance systems, lightweight convolutional–recurrent networks ingest live video, amplify tiny pixel motions through optical flow or Eulerian magnification, and flag drowsiness or aggression with frame-level precision [
37]. Architectures such as MobileNetV2 or RM-Xception funnel spatial features into LSTM heads, achieving millisecond responsiveness while running on edge devices [
38]. Recent work has enhanced its sensitivity through time-series fusion and attention-weighted pooling, preserving accuracy across faces, lighting modes, and camera angles [
39,
40].
Our hypothesis states that these temporal-expression architectures can be transplanted to cattle after anatomical calibration. Bovine facial musculature is simpler than human musculature yet still offers enough contractile diversity in the eye, ear, and muzzle regions to betray the nociceptive load through brief asymmetries and aperture changes [
41]. Early feasibility studies in livestock vision hinted at this possibility, but they either relied on static landmarks or used frame-level models without explicit temporal reasoning [
42,
43,
44].
To scrutinize the hypothesis, we designed an end-to-end temporal vision system tailored to barn conditions (
Figure 1). A YOLOv8-Pose [
45] backbone isolated the face and placed thirty anatomically coherent landmarks with real-time throughput and a robust performance in oblique or overhead camera views. Region of Interest (ROI) patches from the eyes, ears, and muzzle were fed into a pretrained MobileNetV2 [
46] encoder that condensed each frame into a 3840-dimensional descriptor sensitive to fine-grained gradients. A 128-unit LSTM [
47] stitched together five-frame sequences, learning motion trajectories that separated nociceptive twitches from benign facial jitter [
48]. At the video level, probability averaging and burst-density heuristics tempered false alarms, borrowing the confidence-weighting logic from driver-monitoring systems.
This investigation therefore bridges two previously siloed fields: livestock pain phenotyping and human micro-expression AI. This research contributes three advances.
First, it provides the first systematic evaluation of micro-expression dynamics as a pain biomarker in dairy cattle, bringing a fine-timescale analysis into a domain dominated by static scoring;
Second, it develops a sequential process for visual data collection and processing and species-specific annotations, creating a foundation for transfer learning across breeds and lighting environments;
Third, it demonstrates a deployable low-latency pipeline aligned with the compute constraints of on-farm edge devices, moving pain detection from sporadic manual checks to continuous surveillance.
By capturing the vanishingly brief facial echoes of nociception, the system aims to trigger earlier interventions, elevate welfare metrics, and improve both ethical resilience and economic efficiency in the dairy sector. The preliminary study conducted on data collected from Canadian dairy farms showed a promising performance, reflecting the potential for full-scale AI-based solutions for automated and non-invasive cattle discomfort detection to enhance animal welfare and farm productivity.
The rest of this paper is organized as follows:
Section 2 discusses the methodology used for the experiments on facial keypoint detection and pain recognition;
Section 3 shows and analyzes the experimental results;
Section 4 explains the implications of these results; and finally,
Section 5 concludes this paper with future research directions.
3. The Experimental Setup and Results
In this section, we present a comprehensive evaluation of the proposed automated pain detection pipeline with the implementations using Python 3 in Google Colab, encompassing detailed quantitative and qualitative analyses. We systematically examine the performance of each pipeline component, from face and landmark detection to temporal sequence classification and video-level inference, and provide critical insights into the system’s strengths, limitations, and real-world applicability. As part of an ongoing extensive research project on real-time cow pain detection, the purpose of this study is to provide an initial validation of our research concept and to implement and test the validity of the complete pipeline. Hence, this study and implementation focus on a small annotated subset from our data collection and provide limited results. Our full-scale deployment will include the complete annotated dataset with improvements in the parameters and the threshold appropriate for the complete set of data. It will also provide a robust pipeline with complete performance evaluations of the models trained, validated, and tested on the whole dataset before our pilot deployment in dairy farms.
3.1. Performance Metrics
In this experimental setup with the YOLOv8-Pose, MobileNetV2, and LSTM models, the performance metrics that were computed in scores and graphs were the confusion matrix, accuracy, precision, recall, F1-score, and . The graphs were generated by the models using the default legends and notations. The confusion matrix shows four measurements: the true positive (), true negative (), false positive (), and false negative (). These scores indicate
= Accurately predicted positive values;
= Accurately predicted negative values;
= Inaccurately predicted positive values;
= Inaccurately predicted negative values.
The accuracy of a model provides the ratio of correct predictions; precision shows the accuracy of positive predictions; recall represents the ability of the model to identify actual positive values; and F1-score is the harmonic mean of precision and recall. The equations (Equations (
2)–(
5)) are as follows:
The average precision (
) shows the area under the precision–recall curve for a class, and the mean average precision (
) is the mean of the
across all of the classes. Equation (
6) shows the
computation, where
C = the total number of classes,
= the total number of precision–recall points for class
i, and
= the precision at the
jth recall level for class
i.
3.2. Face and Landmark Detection Performance
3.2.1. Model Training and Evaluation
To enable precise localization of the facial regions and extraction of the keypoints from dairy cow videos, the customized YOLOv8-Pose model mentioned earlier was trained on a dataset comprising 1997 manually annotated images of cows’ faces with dimensions of 1080 × 1920. This dataset is an annotated subset of our whole set of collected data. Each annotation included a bounding box for the face and 30 anatomically relevant facial landmarks. The model’s performance was rigorously evaluated using the standard object detection and pose estimation metrics.
3.2.2. Bounding Box Detection
The YOLOv8-Pose model demonstrated exceptional localization capabilities. At a confidence threshold of 0.969, the model achieved a bounding box precision of 1.00, indicating no false positives at this threshold. The recall was 0.96 at a threshold of 0.0, reflecting the model’s ability to detect nearly all true cow faces across the dataset. The F1-score, which balances precision and recall, peaked at 0.95 around a confidence value of 0.505, suggesting a robust performance across varying thresholds. These results are visualized in
Figure 7,
Figure 8 and
Figure 9, which display the precision, recall, and F1-score curves, respectively, as functions of the model’s confidence output.
3.2.3. Pose Estimation and Landmark Localization
For pose estimation, the model’s precision and recall both exceeded 0.85 across a wide range of confidence values, with a maximum F1-score of 0.90 at a confidence threshold of 0.838. This indicates reliable and consistent detection of facial landmarks, even under challenging real-world conditions, such as variable lighting, occlusions, and diverse cow postures. The
at an IoU = 0.50 reached 0.969 for bounding boxes and 0.838 for keypoint detection. The more stringent
@0.50–0.95 scored 0.899 for cow face detection and 0.590 for keypoint detection (
Figure 10). These metrics confirm the model’s capability to not only detect cows’ faces with high accuracy but also to precisely localize facial landmarks, critical for downstream pain recognition tasks.
3.2.4. Implications for Downstream Processing
This robust detection and localization performance provided high-quality input for subsequent stages of the pipeline. The YOLOv8-Pose model was deployed as the backbone of the frame-level preprocessing stage, reliably detecting cows’ faces and extracting 30 facial keypoints per frame. These keypoints were then used to define regions of interest (the eyes, ears, and mouth) for feature extraction and temporal sequence modeling, forming the foundation for a micro-expression-based pain analysis.
3.3. LSTM Model Training and Classification Performance
3.3.1. LSTM Convergence and Learning Dynamics
The LSTM model was trained to classify ‘pain’ versus ‘no pain’ sequences based on the extracted facial features. The training and validation accuracy curves (
Figure 11a) reveal rapid convergence: the validation accuracy surpassed 98% by epoch 5 and steadily approached 99.5% by epoch 30. The close alignment between the training and validation curves indicates minimal overfitting and suggests that the LSTM model effectively learned to recognize the temporal patterns in pain-related facial behavior while maintaining strong generalization. Although the training and validation performance is promising, we should keep in mind that the model used a limited amount of data, and the frames chosen for training and validation had some similarities, as they were from sequential frames containing a fair amount of similarity.
3.3.2. Quantitative Evaluation on Validation Data
The LSTM model’s performance was quantified further using a confusion matrix (
Figure 11b), which showed 765 true negatives (TNs) and 929 true positives (TPs), with only 3 false positives (FPs) and 3 false negatives (FNs). Hence, the model achieved 0.9965 accuracy and 0.9968 for all three other metrics: the precision, recall, and F1-score.
These metrics demonstrate that the LSTM model achieved a near-perfect performance on the validation set. The high precision and recall indicate that the model was both highly sensitive to pain sequences and highly specific in avoiding false alarms. The F1-score, harmonizing precision and recall, further confirms balanced and robust classification.
3.3.3. Interpretation and Limitations
It is crucial to note that these metrics were computed on a validation set derived from the same distribution as the training data, using an 80:20 stratified split. While these results confirm the model’s learning capacity and temporal pattern recognition, they do not fully guarantee generalizability to novel, real-world scenarios. Therefore, further evaluations on completely unseen videos were conducted to assess the model’s robustness in practice.
3.4. Qualitative Visualization of the Frame-Level Inference
3.4.1. Frame Analysis for Pain Probability
To elucidate the model’s decision-making process, individual frames from both ‘pain’ and ‘no pain’ videos were visualized and annotated with predicted labels and confidence scores. In a correctly classified ‘pain’ frame (
Figure 12a), the model assigned a pain probability of 1.00, with a red bounding box highlighting the detected face. Overlaid landmarks, color-coded by anatomical region, revealed pronounced facial tension and changes around the eyes and muzzle—features consistent with the expression of pain. Conversely, a correctly classified ‘no pain’ frame (
Figure 12b) displayed a pain probability of 0.00, with a green bounding box and evenly distributed landmarks indicative of a relaxed facial state. These visualizations underscore the model’s ability to distinguish subtle behavioral cues associated with pain, even in the presence of various types of background noise and other environmental feature variations.
3.4.2. Robustness to Environmental Variability
The model’s consistent keypoint localization across diverse frames contributed to the robustness of downstream LSTM-based sequence modeling. Importantly, the system maintained high accuracy despite common farm environment challenges, such as feeding bars, equipment, and fluctuating illumination. This resilience is critical for real-world deployment, where controlled laboratory conditions cannot be assumed.
3.5. Inference Performance on Unseen Videos
To evaluate the generalizability of the trained model beyond the validation set, we conducted inference on a collection of 14 previously unseen videos. These videos were recorded under similar farm conditions but were not used during model training or hyperparameter tuning. The inference pipeline was applied in full, including frame-wise landmark detection, region-based feature extraction, sequential buffering, and classification via the pretrained LSTM model. For each video, the model predicted a binary ‘pain’ or ‘no pain’ label for each sequence of five consecutive frames, and the overall video-level decision was determined by aggregating these predictions. Specifically, a video was classified as ‘pain’ if the proportion of its frame sequences labeled as pain exceeded a fixed threshold of 30% (see Equation (
1)).
The detailed inference results are shown in
Table 1, which includes per-video statistics such as the total number of analyzed frames, the number of frames predicted as pain, the computed pain ratio, and the final label. Out of the 14 test videos, 9 were annotated as ‘pain’ and 5 as ‘no pain’ based on the expert assessment. The resulting confusion matrix is shown in
Figure 13, whereas the detailed classification scores are presented in
Table 2. The model correctly predicted 5 out of 9 ‘pain’ videos and 4 out of 5 ‘no pain’ videos, and after aggregating the total correctly predicted ‘pain’ and ‘no pain’ classes (i.e., 9 out of 14), the model provided an overall accuracy of 64.3% at the video level.
Quantitatively, the inference model achieved a precision of 0.83 for the ‘pain’ class, indicating that when the model predicts ‘pain’, it is often correct. However, the recall was lower at 0.56, suggesting that the model failed to detect a significant fraction of the actual ‘pain’ cases. The F1-score for the ‘pain’ class was 0.67, while the ‘no pain’ class had an F1-score of 0.62. The macro-averaged precision and recall were 0.67 and 0.68, respectively, and the overall weighted F1-score across both classes was 0.65. These values reflect a moderate performance for a continuous ‘pain’/‘no pain’ detection task in a challenging, real-world setting.
A closer look at the misclassified ‘pain’ videos reveals important insights. For instance, pain_421(3) and pain_255(4) were incorrectly classified as ‘no pain’, despite being ground-truth ‘pain’ samples. However, their predicted pain ratios were 24.6% and 23.6%, respectively—hovering just below the decision threshold (i.e., 30%). Pain_255(5), with a pain ratio of only 10.2%, was also misclassified for the same reason. These examples suggest that certain pain expressions were either too brief or too subtle to influence the overall sequence-level predictions sufficiently. The use of a rigid 30% threshold may therefore be too coarse, potentially ignoring pain patterns that are temporally sparse but clinically significant.
To address the challenge of balancing sensitivity and specificity in video-level pain classification, future iterations of the system could move beyond fixed-threshold rules toward more context-aware decision mechanisms. It is also important to consider the impact of domain shifts between the training and test videos. While the environment and camera setup were kept similar, individual differences in the cows’ appearance (i.e., fur color, ear position, facial asymmetry) as well as lighting variation and partial occlusion (i.e., feeding bars) may have affected landmark/keypoint detection and subsequently downstream feature extraction. Such variability can cause slight misalignment in landmark localization and micro-expression analysis, especially for sensitive regions like the eyes and mouth, ultimately affecting the classification accuracy.
Despite these limitations, the model’s ability to correctly classify the majority of the test videos, including several with high confidence, demonstrates the effectiveness of our pipeline under non-ideal conditions. These results also underscore the need for more diverse training data, robust data augmentation, and possibly ensemble decision strategies when deploying the system in production environments.
4. Discussion
4.1. A Summary of Our Observations
Despite using a limited amount of data, the present study demonstrates that automated detection of pain in dairy cattle using facial micro-expression analysis is not only feasible but also highly promising, as evidenced by the strong sequence-level validation accuracy of our LSTM-based system. The primary intention of this small-scale experiment is to use this as a proof-of-concept to support our original hypothesis on transplanting temporal-expression architectures to identify pain in dairy cattle. However, the translation of this high accuracy to a robust, real-world video-level performance remains a significant challenge. The observed drop from nearly perfect sequence-level results to a more moderate 64.3% video-level accuracy on unseen data underscores several fundamental issues that must be addressed for practical deployment and clinical utility.
Our findings suggest that pain in cattle is not merely observable to the trained human eye but is also computationally accessible, even when the animal’s evolutionary instincts drive it to suppress outward signs of discomfort. The ability to detect pain through short sequences of facial micro-expressions—movements lasting only fractions of a second—opens new possibilities for real-time, non-intrusive welfare monitoring. This capability is particularly transformative in large-scale farming environments, where observation of individual animals is often impractical and early intervention can make a profound difference in terms of health outcomes and quality of life.
The implications of this work extend far beyond technical achievement. By providing a scalable, automated means to monitor pain, our system lays the groundwork for individualized welfare baselines, where each cow’s unique pain signature can inform tailored care strategies. In the future, such technology could enable precision interventions, such as the automated administration of analgesics when specific pain patterns are detected or the creation of auditable welfare records that support ethical supply chains and consumer transparency. These advances hold the promise to not only improve animal health and productivity but also address the growing societal demand for humane and responsible livestock management.
However, the deployment of such systems also raises important ethical and practical questions. The potential for false negatives (i.e., instances where pain is present but not detected) reminds us that even the most sophisticated algorithms must be continually refined and validated to minimize suffering. Equally, the challenge of interpretability remains: stakeholders, from farmers to veterinarians, require clear explanations of the system’s decisions, including which facial features or micro-expressions triggered a pain alert. As we move toward greater automation in animal care, it is essential to balance technological innovation with transparency and trust.
Looking ahead, the framework established in this research opens several intriguing avenues. The methods developed here could be adapted to other species and emotional states, potentially enabling the detection of stress, anxiety, or positive welfare indicators in a variety of animals. Integrating a facial micro-expression analysis with other sensing modalities—such as vocalization analysis, thermal imaging, and posture tracking—could provide a more holistic and nuanced understanding of animal well-being. The creation of digital twins or virtual herds, where welfare interventions can be simulated and optimized before real-world application, represents another exciting frontier.
4.2. Current States and Challenges
4.2.1. Temporal Sparsity and Variability in Pain Expression
This is one of the most prominent challenges in cattle pain detection due to minimal overt displays of pain with brief, subtle, and often context-dependent pain expressions [
22]. Our findings, such as the misclassification of pain-labeled videos like pain_255(4) that contained only a minority of pain frames, highlight the inadequacy of fixed-threshold aggregation rules. Rigid thresholds—such as labeling a video as ‘pain’ if more than 30% of its frames are classified as pain—fail to accommodate the diversity of the pain expression patterns across individuals and contexts. Similar to the versatility of emotional expression in humans, some cows may exhibit pain as short, intense bursts, while others may display more diffuse or intermittent cues. This diversity is complicated further by environmental influences, such as feeding or resting periods, and by individual differences in pain tolerance and behavioral strategies.
4.2.2. Effects of Domain and Environmental Variations
The limitations of rule-based aggregation are compounded by the impact of domain shifts and environmental variation. Our system’s performance was affected by factors such as individual cows’ appearance and environmental conditions. Even minor misalignments in landmark localization can propagate through the feature extraction and temporal modeling pipeline, ultimately degrading the classification accuracy. These findings emphasize the necessity of more diverse and representative training data, as well as robust data augmentation strategies that simulate real-world variability. The use of advanced augmentation techniques, such as synthetic occlusion, random cropping, and brightness variation, could help the model generalize more effectively to the heterogeneous conditions encountered in commercial dairy environments.
4.3. A Comparison with the Existing Literature
The comparative analysis with the existing literature further contextualizes the strengths and limitations of our approach. Previous studies in automated cattle pain and lameness detection have often relied on gross locomotor changes, utilizing 3D CNNs or ConvLSTM2D architectures to achieve video-level accuracies in the range of 85–90%. However, these approaches are typically limited to overt, late-stage pathologies and require the animal to be walking or moving in a controlled manner. In contrast, our focus on facial micro-expressions enables continuous monitoring and has the potential to detect pain at earlier and less severe stages. Nevertheless, the moderate video-level accuracy observed in our study reflects the inherent difficulty of the task and the impact of environmental and subject variability, which are less pronounced in controlled locomotion-based studies.
The relevance of our approach to the ongoing development of grimace scales in animal welfare research is also noteworthy. Manual grimace scales, such as the CGS, have become widely used for pain assessments across species, including rodents, equines, and bovines. These scales rely on the manual annotation of static facial features, such as orbital tightening, ear position, and nostril dilation. While effective in controlled settings, manual scoring is labor-intensive, subject to observer bias, and limited in its temporal resolution. Automated systems like ours offer the potential for scalable, objective, and continuous pain assessments but must overcome the challenges of subtlety, temporal sparsity, and environmental complexity. Our system’s ability to capture and classify brief, transient pain-related facial movements represents a significant advance, yet the moderate recall on unseen videos suggests that certain pain expressions—especially those that are brief, subtle, or confounded by environmental noise—remain difficult to detect reliably. This observation aligns with recent studies indicating that even trained human observers can struggle to consistently assess facial expressions and ear positions, particularly when micro-expressions are fleeting or ambiguous [
52].
4.4. Future Research Directions and the Pilot Deployment Plan
Addressing the limitations of our small-scale implementation and the research gaps in the existing literature requires a shift from rigid, rule-based decision mechanisms to more adaptive, context-aware strategies. Future iterations of our system should move beyond fixed-threshold rules toward decision mechanisms that are informed by intra-video distributional features, such as variance, local burst density, and score skewness. Adaptive thresholds could provide a more nuanced signal than global averages, while the incorporation of high-confidence frame clusters when sufficiently dense, temporally coherent, or strongly predicted could enhance the detection sensitivity. Conversely, burst-to-frame ratio gating or temporal consistency checks could filter out sporadic false positives that may arise in long ‘no pain’ videos. Ultimately, integrating learned metaclassifiers or video-level neural decision heads that consume frame-wise predictions may allow the system to recognize subtle, context-dependent pain patterns that rule-based heuristics cannot capture.
From a methodological perspective, the adoption of advanced temporal modeling architectures, such as attention-based transformers, holds promise for improving the system’s ability to capture sparse and discontinuous pain events. Unlike traditional LSTM models, transformers can assign variable attention weights to different frames, highlighting those that are most informative for pain detection. This capability is particularly relevant for micro-expression analysis, where the most critical signals may be temporally isolated and easily overlooked by models that rely on fixed-length sequences or uniform weighting. In addition, multi-scale feature extraction and ensemble approaches could help capture a broader range of facial expression dynamics, further enhancing the system’s robustness.
Expanding and diversifying the training dataset are another critical priority. Including a wider variety of cow breeds, ages, and pain contexts such as metabolic pain, parturition, and breed-specific facial patterns would help mitigate the risk of overfitting and improve the generalizability. The use of self-supervised pretraining on large volumes of unlabeled video data, as well as generative augmentation techniques to synthesize rare pain expressions, could enhance the model’s capacity to recognize diverse and subtle pain indicators further [
53].
Interpretability remains a key concern for the adoption of automated pain detection systems in practice. While our use of keypoint-based micro-expression detection provides some transparency at the frame level, the temporal decision boundaries produced by the LSTM are less interpretable. The development of visual explanation tools, such as Grad-CAM, could help elucidate which facial regions and time intervals are most influential in the model’s decisions. Such tools would not only improve the system’s trustworthiness for veterinarians and animal welfare inspectors but also facilitate the identification of potential failure modes and biases.
The integration of a facial micro-expression analysis with other non-invasive sensing modalities represents a promising direction for future research. Combining facial analysis with posture, gait, and physiological signals could provide a more holistic assessment of pain and welfare in dairy cattle. Multimodal fusion approaches combining sensor data, images, videos, and audios may help disambiguate ambiguous cases; improve the sensitivity and specificity; and enable the detection of pain states that are not readily apparent from facial cues alone.
The clinical and economic implications of automated pain detection are substantial. The early detection of pain can prevent production losses, reduce treatment costs, and improve overall herd health. For example, timely intervention in mastitis cases can reduce antibiotic usage and the associated costs, while early detection of lameness can prevent declines in milk yield and reproductive performance [
54]. Ethically, the deployment of such systems aligns with the growing emphasis on animal welfare and precision livestock farming, supporting the goal of individualized, real-time monitoring and proactive care.
As mentioned earlier, this research is a preliminary part of our ongoing research into developing a completely annotated benchmark dataset including facial images and videos of healthy cows and cows in pain. The complete dataset will be used to develop a modified version of the proposed model that will incorporate both images and videos to detect the pain state of a cow in real time and provide alerts to assist farmers and veterinarians with animal welfare assessments and monitoring. The complete system’s results and performance will be tested, compared, and presented with different visualizations for an easier analysis. Our complete dataset will additionally include thermal images of both healthy cows and cows in pain to provide a baseline for our application. Our solution will be available to users through mobile applications that they (i.e., farmers, producers, farming staff, veterinarians, etc.) will be able to use by simply taking images or videos of cows for pain state identification. Our pilot application will be deployed for a performance evaluation and improvements before launching it to everyone. The detections generated by our pilot application will be verified by animal welfare professionals to assess the performance through accurate classifications, false positives, false negatives, and other performance measurements. The model will be adjusted to rectify and test performance issues several times to optimize its accuracy, specificity, and sensitivity by incorporating professional human intervention. The improved model will then be deployed on farms, while the performance will be closely monitored. The proposed model and the improved application are designed to serve as a decision support tool for farmers, producers, and clinicians to alert them early to cows’ discomfort and complement their expertise. Our application is not intended to replace clinicians or other animal welfare professional.
Our study establishes micro-expression analysis as a viable paradigm for automated pain detection in dairy cattle while also highlighting the complexities and challenges inherent in translating this technology to real-world farm environments. Achieving reliable, real-time pain monitoring on a large-scale farm will require ongoing innovation in the data collection, model architecture, interpretability, and multimodal integration. Future systems must embrace biologically informed temporal modeling, continual domain adaptation, and sensor fusion to fully realize the potential of precision welfare ecosystems. By addressing these challenges, automated pain detection systems can transform animal welfare from reactive intervention to proactive, individualized care, fulfilling both scientific and ethical imperatives in modern agriculture.
5. Conclusions
This research marks a significant step forward in the field of animal welfare technology by demonstrating that facial micro-expressions in dairy cattle can serve as reliable, quantifiable indicators of pain. By adapting advanced computer vision and temporal modeling techniques, originally developed for human micro-expression analysis, to the unique morphology and behavioral context of cows, we have shown that pain—a deeply subjective and often concealed experience—can be objectively inferred from subtle, fleeting facial movements. The system we developed, integrating a custom-trained YOLOv8-Pose model for precise facial landmark detection with a MobileNetV2 and LSTM-based temporal classifier, achieved remarkable sequence-level accuracy. This not only validates the technical feasibility of such an approach but also challenges the traditional boundaries of animal pain assessments, which have long relied on coarse behavioral scoring or invasive physiological monitoring.
This research highlights the profound potential of AI to enhance our understanding and stewardship of animals. By giving a voice to the silent signals of pain, we not only advance the science of animal welfare but also reaffirm our ethical commitment to those in our care. From the fleeting tension of a muscle, the brief narrowing of an eye, or the subtle twitch of an ear, these micro-movements, once invisible and unmeasurable, now become the foundation of a new era of compassionate, data-driven animal husbandry. As we refine and expand these technologies, we must do so with humility and imagination, always guided by the principle that to care for animals is to listen deeply, measure wisely, and act with empathy. At the convergence of technology and compassion lies the promise of a future where animal suffering is not only seen but prevented and where our relationship with the creatures who share our world is marked by understanding, respect, and care.