1. Introduction
In 2023, China’s pig production capacity continued to grow, slaughtering 726.62 million pigs, an increase of 3.81% year-on-year, and pork production reached 57.94 million tons, up 4.6% from 2022 [
1]. The global pork output grew by 0.85% year-on-year, totaling 115.5 million tons. Productivity per sow per year (PSY), defined as the number of weaned piglets per sow annually, is a core indicator of sow reproductive efficiency and the technical level of a pig farm. In 2014, China’s PSY was only 17.14, far below levels in developed pig-raising countries. For comparison, Denmark, Ireland, Germany, and the USA had PSYs of 30, 27.8, 27.5, and 26, respectively, in 2014 [
2]. In Chinese pig production, 10–30% of replacement gilts exhibit silent estrus that goes undetected due to the lack of visible signs. This delays breeding and negatively affects PSY [
3]. The second phase of China’s Pig Genetic Improvement Plan aims to achieve a PSY of over 32 for lean-type sows by 2035, which would reach global production standards and ensure a high-level supply of breeding stock. Thus, China has been promoting the use of precision phenotyping technologies and genomic selection to continuously improve pig breeding stock performance.
Traditional sow estrus detection methods include visual observation, boar testing, and manual palpation [
4]. However, these methods have significant limitations. They rely heavily on subjective judgments by technicians, leading to inconsistencies and potential misjudgments. Additionally, they depend on observing sow behaviors, such as standing still or mounting, which can be influenced by environmental and individual differences, which lack objectivity and reliability. Consequently, traditional methods often result in a low estrus detection accuracy, which reduces conception rates.
In recent years, infrared thermography has been used to study the relationship between changes in vulvar skin temperature and estrus in sows [
5]. Sykes et al. [
6] demonstrated that infrared thermal imaging could distinguish vulvar temperatures during estrus and non-estrus periods, offering a non-invasive detection method. This technology has also been used to measure temperature changes at the vulvae and nose tips of dairy cows. This allowed the development of algorithms with higher sensitivity levels compared with traditional methods like behavioral observation and Estrotect patches [
7]. Wang et al. [
8] extended the application to cow health monitoring by proposing a deep-learning approach using an improved YOLOv4 network to detect eye temperatures from thermal images. It achieved a rapid high detection accuracy. Zheng et al. [
9] explored using infrared thermography for sow estrus detection, developing an improved YOLO-V5s detector that uses feature fusion and dilated convolutions for automatic vulvar temperature extraction. These studies highlight the potential of combining infrared thermal imaging with deep learning for estrus detection.
The use of deep learning has also advanced in speech recognition and processing [
10,
11]. Although primarily applied to human speech and environmental sounds, audio monitoring techniques have also been developed for animal sounds [
12,
13]. Wang et al. [
14] created a database of cow vocalizations and trained Conv-TasNet and EcapaTdnn models to identify cow estrus states and individual identities accurately. Pan et al. [
15] replaced Gaussian mixture models in traditional sound recognition with DNN-HMM models, improving the detection of pig calls. Yin et al. [
16] fine-tuned AlexNet using spectrogram features to classify pig cough sounds, achieving a 95.4% recognition rate. In the field of estrus detection, Wang et al. [
17] proposed a dual Long Short-term Memory joint discrimination strategy based on an optimal combination to improve estrus detection performance. Wang et al. [
18] utilized an improved lightweight MobileNetV3_esnet model to recognize estrus and non-estrus sounds of sows, achieving an accuracy of 97.12. These research findings demonstrate the broad application potential and practical effectiveness of deep learning in the field of animal sound recognition.
While existing studies have demonstrated promising results using single-modal approaches, these methods face inherent limitations that hinder their practical applicability. First, single-modal data often fail to capture the comprehensive physiological and behavioral manifestations of estrus. For instance, thermal imaging may overlook vocalizations indicative of estrus behavior, while audio analysis cannot detect subtle vulvar temperature variations. Second, environmental noise further degrades the reliability of single-modal systems. Moreover, traditional feature fusion strategies, such as simple concatenation, inadequately model the complex interactions between modalities, leading to suboptimal information integration. To address these challenges, this study proposes APO-CViT, a multimodal feature fusion framework that synergizes thermal infrared images and audio data for robust estrus detection. The key innovations include:
- I.
Adaptive Cross-Attention Mechanism: Dynamically aligns and weights features from thermal and audio modalities, enabling context-aware fusion to emphasize discriminative cues while suppressing irrelevant noise.
- II.
Enhanced DenseNet-SE Backbone: Integrates Squeeze-and-Excitation blocks with dense connectivity to amplify critical multimodal features and improve gradient flow.
- III.
Non-Destructive Multimodal Dataset: A curated dataset of 960 synchronized thermal-audio samples under real farm conditions, addressing the scarcity of high-quality multimodal resources in livestock research.
By resolving the incompleteness of single-modal data and advancing interpretable fusion strategies, this work provides a scalable solution to enhance PSY metrics in modern pig farming, bridging the gap between laboratory prototypes and field deployment.
3. Results
3.1. APO-CViT Training Results
Figure 7 illustrates the loss of the APO-CViT model on both the training and test sets, as well as the confusion matrix. As depicted in
Figure 7a, with the increase in the number of training epochs, the training and test losses of APO-CViT both gradually decrease and stabilize after approximately 20 epochs. This indicates that the model has successfully learned the features of the data and has not exhibited overfitting on either the training or test sets.
3.2. Comparison of the Improved DenseNet Backbone Network
To demonstrate the effectiveness of using the improved DenseNet-SE backbone network, a comprehensive comparison was conducted with ResNet [
31], AlexNet [
32], GoogLeNet [
33], and MobileNetV3 [
34] under identical training parameters. The experimental results, summarized in
Table 4, revealed that the DenseNet-SE model achieved superior performance across multiple evaluation metrics, including precision and F1-score, reaching 98.92% and 97.35%, respectively. This outstanding performance indicates that the improved architecture significantly enhanced the model’s ability to process data from both audio and image modalities.
In terms of computational efficiency, the DenseNet-SE model demonstrated a balanced approach by maintaining a relatively moderate size while achieving superior performance. This balance between computational resources and output quality makes it particularly suitable for real-world applications where both efficiency and accuracy are critical considerations. The network’s ability to simultaneously handle audio and visual data without compromising performance further underscores its potential as an effective solution for multi-modal tasks.
As shown in
Figure 8, the choice of DenseNet-SE as the optimal model is supported by a thorough analysis of various performance metrics, including but not limited to classification accuracy, recall rate, F1-score, and computational cost. Additionally, when compared to ResNet, which also achieved notable results, the DenseNet-SE model demonstrated superior precision and F1-score, highlighting its ability to achieve better trade-offs between accuracy and computational efficiency. While ResNet performed well in certain metrics, it often fell short in terms of precision, indicating that more sophisticated architectures were needed to address these limitations. The addition of dense connections in DenseNet-SE further optimized the flow of information within the network, enabling it to capture complex patterns in both audio and visual data more effectively. The selection of DenseNet-SE as the backbone for multi-modal processing is not only due to its superior performance but also because of its practical computational efficiency. Unlike deeper networks like AlexNet, which may require significant computational resources without necessarily improving performance, DenseNet-SE strikes a balance that makes it accessible for real-world deployment. Its ability to process both audio and visual data at the same time ensures that it is well-suited for tasks such as estrus detection, where multiple patterns need to be seamlessly integrated.
In summary, through rigorous comparison and analysis of multiple network architectures under uniform training conditions, the improved DenseNet-SE backbone network has been identified as the most effective solution for processing data in both audio and image modalities while maintaining a practical balance between computational demands and output quality.
3.3. Comparison of Unimodal Networks and Multimodal Networks
This study comprehensively evaluates the performance of unimodal and multimodal networks for sow estrus detection under identical training conditions. As shown in
Table 5, the proposed APO-CViT framework achieves state-of-the-art results, with a precision of 98.92%, recall of 95.83%, and F1-score of 97.35%, significantly outperforming both unimodal and existing multimodal approaches. Unimodal models, such as Faster R-CNN [
35], EfficientNetV2 [
36] for thermal image analysis, and Wav2Vec2 [
37] for audio processing, exhibit inherent limitations due to their reliance on single-modality data. While EfficientNetV2 effectively captures physiological cues like vulvar temperature changes, it fails to integrate behavioral indicators from vocalizations. Conversely, Wav2Vec2 struggles with environmental noise and overlapping acoustic patterns, highlighting the necessity of multimodal fusion.
Advanced multimodal models, including MulT [
38] (F1-score: 94.23%) and ViLT [
39] (F1-score: 95.12%), demonstrate improved performance by leveraging cross-modal interactions. However, their rigid fusion strategies—such as fixed attention layers in MulT or predefined alignment in ViLT—limit adaptability to dynamic farm environments. In contrast, APO-CViT introduces an adaptive cross-attention mechanism that dynamically balances contributions from thermal images and audio data. This mechanism, combined with the DenseNet-SE backbone, enhances feature representation through channel-wise reweighting and dense connectivity, enabling precise suppression of noise while prioritizing discriminative patterns.
3.4. Ablation Experiment
This study compares three multimodal frameworks for sow estrus detection: CNN-ViT, Adaptive-CNN-ViT, and APO-CViT. The baseline CNN-ViT combines CNN-processed audio features with ViT-extracted thermal image features through simple concatenation, feeding them into a standard DenseNet classifier. Adaptive-CNN-ViT enhances this baseline by introducing an adaptive cross-attention mechanism to dynamically align and weight multimodal features, mitigating noise interference. APO-CViT is the final model. To ensure the robustness of the results, ten independent experimental trials were conducted under identical conditions. The performance metrics reported in
Table 6 correspond to the optimal outcomes from these repeated trials.
After using CNN-ViT to extract audio and image features, the features from both modalities were simply concatenated and fused. The fused features were then input into the original DenseNet network without any improvements, yielding the test results for the model combining CNN-ViT feature extractors.
Adaptive Cross-Attention Mechanism: To validate the effectiveness of the Adaptive module, the CNN-ViT model was compared with a version incorporating the Adaptive mechanism. As shown in
Table 6, the addition of the Adaptive module significantly improved Precision, Recall, and F1-score by 4.77%, 1.13%, and 2.95%, respectively. This indicated that the Adaptive Cross-Attention Mechanism further enhanced the fusion of audio and image features, effectively improving the model’s performance.
SE: After integrating the SE Block module into the backbone network, the model achieved optimal performance metrics, with Precision, Recall, and F1-score reaching 98.92%, 95.83%, and 97.35%, respectively. Significant improvements were observed compared with the CNN-ViT model having the Adaptive Cross-Attention Mechanism. The results demonstrated that the SE Block more effectively captured and expressed complex features, enhancing the model’s understanding of multimodal data.
As shown in
Figure 9, APO-CViT significantly outperforms CNN-ViT and Adaptive-CNN-ViT in three core metrics. Its precision reaches 98.92%, which is an increase of 6.81 percentage points compared to CNN-ViT and 2.04 percentage points compared to Adaptive-CNN-ViT, indicating its significant advantage in reducing false positives. The recall rate is 95.83%, with a relatively modest improvement (1.87 percentage points higher than the former and 0.74 percentage points higher than the latter), but it demonstrates that it maintains high accuracy without sacrificing the ability to capture key features. The F1 score (97.35%) comprehensively reflects its balance, with an improvement of 4.32 percentage points over the baseline model and 1.37 percentage points over the improved model, highlighting its overall superior performance. Looking at the trend of improvement, the progress of APO-CViT exhibits two characteristics: The optimization space for precision is greater. Compared to Adaptive-CNN-ViT, the increase in precision of APO-CViT (+2.04%) is significantly larger than the increase in recall (+0.74%), indicating that its architectural improvements are more conducive to reducing false positives rather than enhancing feature coverage. The recall rate is approaching a bottleneck. The differences in recall rates among the three models are gradually narrowing (APO is only 0.74% higher than Adaptive), suggesting that the gains in recall from structural optimization alone may be approaching saturation. Despite this, APO-CViT still demonstrates its superior balancing ability through a significant increase of 1.37% in the F1 score.
The ablation study in
Table 7 evaluates the contributions of thermal image and audio modalities to sow estrus detection. When only using images, the model achieves moderate performance due to reliable physiological cues from vulvar temperature, though limited by missed detections in cases with subtle thermal changes. Audio only performs significantly worse, as environmental noise and overlapping vocal patterns between estrus and non-estrus states introduce high false positives. Fixed-weight fusion strategies demonstrate gradual improvements: balanced weights (0.5:0.5) yield an F1-score of 90.39%, while image-dominated weighting (0.7:0.3) further enhances accuracy by prioritizing robust thermal features. Conversely, audio-dominated weighting (0.3:0.7) degrades performance, underscoring audio’s susceptibility to noise. The Adaptive Weighting (APO-CViT) mechanism outperforms all baselines, dynamically adjusting modality weights to suppress noise and amplify discriminative features. This context-aware fusion validates APO-CViT’s superiority in balancing precision (98.92%) and recall (95.83%), establishing it as the optimal framework for multimodal estrus detection.
4. Discussion
In this study, the improved APO-CViT model, which incorporates the Adaptive Cross-Attention Feature Fusion Mechanism and an enhanced DenseNet backbone, demonstrated outstanding performance in multimodal emotion recognition tasks. Experimental results show that compared to traditional feature fusion methods, the Adaptive Cross-Attention Mechanism significantly enhances the model’s performance during the feature fusion phase. This mechanism dynamically adjusts the attention weights for different modalities, allowing the model to more effectively capture useful information from both audio and image features, thereby improving the overall detection accuracy and robustness.
In comparative experiments, models using different backbone networks exhibited substantial differences in performance metrics. By comparing ResNet, AlexNet, GoogLeNet, MobileNetV3, DenseNet, and DenseNet-SE, it was observed that DenseNet-SE achieved the best results in Precision, Recall, and F1-score while maintaining a good balance in model size. This demonstrates the superiority of the improved DenseNet backbone in processing multimodal data. Notably, DenseNet-SE achieved a precision of 98.92% and an F1-score of 97.35%, significantly outperforming the other models. This confirms the effectiveness of the SE module in enhancing the model’s ability to express complex features.
Ablation studies further verified the contribution of each improvement module to the model’s performance. Introducing the Adaptive Cross-Attention Mechanism and the SE Block individually led to notable performance gains. Particularly, after incorporating the SE Block, the APO-CViT model achieved optimal performance, indicating that the SE module effectively captures and expresses complex multimodal features, enhancing the model’s understanding of multimodal data.
5. Conclusions
Overall, this paper proposes a multimodal feature fusion method that integrates audio and image data for estrus detection in sows. The experimental results validate the effectiveness of multimodal feature fusion, demonstrating that the combination of audio and image data significantly improves the accuracy of estrus detection. The APO-CViT model achieved exceptional results in the detection task, with Precision, Recall, and F1-score reaching 96.2%, 94.8%, and 95.5%, respectively. Compared to traditional single-modal detection methods, the proposed APO-CViT model captures more comprehensive multidimensional information related to sow estrus, addressing detection errors caused by the incompleteness of single-modal data. By introducing the Adaptive Cross-Attention Feature Fusion Mechanism, the model dynamically adjusts the weight distribution for different modalities, enhancing detection accuracy and robustness. Furthermore, the improved DenseNet structure enhances the model’s ability to extract complex features, improving its understanding and processing performance for multimodal data.
This approach also demonstrates good practicality, as the two modalities of data can be collected through non-contact methods, making data acquisition easy and cost-effective. The study provides a more intelligent estrus monitoring solution for modern pig farms, with significant practical value and potential for widespread application. Future research can further optimize multimodal data fusion methods, explore the application of other physiological parameters in estrus detection, and validate the generalizability of the dataset to detect estrus and non-estrus states in diverse pig breeds under varying environmental conditions. Such efforts will enhance the adaptability of the proposed method across livestock management systems and broaden its practical impact.