1. Introduction
China is the world’s largest pork producing and consuming country [
1], with a large scale of pig farming. In 2023, the country’s pigs slaughtered totaled 726.62 million heads, an increase of 3.8% compared with 2022; the country’s pork production was 57.94 million tons, an increase of 4.6% compared with 2022, which was the highest level since 2015 [
2]. In recent years, in order to meet the growing demand for pork, farms have generally adopted the model of large-scale farming and intensive management to expand the scale of farming. Pig counting is an important part of pig management, and its importance for pig farmers cannot be ignored. Traditional manual counting undoubtedly increases the labor cost of farming, and this process is time-consuming and inefficient, and it is prone to the problems of false reporting and omission [
3]; especially in large-scale animal husbandry enterprises, these problems are more prominent. And in order to better manage pig farms, it is necessary to count the number of pigs in the unit in a timely manner so as to identify the disappearance or death of pigs and deal with such concerns on time to avoid the spread of diseases or a quality decline. However, traditional pig pen filling counting can only count pigs when they are in a pen, which cannot meet the requirement for real-time mastery of the number of pigs during the breeding process. Not only does sensor-based pig counting require the purchase and installation of sensor equipment, which is costly, but the sensor equipment may also be affected due to environmental factors, resulting in counting errors or equipment failure, which further affects the accuracy of counting. And wearable devices are invasive to pigs, and it is necessary to find a more suitable alternative to meet the counting requirements.
With the expansion of the scale of farming, pig management also needs more scientific and effective control. Nowadays, Chinese pig farms are gradually introducing advanced farming technologies and equipment, such as intelligent farming, digital management, and environmental control [
4]. Through the application of intelligent monitoring, precise management, and other technical means, breeding efficiency is improved, the cost is reduced, breeding quality is ensured, and the sustainable development of the breeding industry is ensured. Computer vision technology provides a non-contact, low-cost counting method. Traditional computer vision counts by extracting features such as color, shape, and contour in an image, which in turn are counted using traditional ML algorithms such as k-means, support vector machines, and random forest algorithms [
5,
6]. Among them, Pandit et al. [
7] proposed a silkworm egg counting method that utilizes grayscale images and contrast enhancement techniques for preprocessing and then employs thresholding and morphological operations to determine counts. However, this type of method mainly relies on morphological features such as color, shape, and contour for feature extraction, with a poor generalization ability [
8], and the improper selection of feature equations or thresholds can cause fluctuations in performance. In complex scenarios, the method faces difficulty in producing satisfactory results.
In recent years, spurred by the rapid advancement of deep learning technology, deep learning-based techniques have been able to automatically extract features; among these techniques, convolutional neural networks have shown a strong feature extraction capability [
9], and deep learning algorithms, such as the single shot multiBox detector (SSD) [
10], EfficientDet [
11], and you only look once (YOLO) [
12], not only offer high accuracy and strong generalization ability but also offer a fast operation speed, and they have achieved better performance in practical applications. Therefore, a large number of researchers have applied deep learning algorithms to the field of agricultural counting, and significant results have been achieved in practical applications such as wheat counting [
13,
14], fruit yield estimation [
15], and livestock inventory [
16,
17]. In the field of pig counting, Rong Wang et al. [
18] proposed a high-density herd pig counting model based on the SOLOv2 algorithm, which integrates multi-scale feature pyramid and second-generation deformable convolution, and it reduces the consumption of computational resources and occupancy by optimizing the model structure so that the model can achieve the accurate counting of high-density herds with a segmentation accuracy of 96.7%. Tian et al. [
19] accurately estimated the number of pigs in a whole image by modifying the counting convolutional neural network model with the ResNeXt structure and mapping the image features to the density map, which achieved an average absolute error of 1.67 with a real dataset. Feng et al. [
20] proposed a pig counting and localization method for overhead shot images using a density map estimation technique. The method designed an efficient neural network that utilized depth-wise separable SK blocks and hybrid-vit blocks to obtain a density map and calculate the number of pigs; the method achieved a mean absolute error (MAE) of 0.726 for pig counting, and its pig localization precision was as high as 88.26%, while the recall rate was 86.02%. The above method, relying on density maps, has some advantages in dense crowded scenarios, but its limitation is that it is unable to retain detailed individual information and lacks access to accurate location information for individual targets. As a result, the method loses the ability to deal with associated targets over a time span, and it is not applicable to continuous counting [
21]. In order to accurately identify pig locations, Ju et al. [
22] proposed a real-time segmentation method using the Kinect depth sensor for the problem of the segmentation of pigs adhering to each other. A combination of YOLO and image processing was used to achieve the accurate separation of adherent pigs and improve their execution efficiency. Yang et al. [
23] proposed an improved YOLOv5n pig inventory algorithm based on the construction of a multi-scene pig dataset and the introduction of the SE channel attention module, which improved the accuracy and robustness of the algorithm in complex occluded overlapping scenarios, and the algorithm’s MAE was 0.173. Hao et al. [
24] proposed a novel pig detection and counting model based on YOLOv5, which combines the shuffle attention and Focal-CIoU loss; the shuffle attention module realizes multi-channel information fusion and enhances feature extraction performance, and the improved model achieves a high mAP and accuracy of 93.8% and 95.6% in pig detection and counting, respectively.
The above methods improve the robustness of the model in complex scenes by adding the attention mechanism, and they effectively improve pig counting accuracy, but due to the limited field of view of a single image and the fact that these methods do not realize continuous counting, they are not applicable to pig counting in a large pigsty. For this reason, many researchers have applied tracking techniques to the counting domain [
25,
26]. Cao et al. [
27] achieved the dynamic counting of a small number of sheep by fusing the improved YOLOv5x and DeepSORT algorithms with the addition of the ECA mechanism, and they maintained a low error rate. Chen et al. [
21] proposed a real-time automatic counting system based on a single fisheye camera, which achieved accurate counting through a bottom-up pig detection algorithm, a deep convolutional neural network keypoint detection and keypoint association method, an effective online tracking method, and a novel spatio-temporal response filtering method; however, the keypoint detection-based method relies heavily on the overhead shot detection angle, and it is not suitable for multitasking scenarios of counting. Jin et al. [
28] designed an embedded pig counting system, which uses a lightweight object detector, Tiny-YOLOv4, and a tracking algorithm, DeepSORT, to realize real-time counting on embedded devices. Due to the limited performance of embedded devices, the lightweight detector faces difficulty in meeting the counting requirements of complex scenarios with occlusion and overlapping. Huang et al. [
29] proposed an improved pig counting algorithm based on the more accurate YOLOv5x combined with the DeepSORT model. The algorithm improved the accuracy of pig target detection by embedding two SPP networks of different sizes in the YOLOv5x network and using SoftPool instead of MaxPool operations. In the video-based pig counting experiment, the correlation coefficient (R2) between the experimental results and the actual values reached 98.14%, but the algorithm can only count less than 10 pigs at a time to ensure accuracy, and the continuous counting needs to be divided into several times, which does not meet the requirements for dynamic counting. However, although the above methods realize continuous counting, this counting is based on the overhead angle, and no research has been done on the oblique shooting angle, which has a wide field of view and less obstruction, and which is conducive to the detection and identification of pigs. But the overhead angle is more limited in application scenarios, and the equipment needs to be deployed on top of a pigsty, which leads to more complicated deployment and maintenance.
Based on this, the current study proposes an improved you only look once version 7 (YOLOv7) model for detection and compiled a multi-scene pig counting dataset, including different barn sizes, different numbers of pigs in a group, different light intensities, different occlusion levels, and different shooting angles. For the occlusion and overlapping problem, this paper introduces the coordinate attention (CA) mechanism, which makes the model pay more attention to the information of different coordinate positions in the input image, and it helps improve the robustness and accuracy of the model in occluded, overlapping, and crowded scenes. Finally, the deep association metric SORT (DeepSORT) algorithm was used to realize the continuous dynamic counting of pigs under an arbitrary number, based on the above ideas. The work and innovations of this paper are summarized as follows:
This study constructed a pig counting data set in multiple scenarios, including different pig house sizes, numbers of pigs in groups, light intensities, degrees of occlusion, and shooting angles.
Aiming at the possible misdetection and omission problems in pig counting with complex scenes, this study introduces a CA mechanism to the head of the YOLOv7 model for optimization so that the model can dynamically adjust the degree of attention to different coordinate positions in an image.
This study proposes the P-ELAN-W module, which uses PConv to improve some convolution operations in the ELAN-W module in the head of the YOLOv7 model, reducing redundant calculations and memory accesses and maintaining good feature extraction capabilities.
Aiming at the problem of tracking ID jumping caused due to the phenomena of occlusion and crowding under slant shot angles, this study proposes a dynamic scanning counting method by combining the improved YOLOv7 and DeepSORT. Thus, it realizes dynamic counting under slant shot angles and effectively reduces the influence of tracking ID jumping on counting results.
3. Results
3.1. Comparative Experiment on Pig Detection Performance of YOLOv7 Series Basic Models
The aim of this section is to compare the detection performance of the different models of the YOLOv7 family in order to provide a basis for selecting the most suitable model for the dynamic pig counting task. YOLOv7-tiny, YOLOv7, YOLOv7x, YOLOv7-w6, YOLOv7-d6, YOLOv7-e6, and YOLOv7-e6e were trained under the experimental environment and training parameters described in
Section 2.5 using their respective pre-trained weights. Then, 264 images from the test set were used for testing. The attributes of the seven models and their monitoring performance are shown in
Table 3.
As can be seen in
Table 3, YOLOv7-tiny, as a lightweight model, has an image detection speed of 3.12 milliseconds per image and a model size of only 13 MB, which makes it suitable for embedded device deployment, but with lower accuracy. The YOLOv7 model size is moderate at 71.3 M, with up to 97.16% precision detected with the full dataset, Test-All. YOLOv7x is scaled in depth and width relative to YOLOv7, and on the full dataset, Test-All, mAP_0.5, mAP_0.5:0.95, and recall improved by 0.25, 0.21, and 0.48 percentage points, respectively, and the precision decreased by 0.19 percentage points, while the model size increased by 67.7 M. YOLOv7-w6 is optimized for cloud GPUs, but the results show that its performance does not improve on regular GPUs. YOLOv7-e6 and YOLOv7-d6 are scaled and tuned based on YOLOv7-w6; YOLOv7-e6 has the highest accuracy with the full test datasets, and the accuracy of YOLOv7-d6 is also better than that of YOLOv7, but the YOLOv7-d6 model size is 1480 M, which is not suitable for practical applications. YOLOv7-e6e is a model further optimized via YOLOv7-E6 using E-ELAN, which, despite this optimization, suffers from a slight performance degradation in the test datasets.
For the dynamic pig counting task in combination with DeepSORT, the accuracy and speed of the target detection phase are equally important. If the target detection produces a large number of false detections, it will lead to biased counting results. At the same time, a large model resulting in a slow detection speed will also seriously affect practical applications. YOLOv7 achieved a good balance between accuracy and speed in the experiments, so it was selected as the detection model for the dynamic counting experiments, and subsequent optimization experiments were carried out on this basis.
3.2. Improved YOLOv7 Model Detection Performance Experiment
3.2.1. Model Detection Performance Experiment after Attention Mechanism Optimization
In order to verify the effects of different attention mechanisms and their addition locations on the model, the channel attention mechanism, CA, the stochastic attention mechanism shuffle attention (SA) [
40], and the hybrid attention mechanism, CBAM, were added to YOLOv7 with two addition locations: add location I, the last layer of the backbone network, and add location II, between the ELAN-W module and the REPConv module, as shown in
Table 4. In this experiment, YOLOv7 was used as the basic model, the resolution of the input image was 640 × 640, each model was trained for 100 epochs, and each model was tested for its performance under two viewpoint test sets, Test-O and Test-S.
From
Table 4, it can be observed that the addition of attention mechanisms has a positive impact on the overall performance of the model without increasing its size. The enhancement of YOLOv7 is particularly significant after the addition of CA and CBAM mechanisms at position II. The model with added CA has a mAP of 94.54%, a recall of 94.44%, and a precision of 97.41% with the entire test dataset. Compared to the original model, these metrics improved by 0.97, 1.40, and 0.25 percentage points, respectively. On the other hand, the model with CBAM added has a mAP of 94.97%, a recall of 94.42%, and a precision of 96.93% with the entire test dataset. Compared to the original model, MAP and recall improved by 1.40 percentage points and 1.38 percentage points, respectively, while precision decreased slightly by 0.23 percentage points.
Comparing the different positions added via the attention mechanism revealed that the model improved the mAP, recall, and precision detected with the full test set by adding CA at position II compared to position I by 0.48, 0.74, and 0.17 percentage points, respectively; the addition of CBAM at position II compared to the addition of position I improved the MAP and recall by 0.40 and 0.16 percentage points and achieved a slight decrease of 0.39 percentage points in precision.
In terms of the model detection results for different angles of data, the base model YOLOv7 improved by 8.65, 7.53, and 4.25 percentage points the mAP, recall, and precision relative to the slant shot angles for the overhead shot angle. After adding the attentional mechanism, CA, SA, and CBAM were added in manner II and improved relative to the base model at slant shot angles by 3.59, 1.02, and 4.28 percentage points in the mAP, by 3.0, 0.99, and 4.23 white points in the recall, and by 3.85, 0.77, and 1.80 precision percentage points. However, in the overhead shot angle, the attention mechanism improved the model less, and it even decreased in some indicators, with the SA, CBAM, and CA added in manner II achieved decreases of 0.20, 0.06, and 0.99 percentage points in precision relative to the original model, of which the SA decreased by 0.54 percentage points in the mAP.
A comprehensive analysis revealed that the overhead shot angle is more conducive to the model detection of pigs because it offers a broader field of view with less mutual obstruction among pig groups. Conversely, the slant shot angles, due to their lower angle, are prone to clustering and obstruction, thereby affecting detection performance. The addition of attention mechanisms effectively improves the model’s performance at slant shot angles. Specifically, CA enhances the model’s perception of different locations by assigning different weights to each coordinate position in the input. In obstructed scenes, the CA attention module helps the model focus more on areas that may contain pigs. Meanwhile, adding attention mechanisms to the head network is more effective than adding them to the backbone.
3.2.2. Visual Analysis of the Impact of Attention Mechanism on Model Detection Performance
To intuitively demonstrate the areas of focus of the model and the role of attention mechanisms in guiding the network to key areas, this study utilized Grad-CAM [
41] for visualization to present the decision-making process from the second layer to the final layer of the model. Grad-CAM generated heatmaps for YOLOv7 and YOLOv7 with the added CA mechanism, as shown in
Figure 14.
In the heatmap, the intensity of colors represents the model’s confidence in detecting targets, with the red areas highlighting the key regions where the model predicts the target’s position. As shown in
Figure 14, both models are capable of accurately distinguishing between targets and backgrounds. In
Figure 14(a-1,a-2), the model’s focus is on the head region of the pig, indicating that the model has successfully learned the most distinctive features of the pig, which helps improve the accuracy of pig detection and reduces false detections. Additionally, after adding the CA mechanism, the colors in the heatmap become darker, indicating higher confidence. In the heatmap of group b, it can be seen from
Figure 14(b-1) that YOLOv7 does not pay enough attention to the pig in the bottom right corner, which can easily lead to missed detections. However, after adding the CA mechanism, the attention to this pig is significantly enhanced. The heatmap of group c shows that, although YOLOv7’s attention points can distinguish targets from the background, some background areas are still included in the attention region. With the addition of the CA mechanism, the network can more accurately focus the attention points on the target pig.
The results of this experiment show that the introduction of CA makes the network’s focus on the critical region more explicit and further improves the model’s recognition accuracy with the target. This has a positive impact on optimizing the pig detection task.
3.3. Experiments on the Impact of Different IoU Thresholds on the Detection Performance of the Base and Improved Models
The IOU threshold is a criterion used in object detection to measure the degree of overlap between the detection box and the ground truth box. By adjusting the IOU threshold, the strictness of non-maximum suppression (NMS) can be controlled. In order to optimize the model’s performance and find the best threshold settings, this experiment optimized the ELAN-W module using PConv, introduced the CA mechanism into the head network, and named the improved YOLOv7 YOLOv7-Improved. Under IOU thresholds of 0.4, 0.45, 0.5, 0.55, 0.6, and 0.65, YOLOv7 and YOLOv7-Improved were tested with the Test-S, Test-O, and Test-All datasets in this paper. The experimental results are shown in
Figure 15.
In
Figure 15a, within the range of 0.4 to 0.65, all curves initially rise with the increase in the IOU threshold and then gradually start to decline, indicating that increasing the IOU threshold can improve the
-Score before reaching 0.5. In
Figure 15b, the mAP value continues to increase with the increase in the IOU threshold, but the growth rate starts to slow down after the IOU threshold reaches 0.5. Meanwhile, it is observed that the improved YOLOv7 outperforms the original model with three test datasets, with the greatest improvement observed under the slant shot angles. In summary, the model performs best when the IOU threshold is 0.5, and the improved model demonstrates good generalization from different perspectives, indicating stability and strong adaptability to data from different angles.
3.4. Ablation Experiments
In this experiment, PConv was utilized to optimize ELAN-W, and the CA attention mechanism was introduced. Ablation experiments were conducted on the improved YOLOv7 model using the Test-S, Test-O, and Test-All datasets in this paper. The IOU threshold was set to 0.5, and the experimental results are presented in
Table 5.
From
Table 5, it can be observed that Model II is a lightweight P-ELAN-W optimized YOLOv7 model. Compared to the original model, it achieved an improvement of 1.49 percentage points in mAP and 1.25 percentage points in recall with the Test-S dataset. Meanwhile, the computational complexity decreased by 3.7 GFLOPS, although there was a slight decrease in precision. Model III introduced the CA attention mechanism to the head part of YOLOv7. Relative to the original model, it showed improvements of 3.59, 3.0, and 3.85 percentage points in mAP, recall, and precision, respectively, with the Test-S dataset. With the Test-All dataset, it achieved improvements of 0.97, 1.40, and 0.25 percentage points in mAP, recall, and precision, respectively. Model IV optimized YOLOv7 simultaneously with P-ELA-W and CA. Compared to the original model, it exhibited improvements of 3.24, 0.05, and 1.00 percentage points in mAP with the Test-S, Test-O, and Test-All datasets, respectively. In terms of recall, it achieved improvements of 2.54, 0.23, and 1.34 percentage points versus precision improvements of 4.39, 0, and 1.10 percentage points, with a decrease in computational complexity of 3.6 GFLOPS.
The comprehensive ablation test revealed that the use of PConv to improve ELAN-W can significantly reduce the amount of computation and, at the same time, improve the performance of the model under slant shot angles, and the addition of CA attention mechanism improves the model’s ability to perceive different locations. In the occlusion scenario, CA helps the model pay more attention to the area that may contain pigs, and the detection effect is again improved.
3.5. Comparison Experiment for Detection Performance of Optimized YOLOv7 Model and Other Models
This experiment compared the improved YOLOv7 model with other typical target detection algorithm models, Faster RCNN, YOLOv5, YOLOv4, YOLOv3, and SSD. The IOU threshold was set to 0.5, and the tests were conducted with the Test-All dataset of this paper, respectively, in order to prove the improved YOLOv7 model’s experimental scenario of this paper’s superiority, and the results are shown in
Table 6.
From
Table 6, it can be observed that YOLOv5 has a model size of 40.1 MB, making it more efficient and convenient for model deployment and mobile applications. YOLOv5 achieves a mAP of 92.5%, demonstrating good object detection accuracy. However, the recall is relatively low at 86.90%, indicating that there may be instances where targets are not detected. The SSD (mobilenetv2) model has a smaller size of only 14.3 M, and it achieves an FPS of 116.05. However, its performance in terms of mAP, recall, and precision is relatively average, and it may face limitations in detecting fine-grained objects. The Faster RCNN model has a larger model size of 521 M, with a higher recall rate but relatively lower precision. It may encounter some false positive detections.
Improved-YOLOv7 further enhances performance based on YOLOv7 with an increase of 1.0, 1.34, and 1.1 percentage points in mAP, recall, and precision, respectively, compared to YOLOv7. Compared to YOLOv5, it achieves increases of 2.07, 7.48, and 2.36 percentage points in mAP, recall, and precision, respectively. Compared to YOLOv4, it achieves increases of 5.20, 4.57, and 1.43 percentage points in mAP, recall, and precision, respectively. Compared to YOLOv3, it achieves increases of 2.16, 1.67, and 0.83 percentage points in mAP, recall, and precision, respectively. Compared to SSD, it achieves increases of 19.73, 18.20, and 3.85 percentage points in mAP, recall, and precision, respectively. Compared to Faster RCNN, it achieves increases of 7.05, 3.09, and 32.93 percentage points in mAP, recall, and precision, respectively.
In summary, YOLOv7 is at the forefront of object detection, offering high recall and accuracy and achieving a better balance between precision and recall. The improved YOLOv7 further enhances its performance, providing a reliable foundation for subsequent dynamic counting tasks.
3.6. Visualize the Detection Effect of the Model in Different Scenarios
For the detection effect of the algorithm, this experiment compared the improved YOLOv7 with the original YOLOv7 model in occluded, dim, and overhead shot situations.
The detection results are shown in
Figure 16, from which it can be seen that the original YOLOv7 has a large number of missed and misdetected cases. In
Figure 16(a-1), a pig was missed, and the pig’s ear was mistakenly detected as the pig itself. In
Figure 16(b-1), a high degree of crowding led to missed detection. Although
Figure 16c,d show overhead angles, the dense pig population resulted in pig-to-pig sticking. The original YOLOv7 could not effectively distinguish between these adherent pigs, leading to missed and false detections.
The improved YOLOv7 showed significant improvement in detection performance, and the improved model can still accurately detect all pigs in crowded and pig sticking scenarios. As shown in
Figure 16a,b in the crowded and occluded scenario, the detection accuracy for pigs is up to 90%. The highest accuracy for pig detection is up to 94% under dim conditions, as shown in
Figure 16c.
3.7. ReID Model Training
DeepSORT is mainly used for the tracking of pedestrians, while the original ReID [
42] model is mainly applicable to the appearance re-identification of pedestrians, and it is not applicable to the re-identification of pigs. Therefore, in order to achieve the tracking of pigs, this experiment retrained the ReID model on the pig re-identification dataset in this paper in order to extract the appearance features of pigs to improve the tracking effect. The change curves of the loss value and accuracy of the ReID model during the training and testing process are shown in
Figure 17.
As shown in
Figure 17, the loss curve and accuracy curve gradually smooth out as the number of iterations of the model increases, indicating that the model gradually converges. After 200 rounds of iterations, the loss curve on the test set was stable at 0.407, and the test accuracy was stable at about 0.92. When the test accuracy was close to 1, it indicates that the model performs better in the re-identification of pigs. The re-identification accuracy of 0.92 fully meets the demand for dynamic counting in this paper.
3.8. Dynamic Pig Counting Experiment
The ameliorated YOLOv7-Improved was fused with DeepSORT to perform dynamic counting experiments with the video-144, video-201, video-285, and video-294 video test sets, and the counting process is shown in
Figure 18.
As shown in
Figure 18, the model can accurately detect a pig while assigning a tracking ID to it; with a counting line in the center of the figure, the target box scanned using the counting line will determine whether its ID is in the ID container. If not, it will be added to the ID container and counted, and vice versa without counting. The counting results are shown in
Table 7:
As can be seen in
Table 7, the first group used the experimental method of taking the total number of tracking ID assignments directly as the counting result. The experimental results show that the counting results of this counting method deviate from the actual results and exceed the actual number by a large amount. The average error of the original model is 144 heads, and the average error of the improved model is 129 heads; the average error was slightly reduced, but the counting effect is still not satisfactory. The average accuracy of the model before improvement is 41.25%, and the improved model is slightly relieved, but the counting effect is still unsatisfactory, with an average accuracy of 46.93%. The second group adopts the counting strategy proposed in
Section 2.4 of this paper, and its counting effect is significantly improved. The error counts of the original model on the 4-video test set are −4, +5, +19, and −32, with an average accuracy of 94.33%. Meanwhile, the improved model has error counts of −3, +3, −4, and −26 with the same test set and an average accuracy of 96.58%.
The counting technique proposed in this paper effectively alleviates the dependence on the tracking effect. The results show that the counting effect is significantly improved, while the improved YOLOv7 is proven to have an enhancing effect on the final counting result.
4. Discussion
4.1. Analysis of Pig Detection Performance from Different Shooting Angles
Currently, most pig counting studies are conducted from an overhead shot angle, with few focusing on slant shot angles. In order to make up for the lack of research on counting from slant shot angles, we constructed a multi-scene pig counting dataset, which includes data from both overhead shot and slant shot angles. We tested the YOLOv7 model on both overhead shot and slant shot test datasets. The experimental results show that, compared to slant shot angles, YOLOv7 performs significantly better in terms of MAP, recall, and precision at overhead shot angles, with improvements of 8.65, 7.53, and 4.25 percentage points, respectively. The superior performance at overhead shot angles is likely due to the broader field of view and reduced occlusion among pig groups. Conversely, slant shot angles, with their lower perspectives, are prone to pig clustering and occlusion, thereby affecting detection performance. However, counting applications cannot be confined to overhead shot angles alone. Slant shot angles are more suitable for handheld devices in small pig farms. Therefore, this study did not overlook the application of slant shot angles but, rather, improved the model by adding a CA mechanism. The addition of the attention mechanism effectively enhances the model’s performance at slant shot angles. Specifically, the model’s performance after the incorporation of the CA attention mechanism improved by 3.59, 3.0, and 3.85 percentage points in terms of MAP, recall, and precision, respectively, compared to the base model. Visual analysis through heatmaps shows that the introduction of CA makes the network’s focus on critical areas more distinct, further enhancing the model’s accuracy in pig recognition and indicating that adding attention mechanisms has a positive impact on optimizing object detection tasks. Nonetheless, the performance of the CA-improved model for detection at overhead angles is still 5.04 and 4.79 percentage points higher in terms of mAP, recall, and precision compared to oblique camera angles.
4.2. Analysis of Target Detection Performance Using Different Optimization Methods
In this study, the performance of the YOLOv7 series models was first tested. Of them, YOLOv7 exhibited the most balanced performance in terms of model size and performance. With a size of 71.3 M, it achieved a MAP, a recall, and precision of 93.57%, 93.04%, and 97.16%, respectively, with the entire test set. At the same time, experiments found that YOLOv7x, YOLOv7-d6, and YOLOv7-e6e have more complex network structures but do not demonstrate stronger performance. This may be because large models typically require more data for training and parameter tuning to fully exploit their expressive capabilities. If the available training data are limited, large models may struggle to learn enough patterns and features from the data [
43], thereby affecting their accuracy. Therefore, under specific task and data constraints, smaller models may be more suitable for achieving better performance.
In addition, this study lightened YOLOv7 by using PConv to replace part of the convolution in the ELAN-W module in HEAD, which decreased the model computation by 3.7 GFLOPS, while the model’s performance did not show a significant decrease. Aiming at the problems of overlapping and occlusion in the process of counting pigs in oblique shooting, the attention mechanism was used to improve YOLOv7, and the effects of the CA, SA, and CBAM attention mechanisms on the model detection performance at different additive positions were compared in the experiments of this paper. The experimental results show that adding the attention mechanism has a positive impact on the model performance in general without increasing the model size. Meanwhile, we found that the attention mechanism affects the model differently under different angles. In the research scenario of this paper, the key features in the image (the facial features of the pig) can be better emphasized under the oblique shooting angle, and the attention mechanism successfully captures these features and gives them higher weights. On the contrary, the model is not improved much after adding the attention mechanism under the top shot angle, probably because the distribution of image information under this angle is not conducive to the work of the attention mechanism, or the key features are hidden or weakened.
4.3. Adaptations and Limitations of Counting Methods
The DeepSORT-based dynamic counting task should flexibly adopt targeted counting strategies when dealing with different application scenarios. Let us take static targets such as wheat ears as an example [
39]; although it is simple and easy to directly use the number of IDs assigned via DeepSORT as the counting result, the accuracy of this method is highly dependent on the tracking effect. In complex scenarios such as pig counting, experiments have shown that the frequent switching of tracking IDs and large counting errors are easily caused due to the occlusion and running behavior among pigs. In the scanning counting approach proposed in this paper, the core counting trigger condition occurs when the counting line is in contact with the center point of the target frame. Since this contact is usually brief, even if there is a jump in ID afterward, it will not affect the final counting result. At the same time, to ensure that repeated counts are not performed on the same pig, the method also records the IDs that have already been scanned. After practical testing, this method exhibits high counting accuracy when a pig is moving relatively slowly; e.g., counting errors of −3, −3, and −4 were recorded in the video-144, video-201, and video-285 videos, respectively. However, the counting errors of the video-295 test video were −3, −3, and −4 when a person was moving relatively slowly. In the video-295 test video, when a human approached the shot, the stress response of the pigs caused them to congregate and cover each other, which caused the detection frame to disappear briefly, and it could not be scanned using the counting line, leading to an increase in missed detections. In addition, there are cases where pigs that have already been recorded via the counting line re-pass the counting line, which may lead to double counting. Although DeepSORT is able to match pigs that have passed the counting line multiple times as the same pig and assign the same ID to them to avoid double counting, tracking is often ineffective when pigs are moving quickly and are obscured by other pigs in the middle of the counting line, which leads to the problem of double counting. Therefore, to ensure counting accuracy, this method should be chosen to count pigs when they are moving more gently or when they are relatively still. In the future, this research will further optimize and improve the tracking model and counting strategy to improve the stability and accuracy of counting in complex scenarios.
4.4. Future Application Scenarios and Research Directions
The dynamic counting system in this study is capable of achieving accurate counting in a wide range of scenarios, including counting tasks from overhead and other angles. Different application strategies can be used for different sizes of farms:
For farming retailers, the number of farmed pigs is relatively small, with less occlusion, which is more suitable for handheld device counting. In future research, the handling model can be further lightened, and a counting application can be developed for use by retail farmers.
For large farms with a large number of pig pens and large sites, manual handheld counting is less efficient. The counting method under the top view angle can be adopted, as shown in
Figure 19, where a track camera is installed above the pig house, and dynamic counting is realized by moving the camera. This approach can count without disturbing pigs, and it can effectively avoid obstruction, improve the counting accuracy, and at the same time reduce the labor cost of a farm.