1. Introduction
In the daily farming and management of large-scale farms, it is essential to conduct regular inventory and counting of livestock. Standardized and refined agricultural practices are progressively becoming the norm to enhance farming efficiency [
1]. By utilizing pig counting data to regulate feed quantity and type, and formulating customized farming strategies for various breeds and sizes of pigs, the profitability of pig farming can be significantly enhanced.
In traditional livestock farms, common methods for counting pigs include manual counting and the use of wearable sensors [
2]. However, with the continuous expansion of farming scale, manual counting has become increasingly impractical, placing a significant burden on the physical and mental resources of farm workers. Moreover, if the staff is inattentive or fatigued, manual counting is prone to substantial errors. The use of wearable sensors, such as RFID ear tags, also faces considerable challenges. First, tagging each pig individually in large herds is a labor-intensive task and often induces stress responses in the animals. Additionally, these electronic tags may cause a certain degree of physical discomfort or injury to the pigs. Over time, the tags may become detached due to pig behaviors such as biting, rubbing, or fighting, leading to inaccuracies in counting and increased operational costs. Furthermore, in large-scale farms, especially in facilities with dense populations and frequent occlusions, sensor-based systems often suffer from signal interference or recognition failures, particularly in pigpens with extensive metal structures. In view of these limitations, this study adopts a fixed-camera-based approach, integrating object detection and tracking techniques for pig counting and monitoring. Compared with traditional manual counting and sensor-based methods, vision-based approaches offer a higher degree of automation and intelligence, significantly reducing labor and operational costs. More importantly, they allow for accurate pig identification and counting even in complex environments, without causing any harm or discomfort to the animals.
Driven by the rapid development of artificial intelligence and big data, deep learning (DL) algorithms, characterized by multi-layer neural networks, have achieved notable breakthroughs across diverse domains, including image classification, computer vision, and time series classification [
3,
4]. DL algorithms have been extensively utilized in the field of object detection owing to their superior accuracy and rapidity [
5,
6,
7]. These algorithms can be classified into two categories: two-stage detection methods and single-stage detection methods. Two-stage detection methodologies encompass Region-based Convolutional Neural Networks (RCNN) [
8], Fast Convolutional Neural Networks (Fast-RCNN) [
9], and Mask Convolutional Neural Networks (Mask-RCNN) [
10], which exhibit superior detection accuracy. Single-stage detection methods include SSD [
11], YOLOv3 [
12], and others, which offer advantages in processing speed. All of these deep learning algorithms have demonstrated great accuracy in object detection and localization, rendering them reliable and consistent models for animal monitoring.
Numerous researchers have investigated animal detection and identification systems. Hansen et al. [
13] applied facial recognition technology to pigs, extracting biometric features for non-invasive detection and identification. Song et al. [
14] proposed a YOLOv3-p model for sheep facial detection by optimizing anchor box dimensions and reducing model parameters, which improved detection accuracy while decreasing computational complexity. Yang et al. [
15] created a lightweight sheep face detection model based on Retina Face, employing a streamlined backbone network to minimize model parameters and optimizing the network’s loss function, achieving a detection accuracy of 97.1%. For behavioral analysis, Chae et al. [
16] identified cattle mating behavior using an enhanced object detection network. By incorporating additional convolutional layers and sampling layers into YOLOv3, they proposed a four-scale detection network with 98.5% accuracy. For livestock counting applications, Yang et al. [
17] enhanced the YOLOv5n algorithm for pig counting by constructing a multi-scene pig dataset and integrating the SE channel attention module. This approach enhanced both the accuracy and robustness in complex occlusion scenarios, achieving a mean absolute error (MAE) of 0.173. Similarly, Hao et al. [
18] introduced an improved YOLOv5-based pig detection and counting model, which integrates the shuffle attention mechanism and Focal-CIoU loss, achieving a mean average precision (mAP) of 93.8% for detection and 95.6% for counting. In recent years, numerous computer vision algorithms have been progressively applied to animal husbandry. Free-roaming animal counting systems for cattle [
19,
20], sheep [
21], and wildlife [
22] assist researchers and herders document behavioral patterns and activity zones. Huang et al. [
23] modified the SSD detection network with InceptionV4 modules to enhance the detection accuracy of cattle tails in movement corridors, integrating augmented Kalman filter and Hungarian algorithm for tracking. Cao et al. [
24] achieved dynamic counting of a limited number of sheep by integrating the improved YOLOv5x and Deep SORT algorithms, supplemented by the ECA mechanism, while maintaining a low error rate.
Some scholars have made significant progress in pig counting algorithms. Cowton et al. [
25] combined Faster R-CNN for detection with Deep SORT for tracking and implemented a convolutional network-based re-identification algorithm to track pigs in sties. The tracking data were utilized to measure total distance traveled, idle time, and average speed. Stavrakaki et al. [
26] developed a Kinect motion detection system using reflective markers on pigs’ necks to transmit movement data to receivers positioned along the channel, distinguishing healthy from lame pigs by analyzing locomotion sounds. For direct counting applications, Tian et al. [
1] converted RGB images into density distribution maps and employed an enhanced CNN network to accurately count 15 pigs. Kim et al. [
27] implemented pig counting on the NVIDIA Jetson Nano embedded platform (NVIDIA, Santa Clara, CA, USA), installing cameras above the pigsty corridors to record pigs’ movements. They utilized the lightweight model Tiny YOLOv4 for detection and introduced a lightweight tracking method, LightSort, by enhancing the re-identification module of Deep SORT. A line-crossing approach was employed to enumerate the pigs. Chen et al. [
28] proposed a real-time automated counting system with a single fisheye camera, achieving accurate counts through bottom-up pig detection, deep convolutional neural networks for keypoint recognition and association, online tracking, and a novel spatiotemporal response filtering mechanism. Huang et al. [
29] enhanced pig counting accuracy by improving YOLOv5x with embedded dual-sized SPP networks and replacing MaxPool operations with SoftPool, integrating this with Deep SORT for tracking. A comprehensive review of the literature reveals persistent challenges in pig counting: mutual occlusion and overlapping among pigs, target loss and frequent identification switching during tracking, and false positive tracking trajectories. Although counting algorithms share basic architectural similarities, significant variations exist in camera angles, lighting conditions, pig appearance features, and movement patterns across different farming environments. These variations necessitate targeted optimization of pig counting algorithms for specific environmental conditions. This study proposes a dynamic pig counting algorithm, YOLOv8n-EGV+Deep SORT-P, which enhances the YOLOv8n+Deep SORT model. The YOLOv8n model serves as the detector, while the ELA attention mechanism is incorporated to improve detection accuracy for pig targets under challenging conditions such as occlusion, deformation, and poor lighting. A lightweight module is also incorporated to enhance detection speed without sacrificing precision. The Deep SORT tracking model is enhanced through improvements to the feature extraction network and the introduction of CIoU, effectively addressing tracking challenges such as target loss and identity switching while improving algorithm efficiency and stability. Finally, the virtual counting line method provides a straightforward and efficient approach to dynamic pig counting, performing optimally with the improved tracking algorithm. The main contributions of this study are as follows:
Development of a comprehensive pig detection and tracking dataset collected from corridor videos in real farming environments. Extensive experimental comparisons validate the effectiveness of the proposed improvements, ensuring the method’s reliability and practical applicability.
Introduction of the YOLOv8n-EGV algorithm for pig detection. The improved algorithm incorporates the ELA (Efficient Local Attention) mechanism and lightweight convolutional modules (GSConv and VOVGSCSP), which enhance detection accuracy and diminish computational and network structure complexity, resulting in markedly increased detection speed.
Development of the Deep SORT-P algorithm for pig tracking, which introduces an improved DenseNet-based feature extraction network and CIoU matching algorithm. These improvements augment the precision of multi-object tracking and tracking robustness under challenging farm conditions.
Implementation of a virtual counting line technique for monitoring pigs moving traversing farming corridor. Comparative analysis demonstrates that the improved YOLOv8n-EGV+Deep SORT-P model significantly outperforms the original model in counting accuracy.
2. Methods
This study proposes an enhanced YOLOv8n-EGV+Deep SORT-P algorithm for high-precision dynamic pig counting. The algorithm comprises three main modules: the detection module, the tracking module, and the counting module. The overall technical roadmap of the algorithm is illustrated in
Figure 1.
The proposed system operates in a sequential pipeline. First, the improved YOLOv8n-EGV algorithm detects pigs in each video frame and forwards the detection results to the improved Deep SORT-P pig object tracking algorithm. Upon receiving the pig detection results in the current image, the Deep SORT-P algorithm retrieves the detection results from the previous frame to predict pig movement trajectories in the current frame. Next, the pig re-identification network calculates the correlation between the predicted results of the previous frame and the current detection results. Finally, the virtual counting line method dynamically counts the pigs as they traverse the corridor.
2.1. Detector
2.1.1. YOLOv8n
The YOLO [
30] (You Only Look Once) series represents a landmark family of classic single-stage object identification algorithm, excelling in both accuracy and speed of detection. In 2023, the Ultralytics team released the YOLOv8 algorithm, whose network architecture is shown in
Figure 2. This algorithm builds upon the YOLOv1-v7 series models, reducing network parameters while improving both detection accuracy and real-time performance, leading to its widespread adoption across various object detection and segmentation applications.
YOLOv8′s backbone network adheres to the CSPDarkNet structure, comprising three key components: the Conv convolution module, the C2F module, and the SPPF module. The C2F module, inspired by the ELAN concept in YOLOv7, enhances the model’s gradient flow by integrating additional gradient flow branches. The SPPF module adaptively transforms feature maps of varying sizes into fixed-size outputs while efficiently integrating local and global feature information. The standard convolution block (CBS) in YOLOv8 consists of Conv2d, BatchNorm2d, and the Silu activation function. The neck component retains the bidirectional feature pyramid network structure (FPN+PAN) from YOLOv5, enabling multi-scale object prediction by merging features from different layers. Feature maps from the backbone network are processed through the Feature Pyramid Network (FPN) in a top-to-bottom manner, transmitting high-level semantic information. This process fails to convey feature localization information, whereas the bottom-up structure of the PAN (Path Aggregation Network) transmits low-level localization features upwards, enabling the network to perform multi-scale detection while effectively propagating both semantic and spatial information. YOLOv8′s head component represents a significant architectural advancement compared to its predecessors. The original merged head structure has been supplanted by the decoupled head structure (Decoupled-Head), which separates classification and detection tasks. One head handles classification, measured by binary cross-entropy loss, whereas the other head performs detection, measured by bounding box loss. This decoupled approach allows for more specialized optimization of each task.
Based on variations in network depth and width, the YOLOv8 algorithm is divided into five types: n, s, m, l, and x. Among them, YOLOv8n has the smallest number of model parameters and the lowest number of floating-point operations, resulting in faster inference speed and offering excellent usability and customizability. In this study, considering both detection accuracy and speed, YOLOv8n is selected as the baseline model.
2.1.2. YOLOv8n-EGV
To address the limitations of the standard YOLOv8n algorithm when detecting pigs under challenging conditions—such as occlusion, deformation, partial visibility, and poor lighting—this study proposes the YOLOv8n-EGV detection network. The new network incorporates an efficient local attention mechanism and a lightweight convolutional module, as illustrated in
Figure 3. The YOLOv8n-EGV model features two key architectural improvements. Firstly, the Efficient Local Attention (ELA) mechanism is incorporated into the backbone network to enhance the representational capability of the CNN. The ELA structure is lightweight and straightforward, precisely identifying regions of interest while improving detection accuracy. Secondly, the Conv module in the Neck layer of the YOLOv8n network is substituted with GSConv, and the C2f module in the Neck is replaced with VoV-GSCSP. This modification reduces the model’s parameter count while maintaining detection accuracy, thereby rendering the model more lightweight. The specific improvement methods are outlined below:
ELA
ELA (Efficient Local Attention) [
31] is an efficient local attention mechanism, illustrated in
Figure 4. Despite its simple architecture, ELA significantly enhances detection performance by precisely identifying regions of interest while maintaining input feature map channel dimensions, and its lightweight characteristics. The ELA mechanism operates through the following process: First, similarly to CA, ELA utilizes strip pooling in the spatial dimension to obtain feature vectors in both horizontal and vertical orientations. It preserves a narrow kernel shape to capture long-range relationships and mitigate the influence of irrelevant regions on label prediction, thereby acquiring rich target location features in their respective matrices. Second, ELA processes these horizontal and vertical feature vectors independently using 1D convolutions for local interactions. The kernel size can be adjusted as needed to control the scope of these interactions. The resulting vectors undergo group normalization (GN) followed by nonlinear activation to generate directional attention predictions. The final position attention is obtained by multiplying the position attention from both directions. Compared to 2D convolutions, 1D convolutions are more adept at handling continuous signals, and are lighter and faster. In comparison to BN, GN demonstrates comparable performance and greater versatility.
Compared to some well-established attention mechanisms, such as SE [
32] (Squeeze-and-Excitation) and CBAM [
33] (Convolutional Block Attention Module), the ELA attention mechanism demonstrates superior performance. SE mainly focuses on adjusting the channel-wise weights, neglecting the crucial role of spatial information in occlusion and local structural changes. Although CBAM incorporates spatial attention, its spatial modeling is relatively shallow, making it difficult to capture detailed local features of the target adequately. The ELA attention mechanism strengthens the modeling of local feature regions and places greater emphasis on fine-grained information in key areas of the target. As a result, it can effectively extract discriminative regional features even when the target is partially occluded or deformed. ELA typically employs deformable or multi-scale receptive field designs, which enhance its spatial selectivity and dynamic adaptability, enabling it to accurately locate and identify targets in complex backgrounds. By enhancing the response to local textures and edge regions, ELA exhibits greater robustness to strongly structured local features, such as contours and boundaries. It maintains focus on key areas even under blurry visual conditions, such as low-light environments, thereby improving detection accuracy. Consequently, in the real-world scenarios of this study, ELA demonstrates superior performance compared to certain established attention mechanisms.
GSConv
GSConv [
34] is an innovative convolutional operation aimed at enhancing the performance and efficiency of deep learning models. Its core principle is based on the integration of group convolutions and a displacement mechanism [
35]. First, GSConv segments the input feature map into multiple groups, performing convolution operations on each group independently. This method markedly decreases the parameter count and computational complexity while maintaining model expressiveness. Subsequently, during the convolution process, GSConv introduces a displacement operation that shifts certain channels of the feature map horizontally, enhancing the acquisition of local contextual information. After displacement, the feature map is reassembled by augmenting or concatenating elements to integrate information from different groups, allowing the model to merge information from multiple directions and further improve overall performance. GSConv is not only more computationally efficient than conventional fully connected convolutions, making it suitable for resource-constrained devices, but also enhances feature learning capability, enabling superior performance in managing intricate patterns. In convolutional neural network architectures, standard convolution operations yield more accuracy but result in increased model complexity and prolonged inference durations. Conversely, while DWConv (depth-wise separable convolution) provides expedited detection speed, its accuracy is diminished. Therefore, the design objective of the GSConv module is to diminish model complexity while preserving detection accuracy. The structure of the module is presented in
Figure 5. GSConv integrates the semantic information generated by standard convolution into various parts of DWConv, fully leveraging the advantages of both standard and DWConv convolutions. This approach accelerates DWConv and enhances the accuracy of standard convolutions, thereby achieving better performance in the pig detection task.
To accelerate the computation of predictions, the feedforward images in CNNs almost always undergo analogous transformation processes in the Backbone, wherein spatial information is progressively transferred to the channel dimension. Moreover, each instance of compressing the spatial dimensions (width and height) of the feature map while expanding the channels results in the inevitable loss of semantic information. Dense convolution calculations maximize the retention of hidden connections between channels, while sparse convolutions completely sever these connections. GSConv preserves these connections as much as possible. Nevertheless, if employed at every stage of the model, the network would become excessively deep, and the increased depth would impede data flow, considerably prolonging the inference time. Upon reaching the Neck, these feature maps are lengthened (with the channel dimension maximized and the width and height dimensions minimized), rendering further alterations superfluous. Consequently, a more effective strategy is to implement GSConv only in the Neck layer. At this juncture, employing GSConv to process concatenated feature maps is preferable, as it reduces redundancy and repetition of information, eliminating the necessity for further compression, thus enhancing the efficacy of attention modules such as SPP and CA.
VOVGSCSP
The cross-level part of the VOVGSCSP [
34] network is constructed utilizing a singular aggregation method. This approach effectively integrates information between feature maps from different stages. The stacking GSConv host convolutions further augment the GS bottleneck layer structure, improving the network’s feature processing capabilities, strengthening the nonlinear expression of the features, and increasing information reutilization. Although GSConv can markedly reduce redundant information in the feature maps of the pig detection model, it has limitations in further decreasing inference time without compromising accuracy. Consequently, the C2f module in the Neck network is replaced with the VoVGSCSP module. VoVGSCSP consists of GSConv modules aggregated by a singular technique, as shown in
Figure 6a. The structure of VoVGSCSP is illustrated in
Figure 6b. The implementation of the VoVGSCSP module decreases both computational complexity and network structure complexity while preserving detection accuracy. This further reduces the model’s memory usage, accelerates inference performance, and optimizes feature utilization, making the model more suitable for lightweight tasks.
2.2. Tracker
2.2.1. Deep SORT
The SORT [
36] (Simple Online Realtime Tracking) algorithm employs Kalman filtering and the Hungarian algorithm for multi-target tracking, providing the benefits of simplicity and high speed. Nevertheless, the SORT algorithm only predicts the tracking trajectory based on the motion trends of the target, disregarding the object’s appearance characteristics. The Kalman filter employs a linear constant-velocity model for prediction; however, in real-world tracking, the object’s velocity may fluctuate. Moreover, the SORT algorithm is exclusively designed for short-term target tracking and is ineffective in scenarios including target occlusion. These constraints result in frequent identity switches and low tracking accuracy in the presence of occlusion.
To address the limitations of the SORT algorithm, Alex Bewley et al. proposed an enhanced tracking algorithm, Deep SORT [
37] (Simple Online Realtime Tracking with Deep Association Metric). This improved approach incorporates appearance information, combining motion and appearance features with a linear weighted value as the ultimate measure for matching using the Hungarian algorithm. Additionally, Deep SORT employs a cascade matching strategy, assigning higher matching priority to objects that appear more frequently. The algorithm incorporates the extraction of deep learning features and similarity metrics, wherein appearance feature comparisons are made for all targets during each tracking process to reduce identity switches and achieve more sustained tracking. The network structure of the Deep SORT algorithm is shown in
Figure 7. The Deep SORT architecture consists of two main components: the deep appearance descriptor branch and the motion prediction branch. The motion prediction branch uses Kalman filtering to predict the trajectory state, leveraging data from previous frames to project the target’s position in the subsequent frame. The Mahalanobis distance is employed to assess the difference between the predicted trajectory and the actual detection. The deep appearance descriptor branch operates as a convolutional network for image classification, extracting appearance features from the detected frames and converting them into feature vectors. Cosine distance metrics are utilized to evaluate the similarity between these feature vectors. The trajectory segments are subsequently linked using a cascade matching algorithm that combines both cosine and Mahalanobis distances. During the tracking management phase, the tracks are updated, initialized, and deleted to ensure the continuity of effective tracking.
2.2.2. Deep SORT-P
To mitigate target loss and frequent identity switches during pig tracking with the Deep SORT algorithm, this study proposes a pig tracking algorithm, Deep SORT-P, which incorporates an efficient re-identification feature extraction network and an advanced IoU matching algorithm. The specific improvements are described as follows:
DenseNet
The original Deep SORT employs ResNet [
38] as the re-identification model to extract target features, which effectively improves the target tracking performance and facilitates more precise identification and tracking of the target, thereby enhancing the overall tracking performance and stability. However, in this study, the pigs are monitored in a channel environment, where mutual occlusion and squeezing are particularly pronounced, especially when they first enter the channel. This creates a more complex scenario, requiring stronger re-identification capabilities. A re-identification network with enhanced feature extraction ability is employed to perform deeper feature extraction of the pigs’ appearance, aligning with the demands of the actual pig tracking process.
DenseNet [
39] is a densely connected model that differs from ResNet, which incorporates the input and output directly to establish a residual structure. In DenseNet, the input and output are connected in parallel, with each layer linked to all preceding levels, receiving all the outputs of preceding layers as additional inputs. This allows each layer within the module to directly access the outputs of all prior layers, enabling the reuse of features. The network can extract and utilize more features from the input data, thereby enhancing its expressive power. This extensive use of features enables relatively simple network structures to achieve better performance. As shown in Formula (1), the input of the i-th layer is contingent not only upon the output of the (
i − 1)-th layer but also on the outputs of all previous layers.
In this context, [] represents the concatenation operation, where all outputs from to are combined along the channel dimension. H denotes the nonlinear transformation, which is a combination of BN + ReLU + Conv (3 × 3).
During the target tracking process, when occlusion arises, one crucial method for Deep SORT to maintain consistent ID is the implementation of cascade matching between the appearance features derived from the re-identification network and the detection boxes. The matching results directly affect the total ID switch rate of the tracking model. Consequently, the DenseNet model, capable of extracting greater visual information, is more appropriate for the job of appearance feature extraction in the context of this study. This research employs DenseNet-121 as the feature extraction network to enhance pig re-identification performance, minimize frequent identity switches during tracking, and improve the tracking model’s accuracy. The structure of the DenseNet-121 network is illustrated in
Table 1.
CIoU
In the second stage of the Deep SORT algorithm, while the IoU matching algorithm offers scale invariance, it exhibits limitations as an overlap metric due to the significant scale variations in pig movement and the presence of occlusions. In contrast, CIoU matching evaluates not only the overlap area but also the distance between center points and aspect ratio. As a result, CIoU matching can reduce trajectory matching errors and decrease the number of identity switches. The expression is as follows:
where
ρ represents the Euclidean distance between the center points of the two detection boxes;
b and
bgt denote the center points of the predicted bounding box and the ground truth bounding box, respectively;
c represents the diagonal length of the smallest enclosing box that encompasses both bounding boxes; α is a balancing factor that regulates the impact of the centroid distance and the bounding box size;
v is a parameter utilized to assess the consistency of the aspect ratio between the predicted and ground truth bounding boxes. After integrating CIoU into the Deep SORT object tracking algorithm, the similarity between bounding boxes is evaluated from a more holistic standpoint. This is particularly beneficial in target tracking scenarios that involve occlusion, scale variation, or shape deformation. When monitoring a large number of targets with frequent occlusions, CIoU can more accurately measure the similarity between detection boxes, mitigating the impact of occlusions caused by pigs crowding together. This enhances the model’s representational capability and significantly improves the precision of the object tracking process.
2.3. Counting Technique
This study employs the virtual counting line method for object counting. This approach is straightforward and effective, and demonstrates commendable counting performance when the detection and tracking algorithms achieve high accuracy. The counting process is illustrated in
Figure 8.
The key aspect of implementing this counting method in Deep SORT is the incorporation of the virtual counting line judgment into the multi-object tracking process. In this study, the counting line is manually defined by setting specific coordinates. A vertical counting line is positioned at the center of the screen, extending beyond the channel’s width to vertically span the entire channel. When a target crosses the counting line, logical conditions determine whether the counting is valid. During the tracking process in Deep SORT, each target is allocated a unique ID and trajectory. To ensure that a target is correctly counted upon crossing the counting line, a “counting flag” is designated for each target, indicating whether it has been previously counted. The counting flag is defined as a binary variable (0 or 1), with 0 signifying that the target has not been counted, and 1 indicating that it has been counted. In each frame, Deep SORT updates the state of each target. Upon a target crossing the counting line and meeting specific conditions, its trajectory’s counting flag is set to 1, and the count is incremented. If the counting proves unsuccessful, the system verifies if the “direction flag” of the target trajectory is configured to 1. The direction flag is a binary variable, where 1 represents movement to the left and 0 represents movement to the right. If the direction is erroneous, the trajectory is disregarded. If the direction is correct, the system further verifies whether the target’s trajectory has crossed the counting line. If so, the count is incremented, and the trajectory’s counting flag is set to 1; otherwise, the trajectory is discarded.
In summary, each time Deep SORT processes a frame, it initially updates the target’s position, velocity, and visual characteristics. Subsequently, for each target, it verifies if the counting flag is set to 1. If the flag is set, it indicates that the target has been previously counted, and the target is omitted. Next, if the target has not been counted, its direction of movement is examined. If the direction is incorrect, the target is skipped. If the target’s movement direction is correct, it is further verified whether it has crossed the counting line. Upon fulfillment of the condition, the count is augmented, and the target’s counting flag is set to 1. Finally, all counting results are aggregated, and upon processing all video frames, the total number of targets that have crossed the counting line is obtained.
4. Conclusions
This study proposes an enhanced dynamic pig counting algorithm, YOLOv8n-EGV+Deep SORT-P. Compared to the original model, the improved algorithm enhances detection accuracy and speed in pig detection, particularly under occlusion scenarios such as crowding and deformation, as well as in low-light conditions. Additionally, it improves the precision and stability of multi-object pig tracking, mitigating issues such as target loss and frequent identity switches. Experimental results show that the mAP increased by 1.7%, while the Params and GFLOPs were reduced by 14.3% and 17.3%, respectively. Moreover, MOTA improved by 4.2%, MOTP increased by 1.7%, and IDSW decreased by approximately 25.5%. Finally, a virtual counting line method was employed to perform counting experiments on 63 video segments of pigs passing through a farm passage. When comparing the counting results before and after the algorithm enhancement, the improved algorithm achieved a counting accuracy of approximately 92.1%, representing a 17.5% increase. In summary, the proposed improved algorithm demonstrates significant superiority.
This study’s analysis indicates that future work can focus on the following directions:
The dataset established in this study consists of pigs of similar size and a single category. Future research could expand the dataset to include different pig categories, classified by size and color. The proposed improved algorithm can be adapted for the separate counting of various pig types, thereby broadening its applicability.
Further optimization and enhancement of the tracking model could be explored to improve counting accuracy and stability in more complex environments, ensuring a more comprehensive evaluation of the model’s effectiveness.
The counting strategy can be further optimized. This study utilizes a dataset of pigs being manually herded through the farm passage, eliminating the issue of duplicate counts due to pigs crossing back over the virtual counting line. Future work could introduce a bidirectional counting method, where pigs crossing the counting line in both the left and right directions are counted separately, and the final count is obtained by subtracting these two values. This enhanced counting strategy would improve the method’s application and effectively address issues when pigs traverse the route without human supervision, preventing duplicate counting.