1. Introduction
At present, China’s pig breeding industry is developing rapidly in the direction of intelligence and scale. The pork industry is closely related to people’s daily lives and health, and with the growing global population and increasing demand for meat, the development of efficient pig farming technology is particularly important. The pig weight evaluation mentioned in this paper is the concrete realization of this purpose. According to pig weight estimation, information such as the growth uniformity and health status of pig herds can be learned in real-time to formulate scientific breeding plans, rationally match feed, improve breeding efficiency, and reduce breeding costs [
1,
2,
3]. In addition, in terms of insurance and credit, insurance companies and credit companies can carry out intelligent counting and weight assessment of pigs each time the pig farm is released and calculate the income of the pig farm to achieve the purpose of risk assessment. Traditionally, the ground-scale measurement is the main way to obtain the weight of pigs, but the measurement process is easy to cause the stress reaction of pigs, and the work intensity of employees is also larger. Therefore, there is an urgent need to develop a fast, accurate, and non-invasive method to estimate pig weight data to increase production efficiency, enhance biosecurity measures, and improve pig health. Due to the development of deep learning technology in recent years, intelligent inventory and revaluation of pictures taken by cameras are being widely studied in animal husbandry [
4,
5,
6,
7].
Marchant [
8] divided the target area in the image into several main parts and estimated the quality by the area of these parts. Based on the binocular vision principle, Li Zhuo [
9] proposed a pig contour information extraction algorithm based on depth images to estimate the weight of pigs, aiming at the problem that natural images are susceptible to pollutants and light interference. Liu Pengfei [
10] proposed a pig body size parameter acquisition and weight estimation method based on a 3D point cloud algorithm. Kong Shangyu [
11] estimated body weight based on the Resnet regression network algorithm.
After investigation, there are mainly four kinds of pig weight evaluation methods. (1) Parameters such as the body size and back area of pigs were extracted from 2D images, and the body weight of pigs was estimated through the relationship model between these parameters. The average error of the estimation was 3.38~5.30% [
12,
13]. (2) The binocular camera and depth camera were used to shoot the pig, reconstruct the 3D image of the pig body, extract the parameters of the height of the back of the pig and the size of the area from the 3D image, and estimate the weight of the pig through these parameters. The average error of the estimation was 2.26~3.30% [
14]. (3) The back region of the pig image was fitted with the elliptic square method, and the body weight was estimated according to the relationship model between the mass center of the ellipse, the length of the major and minor axes, the area, and the regional eccentricity of the ellipse. The average relative error of this method is 3.0~3.8% [
15]. (4) The grid slide was projected to the back of the pig’s body, and the weight of the pig was estimated after the height and area of the pig’s body were calculated according to the principle of stereoscopic projection [
16], but for this method it was difficult to achieve automation. The whole image processing process of methods (2,3) is rather complicated. Generally, the pig body image needs to be processed as follows: background removal, image enhancement, image binarization, image filtering and noise reduction, head and tail removal, and body size, volume, and other parameter extraction. It takes a long time, and it is difficult to apply to occasions where the pig weight acquisition speed and real-time requirements are relatively high, such as the fattening pig group feeding system and sow feeding station [
17].
The algorithm based on a neural network for pig weight estimation still faces some challenges, mainly in the following aspects: (1) Most studies still use binocular cameras and rangefinders, which have problems of high cost and difficulty in equipment installation and maintenance. Its cost and installation and maintenance difficulty are at least twice that of the monocular camera scheme. (2) Due to the crowding phenomenon of pigs, pigs block each other and cause squeeze deformation, which affects the accuracy of the image segmentation algorithm. (3) The inference speed of the image segmentation algorithm is slow, the computing power required is large, and the model deployment cost is also large [
18,
19].
To address the above problems, this study aimed to develop and validate an improved algorithm for estimating pig weight based on the EfficientVit network [
20] and the cascade model inference method for revaluation. The hypothesis of this study is that it is possible to obtain an improved algorithm for estimating pig weight using only one camera for revaluation.
The time cost of memory access is the key factor restricting the inference speed of the model. In VIT (Vision Transformer) [
21], frequent reshape operations, element addition, element multiplication, and normalization operations all require address access across different storage units, which will undoubtedly lead to a significant reduction in memory access efficiency. Although this burden can be alleviated by simplifying softmax self-attention mechanisms, such as ReLU [
22] self-attention, sparse attention, and low-rank approximation, these approaches often come at the expense of model accuracy. In addition, operations such as tensor reshaping involved in MHSA (multi-head self-attention mechanism) in VIT can also seriously drag down memory access efficiency. Recent studies have shown that inefficient memory access operations are mainly concentrated in the MHSA layer, not the FFN (feedforward network) [
23] layer. Therefore, by properly adjusting the ratio between the MHSA and FFN layers, memory access times can be significantly reduced while maintaining model performance. This means that reducing the proportion of MHSA operations and increasing the proportion of linear FFN operations that are more memory efficient will become a viable optimization scheme. In this study, we replaced the MHSA with the CGA (cascading group attention) module. EfficientVit-C network improvements are as follows:
(1) By replacing MHSA (multi-head self-attention) modules in the EfficientVit network with CGA (cascading group attention) modules, the usage ratio of the linear FFN (feedforward network) was increased, while the usage ratio of multi-head self-attention modules was reduced.
(2) When input, directly using Overlap PatchEmbed (a subsampling method) for subsampling eight times under the embedding layer, and only using three stage scales for the entire model.
(3) Replacing LN operation with BN operation in a unified way and improving model accuracy and parameter calculation efficiency through structured pruning.
In
Section 2, we describe the construction and improved network of the dataset in this study, describe how to obtain the area of the back of the pig in the image, and calculate the weight of the pig. The experimental results are provided in
Section 3. Finally, the discussion and conclusion are provided in
Section 4 and
Section 5.
2. Materials and Methods
2.1. Image Collection and Dataset Construction
The study was carried out on pigs at pig farms. The data collection site was located in a pig farm with a scale of 30,000 in Ninghe, Tianjin, China. The pigs collected were raised for 1 to 5 months, and the whole data collection process lasted for 3 months. The image collection scene is set as the pig exit channel, and the length and width of the pig exit channel are 5.0 m*1.2 m. In this study, a Hikvision DS-2CD3T25-I3 camera was used for image acquisition, with a resolution of 1920*1080 pixels, and the camera was installed 3 m above the middle of the pig’s passage. During the filming process, 1–5 pigs at a time passed through the channel. To ensure the stability of data acquisition, all image data were collected under natural light conditions during the day (7 am to 6 pm) and under heating light conditions at night. To improve the robustness of the model in various environments, we collected a total of 4000 pictures of 800 pigs to make a dataset and divided it into a training set, a validation set, and a test set at a ratio of 7:2:1.
To better design the weight estimation algorithm based on adaptive weight adjustment, the weight data acquisition method uses the weight scale of the pig farm limit column to measure. Since the pigs are in the active state during the measurement, the weight value will fluctuate slightly for some time in most cases, and accurate data can be obtained by setting thresholds and smoothing filtering [
11]. The weight of pigs in the dataset range is [20, 130] KG.
The images are captured by the remote server control camera when the shooting conditions are met. The captured images are named and saved according to the rule of “pig farm number + camera number + shooting time”, and the data are returned every other day. Labelme (version number 3.16.2) annotation software is used to annotate images sent back to the server. The entire surface of the pig is labeled in the counting task. In the weight assessment task, because the pig’s head and tail will swing flexibly, the camera may not be able to shoot, which can easily cause errors, so only the back of the pig (from the shoulder to the butt area) should be marked. The production process of the dataset is illustrated in
Figure 1:
2.2. Construction of the EfficentVit-C Model
2.2.1. CGA (Cascading Group Attention) Modules
MHSA can improve model performance by embedding the input sequence into multiple attention heads and calculating the attention force separately at the same time. Due to the high similarity of many attention heads—many attention heads learn similar projections of the same complete features—a large amount of feature computation redundancy is generated. Drawing on the design of group convolution, the attention head in the CGA module is split to save computing overhead: in each attention subhead operation stage, it is grouped before computing
(Query),
(Key), and
(Value). The formal formula is expressed as follows:
,
,
are the mapping layers that split the input features into different subspaces,
is a linear layer that maps the concatenated output features to the input feature dimensions. The output of each subattention head is added to the input of the next subattention head, thereby increasing the model capacity and further improving feature diversity. The formal formula is expressed as follows:
Parameter reallocation strategy: We used Taylor’s structured pruning method to set small channel dimensions for and projections for all stages of each head. The projection maintains dimensions that match the input embedding. This design aims to give full play to the key role of V projection in information transmission. At the FFN level, we optimized the parameter redundancy problem and reduced the expansion rate from 4 to 2. This adjustment not only reduces the complexity of the model but also helps to improve its efficiency. Important modules have a sufficient number of channels to capture and learn rich representations in high-dimensional space. Redundant parameters in unimportant modules are removed. This design not only prevents the loss of feature information in the process of model training but also improves the inference speed of the model.
The multi-layer module architecture designed in this study is Token Interaction + linear FFN+ cascading group attention + token interaction + linear FFN. The formal formula is expressed as follows:
where
is the complete input feature of the
th memory block. Multi-layer module architecture reduces access memory time consumption and realizes efficient communication between different feature channels. In addition, the token interaction layer uses depth-wise separable convolution, which can better obtain the information induction bias of the local structure and enhance the ability of the model to obtain features. The internal structure of cascading group attention and EfficientViT-C’s Building Block is illustrated in
Figure 2.
2.2.2. Neural Network Structure Optimization
In the network input, we use the overlapping block embedding layer to achieve 8 time subsampling and ensure the high efficiency of the network [
18,
19]. Only three stages are used, which greatly improves operation efficiency. For normalization, we use the batch normalization (BN) operation instead of the layer normalization (LN) operation. This is because BN has the ability to fold operations into a preordered convolution or linear layer, giving it a significant advantage over LN in terms of run time [
24].
In the choice of activation function, we firmly adopted ReLU because, compared with the commonly used SILU [
25] or GELU [
26], ReLU not only has faster operation speed but also has better compatibility on some inference deployment platforms [
27,
28]. This choice adds more practicality and stability to our model. Finally, we used the structured pruning method to prune the whole network. The overall network structure of EfficientViT-C proposed in this study is illustrated in
Figure 3:
2.3. Model Structure of Pig Weight Estimate
2.3.1. Monocular Depth Estimation Network
According to the visual imaging law of near large and far small, the height difference in pigs will cause the distortion of the target size in the captured image and then produce errors in the estimation of the target weight. To effectively reduce such errors, we must first obtain a depth image of the pig. With depth calibration technology, we can accurately determine the depth information of the pig. Then, according to the depth information obtained, the pig target mask segmented by the EfficientVit-C model is zoomed in or out accordingly.
In this experimental setting, we set the height of the camera to 3 m and the height of the pig to 1 m as the benchmark height. At the same time, the position of the pig directly below the camera and 2 m away from the camera is set as the baseline distance. When the distance between the pig and the camera is greater than 2 m, we will magnify the target mask; when the distance is less than 2 m, it is reduced. In this way, we can obtain a target mask closer to the real situation.
Monocular depth estimation: using a monocular camera to estimate the relative distance between each pixel in an RGB image and the shooting source. The key lies in how the network captures and analyzes the depth information in the image and quantifies the distance through the depth calibration method. The Depth Anything [
29] network adopts the feature alignment loss method (cosine embedding loss, L2 loss, etc.) to transform the feature space into high and continuous and uses this semantically continuous feature information to enrich semantic priors more effectively, greatly enhancing the accuracy of monadic depth estimation. The formula for the feature alignment loss function is as follows, where
is the cosine similarity between two feature vectors:
Depth Anything uses the pre-training weights of the Dinov2 [
30] model, which is the backbone model of Segment Anything [
31]. Vistas are detected by using a pre-trained semantic segmentation model with its relative depth (parallax value) set to 0, the furthest. The Depth Anything model uses the fine-tuned ZoeDepth [
32] framework as a depth model that predicts relative depth. It is necessary to convert relative depth into absolute depth by the depth calibration method. Specific steps are as follows:
- (1)
The coordinates of the mask center point in the image obtained by the EfficientVit-C model.
- (2)
The relative depth of this coordinate in the depth map obtained by the Depth Anything model.
- (3)
The relative depth of this point is converted to absolute depth by the depth calibration method.
The original image (a,d), the depth image (b,e) and the mask image (c,f) are as follows (
Figure 4):
2.3.2. Weight Adaptive Regression Network
To better balance accuracy and speed, the Resnet50 network is adopted in this paper, and the full connection layer of the Resnet50 network is replaced by the regression output layer. The mask, enlarged or reduced according to the depth distance, and the corresponding true weight value are input into the network, and the GHM-R [
33] (gradient harmonizing mechanism regression) loss function is selected. The number of iterations is set to 100, the optimizer is ADAM, and the learning rate is 0.001. Network evaluation indexes are root mean square error (RMSE) and relative root mean square error (RRMSE).
The absolute error
is the deviation between the estimated value
and the actual value
, reflecting the actual magnitude of the deviation. Its representation is as follows:
Relative error
is the ratio between the absolute error value and the true value, reflecting the degree of deviation between the estimated value and the true value. Its representation is as follows:
The root-mean-square error
is the expected arithmetic square root of the absolute error
, which measures the overall deviation between the estimated value and the true value in the discrete case, and is represented as follows:
The relative root-mean-square error
is the arithmetic square root of expected
, which measures the degree of overall deviation between the predicted value and the true value in the discrete case, and is represented as follows:
The inference flow chart of the cascade model of pig weight estimated tasks designed in this study is presented below (
Figure 5).
2.4. Experiments
In terms of hardware configuration, NVIDIA Jetson NX was used as the inference platform with 16 GB of RAM. The operating system was Ubuntu 20.04. Jetpack 5.0.2 was used as an AI development environment. Jetpack 5.0.2 equipped with CUDA 11.4, TensorRT 8.4.1, and cuDNN 8.4.1. used Pytorch 2.0.1.
In the first set of experiments, EfficientVit was used as the baseline model for ablation experiments. The training set was the data enhancement training set, and the performance evaluation used the test set. The model training epoch was 100 times, and the model evaluation indexes were accuracy, recall rate and average accuracy (AP50).
In the second set of experiments, the datasets were trained and identified using the EfficientVit-C model proposed in this study. The model was compared with several image segmentation models, including Mask R-CNN, SOLOv2, YOLOv8n-seg, YOLOv8s-seg, Yolov88-Seg, YOLOv8l-seg, and EfficientVit. This evaluation was performed using a training set.
In the third set of experiments, the resnet50, Xception, MobilenetV4, Densenet18, and resnet101 models were compared. The dataset used for this assessment included 200 images and 100 pigs with an average body weight of 95.0 kg.
4. Discussion
In this study, the EfficientVit model was improved to improve the accuracy and speed of image segmentation, and an innovative multi-model cascade method was proposed to complete the pig weight evaluation task with only one monocular camera.
In Experiment 1, the ablation study, the accuracy of the model was significantly increased after the addition of the CGA module. The reasons why the CGA mechanism can improve segmentation accuracy are as specified below.
Providing a different feature segmentation for each head can improve the diversity of the attention graph. Second, cascading attention heads can increase network depth, further boosting model capacity without introducing any additional parameters. Since the attention-graph calculation for each header uses a smaller QK channel dimension, it only incurs a smaller computational overhead.
By using embedded overlapping blocks and using BN instead of LN, the structured pruning method significantly increases the speed and accuracy. The reasons may be as specified below.
Using Overlap PatchEmbed blocks, using BN instead of LN can reduce the amount of parameter computation. Structured pruning allows important modules to have a sufficient number of channels to capture and learn rich representations in high-dimensional spaces. Redundant parameters in unimportant modules are removed. This design not only prevents the loss of feature information in the process of model training but also improves the inference speed of the model. Similar conclusions were obtained in a related study [
34].
In Experiment 2, in the comparison between the EfficientVit-C model and other models, the efficientvit-C model has the highest AP50 value and the highest FPS value, showing very good accuracy and speed.
In Experiment 3, the weight evaluation experiment, the Resnet50 model shows the best weight evaluation accuracy, which is due to the better generalization performance of Resnet network.
At present, the pig breeding industry generally requires an average deviation of less than 4 KG/100 KG, that is, within 4 KG per 100 KG deviation. We improved the network based on EfficientVit to innovate a cascade of multiple models for pig weight estimation. In the scenario of the pig exit channel, when pigs can move freely, after rigorous testing, the average deviation value of the scheme reaches 3.11 kg/100 kg, which fully meets the accuracy requirements of the pig breeding industry.
It is worth mentioning that this study only uses the monocular camera as the acquisition device, and there is no need to match other complex equipment. This not only greatly reduces the cost of actual deployment but also makes the installation and maintenance process much easier and faster. Therefore, the results of this study have a wide range of application prospects in the breeding industry and are expected to bring more efficient and economical weighing solutions.
The experiments show that the improved EfficientVit-C can effectively solve the problems of difficult segmentation caused by deformation and occlusion of pig crowding. The EfficientVit-C-based multi-model cascade approach can meet the accuracy and speed requirements of pig weight estimation tasks. The EfficientVit-c model proposed in this study still has some shortcomings, and the image segmentation effect in target dense scenes (such as more than 80 pigs) needs to be improved, and new feature extraction algorithms (such as the newly open-source Mamba architecture) can be further studied.