Monocular Visual Pig Weight Estimation Method Based on the EfficientVit-C Model

Wan, Songtai; Fang, Hui; Wang, Xiaoshuai

doi:10.3390/agriculture14091571

Open AccessArticle

Monocular Visual Pig Weight Estimation Method Based on the EfficientVit-C Model

by

Songtai Wan

¹

,

Hui Fang

^1,*

and

Xiaoshuai Wang

²

¹

Huzhou Research Institute, Zhejiang University, Huzhou 313000, China

²

College of Biosystems Engineering and Food Science, Zhejiang University, Hangzhou 310058, China

^*

Author to whom correspondence should be addressed.

Agriculture 2024, 14(9), 1571; https://doi.org/10.3390/agriculture14091571

Submission received: 20 August 2024 / Revised: 7 September 2024 / Accepted: 9 September 2024 / Published: 10 September 2024

(This article belongs to the Section Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

:

The meat industry is closely related to people’s daily lives and health, and with the growing global population and increasing demand for meat, the development of efficient pig farming technology is particularly important. However, China’s pig industry still faces multiple challenges, such as high labor costs, high biosecurity risks, and low production efficiency. Therefore, there is an urgent need to develop a fast, accurate, and non-invasive method to estimate pig body data to increase production efficiency, enhance biosecurity measures, and improve pig health. This study proposes EfficientVit-C model for image segmentation and cascade several models to estimate the weight of pigs. The EfficientVit-C network uses a cascading group attention module and improves computational efficiency through parameter redistribution and structured pruning. This method uses only one camera for weight estimation, reducing equipment costs and maintenance expenses. The results show that the improved EfficientVit-C model can segment pigs accurately and efficiently the mAP50 curve convergence is 98.2%, the recall is 92.6%, and the precision is 96.5%. The accuracy of pig weight estimation is 100 kg +/− 3.11 kg. On the Jetson Orin NX platform, the average time to complete image segmentation for each 640*480 resolution image was 4.1 ms, and the average time required to complete pig weight estimation was 31 ms. The results show that this method can quickly and accurately estimate the weight of pigs and provide guidance for the subsequent weight evaluation procedures of pigs.

Keywords:

transformer; estimation of swine weight; swine production; computer vision

1. Introduction

At present, China’s pig breeding industry is developing rapidly in the direction of intelligence and scale. The pork industry is closely related to people’s daily lives and health, and with the growing global population and increasing demand for meat, the development of efficient pig farming technology is particularly important. The pig weight evaluation mentioned in this paper is the concrete realization of this purpose. According to pig weight estimation, information such as the growth uniformity and health status of pig herds can be learned in real-time to formulate scientific breeding plans, rationally match feed, improve breeding efficiency, and reduce breeding costs [1,2,3]. In addition, in terms of insurance and credit, insurance companies and credit companies can carry out intelligent counting and weight assessment of pigs each time the pig farm is released and calculate the income of the pig farm to achieve the purpose of risk assessment. Traditionally, the ground-scale measurement is the main way to obtain the weight of pigs, but the measurement process is easy to cause the stress reaction of pigs, and the work intensity of employees is also larger. Therefore, there is an urgent need to develop a fast, accurate, and non-invasive method to estimate pig weight data to increase production efficiency, enhance biosecurity measures, and improve pig health. Due to the development of deep learning technology in recent years, intelligent inventory and revaluation of pictures taken by cameras are being widely studied in animal husbandry [4,5,6,7].

Marchant [8] divided the target area in the image into several main parts and estimated the quality by the area of these parts. Based on the binocular vision principle, Li Zhuo [9] proposed a pig contour information extraction algorithm based on depth images to estimate the weight of pigs, aiming at the problem that natural images are susceptible to pollutants and light interference. Liu Pengfei [10] proposed a pig body size parameter acquisition and weight estimation method based on a 3D point cloud algorithm. Kong Shangyu [11] estimated body weight based on the Resnet regression network algorithm.

After investigation, there are mainly four kinds of pig weight evaluation methods. (1) Parameters such as the body size and back area of pigs were extracted from 2D images, and the body weight of pigs was estimated through the relationship model between these parameters. The average error of the estimation was 3.38~5.30% [12,13]. (2) The binocular camera and depth camera were used to shoot the pig, reconstruct the 3D image of the pig body, extract the parameters of the height of the back of the pig and the size of the area from the 3D image, and estimate the weight of the pig through these parameters. The average error of the estimation was 2.26~3.30% [14]. (3) The back region of the pig image was fitted with the elliptic square method, and the body weight was estimated according to the relationship model between the mass center of the ellipse, the length of the major and minor axes, the area, and the regional eccentricity of the ellipse. The average relative error of this method is 3.0~3.8% [15]. (4) The grid slide was projected to the back of the pig’s body, and the weight of the pig was estimated after the height and area of the pig’s body were calculated according to the principle of stereoscopic projection [16], but for this method it was difficult to achieve automation. The whole image processing process of methods (2,3) is rather complicated. Generally, the pig body image needs to be processed as follows: background removal, image enhancement, image binarization, image filtering and noise reduction, head and tail removal, and body size, volume, and other parameter extraction. It takes a long time, and it is difficult to apply to occasions where the pig weight acquisition speed and real-time requirements are relatively high, such as the fattening pig group feeding system and sow feeding station [17].

The algorithm based on a neural network for pig weight estimation still faces some challenges, mainly in the following aspects: (1) Most studies still use binocular cameras and rangefinders, which have problems of high cost and difficulty in equipment installation and maintenance. Its cost and installation and maintenance difficulty are at least twice that of the monocular camera scheme. (2) Due to the crowding phenomenon of pigs, pigs block each other and cause squeeze deformation, which affects the accuracy of the image segmentation algorithm. (3) The inference speed of the image segmentation algorithm is slow, the computing power required is large, and the model deployment cost is also large [18,19].

To address the above problems, this study aimed to develop and validate an improved algorithm for estimating pig weight based on the EfficientVit network [20] and the cascade model inference method for revaluation. The hypothesis of this study is that it is possible to obtain an improved algorithm for estimating pig weight using only one camera for revaluation.

The time cost of memory access is the key factor restricting the inference speed of the model. In VIT (Vision Transformer) [21], frequent reshape operations, element addition, element multiplication, and normalization operations all require address access across different storage units, which will undoubtedly lead to a significant reduction in memory access efficiency. Although this burden can be alleviated by simplifying softmax self-attention mechanisms, such as ReLU [22] self-attention, sparse attention, and low-rank approximation, these approaches often come at the expense of model accuracy. In addition, operations such as tensor reshaping involved in MHSA (multi-head self-attention mechanism) in VIT can also seriously drag down memory access efficiency. Recent studies have shown that inefficient memory access operations are mainly concentrated in the MHSA layer, not the FFN (feedforward network) [23] layer. Therefore, by properly adjusting the ratio between the MHSA and FFN layers, memory access times can be significantly reduced while maintaining model performance. This means that reducing the proportion of MHSA operations and increasing the proportion of linear FFN operations that are more memory efficient will become a viable optimization scheme. In this study, we replaced the MHSA with the CGA (cascading group attention) module. EfficientVit-C network improvements are as follows:

(1) By replacing MHSA (multi-head self-attention) modules in the EfficientVit network with CGA (cascading group attention) modules, the usage ratio of the linear FFN (feedforward network) was increased, while the usage ratio of multi-head self-attention modules was reduced.

(2) When input, directly using Overlap PatchEmbed (a subsampling method) for subsampling eight times under the embedding layer, and only using three stage scales for the entire model.

(3) Replacing LN operation with BN operation in a unified way and improving model accuracy and parameter calculation efficiency through structured pruning.

In Section 2, we describe the construction and improved network of the dataset in this study, describe how to obtain the area of the back of the pig in the image, and calculate the weight of the pig. The experimental results are provided in Section 3. Finally, the discussion and conclusion are provided in Section 4 and Section 5.

2. Materials and Methods

2.1. Image Collection and Dataset Construction

The study was carried out on pigs at pig farms. The data collection site was located in a pig farm with a scale of 30,000 in Ninghe, Tianjin, China. The pigs collected were raised for 1 to 5 months, and the whole data collection process lasted for 3 months. The image collection scene is set as the pig exit channel, and the length and width of the pig exit channel are 5.0 m*1.2 m. In this study, a Hikvision DS-2CD3T25-I3 camera was used for image acquisition, with a resolution of 1920*1080 pixels, and the camera was installed 3 m above the middle of the pig’s passage. During the filming process, 1–5 pigs at a time passed through the channel. To ensure the stability of data acquisition, all image data were collected under natural light conditions during the day (7 am to 6 pm) and under heating light conditions at night. To improve the robustness of the model in various environments, we collected a total of 4000 pictures of 800 pigs to make a dataset and divided it into a training set, a validation set, and a test set at a ratio of 7:2:1.

To better design the weight estimation algorithm based on adaptive weight adjustment, the weight data acquisition method uses the weight scale of the pig farm limit column to measure. Since the pigs are in the active state during the measurement, the weight value will fluctuate slightly for some time in most cases, and accurate data can be obtained by setting thresholds and smoothing filtering [11]. The weight of pigs in the dataset range is [20, 130] KG.

The images are captured by the remote server control camera when the shooting conditions are met. The captured images are named and saved according to the rule of “pig farm number + camera number + shooting time”, and the data are returned every other day. Labelme (version number 3.16.2) annotation software is used to annotate images sent back to the server. The entire surface of the pig is labeled in the counting task. In the weight assessment task, because the pig’s head and tail will swing flexibly, the camera may not be able to shoot, which can easily cause errors, so only the back of the pig (from the shoulder to the butt area) should be marked. The production process of the dataset is illustrated in Figure 1:

2.2. Construction of the EfficentVit-C Model

2.2.1. CGA (Cascading Group Attention) Modules

MHSA can improve model performance by embedding the input sequence into multiple attention heads and calculating the attention force separately at the same time. Due to the high similarity of many attention heads—many attention heads learn similar projections of the same complete features—a large amount of feature computation redundancy is generated. Drawing on the design of group convolution, the attention head in the CGA module is split to save computing overhead: in each attention subhead operation stage, it is grouped before computing

Q

(Query),

K

(Key), and

V

(Value). The formal formula is expressed as follows:

{\tilde{H}}_{i j} = A t t n (H_{i j} W_{i j}^{Q}, H_{i j} W_{i j}^{K}, H_{i j} W_{i j}^{V})

(1)

{\tilde{H}}_{i + 1} = C o n c a t {[{\tilde{H}}_{i j}]}_{j} = {}_{1 : h}W_{i}^{P}

(2)

W_{i j}^{Q}

,

W_{i j}^{K}

,

W_{i j}^{V}

are the mapping layers that split the input features into different subspaces,

W_{i}^{P}

is a linear layer that maps the concatenated output features to the input feature dimensions. The output of each subattention head is added to the input of the next subattention head, thereby increasing the model capacity and further improving feature diversity. The formal formula is expressed as follows:

H_{i j}^{'} = H_{i j} + {\tilde{H}}_{i (j - 1)}, 1 < j \leq h

(3)

Parameter reallocation strategy: We used Taylor’s structured pruning method to set small channel dimensions for

Q

and

K

projections for all stages of each head. The

V

projection maintains dimensions that match the input embedding. This design aims to give full play to the key role of V projection in information transmission. At the FFN level, we optimized the parameter redundancy problem and reduced the expansion rate from 4 to 2. This adjustment not only reduces the complexity of the model but also helps to improve its efficiency. Important modules have a sufficient number of channels to capture and learn rich representations in high-dimensional space. Redundant parameters in unimportant modules are removed. This design not only prevents the loss of feature information in the process of model training but also improves the inference speed of the model.

The multi-layer module architecture designed in this study is Token Interaction + linear FFN+ cascading group attention + token interaction + linear FFN. The formal formula is expressed as follows:

Χ_{i + 1} = \prod^{N} Φ_{i}^{F} (Φ_{i}^{A} (\prod^{Ν} Φ_{i}^{F} (Χ_{i})))

(4)

where

X_{i}

is the complete input feature of the

i

th memory block. Multi-layer module architecture reduces access memory time consumption and realizes efficient communication between different feature channels. In addition, the token interaction layer uses depth-wise separable convolution, which can better obtain the information induction bias of the local structure and enhance the ability of the model to obtain features. The internal structure of cascading group attention and EfficientViT-C’s Building Block is illustrated in Figure 2.

2.2.2. Neural Network Structure Optimization

In the network input, we use the overlapping block embedding layer to achieve 8 time subsampling and ensure the high efficiency of the network [18,19]. Only three stages are used, which greatly improves operation efficiency. For normalization, we use the batch normalization (BN) operation instead of the layer normalization (LN) operation. This is because BN has the ability to fold operations into a preordered convolution or linear layer, giving it a significant advantage over LN in terms of run time [24].

In the choice of activation function, we firmly adopted ReLU because, compared with the commonly used SILU [25] or GELU [26], ReLU not only has faster operation speed but also has better compatibility on some inference deployment platforms [27,28]. This choice adds more practicality and stability to our model. Finally, we used the structured pruning method to prune the whole network. The overall network structure of EfficientViT-C proposed in this study is illustrated in Figure 3:

2.3. Model Structure of Pig Weight Estimate

2.3.1. Monocular Depth Estimation Network

According to the visual imaging law of near large and far small, the height difference in pigs will cause the distortion of the target size in the captured image and then produce errors in the estimation of the target weight. To effectively reduce such errors, we must first obtain a depth image of the pig. With depth calibration technology, we can accurately determine the depth information of the pig. Then, according to the depth information obtained, the pig target mask segmented by the EfficientVit-C model is zoomed in or out accordingly.

In this experimental setting, we set the height of the camera to 3 m and the height of the pig to 1 m as the benchmark height. At the same time, the position of the pig directly below the camera and 2 m away from the camera is set as the baseline distance. When the distance between the pig and the camera is greater than 2 m, we will magnify the target mask; when the distance is less than 2 m, it is reduced. In this way, we can obtain a target mask closer to the real situation.

Monocular depth estimation: using a monocular camera to estimate the relative distance between each pixel in an RGB image and the shooting source. The key lies in how the network captures and analyzes the depth information in the image and quantifies the distance through the depth calibration method. The Depth Anything [29] network adopts the feature alignment loss method (cosine embedding loss, L2 loss, etc.) to transform the feature space into high and continuous and uses this semantically continuous feature information to enrich semantic priors more effectively, greatly enhancing the accuracy of monadic depth estimation. The formula for the feature alignment loss function is as follows, where

\cos (f_{i}, f_{i}^{'})

is the cosine similarity between two feature vectors:

L_{f e a t} = 1 - \frac{1}{H W} \sum_{i = 1}^{H W} \cos (f_{i}, f_{i}^{'})

(5)

Depth Anything uses the pre-training weights of the Dinov2 [30] model, which is the backbone model of Segment Anything [31]. Vistas are detected by using a pre-trained semantic segmentation model with its relative depth (parallax value) set to 0, the furthest. The Depth Anything model uses the fine-tuned ZoeDepth [32] framework as a depth model that predicts relative depth. It is necessary to convert relative depth into absolute depth by the depth calibration method. Specific steps are as follows:

(1): The coordinates of the mask center point in the image obtained by the EfficientVit-C model.
(2): The relative depth of this coordinate in the depth map obtained by the Depth Anything model.
(3): The relative depth of this point is converted to absolute depth by the depth calibration method.

The original image (a,d), the depth image (b,e) and the mask image (c,f) are as follows (Figure 4):

2.3.2. Weight Adaptive Regression Network

To better balance accuracy and speed, the Resnet50 network is adopted in this paper, and the full connection layer of the Resnet50 network is replaced by the regression output layer. The mask, enlarged or reduced according to the depth distance, and the corresponding true weight value are input into the network, and the GHM-R [33] (gradient harmonizing mechanism regression) loss function is selected. The number of iterations is set to 100, the optimizer is ADAM, and the learning rate is 0.001. Network evaluation indexes are root mean square error (RMSE) and relative root mean square error (RRMSE).

The absolute error

e

is the deviation between the estimated value

W_{f i t t i n g}

and the actual value

W_{a c t u a l}

, reflecting the actual magnitude of the deviation. Its representation is as follows:

e = W_{f i t t i n g} - W_{a c t u a l}

(6)

Relative error

σ

is the ratio between the absolute error value and the true value, reflecting the degree of deviation between the estimated value and the true value. Its representation is as follows:

σ = \frac{e}{W_{a c t u a l}} \times 100 %

(7)

The root-mean-square error

e_{R M S E}

is the expected arithmetic square root of the absolute error

e^{2}

, which measures the overall deviation between the estimated value and the true value in the discrete case, and is represented as follows:

e_{R M S E} = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} e_{i}^{2}}

(8)

The relative root-mean-square error

σ_{RR M S E}

is the arithmetic square root of expected

σ^{2}

, which measures the degree of overall deviation between the predicted value and the true value in the discrete case, and is represented as follows:

σ_{R R M S E} = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} σ_{i}^{2} \times 100 %}

(9)

The inference flow chart of the cascade model of pig weight estimated tasks designed in this study is presented below (Figure 5).

2.4. Experiments

In terms of hardware configuration, NVIDIA Jetson NX was used as the inference platform with 16 GB of RAM. The operating system was Ubuntu 20.04. Jetpack 5.0.2 was used as an AI development environment. Jetpack 5.0.2 equipped with CUDA 11.4, TensorRT 8.4.1, and cuDNN 8.4.1. used Pytorch 2.0.1.

In the first set of experiments, EfficientVit was used as the baseline model for ablation experiments. The training set was the data enhancement training set, and the performance evaluation used the test set. The model training epoch was 100 times, and the model evaluation indexes were accuracy, recall rate and average accuracy (AP50).

In the second set of experiments, the datasets were trained and identified using the EfficientVit-C model proposed in this study. The model was compared with several image segmentation models, including Mask R-CNN, SOLOv2, YOLOv8n-seg, YOLOv8s-seg, Yolov88-Seg, YOLOv8l-seg, and EfficientVit. This evaluation was performed using a training set.

In the third set of experiments, the resnet50, Xception, MobilenetV4, Densenet18, and resnet101 models were compared. The dataset used for this assessment included 200 images and 100 pigs with an average body weight of 95.0 kg.

3. Results

3.1. Ablation Study

The results show that the baseline model precision, recall, and average precision (AP50) are 89.3%, 84.1%, and 89.5%. When MHSA modules and parameter reassignment strategies in the EfficientViT network were replaced by CGA modules, the precision, recall rate, and average precision (AP50) reached 95.5%, 90.7%, and 96.2%, respectively. On this basis, we introduced eight-time sampling under the overlapping block embedding layer and only use the three-stage scale strategy in the whole model to reduce the parameter calculation and improve the model inference speed. Due to the reduction in parameter calculation, the accuracy rate, recall rate, and average accuracy are slightly reduced. On this basis, BN (batch normalization) is introduced to replace LN (layer normalization) and the structured pruning strategy, and the precision, recall, and average precision (AP50) are improved, reaching 96.5%, 92.6%, and 98.2%, respectively. The comparison of the results of different improvement strategies is listed in Table 1 and illustrated in Figure 6.

The original image (a) and EfficientVit Model inference results (b) and Efficientvit-C Model inference results (c) are illustrated in the following figure (Figure 7):

3.2. Performance Comparison with Classic Network Models

The following Table 2 compares the speed and average precision (AP50) of each model:

The above comparison results show that the improved EfficientVit model algorithm has advantages in both speed and average precision and meets the requirements of real-time segmentation of pigs.

3.3. Comparison of Pig Estimated Weight Results

The comparison of the average deviation value and relative root-mean-square error of each model reasoning on the Jetson Orin NX platform is listed in the following Table 3:

3.4. Model Deployment

The deployment process is as follows:

(1): Using the high-speed parallel computing capability of the GPU, CUDA custom operators were written, the model’s pre- and post-processing operations into were integrated into the model, and all operations were performed on the GPU.
(2): The EffiecientVit-C model, Depth-Anything model and Resnet50+GHM-R Loss model improved in this study were converted into TensorRT format.
(3): By using the Nvidia Triton open-source model deployment framework, the three models of Depth-Anything, EffiecientVit-C and Resnet50+GHM-R loss were deployed in the pipeline.

Nvida DeepStream was used to push the video stream from the camera, and video hard decoder technology was adopted. On the Jetson Orin NX platform, the average time required for each 640*480 resolution image to complete the pig weight task was 31 ms.

4. Discussion

In this study, the EfficientVit model was improved to improve the accuracy and speed of image segmentation, and an innovative multi-model cascade method was proposed to complete the pig weight evaluation task with only one monocular camera.

In Experiment 1, the ablation study, the accuracy of the model was significantly increased after the addition of the CGA module. The reasons why the CGA mechanism can improve segmentation accuracy are as specified below.

Providing a different feature segmentation for each head can improve the diversity of the attention graph. Second, cascading attention heads can increase network depth, further boosting model capacity without introducing any additional parameters. Since the attention-graph calculation for each header uses a smaller QK channel dimension, it only incurs a smaller computational overhead.

By using embedded overlapping blocks and using BN instead of LN, the structured pruning method significantly increases the speed and accuracy. The reasons may be as specified below.

Using Overlap PatchEmbed blocks, using BN instead of LN can reduce the amount of parameter computation. Structured pruning allows important modules to have a sufficient number of channels to capture and learn rich representations in high-dimensional spaces. Redundant parameters in unimportant modules are removed. This design not only prevents the loss of feature information in the process of model training but also improves the inference speed of the model. Similar conclusions were obtained in a related study [34].

In Experiment 2, in the comparison between the EfficientVit-C model and other models, the efficientvit-C model has the highest AP50 value and the highest FPS value, showing very good accuracy and speed.

In Experiment 3, the weight evaluation experiment, the Resnet50 model shows the best weight evaluation accuracy, which is due to the better generalization performance of Resnet network.

At present, the pig breeding industry generally requires an average deviation of less than 4 KG/100 KG, that is, within 4 KG per 100 KG deviation. We improved the network based on EfficientVit to innovate a cascade of multiple models for pig weight estimation. In the scenario of the pig exit channel, when pigs can move freely, after rigorous testing, the average deviation value of the scheme reaches 3.11 kg/100 kg, which fully meets the accuracy requirements of the pig breeding industry.

It is worth mentioning that this study only uses the monocular camera as the acquisition device, and there is no need to match other complex equipment. This not only greatly reduces the cost of actual deployment but also makes the installation and maintenance process much easier and faster. Therefore, the results of this study have a wide range of application prospects in the breeding industry and are expected to bring more efficient and economical weighing solutions.

The experiments show that the improved EfficientVit-C can effectively solve the problems of difficult segmentation caused by deformation and occlusion of pig crowding. The EfficientVit-C-based multi-model cascade approach can meet the accuracy and speed requirements of pig weight estimation tasks. The EfficientVit-c model proposed in this study still has some shortcomings, and the image segmentation effect in target dense scenes (such as more than 80 pigs) needs to be improved, and new feature extraction algorithms (such as the newly open-source Mamba architecture) can be further studied.

5. Conclusions

In this study, an improved EfficientVIT-C model based on EfficientVit was proposed, which could segment pigs in images accurately and comprehensively. This paper also proposes a multi-model cascade design for pig weight evaluation. Specific conclusions are as follows:

(1): MHSA modules in the EfficientVit network were replaced with CGA modules, which improved the usage rate of FFT and reduced the usage rate of MHSA modules. The effect of reducing memory access time and improving image segmentation accuracy was achieved.
(2): When input, the overlap block was directly used to sample eight times under the embedding layer, and only three stage scales are used in the whole model to reduce the amount of parameter calculation, improve the model speed effect.
(3): LN operation was replaced with BN operation a structured pruning method was adopted to improve model accuracy and parameter calculation efficiency.
(4): The accuracy rate of the trained EfficientVit-C model was 96.5%, the recall rate was 92.6%, and the AP50 was 98.2%. On the Jetson Orin NX platform, the average time required for each 640*480 resolution image to complete the pig weight task is 31 ms.

These results demonstrated that the proposed EfficientVit-C model and the cascade design of multiple models could quickly and accurately segment and weigh pigs. By accurately estimating the weight of the pig, the growth information of the pig can be obtained in real time, a scientific pig breeding plan can be prepared, and the income of the pig farm can be quickly calculated.

Author Contributions

Conceptualization, S.W.; methodology, S.W.; investigation, S.W.; resources, S.W.; writing—original draft preparation, S.W.; writing—review and editing, H.F. and X.W.; project administration, H.F.; funding acquisition, H.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the Innovation 2030Major S&T Projects of China (grant number 2021ZD0113601).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available at https://gitee.com/tuobayehh/efficient-vit-c (accessed on 1 September 2024).

Conflicts of Interest

The authors declare that they do not have any commercial or associative interests that represent any conflicts of interest in connection with the work submitted.

References

Picon, A.; Alvarez-Gila, A.; Seitz, M.; Ortiz-Barredo, A.; Echazarra, J.; Johannes, A. Deep convolutional neural networks for mobile capture device-based crop disease classification in the wild. Comput. Electron. Agric. 2019, 161, 280–290. [Google Scholar] [CrossRef]
Kamilaris, A.; Prenafeta-Boldu, F.X. Deep learning in agriculture: A survey. Comput. Electron. Agric. 2018, 147, 70–90. [Google Scholar] [CrossRef]
Meng, Q.-L. Practice of digital intelligent management system of pig farm. China Pig Ind. 2022, 17, 96–98. [Google Scholar]
Zhang, L.; Zhou, H.; Zhu, Q. Multi-target tracking of herd pigs based on PigsTrack tracker. Trans. Chin. Soc. Agric. Eng. 2019, 39, 181–190. [Google Scholar]
Fan, F. Animal Detection and Multi-Target Tracking in Images Based on Deep Learning in Modern Farms; Beijing University of Posts and Telecommunications: Beijing, China, 2024. [Google Scholar]
Zi, J.; Tan, L.; Zhao, Y.; Chen, X. Application of machine vision in pig behavior recognition. Mod. Anim. Sci. Technol. 2022, 26–28. [Google Scholar]
Zhang, L.; Zhang, B.; Qiu, J. Application of machine vision in smart pig industry. Livest. Poult. Ind. 2023, 34, 24–28. [Google Scholar]
Marchant, J.A.; Schofield, C.P.; White, R.P. Pig growth and conformation monitoring using image analysis. Anim. Sci. 1999, 68, 141–150. [Google Scholar] [CrossRef]
Li, Z.; Du, X.; Mao, T.; Teng, G.Z. Detection system of pig body scale based on depth image. Trans. Chin. Soc. Agric. Mach. 2016, 47, 311–318. [Google Scholar]
Liu, P. Study on Collection of Body Size Parameters and Body Weight Estimation of Pigs Based on Point Cloud; Hebei University of Technology: Tianjin, China, 2021; pp. 71–88. [Google Scholar]
Kong, S. Weight Prediction of Pigs Based on Weight Adaptive Adjustment; Harbin Engineering University: Harbin, China, 2021; pp. 49–62. [Google Scholar]
Kollis, K.; Phang, C.S.; Banhazi, T.M.; Searle, S.J. Weight estimation using image analysis and statistical modelling: A preliminary study. Appl. Eng. Agric. 2007, 23, 91–96. [Google Scholar] [CrossRef]
Yang, Y.; Teng, G.H.; Li, B.M.; Shi, Z.X. Measurement of pig weight based on computer vision. Trans. Chin. Soc. Agric. Eng. 2006, 22, 127–131. [Google Scholar]
Li, Z.; Mao, T.T.; Liu, T.H.; Teng, G.H. Comparison and optimization of pig mass estimation models based on machine vision. Trans. Chin. Soc. Agric. Eng. 2015, 31, 155–161. [Google Scholar]
Kashiha, M.; Bahr, C.; Ott, S.; Moons CP, H.; Niewold, T.A.; dberg, F.O.; Berckmans, D. Automatic weight estimation of individual pigs using image analysis. Comput. Electron. Agric. 2014, 107, 38–44. [Google Scholar] [CrossRef]
Minagawa, H.; Hosono, D. A light projection method to estimate pig height. Phytotaxa 2000, 178, 23–32. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 1580–1589. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Cai, H.; Li, J.; Hu, M.; Gan, C.; Han, S. EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction. arXiv 2022, arXiv:2205.14756. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Agarap, A.F. Deep Learning using Rectified Linear Units (ReLU). arXiv 2018, arXiv:1803.08375. [Google Scholar]
Touvron, H.; Bojanowski, P.; Caron, M.; Cord, M.; El-Nouby, A.; Grave, E.; Izacard, G.; Joulin, A.; Synnaeve, G.; Verbeek, J.; et al. ResMLP: Feedforward networks for image classification with data-efficient training. arXiv 2021, arXiv:2105.03404. [Google Scholar] [CrossRef]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2019), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Elfwing, S.; Uchibe, E.; Doya, K. Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning. arXiv 2017, arXiv:1702.03118. [Google Scholar] [CrossRef]
Hendrycks, D.; Gimpel, K. Gaussian Error Linear Units (GELUs). arXiv 2016, arXiv:1606.08415. [Google Scholar]
Apple. Coremltools: Use Coremltools to Convert Machine Learning Models from Third-Party Libraries to the Core ML. 2021. Available online: https://apple.github.io/coremltools/docs-guides/ (accessed on 20 August 2024).
Junjie, B.; Fang, L.; Ke, Z. Onnx: Open Neural Network Exchange. GitHub Repository. 2019. Available online: https://github.com/onnx/onnx (accessed on 20 August 2024).
Yang, L.; Kang, B.; Huang, Z.; Xu, X.; Feng, J.; Zhao, H. Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data. arXiv 2024, arXiv:2401.10891. [Google Scholar]
Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. DINOv2: Learning Robust Visual Features without Supervision. arXiv 2023, arXiv:2304.07193. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y.; et al. Segment Anything. arXiv 2023, arXiv:2304.02643. [Google Scholar]
Bhat, S.F.; Birkl, R.; Wofk, D.; Wonka, P.; Müller, M. ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth. arXiv 2023, arXiv:2302.12288. [Google Scholar]
Li, B.; Liu, Y.; Wang, X. Gradient Harmonized Single-stage Detector. arXiv 2018, arXiv:1811.05181. [Google Scholar] [CrossRef]
Liu, X.; Peng, H.; Zheng, N.; Yang, Y.; Hu, H.; Yuan, Y. EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada; 2023; pp. 14420–14430. [Google Scholar] [CrossRef]

Figure 1. Data annotation and dataset production.

Figure 2. Block diagram of the internal organization of CGA (left) and the multi-layer module (right).

Figure 3. Improved EfficientViT-C structure.

Figure 4. Original image (a,d), depth image (b,e), and mask image (c,f).

Figure 5. Pig estimated weight task cascade model inference flow chart.

Figure 6. Effect diagram of different improvement strategies.

Figure 7. Original image (a), EfficientVit (b), and EfficientVit-C (c) inference results.

Table 1. Segmentation results under different improvement strategies.

CGA	Subsmaple*8	Structured Pruning	Precision/%	Recall/%	AP50/%
×	×	×	89.3	84.1	89.5
√	×	×	95.5	90.7	96.2
√	√	×	92.8	89.2	93.6
√	√	√	96.5	92.6	98.2

AP50: average precision. CGA: cascading group attention.

Table 2. Results of comparison detection between different models.

Model	AP50 (%)	Frames per Second (FPS)	Params (M)
Mask-Rcnn	87.1	3.1	210.1
SOLOv2	85.5	7.3	170.3
YOLOv8n-seg	93.3	100.3	3.4
YOLOv8s-seg	94.6	60.2	11.8
YOLOv8m-seg	95.9	46.7	27.3
YOLOv8l-seg	96.5	38.1	46.0
YOLOv8x-seg	97.5	29.1	71.8
EfficientVit	89.5	51.2	15
EfficientVit-C	98.2	143.1	2.8

AP50: average precision.

Table 3. Comparison of pig estimated weight results.

Model	RMSE (KG)	RRMSE (%)
Xception + GHM-R loss	3.81	4.01
Densenet18 + GHM-R loss	3.43	3.61
MobilenetV4 + GHM-R loss	3.18	3.34
Resnet50 + GHM-R loss	2.95	3.11
Resnet101 + GHM-R loss	2.95	3.11

RMSE: root mean square error. RRMSE: relative root mean square error.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wan, S.; Fang, H.; Wang, X. Monocular Visual Pig Weight Estimation Method Based on the EfficientVit-C Model. Agriculture 2024, 14, 1571. https://doi.org/10.3390/agriculture14091571

AMA Style

Wan S, Fang H, Wang X. Monocular Visual Pig Weight Estimation Method Based on the EfficientVit-C Model. Agriculture. 2024; 14(9):1571. https://doi.org/10.3390/agriculture14091571

Chicago/Turabian Style

Wan, Songtai, Hui Fang, and Xiaoshuai Wang. 2024. "Monocular Visual Pig Weight Estimation Method Based on the EfficientVit-C Model" Agriculture 14, no. 9: 1571. https://doi.org/10.3390/agriculture14091571

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Monocular Visual Pig Weight Estimation Method Based on the EfficientVit-C Model

Abstract

1. Introduction

2. Materials and Methods

2.1. Image Collection and Dataset Construction

2.2. Construction of the EfficentVit-C Model

2.2.1. CGA (Cascading Group Attention) Modules

2.2.2. Neural Network Structure Optimization

2.3. Model Structure of Pig Weight Estimate

2.3.1. Monocular Depth Estimation Network

2.3.2. Weight Adaptive Regression Network

2.4. Experiments

3. Results

3.1. Ablation Study

3.2. Performance Comparison with Classic Network Models

3.3. Comparison of Pig Estimated Weight Results

3.4. Model Deployment

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI