Next Article in Journal
Morphometric Characterization of Local Goat Breeds in Two Agroecological Zones of Burkina Faso, West Africa
Previous Article in Journal
Optimized Small Waterbird Detection Method Using Surveillance Videos Based on YOLOv7
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

SheepFaceNet: A Speed–Accuracy Balanced Model for Sheep Face Recognition

College of Information Engineering, Northwest A&F University, Xianyang 712100, China
*
Author to whom correspondence should be addressed.
Animals 2023, 13(12), 1930; https://doi.org/10.3390/ani13121930
Submission received: 19 May 2023 / Revised: 6 June 2023 / Accepted: 6 June 2023 / Published: 9 June 2023
(This article belongs to the Section Animal System and Management)

Abstract

:

Simple Summary

The use of computer vision technology has improved the effectiveness of individual sheep identification, but existing methods face challenges such as large parameter sizes, slow recognition speeds, and difficult deployment. To address this issue, we have made improvements and optimizations based on the Retinaface face recognition model to create a balanced speed-accuracy sheep face recognition model. The optimized model has fewer parameters, simpler computations, faster inference speeds, and higher recognition accuracy, making it well-suited for deployment on resource-constrained edge devices. This research is expected to promote the application of deep learning-based sheep face recognition methods in production.

Abstract

The recognition of sheep faces based on computer vision has improved the efficiency and effectiveness of individual sheep identification, providing technical support for the development of smart farming. However, current recognition models have problems such as large parameter sizes, slow recognition speed, and difficult deployment. Therefore, this paper proposes an efficient and fast basic module called Eblock and uses it to build a lightweight sheep face recognition model called SheepFaceNet, which achieves the best balance between speed and accuracy. SheepFaceNet includes two modules: SheepFaceNetDet for detection and SheepFaceNetRec for recognition. SheepFaceNetDet uses Eblock to construct the backbone network to enhance feature extraction capability and efficiency, designs a bidirectional FPN layer (BiFPN) to enhance geometric location ability, and optimizes the network structure, which affects inference speed, to achieve fast and accurate sheep face detection. SheepFaceNetRec uses Eblock to construct the feature extraction network, uses ECA channel attention to improve the effectiveness of feature extraction, and uses multi-scale feature fusion to achieve fast and accurate sheep face recognition. On our self-built sheep face dataset, SheepFaceNet recognized 387 sheep face images per second with an accuracy rate of 97.75%, achieving an advanced balance between speed and accuracy. This research is expected to further promote the application of deep-learning-based sheep face recognition methods in production.

1. Introduction

Sheep play a crucial role in animal husbandry and are essential for agricultural production and the human diet. Traditional sheep individual identification methods usually use artificial marking, whereby a tag or magnetic disk is attached to the sheep’s ear, and identity confirmation is done through handheld readers. However, this method has many problems, such as easy loss, damage, and impersonation of tags [1], and it is difficult for the approach to meet the high load management requirements of large-scale farms [2]. In addition, long-term wearing of tags may also cause physical and psychological harm to sheep [3].
In contrast, sheep face recognition technology has the advantages of contactless, non-invasive, rapid, and accurate characteristics, which can effectively avoid the problems of artificial marking and improve breeding efficiency and production benefits. Deep-learning-based sheep face recognition mainly uses convolutional neural network architecture [4] and training methods to learn features and patterns from a large number of sheep face images for automatic recognition and classification. Similar to the process of facial recognition, sheep face recognition mainly includes two parts: sheep face detection and recognition. Sheep face detection is used to determine whether there is a sheep face in the image, and the size and location of the sheep face in the image, and crops the whole sheep face area from the image for subsequent sheep face recognition. Sheep face recognition extracts features through a convolutional neural network, represents the sheep face as a vector, compares it with the sheep faces stored in a database, finds the one with the highest similarity, and gives the identity information of the sheep. For example, [5,6,7,8,9,10] improved the efficiency and effectiveness of sheep face recognition using neural-network-based methods with high recognition accuracy. However, they use heavyweight neural networks for sheep face detection and recognition, leading to a large number of parameters in the recognition model, complex computations, slow inference speed, and impossible deployment on resource-limited edge devices. To address this issue, [11,12,13] designed lightweight feature extraction networks to reduce the number of model parameters and computational complexity for deployment on edge devices. Although these works effectively reduce the number of model parameters and complexity, related studies [14] show that reducing the number of parameters and computational complexity cannot effectively improve the speed of the model during inference, and the model may not meet the real-time requirements of actual production. Therefore, [15,16,17,18,19,20] designed real-time recognition models, which effectively improve the recognition speed of the model; however, the recognition accuracy needs to be improved. Therefore, we urgently need a recognition model that strikes a good balance between recognition accuracy and recognition speed.
We propose a computationally efficient and fast basic module, Eblock. Eblock uses inexpensive linear operations to complement redundant features and employs reparameterization during inference to reduce model parameters and calculation complexity, accelerating model inference. Based on this, we built a lightweight sheep face recognition model, SheepFaceNet, achieving the best speed–accuracy trade-off. SheepFaceNet consists of a detection module, SheepFaceNetDet, and a recognition module, SheepFaceNetRec. SheepFaceNetDet uses Eblock to enhance feature extraction efficiency and speed, designs a BiFPN layer to enhance model geometric positioning ability, and optimizes the network structure, which affects inference speed, achieving fast and accurate sheep face detection. SheepFaceNetRec uses Eblock to construct a feature extraction network, employs ECA channel attention to improve the effectiveness of feature extraction, and uses multi-scale feature fusion to achieve fast and accurate sheep face recognition. On our self-built sheep face dataset, SheepFaceNet recognized 387 sheep face images per second with a recognition accuracy of 97.75%, achieving the state-of-the-art speed–accuracy trade-off. This research is expected to further promote the application of deep-learning-based sheep face recognition methods in production.
Our contributions are summarized as follows:
(1)
Proposing the efficient and fast basic module Eblock;
(2)
Proposing the lightweight sheep face recognition model SheepFaceNet;
(3)
Achieving the state-of-the-art speed–accuracy trade-off using SheepFaceNet.

2. Materials and Methods

2.1. Dataset

The sheep face image data used in this paper come from the Yanchi Tan sheep breeding base in Ningxia, China. A total of 95 segments of sheep face videos were obtained by tracking and shooting the sheep faces in different environments such as indoors, outdoors, in different lighting, at different distances, and at different heights. The videos cover 78 dairy goats and 61 Tan sheep, with a resolution of 1920 × 1080 pixels. These videos were then divided into frames using FFmpeg, and the SSIM algorithm [21] was used to delete images with high similarity in order to avoid too much interference in the network training.
A total of 7328 images were obtained, and these raw datasets were divided into a training set, validation set, and test set in a ratio of 6:3:1. To expand the dataset and improve the model’s generalization ability, the ImageEnhance module in the PIL library of Python was used to augment the original dataset. The specific augmentation methods include adjustment of brightness, contrast, rotation, occlusion, random cropping, etc., and the enhanced examples are shown in Figure 1. After data augmentation, the training set contained a total of 16,812 images.
To more accurately detect sheep faces, the Labelme tool [22] was used in this paper to annotate the sheep faces in the obtained images. When annotating, referring to the production method of the face recognition dataset, the main facial area of the sheep face was selected as the detection box, and key points, such as the sheep’s eyes and nose, were labeled, as shown in Figure 2.

2.2. Efficient and Fast Basic Module EBlock

To reduce the number of parameters and computational complexity of the model and improve the inference speed, this paper proposes an efficient and fast basic module called Eblock. Deep convolutional neural networks typically consist of many convolutional layers, which leads to a huge computational cost. Some recent works, such as MobileNet [23] and ShuffleNet [24], have introduced deep convolution or shuffle operations to build efficient convolutional neural networks using smaller convolution kernels, but the remaining 1 × 1 convolution layers still consume a considerable amount of memory and parameters. Assuming the shape of input data X is X R c × h × w , where c, h, and w represent the number of channels, height, and width, respectively, the operation that generates n feature maps can be represented as:
Y = X F + B
In the equation, ∗ represents the convolution operation, B is the bias term, Y R h × w × n is the output feature map with n channels, and h and w are the height and width, respectively, of the output. F R c × k × k × n is the convolution kernel in this layer and k × k is the size of F. In this convolution process, the calculation method of FLOPs is n · h · w · c · k · k . The number of convolution kernels n and the number of channels c are usually very large, resulting in a large quantity of FLOPs, as shown in Figure 3a.
Mainstream convolutional neural networks usually generate redundancy when computing feature maps and these redundancies are necessary for the performance of the network. However, related research [14] has shown that it is not necessary to use a large number of parameters and FLOPs to generate these redundant feature maps one by one. Simple linear transformations can generate redundant feature maps from intrinsic feature maps. The process of generating m intrinsic feature maps Y R h × w × m is shown in (2), where m is far less than n. This process is generated by a primary convolution with convolution kernel parameters identical to those in Equation (1). Here, F R c × k × k × m is the convolution kernel used.
Y = X F
The process of using inexpensive linear transformations to generate n redundant feature maps from m intrinsic feature maps is shown below. Here, y i is the i-th intrinsic feature map in Y , and Φ i , j is the j-th linear transformation used to generate the j-th feature map y i j in Equation (3). Using Equation (3), we can obtain n = m ∗ s feature maps Y = y 11 ,   y 12 , y m s . Linear transformations are performed on each channel, and their computational cost is much lower than that of ordinary convolutions. The structure is shown in Figure 3b.
y i j = Φ i , j y i ,     i = 1 , , m ,   j = 1 , , s .
The above operation can significantly reduce the number of model parameters and computational complexity, but it cannot significantly reduce the inference speed of the model. Therefore, this paper proposes to reparameterize the process described in Equation (3) to improve the inference speed and build an efficient and fast basic module, Eblock. Its structure is shown in Figure 4.
Structural reparameterization technology [26,27] is an effective neural network technology that decouples the training phase and the inference phase. In the training phase, for a given backbone network, reparameterization technology increases the model’s representational power by adding multiple branches or specific layers with various neural network components to the backbone network. In the inference phase, the added branches or layers can be merged into the parameters of the backbone network through some equivalent transformations, which can significantly reduce the number of parameters or computational costs without affecting performance and accelerate inference.
The process of reparameterizing the process described in Equation (3) is shown in Figure 5. During training, for the convolution layer with kernel size K = {1,3}, input channel C i n , and output channel C o u t , the weight matrix can be represented as W " C o u t × C i n × K × K , and the bias is represented as B D . The BatchNorm (BN) layer contains accumulated mean μ, variance σ, bias β, and scaling factor γ. Since convolution and BN are linear operations during inference, they can be merged, and the corresponding weight is   W ^ = W γ σ and bias is   B ^ = B μ γ σ + β . For skip connections, BN is merged into the 1 × 1 identity kernel and then zero-padded. After merging BN into each branch, the corresponding weight matrix is W = i M W i ^ and bias is B = i M B i ^ , where M is the number of network branches. In this way, the model’s parameters and computational costs are greatly reduced, and the inference speed is also accelerated.

2.3. SheepFaceNet

SheepFaceNet is a facial recognition model for sheep that follows the general process of sheep face recognition, which mainly includes two parts: SheepFaceNetDet for sheep face detection and SheepFaceNetRec for recognition, as shown in Figure 6. SheepFaceNetDet uses a sheep face detection dataset to train a sheep face detection model based on SheepFaceNetDet, which extracts the sheep face area from the sheep face image and prepares it for subsequent recognition. SheepFaceNetRec uses the obtained sheep face area for sheep face recognition and outputs sheep face identity information.

2.4. Sheep Face Detection Model, SheepFaceNetDet

SheepFaceNetDet inherits the structure of RetinaFace [28], a widely used facial detection model, including three parts: feature extraction backbone network, BiFPN module, and SSH feature enhancement module. The feature extraction backbone network performs spatial feature extraction on the input sheep face image and obtains feature maps of different scales in the last three stages. Then, BiFPN is used to fuse detailed information and semantic information, and finally, SSH is used to enhance the receptive field. Its structure is shown in Figure 7.

2.4.1. Feature Extraction Network Based on Eblock

MobilenetV1-0.25 is a lightweight backbone network used by RetinaFace, which uses depthwise separable convolution to reduce the parameter quantity of the model and improve inference speed. It is widely used in object detection tasks. However, its relatively weak feature extraction capability and insufficient perception ability of local structures result in certain impacts on the recognition effect. For example, in the original RetinaFace paper, the AP value of the RetinaFace face detection model with ResNet50 as the feature extraction network is 96.94%, while the AP value of the detection model with MobileNet 0.25 as the main feature extraction network is only 78.2%. Eblock and MobilenetV1’s Inverted Residual Block have two differences. First, Eblock supplements redundant features through inexpensive linear operations to reduce network computations. Specifically, Eblock performs pointwise convolution and depthwise convolution on the input tensor and concatenates the results of pointwise convolution and depthwise convolution on the channel dimension as the final output. This approach is exactly the opposite of the Inverted Residual Block. Secondly, Eblock reparameterizes pointwise convolution and depthwise convolution, effectively improving the model’s inference speed. Through these operations, the feature extraction network constructed by Eblock has higher computational efficiency, better feature extraction performance, and faster inference speed. The structure of the feature extraction network based on EBlock is shown in Table 1. The SheepFaceNetDet detection model receives 640 × 640 sheep face images, reduces the feature map size through downsampling in five stages, increases the model channel number, and takes the feature maps P 3 , P 4 , P 5 from the last three stages as the output for sheep face detection.

2.4.2. BiFPN

The original RetinaFace face detection model uses FPN for multi-scale feature fusion. However, this top-down feature fusion method gradually fuses the semantic information of deep feature maps to shallow ones, which does not significantly improve the geometric information localization ability of each feature layer. High-level neurons usually have strong responses to the entire object, while low-level neurons are more easily activated by local textures, patterns, etc. Therefore, in the FPN network, a top-down branch is used to propagate high-level semantic features to lower layers so that all feature maps have classification ability. In the FPN network, a proposal is assigned to a specific feature level based on its size. This assigns large proposals to larger feature levels and smaller ones to smaller feature levels. Although this approach is simple, it can cause errors. For example, similar proposals with a difference of only 10 pixels may be assigned to different feature levels, which can affect the recognition results. Therefore, this paper improves the FPN of the original RetinaFace detection model by combining the bidirectional feature fusion structure mentioned in [29]. After the unidirectional top-down feature fusion layer, a bottom-up feature fusion layer is added so that low-level features with precise details and position information can help locate large proposals, enhancing the model’s geometric information localization ability. As shown in Figure 7, the feature maps N 3 , N 4 , and N 5 , which simultaneously have rich semantic and detail information after BiFPN, are input into the SSH for final sheep face detection, further improving the accuracy of sheep face detection.

2.4.3. Time-Consumption Analysis

To further improve the detection speed of the SheepFaceNetDet model, this paper conducts a further time-consumption analysis of all network structures in RetinaFace, using FLOPs as the measurement indicator. As a matter of principle, the actual inference speed of a model is inversely proportional to its FLOPs. Therefore, a larger number of FLOPs means a longer inference time. Although factors such as model parallelism and memory access can also affect inference speed, this paper uses FLOPs to roughly measure the time consumption of each component of the model while keeping these factors as constant as possible. The method used to measure FLOPs in this paper is ptflops.get_model_complexity_info [30]. The experimental results are shown in Figure 8.
It can be seen that the feature extraction network has the highest time consumption, accounting for 39.00% of the total time consumption of the original RetinaFace model, while FPN and SSH [31] account for 36.80% and 23.87%, respectively, and the loss function only accounts for 0.33%. Further analysis found that the feature fusion operation in FPN took the longest time; for example, when fusing the 80 × 80 feature maps with the 40 × 40 feature maps, the time consumption reached 237.16 Mmac, accounting for 24.43% of the total time consumption. The reason for the high time consumption of SSH is similar. Since the feature fusion operation is implemented by regular convolution, in order to further improve the detection speed of the model, this paper replaces all regular convolutions in FPN and SSH with depthwise separable convolutions.

2.5. Sheep Face Recognition Method, SheepFaceNetRec

To address the low recognition accuracy and slow inference speed of the RetinaFace-MobilenetV1-0.25 model, this paper constructed a lightweight and high-accuracy sheep face recognition model using the Eblock as the basis for the feature extraction network, proposing the SheepFaceNetRec lightweight sheep face detection model. The overall structure of SheepFaceNetRec is shown in Figure 9. SheepFaceNetRec uses two Eblocks and ten Ebottlenecks to extract features of different scales. The Eblock is an efficient and fast basic module proposed in this paper, which is used to build the backbone network of the sheep face recognition model to improve the efficiency and effectiveness of feature extraction. The Ebottleneck is used to reduce the number of parameters and computations of the model while maintaining the accuracy of sheep face recognition. The feature maps of sizes 28 × 28 × 80, 14 × 14 × 96, 7 × 7 × 144, and 7 × 7 × 128 are sent to the MS-FC layer through concatenation to provide the model with different receptive fields for recognizing sheep faces of different sizes. The output of the MS-FC layer is then input into the GDConv layer [32] to give different attention to different positions of the feature map. The feature map is flattened into a 128-dimension representation of the sheep face through the FC layer, and the L2 Norm is used to map the sheep face features onto a unit hypersphere to avoid the influence of features in different scales on computational efficiency.
In this paper, the Ebottleneck is proposed by mimicking the structure of the residual block of ResNet to reduce the number of parameters and computations of the model while maintaining recognition accuracy. EBottleneck consists mainly of two EBlocks and one ECA module. The first EBlock serves as an expansion layer to increase the number of channels, while the second EBlock reduces the number of channels to match the residual connection. Residual connections are added between the inputs and outputs of the two EBlocks. After each layer, a BN layer and a ReLU activation function are used. The second EBlock does not use the ReLU activation function. A deep convolutional layer with a stride of 2 is inserted between the two EBlocks. When the stride = 1, the part enclosed by the green dotted line box in EBottleneck does not exist. It should be noted that the BN used here is merged into the corresponding convolution layer during inference.
ECA [33] is an efficient channel attention module that effectively avoids the influence of dimensionality reduction on channel attention learning by using a non-dimensionality reduction local inter-channel interaction strategy. This module improves the performance of the model with a small increase in the number of parameters. The structure of ECA is shown in Figure 8. It first calculates the average value of each dimension feature channel through global average pooling to obtain a vector with a dimension equal to the number of channels. Then, an MLP is used to process the channel vector to obtain a vector representing the weights of each channel. Finally, an accumulation operation is performed.
During the sheep face recognition process, the size of sheep faces in the images varies, thus requiring different receptive fields. This paper borrows from [34] and selects four different scales of feature maps, namely 28 × 28 × 80, 14 × 14 × 96, 7 × 7 × 144, and 7 × 7 × 128, to cover sheep faces of different scales. Meanwhile, these feature maps are relatively small, with low computational complexity and rich semantic information. In this paper, the 28 × 28 feature map is subjected to 4 × 4 average pooling to obtain a 7 × 7 feature map. The 14 × 14 feature map is subjected to 2 × 2 average pooling to obtain a 7 × 7 feature map. Finally, these feature maps with consistent sizes but different channel numbers are connected to form a 7 × 7 × 448 feature map.

3. Results

3.1. Experimental Environment and Evaluation Index

3.1.1. Experimental Environment

The experimental environment configuration of this paper is shown in Table 2. The pre-training idea in transfer learning is adopted, and the feature extraction network is pre-trained based on the publicly available dataset WIDER FACE [35], which shortens the model training time. After multiple rounds of parameter adjustment, this paper selected parameters with better experimental results. A 640 × 640 × 3 image size is used as the input for the detection model, with a BatchSize of 64 and an SGD optimizer with a momentum of 0.927. The sheep face detection model uses the Complete Intersection Over Union (CIOU) loss function, with an initial learning rate of 1 × 10 3 , a weight decay regularization coefficient of 1 × 10 5 , and a learning rate decay of 10 for model iteration. The sheep face recognition model uses the Arcface loss function, with an initial learning rate of 1 × 10 3 , and the learning rate is adjusted using the cosine annealing algorithm. Dropout is set at 0.6, and 100 epochs are trained.

3.1.2. Evaluation Index

This paper selects the following evaluation metrics: accuracy, precision, recall, average precision (AP), FLOPs, parameters, FPS, and latency. Their calculation methods are shown in Equations (4)–(8), where TP represents true positives, FP represents false positives, TN represents true negatives, and FN represents false negatives. Accuracy refers to the ratio of correctly identified samples to the total number of samples. Precision refers to the ratio of correctly predicted positive samples to the total number of predicted positive samples. Recall refers to the ratio of correctly predicted positive samples to the total number of true positive samples. AP represents the area enclosed by the precision–recall (P-R) curve plotted with recall on the x-axis and precision on the y-axis, which comprehensively considers the precision and recall indicators of the classifier. FLOPs are commonly used to measure the computational complexity of the model. The smaller the FLOPs value, the lower the model’s computational complexity, and the less time it takes to process. Here, h, w, and C i n represent the height, width, and channel of the input feature map, respectively, while C o u t represents the channel of the output feature map, and K represents the width of the convolution kernel. Parameters are commonly used to calculate the model’s parameter size and model scale, representing the memory size required by the model. Generally, when the number of model parameters is large, the inference time required by the model will be longer. In this paper, FPS is used to measure the speed of the sheep face detector, while latency is used to measure the speed of the sheep face recognition model.
A c c u r a c y = T P + T N / T P + T N + F P + F N
    P r e c i s i o n = T P / T P + F P
  R e c a l l = T P / T P + F N
A P = 0 1 P ( r ) d r
F L O P s = 2 h w × C i n × K 2 + 1 × C o u t

3.2. Sheep Face Detection Results

3.2.1. Comparison of Different Sheep Face Detection Models

In order to verify the performance of the SheepFaceNetDet sheep face detection model proposed in this paper, this section compares it with mainstream facial detection models, including Retinaface-Mobilenet0.25, Retinaface-ResNet50, YOLOv5s, and CenterFace [36]. The experimental results are shown in Table 3. It can be seen that the AP of SheepFaceNetDet is 96.36%, slightly lower than the best-performing heavyweight Retinaface-ResNet50, which has an AP of 97.51%. Among all models, SheepFaceNetDet has the least number of parameters, the fastest detection speed, and the lowest computational complexity, achieving the best speed–accuracy trade-off. Compared with the original Retinaface-Mobilenet0.25 model, the SheepFaceNetDet model reduced the number of parameters by 60% and FLOPs by 65%. Meanwhile, the AP increased by 2.77%, and the detection speed was faster, indicating that the improvement of Retinaface-Mobilenet0.25 proposed in this paper is effective. Compared with Retinaface-ResNet50, although the AP is slightly lower, by 1.15%, the number of parameters of Retinaface-ResNet50 is 105 times that of SheepFaceNetDet, and the FLOPs is 60 times that of SheepFaceNetDet. The detection speed is only 34% of the proposed model in this paper. This means that SheepFaceNetDet is not only faster but also has better detection performance. Compared with YOLOv5s and CenterFace, SheepFaceNetDet also shows a better speed–accuracy trade-off.

3.2.2. Ablation Experiment

In order to investigate the effectiveness of the different improvement measures proposed in this paper for RetinaFace, ablation experiments were conducted, and the results are shown in Table 4. “Eblock” represents using Eblock to construct the feature extraction network instead of Mobilenet0.25, “BiFPN” represents using bidirectional FPN, and “DWConv” represents using depthwise separable convolution to replace the ordinary convolution in BiFPN and SSH. It can be seen that due to the weak feature extraction ability of Retinaface-Mobilenet0.25, the detection accuracy is poor, with an AP of only 93.59%. After replacing it with Eblock to construct the feature extraction network, the model’s AP increased by 2.29%, and the detection speed slightly improved, while the changes in parameter count and FLOPs were not significant, which proves the effectiveness of the Eblock design. When replacing the original unidirectional FPN with BiFPN, AP increased by 1.53%, but the detection speed slowed down, and the parameter count and FLOPs slightly increased, indicating that BiFPN can indeed improve detection performance but at the cost of the extra computational burden. When the above two improvements were combined, AP further increased to 96.97%, and the detection speed slowed down slightly. When replacing the convolutions in BiFPN and SSH with depthwise separable convolution, although AP slightly decreased, parameter count and FLOPs significantly reduced, and the detection speed became noticeably faster. Through the above analysis, it can be concluded that the improvements proposed in this paper for Retinaface-Mobilenet0.25 are necessary and effective.

3.2.3. Sheep Face Detection Results

The detection results of SheepFaceNetDet are shown in Figure 10, which indicates that SheepFaceNetDet can accurately detect the sheep faces in the image and is not affected by the detection environment, demonstrating its excellent performance.

3.3. Sheep Face Recognition

3.3.1. Comparison of Recognition Effect with Other Models

To verify the effectiveness of SheepFaceNetRec, this paper compares it with the current state-of-the-art methods ResNet18, MobileNetv1, MobileNetv2 [37], MobileNetv3 [38], EfficientNet-B0 [39], GhostNet, and MobileViT [40]. The experimental results are shown in Table 5. SheepFaceNetRec achieved the best recognition accuracy, similar to that of Resnet18 and Inception-Resnetv2, but with the smallest number of parameters and FLOPs and the lowest latency, demonstrating a good balance between speed and accuracy. Compared with other models, SheepFaceNetRec has different degrees of advantages. For example, compared with the recently proposed lightweight MobileViT-XXS, SheepFaceNetRec’s recognition accuracy is 4.23% higher, and its number of parameters and FLOPs are only 47% and 38% of MobileViT-XXS, respectively, while its latency is only 36%. The effectiveness of SheepFaceNetRec proposed in this paper is evident.

3.3.2. Ablation Experiment

To verify the effectiveness of each component of SheepFaceNetRec, ablation experiments were conducted by separating the two structures in the SheepFaceNetRec structure. The experimental results are shown in Table 6. When nothing is used, SheepFaceNetRec is a relatively lightweight convolutional neural network, with parameters, latency, and recognition accuracy of 0.58 MB, 2.42 ms, and 92.00%, respectively. The recognition accuracy is relatively low. When using ECA to optimize channel feature extraction, the model’s parameters, latency, and recognition accuracy are 0.59 MB, 2.48 ms, and 93.10%, respectively. It can be seen that parameters and latency did not change significantly, but the recognition accuracy increased by 1.1%, demonstrating the effectiveness of ECA. When using multi-scale features for sheep face recognition, the model’s parameters, FLOPs, and recognition accuracy are 0.59 MB, 2.55 ms, and 96.20%, respectively. Compared with the initial model, parameters and latency did not change significantly, but the recognition accuracy increased by 4.2%, and the recognition effect was greatly improved, indicating that MS-FC can indeed improve the model’s recognition effect. When both ECA and MS-FC are used, parameters and latency still do not change much, but the recognition accuracy further improves, eventually reaching 97.75%. Ablation experiments show that each part of the model is necessary and effective.

3.3.3. The Experimental Results of Sheep Face Recognition Are Shown

Figure 11 shows the results of SheepFaceNetRec sheep face recognition. From the figure, it can be seen that the SheepFaceNetRec sheep face recognition model proposed in this paper can accurately identify the sheep of each color, whether in strong or weak light, and is effective for recognition. This indicates that the SheepFaceNetRec sheep face recognition model proposed in this paper can effectively complete sheep face recognition tasks under most regular conditions.

4. Discussion

This paper proposes a lightweight sheep face recognition model, SheepFaceNet, which greatly reduces the number of parameters and computational complexity of the recognition model, improves the inference speed of the model, and has high recognition accuracy, achieving the best speed–accuracy trade-off. It solves the problems faced by sheep face recognition models, such as large parameter sizes, slow recognition speed, and difficult deployment. This paper proposes a computationally efficient and fast basic module, Eblock, and uses Eblock to construct a feature extraction network, which improves the efficiency and effectiveness of the model’s feature extraction and solves the problem that the number of model parameters and computational complexity cannot be reduced to the same extent as the inference speed. To improve the accuracy of the model’s recognition detection and recognition, a bidirectional FPN layer is designed to enhance the model’s geometric localization ability and combined with ECA channel attention and multi-scale feature fusion for sheep face recognition. Comparative experiments with other models show that SheepFaceNet has a fast inference speed and high recognition accuracy. Ablation experiments show that each component of the designed model is effective and necessary. The results of detection and recognition show that SheepFaceNet has great application potential. On a self-built sheep face dataset, SheepFaceNet can recognize 387 sheep face images per second with a recognition accuracy of 97.75%, achieving the best speed–accuracy trade-off. Compared to [5,6,7,8,9,10], SheepFaceNet has a similar recognition accuracy, but its model parameter size, computational complexity, and inference time delay are much lower than those of traditional models. While existing heavyweight backbone networks can achieve good recognition accuracy, their limitations in parameter size, computational complexity, and inference time delay prevent them from being deployed on edge devices. This study designed an efficient and fast basic module Eblock and used it as the basis for designing a lightweight backbone network. Through a series of improvements and optimizations, SheepFaceNet not only achieves high recognition accuracy but also real-time recognition speed. Previous studies [11,12,13] effectively reduced model parameter size and computational complexity by using or improving lightweight recognition models, but their recognition accuracy is lower than that of SheepFaceNet. While MobilenetV2 can reduce model parameters and computational complexity, the use of depthwise separable convolution and pointwise convolution reduces model capacity, thus affecting recognition accuracy. The basic module Eblock proposed in this paper uses inexpensive linear transformations to extract features and reduce feature extraction costs without affecting model capacity. Moreover, previous models do not effectively reduce model inference delays. However, SheepFaceNet’s Eblock improves inference speeds through reparameterization and optimizes time-consuming components of the model, allowing SheepFaceNet to maintain high recognition accuracy while meeting real-time requirements. Previous models [15,16,17] have also implemented sheep face recognition but with low recognition accuracy. For example, [15] has a recognition accuracy of only 91%, which is much lower than SheepFaceNet’s 97.75%. SheepFaceNet strikes a balance between recognition accuracy and speed. It should be noted that the above comparisons are based on results reported by relevant works, and the comparison is only for reference due to the use of different datasets. However, it is enough to show SheepFaceNet’s advantages in the speed–accuracy trade-off. The parameter size and computational complexity of SheepFaceNet are already very low, which meets the requirements for deployment on edge devices. The recognition accuracy of SheepFaceNet is 97.75%, which needs to be further improved for better application in actual production. The above recognition results are based on the dataset constructed in this paper, and we need to further test SheepFaceNet’s performance in actual production. SheepFaceNet accelerates the deployment of deep-learning-based sheep face recognition models on edge devices, promoting the application of sheep face recognition in actual production.

5. Conclusions

The lightweight sheep face recognition model proposed in this paper, SheepFaceNet, achieves a balance between speed and accuracy with 0.60 MB of parameters and 0.09 G FLOPs. It can recognize 387 sheep face images per second on our self-built sheep face dataset with an accuracy of 97.75%, surpassing many advanced recognition models. However, the above results are experimental results from our self-built dataset and equipment, and the generalization needs further verification. Moreover, the structure and transformation of the reparameterization in SheepFaceNet need to be redesigned, which is more complicated than ordinary neural networks. The balance between speed and accuracy is essential for the deployment of sheep face recognition models in practice. In the future, we will verify the generalization of SheepFaceNet on other datasets and optimize its speed–accuracy balance to make it more suitable for deployment on edge devices and better serve sheep production.

Author Contributions

Conceptualization, X.L. and Y.Z.; methodology, X.L.; software, X.L.; validation, X.L. and Y.Z.; formal analysis, X.L.; investigation, Y.Z.; resources, Y.Z.; data curation, Y.Z.; writing—original draft preparation, X.L.; writing—review and editing, X.L.; visualization, Y.Z; supervision, S.L.; project administration, S.L.; funding acquisition, S.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by National Key Research and Development Program of China (grant number 2022YFD1300201).

Institutional Review Board Statement

The experimental procedures were approved under the control of the Guidelines for Animal Experiments by the Committee for the Ethics on Animal Care and Experiments of Northwest A&F University and performed under the control of the “Guidelines on Ethical Treatment of Experimental Animals” (2006) No. 398 set by the Ministry of Science and Technology, China.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author. The data are not publicly available due to the privacy policy of the authors’ institution.

Acknowledgments

We thank all of the funders.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Ait-Saidi, A.; Caja, G.; Salama, A.A.K.; Carné, S. Implementing electronic identification for performance recording in sheep: I. Manual versus semiautomatic and automatic recording systems in dairy and meat farms. J. Dairy Sci. 2014, 97, 7505–7514. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  2. Corkery, G.P.; Gonzales-Barron, U.A.; Butler, F.; Mc Donnell, K.; Ward, S. A preliminary investigation on face recognition as a biometric identifier of sheep. Trans. ASABE 2007, 50, 313–320. [Google Scholar] [CrossRef]
  3. Leslie, E.; Hernández-Jover, M.; Newman, R.; Holyoake, P. Assessment of acute pain experienced by piglets from ear tagging, ear notching and intraperitoneal injectable transponders. Appl. Anim. Behav. Sci. 2010, 127, 86–95. [Google Scholar] [CrossRef]
  4. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  5. Salama, A.Y.A.; Hassanien, A.E.; Fahmy, A. Sheep identification using a hybrid deep learning and bayesian optimization approach. IEEE Access 2019, 7, 31681–31687. [Google Scholar] [CrossRef]
  6. Hitelman, A.; Edan, Y.; Godo, A.; Berenstein, R.; Lepar, J.; Halachmi, I. Biometric identification of sheep via a machine-vision system. Comput. Electron. Agric. 2022, 194, 106713. [Google Scholar] [CrossRef]
  7. Li, X.; Xiang, Y.; Li, S. Combining convolutional and vision transformer structures for sheep face recognition. Comput. Electron. Agric. 2023, 205, 107651. [Google Scholar] [CrossRef]
  8. Zhang, C.; Zhang, H.; Tian, F.; Zhou, Y.; Zhao, S.; Du, X. Research on sheep face recognition algorithm based on improved AlexNet model. Neural Comput. Appl. 2023, 1–9. [Google Scholar] [CrossRef]
  9. Agrawal, D.; Minocha, S.; Namasudra, S.; Kumar, S. Ensemble algorithm using transfer learning for sheep breed classification. In Proceedings of the 2021 IEEE 15th International Symposium on Applied Computational Intelligence and Informatics (SACI), Timisoara, Romania, 19–21 May 2021. [Google Scholar]
  10. Jwade, S.A.; Guzzomi, A.; Mian, A. On farm automatic sheep breed classification using deep learning. Comput. Electron. Agric. 2019, 167, 105055. [Google Scholar] [CrossRef]
  11. Pang, Y.; Yu, W.; Zhang, Y.; Xuan, C.; Wu, P. Sheep face recognition and classification based on an improved MobilenetV2 neural network. Int. J. Adv. Robot. Syst. 2023, 20, 17298806231152969. [Google Scholar] [CrossRef]
  12. Li, X.; Du, J.; Yang, J.; Li, S. When Mobilenetv2 Meets Transformer: A Balanced Sheep Face Recognition Model. Agriculture 2022, 12, 1126. [Google Scholar] [CrossRef]
  13. Li, Z.; Lei, X.; Liu, S. A lightweight deep learning model for cattle face recognition. Comput. Electron. Agric. 2022, 195, 106848. [Google Scholar] [CrossRef]
  14. Dehghani, M.; Arnab, A.; Beyer, L.; Vaswani, A.; Tay, Y. The efficiency misnomer. arXiv 2021, arXiv:2110.12894. [Google Scholar]
  15. Zhang, X.; Xuan, C.; Ma, Y.; Su, H.; Zhang, M. Biometric facial identification using attention module optimized YOLOv4 for sheep. Comput. Electron. Agric. 2022, 203, 107452. [Google Scholar] [CrossRef]
  16. Billah, M.; Wang, X.; Yu, J.; Jiang, Y. Real-time goat face recognition using convolutional neural network. Comput. Electron. Agric. 2022, 194, 106730. [Google Scholar] [CrossRef]
  17. Song, S.; Liu, T.; Wang, H.; Hasi, B.; Yuan, C.; Gao, F.; Shi, H. Using pruning-based YOLOv3 deep learning algorithm for accurate detection of sheep face. Animals 2022, 12, 1465. [Google Scholar] [CrossRef]
  18. Zhao, S.; Hao, G.; Zhang, Y.; Wang, S. A real-time semantic segmentation method of Sheep Carcass images based on ICNet. J. Robot. 2021, 2021, 8847984. [Google Scholar] [CrossRef]
  19. Fu, L.; Yang, Z.; Wu, F.; Zou, X.; Lin, J.; Cao, Y.; Duan, J. YOLO-Banana: A lightweight neural network for rapid detection of banana bunches and stalks in the natural environment. Agronomy 2022, 12, 391. [Google Scholar] [CrossRef]
  20. Li, W.; Zhang, L.; Wu, C.; Cui, Z.; Niu, C. A new lightweight deep neural network for surface scratch detection. Int. J. Adv. Manuf. Tech. 2022, 123, 1999–2015. [Google Scholar] [CrossRef]
  21. Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [Green Version]
  22. Wada, K. labelme: Image Polygonal Annotation with Python. Available online: https://github.com/wkentaro/labelme (accessed on 25 April 2023).
  23. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
  24. Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
  25. Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
  26. Ding, X.; Guo, Y.; Ding, G.; Han, J. Acnet: Strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
  27. Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Online, 19–25 June 2021. [Google Scholar]
  28. Deng, J.; Guo, J.; Ververas, E.; Kotsia, I.; Zafeiriou, S. Retinaface: Single-shot multi-level face localisation in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
  29. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
  30. Sovrasov, V. Ptflops: A Flops Counting Tool for Neural Networks in Pytorch Framework. Available online: https://github.com/sovrasov/flops-counter.pytorch (accessed on 10 April 2023).
  31. Najibi, M.; Samangouei, P.; Chellappa, R.; Davis, L.S. Ssh: Single stage headless face detector. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
  32. Chen, S.; Liu, Y.; Gao, X.; Han, Z. Mobilefacenets: Efficient cnns for accurate real-time face verification on mobile devices. In Proceedings of the Biometric Recognition: 13th Chinese Conference, CCBR 2018, Urumqi, China, 11–12 August 2018. [Google Scholar]
  33. Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
  34. Guo, X.; Li, S.; Yu, J.; Zhang, J.; Ma, J.; Ma, L.; Ling, H. PFLD: A practical facial landmark detector. arXiv 2019, arXiv:1902.10859. [Google Scholar]
  35. Yang, S.; Luo, P.; Loy, C.C.; Tang, X. Wider face: A face detection benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar]
  36. Xu, Y.; Yan, W.; Yang, G.; Luo, J.; Li, T.; He, J. CenterFace: Joint face detection and alignment using face as point. Sci. Program. Meth. 2020, 2020, 7845384. [Google Scholar] [CrossRef]
  37. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
  38. Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Adam, H. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
  39. Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning (ICML), Long Beach, CA, USA, 10–15 June 2019. [Google Scholar]
  40. Mehta, S.; Rastegari, M. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar]
Figure 1. Enhanced results of sheep face images. (a) represents the original image, (b) represents the image after brightness adjustment, (c) represents the image after blurring, (d) represents the image after rotation, (e) represents the image after occlusion, and (f) represents the image after contrast adjustment.
Figure 1. Enhanced results of sheep face images. (a) represents the original image, (b) represents the image after brightness adjustment, (c) represents the image after blurring, (d) represents the image after rotation, (e) represents the image after occlusion, and (f) represents the image after contrast adjustment.
Animals 13 01930 g001
Figure 2. The process of labeling sheep faces.
Figure 2. The process of labeling sheep faces.
Animals 13 01930 g002
Figure 3. Traditional convolution and Ghost module. (a) is the traditional convolutional structure and (b) is the Ghost module in [25].
Figure 3. Traditional convolution and Ghost module. (a) is the traditional convolutional structure and (b) is the Ghost module in [25].
Animals 13 01930 g003
Figure 4. Structure of Eblock.
Figure 4. Structure of Eblock.
Animals 13 01930 g004
Figure 5. Reparameterization process. (a) is the original structure and (b) is the reparameterized structure.
Figure 5. Reparameterization process. (a) is the original structure and (b) is the reparameterized structure.
Animals 13 01930 g005
Figure 6. SheepFaceNet sheep face recognition process.
Figure 6. SheepFaceNet sheep face recognition process.
Animals 13 01930 g006
Figure 7. SheepFaceNetDet sheep face detection model structure.
Figure 7. SheepFaceNetDet sheep face detection model structure.
Animals 13 01930 g007
Figure 8. Time-consumption analysis of RetinaFace-Mobilenet0.25.
Figure 8. Time-consumption analysis of RetinaFace-Mobilenet0.25.
Animals 13 01930 g008
Figure 9. SheepFaceNetRec sheep face recognition model structure.
Figure 9. SheepFaceNetRec sheep face recognition model structure.
Animals 13 01930 g009
Figure 10. SheepFaceNetDet sheep face detection results.
Figure 10. SheepFaceNetDet sheep face detection results.
Animals 13 01930 g010
Figure 11. SheepFaceNetRec sheep face recognition results.
Figure 11. SheepFaceNetRec sheep face recognition results.
Animals 13 01930 g011
Table 1. Feature extraction network structure of SheepFaceNetDet.
Table 1. Feature extraction network structure of SheepFaceNetDet.
InputOperationInput ChannelOutput ChannelFeature MapStage
640 × 640 × 3EBlock816N0
320 × 320 × 16EBlock1632N1
160 × 160 × 32EBlock3264N2
80 × 80 × 64EBlock64128Y3
40 × 40 × 128EBlock128128N4
40 × 40 × 128EBlock128256Y4
20 × 20 × 256EBlock256256N5
20 × 20 × 256EBlock256256N5
20 × 20 × 256EBlock256256Y5
Table 2. Experimental environment configuration.
Table 2. Experimental environment configuration.
Experimental EnvironmentConfiguration Parameters
GPUNVIDIA GeForce RTX3090
CPUIntel(R) Core(TM)I7-7700K
Operating systemUbuntu16.04
Deep Learning FrameworkPytorch 1.10
Programming languagesPython 3.8
GPU Acceleration LibraryCUDA11.6, CUDNN 8.3.2
Table 3. Comparison of different detection models.
Table 3. Comparison of different detection models.
ModelAP (%)Parameters (MB)FPS (fps)FLOPs (G)
Retinaface-Mobilenet0.2593.590.4275.131.29
Retinaface-ResNet5097.5127.2931.5250.75
YOLOv5s94.207.2737.828.19
CenterFace92.640.3865.541.56
SheepFaceNetDet96.360.2591.430.84
Table 4. Results of ablation experiments.
Table 4. Results of ablation experiments.
EblockBiFPNDWConvAP (%)Parameters (MB)FPS (fps)FLOPs (G)
93.590.4275.131.02
95.880.4181.751.29
95.120.5763.871.24
96.970.5872.111.58
96.360.2591.430.84
Table 5. Comparison results of different models.
Table 5. Comparison results of different models.
ModelAccuracy (%)Parameters (MB)Latency (ms)FLOPs (G)
Resnet1897.8811.6911.301.82
MobilenetV294.203.504.270.40
MobilenetV396.102.545.330.22
EfficientNet-B097.785.296.980.38
Inception-Resetv297.9155.8428.316.67
MobileViT-XXS93.521.277.010.25
MobileViT-XS95.612.327.300.70
MobileViT-S95.775.587.701.42
SheepFaceNetRec97.750.602.580.09
Table 6. Results of ablation experiments.
Table 6. Results of ablation experiments.
ECAMS-FCParameters (MB)Latency (ms)Accuracy (%)
0.5872.4292.00
0.5962.4893.10
0.5932.5596.20
0.6022.5897.75
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, X.; Zhang, Y.; Li, S. SheepFaceNet: A Speed–Accuracy Balanced Model for Sheep Face Recognition. Animals 2023, 13, 1930. https://doi.org/10.3390/ani13121930

AMA Style

Li X, Zhang Y, Li S. SheepFaceNet: A Speed–Accuracy Balanced Model for Sheep Face Recognition. Animals. 2023; 13(12):1930. https://doi.org/10.3390/ani13121930

Chicago/Turabian Style

Li, Xiaopeng, Yichi Zhang, and Shuqin Li. 2023. "SheepFaceNet: A Speed–Accuracy Balanced Model for Sheep Face Recognition" Animals 13, no. 12: 1930. https://doi.org/10.3390/ani13121930

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop