2.1. Dataset
The sheep face image data used in this paper come from the Yanchi Tan sheep breeding base in Ningxia, China. A total of 95 segments of sheep face videos were obtained by tracking and shooting the sheep faces in different environments such as indoors, outdoors, in different lighting, at different distances, and at different heights. The videos cover 78 dairy goats and 61 Tan sheep, with a resolution of 1920 × 1080 pixels. These videos were then divided into frames using FFmpeg, and the SSIM algorithm [
21] was used to delete images with high similarity in order to avoid too much interference in the network training.
A total of 7328 images were obtained, and these raw datasets were divided into a training set, validation set, and test set in a ratio of 6:3:1. To expand the dataset and improve the model’s generalization ability, the ImageEnhance module in the PIL library of Python was used to augment the original dataset. The specific augmentation methods include adjustment of brightness, contrast, rotation, occlusion, random cropping, etc., and the enhanced examples are shown in
Figure 1. After data augmentation, the training set contained a total of 16,812 images.
To more accurately detect sheep faces, the Labelme tool [
22] was used in this paper to annotate the sheep faces in the obtained images. When annotating, referring to the production method of the face recognition dataset, the main facial area of the sheep face was selected as the detection box, and key points, such as the sheep’s eyes and nose, were labeled, as shown in
Figure 2.
2.2. Efficient and Fast Basic Module EBlock
To reduce the number of parameters and computational complexity of the model and improve the inference speed, this paper proposes an efficient and fast basic module called Eblock. Deep convolutional neural networks typically consist of many convolutional layers, which leads to a huge computational cost. Some recent works, such as MobileNet [
23] and ShuffleNet [
24], have introduced deep convolution or shuffle operations to build efficient convolutional neural networks using smaller convolution kernels, but the remaining 1 × 1 convolution layers still consume a considerable amount of memory and parameters. Assuming the shape of input data
is
, where
c,
h, and
w represent the number of channels, height, and width, respectively, the operation that generates
n feature maps can be represented as:
In the equation, ∗ represents the convolution operation,
B is the bias term,
is the output feature map with n channels, and
and
are the height and width, respectively, of the output.
is the convolution kernel in this layer and
k ×
k is the size of F. In this convolution process, the calculation method of FLOPs is
. The number of convolution kernels n and the number of channels c are usually very large, resulting in a large quantity of FLOPs, as shown in
Figure 3a.
Mainstream convolutional neural networks usually generate redundancy when computing feature maps and these redundancies are necessary for the performance of the network. However, related research [
14] has shown that it is not necessary to use a large number of parameters and FLOPs to generate these redundant feature maps one by one. Simple linear transformations can generate redundant feature maps from intrinsic feature maps. The process of generating m intrinsic feature maps
is shown in (2), where m is far less than
n. This process is generated by a primary convolution with convolution kernel parameters identical to those in Equation (1). Here,
is the convolution kernel used.
The process of using inexpensive linear transformations to generate n redundant feature maps from m intrinsic feature maps is shown below. Here,
is the
i-th intrinsic feature map in
, and
is the
j-th linear transformation used to generate the
j-th feature map
in Equation (3). Using Equation (3), we can obtain
n = m ∗ s feature maps
. Linear transformations are performed on each channel, and their computational cost is much lower than that of ordinary convolutions. The structure is shown in
Figure 3b.
The above operation can significantly reduce the number of model parameters and computational complexity, but it cannot significantly reduce the inference speed of the model. Therefore, this paper proposes to reparameterize the process described in Equation (3) to improve the inference speed and build an efficient and fast basic module, Eblock. Its structure is shown in
Figure 4.
Structural reparameterization technology [
26,
27] is an effective neural network technology that decouples the training phase and the inference phase. In the training phase, for a given backbone network, reparameterization technology increases the model’s representational power by adding multiple branches or specific layers with various neural network components to the backbone network. In the inference phase, the added branches or layers can be merged into the parameters of the backbone network through some equivalent transformations, which can significantly reduce the number of parameters or computational costs without affecting performance and accelerate inference.
The process of reparameterizing the process described in Equation (3) is shown in
Figure 5. During training, for the convolution layer with kernel size
K = {1,3}, input channel
, and output channel
, the weight matrix can be represented as
, and the bias is represented as
. The BatchNorm (BN) layer contains accumulated mean
μ, variance
σ, bias
β, and scaling factor
γ. Since convolution and BN are linear operations during inference, they can be merged, and the corresponding weight is
and bias is
. For skip connections, BN is merged into the 1 × 1 identity kernel and then zero-padded. After merging BN into each branch, the corresponding weight matrix is
and bias is
, where M is the number of network branches. In this way, the model’s parameters and computational costs are greatly reduced, and the inference speed is also accelerated.
2.5. Sheep Face Recognition Method, SheepFaceNetRec
To address the low recognition accuracy and slow inference speed of the RetinaFace-MobilenetV1-0.25 model, this paper constructed a lightweight and high-accuracy sheep face recognition model using the Eblock as the basis for the feature extraction network, proposing the SheepFaceNetRec lightweight sheep face detection model. The overall structure of SheepFaceNetRec is shown in
Figure 9. SheepFaceNetRec uses two Eblocks and ten Ebottlenecks to extract features of different scales. The Eblock is an efficient and fast basic module proposed in this paper, which is used to build the backbone network of the sheep face recognition model to improve the efficiency and effectiveness of feature extraction. The Ebottleneck is used to reduce the number of parameters and computations of the model while maintaining the accuracy of sheep face recognition. The feature maps of sizes 28 × 28 × 80, 14 × 14 × 96, 7 × 7 × 144, and 7 × 7 × 128 are sent to the MS-FC layer through concatenation to provide the model with different receptive fields for recognizing sheep faces of different sizes. The output of the MS-FC layer is then input into the GDConv layer [
32] to give different attention to different positions of the feature map. The feature map is flattened into a 128-dimension representation of the sheep face through the FC layer, and the L2 Norm is used to map the sheep face features onto a unit hypersphere to avoid the influence of features in different scales on computational efficiency.
In this paper, the Ebottleneck is proposed by mimicking the structure of the residual block of ResNet to reduce the number of parameters and computations of the model while maintaining recognition accuracy. EBottleneck consists mainly of two EBlocks and one ECA module. The first EBlock serves as an expansion layer to increase the number of channels, while the second EBlock reduces the number of channels to match the residual connection. Residual connections are added between the inputs and outputs of the two EBlocks. After each layer, a BN layer and a ReLU activation function are used. The second EBlock does not use the ReLU activation function. A deep convolutional layer with a stride of 2 is inserted between the two EBlocks. When the stride = 1, the part enclosed by the green dotted line box in EBottleneck does not exist. It should be noted that the BN used here is merged into the corresponding convolution layer during inference.
ECA [
33] is an efficient channel attention module that effectively avoids the influence of dimensionality reduction on channel attention learning by using a non-dimensionality reduction local inter-channel interaction strategy. This module improves the performance of the model with a small increase in the number of parameters. The structure of ECA is shown in
Figure 8. It first calculates the average value of each dimension feature channel through global average pooling to obtain a vector with a dimension equal to the number of channels. Then, an MLP is used to process the channel vector to obtain a vector representing the weights of each channel. Finally, an accumulation operation is performed.
During the sheep face recognition process, the size of sheep faces in the images varies, thus requiring different receptive fields. This paper borrows from [
34] and selects four different scales of feature maps, namely 28 × 28 × 80, 14 × 14 × 96, 7 × 7 × 144, and 7 × 7 × 128, to cover sheep faces of different scales. Meanwhile, these feature maps are relatively small, with low computational complexity and rich semantic information. In this paper, the 28 × 28 feature map is subjected to 4 × 4 average pooling to obtain a 7 × 7 feature map. The 14 × 14 feature map is subjected to 2 × 2 average pooling to obtain a 7 × 7 feature map. Finally, these feature maps with consistent sizes but different channel numbers are connected to form a 7 × 7 × 448 feature map.