3.2. Results and Discussion of the Experiments
The dataset was divided into a training set and test set in a 4:1 ratio randomly in a non-overlapping manner. The training set contained 773 videos and the test set contained 227 videos. The stochastic selected 227 test samples were input into the I3D and TSN networks, which were trained by the training set, and then we obtained the accuracy of each category of the videos.
For the TSN model, if
is 1, the networks will degenerate into normal two-stream networks.
Table 3 shows the behavior recognition accuracy of TSN models with different backbone networks under different numbers of video segments. It can be seen from
Table 3 that the behavior recognition performance is better when the video is divided into three segments. The video information can be fully extracted, and actions can be modeled from the whole video. The experimental results reflect the superiority of the sparse temporal sampling strategy.
Table 4 shows the performance of different network architectures of two-stream convolutional networks. It can be seen from
Table 4 that the behavior recognition performance is best when the backbone network is ResNet101 in both spatial and temporal networks, which shows that the recognition accuracy can be improved by increasing the depth of the network. Residual network solves the problem of network degradation caused by an overly deep network. It also reflects the superiority of the combination of two-stream networks and deep networks. At the same time, it can be concluded that the accuracy of the temporal networks is higher than the accuracy of the spatial networks. From this, we can see the importance of optical flow, which has motion characteristics and apparent invariance [
36] for recognition. The introduction of optical flow has significantly improved the fused accuracy, which proves that the two-stream networks have certain complementarity.
Table 5 shows the confusion matrix of the TSN model whose feature extraction network is ResNet101. It can be seen from the table that the recognition accuracy of feeding, mounting, and lying reaches 100%, while the recognition accuracies of scratching and walking are 97.82% and 97.14%, respectively. The average recognition accuracy is 98.99% and the model works well in pig behavior recognition.
For the I3D model, we compared two ways of initializing the feature extraction network parameters. One was to expand the 2D convolution kernel parameters from Imagenet to 3D convolution kernel parameters, and then we further used the I3D network parameters from the Kinetics dataset to fine-tune the network with the pig video behavior dataset. The other one was to set the parameters of the feature extraction network through completely random initialization and then directly train the network with the pig video behavior dataset. The change curves of the loss function values of two-stream convolutional networks are shown in
Figure 8, and the change curve of the accuracy rate of two-stream convolutional networks during training are shown in
Figure 9. For comparing the parameter initialization methods, the curves of the networks with two different parameter initialization methods are all drawn in the same graph.
It can be seen from
Figure 8 that, whether the networks are the temporal convnets or the spatial convnets, the networks with randomly initialized parameters may have large fluctuations in the loss value during the initial training phase. However, as the number of training iterations increases, the model will eventually reach a convergence state. However, the convergence speed of the networks with randomly initialized parameters is slower than the convergence speed of the networks with pre-training. According to
Figure 9 the networks with pre-training have higher recognition accuracy and faster convergence speed. Through the analysis above, the way that the pre-trained models are applied and then the parameters are fine-tuned according to the new pig dataset accelerates the convergence speed and achieves a high accuracy. The I3D networks were tested with the test dataset, and the result is shown in
Table 6. It can also be concluded that the accuracy of the temporal networks is higher than the accuracy of the spatial networks. The model is more sensitive to optical flow information. Compared with the single stream network, the two-stream network still shows better performance.
Table 7 shows the confusion matrix of the I3D model with pre-training. One scratching behavior was misidentified as walking behavior, and one walking behavior was misidentified as feeding. The average recognition accuracy is also 98.99%.
Through the experiments above, we can find that the TSN model whose feature extraction network is ResNet101 and the I3D model with pre-training all have achieved high accuracy rate which is 98.99% in pig video behavior recognition. In order to compare two models more comprehensively, we compared the average recognition time of each video. The result is shown in
Figure 10.
It can be known from the results that the average recognition time of each video in the TSN network (ResNet101) is 0.8565 s less than that of the I3D network with pre-training when the accuracy of the two models is the same. Therefore, the TSN model (ResNet101) has good recognition efficiency and recognition effect for multi-behavior recognition of pigs and has good robustness for different pigpen environments, pig ages, pig body sizes, and lighting conditions.
They are lots of publications in the pig behavior recognition field.
The literatures [
17,
18,
19,
20] are all pig behavior recognition studies based on computer vision techniques, and the methods of behavior feature extraction relied on human observation and design. In this paper, the deep learning method was adopted, so there is no need to manually design feature extraction methods, and features can be learned from the data automatically. The learned features are more suitable and effective for behavior recognition. Viazzi et al. [
17] divided the manual feature extraction and subsequent action classification into two separate processes. The work based on deep learning in this paper is end-to-end; pig videos were inputted and then behavior categories were outputted, which achieved a seamless connection of feature extraction and classification. Kashiha et al. [
18] and Nasirahmadi et al. [
8] all made the pig into an ellipse to perform image analysis and calculating, which depended on high-precision image segmentation and was susceptible to light, contrast between the pig and background, and complex backgrounds. Kashiha et al. [
19] and Lao et al. [
20] detected behavior through the distance between a part of the pig body and the object such as the drinking nipple and feeder, which depended on the image processing and shooting conditions. The work in this paper is not affected by light, the contrast between pig and background, and complex background, and it does not need to perform image processing on video frames.
The pig images were segmented by using a deep learning-based method in the following literatures. Zheng et al. [
2] and Yang et al. [
22] used Faster-RCNN to recognize pig postures and feeding behaviors. Nasirahmadi et al. [
16] proposed three detector methods including Faster R-CNN, SSD, and R-FCN to recognize postures of pigs. Real-time sow drinking, urination, and mounting behavior recognition has been achieved by using an optimized target detection method based on the SSD and the MobileNet [
24]. Li et al. [
10] proposed Mask R-CNN to segment pigs from images and then extracted the eigenvectors for mounting recognition. These methods extracted spatial features from still images without considering the temporal features of behavior. Compared with still image classification, the temporal component in video provides an additional and important clue for recognition and behavior can be more reliably identified based on temporal information. In this paper, the spatial stream networks process image frames to get the spatial information, and the temporal stream networks process optical flow to get the motion information, so two-stream convolutional networks can extract the spatio-temporal information of the video to achieve behavior recognition. According to the experimental results of this article, the accuracy of the temporal networks is higher than the accuracy of the spatial networks. From this, we can see the importance of temporal features for recognition. Sows were segmented from all frames through the FCN model to extract spatial features; then, the temporal features were designed and extracted, and the classifier was used to classify nursing behavior finally [
23]. The method of this paper can extract spatial and temporal features directly through training and is end-to-end. Another advantage of this method is that it can simultaneously identify five kinds of different behaviors that can reflect the health and welfare of pigs.