1. Introduction
With the growth of the world’s population, people’s demand for food has gradually increased, especially meat, milk, and eggs. As a result, there is a demand for increasing the production of livestock. In recent years, the importance of Precision Livestock Farming (PLF) has grown globally. PLF is a system for monitoring and managing livestock production that focuses on improving animal welfare and optimizing livestock production [
1]. In the context of PLF, it is the trend to apply artificial intelligence to improve breeding efficiency and to reduce costs [
2].
The computer visual tasks are the most fundamental part in achieving efficient livestock management, and many works based on deep learning are currently proposed for PLF for different livestock. PLF manages animals as individuals, and individual identification is the foundation of any management activity [
3]. Hu et al. [
4] applied the YOLO model to detect cows in images and then achieved cow identification using a convolutional neural network and a support vector machine (SVM). Similarly, Shang et al. [
5] used the Single Shot Detection (SSD) network to preprocess data set and designed a loss function consisting of Triplet Loss and Label Smoothing Cross-Entropy Loss function to identify sheep. In these approaches, a simple detection network was used to locate the targets before identification was performed. However, there are many interference factors in natural conditions, such as obstacle occlusion, target overlap, etc., which are not conducive to individual identification, as shown in
Figure 1a.
In order to further promote the development of PLF, behavior recognition, weight estimation, and other research based on deep learning have been proposed. To reduce the interference caused by noise and occlusion, it is common to use detection and segmentation of sheep to accomplish these tasks. Yang et al. [
6] implemented fully convolutional networks (FCNs) to segment still images to extract spatial features and motion analysis techniques in spatio-temporal video to identify active and inactive behaviors of sows in loose pens. He et al. [
7] successively used detection and semantic segmentation to obtain more effective sheep region data for sheep weight estimation. The authors [
8] used the detection and segmentation algorithm to remove the background and to extract pig features, and finally, neural networks were used to estimate pig weight. However, since the segmentation algorithm is simple, its impact on the estimated results cannot be ignored.
Each of these works has a common feature, which is to obtain the features needed for research through simple detection or segmentation. The effectiveness of image segmentation directly affects the accuracy of feature extraction and computer vision tasks [
9]. It is crucial and necessary to reduce the interference of external factors and to improve the ability to perform visual tasks. High-quality image segmentation can significantly mitigate the dilemma of downstream tasks, such as individual identification, behavior recognition, and weight estimation.
Image segmentation is divided into semantic segmentation [
10,
11], instance segmentation [
12,
13,
14], and panoptic segmentation [
15,
16]. The study objects in livestock farming usually belong to the same category, which means that instance segmentation is more appropriate for livestock farming than semantic segmentation because it can extract the location and contour of each object belonging to the same category without background interference. The performance of one-stage instance segmentation in the livestock farming scenario is unsatisfied due to the large number of highly overlapping instances and noisy background. Two-stage instance segmentation can detect potential targets before segmentation, allowing them to identify the location and size of objects more accurately in the image before segmenting each object. Due to the approximate boundaries between objects generated by detection, the two-stage method can better separate objects in complex scenes where multiple instances overlap or occlude each other compared to the one-stage method. Therefore, we examined the two-stage high-performance instance segmentation network to solve the difficulty of accurate segmentation caused by the irregular fleece and highly overlapping sheep.
Most of the current instance segmentation work is based on a two-stage pipeline of Mask R-CNN [
12]. Mask RCNN is implemented by adding full convolution segmentation branches on Faster R-CNN [
17], which first extracts multi-scale features by backbone and Feature Pyramid Network (FPN) [
18], and then it obtains ROI (region of interest) features for the first stage to classify the target and position regression, and finally it performs the second stage of full convolution segmentation to obtain mask. Qiao et al. [
19] proposed an instance segmentation method based on Mask R-CNN deep learning framework for solving the problem of cattle segmentation and contour extraction in the real environment. The authors [
20] proposed the instance segmentation with Mask R-CNN of dairy cows to analyze dairy cattle herd activity in a multi-camera setting. Dohmen et al. [
21] applied Mask R-CNN algorithm to segment the regions of heifers in the images to support body mass prediction. Xu et al. [
22] achieved sheep behavior recognition by Mask R-CNN. Moreover, there are also some works that employed segmentation to accomplish the computer vision task of sheep [
23,
24].
As we see, these studies related to instance segmentation are based on the algorithm of Mask R-CNN and the edges of their research targets are relatively smooth, such as pigs, cattle, etc., which is not satisfactory on sheep data. Since the instances in the farm have high overlap and complex shapes leading to poor segmentation, the capability of Mask R-CNN cannot obtain perfect masks. This is because the detection performance has reached the bottleneck for crowded detection in sheep farms, and a lot of detailed information is lost due to pooling operation. What is more, the resolution of the mask is at 28 × 28 in general, which is too different from the original image. By visualizing the results of Mask R-CNN (as shown in
Figure 2), we found that the mask boundary has a wavy shape on our data, which is caused by insufficient information when performing repeated upsampling.
Through the analysis of practical application scenarios and data content, our work is aimed finely in locating and segmenting the contour of sheep in the working area of the farm in both indoor and outdoor environments. We found that RefineMask [
25] provided an idea for the problem of loss of information and low mask resolution, which can produce high-quality segmentation results through multi-stage boundary refinement. Although RefineMask worked well compared with Mask R-CNN and other methods, we found that there are still problems with sheep data.
Figure 2a shows that, when RefineMask processes images with highly overlapping instances, the network cannot clearly detect each complete instance, and the detection performance needs to be enhanced.
As can be seen
Figure 2c, RefineMask cannot segment the entire instance correctly, which is due to the correct features not being noticed. Based on this work, we made the following improvements: (1) to extract more excellent features, the backbone network is replaced by ConvNeXt-E instead of ResNet [
26], which is obtained by adding the Efficient Channel Attention (ECA) [
27] module to ConvNeXt-T [
28]; (2) changing the detector to Dynamic R-CNN [
29] and adding shared convolutional layers to improve the expression of the detection network; (3) adding spatial attention module (SAM) [
30] to the mask head and semantic head of RefineMask, respectively, so that the segmentation network can pay attention to effective features and suppress noise.
In this paper, we constructed a sheep instance segmentation dataset on a real farm as a basis for research. Additionally, the data augmentation method of Copy-Paste [
31] was applied in the training to make full use of the image data of single sheep to enrich the dataset and improve the generalization performance of the model. A high-performance instance segmentation algorithm SheepInst, focusing on the boundary segmentation effect, was proposed for sheep data in livestock farming, which provides high-quality features for the subsequent task and proposes a solution for PLF. Moreover, we provide a valuable solution for the research and application of other sheep.
The rest of this paper is organized as follows.
Section 2 describes the image collection and augmentation methods. The overview of the methods, details of the improvements, and the experimental details are also presented in this section.
Section 3 introduces the evaluation metrics and the experimental results of the proposed method. Finally,
Section 4 discussed the application value of our work and future research directions. In
Section 5, the proposed method is summarized.
4. Discussion
In sheep farms, there is a trend to apply artificial intelligence technologies to achieve individual management of sheep to improve farming efficiency and to reduce labor costs. However, when deploying technologies, such as individual identification, behavior recognition, and weight estimation, it was found that it was difficult to distinguish individuals and to improve the accuracy of work. The wool of the Hu sheep is more disorganised, irregular, and exuberant compared with other sheep. Moreover, the sheep’s tendency to congregate causes significant overlap of targets and interference from different lighting and backgrounds, posing a challenge for vision tasks. There are few instance segmentation methods for Hu sheep, and there is still room for exploration and improvement. The purpose of this paper is to address these problems.
We proposed a two-stage instance segmentation, called SheepInst, based on RefineMask, to achieve high-performance detection and segmentation of sheep. Compared with the baseline (RefineMask with ResNet-50), SheepInst improved the box AP, mask AP, and boundary AP by 19.7%, 8.3%, and 10.1%, respectively. When the backbone changed to ResNet-101, SheepInst improved by 7.9%, 4%, and 3.5%, respectively. Our work maximized performance gains by adding the least cost. This benefits from the following. (1) The improved backbone can focus on crucial features that support high-performance detection and segmentation. (2) We introduced new training strategies and modified the network structure in the object detection branch to improve accuracy and to achieve high-accuracy object detection to support high-quality segmentation. (3) By adding spatial attention modules to guide the segmentation network for meaningful learning, the segmentation network can focus on effective information, and the quality of the segmentation boundary can be improved.
The extensive experiments proved SheepInst outperformed the state-of-the-art models in terms of both accuracy and generalization, and the network can focus on the correct information and suppress noise, which is more suitable for the results in the sheep farm scene. Even if there is a case of highly overlapping sheep, our work can precisely locate and output high-quality masks, ensuring the success of computer vision tasks. SheepInst provides a solution to the dilemma of using computer vision technology in artificial intelligence to achieve PLF in a real livestock environment. Certainly, the other work about sheep can use our work as a pre-processing method.
The following are the limitations of our work and also the directions for improvement in the future. (1) Since the research topics nowadays gradually tend to semi-supervised, weakly supervised and unsupervised, etc., it is a common goal to achieve good segmentation performance by using a small amount of manual data annotation. The dataset labels in this paper are all manually annotated, and we can focus on semi-supervised research in the future to obtain higher model performance with less cost of labor. (2) There are limitations to the category of sheep in the dataset. We only have one category of sheep in our sheep data, and we can add other phenotypically different species, such as cashmere goats and milk goats, in the future, to increase the generalizability of the model.