**1. Introduction**

Using machine vision and video surveillance equipment to automatically analyze the behavior of dairy cows, obtain their physiological and health information, and provide data support and decision-making bases for precision breeding and management has gradually become a research hotspot [1–4]. Machine vision systems have the advantages of no contact, low cost, and low stress [5]. However, the lack of stable and reliable image-based individual identification methods and technologies seriously restricts the promotion and use of these machine vision systems [6].

Radio frequency identification (RFID) is an individual identification method commonly used on large commercial dairy farms. RFID enables the recording of individual information, feeding information [7,8], and milk production [9,10] of dairy cows by reading tags attached to their bodies (usually ear tags) with wireless transmission technology. Compared with traditional methods, the reliability of information and the real-time information acquisition performance of RFID are improved. However, the workload of wearing and maintaining ear tags is substantial, which causes a stress response in the animals [11]. The RFID working performance is affected by label power, iron fences, and electromagnetic environments [12]. Additionally, when there are multiple cows in a recognition scene, individual information from multiple cows cannot be simultaneously obtained. Individual

**Citation:** Zhao, K.; Zhang, R.; Ji, J. A Cascaded Model Based on EfficientDet and YOLACT++ for Instance Segmentation of Cow Collar ID Tag in an Image. *Sensors* **2021**, *21*, 6734. https://doi.org/10.3390/ s21206734

Academic Editor: Cosimo Distante

Received: 25 July 2021 Accepted: 7 October 2021 Published: 11 October 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

RFID recognition devices add complexity to machine-vision-based intelligent information perception systems. Therefore, scholars have begun to study individual biometrical identification methods based on machine vision, and this recognition is realized mainly through the extraction and classification of the biological characteristics of cows [13]. The muzzle, iris, and face of the head of dairy cows contain various biological information that can be used as recognizable biological characteristics of individual cows. Different algorithms have been used to extract the descriptive features from images of muzzles [14,15], irises [16,17], and faces [18], and machine learning technology is used to classify the feature vectors to achieve individual recognition. However, the acquisition of head images requires a special shooting environment. The shooting angle, the quality of light, and the matching degree of cows all affect the details in the image and can reduce recognition accuracy.

Holstein cows are the most common cows on farms. The black and white patterns on the bodies can be used as biological features to identify individual animals. Zhao and He [19] proposed an individual cow recognition method based on a convolutional neural network. A 48 × 48 matrix from the trunk image of a dairy cow was extracted as a feature value, and a recognition model based on a convolutional neural network was constructed and trained. In the test, 90.55% of the images were correctly identified. Zhao et al. [6] proposed a cow recognition method based on template matching. The feature template library was generated by extracting the trunk image features of all cows, and individual cows were identified by matching their trunk image features with the features in the template library. Okura et al. [11] proposed a method for individual identification of dairy cows based on RGB-D video. The RGB images were used to obtain the texture features of dairy cows, and the depth image videos were used to obtain the features of the cows' gaits. These two complementary features were used to identify the cows. He et al. [20] proposed an individual identification method based on an improved YOLO v3 model. Images of cow backs were obtained with video frame decomposition technology, and a recognition model of optimizing anchors and improving network structure was constructed based on the Gaussian YOLO v3 algorithm. Yukun et al. [21] obtained images of the backs of dairy cows with moving cameras. While constructing an automatic system for scoring cow body condition, these authors established an individual recognition model of dairy cows based on a YOLO model and a convolution neural network. Side-view or top-view images of walking dairy cows are easy to obtain. Generally, a video or image acquisition device can be placed in the vicinity of dairy cows during feeding, drinking, or milking, and camera focus can be adapted to different recognition distances. However, this method is only applicable to Holstein cows with black and white patterns, which does not solve the problem of identifying cows with uniform colors on their bodies. In addition, the output dimension of the network corresponds to the number of cows in the herd. When the number of cows increases, the scale of the network increases exponentially. Once new cows join the herd, the entire network needs to be retrained.

An individual dairy cow identification model that functions through the detection and recognition of the ID number on their collar tag requires simple description features and a small network scale compared to biometric identification. The tagging and maintenance processes of this method impose a lower workload than RFID and do not affect the welfare of the dairy cows compared. Specifically, the ID tag worn on the cow's neck is first located, and then the ID numbers on the tag are recognized to identify the dairy cow. Zhang et al. [22] proposed a method of cow individual identification based on collar ID tags. This method first locates the tag by cascade detector combined with multi-angle detection, and then performs character segmentation and character recognition on the tag image. However, this location method cannot well-adapt to distortion deformation of ID tags. Zin et al. [23] proposed a tracking system for individual cows using visual ear tag analysis. First, the head and ear tag are detected. Then, the ear tag is recognized by finding the four-digit area, digit segmentation and digit recognition. However, the wearing process of ear tags requires punching a hole in the ear of cows, which can easily cause a stress response and affects the welfare of cows. The cascaded instance segmentation method we

propose can adapt to the various deformations of ID tags, and its wearing process does not have much of an effect on the physiology and psychology of dairy cows. A comparison of different identification methods for cows is provided in Table A1.

The detection of ID tags is the first and key step to identifying individual cows, and its results directly affect the subsequent character recognition accuracy. If the ID tag is accurately segmented according to its contour, the digital recognition task becomes similar to license plate character recognition. According to existing research, it has achieved high recognition accuracy [24,25]. At present, few studies have been conducted on cow collar tag detection, but there are many studies on and applications for license plate detection. Xie et al. [26] proposed a multidirectional license plate detection framework based on CNN, which predicts the rectangular box and corresponding rotation angle to the license plate. This method can solve the problem of license plate rotation in a plane, but it cannot accommodate the tilt of the plate in three-dimensional space caused by the shooting angle. Xu et al. [27] proposed a method for locating irregular quadrilateral license plates. The proposed algorithm has two prediction branches: one is to predict the bounding box containing the license plate area and the other is to predict four groups of vertex offset values corresponding to the four bounding box corners, so as to get the vertex of irregular quadrilateral. This method is implemented based on YOLOv3 that extends the output dimensions. Kim et al. [28] proposed a two-step license plate location method that first detects the vehicle area and then locates the license plate in each vehicle area. This method can quickly filter out the complex background in an image. The license plate that is detected by these methods is a rigid object, but the four digital blocks on the ID tag we aimed to detect are attached to a flexible neck ring (to reduce the foreign body sensation experienced by the cow).

The flexible collar ID tag detection task is required to solve the following two key problems: First, in a side-looking image of a cow walking, the cow is in a continuous state of activity, so the tag is rotated and distorted in different planes, causing different degrees of deformation. Second, when the cow is far from the image acquisition equipment, the pixel area of the tag is relatively small, which causes difficulty in accurate detection. Therefore, we constructed a cascaded model for instance segmentation of the targets. First, the EfficientDet-D4 [29] model is used to detect the bounding box surrounding the ID tag, which effectively filters out most of background in the image and makes the segmentation task more targeted. Then, the image in the bounding box is sent to the YOLACT++ [30] model, and the ID tag is accurately segmented according to its contour to solve the tag deformation problem.

To accurately segment collar ID tags of cows, we conducted the following work: (1) To address the detection and recognition of a cow collar ID tag, we propose a high-precision cascaded model based on EfficientDet and YOLACT++ for instance segmentation, which overcomes the detection difficulty caused by the small area and large deformation of the tag. (2) We tested the performance of EfficientDet-D0–D5 model in the ID tag detection task, and analyzed the ability of different models to detect small targets. (3) The YOLACT++ model with different backbone networks (ResNet50/ResNet101) and different numbers of prototype masks was used to segment ID tags, and the effects of different parameters on the accuracy and speed of a single target segmentation task were analyzed. (4) The common two-stage segmentation models Mask RCNN [31], Mask Scoring RCNN [32], and one-stage instance segmentation model Solov2 [33] were used to segment ID tags, and the accuracy and speed of our proposed method and the above methods were compared. (5) The robustness of the cascaded instance segmentation model to changes in area, ID tag deformation, and brightness was analyzed.

The main contributions of this paper can be described as follows:


#### **2. Materials and Methods**

#### *2.1. Data Acquisition*

Experimental images were collected at Coldstream Research Dairy Farm, University of Kentucky, USA, in August 2016, and the subjects were Holstein cows during lactation. When a cow returned to her shed after milking, she passed through a flat straight passage that had four electric fences (two before and after each) to limit the active area of the cow, and the width of the passage was 2 m. A Nikon D5200 camera was mounted on a tripod 3.5–5 m from the passage and 1.5 m from the ground. The camera used a 35 mm lens and was set to ISO 400, autoexposure, and autofocus. As cows passed through the camera's field of view, it continuously captured pictures at fixed time intervals. The resolution of the images was 6000 (horizontal) × 4000 (vertical) pixels. Images were captured from 16:00 to 18:00 on sunny days and was performed under natural light. The images were stored in the camera's local memory card. One of the original images is shown in Figure 1a. The cows were all wearing collars, as shown in Figure 1b. Each collar contained four square blue plastic blocks with white numbers; the four-digit numbers were the only identity labels of the cows.

To verify the background robustness of the cascaded model, we collected images of dairy cows wearing collar ID tags in the feeding bank at Sheng Sheng Farm, in Luoyang, Henan province, China. The images were captured from 9:30 to 11:30 under natural light on 16 September 2021. A cellphone (Xiaomi 10, Xiaomi Inc., Beijing, China) was used for hand-held shooting. The camera was set to autofocus and autoexposure mode. Images were captured from different angles when dairy cows were fixed on fences. A total of 200 images were captured, where 20 cows were involved and four different collar IDs were used. The resolution of the images was 5792 (horizontal) × 4344 (vertical) pixels.

The images were screened to exclude those with no cow or those that were overexposed, leaving 1344 images for the experiment. Due to the different moving speeds of the cows through the field of view, the number of samples of the individuals differed. A total of 670 images of 36 cows were randomly selected as the training set, which included 788 tags. A total of 268 images of 16 cows were randomly selected as the validation set, which included 321 tags. The 406 images of the remaining 26 cows were used as the test set, which included 492 tags. The ratio of images in the training set, validation set, and test set was

approximately 5:2:3, and there was no cross-duplication between individuals in different data sets. The training set was used to fit the ID tag detection and segmentation model. The validation set was used to preliminarily evaluate the model to adjust its hyperparameters. The test set was used to evaluate the generalization ability of the final model.

#### *2.2. Data Labelling*

Labelme software (https://github.com/CSAILVision/LabelMeAnnotationTool (accessed on 22 October 2020) was used to annotate the data and build data sets in COCO format. Because the activities of the cows led to different degrees of deformation of their tags, the polygon mode was selected to label the target in an image. For the cascaded instance segmentation model in this paper, two steps of image annotation were required. Step (1): The tags in the original image were labelled to train and test the ID tag detection model. Step (2): The detection model trained in step (1) was used to detect ID tags in the training set and crop the detected bounding boxes. The cropped images were labelled and taken as the training set for the segmentation model. We performed the same for the validation set and test set of the detection model to obtain the validation set and test set of the segmentation model, respectively. Because the resolution of the original image was too high (4000 × 6000 pixels), the memory requirement of the model training was very high, so the images of the training set and the validation set were compressed to 1200 × 800 pixels when training the ID tag detection model.

#### *2.3. Cascaded Model for Instance Segmentation*

To solve the problem of the small area and the deformation of the cow collar ID tag in the images, a cascaded detection method was developed in this study. First, the detection model was used to detect the ID tag, and the image in the bounding box surrounding the ID tag was cropped as the input to the segmentation model. Then, the ID tag was accurately segmented according to its contour using the instance segmentation model. For the detection model in the first step, since the area of the target contained a small portion of the whole image, the feature extraction network was required to obtain both high-level semantic information and low-level spatial information. We wanted the model to allow the input image resolution to be as large as possible to retain more feature information. EfficientDet is a scalable model architecture for object detection based on EfficientNet. EfficientDet-D0–D7 were obtained by the composite scaling of each part of the detection network. This composite scaling method enabled us to balance accuracy and speed and to choose a better model. The BiFPN structure in the EfficientDet model enabled the network to obtain rich semantic and spatial information about the target through the upsampling, downsampling, and weighted fusion of different feature layers. Therefore, we chose EfficientDet as the ID tag detection model and tested the performance of different EfficientDet models to identify and select the optimal model.

For the ID tag segmentation task, the segmentation result was the final result, which directly affected the accuracy of the subsequent character recognition. Therefore, we hoped that the mask along the edge of the tag could completely contain all the ID numbers and did not contain redundant background. Because the image to be segmented was relatively small and the target area generally occupied more than 1/2 of the whole image, the difficulty of segmentation was low, and the fully convolutional network could efficiently segment the ID tags. Therefore, we chose the real-time instance segmentation model YOLACT++ based on a fully convolutional network to complete the tag segmentation task.

#### 2.3.1. EfficientDet Detection Model

EfficientDet uses EfficientNet as its backbone to extract feature maps. EfficientNet obtains EfficientNet-B0–B7 by scaling the baseline model while adjusting the depth, width, and resolution of the input image. As the baseline, EfficientNet-B0 is composed of 1 stem and 7 blocks, as shown in Figure 2a. The stem structure functions to adjust the number of channels through convolution. The block includes several mobile inverted bottleneck

convolution (MBConv) block modules. The design concept of the MBConv block modules involves inverted residuals and ResNet. First, a 1 × 1 convolution is performed to upgrade the dimension of the feature maps and a 3 × 3 or 5 × 5 depthwise separable convolution is performed, then a simple attention mechanism is added after this structure. Finally, 1 × 1 convolution is used to reduce the dimensionality of the feature maps, which are connected to the input side to form a residual structure. The channel attention mechanism effectively reduces the redundant channel feature information in the image, accelerates the network training speed, and reduces the memory required for training. Based on EfficientNet-B0, EfficientNet-B1–B7 are obtained by changing the width coefficient, depth coefficient, and input image size.

**Figure 2.** The structure of EfficientDet. (**a**) EfficientNet-B0, (**b**) BiFPN, (**c**) prediction head.

Simply, BiFPN is an enhanced version of FPN. The feature extraction process of BiFPN is shown in Figure 2b. BiFPN mainly includes two parts: the first is feature upsampling and feature weighted fusion; the second is feature downsampling and feature-weighted fusion. After downsampling and adjusting the number of channels, the feature maps extracted by EfficientNet are used as the input to BiFPN. First, upsampling and stacking of input features are performed, and then downsampling and stacking are performed. In the next BiFPN, the feature layers of the previous stage are used as the input, and up and downsampling and feature fusion are carried out again. This feature extraction method of upper and lower circulation sampling and weighted fusion retains the spatial information of the ID tag and obtains semantic information. From EfficientDet-D0 to EfficientDet-D7, BiFPN has an increasing number of cycles, which means that the depth of the network increases and the extracted feature information is richer. However, with the deepening of the network, the speed of training and reasoning is reduced.

The prediction head consists of two parts: the classification network and the prediction box regression network. The former assesses the category of the target, and the latter regresses the location of the target, as shown in Figure 2c. Before prediction, anchors are generated on the feature layers extracted by BiFPN. Through repeated separable convolutions, the classification branch and the prediction box regression branch generate 1 category parameter and 4 position adjustment parameters for each anchor, and finally obtain the location of the prediction box and the category of the target in the prediction box. From EfficientDet-D0 to EfficientDet-D7, the classification branch and the prediction box regression branch have different depths. When the EfficientDet head uses more separable

convolutions, it may be less sensitive to small targets while acquiring deep semantic information.

Compared with other EfficientDet network structures, the number of parameters of EfficientDet-D6 and EfficientDet-D7 is significantly larger. Considering the image resolution and detection efficiency, we did not consider the use of EfficientDet-D6 or EfficientDet-D7 in the ID tag detection task.

#### 2.3.2. YOLACT++ Segmentation Model

In the YOLACT++ instance segmentation model [30], a series of prototype masks and mask coefficients are generated by a fully convolutional network and fully connected layers, respectively, and the final mask is obtained by a linear combination of the two. As a one-stage model, YOLACT++ also adds a fast mask rescoring network to improve the segmentation accuracy of the mask so the model has excellent detection speed and high segmentation accuracy.

The YOLACT++ model uses ResNet as its backbone and FPN to construct feature maps P3, P4, P5, P6, and P7 with different sizes and advanced semantic information, as shown in Figure 3a. To adapt to the different scales and deformations of the target, a deformable convolutional network (DCN) is introduced into ResNet. The prototype generation branch (Protonet) takes the P3 layer of FPN (feature pyramid net) as its input (Figure 3c) because the P3 layer, as the deep backbone feature, has high resolution and can produce high-quality masks. Protonet is a fully convolutional network (FCN) composed of 3 × 3 and 1 × 1 convolution layers. Protonet predicts *k* prototype masks for the image, and all the final predicted masks are the linear combination of these *k* prototype masks. The prediction head takes the five feature maps (*Pi*) output by FPN as its input and uses the fully connected layer to generate three branches. One branch is used to predict the confidence of the target belonging to *c* categories, the second branch is used to predict the four position regression parameters of the bounding box, and the third branch is used to predict *k* mask coefficients (*k* corresponds to the number of prototype masks), as shown in Figure 3b. Then, non-maximum suppression (NMS) is carried out according to the predicted bounding box and the corresponding category confidence. The linear combinations of prototype masks and corresponding mask coefficients are the results of instance segmentation. These operations can be efficiently implemented using a single matrix multiplication and sigmoid:

$$\mathbf{M} = \ \sigma \left( \mathbf{PC}^T \right) \tag{1}$$

where *P* is an *h* × *w* × *k* matrix of prototype masks and *C* is an *n* × *k* matrix of mask coefficients for *n* instances that survive NMS and score thresholds. Finally, the masks are cropped with the predicted bounding box.
