*2.4. Training Platform and Parameter Settings*

## 2.4.1. Training Platform

The software environment of our experimental platform was an Ubuntu 18.04 LTS 64 bit system. The programming language was Python 3.7. CUDA10.1 and cuDNN 7.6.5 were used as the parallel computing architecture of the deep neural network and GPU acceleration library. We selected Pytorch 1.4 as the deep learning framework. The GPU was a NVIDIA GeForce GTX 1080Ti, and the memory was 11 GB. The CPU had a 3.50 GHz Intel(R) Core(TM) i7-7800X processor, and its working memory was 32 GB.

**Figure 3.** The structure of YOLACT++. (**a**) ResNet-FPN, (**b**) prediction head, (**c**) protonet.

#### 2.4.2. Training Parameters of EfficientDet and YOLACT++

First, the training set and validation set constructed in step (1) of Section 2.2 were used to train the ID tag detection model EfficientDet. AdamW was selected as the optimizer for model training, and the batch size was set to 2. The initial learning rate was set to 1 <sup>×</sup> <sup>10</sup>−<sup>3</sup> . If the loss of the validation set was less than 0.1 in three epochs, the learning rate would have become 0.1 times that of the original. The weight decay coefficient and momentum coefficient were set to 1 <sup>×</sup> <sup>10</sup>−<sup>4</sup> and 0.9, respectively. The maximum number of iterations of all models was 1 <sup>×</sup> <sup>10</sup><sup>4</sup> . According to the statistics of the tag size in the scaled image, the anchor sizes were determined to be 4, 8, 16, 32, and 64. The K-means clustering algorithm was used to calculate the anchor ratios suitable for our dataset, which were (0.7, 1.4), (1.0, 1.0), and (1.4, 0.7). The same training environment and training parameters were used to train Efficient-D0–D5 based on the pretraining model. After training, the performance of different EfficientDet models was evaluated with test sets.

The training set and validation set constructed in step (2) of Section 2.2 were used to train the ID tag segmentation model YOLACT++. SGD was selected as the optimizer for model training, and the batch size was set to 4. The initial learning rate was set to <sup>1</sup> <sup>×</sup> <sup>10</sup>−<sup>4</sup> . The weight decay coefficient and the momentum coefficient were set to 1 <sup>×</sup> <sup>10</sup>−<sup>4</sup> and 0.9, respectively. In the training process, the maximum number of iterations of the model was 1 <sup>×</sup> <sup>10</sup><sup>4</sup> . ResNet50 and ResNet101 were selected as the backbone of YOLACT++ for training to compare the effects of different backbones on the accuracy and speed of the ID tag segmentation model. To study the influence of the generated prototype masks number *k* on the segmentation effect and speed of a single target, the YOLACT++ models were trained with *k* = 4, 16, and 32.

#### *2.5. Precision Evaluation Index of Model*

In this study, COCO detection evaluation indexes were used to evaluate the precision of the model. The intersection over union (IoU) is a value used to measure the degree of overlap between a prediction box and a groundtruth box, and its formula is:

$$\text{IoU} = \frac{\text{S}\_p \cap \text{S}\_\mathcal{g}}{\text{S}\_p \cup \text{S}\_\mathcal{g}} \tag{2}$$

where *S<sup>p</sup>* represents the area of the predicted bounding box and *S<sup>g</sup>* represents the area of the groundtruth bounding box. The IoU threshold is used to determine whether the content in the prediction box is a positive sample.

For the target detection model, the commonly used evaluation indices are precision *(P*) and recall (*R*), and their calculation formulas are:

$$P = \frac{TP}{TP + FP} \tag{3}$$

$$R = \frac{TP}{TP + FN} \tag{4}$$

where *TP* represents the number of correctly predicted targets; *FP* represents the number of falsely predicted targets, that is, the background is mistaken for a positive sample; *FN* represents the number of missed targets, that is, a positive sample is mistaken as the background. Confidence is an important indicator in target detection algorithms. For each prediction box, a confidence value was generated, indicating the credibility of the prediction box. Different combinations of *P* and *R* were obtained by setting different confidence thresholds. Taking *P* and *R* as vertical and horizontal coordinates, respectively, the *PR* curve could be drawn. When the IoU threshold was set to 0.5, the area under the PR curve was APIoU = 0.50 (AP50). When the IoU threshold was set to 0.75, the area under the PR curve was APIoU = 0.75 (AP75). AP was averaged over multiple intersection over union (IoU) values. Specifically, we used 10 IoU thresholds of 0.50:0.05:0.95. The average of multiple IoU thresholds more comprehensively reflects the performance of the model.

From the statistics of the test results, for the first step of the ID tag detection task, only when the IoU of the prediction and groundtruth bounding box was greater than 0.5 could the prediction box contain all the numbers on a tag. Therefore, the APIoU = 0.50 (AP50) and AP of the detected bounding boxes were selected as the evaluation indices of the accuracy of the tag detection model. For the second step of the ID tag segmentation task, only when the IoU of the prediction and groundtruth mask was greater than 0.75 could the prediction mask contain all the numbers on a tag without background. Therefore, the APIoU = 0.75 (AP75) and AP of the segmented masks were selected as the evaluation indices of the accuracy of the tag segmentation model. For the proposed cascaded instance segmentation method, we multiplied the AP50 of the detection model and the AP75 of the segmentation model to obtain the final accuracy of the ID tag detection model.

#### **3. Results**

#### *3.1. Training and Testing of EfficientDet*

During EfficientDet training, in the first 1000 iterations, the loss decreased rapidly. In 1000–6000 iterations, the loss had no obvious convergence trace but continuously oscillated. After 6000 iterations, due to the reduction in the learning rate, the loss started to converge again and finally reached a stable state. Therefore, for EfficientDet, reducing the learning rate at the late training stage effectively inhibited the loss oscillation of the model and accelerated the convergence rate. From EfficientDet-D0 to EfficientDet-D5, the training time gradually increased from the initial 3 h to 43 h, indicating that the complexity of the network structure significantly affected the training time of the model.

To test the performance of different EfficientDet models in the ID tag detection task, the original images (6000 × 4000 pixels) in the test set, which were constructed in step (1) in Section 2.2, were input to the trained Efficient-D0–D5 models for detection. According to the detection results, we aimed to find the best EfficientDet model that achieved a balance between accuracy and speed. The APIoU = 0.50 (AP50) and AP of the detected bounding boxes and inference time per image were used as he evaluation indices. The test results are shown in Figure 4.

**Figure 4.** The precision and efficiency of EfficientDet-D0–D5. D0–D5 represent EfficientDet-D0– EfficientDet-D5, respectively.

As shown in Figure 4, from EfficientDet-D0 to EfficientDet-D4, the accuracy increased, indicating that increased network depth and multiple BiFPN cycles significantly improved the extraction and expression of image features, and the reasoning time for a single image did not significantly increase. The main factor affecting the reasoning speed of different EfficientDets was the complexity of the model. Although EfficientDet-D0–D4 had different complexities, their parameters were within 5–20 million. For our ID tag detection task, these differences had less influence on the reasoning speed than the high resolution of the image. Thus, the reasoning time of the EfficientDet-D0–D4 models for a single image had no obvious change.

Although EfficientDet-D5 has a wider and deeper network than EfficientDet-D4, its accuracy in the tag detection task was lower than that of EfficientDet-D4. This shows that for our small target detection task, the spatial information of small targets gradually reduced when the network reached a certain depth, which led to a decrease in detection accuracy. However, the number of parameters of the EfficientDet-D5 model was 30 million, which is approximately 1.5 times that of EfficientDet-D4, so its inference time was longer than that of the previous model. Therefore, we finally adopted the EfficientDet-D4 model with its high accuracy and efficiency as the ID tags detection model.

Figure 5 shows the detection results of EfficientDet-D0–D5 for some of the images in the test set. Due to the high resolution, only the image content related to the prediction box is cropped.

Figure 5 shows that for EfficientDet-D0 and EfficientDet-D1, problems of missing targets and inaccurate location often occurred, indicating that the shallow network structure could not effectively extract the features of small targets in the image. For EfficientDet-D2 and EfficientDet-D3, there were few missed targets but many false detections. This indicates that improvement in the network depth, width, and input image resolution increased the ability to extract features from small targets, but semantic information sufficient to accurately classify anchors was not extracted. For EfficientDet-D4, the model could not only accurately classify and locate small targets but also had higher confidence in correct classification than the previous model, which accurately and efficiently completed the

ID tags detection task. The confidence of the detection boxes of EfficientDet-D5 was high, but there were false samples near the target. This shows that high-level semantic information could correctly classify anchors when the network depth increased, but the low-level spatial information of small targets decreased, resulting in false detection boxes near targets. Thus, for small target detection tasks, reasonable network depth and width are the keys to simultaneously obtaining accurate semantic information and complete position information.


**Figure 5.** Some of detection results of EfficientDet-D0–D5. Groundtruth represents the true bounding box in the image to be detected; D0–D5 represent EfficientDet-D0–EfficientDet-D5, respectively; the green boxes in the detection image represent the prediction results of the model; and ID represents the class of detected targets. For our ID tag detection task, ID is the only class. The number behind ID represents the confidence of the corresponding detection box, and the unit is % (not shown in some black background images).

## *3.2. Training and Testing of YOLACT++*

During training, the model converged rapidly in the first 500 iterations. From 500 to 6000 iterations, although it stabilized overall, there were still some large loss values. The losses stabilized after the 6000th iteration. The model with the ResNet101 backbone had a slightly longer training time than the model with the ResNet50 backbone. The greater the *k* value, the longer the training time. Compared with the *k* value, the backbone had more influence on the training time.

To study the influence of different parameters on the accuracy and detection speed of the ID tag segmentation model, after the training was completed, the images of the test set constructed in step (2) in Section 2.2 were input to the YOLACT++ models with different parameters for segmentation. The APIoU = 0.75 (AP75) and AP of the segmented masks and detection speed were used as test indices. The test results are shown in Figure 6.

**Figure 6.** The precision and efficiency of YOLACT++. (**a**) The detection accuracy of the model with ResNet50 as backbone. (**b**) The detection accuracy of the model with ResNet101 as backbone. (**c**) The detection speed of the model with different parameters.

As shown in Figure 6a,b, the accuracy index AP75 of the models with different parameters reached nearly or exactly 100%, and the detection time of a single image was 0.25–0.34 s, indicating that the YOLACT++ model could quickly and accurately segment the ID tag through a linear combination of the prototype masks and the mask coefficients. The accuracy of the YOLACT++ model using ResNet50 as the feature extraction network was higher than when using ResNet101 as shown in Figure 6a,b. This result indicates that for simple segmentation tasks, the depth of ResNet50 was sufficient to extract the features of the target in the image. When ResNet50 was used as the backbone, reducing the number of prototype masks generated slightly improved the segmentation accuracy. This shows that for single-target segmentation, due to the reduced background in the image, too many protomasks will interfere with accurate segmentation of a tag.

In terms of detection speed, the speed of the YOLACT++ model with ResNet50 as its backbone was higher than that of ResNet101, as shown in Figure 6c. When the backbone of YOLACT++ was ResNet50, reducing the *k* value slightly improved the detection speed. However, when the backbone of YOLACT++ was ResNet101, reducing the *k* value had little effect on the detection speed. This result indicates that compared with the k value, the backbone had a greater impact on the detection speed. According to the test results, we finally decided to use ResNet50 as the backbone for feature extraction and chose to generate four prototype masks. As a result, the overall accuracy of the cascaded model based on EfficientDet-D4 and YOLACT++ was 96.5%, and the total detection time for a single image was 1.92 s. Figure 7 shows some test results for the YOLACT++ model. The predicted masks cleanly surrounded the number on the tag with relatively high confidence. The result had good robustness to the rotation of the tag, the change in brightness, and the interference around the label.

**Figure 7.** Some of segmentation results of YOLACT++. ID represents the identified class name, and the number after ID represents the confidence for the predicted mask.

#### **4. Discussion**

#### *4.1. Comparison with Common Instance Segmentation Model*

The proposed cascaded detection model was compared with the common two-stage models Mask RCNN, Mask Scoring RCNN, and the one-stage model SOLOv2. Mask RCNN and Mask Scoring RCNN are two-stage detection models based on a region proposal network. The detection accuracy of these algorithms is high, but their detection speed is slow. SOLOv2 is a one-stage detection model based on anchor box regression. The detection accuracy of this algorithm is slightly low, but its speed is fast. We used the same training set, validation set, and test set to train and test the accuracy and detection time of different models in the same operating environment. The results are shown in Table 1.

**Table 1.** The precision and efficiency of cascaded model and other models.


The overall accuracy of our proposed cascaded instance segmentation model is 96.5%, and its detection time for a single image is 1.92 s. The two-stage instance segmentation model Mask RCNN accurately locates and segments most ID tags with high accuracy, and its segmentation index AP75 is 85.3%, but its detection time for a single image is 2.63 s, which is slightly longer. Mask Scoring RCNN with a Re-score branch performs worse than Mask RCNN in our ID tag segmentation task, and its segmentation index AP75 is 58.2%. The detection time per image is 3.39 s, which is longer than that of Mask RCNN. As a onestage instance segmentation model, SOLOv2 has a short detection time of 1.21 s. However, its segmented mask is rough along the edge of the tag but tortuous, so its segmentation accuracy is low, at 18.5%. In most cases, the masks with tortuous contours contain some background outside the tags.

The detection results for some images are shown in Figure 8. Three situations with high segmentation difficulty are depicted in the figure. The first is when the brightness of the tag is too low and there are multiple targets in the image. The second is when interference around the tag has similar characteristics to characters (such as white chains). Third, when the brightness of the tag is too high, the character block borders are also displayed. In the above three cases, our method accurately segments the ID tag from the complex background. However, other models are prone to location offsets; missing some characters, including redundant backgrounds and even being unable to detect the tag. Therefore, compared with existing two- and one-stage segmentation models, our proposed cascaded instance segmentation method achieves high-precision ID tag segmentation in complex environments, which has strong robustness and solves the problem of detection difficulty caused by the small area and large deformation of the tag.

#### *4.2. Deformation and Brightness Robustness*

To analyze the performance in detecting targets with different areas, we quantified the results of detecting ID tags with different areas with EfficientDet-D4, as shown in Figure 9. As seen from Figure 9, the proportion of the ID tag area to the whole image was only 0.02–0.09%, which is representative of a small target that was difficult to detect, but the model still achieved a high detection rate. By observing the tags of different areas, we found that the rotation and distortion of the tag were the main reasons for the change in area, and the distance between the cow and the image acquisition equipment was the secondary reason. The larger the area of the ID tag, the larger the deformation of the tag. The detection accuracy of ID tags in intervals (5) and (6) was low. There were two main

reasons: (1) the total number of samples in these two intervals was small, so even a small amount of false detection had a relatively large impact on the results; (2) the deformation of the tag led to an increase in false detection boxes that overlapped with the target but did not fully contain the numbers on the tag. There were only three ID tags in interval (7), which had fewer samples. When drawing its *P*–*R* curve, the accuracy and recall rate were both 100% with the confidence threshold set to 0.7, so its AP50 was 100%.

**Figure 8.** Some of the segmentation results of cascaded model and other models. The ID above the detection results and the number after the ID represent the class and the confidence of the prediction mask, respectively. (**a**–**c**) in the figure correspond to three difficult situations. (**a**) The brightness of the tag is too low and there are multiple targets in the image. (**b**) Distractions around ID tag. (**c**) The brightness is too high.

**Figure 9.** The precision of ID tags with different areas. The abscissa represents the proportion of the bounding box area of the ID tag to the whole image; (**1**) to (**7**) represent seven intervals from 0.02% to 0.09% in increments of 0.01%.

To analyze whether the model is robust to different types of deformation, we divided the deformation of the ID tag into three types (Figure 10): (1) the cow was walking slowly or static, and the tag was only rotated or slightly distorted; (2) the cow was walking quickly or had a lowered head, and the tag was rotated; (3) the cow's head was twisted, and the tag was both rotated and distorted. The detection results of EfficientDet-D4 for ID tags with different types of deformation were statistically analyzed (Table 2). Table 2 shows that the accuracy was the highest when the tag only slightly rotated or distorted. The accuracy was lower when the tag was rotated. When the tag was both rotated and distorted, the accuracy was the lowest. However, rotation and distortion only reduced the accuracy by 2.9%. Therefore, regardless of the state of the ID tag, the model achieves a high detection rate and has high robustness to different types of deformation.

**Figure 10.** The three types of deformation of ID tags: (**a**) slight rotation or distortion; (**b**) rotation; (**c**) rotation and distortion.


**Table 2.** ID tags detection accuracy with different types of deformation.

The cascaded instance segmentation method proposed in this paper can also adapt to the variant brightness of the target. Figure 11 shows the detection results for some ID tags under different light conditions.

**Figure 11.** Some detection results under different light conditions.

#### *4.3. Background Robustness*

Without retraining the model, the dairy cows' images collected at Sheng Sheng Farm were passed through the cascaded model for detection and segmentation. The results showed that the AP50 of EfficientDet-D4 model is 94.1%, and the AP75 of YOLACT++ model is 100%. Some test results are shown in Figure 12. Even in different scenarios, the model has achieved high accuracy. Therefore, we concluded that the constructed and trained cascaded instance segmentation model has strong robustness with different backgrounds and has promising application prospects.

**Figure 12.** Detection and segmentation results with different backgrounds.

#### *4.4. Analysis of False and Missed Detections*

Since the segmentation index AP75 of the YOLACT++ model was 100%, we only analyzed the false and missed detection of ID tags by EfficientDet-D4. After statistics were compiled, there were no missing ID tags. False detection mainly included two cases: (1) a tree branch in the background was mistaken for the target, and the confidence was slightly high, as shown in Figure 13a; (2) a bounding box that overlapped with the tag but did not contain the numbers on the tag completely, as shown in Figure 13b. The reason for the first type of false detection may be that the high-level semantic features of the branches in this region were coincidentally similar to the ID tag, which led to the misjudgment of the branches as the target by the model. The reason for the second kind of misdetection may be that part of the ID tag was also included in the bounding boxes of these false detections, which led to the network failing to make correct judgements. Alternatively, these false bounding boxes were not filtered out when NMS was carried out. The confidence of false bounding boxes was generally low and could have been filtered by setting a confidence threshold.

**Figure 13.** Some of the false detection results. (**a**) A tree branch mistaken as a target. (**b**) The bounding box partly overlapped with the target.

#### *4.5. ID Number Recognition*

ID number recognition is completed by character segmentation and character recognition. The purpose of character segmentation is to segment the color tag image into four binary images containing only a single character, which was implemented through the following steps (as shown in Figure 14). The character recognition model was constructed based on a simple convolutional neural network, with the purpose to classify the single character images. The unsegmented images detected by the EfficientDet-D4 model and the segmented images by the cascaded model proposed in this paper were passed through character segmentation model and character recognition model, respectively. The character segmentation model consists of several simple image processing methods, which are illustrated in Figure 14. The character recognition model is constructed based on LeNet-5 [34]. We changed the C5 layer of LeNet-5 from fully connected layer to convolution layer to obtain the character recognition model. The reason for this change is to reduce the redundant parameters and enrich the features.

**Figure 14.** The process of character segmentation: (**a**) graying; (**b**) grey-level transformation; (**c**) binary segmentation; (**d**) morphological processing and removal of redundant background; (**e**) character cropping; (**f**) character normalization.

Table 2 shows that about 76% of ID tags in the dataset are rotated and twisted, so the corresponding detected bounding box will contain different background areas. In the binarization of pixels, the pale white body area of cows and grass in the background are misjudged as characters, which considerably interferes with the implementation of subsequent steps. If the brightness of the background exceeds that of the character, some characters will be lost due to the high threshold in the binary segmentation. Figure 15a depicts the character segmentation results of partially unsegmented images, and Figure 15b displays the character segmentation result of partially segmented images. In Figure 15, the images of each group from left to right are the images after detection/segmentation, the images after binarization, and the images after character normalization. It can be seen from the figure that the character segmentation results of the unsegmented images are very poor due to the influence of the background in the detection bounding box. Its character recognition result is obviously lower than that of segmented images. The accuracy of the character recognition of the segmented image is 95.4%, which is 2.05% higher than that in [22]. This proves that the segmentation of the tags image to remove the redundant background can effectively strengthen character recognition.

#### *4.6. Future Studies*

Although the cascaded model based on EfficientDet-D4 and YOLACT++ can achieve 96.5% segmentation accuracy, there is still room for improvement. For the false detection bounding boxes that overlap with the target but do not fully contain the target, their union set can be calculated as the detection results of the tag area, then the segmentation model can be used to remove the redundant background in the detection results. Alternatively, these false positives can be suppressed by better NMS methods, such as Fast NMS [35], which creates the highest confidence bounding box through mutual suppression of all detection boxes. Compared with the traditional method, it allows already-removed detections to suppress other detections, and less time is required.

**Figure 15.** Character segmentation results of (**a**) unsegmented and (**b**) segmented images.

The detection speed of the cascaded model also needs to be improved. The detection time is mainly consumed in the detection process of EfficientDet-D4. Due to the high resolution of the image, it is necessary to generate many anchors at different scales and classify and regress them, which requires considerable time. In practical applications, if we know that the size of the tag is within a certain range, the number of anchors generated on each grid point can be reduced, thus effectively simplifying the detection process. Additionally, the image contains many extra background data. If the target appears only in a specified space range, adding a spatial attention mechanism to EfficientDet can cause the network to pay more attention to the areas where the ID tag may appear. This would reduce the time required to extract features from irrelevant backgrounds, thus improving the detection efficiency.

#### **5. Conclusions**

This paper proposed a cascaded method for the instance segmentation of a cow collar ID tag based on EfficientDet-D4 and YOLACT++, which accurately detects and segments the target with a small area. The detection accuracy AP50 of the EfficientDet-D4 model is 96.5%, the segmentation accuracy AP75 of the YOLACT++ model is 100%, and the overall segmentation accuracy is 96.5%. Compared with common instance segmentation models, the accuracy is improved by more than 11.2%. Changes in brightness and deformation of the tag have little effect on the detection accuracy of the proposed model. It shows high anti-interference capability and has the potential to be applied to remote and multitarget cow identification on dairy farms. In the future, we can optimize the structure of EfficientDet and propose a better NMS method to reduce the false detection. Additionally, an attention mechanism and other strategies can be considered for reducing the time used by the feature extraction process to improve the detection speed when an image has a large background area.

**Author Contributions:** Conceptualization, K.Z. and J.J.; methodology, K.Z. and R.Z.; software, R.Z. and K.Z.; validation, R.Z.; formal analysis, R.Z.; investigation, K.Z.; resources, J.J. and K.Z.; data curation, R.Z. and K.Z.; writing—original draft preparation, R.Z. and K.Z.; writing—review and editing, K.Z. and J.J.; visualization, R.Z.; supervision, J.J. and K.Z.; project administration, K.Z. and J.J.; funding acquisition, J.J. and K.Z. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by National Key Research and Development Project of China (2019YFE0125600) and National Natural Science Foundation of China (32002227).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** The authors thank Coldstream Farm, University of Kentucky, U.S., for their cooperation in data collection.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**

**Table A1.** Comparison of different identification methods for cows.


#### **References**

