2.5.2. Training Parameter Settings

The number of images in the pre-training stage was 631, and the labels number of buds was 1723 during the annotation. The recognition effect of YOLOv7 model was compared with 4 different attention mechanisms, which were added at different positions. According to the ratio of training set: verification set = 9:1 and training verification set: test set = 9:1, the dataset was randomly divided into training set, verification set and test set. In the improved YOLOv7 network training process, we proceeded with 50 generation freeze training, 300 generation thaw training in turn. In parameter, batch (batch size) was set to 8, the initial learning rate set to 0.001, the minimum learning rate set to 0.00001, using the sgd optimizer, and momentum parameter was set to 0.937. We used the cosine annealing function to dynamically reduce the learning rate and turned off the Mosaic data enhancement method. We set the confidence level to 0.5 and the intersection ratio size used for non-maximum suppression to 0.3. The loss function consisted of three parts: Reg (rectangular box regression prediction) part, 0bj (confidence prediction) part and Cls (classification prediction) part. The Reg part uses CIOU Loss, and the 0bj part and Cls part uses BCE Loss (cross entropy loss).

The number of images in the formal training stage was 2049 and the labels number of buds was 6055 during the annotation. The parameter settings were the same as those in the pre-training stage. We selected the YOLOv7+CBAM model and the YOLOv7+ECA model, which had the preferable recognition effect in the pre-training process, and compared them with the recognition effect of YOLOv7 model in the formal training stage.

### *2.6. Evaluation Index*

In this study, precision (P) was used to represent the percentage of buds correctly identified by the model; recall (R) was used to represent the coverage of bud targets identified in the images; mean average precision (mAP) represents the sum of all classes divided by all classes; F1 score was used to evaluate the performance of the method by balancing the weights of precision and recall; the frames per second (FPS) about the detection time of a single image was used to evaluate the actual bud recognition speed of the model. These parameters were used as evaluation indicators to evaluate the trained model. The relevant calculation formulas are as follows:

$$P = \frac{\text{TP}}{\text{TP} + \text{FP}} \times 100\% \tag{1}$$

$$\text{R} = \frac{\text{TP}}{\text{TP} + \text{FN}} \times 100\% \tag{2}$$

$$\text{F1} = \frac{2\text{P} \times \text{R}}{\text{P} + \text{R}} \times 100\text{\%} \tag{3}$$

$$\text{mAP} = \frac{1}{\text{C}} \sum\_{k=i}^{N} P(k) \Delta R(k) \tag{4}$$

In the formula: true positives (TP) means that both the detection result and the true value are a famous tea bud; in other words, the number of famous tea buds is detected correctly. False positives (FP) indicates the detection result is famous tea buds, and the true value is the background; in other words, the number of famous tea buds is counted incorrectly. False negatives (FN) means that the detection result is the background, and the true value is the famous tea buds; in other words, the number of famous tea buds are not counted.

"TP + FP" refers to the total number of famous tea buds detected, and "TP + FN" refers to the total number of famous tea buds in an image. *C* is the number of categories, *N* represents the number of all pictures in the test set, *P*(*k*) represents the precision when *k* pictures can be recognized, and Δ*R*(*k*) represents the change of the recall value when the number of recognized pictures changes from *k* − 1 to *k* [20].

#### *2.7. Results and Analysis*

(1) We randomly selected 631 images from the captured images to form the pre-training dataset. We used the pre-training dataset to conduct training under the same environment, and compared the recognition effect parameters of the networks with different attention mechanisms added at different positions of the YOLOv7 network. The recognition effect parameters of the model were shown in Table 1. The changes of loss value and mAP of YOLOv7+CBAM network in the training process are shown in Figure 9.


**Table 1.** Recognition effect parameters of different networks.

**Figure 9.** YOLOv7+CBAM network diagram. (**a**) Loss value change curve during training. (**b**) mAP curve change during training.

By comparing the recognition effects of different attention modules placed at different positions in the YOLOv7 network, it can be seen that after adding CBAM Block and ECA Block, YOLOv7 can achieve certain improvement in the parameters of P, Recall, F1 score and detection speed. Overall, it had a better recognition effect.

(2) In order to compare the influence of the number of images used for training on the network recognition effect, we used 1049 images taken and 1000 images after data enhancement during the formal training phase. The total images number of the dataset was 2049, and the training parameters remained unchanged compared with the pre-training phase. We selected the network of AB1=CBAM and AB1=ECA. The recognition effect of YOLOv7+CBAM network, YOLOv7+ECA network and YOLOv7 network were compared. The recognition effect parameters of the three networks are shown in the following Table 2.

**Table 2.** Identification effect parameters of different models.


(3) The bud images with occlusion, small targets and dense distribution were selected, respectively. The detection effects of the improved attention mechanism network based on the CBAM network, the ECA network and the YOLOv7 network without added attention mechanism were compared, as shown in Figure 10. It was found that the improved YOLOv7 network based on CBAM Block had higher recognition accuracy. Additionally, it had a lower rate of missed detection.

**Figure 10.** Comparison of recognition effects of different networks.
