*2.2. Schematic Diagram of Method*

Compared to the controllable greenhouse environment, there are many challenges of monitoring maize tassels in field: (1) complex background (Figure 2a); (2) tassels vary in size, color and shape caused by light conditions, varieties and unsynchronized growth stage (Figure 2b); (3) other factors, such as tassel pose variations caused by wind during shooting (Figure 2c); (4) the newly grown tassels do not have special shape, and the color is similar to the reflective leaves, which further increases the difficulty of tassels detection.

**Figure 2.** Challenges of maize tassels detection in field. (**a**) Complex background: (1) soil background, (2) green leaf, (3) reflective leaf, (4) shadow. (**b**) Tassels vary in shape and size: (1) (2) tassels vary in shape, and (3) size; (**c**) Tassel pose variations caused by wind.

In view of the problems to be solved and the challenges faced, we proposed an automatic tassels detection method combining RF and VGG16 network. Figure 3 shows the main process of this method. First, RF classifier was used to conduct pixel-based supervised classification of UAV images. The advantage of this process is that we can find tassels of any size, but it would also cause problems such as interference pixels and unconnected tassel regions; therefore, the morphological dilation method was used on unconnected regions belongs to tassels and noises, which are called the potential tassel region proposals; In the process of RF classification, pixels of other categories may be misidentified as tassels, so the potential region proposals have some wrong connected regions. To reduce the false positives, we used VGG16 network to re-classify the potential tassel region proposals and obtain accurate detection results; finally, we explored how to extract branch number of detected tassels.

**Figure 3.** Schematic diagram of method.

#### *2.3. Potential Tassel Region Proposals by RF and Morphological Method*

Both the dynamic monitor of tassels development in breeding field, and the detasseling arrangement of the seed maize all require the detection of tassels to have high timeliness. Selective search strategy is a common method in the field of object detection and becomes an essential element for fast detection. It extracts potential bounding boxes based on image segmentation and sub-region merging [33], which can effectively solve the problems of high computational complexity and multiple redundant bounding boxes in exhausted search [11]. Referring to the idea of selective search, we proposed a method for the extraction of potential tassel region proposals that is suitable for our study. This method consists of two stages, including the tassel regions extraction by RF, and the potential tassel region proposals through morphological methods.

#### 2.3.1. The Extraction of Tassel Regions by RF

In this study, we not only need to accurately identify the position of maize tassels, but also to extract the morphological characteristics. We used the RF to separate tassels from the background environment first. RF is an effective integrated learning technology proposed by Breiman [34] in 2001, it can reduce the correlation between decision trees through random selection of features and samples. By combining multiple weak classifiers, the model has high precision and good generalization ability [35], and has been successfully applied to biomedical science, agricultural informatization and other fields [36–39].

Considering the diversity of light conditions, and tassels vary in size and shape, 4 images were randomly selected from 5 groups of UAV data, with a total of 20 images were selected to label samples. To minimize the influence of camera lens distortion, all these images were cropped to 25% of original size (Figure 4b). A total of 3835 sample points were labeled through generating random points (Figure 4c, in this way, samples can represent the actual distribution of field environment type), sample points were divided into 5 categories: leaves, tassels, field path, road (there were hardened roads in some images), and background shadow. Among them, tassel samples accounted for 6%. To compare the influence of different tassel sample proportions on the classification results, some tassel samples were added in the way of artificial labeling.

For each sample, a series of color features were calculated, including R, G, B, H (Hue), S (Saturation), V (Value), ExG (Equation (1)), ExR (Equation (2)) [40,41]. The ratio of training set to test set is 2:1, and the proportion of tassel samples (4%, 6%, 10%, 15%, 20%, and 25%) in training set was adjusted for multiple experiments. From the results of multiple experiments (Table 1), it can be found that the recall of tassel increased gradually while the precision decreased, as the proportion of tassel samples increases. This indicates that the classification model tends to categories with a large number of samples, and the classification results are obviously affected by the sample proportion. In this study, after the proportion of tassel samples reaches 10%, the identification performance of model for tassel tends to be stable, and based on the overall accuracy (OA), the proportion with 15% is considered to be the best.

$$\text{ExG} = \text{2g} - r - b \tag{1}$$

$$\text{ExR} = \text{2}r - \text{g} - b \tag{2}$$

The corresponding calculation equation of *r*, *g*, and *b* channel characteristics is as follows:

$$\begin{cases} r = \frac{R}{R+G+B} \\ g = \frac{G}{R+G+B} \\ b = \frac{B}{R+G+B} \end{cases} \tag{3}$$

**Figure 4.** Process of sample acquisition. (**a**) Original image. (**b**) Cropped image. (**c**) Random sample points (white points).


**Table 1.** Classification results under different proportions of tassel samples.

According to the characteristics of the RF classifier, the importance of features could be evaluated. We applied the permutation importance method based on test set, which was evaluated by permuting the column values of a single feature, rerunning the trained model, and then calculating accuracy change as importance score [34,42]. This method is more reliable than the Gini importance [43]. The rfpimp package with python was used to complete this, and the feature importance obtained is shown in Figure 5. It can be found that ExR has the highest importance score, followed by G (green band), while the B (blue band) has the lowest importance score. Figure 6b shows the classification results of the UAV image, and tassels' morphological characteristics are well displayed. After this, other categories except tassel were selected and integrated together to obtain binary image of tassels (Figure 6c), called tassel regions.

**Figure 5.** Importance derived by permuting each feature and computing change in accuracy.

**Figure 6.** The extraction of potential tassel region proposals. (**a**) Original image. (**b**) Classification result by RF. (**c**) Binary image of tassels (white is tassel regions, black is non-tassel regions). (**d**) Result of morphological dilation (white is tassel regions, black is non-tassel regions).

#### 2.3.2. Potential Tassel Region Proposals Based on Morphological Processing

In the complex field environment, the mutual occlusion between top leaves and tassels is serious, leading to the fact that tassels in the binary image obtained in Section 2.3.1 were not connected regions (Figure 6c), and there is also the problem that other categories' pixels were misidentified as tassels.

To remove some of noises, we applied morphological remove small objects on binary image, and then, used morphological dilation on unconnected pixel regions belongs to tassels and noises in order to obtain potential tassel region proposals (Figure 6d).

#### *2.4. Fine Detection of Tassels by Using VGG16*

To reduce the false positives, the potential tassel region proposals obtained after morphological processing were labeled, with 0 representing non-tassel connected region and 1 representing tassel connected region. A total of 2745 samples were labeled, including 1230 positive samples (tassel connected region) and 1515 negative samples (non-tassel connected region).

We found that many labeled samples contained less surrounding information due to the small area of the envelope rectangles, making it difficult for human eyes to recognize these samples. As shown in Figure 7, all the samples were labeled with human recognition attribute (called recognition), and analyzed the distribution of envelope rectangles' pixel number. Almost all of (96%) the envelope rectangles out of unidentifiable samples (recognition = 0) are smaller than 600 pixels, so the envelope rectangles were expanded to 600 pixels according to the aspect ratio before put into VGG network, so that more information can be included. Although nearly half of the envelope rectangles out of identifiable samples (recognition = 1) are less than 600 pixels, after verification, almost all of these samples are non-tassel regions, which means they could be recognized because of their special color.

**Figure 7.** The pixel number distribution of envelope rectangles.

The VGG network was proposed by the Visual Geometry Group of Oxford University, and participated in the 2014 ImageNet Large Scale Visual Recognition Challenge (ILSVRC), won the first and second places in localization and classification tasks. VGG network adopts the superposition of multiple 3 × 3 convolution filters to replace a large convolution filter, which can increase the network depth and reduce the number of total parameters at the same time [44]. Therefore, the idea of 3 × 3 convolution filters has become the basis of various subsequent classification models. Two common structures in VGG networks are VGG16 and VGG19, they are differed in the number of convolutional layers that VGG16 has 13 convolutional layers while VGG19 has 16.

VGG16 network structure was selected in this paper, and we used the weight parameters from ImageNet's pre-training model, to achieve better training results and reduce the running time. This study deals with binary classification, so replace the activation function of the output layer with the sigmoid (Figure 8). The ratio of training set to validation set was 7:3. To prevent overfitting due to limited samples and improve the model's generalization, data augmentations through small random transformations with rotate and flip were used. The model's iteration and batch size were 50 and 64, respectively, initial learning rate was 0.001 and adopted the learning rate decay strategy. We fine-tuned the obtained optimal model by releasing the deep network parameters, to make the model parameters more consistent with the data in this study.

**Figure 8.** Architecture of VGG16 network.

#### *2.5. Extraction of Tassel Branch Number*

As an important factor determining tassel size and pollen quantity, the tassel branch number is one of the most important indicators in maize breeding. Some researchers placed tassels collected in the field in a portable photo boxes for taking pictures. Branch number was estimated by a series of circular arcs from the lowest branch node, the number of intersections between each circle and binary object was calculated, and the greatest value was taken as branch number [4]. However, the data acquisition of this method is low throughput, which is not suitable for large scale application. Moreover, this calculation method of branch number is also not suitable for the UAV images.

In this paper, the tassel branch number was extracted from the result of morphological dilatation (which can be extracted from Section 2.3.2). Skeleton extraction was performed first, and then an endpoint detection method suitable for tassels' shape were proposed based on this (Figure 9). This method abstracts tassel skeleton into a matrix composed of 0 and 1 (0 is the background pixel and 1 represents the skeleton pixel). Based on the principle of most background pixels around the tassel endpoint, the number of background pixels in the window (3 × 3) with skeleton point as center was calculated. The tassels' endpoints are those skeleton points that have the most background pixels.


**Figure 9.** Endpoint detection of tassels. (**a**) 0, 1 matrix of tassel skeleton. (**b**) Matrix after applied padding, the blue box is a 3 × 3 window. (**c**) Pixel with the deepest color are the endpoints of tassel.

### *2.6. Model Evaluation*

To evaluate performance of the proposed method, accuracy, recall rate and F1-score were selected [45]. Precision refers to the proportion of correctly detected in all positive results returned by our model, with a ratio of 1.0 being perfect; recall (1.0 is perfect) indicates for all relevant samples, how many are correctly detected, and F1-score indicates the harmonic average of the precision and recall. The metrics were calculated as follows:

$$Precision = \frac{TP}{TP + FP} \tag{4}$$

$$recall = \frac{TP}{TP + FN} \tag{5}$$

$$F1-score = \frac{2 \times Precision \times recall}{Precision + recall} \tag{6}$$

where *TP*, *FP*, and *FN* are the number of true positives, false positives and false negatives, respectively.

#### **3. Results**

#### *3.1. Influence of the Envelope Rectangles' Size on Model Accuracy*

According to the description in Section 2.4, we have trained both the enlarged of envelope rectangles (for samples with envelope rectangle less than 600 pixels, we expanded them to 600 pixels according to the aspect ratio) and the direct use of original envelope rectangles. The iterations of model was set as 50. Figure 10 shows the loss and accuracy curve during model training after the expansion of envelope rectangles.

**Figure 10.** Loss and accuracy curve.

We fixed the shallow network parameters of the optimal model obtained during iteration, and released different number of deep networks for fine-tuning (Table 2). It can be found that no matter how many several layers of deep network were released, the recognition performance of the enlarged of envelope rectangles is better than original envelope rectangles. This is because the more surrounding information was included in samples, the more comprehensive model can learn (especially for the samples whose real category is small part of leaf, after the enlarged of envelope rectangles, it will be found that the surrounding scenes are different from the real tassels). Moreover, when the envelope rectangles were enlarged, the model obtained by releasing networks that are behind the eighth convolution layer has the highest validation accuracy, which is 0.954.


**Table 2.** Validation accuracy of model under fine-tuning.

Validation accuracy\_6 represents the validation accuracy of model obtained by releasing deep networks that are after the sixth convolution layer.

#### *3.2. Influence of Different Tasseling Stages on Detection Accuracy*

To demonstrate the performance of the proposed method, 50 plots were selected from the series UAV images randomly. The detection results of this method is shown in Figure 11. The red boxes, blue boxes, and yellow boxes in the figure represent automatically detection, absent detection, and incorrect detection by proposed method, respectively. It can be found that different shapes, sizes and even tassels that can just be observed by human eyes could be well detected (Figure 12). The precision, recall rate and F1-score were 0.904, 0.979 and 0.94, respectively (Table 3). It should be noted that some of tassels were covered by leaves severely, leading to the failure to connect these tassels through dilation processing of Section 2.3.2. Therefore, a small part of tassels' branches were also marked in the final detection results, which were regarded as false positives by us. In addition, the model error mainly comes from the recognition of leaves vein and reflective leaves as tassels (Figure 12).

It was found that the detection effect was related to the tasseling stage of breeding plots, when analyzed the detection accuracy. Therefore, the tasseling stage of breeding plots were divided into early, middle and late tasseling stages according to the proportion of tasseling plants and whether the tassels have complete morphological characteristics. The definition of tasseling stages are as follows: tasseling proportion is less than half is early tasseling stage (Figure 11a,b); tasseling proportion is more than half, but most of them do not have complete morphological characteristics is middle tasseling stages (Figure 11c,d); tasseling proportion is more than half and tassels have complete morphological characteristics is late tasseling stage (Figure 11e,f).

According to Table 3, the detection effects of three different tasseling stages are as follows: late tasseling stage > middle tasseling stage > early tasseling stage, and the corresponding F1-score are 0.962, 0.914 and 0.863 respectively. The effect of early and middle tasseling stage was worse than that in the late stage, mainly reflected in the low precision. This is because the top leaves of maize plants in these two stages were not fully unfolded, and the high flexibility leads to the large inclination angle. Under the same light condition, the reflection phenomenon of such leaves would be more significant, which is easy to be misidentified as tassels. Moreover, no matter at which tasseling stage, the recall rate was always higher than precision, which indicates that there were fewer tassels missed by the proposed method, but have the phenomenon of other objects (mainly leaves vein and the reflective leaves, Figure 12) were identified as tassels.



**Figure 11.** Detection results of tassels. The red boxes, blue boxes, and yellow boxes in the figure automatically represent detection, absent detection, and incorrect detection by proposed method respectively.

**Figure 12.** Detailed diagram of detection results. Red boxes represent automatically detection by proposed method, (1) the newly grown tassels; (2) reflective leaf was misidentified as tassel; (3) and (4) leaves vein were misidentified as tassels.

#### *3.3. The Calculation of Tassel Branch Number*

Before extracting the tassel skeleton, we found that the envelope rectangles of detected tassels may contain partial branches of other tassels caused by the high planting density in the field, leading to some interference connected regions (the red boxes at the Figure 13b) in the result of morphological dilatation (which can be extracted from Section 2.3.2). Therefore, before skeleton extraction and endpoint detection, interference connected regions should be removed first.

The tassel branch number was calculated as follows (Figure 13c,d). The uppermost branch of tassel cannot be detected because our image is taken from orthographic angle. Therefore, the final branch number needs to add 1 to the endpoint detection result. In this example, the tassel branch is 7 endpoints plus the uppermost branch, making a total of 8 branches.

**Figure 13.** The calculation of tassel branch number. (**a**) Original image. (**b**) Binary image after morphological dilatation. (**c**) Tassel skeleton. (**d**) Endpoint detection.

#### **4. Discussion**

#### *4.1. Comparison of the Generation Method of Detection Boxes*

The detection of tassels carried out in this paper can be regarded as a common object detection problem in the field of computer vision. Yunling Liu et al. [14] have realized the detection of tassels by using the Faster R-CNN. Faster R-CNN is a two-stage object detection algorithm proposed by Ren Shaoqing et al. [46], which changed the generation method of detection boxes based on Fast R-CNN and proposed the Region Proposal Networks (RPN) strategy. RPN has become a mainstream method, it form a series anchor boxes as initial region proposals by setting different anchor ratios and scales for each pixel on the feature map, which was obtained by the convolution neural network. RPN could improve the generation speed of region proposals, but the smallest anchor box (128<sup>2</sup> pixels) in RPN is also bigger than most of tassels, which will affect the detection accuracy of model [47]. Yunling Liu et al. [14] also payed attention to this problem, so the anchor size was adjusted from [128<sup>2</sup> , 256<sup>2</sup> , 512<sup>2</sup> ] to [85<sup>2</sup> , 128<sup>2</sup> , 256<sup>2</sup> ], and the final prediction accuracy of tassels increased from 87.27% to 89.96%.

In summary, the size setting of anchor boxes is particularly important in RPN, especially for the tassels detection problem in this paper that including different tasseling stages (tassel size changes significantly) and different shapes. The ratio of the width to height of the tassel samples was used to represent the tassel morphology, and Figure 14 shows the distribution of tassels morphology and size in this paper (tassel samples labeled in Section 2.4). We found that the width to height ratio (Figure 14b) is concentrated in (0.2, 2.0), and the default anchor ratios in RPN are 0.5, 1, 2, which can satisfy the requirements of different shape of tassels; the distributions of pixel width and height of tassel samples (Figure 14a) are mostly within the range of (11, 100), so the default size [128<sup>2</sup> , 256<sup>2</sup> , 512<sup>2</sup> ] in anchor parameter cannot cover tassels in our dataset. To better identify tassels, we can adjust the size of anchor and add additional anchor boxes while keeping the default anchor ratios. However, there is just a simple discussion, the optimal parameter setting needs to go through a series of experimental analysis.

Actually, in our proposed method, we divided the images into tassel regions and non-tassel regions by using random forest, and then extracted the potential tassel region proposals through morphological method is also an operation of forming detection boxes. This method can find the detection boxes of tassels at different tasseling stages, even including newly grown tassels (Figure 12), and there is no problem like RPN applied to small size objects. However, it spent more time in labeling samples compared with the object detection method, because we also need a large number of samples when training the random forest classifier.

**Figure 14.** (**a**) The distributions of pixel width and height of tassels. (**b**) The distribution of tassels morphology.

#### *4.2. Comparison of VGG16 and RF in Fine Detection*

In the classification of the potential tassel region proposals, we compared the effect of RF and VGG16. The potential tassel region proposals extracted in Section 2.3 was selected, and RF classifier was used. The number of samples, training set and validation set were consistent with the input of VGG16 network, and the Histogram of Oriented Gradient (HOG) method was applied to extract features of each sample.

The results are shown in Table 4, the OA of RF classifier is 0.796, while VGG16 network is 0.954 (Table 2, validation accuracy was calculated in the same way as OA in RF), indicating that VGG16 performs significantly better than RF. The recall rate of tassel obtained by RF is very low, indicating that there are many tassels were missed, which is not suitable for our application scene. This may be caused by the small number of training sets; however, under the same sample size, VGG16 network performs better because we used the ImageNet's pre-training model. This also reflects the advantage of deep learning to transfer existing model on the problem to be solved; moreover, we all know that it does not require feature engineering construction and feature optimization.


**Table 4.** The result of RF in fine detection of tassels.

#### **5. Conclusions**

The extraction of tassels development in maize breeding fields and seed maize production fields rapidly and accurately can provide decision support for varieties selection and detasseling arrangement. However, due to the complex planting environment in the field, such as unsynchronized growth stage and tassels vary in size and shape caused by varieties, the detection of maize tassels remains challenging problem. In this paper, based on the time series UAV images of maize flowering stage, we proposed a detection method of maize tassels in complex scenes (different varieties, different tasseling stages) by combining RF and VGG16 network. The main conclusions are as follows:


and found that the detection effect of tassels was highly correlated with the tasseling stages and the detection effect in late tasseling stage was better than that in middle and early stages.


The detection method of maize tassels proposed in this paper has the advantages of high precision and fast data acquisition speed, which can be applied to large area of maize breeding fields and seed maize production fields. In the future research, we will carry out experiments on UAV images that collected at different periods of the day, flight heights and other sensors (like multispectral and hyperspectral sensors), and reduce the number of samples to propose a more efficient detection method for maize tassels. As for the problem of unsuccessful photogrammetric processing caused by high similarity of acquired images, we will also look for a solution to this problem, and then replace single images with orthophoto mosaics in the later work.

**Author Contributions:** Conceptualization, X.Z. (Xiaodong Zhang), S.L. and Z.L.; methodology, Z.L., W.S. and Y.Z.; formal analysis, X.Z. (Xuli Zan), and Z.L.; investigation, X.Z. (Xuli Zan), X.Z. (Xinlu Zhang) and W.L.; data curation, X.Z. (Xinlu Zhang), Z.X. and W.L.; writing—original draft preparation, X.Z. (Xuli Zan); writing—review and editing, Z.L. and X.Z. (Xuli Zan). All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by National Key Research and Development Plan of China, grant number 2018YFD0100803.

**Conflicts of Interest:** The authors declare no conflict of interest.
