3.1. Comparison of Instance Segmentation Models
To verify the effectiveness of the proposed method for weed segmentation, six typical instance segmentation algorithms, including Mask R-CNN, SOLO [
39], PolarMask [
40], CenterMask [
41], YOLACT [
42] and BlendMask, were compared. This study will identify weed phenotypic information in complex field environments. To further improve the adaptability of the model, the image data are enhanced. The above six algorithms are trained on two datasets: one is the original image, and the other is the enhanced image. To examine the recognition effect of the model in a complex field environment, the images in the test were those without enhancement.
Figure 8 shows the
F1, AP50, AP70 values of the model.
Figure 8a shows the
F1 values of six instance segmentation networks. From the figure, it can be seen that data enhancement can increase by at most 3.21% and by at least 1.53% compared to that without enhancement. The effect of data enhancement is generally higher than that without enhancement. For the convenience of description, in the following research we will write YOLACT-550++ as YOLACT. In the case of data enhancement, the
F1 values of Mask RCNN, SOLO, CenterMask and BlendMask are greater than 0.92, while the
F1 values of YOLACT and PolarMask are the lowest. To further analyse the recognition performance of multiclassification target location and category information,
Figure 9b,c show the AP50 and AP70 of six instance segmentation networks. Through comparison, it can be seen that the value of AP50 is higher than that of AP70. Choosing an
IOU threshold greater than or equal to 0.5 is more suitable for this study. In the result of
Figure 8b, we can see that in the case of data enhancement, the AP50 values of the six models are between 65% and 72%, which can meet the needs of weed case segmentation. The AP50 values of BlendMask, SOLO and CenterMask are greater than 70%.
Field weeds are visual objects with complex structures and rich texture features. Even within the same species, there are great differences in morphology and colour. Data enhancement can improve the generalization ability of the model, reduce overfitting, and improve the adaptability of the model to complex field environments. The main reasons for the good recognition of BlendMask, SOLO and CenterMask are: BlendMask combines the idea of top-down and bottom-up methods to fuse rich information at the instance level and accurate dense pixel features, so it is very suitable for the situation of leaf occlusion in this study. SOLO uses instance categories to realize direct instance segmentation, which is free from the influence of target detection. While, CenterMask is also a one-stage instance segmentation model that contains both global and local image methods of YOLACT and PolarMask. It can complete the instance segmentation of different objects in the case of pixel-level feature alignment, so it has achieved good segmentation results. However, CenterMask is still not completely out of the influence of target detection, so the AP50 value of CenterMask is second only to BlendMask and SOLO.
The main reasons for the poor recognition effect of YOLACT and PolarMask are that YOLACT is a one-stage instance segmentation network that uses a global image-based method to process the image. This method can better retain the location information of the object; however, for the case of leaf occlusion, it may not be able to accurately locate each weed leaf, resulting in the occluded leaves below being recognized as the leaves of the foreground mask, causing errors. PolarMask is also a one-stage instance segmentation model. The contour of the object is described by the polygon composed of rays emitted from the centre of the object. However, the weed is polymorphic, the morphological structure is complex, and the plant centre is also special. This method may not accurately describe the edge of the object. When connecting the endpoints of each ray, some local segmentation information will be lost, which makes the final mask ineffective.
In summary, BlendMask and SOLO have the best weed segmentation performance in a complex field environment. In the case of data enhancement, the
F1 value of BlendMask is 0.47% higher than that of SOLO, and the AP50 value is 0.69% higher. To further explore the ability of these two models to obtain weed phenotypic information, BlendMask and SOLO were replaced by two backbone networks (ResNet50 and ResNet101), and the calculation times under different backbone networks were compared. The results are shown in
Table 3.
Table 3 lists the prediction durations of the model for a single picture. When two different backbone networks are selected, the prediction time of BlendMask is 13.4 ms and 13.9 ms lower than that of SOLO, respectively. SOLO is affected by anchor-based methods; similar to FCIS, it distinguishes location information. The BlendMask model uses the fusion method of FCIS and YOLACT and proposes the Blender module, which has a faster processing speed. Therefore, considering the segmentation performance and prediction time, BlendMask shows satisfactory segmentation performance and can achieve fast and accurate weeding.
3.3. Segmentation Results of Weeds with Different Shooting Angles and Leaf Ages
To verify the recognition performance of the model under different leaf ages and different shooting angles, the morphological difference of a plant is large under different shooting angles and different leaf ages. We compared the segmentation results of BlendMask with those of two different backbone networks (ResNet50 and ResNet101) combined with FPN under different leaf ages and shooting angles. We used seven key indexes: precision rate (
P), recall rate (R),
F1, intersection over the union (
IOU), average precision (
AP), mean average precision (
mAP), and mean intersection over the union (
mIOU). The test set, which included 600 images, was used to verify the generalization ability of the model; therefore, 600 images without data enhancement were selected for testing. The total test set included 200 images each for the front view, side view, and top view. As shown in
Figure 9, the labels a, b, c, a_leaf, b_leaf, c_leaf, and centre were not recognized, and we considered that these labels were identified as the background.
Figure 9 shows the confusion matrices of the detection results of the model in the case of data enhancement. This matrix calculates the statistics pertaining to the number of classified images by comparing the actual labels in the validation set data with the predicted types and indicates whether the model can differentiate among different classes. As shown in
Figure 9, both ResNet50 and ResNet101 exhibit intuitive common features: The prediction for the leaves of
Solanum nigrum is highly accurate.
Solanum nigrum is an annual herb with oval leaves. Moreover, this weed has a large number of leaves, which allows the model to learn more features. Because the model can extract sufficient features from
Solanum nigrum leaves,
Solanum nigrum exhibits a high recognition accuracy. However, a part of the leaves of
Solanum nigrum is predicted to be leaves of
Abutilon theophrasti Medicus because the leaves of
Abutilon theophrasti Medicus are also oval, similar to those of
Solanum nigrum. When collecting data, blurred images may result in insufficient image information extraction. These reasons easily lead to misjudgment of the results, which is not expected because it may lead to errors in leaf age identification. Plant centre recognition accuracy is second only to the leaves of
Solanum nigrum, and the plant centre is not considered to be another label. Because the plant centre is a special physiological position of weeds, which is obviously different from the characteristics of other categories, there is less misjudgement. According to the confusion matrix, the accuracy of the model, recall and
F1 evaluation index can be further calculated.
Figure 10 shows the detection results of the BlendMask model under two backbone networks, three angles, and seven types of labels in the case of data enhancement.
According to
Figure 10, the
F1 values of ResNet50 in the front, side, and top views and all the test sets were greater than or equal to 0.7634. In comparison, the
F1 values of ResNet101 were greater than or equal to 0.8983. It can be seen that ResNet101 has higher performance than ResNet50. In ResNet50, the recognition accuracy of barnyard grass leaves was higher than that of
Abutilon theophrasti Medicus, but in ResNet101, the corresponding accuracies were comparable. Because ResNet50 has fewer convolutional layers than ResNet101, it may not be able to extract sufficient features. ResNet101 increases the depth of the network, so the corresponding
F1 values for barnyard grass (
Echinochloa crus-galli) and
Abutilon theophrasti Medicus leaves are high, and the accuracy is comparable. The difference in accuracy between the two types of weed recognition can also be attributed to the shallow network’s inability to extract enough features from the weeds, but the difference between the two types of weed recognition accuracy gradually decreases as the number of layers in the network increases.
In the case of data enhancement, in terms of the F1 value, the recognition accuracy for the Solanum nigrum leaf was the highest. According to the confusion matrix, Solanum nigrum leaf have the highest classification accuracy. Analyzed from the subject category of plants, Solanum nigrum is an annual herbaceous plant of the Solanaceae family, with upright stems, many branches, oval or heart-shaped leaves, and a large number of leaves. The characteristics of this type of plants are more obvious, which is conducive to the extraction of features by deep learning models.
For the plant centre, the precision values in the front, side, and top views and all the test sets are 1.0000. According to the confusion matrix analysis in
Figure 9, since the characteristics of the plant centre are more obvious than those of other categories, the accuracy value is high. The
F1 values of ResNet101 in the front, side, top, and total test sets were 0.9445, 0.9371, 0.9643 and 0.9479, respectively. When ResNet101 was used as the backbone network, the recall values of the front, side and top view test sets were greater than or equal to 0.8267, 0.8374 and 0.9149, respectively. The top view test set exhibited the highest performance in all the classifications.
Since
Figure 9 and
Figure 10 only indicate the classification performance of the model, the recognition accuracy of the model cannot be determined, and the actual environment in the field is complex, which is expected to influence weed identification. Therefore, the model recognition accuracy is critical to evaluate the model performance.
Table 7 presents the detection results for the weeds under different networks and angles with data enhancement.
The
mAP is a commonly used index in target detection.
Table 4 shows that the
mAP and
mIOU values of ResNet101 are higher than those of ResNet50, which has good target detection performance, can be applied to the segmentation of small target objects, and can meet the needs of weed segmentation. Therefore, this study selected ResNet101 combined with the FPN framework to extract weed characteristics. For the total test set, when ResNet101 is used as the backbone network, the value of AP50 is 12.8% higher than that of AP70, indicating that a threshold greater than or equal to 0.5 has good detection performance. Using ResNet101 as the backbone network, the AP50 value of the top view is 5.2% and 13.9% higher than that of the front view and side view, respectively. The top view has achieved good detection performance. The
mIOU is a valuable index to evaluate the segmentation results [
43] and is commonly used to evaluate the segmentation performance of the BlendMask model. As shown in
Table 7, using ResNet101 as the backbone network, the
mIOU values of the top view are 4.5% and 5.9% higher than those of the front view and side view, respectively, and good segmentation results are still achieved.
Among the three orthogonal angles, more comprehensive weed phenotype information can be obtained from the perspective of the top view; specifically, the plant centre of the weeds can be identified more clearly. In contrast, the side and front views cannot clearly observe the plant centre. Consequently, the detection accuracy for the top view angle is higher than that for the other angles. Nevertheless, when intelligent agricultural equipment is employed in the field, the camera is usually fixed at an angle, although the position and shape of the weeds in the field are complex and changeable. When the machine is moving, the imaging angle of the weeds changes, and the imaging angle differs owing to the different positions of the weeds. The information of the side and front views is exposed at certain angles; therefore, obtaining the images of the side and front views can help the model accurately segment the weeds. Constructing datasets from different perspectives can enable the model to adapt to the job requirements of different scenarios. To verify the accuracy of this method for leaf age identification,
Figure 11 shows the accuracy of different leaf age identification in the case of data enhancement and ResNet101 combined with FPN.
Figure 12 lists some of the results of leaf age identification for the three weeds.
To obtain the accuracy of leaf age identification, we used a test set containing 900 unenhanced images, including 300 for Solanum nigrum, 300 for barnyard grass (Echinochloa crus-galli) and 300 for Abutilon theophrasti Medicus. In the Solanum nigrum data set, there were 100 weeds with two leaves, three leaves and four leaves, and the remaining two data sets were also set in the same way. The accuracy of leaf identification was determined by comparing the leaf value calculated by the computer with the leaf value on the label when the data were collected.
From
Figure 11, we can see that the accuracy of leaf identification of the three weeds was higher than 88%. From the perspective of the growth stage of weeds, the average recognition accuracy of these three weeds at 2 leaf age was 0.913, the average recognition accuracy at 3 leaf age was 0.930, and the average recognition accuracy at 4 leaf age was 0.911. In particular, the recognition accuracy of three leaves of
Solanum nigrum was 0.957, which is the highest among all leaves.
Figure 12a,b show that
Solanum nigrum leaves are mostly grown from the same centre, and fewer leaves are occluded below.
At present, the treatment of weeds in the field mostly involves weed classification and detection [
7,
8]. However, weed classification can determine only the species of weeds, and the specific position coordinates of the weeds cannot be obtained; thus, the exact target cannot be sprayed. Weed detection can facilitate the drawing of the bounding box of weeds [
12]. However, weeds exhibit an irregular shape and size, which may cause the machine to be inaccurate with respect to the target, resulting in certain herbicides falling to the ground and not being absorbed by the weeds; this aspect may lead to environmental pollution and wastage of the herbicide. As a kind of deep learning model, instance segmentation can detect the target pixel by pixel, thereby solving the problems of blade adhesion and occlusion. Moreover, the leaf age of weeds and the position of the plant centre can be obtained accurately. The data for this study were obtained from a complex field environment, while Bell and Dobrescu et al. carried out a lot of studies on plant leaf counts [
15,
16,
17], but they were all taken in an indoor environment where the backgrounds were often pure and the illumination is uniform. Studying the field environment can help make the model more suitable for practical applications. Wang and Huang et al. identified the central region of maize and rice, respectively, which corresponded to the protected area for mechanical weeding and which was also the centre of the plant [
44,
45]. It is worth noting that the morphological and structural characteristics of maize and rice are relatively uniform, so the characteristics of the central area are more obvious. However, weeds are polymorphic, and the morphology of weeds of different varieties and leaf ages is quite different, and the growth position is random and variable.
Only a few of the existing studies on plant phenotypes are specific to weed phenotypes. However, weeds of different leaf ages require different doses of herbicides; therefore, it is of significance to obtain information on weed leaf ages to reduce the amount of herbicides. In the Northeast Plain of China, the main economic crops are maize, soybeans, and wheat, which are susceptible to annual and perennial weeds. Controlling annual and perennial weeds can increase crop yields and reduce the likelihood of damage caused by weeds in the second year [
46]. Moreover, studying the interaction between the plant phenotype and vision through effective phenotypic analysis can help provide information regarding plant growth and morphological changes.
The employed DCNN model was used to segment only three kinds of weeds, but there are still differences in detail between different kinds of weeds, so it is necessary to expand the kinds of weeds, collect and segment the images of field crops, and increase the number of datasets. The model can achieve a higher segmentation accuracy. Moreover, the obtained leaf age of economic crops can provide a basis for crop fertilization. For certain plants, the plant centre is the pollination area of flowers, and segmentation of this part can provide valuable guidance for subsequent studies. BlendMask failed to segment weeds close to the edge of the image in the weed test image, but continuous video input can help eliminate the edge effects in field applications.