1. Introduction
Lotus seeds are mature seeds of the aquatic herbaceous economic plant Lotus, distributes mainly in Asian, Australian, and North American countries. Due to their delicious taste, rich nutrients, and medicinal ingredients, they are used as fresh-eating fruit, functional food, and traditional Chinese medicine [
1,
2,
3,
4]. The lotus seeds grow inside the lotus pod, which is bowl-shaped and supported by a slender lotus stalk (
Figure 1).
Lotus pod harvesting is an essential procedure of the lotus seed cultivation process. Limited to the aquatic or muddy growth environment and gradual ripening and random location distribution characteristics of the lotus pods, manual selective picking has been the only way to harvest lotus pods since ancient times. However, manual labor involves harsh working conditions, high labor intensity, and low efficiency. Especially, the mature period of lotus seeds is in the high-temperature season between July and September every year [
5]. Many experienced workers are often required to participate in picking during this period. With the intensification of urbanization and population aging, the labor shortage for harvesting has become prominent, and the harvesting cost has subsequently increased [
6,
7]. Hence, automatic lotus pod-harvesting technology is urgently needed to fulfill the requirement of the lotus seeds industry.
At present, robotic picking technology has become a popular research field in the automatic harvesting of fruits and vegetables, which provides an effective solution to the challenges faced in manual picking [
8,
9]. For manual lotus pod picking, the lotus pod is separated from the lotus stalk by manually breaking or cutting off the stalk near the lotus pod. Therefore, to realize effective robotic lotus pod picking, it is necessary to identify and segment the lotus stalk area to provide basic data for the end-effector to perform the holding and separation actions. On the other hand, to prevent misjudgment due to the similarity between the characteristics of the lotus stalk and the surrounding lotus leaf stalk and the possible gathering of multiple lotus pods, affiliation analysis of the lotus pod and corresponding stalk is needed in combination with lotus pod identification information before automatic picking. Hence, studying fast, robust, and effective segmentation methods for lotus pods and stalks is necessary.
With the development of harvesting robot technology, image segmentation of fruits and vegetables has received extensive attention from researchers, with relevant literature reports divided into traditional image processing and deep learning. The traditional image processing-based segmentation methods preprocess the image first and then utilize the differences in color, shape, and texture between the object and the background in the image to achieve segmentation [
10,
11,
12,
13,
14,
15]. Septiarini et al. [
11] proposed an image segmentation method for oil palm fruit, which includes procedures of object localization, color and smoothing pre-processing, and edge detection. Linker et al. [
12] developed a four-step detection and segmentation method for green apples based on color, texture, and contour information.
Although the segmentation method based on traditional image processing is simple and convenient, it relies on high-quality image processing and hand-engineering features and is sensitive to environmental changes. Typically, there are problems in the growth environment of agricultural products, such as illumination changes, occlusion of branches and leaves, overlapping, and similarity in color between objects and background, which have posed significant challenges to the traditional method.
In recent years, deep learning technology with high precision, high efficiency, and good robustness has been applied in the segmentation of fruits and vegetables [
10,
16]. It can realize autonomous learning, automatic extraction of image feature information, and end-to-end detection [
17,
18] and is more suitable for segmentation tasks in complex environments than the traditional method. The existing segmentation method includes three sub-types: semantic segmentation [
19], instance segmentation, and panoptic segmentation [
20], among which instance segmentation could obtain pixel-level masks of individual objects [
21] and is the primary type for object segmentation in automatic harvesting applications.
The instance segmentation algorithms reported mainly include two-stage algorithms represented by Mask R-CNN [
22,
23] and one-stage algorithms represented by You Only Look at CoefficienTs (YOLACT) [
24], UNet, fully convolutional one-stage (FCOS), etc. For the segmentation of fruits and leaves, Wang and He [
10] designed an apple segmentation method by integrating the attention mechanism module into the backbone network of the Mask R-CNN algorithm. A segmentation mAP of 0.917 was achieved. Lu et al. [
14] studied the application of deep learning instance segmentation algorithms, i.e., the pyramid scene parsing network (PSPNet), U-Net, and DeepLabV3+, on the segmentation task of Sichuan peppers. They also compared the results with three traditional segmentation methods, i.e., RGB and HSV color spaces, and k-means clustering. Xu et al. [
25] segmented cherry tomatoes and stems using an improved Mask R-CNN algorithm, achieving identification accuracies of 93.76% and 89.34%, respectively. Jia et al. [
26] proposed a segmentation method for green fruits called FoveaMask, which introduced a position attention module and performed instance segmentation of fruit through the full convolutional operation. Liu et al. [
27] proposed a modified FCOS model for the segmentation of obscured green fruit and achieved a segmentation accuracy of 85.3% on an Apple dataset.
For the segmentation of fruit peduncles and branches, Li et al. [
28] proposed a multitask-aware YOLACT network, which realized the segmentation of the main stem and fruit pedicel of cherry tomatoes. Zhong et al. [
29] studied the segmentation of the main fruit bearing branches (MFBB) of litchi using YOLACT. Yang et al. [
30] developed an integrated system that simultaneously detects and measures citrus fruits and branches by combining Mask R-CNN and a branch segmentation fusion algorithm. The achieved average accuracy of fruit and branch identification was 88.15% and 96.27%, respectively. Hitherto, to the authors’ knowledge, there are very few reports on the segmentation of lotus pods and stalks.
The segmentation task of lotus pods and stalks in an unstructured planting environment is quite difficult due to color similarity between the objects and background, scattered distribution and multi-scale characteristics, and occlusion phenomenon. In this study, the effective instance segmentation of lotus pods and stalks in the unstructured planting environment is investigated. A specific segmentation dataset of lotus pods and stalks is established. A model, LPSS-YOLOv5 (Lotus Pod and Stalk Segmentation Model Based on You Only Look Once version 5), for lotus pod and stalk segmentation based on the latest YOLOv5 v7.0 instance segmentation model is proposed. The CBAM (Convolutional Block Attention Module) attention mechanism is introduced into the model network. And the scale distribution of the network output feature layer is adjusted. Then, the model is compared with the mainstream Mask R-CNN and YOLACT segmentation models. In addition, a method for localizing the stalk picking point and lotus pod’s key point in the 2D image based on the model’s segmentation result is built and tested. A 3D localization test is conducted based on the self-developed lotus pod harvesting robot. The research results are expected to support the development of the lotus pod harvesting robots.
3. Results and Discussion
3.1. Instance Segmentation Performance of the Proposed LPSS-YOLOv5
In this section, experiments were carried out to test the performance of the proposed LPSS-YOLOv5 model. The model’s loss curves on the training and validation sets are shown in
Figure 11. In line with the original YOLOv5 7.0 model, the loss curve combined four loss components, namely box loss, segmentation loss, object loss, and classification loss. In
Figure 11, both the training and validation loss curves were ideal L-shaped. With the increase in training times, the loss value of each curve gradually decreases until final convergence, which indicates that the model’s performance has stabilized and the whole training process was normal. In addition, the gap between the training and validation loss curves was small, and the model fitted well.
Then, the model was tested using Test Sets A and B to verify its segmentation performance for lotus pods and stalks in actual planting environments. The trained weight was deployed to the model, and then the images in the test sets were inputted into the model for inference, and test results were obtained accordingly, as listed in
Table 3.
The P, R, F1-score, and mAP0.5 values achieved with LPSS-YOLOv5 on Test Set A were 96.7%, 99.3%, 98.0%, and 99.3%, respectively. Among them, the AP0.5 of the model for the lotus pod and stalk categories were both 99.3%. This means that the segmentation performance of the model for medium-large scale lotus pods and stalks is high. On Test Set B, the P, R, F1-score, and mAP0.5 values achieved with the model were 93.4%, 83.0%, 87.9%, and 88.8%, respectively. Among them, the AP0.5 corresponding to the lotus pod and stalk categories were 94.2% and 83.3%, respectively.
Further, on Test Set B, the segmentation performance of the model was lower than those of Test Set A. Among them, the P, R, and mAP0.5 of the model on Test Set B were 3.3%, 16.3%, and 10.5% lower than Test Set A, respectively. The data indicates that the model’s missed segmentation rate increased, and it also reflected the difficulty in small-scale object segmentation. In addition, the model’s segmentation performance for stalks was lower than that for lotus pods. The mAP0.5 for stalks was 10.9% lower than that of the lotus pods. This is because the stalk object was much smaller than the lotus pods, so it occupied fewer pixels in the image, making it more difficult to be detected and segmented.
Figure 12a,b show the representative segmentation effects of LPSS-YOLOv5 on Test Sets A and B, respectively. Both medium-large scale lotus pods and stalks were effectively segmented. The generated masks fit the objects’ boundaries well. In the presence of multiple lotus pods, the model could accurately segment each object (
Figure 12(a3)). In addition, the model could still achieve accurate segmentation even if the lotus pod was partially obscured by surrounding lotus leaves, as shown in
Figure 12(a4).
Figure 12b shows the segmentation effects of the model on Test Set B. The model successfully segmented small-scale lotus pods and stalks objects in the actual growth environment, even though their size accounts for a very small proportion in the image. In addition, it is worth mentioning that similar objects such as leave stalks in the image were not misidentified.
The above results indicate that LPSS-YOLOv5 could effectively and robustly segment lotus pods and stalks with various scales, which could meet the robotic harvesting task requirements of lotus pods in actual planting environments.
3.2. Ablation Experiment
An ablation experiment was conducted to verify the contribution of the improvement measures mentioned in
Section 2.3 on the performance of the LPSS-YOLOv5 model. First, two comparative models, YOLOv5-CBAM and YOLOv5-AFL (Adjustment of the Feature Layers), were established based on YOLOv5 v7.0. YOLOv5-CBAM was built by introducing the CBAM attention mechanism into the backbone and neck networks. YOLOv5-AFL was established by adding a 160 × 160 small object detection layer and removing the original 20 × 20 large object detection layer.
Table 4 lists the results of the ablation experiment of each model on Test Sets A and B. The mAP
0.
5 achieved with YOLOv5 v7.0, YOLOv5-CBAM, YOLOv5-AFL, and LPSS-YOLOv5 on Test Set A were 99.4%, 99.4%, 99.0%, and 99.3%, respectively. This means that each improvement measure had a lesser impact on the model’s segmentation accuracy for medium-large objects, and each model achieved a very high segmentation accuracy. The mAP
0.
5 achieved with YOLOv5 v7.0, YOLOv5-CBAM, YOLOv5-AFL, and LPSS-YOLOv5 on Test Set B were 86.2%, 87.0%, 88.3%, and 88.8%, respectively. The results of the three improved models were higher than those of the original YOLOv5 v7.0. The results indicate that both the abovementioned improvement measures play a role in improving the segmentation performance of the model for small-scale objects. Among them, the effect of introducing the two improvement measures simultaneously was more effective, obtaining a 2.6% increase in mAP
0.
5.
On the other hand, the model improvement affected the parameter amount of the model. The introduction of the CBAM module resulted in an increase in the number of model parameters from 7,401,119 to 7,467,631, indicating that it increased the model’s complexity but to a limited extent. Adding a 160 × 160 detection layer to the network while removing the original 20 × 20 detection layer significantly reduced the model parameters from 7,401,119 to 5,645,727. This is because compared with the feature map of the 20 × 20 detection layer, the 160 × 160 feature map has fewer deep features, which reduces the model’s complexity. Therefore, after adopting the above two improvement measures, the final LPSS-YOLOv5 model achieved a significant reduction in parameters compared to the original YOLOv5 v7.0 model, from 7,401,119 to 5,705,041, which means that better deployment performance could be obtained.
Table 5 shows the test results of each category in the ablation experiment. On Test Set B, the AP
0.
5 values of YOLOv5-CBAM for lotus pod and stalk categories were 94.4% and 79.6%, which were higher than the 94.0% and 78.3% achieved through the original YOLOv5 v7.0 model, respectively. This verifies that introducing the CBAM enhances the model’s feature extraction ability for lotus pods and stalks in complex environments. The YOLOv5-AFL model achieved AP
0.
5 values of 93.8% and 82.7% for the lotus pod and stalk categories, respectively. Compared to the original YOLOv5 v7.0 model, it showed a little decrease in lotus pod segmentation. However, it achieved a 4.4% increase in the segmentation of lotus stalks. This verifies that adjusting the multi-scale feature layer structure effectively improved the model’s segmentation performance for small-scale lotus stalks. With the introduction of both the improvement measures, the LPSS-YOLOv5 model achieved an AP
0.
5 value of 83.3% for lotus stalk segmentation, which was a 5% increase compared to the YOLOv5 v7.0 model.
Furthermore, it can be seen from the picture of the above results (
Figure 13) that the LPSS-YOLOv5 model has obtained the best comprehensive segmentation performance for small objects after being improved by the two measures at the same time.
Figure 14 shows the representative segmentation effects in the test. There are both false segmentations (marked by red circles) and missed segmentations (marked by yellow circles) in the results of the YOLOv5 v7.0 model. After introducing the CBAM, the missed segmented lotus pod could be successfully detected (
Figure 14d), and the falsely detected lotus flower and lotus leaves (
Figure 14b,e) were not detected with YOLOv5-CBAM. The results indicate that introducing the CBAM not only improved the model’s feature extraction ability for lotus pod and stalk but also suppressed the interference of surrounding irrelevant objects and reduced the false segmentation phenomenon. However, YOLOv5-CBAM still failed to segment a small-scale stalk and a lotus pod that was heavily occluded (
Figure 14a,c).
On the other hand, YOLOv5-AFL achieved correct segmentation for those small stalk and occluded lotus pod objects that were not identified via the YOLOv5 v7.0 model (
Figure 14a,c). However, it falsely segmented the immature lotus pod, lotus flowers, and also the leaves, which indicates that if only the AFL improvement was introduced, the model’s ability to suppress the information interference of surrounding irrelevant objects was limited. In contrast, the LPSS-YOLOv5 model, which introduces both improvements, achieved complementary advantages. The best comprehensive segmentation effect was achieved, and the abovementioned false and missed segmentation problems were avoided. In summary, according to the results of the ablation experiments, the improvements carried out on YOLOv5 v7.0 in this study have played their due roles as expected.
3.3. Performance Comparison of the Mainstream Instance Segmentation Models
To further verify the segmentation performance of the proposed model, other mainstream instance segmentation models, i.e., Mask R-CNN [
34,
35] and YOLACT [
36], were selected and used in contrast experiments with LPSS-YOLOv5. Specifically, the same data sets were used for the comparison, and the results are shown in
Table 6.
On Test Set A, the mAP0.5 values of LPSS-YOLOv5 were 8.4% and 1.6% higher than the Mask R-CNN and YOLACT, respectively, and were similar to YOLOv5 v7.0. On Test Set B, the mAP0.5 values of LPSS-YOLOv5 were 55.3%, 14.4%, and 2.6% higher than those of the Mask R-CNN, YOLACT, and YOLOv5 v7.0 models, respectively.
Figure 15 shows the representative segmentation effects of each model on the two test sets. The YOLOv5 v7.0, Mask R-CNN, and YOLACT models all have missed and false segmentation of lotus pods and stalks. In contrast, LPSS-YOLOv5 could more fully focus on the object features and pay more attention to small objects when segmenting lotus pods and stalks. Even for those objects in complex environments, the model achieved good segmentation results.
Figure 16 compares some sample predicted masks with corresponding ground truth for the four models. As shown in
Figure 16a, for medium-large lotus pod and stalk objects, the difference between the predicted mask and the ground-truth mask of the above four models was small, and each model achieved effective segmentation for all objects.
As shown in
Figure 16b, for the segmentation of small objects, Mask R-CNN and YOLACT achieved more smooth mask contour edges. However, they failed to segment all the objects. Combined with the performance results listed in
Table 6, the problem of the high missed detection rate for small objects limits their practical application. In contrast, although the mIOU indicators reflecting mask coverage quality and contour smoothness of LPSS-YOLOv5 and YOLOv5 v7.0 were lower than that of the Mask R-CNN and YOLACT, they still achieved effective segmentation of the majority body part of the objects. Combined with the results in
Table 6, on the basis of satisfying a higher recall rate, slightly lower mask coverage quality has a limited impact on the actual application effect of the model.
In terms of detection speed, LPSS-YOLOv5 achieved a detection speed of 93.5 FPS, which was much higher than Mask R-CNN’s 5.8 FPS and YOLACT’s 51.3 FPS but slower by 10.6 FPS compared to YOLOv5 v7.0. Although the model’s improvement reduced the detection speed, it was still significantly higher than the mainstream one-stage algorithm YOLACT. In addition, the model size of LPSS-YOLOv5 was only 12 MB, which was much smaller than the 343.1 MB of Mask R-CNN and 194.4 MB of YOLACT. Therefore, LPSS-YOLOv5 is more feasible and practical to deploy on intelligent lotus pod harvesting robots than other models.
Combining the results of the radar chart shown in
Figure 17 and the above analysis, it is seen that the comprehensive performance of LPSS-YOLOv5 was better than other mainstream Mask R-CNN and YOLACT models in terms of segmentation quality, real-time performance, and deployability.
3.4. Localization Effects of the Picking Point Based on LPSS-YOLOv5
To explore the feasibility of further using the segmentation results of LPSS-YOLOv5 for picking point localization and pod–stalk affiliation confirmation, a method for localizing stalk picking point and lotus pod’s key point in the 2D image was built, and corresponding 2D localization tests were performed in this section. The implementation steps include the following: (1) Obtaining the detection box and segmentation mask of lotus pods and lotus stalks in the image using LPSS-YOLOv5. (2) Extracting the pixel data within the detection box and the mask area. (3) Calculating the centroid of each mask using Equations (7) and (8) and determining the coordinates of the pixel where the centroid is located in the image. The centroid of the stalk mask was used as the picking point, and the centroid of the lotus pod mask was used as the key point representing the individual lotus pod. The connection line between the picking point and the nearest key point was used as the pod–stalk relationship vector, which could be adopted to assist the judgment of the pose and direction of the picking end-effector.
where
xc,
yc represent the coordinates of the centroid pixel.
n represents the total number of pixels in the detection box.
xi and
yi represent the coordinates of the
ith pixel.
f (
xi,
yi) is the pixel value, while the
f (
xi,
yi) of the pixels in the mask is 1; otherwise it is 0.
Images with different scales and lighting conditions in the test sets were selected for testing, and the results are shown in
Figure 18, respectively. It can be seen from the figures that the corresponding picking points (red) and the key points (blue) of the lotus pods under different conditions have been successfully located. In addition, when multiple lotus pods were detected (
Figure 18(a2,b2,b3)), the affiliation relationship between the lotus pod and the corresponding lotus stalk could be correctly established. The test results indicate that the LPSS-YOLOv5 model is robust and could effectively support the picking point localization and the pod–stalk affiliation confirmation calculation tasks under various scales and lighting conditions.
3.5. A 3D Localization Test Based on the Lotus Pod Harvesting Robot
Furthermore, a 3D localization test of lotus pods and stalks in the laboratory environment was conducted based on the self-developed lotus pod harvesting robot (
Figure 19a). Among them, the acquisition of 3D information was performed using a side-view depth camera (Intel Realsense D435) installed on the arm of the harvesting robot (
Figure 19b). The camera uses stereoscopic depth technology with a global shutter and the ideal depth range is 0.3 m~3 m. In the test, the RGB and the depth images of the sample lotus pods were captured with the camera, and the schematic diagram is shown in
Figure 19c. The pixel resolution of RGB and depth images set was 1280 × 720. After the acquisition, the depth image was aligned to the color image. Then, the method described in
Section 3.4 was used to obtain the 2D localization information of the picking point and the key point of the lotus pod. Subsequently, the final 3D space coordinates of the points were obtained according to the camera’s intrinsic parameters and corresponding depth data.
The localization results are shown in
Figure 20a–d. For single or multiple lotus pods, the combination of the LPSS-YOLOv5 model and the depth camera achieved effective 3D localization of the picking point and the lotus pod’s key point.
In summary, the proposed LPSS-YOLOv5 model in this study showed good performance in lotus pod and stalk detection and segmentation tasks. It maintained a high segmentation speed and achieved high accuracy and robustness while reducing the complexity of the model. It could provide the location and contour information of lotus pods and stalks for lotus pod harvesting robots to perform picking operations.
4. Conclusions
To achieve accurate segmentation of lotus pod and stalk objects in the unstructured planting environment, this study proposed an instance segmentation model for lotus pods and stalks, named LPSS-YOLOv5. In the model network, the CBAM attention mechanism was introduced and the multi-scale feature layer structure of the model network was adjusted. Among them, by introducing the CBAM module, the feature extraction ability of the model for the lotus pods and stalks was improved. In addition, by adding a 160 × 160 small-scale detection layer and by removing the existing 20 × 20 large-scale detection layer, the segmentation performance of the model for small-scale lotus stalks was effectively improved. Meanwhile, the model size was reduced. The model’s mAP0.5 on the medium-large scale test set and the small-scale test set were 99.3% and 88.8%, respectively.
Compared with other mainstream instance segmentation algorithms, i.e., Mask R-CNN, YOLACT, and YOLOv5 v7.0, on the medium-large scale test set, the mAP0.5 values of LPSS-YOLOv5 were 8.4% and 1.6% higher than the Mask R-CNN and YOLACT algorithms, respectively, and similar to YOLOv5 v7.0. On the small-scale test set, the mAP0.5 values of LPSS-YOLOv5 were 55.3%, 14.4%, and 2.6% higher than those of Mask R-CNN, YOLACT, and YOLOv5 v7.0 algorithms, respectively. It is worth noting that the AP0.5 for stalks achieved with LPSS-YOLOv5 was 83.3%, 5.0% higher than the YOLOv5 v7.0 model.
The LPSS-YOLOv5 achieved a detection speed of 93.5 FPS, which was much higher than Mask R-CNN’s 5.8 FPS and YOLACT’s 51.3 FPS. The model size of LPSS-YOLOv5 was only 12 MB, which was much smaller than the 343.1 MB of Mask R-CNN and 194.4 MB of YOLACT. The comprehensive performance of LPSS-YOLOv5 was better than other mainstream models in terms of segmentation accuracy, speed, and deployability.
Finally, a method for localizing the stalk picking point and lotus pod’s key point in the 2D image was built and tested. A 3D localization test was conducted based on the self-developed lotus pod harvesting robot. The research results verified that LPSS-YOLOv5 could effectively support the picking point localization and the pod–stalk affiliation confirmation calculation.
In future work, we will explore the deployment of the LPSS-YOLOv5 model and localization method into the control system of the lotus pods harvesting robot as a basis to support the end-effector to perform multi-DOF picking.