Plant phenotypes are a comprehensive manifestation of all physical, physiological, and biochemical characteristics and traits that reflect the structure, composition, and growth and development processes of plants, influenced jointly by their genotype and the environment [
1]. Understanding plant phenotypic characteristics and traits is a significant topic in biological research, as it helps uncover the complex interactions between the genome and environmental factors that shape plant phenotypes [
2]. This, in turn, enables the prediction of phenotypes from genotypes, facilitating the effective selection and improvement of crop varieties [
3], which is crucial in the pursuit of more efficient and sustainable agriculture. The goal of plant phenotyping is to understand and describe the morphological characteristics and traits of plants under different environmental conditions, which involves the quantitative measurement and evaluation of these traits. Previous quantitative analysis methods primarily relied on manual, destructive measurements, which were limited in both speed and accuracy and were time-consuming and labor-intensive. Consequently, the development of non-destructive and efficient high-throughput phenotyping technologies has become a key research focus across various fields.
Two-dimensional digital phenotyping, while widely used for its high throughput and efficiency [
4], has several limitations. It lacks depth information, which results in incomplete representations of plant structures, especially when dealing with occlusions or overlapping parts. This makes the accurate measurement of three-dimensional traits, such as volume and spatial distribution, challenging. Additionally, 2D methods struggle with complex plant morphologies and are highly sensitive to environmental factors like lighting and angle, which can distort results. These limitations hinder the precise analysis of plant traits and their reliable measurement. In contrast, 3D phenotyping technologies capture depth data, allowing for more accurate measurements of plant size, volume, and spatial structure. By addressing the challenges of occlusion, overlap, and irregular morphology, 3D systems provide a more comprehensive and reliable approach to plant phenotyping, making them essential for advancing plant analysis in phenotypic research.
While it is relatively straightforward to obtain overall plant phenotypic information (such as plant height and convex hull volume), finer measurements at the organ or part level require organ-level segmentation to precisely isolate individual organs, such as fruits or leaves. In 3D phenotypic analysis, segmenting the model into distinct plant organs is a crucial yet challenging step for achieving accurate measurements of specific plant parts. Nevertheless, researchers have developed various organ-level segmentation methods, which have yielded promising results.
1.1. Traditional Phenotyping Methods
Traditional organ-level phenotyping methods primarily rely on clustering algorithms and region-growing algorithms. The basic principle of clustering methods is to group points in a point cloud into different clusters based on their features, such as spatial position, color, and normal vectors. Xia et al. [
5] employed the mean-shift clustering algorithm to segment plant leaves from background elements in depth images. By analyzing the vegetation characteristics of candidate segments generated through mean shift, plant leaves were extracted from the natural background. Moriondo et al. [
6] utilized structure from motion (SfM) technology to generate point cloud data for olive tree canopies. They combined spatial scale and color features, which were emphasized in the data, with a random forest classifier to effectively segment the point cloud into distinct plant structures, such as stems and leaves. Zermas et al. [
7] implemented an algorithm known as RAIN to segment maize plants. However, this method struggles to perform effectively when dealing with canopies that are densely foliated. Miao et al. [
8] introduced an automated method for segmenting stems and leaves in maize by leveraging point cloud data. This approach begins by extracting the skeleton of the maize point cloud, followed by the application of topological and morphological techniques to categorize and quantify various plant organs.
The region-growing algorithm is a segmentation method based on point cloud similarity and neighborhood connectivity. Its fundamental principle involves starting from one or more initial seed points and progressively merging adjacent points that meet specific similarity criteria into the corresponding region until the entire point cloud is segmented into multiple clusters. Liu et al. [
9] developed a maize leaf segmentation algorithm that combines the Laplacian operator with the region-growing algorithm. This method first uses the Laplacian operator to generate the plant skeleton [
10] and then applies the region-growing algorithm [
11] to divide the point cloud into multiple clusters based on the curvature and angular variation of the normal vectors along the organ surface. Miao, Wen et al. [
12] applied a median-based region-growing algorithm [
13] to achieve semi-automatic segmentation of maize seedlings.
The parameter selection in traditional methods typically depends on the specific structure of the point cloud, which means that the processing parameters suitable for one plant may not be applicable to another. The effectiveness of a given method is influenced by factors such as the plant’s morphology and the quality of its 3D representation [
14].
1.2. Deep Learning-Based Phenotyping Methods
Deep learning approaches have become highly effective for segmenting 2D images [
15,
16]. However, most of these segmentation techniques are tailored for structured data and do not perform well with unstructured data, such as 3D point clouds.
PointNet [
17] was the first to directly use point cloud data as input for neural networks, accomplishing tasks such as object classification and segmentation. Building upon PointNet, Y. Li et al. [
18] developed a point cloud segmentation system specifically for automatically segmenting maize stems and leaves. However, PointNet struggles to capture local structures, limiting its ability to recognize fine-grained patterns and making it difficult to apply in more complex scenarios. In response, Qi et al. [
19] proposed PointNet++, whose multi-scale neural network structure recursively applies PointNet to hierarchically partition the input point sets, somewhat addressing the limitations of PointNet. Heiwolt et al. [
20] applied the improved PointNet++ architecture to the segmentation of tomato plants, successfully using the network to directly predict the pointwise semantic information of soil, leaves, and stems from the point cloud data. Both Shen et al. [
21] and Hao et al. [
22] combined the PointNet++ architecture with an improved region-growing algorithm to achieve enhanced local feature extraction. They developed an organ segmentation network specifically designed for cotton seedlings, enabling accurate extraction of their 3D phenotypic structural information.
The irregularity and unordered nature of point cloud data prevent the direct application of convolutional operators designed for regular data. The emergence of PointCNN [
23] and PointConv [
24] extended classic CNN architectures to point cloud data. Ao et al. [
25] used PointCNN to segment the stems and leaves of individual maize plants in field environments, overcoming key challenges in extracting organ-level phenotypic traits relevant to plant organ segmentation. Gong et al. [
26] proposed Panicle-3D, a network that achieves higher segmentation accuracy and faster convergence than PointConv, and applied it to the point clouds of rice panicles. However, this method requires large amounts of annotated data for training. Subsequently, Jin et al. [
27] proposed a voxel-based CNN (VCNN) as an alternative convolution method to extract features, applying it to semantic and leaf instance segmentation of LiDAR point clouds from 3000 maize plants. Zarei et al. [
28] developed a leaf instance segmentation system specifically designed for sorghum crops by integrating the EdgeConv architecture with the DBSCAN algorithm. This system was successfully applied in large-scale field environments. Li et al. [
29] introduced PlantNet, a dual-function point cloud segmentation network and the first architecture capable of handling multiple plant species, applying it to tobacco, tomato, and sorghum plants. Building on this, they enhanced the network’s feature extraction and fusion modules, naming the improved network PSegNet [
30], which achieved better results than PlantNet.
Another approach is the multi-view representation method, which projects 3D representations onto multiple 2D images and then uses 2D image segmentation networks to obtain segmentation results. The main challenges of this technique lie in determining viewpoint information and mapping the segmentation results back from 2D to 3D space. Shi et al. [
31] used ten grayscale cameras from different angles to observe plants, capturing clear segmentation images of the plant and background, and generated a 3D point cloud of the plant using a shape-from-silhouette reconstruction method. They trained a fully convolutional network (FCN) based on an improved VGG-16 and a Mask R-CNN network to achieve semantic and instance segmentation, respectively, and used a voting strategy to merge the segmentation results from the ten 2D images into a 3D point cloud of tomato seedlings. However, this voting scheme is based on the assumption that each point in the point cloud is visible to all ten cameras. Due to occlusions, this assumption is difficult to maintain with complex plants or scenes.
Despite the various methods and improvements, several key challenges remain in plant phenotyping extraction. These challenges include the difficulty of applying the methods in complex scenes with occlusions, the reliance on high-quality annotated 3D plant datasets, and the generalization ability of methods across diverse plant segmentation tasks. To overcome the limitations of existing methods, this paper proposes a zero-shot 3D leaf instance segmentation algorithm based on deep learning and multi-view stereo (MVS). This approach extends the generalization and zero-shot capabilities of the 2D segmentation model SAM into 3D space and introduces a novel fusion method to incrementally merge multiple instance segmentation results. For this paper, the main contributions are as follows:
It introduces a zero-shot, training-free 3D leaf segmentation method based on multi-view images, which offers accuracy superior to that of traditional methods and overcomes the deep learning requirement for large amounts of training data.
An incremental merging method based on confidence scores is proposed to integrate 3D point cloud instance segmentation results from multiple viewpoints.
It provides an efficient, cost-effective, and scalable solution for phenotypic analysis in agriculture.