Automatic Method for Extracting Tree Branching Structures from a Single RGB Image

Yang, Yinhui; Lai, Huang; Chen, Bin; Huo, Yuchi; Xia, Kai; Huang, Jianqin

doi:10.3390/f15091659

Open AccessArticle

Automatic Method for Extracting Tree Branching Structures from a Single RGB Image

by

Yinhui Yang

^1,2,3,*,†

,

Huang Lai

^1,2,3,†

,

Bin Chen

^1,2,3

,

Yuchi Huo

⁴

,

Kai Xia

^1,2,3

and

Jianqin Huang

⁵

¹

College of Mathematics and Computer Science, Zhejiang A&F University, Hangzhou 310007, China

²

Key Laboratory of State Forestry and Grassland Administration on Forestry Sensing Technology and Intelligent Equipment, Hangzhou 311300, China

³

Zhejiang Provincial Key Laboratory of Forestry Intelligent Monitoring and Information Technology, Hangzhou 311300, China

⁴

State Key Lab of CAD&CG, Zhejiang University, Hangzhou 310058, China

⁵

State Key Laboratory of Subtropical Silviculture, Zhejiang A&F University, Hangzhou 311300, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Forests 2024, 15(9), 1659; https://doi.org/10.3390/f15091659

Submission received: 31 July 2024 / Revised: 17 September 2024 / Accepted: 18 September 2024 / Published: 20 September 2024

(This article belongs to the Section Forest Inventory, Modeling and Remote Sensing)

Download

Browse Figures

Versions Notes

Abstract

:

Creating automated methods for detecting branches in images is crucial for applications like harvesting robots and forest monitoring. However, the tree images encountered in real-world scenarios present significant challenges for branch detection techniques due to issues such as background interference, occlusion, and varying environmental lighting. While there has been notable progress in extracting tree trunks for specific species, research on identifying lateral branches remains limited. The primary challenges include establishing a unified mathematical representation for multi-level branch structures, conducting quantitative analyses, and the absence of suitable datasets to facilitate the development of effective models. This study addresses these challenges by creating a dataset encompassing various tree species, developing annotation tools for multi-level branch structure labeling, designing branch vector representations and quantitative metrics. Building on this foundation, the study introduces an automatic extraction model for multi-level branch structures that utilizes ResNet and a self-attention mechanism, along with a tailored loss function for branch extraction tasks. The study evaluated several model variants through both qualitative and quantitative experiments. Results from different tree images demonstrate that the final model can accurately identify the trunk structure and effectively extract detailed lateral branch structures, offering a valuable tool for applications in this area.

Keywords:

branch detection; trunk detection; attention mechanism; quantitative metric; deep learning

1. Introduction

The branch structure information of trees can be applied in various agri-forestry applications, including harvesting robots [1,2,3,4,5,6,7,8,9,10], tree pruning [11,12,13,14], tree health and growth monitoring [15,16,17], and phenotypic analysis [18,19,20]. In practical applications, it is often necessary to combine data from multiple types of sensors, e.g., visual or thermal cameras and laser scanners, to obtain accurate branching information [4,21,22,23,24]. However, image sensors are cheaper, and processing two-dimensional image data are more efficient than three-dimensional laser scanning data, making them more suitable for applications that require real-time performance or long-term monitoring [8,16,25]. However, images captured from real scenes are often mixed with complex backgrounds, and there are often varying degrees of occlusion caused by leaves and branches, as well as varying lighting conditions. These all pose significant challenges for developing methods to automatically extract structural information of trees from image data. Nonetheless, the ongoing progress in deep learning technologies is progressively improving image processing capabilities, thereby sustaining the active research focus on the extraction of tree branch structures from images.

In the past decades, significant progress has been made in detecting tree branch structures from images. These studies can be mainly divided into two categories, namely methods based on traditional image processing techniques, e.g., [1,2,3,5,21,26,27], and methods based on deep learning, e.g., [9,10,12,13,17,25,28,29,30,31]. Early research was mainly based on image processing techniques, such as color space transformation, template matching, etc. Therefore, their processing ability for complex scenes was limited. To obtain good results, it was required that the branches in the image be easily distinguished from the background. The accuracy of these methods is easily affected by factors such as lighting conditions and occlusion in real scenes. In the following, several typical related works are reviewed.

Lu et al. [1] proposed a contrast-based tree trunk detection method based on the assumption that the trunks of trees are narrow and vertical to the ground. They applied bar filters as contrast templates to detect trunks within a limited range of diameters and depth. Auat Cheein et al. [2] extracted histogram of oriented gradients (HOG) features from the image and trained a support vector machine (SVM) classifier based on such features to detect olive stems from images captured by a car-like mobile robot. Shao et al. [3] converted the RGB images to

L * a * b

color space, where the pixels were further processed by chrominance classification and morphological operations. Then, the edge boundaries of the trunk is detected by the Hough transform. The algorithm was proposed to segment and detect tree trunks in outdoor field, which is useful for forestry harvesting robots. Shalal et al. [21] developed a tree trunk detection method in the orchard by fusing the laser scanner and camera data. For the vision-based trunk detector, it was designed on the color and edge information extracted from the image in the

H S V

color space. In Juman et al. [26], they also pre-processed the oil-palm image in the

H S V

color space to eliminate the background from the trunk regions. Then, they applied the Viola and Jones detector to detect the trunk parts. They reported better results compared to the SVM-based methods, as used in [2].

Ji et al. [5] proposed a segmentation method for the apple branch images by the iterative threshold method. They first converted the RGB images to

X Y Z

and

I_{1} I_{2} I_{3}

color spaces, then the image was processed by the contrast limited adaptive histogram equalization method. They tested their method on 100 images taken under different illumination and achieved better results than other threshold based methods. Liu et al. [27] proposed a clustering-based method to detect citrus fruit and tree trunks in natural environments. They converted the RGB image into a

Y^{'} C r C b

color space, and constructed the elliptical boundary model to fit and assign the image pixels into various elliptical clusters. All such methods can only detect the unobstructed tree trunks, since the color of the trunk is easily distinguishable from the background. Though several methods reported the performance on their collected images, e.g., [21,26], there is still a lack of applicable quantitative metrics and common datasets to effectively evaluate and compare the performance of different methods.

Traditional vision-based methods rely on manually designed features for branch detection, which limits the complexity of the images they can process. However, deep-learning-based methods extract the optimal features from images by learning from data, thus having greater advantages and practical potential. For deep-learning-based methods, it is necessary to collect a large amount of tree image data for model training, but data collection and annotation are not easy. Currently, deep-learning-based branch extraction methods can be mainly divided into two categories, namely object detection, e.g., [9,10,13,24,28,31], and semantic segmentation methods, e.g., [12,17,25,29].

Yang et al. [28] collected about 5000 images from a citrus orchard using the Kinect v2 camera from different lighting directions. Then, they trained a Mask R-CNN recognition model to detect the branches and fruits. Furthermore, the bounding boxes of the detected branches are merged to reconstruct the branches and trunk of the citrus tree. By integrating the depth information with the detected branches, a picking robot can use such structural information for picking path planning and obstacle avoidance. They reported the average recognition precision of

88.15 %

and

96.27 %

for the fruits and branches, respectively. Su et al. [9] collected 1800 images of apple orchards in different seasons and under different light conditions to train a trunk detector based on the improved YOLOv5 model. The centroid of the detected trunk can then be used in path planning of orchard robots. The trunk detector achieved a highest precision of

98.37 %

in summer data, while a lowest precision of

89.61 %

in winter data. Liu et al. [10] collected a total of 1500 pictures of Camellia oleifera trunks under different conditions, including single and multiple trunks with front-light or back-light, to train their improved YOLOv7 model. In order to enhance the detection accuracy, they incorporated a convolutional block attention module (CBAM) in the backbone of YOLOv7 and replaced the original loss function by the Facol-EIoU loss function. Their experimental results demonstrated that the proposed method achieves a mAP of

89.2 %

.

Targeting the application scenario of preventing damage from tree branches to power lines, Silva et al. [13] developed a branch detector based on a convolutional neural network (CNN). They trained the model on a dataset about 1800 images, where each image contains at least one main tree branch. The CNN model predicted the pixels covered by the branch, then a Hough transform is applied to detect the straight line belonging to the branch. Then, a grip point was calculated on the detected line for their application. They computed the distances and angles between the detected and labeled lines to evaluate the method. Grondin et al. [31] proposed two densely annotated datasets, one containing 43,000 synthetic tree images generated by using the Unity game engine and the other containing 100 images captured from real trees. They trained two models: Mask R-CNN and Cascade Mask R-CNN. The models were first pre-trained on the synthetic dataset, then fine-tuned on the real dataset. The Cascade Mask R-CNN achieved a slightly better average precision when measuring the bounding boxes predictions of the tree trunks.

Chen et al. [12] collected 521 RGB and depth images from partially occluded apple trees, then three deep learning models: U-Net, DeepLabv3, and Pix2Pix, were trained and compared with their semantic segmentation results. They defined a occlusion difficulty index to quantify the difficulty of the branch segmentation task for occluded data. According to their experiments, the U-Net outperforms the other two models in the given metric. Tong et al. [29] collected 2500 apple tree images to train various models based on the Mask R-CNN and Cascade Mask R-CNN by varying their backbones from the ResNet to the transformer architectures. According to the segmented primary branches and trunk, they applied a skeletonization algorithm to locate the junction points between the trunk and the primary branch. According to their experiments, the combination of the Cascade Mask R-CNN with a transfomer backbone outperforms other models.

Li et al. [17] applied the salient object detection technique to detect tree trunks or branch parts from the images. In their research, about 2500 tree images of various tree species in different urban scenes were collected. Their model can be divided into two stages, one for feature extraction and another one for feature aggregation. They employed the U²Net as the base network and incorporated a texture attention module to enhance the extraction of the local and global features. They applied seven metrics to evaluate the similarity between the predicted saliency map and the ground truth from multiple perspectives. In Wan et al. [25], they also employed the U²Net to design a new model called U²ESPNet by combining the efficient spatial pyramid module into the model. They collected about 1000 images over 200 apple trees to train the model. Their experimental results showed that the model achieved the Intersection over Union (IoU) and F1-score in visible branches segmentation with

75.35 %

and

85.94 %

.

The successful extraction of branch structures from images, whether utilizing traditional methods or deep learning techniques, fundamentally relies on the identification of branch features. It is important to note that branches of varying sizes correspond to different pixel area coverage; as branch size diminishes, the number of effective pixels they encompass also decreases, thereby complicating the process of image feature extraction. Previous studies predominantly concentrated on detecting tree trunks, which were subsequently represented as bounding boxes or pixel regions. Furthermore, many of these models were tailored to specific tree species, limiting their generalizability to other species. Currently, deep learning methods are prevalent due to their efficacy in processing images captured in real-world environments. Despite notable advancements, additional research is necessary to mitigate the effects of occlusion and varying lighting conditions on the accuracy of branch detection in practical applications. A contemporary research focus involves the extraction of multi-level branch structures to facilitate more precise operations, which presents a significant challenge. This challenge arises primarily from the need to develop consistent representations of multi-level branch structures that align with the requirements of deep learning models. Additionally, the creation of a well-annotated branch dataset poses another obstacle. As branch size decreases at higher branching levels, even state-of-the-art object detection models based on deep learning encounter difficulties in effectively extracting relevant features from images. Therefore, it is imperative to explore model architectures and evaluation metrics specifically tailored for branch detection, rather than confining the investigation to conventional object detection or semantic segmentation frameworks.

In this study, we have developed an end-to-end framework utilizing deep neural networks to tackle the aforementioned challenges. This framework is designed to automatically extract multi-level tree branching structures from a single RGB image. Our main contributions are summarized as follows:

We developed a dataset of tree images that encompasses a diverse range of tree species, with each image featuring a meticulously annotated multi-level branching structure. To support annotating tree branches at different levels, we developed a specific interactive annotation tool and an xml file format to save the tree branches data.
We designed a vector representation to encode the multi-level tree branches data consistently for diverse tree structures.
We developed an end-to-end tree branching structure extraction model by incorporating the attention mechanism and a specially designed loss function based on the proposed branch representation.
We designed quantitative metrics to measure the topological and geometrical similarities between two tree branching structures, which are then used to evaluate the performance of the model.

2. Materials and Methods

The proposed tree branching structure extraction model is a neural network consists of a CNN backbone and two branch heads, as is illustrated in Figure 1. To successfully train such a model, an image dataset with annotated branching structures should be used. Several datasets from previous studies, such as those by [9,12,28,29,31], may have constraints when utilized in our research for model training. For instance, some of these datasets, like those by [12,29], are primarily tailored to a specific tree type, such as fruit trees in orchards, thereby limiting the generalizability of the model across various tree species. Additionally, these studies have primarily focused on annotating trunk or lateral branch bounding boxes using tools like Labelme [32], which are inadequate for annotating complete tree branching structures at different scales. To overcome these limitations, we initially gathered a diverse tree image dataset encompassing various tree species and subsequently developed an interactive annotation tool for annotating tree branching structure data from each image.

2.1. Dataset Construction

The dataset comprises 1729 images of trees representing a variety of tree species such as Ginkgo biloba, Populus bolleana, Platanus acerifolia, Betula platyphylla, Pterocarya stenoptera, Pinus sylvestris, Acer palmatum, Cinnamomum camphora, and Salix babylonica. These images were obtained through three distinct methods: 655 images were downloaded from online sources, 376 images were captured using mobile phones from real-life trees, and 698 images were created using AIGC tools [33]. Figure 2 illustrates a subset of tree images from the dataset, showcasing samples obtained through each of these three acquisition methods. The mobile phone images were taken on the campus of Zhejiang A&F University in Hangzhou, Zhejiang, China, with an image resolution of

2304 \times 4096

. These images were taken between 9 a.m. and 6 p.m. from October 2021 to October 2022, with a total of 233 images taken between 9 a.m. and 12 p.m. and 143 images taken between 1 p.m. and 6 p.m. Different weather conditions in different seasons were considered when collecting these images. Among them, 205 images were taken on cloudy days and 171 images were taken on sunny days. In total, 86 images were taken from January to March, 98 images were taken from from April to June, 139 images were taken from July to September, and 53 images were taken from October to December.

For the convenience of further discussion, it is necessary to define some basic concepts for characterizing the branching structure of trees. Branches are essential elements of a tree’s branching system, each comprising multiple nodes. The hierarchical tree branching structure includes branches located at different levels. As per conventional terminology, the main branch at level 0 is known as the trunk, while branches at level 1 encompass all lateral branches directly connected to the trunk, with this pattern continuing for subsequent levels. The primary node of the trunk is designated as the root. Additionally, a set of parameters has been established to provide a clearer expression of the aforementioned concepts (Table 1). The parameter N is used to characterize the number of nodes on the trunk. The structure of the lateral branches is more complex and need more parameters to characterize it. Based on the observations of the real world trees as well as some botanical prior knowledge, we defined four parameters, namely L, k, n, and M, to characterize the structure of the lateral branches. The interpretations for these parameters are listed in Table 1.

A typical annotation process for one image is demonstrated in Figure 3. The user can select a current branching level to start the annotation. For any branch at the given level, the user need to annotate the nodes on the branch and the annotation tool will automatically connect these nodes by line segments (the red lines in Figure 3). The 2D positions of all the nodes at one level will be grouped together. Following that are the corresponding branches connected by these nodes. These annotation data are saved in an xml file organized in a way shown on the right of Figure 3. In the middle of Figure 3, we visualized the annotated data as a binary image.

2.2. Resampling of Branch Structure

The annotated dataset collected can not be directly used by a deep neural network, since the branches have varying number of nodes across different trees and branching levels. In order to represent the branching structure in a consistent way, we developed two steps to process the annotation data. An overview of such steps is shown at the top of Figure 1. Here, we will first introduce the branch resampling step and the vectorization step will be introduced in the next section.

In Figure 4a–c, three histograms are computed to show the distribution of values in parameters N, n, and L in the annotated dataset. These numbers will be used in resampling and vectorizing the tree branching structure data. In computing such histograms, we excluded the tree images generated by AI tools since they are not real trees. According to Figure 4a,

N = 20

is the maximum number of annotated trunk nodes in all trees. The average number of annotated nodes on each lateral branch has a peak at value

n = 5

according to Figure 4b. According to Figure 4c, only a small fraction of trees have more than 20 annotated lateral branches so

L = 20

may be a reasonable choice.

The goal of the resampling step is to make the number of nodes at each branch equal to a given value. In Figure 5, we resampled a user annotated branching structure (left tree in Figure 5) such that the number of nodes at level 0 is

N = 20

and the number of nodes of each lateral branch at level 1 is

n = 5

(right of Figure 5). These numbers are set according to the statistics we computed from the dataset in Figure 4a–c.

The annotated branching structure of a tree image is represented as a set of line segments. In order to resample a branch, we interpolated a smooth curve through the annotated image points. We used the natural cubic spline method, since it can generate a

C^{2}

-continous parametric curve from the control points [34]. Suppose the

n + 1

user annotated 2D image points on a branch are denoted as

P_{0}, P_{1}, \dots, P_{n}

. Then each pair of the points

P_{i}

and

P_{i + 1}

(with

0 \leq i \leq n

) can define a cubic arc

C_{i} (u) (0 \leq u \leq 1)

as follows:

C_{i} (u) = H_{0} (u) P_{i} + H_{1} (u) P_{i + 1} + H_{2} (u) P_{i}^{'} + H_{3} (u) P_{i + 1}^{'} .

(1)

The

H_{i} (u)

functions in Equation (1) are cubic Hermite blending polynomials [34] and

P_{i}^{'}, P_{i + 1}^{'}

are the tangents at point i and i + 1. These arcs are joined end to end to generate the final smooth curve

C (u)

through the n + 1 points. Finally, we can resample each branch curve

C (u)

uniformly to generate the desired number of nodes on it (the right tree in Figure 5).

2.3. Vector Representations for Tree Branching Structures

A good mathematical representation for the tree branching structures should be able to handle diverse tree species in a consistent way. However, this is not a trivial work, since the branching structures varies a lot both within and across tree species. To this end, we designed a vector representation that is not dependent on tree species and can be easily incorporated into the deep neural network. In this study, we mainly focused on the first two levels of the branching structure, since this is enough to characterize the overall structure for almost all trees [35]. It is convenient to extend the vector representation for finer levels of branches. By design, the final vector representation consists of two groups of data collected from the trunk and the lateral branches. Each group is a sub-vector of the whole vector representation with different sizes.

In order to get a fixed size vector representation, in our design, the value of N and M are fixed. Then the total size of the represented vector is

2 N + 2 M

, since each node has a pair of image coordinates

(x, y)

to be stored. For the trunk, we can easily store the trunk nodes sequentially into its sub-vector. However, it is more difficult to represent the structure of lateral branches in a fixed size vector than the trunk. For real trees, the number of lateral branches at each trunk node is varying, so the value of k should not be fixed. However, the number of total lateral branches (L) and the number of nodes on each lateral branch (n) should be fixed in order to get a fixed value of

M = L \times n

. According to the parameterization, the vector encoding and vector decoding processes for the branch representation are detailed as follows.

To get a consistent vector representation, it is necessary to define the order to process the various nodes. In our design, each trunk node is processed in a bottom-up order. If a trunk node has several lateral branches, the lateral branches will be processed in a left-to-right order. For the left lateral branches, the order is from the lowest branch to the highest one. For the right lateral branches, the order is from the highest branch to the lowest one.

In vector encoding (see Figure 6), the coordinates of the trunk nodes are stored one by one in its

2 N

sized sub-vector. Then, for each trunk node, if the node has some lateral branches, we store the lateral nodes’ coordinates in corresponding positions in the

2 M

sized sub-vector. If the node has no lateral branches, we just skip the node and continue to the next one. This process is repeated until the total number of lateral nodes is reached. The dataset contains trees that may exhibit a number of lateral nodes that is either below or above the specified value M. In instances where the number of lateral nodes is less than M, the values of all remaining positions in the vector are assigned to zero. Conversely, when the number of lateral nodes exceeds M, the excess node data are simply omitted. During the vector decoding process, given that both the trunk and lateral nodes are fixed in number, the 2D coordinates of the nodes can be sequentially extracted from the vector. Subsequently, every n consecutive positions from the lateral branch group are interpreted as representing a lateral branch.

In Figure 6, we illustrate the vector representations for two example trees. For both trees,

N = 4

,

L = 6

, and

n = 3

. Thus, the total number of lateral nodes

M = L \times n = 18

. The trunk part of the two trees is the same, so we will only discuss the differences in the lateral part. For the top tree, only the lateral nodes from the trunk node 2 and 3 are stored. The trunk node 1 and 4 are skipped since they have no lateral branches. However, the total number of lateral nodes of the trunk node 2 and 3 is only 9, so the extra positions of the vector are all filled with zeros. For the bottom tree, the lateral nodes of trunk node 2 and 3 are stored in the vector. However, for trunk node 4, the M is reached after storing the nodes on its left branch, so the nodes of the right branch are discarded in the final vector representation.

2.4. Tree Branching Structure Extraction Model

In the development of an automated method for extracting tree branching structures from RGB images, it is essential to create a model capable of not only autonomously identifying significant features within the image but also establishing a correspondence between these features and the vector representations of the tree branching structures. To address these requirements, we have formulated the tree branching structure extraction model as an end-to-end deep neural network.

The model consists of three building blocks. The first building block is a convolutional deep neural network, which is used to extract image features from the input image. The second building block is the attention layers; its goal is to enable the network to collect and associate CNN features from a large neighborhood to represent the global relationships of the objects in the image. This is useful for extracting tree branching structures from images, since there are natural and strong relationships between the lateral branches and the trunk. The third building block is the loss function defined upon the vector representations to penalize the mismatchings between the predicted and the annotated values. We can view the first two building blocks together as a backbone network and the third one as a multiple heads layer on top of the backbone. The overall architecture of the proposed model is illustrated in Figure 1.

There are different ways to combine the attention mechanisms with CNNs and one simple way is just to replace some of the top layers of CNNs by attention layers. There are some advantages according to such a design. First, it is convenient to combine the off-the-shelf CNNs and attention layers to get a new network without spending a lot of effort in building it from scratch. Second, for applications with small datasets as in our case, by reusing the pre-trained CNN weights, it is more efficient to train such a model than the pure attention models. In this study, we chose the combination of the popular ResNet model [36] and the transformer’s self-attention layers [37] as in the BotNet model [38].

The ResNet is a wildly used CNN model, which consists of multiple bottleneck structures with residual connections between them. The BoTNet is a network constructed from the ResNet by replacing the last three bottleneck structures with three bottleneck transformer structures. The key design of the bottleneck transformer structure is to use the multi-head self-attention (MHSA) layer to transform the feature maps instead of a traditional

3 \times 3

convolution layer as in ResNet. By using such MHSA layers, the information of the global dependencies across the features can be extracted, which is useful for inferring the positional relationships among branches (see the comparison discussions in Section 3.1).

In Figure 1, the overall structure of our method is illustrated and the attention mechanism is illustrated in Figure 7. The feature maps

X

from the previous layer will go through the

1 \times 1

convolutions to compute the query (

q

), key (

k

), and value (

v

) encodings. The

p

s in Figure 7 are positional encodings. These various encodings are grouped into different matrices, by taking the operations as is illustrated in Figure 7, and the final output

Z

will have the same shape

(w, h, d)

as the input feature maps

X

. Please refer to [37,38] for more details of the self-attention mechanism.

In a forward pass, a tree image is fed into the CNN to compute the feature maps. Then, these feature maps are further transformed by the attention layers to compute a tree branching structure vector. The sub-vectors of the trunk and lateral branches are computed by two fully connected layers separately, which are the two heads illustrated in Figure 1. The final loss is defined as a weighted sum of the trunk loss and the lateral branch loss. In Equation (2), the loss function for one image sample is given:

L (y, \hat{y}; Θ, X) = α ∥ y_{T} - {\hat{y}}_{T} ∥_{1} + β {∥ y_{L} - {\hat{y}}_{L} ∥}_{1},

(2)

where

L

is the loss function for a training sample, symbols

y

and

\hat{y}

denote the true and predicted tree branching structure vectors, respectively, the latter being dependent on

Θ

and

X

,

Θ

is the parameter set of the model to be learn,

X

is the input image, and the notation

{∥ \cdot ∥}_{1}

refers to the computation of the

L 1

norm of a vector. The two functions presented on the right side of the equation are specifically designed to quantify the prediction errors associated with the trunk and lateral branches, respectively (refer to Figure 1 for a visual representation). Here,

y_{T}

and

{\hat{y}}_{T}

are the sub-vectors of the trunk group and

y_{L}

and

{\hat{y}}_{L}

are the sub-vectors of the lateral branch group. The two weights

α

and

β

are used to control the importance of each group. In training, the final loss function is defined on a mini-batch of images.

2.5. Model Training

The models are implemented in the Python language (3.9.13) using the PyTorch framework (1.12.1), and were trained on a desktop PC using an Nvidia GeForce RTX 3080 GPU (Nvidia, Santa Clara, CA, USA) and an Intel(R) Core(TM) i5-13600KF CPU (Intel, CA, USA) with 16.0 GB of RAM (Corsair, CA, USA).

Before training, our dataset was divided into a training set and a testing set with a ratio of 8:2. Based on the ResNet backbone, we experimented with models at various depths, both with and without the self-attention bottlenecks. For the models without the self-attention bottlenecks, we trained three models with varying depths, which are called TreeResNet-50, TreeResNet-101, and TreeResNet-152, correspondingly (see Supplementary Figure S1). For the models with the self-attention bottlenecks, we also trained three models, which are called TreeBotNet-50, TreeBotNet-101, and TreeBotNet-152, correspondingly (see Supplementary Figure S2).

Due to the constraints imposed by the limited size of our dataset, we employed transfer learning methodologies to train all six deep neural network models. In the case of the three TreeResNet models, we maintained the weights of the backbone while only optimizing the weights of the fully connected layers that connect the backbone to the tree branching structure vector. Similarly, for the three TreeBotNet models, we also preserved the weights of the backbone, but focused on learning the weights within the self-attention bottlenecks and the weights of the final fully connected layers leading to the tree branching structure vector (refer to Figure 1). According to the statistics computed in Section 2.1, we set

N = 20

,

L = 20

, and

n = 5

. Thus, the size of the branching structure vector is

2 N + 2 M = 240

, where

M = L \times n = 100

. For the loss in Equation (2), we set both

α

and

β

to

1.0

to emphasize the equal importance of the trunk and lateral branches. In training, the learning rate was set to

0.0001

, the batch size is 16, and all the tree images in a batch were resized to

224 \times 224

. In Figure 8, the convergence curves of the training loss and validation loss during 1200 epochs are given. It is obvious from the figure that the models with self-attention bottlenecks have converged to a lower loss value than the models without attentions.

2.6. Model Evaluation

The model proposed in Section 2.4 will produce a vector representation for the tree branching structure. Thus, it is not suitable to adopt the common metrics used in object detection tasks, e.g., the IoU of two boxes, to evaluate the accuracy of the model. To compare two tree branching structures in a quantitative way, both the geometrical and topological differences should be considered. However, there is still no recognized optimal method in the literature to measure the similarity between two tree branching structures. To this end, we developed two metrics to approximately compute the similarity of two tree branching structures. The first metric is based on the edit distance, which is designed to compute the topological distance from one tree to another. The second metric is based on the Euclidean distance between two vectors, which is designed to compute the geometrical distance of two trees.

The edit distance is a common similarity measure for two strings [39]. Three editing operations, namely replace, insert, and delete, are combined to transform one string to another and the minimum number of the required operations is the edit distance from the target string to the source. To adopt the edit distance in our task, the key is to encode the tree branching structure as a string. In Figure 9, an illustration of our string encoding of a tree is given. Each branch segment between two nodes on the trunk is coded as ‘T’. Each branch segment on the left or right lateral branches is coded as ‘L’ or ‘R’, respectively. By following the same encoding order as in Section 2.3, the final string can be computed. In Figure 9, the center tree is obviously more similar to the reference tree than the right tree, which is consistent with their edit distances.

We can use the edit distance to evaluate the model’s accuracy by computing the distance between the predicted tree structure and its annotated structure. Thus, we defined the mean edit distance (MED) of

| T |

tree pairs

({\hat{t}}_{i}, t_{i})

in a set

T

as follows:

M E D (T) = \sum_{i = 1}^{| T |} \frac{E D ({\hat{t}}_{i}, t_{i})}{| T |},

(3)

where

{\hat{t}}_{i}

and

t_{i}

are vector representations of two trees correspondingly. The edit distance sub-routine

E D

was implemented using a dynamic programming algorithm which has the time complexity

O (N M)

when the two strings’ size are N and M, respectively, [39].

The edit distance is a topological metric which is irrelevant to the geometries of the tree, for example, the node’s position and branch length. Thus, we also defined a metric according to the Euclidean distance between two trees’ branch vectors to measure their geometrical distance. The mean geometrical distance (MGD) over the set of tree pairs

T

then is defined as follows:

M G D (T) = \sum_{i = 1}^{| T |} \frac{G D ({\hat{t}}_{i}, t_{i})}{| T |},

(4)

G D ({\hat{t}}_{i}, t_{i}) = \frac{1}{K} \sum_{k = 0}^{K - 1} \sqrt{{({\hat{t}}_{i} [2 k] - t_{i} [2 k])}^{2} + {({\hat{t}}_{i} [2 k + 1] - t_{i} [2 k + 1])}^{2}},

(5)

where K is the number of total tree nodes and

{\hat{t}}_{i} [2 k], {\hat{t}}_{i} [2 k + 1]

are the 2D coordinates of the k-th node in the tree branching structure vector

{\hat{t}}_{i}

.

3. Results

Here, the model proposed in this study will be analyzed and validated from multiple perspectives. We first present some quantitative and qualitative comparison results for different models. Then, the prediction results of the trunk and the lateral branching structures from various trees will be presented. The quantitative metrics used are the MED and MGD defined in Equations (3) and (4). The qualitative results are visualizations of the predicted tree branching structures’ vectors (the ‘Visualization’ step in Figure 1). A post processing step for filtering lateral branches was applied to the predicted tree branching structure vector. A lateral branch will be filtered out if the distance between its starting node to the closest trunk node is greater than d pixels. In practice, we found

d = 10

is a reasonable value and is used for all the results presented. The results of the subsequent analysis will be based on the filtered branch structures.

3.1. Comparison of Results for Different Model Variants

We conducted experiments utilizing six models based on various ResNet backbones, both with and without self-attention bottlenecks (refer to Section 2.5). Additionally, to assess the influence of AI-generated images on model performance, we trained these models on datasets that included and excluded such images. To facilitate a quantitative comparison of their performance, we calculated the Mean Edit Distance (MED) and Mean Geometric Distance (MGD) for these models on the test set, with the results presented in Table 2.

From the quantitative analysis, three key observations emerge. First, models trained on datasets containing AI-generated images consistently outperform those trained on datasets devoid of such images. Second, there is a positive correlation between the depth of the backbone network and the accuracy of the corresponding model; however, exceptions are noted with TreeResNet101 and TreeBotNet101, which exhibit larger MGDs compared to TreeResNet50 and TreeBotNet50, despite having smaller MEDs. Third, models incorporating self-attention bottlenecks demonstrate superior performance relative to those that do not. Notably, TreeBotNet50 exhibits smaller MGD and MED values than TreeResNet152, suggesting that the self-attention mechanism, as delineated in Section 2.4, enhances the extraction of spatial relationships among branches more effectively than traditional convolutional operations. This finding underscores the self-attention mechanism as a significant component in the model for extracting tree branching structures.

In addition to the evaluation of the models utilizing the metrics of Mean Geometric Distance (MGD) and Mean Edit Distance (MED), we incorporated standard performance metrics, namely accuracy, precision, recall, and F1-score, tailored to our specific task. These metrics were computed based on the labeled and predicted tree structure vectors, with the elements utilized in the calculations outlined in Equations (6)–(9), as depicted in Figure 10. As discussed in Section 2.3, the tree structure vector comprises two types of branch sub-vectors: branches containing node positions (represented in green in Figure 10) and void branches filled with zeros (represented in gray in Figure 10). By aligning the predicted and labeled tree structure vectors on a branch-by-branch basis, we can ascertain the number of correctly predicted branches (True Positives and True Negatives, as illustrated in Figure 10) as well as the incorrectly predicted branches (False Negatives and False Positives, also shown in Figure 10). Subsequently, we can derive these performance metrics for the models using the test dataset through the application of the following formulas:

A c c u r a c y = \frac{T P + T N}{T P + F P + F N + T N} .

(6)

P r e c i s i o n = \frac{T P}{T P + F P} .

(7)

R e c a l l = \frac{T P}{T P + F N} .

(8)

F 1 - score = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l} .

(9)

The metrics computed for the different models are presented in Table 3. From these findings, it can be inferred that all TreeBotNet models surpass the performance of the TreeResNet models, with TreeBotNet152 exhibiting the highest levels of accuracy, precision, and F1-score.

In Figure 11, we present a visualization of the predicted tree branching structures generated by various models for a selection of trees. The images of these trees were obtained from online sources and were not part of the training dataset. The chosen tree images were selected to reflect variations in branch structure and the complexity of the backgrounds. Additionally, we calculated the edit distance (ED) and geometric distance (GD) for each predicted tree branching structure in comparison to the annotated reference. The results indicate that the predictions made by TreeBotNets exhibit smaller distances to the annotations than those produced by TreeResNets, which aligns with the visual outcomes depicted in the figure. Furthermore, as the complexity of the backgrounds in the tree images increases, both the ED and GD for all models also show a corresponding increase, as illustrated in Figure 11.

For the first tree in Figure 11, its branching structure is much simpler. The TreeResNet101, TreeBotNet50, and TreeBotNet152 all achieved the optimal

E D

value 0, but the geometric distance of TreeResNet101 is larger than the TreeBotNets. In this case, the TreeBotNet152 has the smallest

G D = 0.87

and achieved the best results. For the second tree, the branching structure generated by TreeBotNet152 is perceptually closest to the annotation. Its

E D = 6

is the smallest and its

G D = 2.20

is much smaller than the values of TreeResNets, which is consistent with the visualization results. For the third tree, the results generated by the three TreeBotNets are all better than the TreeResNets, while the TreeBotNet152 has the best result. From these comparison results, we can first conclude that the ED and GD together can provide a good metric for characterizing the similarity between two tree branching structures. These quantitative distances are consistent with the perceived differences presented in the visualized results. Second, the tree branching structures predicted by the TreeBotNet152 are better than other models according both to the quantitative and qualitative results.

3.2. Analysis and Comparison of Model Prediction Results

During training, the TreeBotNet152 converged to the minimum loss (Figure 8), which was also confirmed in the comparative experiment above. So we chose the TreeBotNet152 as the final model to analyze the tree branching structures generated by our method. According to the testing environment (Section 2.5), the computational performance of TreeBotNet152 is approximately 20 images per second, which meets the needs of interactive applications.

To demonstrate the model’s capabilities, a series of prediction results from the testing image sets are presented in Figure 12. In these images, several trees with different branching patterns, leaf colors, backgrounds, and lighting conditions were presented. In Figure 13, we also compared with the current state-of-the-art results based on the YOLO series model [10]. These results demonstrated that the model has the capability to extract tree branching structures from a single tree image for varying branching patterns and challenging scenarios, such as self occlusion.

In Figure 12, each row in the figure represents a distinct tree, which includes the input image of the tree, the predicted results of the tree’s trunk, and the complete results including lateral branches. The tree images presented in the first and fourth rows are sourced from the internet, while the tree image in the second row is generated using artificial intelligence tools. The tree images located in the third, fifth, and sixth rows are captured using mobile phones. All of these tree images are included in the test dataset. Among these results, the tree trunk detected by the model is very consistent with the actual position. However, for lateral branches, although there are some deviations, their overall structure still fits well with the actual branching structure of the tree. For the first and second trees, due to their relatively simple and easily distinguishable background, both the trunk and lateral branches are closely aligned with the actual position of the branches.

For the third and fourth trees in Figure 12, they are Ginkgo and Willow trees, respectively. Due to the presence of the same trees in the background, the color of their leaves causes significant interference, making it more challenging compared to the previous two trees. It is well-established that Ginkgo and Willow trees exhibit markedly distinct branching architectures. The branch structures generated by our model accurately capture these differences, thereby offering substantial support for the validation of the model’s branch extraction capabilities.

The trees located in the fifth and sixth rows present significant challenges due to interference from adjacent trees, background elements, and varying lighting conditions. Notably, the lighting in the sixth tree image is dim, a condition frequently encountered in real-world environments. Despite these challenges, the results indicate that the extracted trunks align closely with the actual trunks. While the extracted lateral branches may not be entirely accurate, they nonetheless succeed in representing the overall structure of the tree. The results obtained from these varied scenarios have effectively demonstrated the robustness of our model across multiple dimensions.

As previously discussed, the existing literature predominantly employs object detection techniques to identify the primary trunks of trees, which are typically delineated by rectangular bounding boxes. In this context, we conducted a comparative analysis of the results from a leading tree trunk detection study utilizing the YOLO series [10], as presented in Figure 13. Owing to the absence of open access to their dataset and model, our analysis concentrated on a comparative examination of their primary results, specifically the four trees depicted in Figure 10 from [10]. The images of these four trees presented in Figure 13 are derived from screenshots of the corresponding figures in their publication. It is crucial to highlight that their research is specifically tailored to camellia trees, a species that is not represented in our dataset, thereby presenting a considerable challenge to the generalizability of our model. The images of camellia trees were captured under varying lighting conditions, and the actual orchard setting in which they were taken serves as a robust validation for the practical applicability of our model.

The findings presented in Figure 13 facilitate a comparative analysis between our proposed method and the approach outlined in [10]. Firstly, it is noteworthy that the method described in [10] is limited to detecting only certain visible portions of the trunk, whereas our method is capable of reconstructing the entire trunk structure, including critical nodes, even in instances of significant self-occlusion. It is important to acknowledge, however, that the method in [10] can simultaneously identify multiple trees within a single image, while our current approach is restricted to detecting only the central tree depicted in the image.

Secondly, in addition to identifying the main trunk, our method is also proficient in delineating lateral branch structures that align with the overall morphology of the tree. Although there may be some positional discrepancies in the detected branches, the identified branch structures generally conform well to the overall shape of the trees. For the first three trees illustrated in Figure 13, the bases of the detected trunks are contained within the designated pink rectangular boxes, thereby achieving a level of accuracy comparable to that reported in [10]. However, for the final tree, a noticeable deviation in the detected trunk is observed. This discrepancy may be attributed to the trunk’s color being relatively similar to that of the surrounding soil, which occupies a substantial area and ultimately influences the model’s predictive performance.

It is pertinent to note that these tree images were captured in real-world environments, where varying lighting conditions present considerable challenges to the branch extraction model. Consequently, the comparative results serve to further validate the generalization capabilities of the proposed model and its potential applicability in complex real-world scenarios.

4. Discussion

The experimental findings presented above have substantiated the efficacy of the automatic branch structure extraction model introduced in this study from multiple perspectives. A comparative analysis with existing literature highlights several unique aspects of the methodology employed in this study. Firstly, this research is capable of extracting not only tree trunks but also lateral branches. In contrast to the majority of current methods that utilize a rectangular bounding box to represent detected branches, this approach directly outputs the positions of each branch node, thereby facilitating a more accurate representation of branch morphology. A significant contributor to these outcomes is the vector representation of branches that we have proposed.

Secondly, as previously discussed, while a limited number of studies can address partially occluded scenarios, most existing approaches necessitate that the detected branches be unobstructed and distinctly separable from the background. The results from the preceding section demonstrate that our method can effectively extract corresponding branching structures even in cases of severe self-occlusion. This capability can be attributed to two primary factors: the dataset constructed for this study, which comprises a diverse array of tree images accompanied by detailed annotation, and the integration of a self-attention mechanism within the feature extraction process. This mechanism enhances the model’s ability to capture the global relationships among local features, thereby improving its performance in occluded situations. This assertion is further corroborated by findings in other studies, such as those by [10,17,29].

Lastly, we have introduced a specific metric based on edit distance to assess the topological similarity of tree branching structures, distinguishing it from widely used metrics, such as Intersection over Union (IoU). The IoU-based metrics are limited to measuring the accuracy of detected branches on a local scale, which complicates the provision of a comprehensive assessment of the entire branching structure. Nevertheless, accurately characterizing the differences between various tree structures remains an unresolved issue that warrants further investigation. Given the results obtained, we posit that the metrics proposed in this study offer a promising avenue for future exploration.

While the methodology presented in this article has demonstrated significant advancements in the specified areas, it is important to note that, being an image-based approach, certain practical factors inherent to the application context (like in [10]) may still exert varying degrees of influence on the model’s performance. A comprehensive quantitative and qualitative analysis has been performed in the preceding section. The findings indicate that the primary factors affecting performance include occlusion caused by tree branches and leaves, the size of higher-level branches, environmental lighting conditions, and background interference. Consequently, there remain several avenues for enhancement in the proposed method for future research. The potential limitations of the current approach and prospective directions for improvement are summarized as follows.

The proposed representation of tree branch structures has exhibited a commendable degree of compatibility with deep neural network models, as evidenced by the results obtained. Nevertheless, there remains a need for enhancement in its capacity to represent a wider variety of branch structures. This necessity can be delineated into two primary aspects. Firstly, the current framework only accommodates a two-level branch structure. While it is posited that extending this framework to incorporate additional levels is feasible, further empirical investigations are required to substantiate this claim. Secondly, the reliance on a redundant representation methodology results in the generation of representation vectors that include a significant number of zero-filled values. Moving forward, it is imperative to explore more compact representation techniques, as these are essential for effectively characterizing multi-level branch structures across multiple trees and for augmenting the efficiency of model learning.
The existing model architecture relies solely on traditional convolutional neural networks for the extraction of image features. While the integration of attention mechanisms has yielded positive outcomes, it remains pertinent to investigate the potential for combining widely recognized large models within the image domain to enhance semantic feature extraction. For instance, the research presented in [40] examined the features derived from self-supervised vision transformers, demonstrating their efficacy for tasks such as semantic segmentation. Therefore, it is essential to explore methodologies that integrate models with robust feature representation capabilities with tasks related to tree branch structure detection.
The dataset developed in this study has been augmented with AI-generated tree images; however, there remains a need for further expansion of the image count. Our analysis indicates that the trees produced by current AI tools exhibit constraints in terms of overall morphology and the complexity of branching patterns. Given the challenges associated with obtaining a diverse array of tree images under varying lighting conditions and backgrounds in natural settings, the integration of virtual tree data, as demonstrated in [31], continues to represent a promising avenue for future exploration.

5. Conclusions

The methodology outlined in this study demonstrates the ability to autonomously delineate the branching architecture of trees from a single image, effectively accommodating various levels of occlusion, lighting conditions, and complex background interferences. The final model’s performance yielded metrics of

19.39

and

12.40

for the Mean Edit Distance (MED) and Mean Geometric Distance (MGD), respectively, which assess the topological and geometric accuracy of the branch structure. Utilizing standard evaluation metrics for branch identification, the model achieved an average accuracy of

80.2 %

and an F1-score of

0.721

. The results of this research indicate substantial potential for practical applications. For example, by extracting the branching structure of trees, it becomes possible to incorporate domain-specific knowledge to create automated analytical tools for identifying optimal tree species, which could be particularly beneficial in areas such as tree breeding. It is expected that as this technology progresses, a wider array of related applications will develop, leveraging its capabilities in the future.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/f15091659/s1, Figure S1: TreeResNets.pdf and Figure S2: TreeBotNets.pdf.

Author Contributions

Conceptualization, Y.Y.; methodology, Y.Y. and H.L.; software, Y.Y., H.L. and B.C.; validation, H.L., Y.Y. and B.C.; formal analysis, Y.Y. and H.L.; investigation, Y.Y. and H.L; resources, Y.Y., Y.H. and K.X.; data curation, H.L. and B.C.; writing—original draft preparation, Y.Y. and H.L.; writing—review and editing, Y.Y., Y.H., K.X. and J.H.; visualization, Y.Y., H.L. and B.C.; supervision, Y.Y. and Y.H.; project administration, Y.Y. and J.H.; funding acquisition, Y.Y., Y.H., K.X. and J.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Zhejiang Provincial Natural Science Foundation of China (LQ20F020005); the Natural Science Foundation of China (62441205, 32271869); and the “Pioneer” and “Leading Goose” R&D Program of Zhejiang (2022C02009).

Data Availability Statement

The software will be available on GitHub soon. The data of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this article.

References

Lu, Y.; Rasmussen, C. Tree trunk detection using contrast templates. In Proceedings of the 18th IEEE International Conference on Image Processing, Brussels, Belgium, 11–14 September 2011; pp. 1253–1256. [Google Scholar] [CrossRef]
Auat Cheein, F.; Steiner, G.; Perez Paina, G.; Carelli, R. Optimized EIF-SLAM algorithm for precision agriculture mapping based on stems detection. Comput. Electron. Agric. 2011, 78, 195–207. [Google Scholar] [CrossRef]
Shao, L.; Chen, X.; Milne, B.; Guo, P. A novel tree trunk recognition approach for forestry harvesting robot. In Proceedings of the 9th IEEE Conference on Industrial Electronics and Applications, Hangzhou, China, 9–11 June 2014; pp. 862–866. [Google Scholar] [CrossRef]
Shalal, N.; Low, T.; Mccarthy, C.; Hancock, N. A preliminary evaluation of vision and laser sensing for tree trunk detection and orchard mapping. In Proceedings of the Australasian Conference on Robotics and Automation ACRA, Sydney, Australia, 2–4 December 2013. [Google Scholar]
Ji, W.; Qian, Z.; Xu, B.; Tao, Y.; Zhao, D.; Ding, S. Apple tree branch segmentation from images with small gray-level difference for agricultural harvesting robot. Optik 2016, 127, 11173–11182. [Google Scholar] [CrossRef]
Colmenero-Martinez, J.T.; Blanco-Roldán, G.L.; Bayano-Tejero, S.; Castillo-Ruiz, F.J.; Sola-Guirado, R.R.; Gil-Ribes, J.A. An automatic trunk-detection system for intensive olive harvesting with trunk shaker. Biosyst. Eng. 2018, 172, 92–101. [Google Scholar] [CrossRef]
Chen, X.; Wang, S.; Zhang, B.; Luo, L. Multi-feature fusion tree trunk detection and orchard mobile robot localization using camera/ultrasonic sensors. Comput. Electron. Agric. 2018, 147, 91–108. [Google Scholar] [CrossRef]
Wan, H.; Fan, Z.; Yu, X.; Kang, M.; Wang, P.; Zeng, X. A real-time branch detection and reconstruction mechanism for harvesting robot via convolutional neural network and image segmentation. Comput. Electron. Agric. 2022, 192, 106609. [Google Scholar] [CrossRef]
Su, F.; Zhao, Y.; Shi, Y.; Zhao, D.; Wang, G.; Yan, Y.; Zu, L.; Chang, S. Tree Trunk and Obstacle Detection in Apple Orchard Based on Improved YOLOv5s Model. Agronomy 2022, 12, 2427. [Google Scholar] [CrossRef]
Liu, Y.; Wang, H.; Liu, Y.; Luo, Y.; Li, H.; Chen, H.; Liao, K.; Li, L. A Trunk Detection Method for Camellia oleifera Fruit Harvesting Robot Based on Improved YOLOv7. Forests 2023, 14, 1453. [Google Scholar] [CrossRef]
Majeed, Y.; Karkee, M.; Zhang, Q.; Fu, L.; Whiting, M.D. Determining grapevine cordon shape for automated green shoot thinning using semantic segmentation-based deep learning networks. Comput. Electron. Agric. 2020, 171, 105308. [Google Scholar] [CrossRef]
Chen, Z.; Ting, D.; Newbury, R.; Chen, C. Semantic segmentation for partially occluded apple trees based on deep learning. Comput. Electron. Agric. 2021, 181, 105952. [Google Scholar] [CrossRef]
Silva, R.; Junior, J.M.; Almeida, L.; Gon?alves, D.; Zamboni, P.; Fernandes, V.; Silva, J.; Matsubara, E.; Batista, E.; Ma, L.; et al. Line-based deep learning method for tree branch detection from digital images. Int. J. Appl. Earth Obs. Geoinf. 2022, 110, 102759. [Google Scholar] [CrossRef]
Li, Y.; Zhang, Z.; Wang, X.; Fu, W.; Li, J. Automatic reconstruction and modeling of dormant jujube trees using three-view image constraints for intelligent pruning applications. Comput. Electron. Agric. 2023, 212, 108149. [Google Scholar] [CrossRef]
Sass-Klaassen, U.; Fonti, P.; Cherubini, P.; Gričar, J.; Robert, E.M.; Steppe, K.; Bräuning, A. A tree-centered approach to assess impacts of extreme climatic events on forests. Front. Plant Sci. 2016, 7, 1069. [Google Scholar] [CrossRef] [PubMed]
Esperon-Rodriguez, M.; Quintans, D.; Rymer, P.D. Urban tree inventories as a tool to assess tree growth and failure: The case for Australian cities. Landsc. Urban Plan. 2023, 233, 104705. [Google Scholar] [CrossRef]
Li, R.; Sun, G.; Wang, S.; Tan, T.; Xu, F. Tree trunk detection in urban scenes using a multiscale attention-based deep learning method. Ecol. Inform. 2023, 77, 102215. [Google Scholar] [CrossRef]
Jin, S.; Sun, X.; Wu, F.; Su, Y.; Li, Y.; Song, S.; Xu, K.; Ma, Q.; Baret, F.; Jiang, D.; et al. Lidar sheds new light on plant phenomics for plant breeding and management: Recent advances and future prospects. ISPRS J. Photogramm. Remote Sens. 2021, 171, 202–223. [Google Scholar] [CrossRef]
Zhang, P.; Huang, J.; Ma, Y.; Wang, X.; Kang, M.; Song, Y. Crop/Plant Modeling Supports Plant Breeding: II. Guidance of Functional Plant Phenotyping for Trait Discovery. Plant Phenomics 2023, 5, 0091. [Google Scholar] [CrossRef] [PubMed]
Gao, W.; Yang, X.; Cao, L.; Cao, F.; Liu, H.; Qiu, Q.; Shen, M.; Yu, P.; Liu, Y.; Shen, X. Screening of Ginkgo Individuals with Superior Growth Structural Characteristics in Different Genetic Groups Using Terrestrial Laser Scanning (TLS) Data. Plant Phenomics 2023, 5, 0092. [Google Scholar] [CrossRef] [PubMed]
Shalal, N.; Low, T.; McCarthy, C.; Hancock, N. Orchard mapping and mobile robot localisation using on-board camera and laser scanner data fusion—Part A: Tree detection. Comput. Electron. Agric. 2015, 119, 254–266. [Google Scholar] [CrossRef]
Xue, J.; Fan, B.; Yan, J.; Dong, S.; Ding, Q. Trunk detection based on laser radar and vision data fusion. Int. J. Agric. Biol. Eng. 2018, 11, 20–26. [Google Scholar] [CrossRef]
da Silva, D.Q.; dos Santos, F.N.; Sousa, A.J.; Filipe, V. Visible and Thermal Image-Based Trunk Detection with Deep Learning for Forestry Mobile Robotics. J. Imaging 2021, 7, 176. [Google Scholar] [CrossRef]
Jiang, A.; Noguchi, R.; Ahamed, T. Tree trunk recognition in orchard autonomous operations under different light conditions using a thermal camera and faster R-CNN. Sensors 2022, 22, 2065. [Google Scholar] [CrossRef] [PubMed]
Wan, H.; Zeng, X.; Fan, Z.; Zhang, S.; Kang, M. U2ESPNet-A lightweight and high-accuracy convolutional neural network for real-time semantic segmentation of visible branches. Comput. Electron. Agric. 2023, 204, 107542. [Google Scholar] [CrossRef]
Juman, M.A.; Wong, Y.W.; Rajkumar, R.K.; Goh, L.J. A novel tree trunk detection method for oil-palm plantation navigation. Comput. Electron. Agric. 2016, 128, 172–180. [Google Scholar] [CrossRef]
Liu, T.H.; Ehsani, R.; Toudeshki, A.; Zou, X.J.; Wang, H.J. Detection of citrus fruit and tree trunks in natural environments using a multi-elliptical boundary model. Comput. Ind. 2018, 99, 9–16. [Google Scholar] [CrossRef]
Yang, C.; Xiong, L.; Wang, Z.; Wang, Y.; Shi, G.; Kuremot, T.; Zhao, W.; Yang, Y. Integrated detection of citrus fruits and branches using a convolutional neural network. Comput. Electron. Agric. 2020, 174, 105469. [Google Scholar] [CrossRef]
Tong, S.; Yue, Y.; Li, W.; Wang, Y.; Kang, F.; Feng, C. Branch Identification and Junction Points Location for Apple Trees Based on Deep Learning. Remote Sens. 2022, 14, 4495. [Google Scholar] [CrossRef]
Itakura, K.; Hosoi, F. Automatic tree detection from three-dimensional images reconstructed from 360 spherical camera using YOLO v2. Remote Sens. 2020, 12, 988. [Google Scholar] [CrossRef]
Grondin, V.; Fortin, J.M.; Pomerleau, F.; Giguère, P. Tree detection and diameter estimation based on deep learning. Forestry 2023, 96, 264–276. [Google Scholar] [CrossRef]
Wada, K. labelme: Image Polygonal Annotation with Python. 2018. Available online: https://github.com/wkentaro/labelme (accessed on 1 July 2021).
AUTOMATIC1111. Stable Diffusion Webui. 2023. Available online: https://github.com/AUTOMATIC1111/stable-diffusion-webui (accessed on 10 January 2024).
Guha, S. Computer Graphics Through OpenGL: From Theory to Experiments; A K Peters/CRC Press: Boca Raton, FL, USA, 2022. [Google Scholar] [CrossRef]
Wither, J.; Boudon, F.; Cani, M.; Godin, C. Structure from silhouettes: A new paradigm for fast sketch-based design of trees. Comput. Graph. Forum 2009, 28, 541–550. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.U.; Polosukhin, I. Attention is All you Need. In Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Srinivas, A.; Lin, T.Y.; Parmar, N.; Shlens, J.; Abbeel, P.; Vaswani, A. Bottleneck transformers for visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 16514–16524. [Google Scholar] [CrossRef]
Skiena, S.S. The Algorithm Design Manual; Springer: London, UK, 2008. [Google Scholar] [CrossRef]
Caron, M.; Touvron, H.; Misra, I.; J’egou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging Properties in Self-Supervised Vision Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9630–9640. [Google Scholar]

Figure 1. Overview of our method. For further details regarding the content depicted in the image, please refer to the description in the text.

Figure 2. Sample images from the dataset. From left to right: web images, phone images, and AI-generated images.

Figure 3. Demonstration of the annotation procedure applied to a tree image. The blue dots in the tree image are the key nodes of the branches annotated by the user. After completing a certain level of branch annotation, the annotation tool will turn these dots red and automatically connect them as branches.

Figure 4. Dataset statistics. (a) Histogram of the number of annotated nodes on the trunk. (b) Histogram of the average number of annotated nodes on each lateral branch. (c) Histogram of the total number of annotated lateral branches in each tree.

Figure 5. Demonstration of the resampling procedure applied to a tree branching structure.

Figure 6. Illustration for the vector representations of two trees. Each green dot represents a node (a 2d point) on the branch. The sub-vector marked with ‘Trunk’ means this part stores the trunk nodes and ‘Lateral’ for the lateral nodes. The sub-vector marked with ‘Trunk Node 2’ means this part stores the nodes on the lateral branches directly connected to trunk node 2 and so on so forth. The root node is ‘Trunk Node 1’. The sub-vector marked with ‘Extra Positions’ stores zero values when the total number of nodes in the tree is less than the desired vector size.

Figure 7. Schematic illustration of the self-attention mechanism in a bottleneck transformer.

Figure 8. Convergence curves for training different models. (a) Convergence curves on the training set. (b) Convergence curves on the validation set.

Figure 9. Illustration of the string encoding in the edit distance computation. The left is the reference tree, the other two are the target trees. To compute the edit distance from the target trees to the reference tree, all the trees are first encoded as strings (listed at the bottom of each tree, the differences are highlighted in red). Then, the edit distance from the center tree to the reference tree is 2, since two ‘L’s should be inserted in order to match the reference tree. The edit distance from the right tree to the reference tree is 4, where the edit operations are highlighted in the string.

Figure 10. Depiction of the elements utilized in diverse metric computations. The relationship highlighted in red in the diagram indicates the branches that have prediction errors.

Figure 11. Comparison of the results from different models. For each tree, the left column contains the input tree image and its annotated branches. The right columns are the prediction results from different models. The branches are drawn as red line segments. The distances between the predicted and annotated branches are listed below each tree image.

Figure 12. Visualization of several tree branching structure results on the testing set. Columns from left to right are: the input tree images, predicted trunk structures, predicted trunk, and lateral branches.

Figure 13. A comparative analysis of findings derived from the object detection methodology presented in [10]. The subjects of this study are Camellia oleifera trees, which are absent from our dataset. The upper row illustrates the identified trunks (indicated by pink boxes) as reported in [10]. In the lower row, our results pertaining to branch extraction (represented by red lines) are overlaid on the findings of [10].

Table 1. Parameters to characterize a tree branching structure.

Parameter	Interpretation
N	Number of nodes on the trunk
L	Total number of lateral branches
k	Number of lateral branches at each trunk node
n	Number of nodes on each lateral branch
M	Number of nodes on all lateral branches

Table 2. Quantitative comparisons of different models.

Model	MGD ¹ w.o/w AIGI ³	MED ² w.o/w AIGI
TreeResNet50	15.97/15.75	25.48/22.28
TreeResNet101	18.14/16.25	25.73/22.01
TreeResNet152	17.61/15.68	26.34/21.86
TreeBotNet50	14.72/12.71	24.10/19.69
TreeBotNet101	15.36/12.90	24.33/19.46
TreeBotNet152	13.94/12.40	22.07/19.39

¹ Mean geometrical distance: smaller is better. ² Mean edit distance: smaller is better. ³ w.o/w AIGI: training without and with the AI-generated images. The values emphasized in bold within the table indicate the models that reach the lowest values for these two metrics.

Table 3. Evaluation of various models utilizing multiple metrics for comparison.

Model	Accuracy	Precision	Recall	F1-Score
TreeResNet50	0.745	0.633	0.580	0.610
TreeResNet101	0.741	0.614	0.582	0.597
TreeResNet152	0.746	0.630	0.588	0.608
TreeBotNet50	0.798	0.691	0.744	0.717
TreeBotNet101	0.796	0.697	0.725	0.711
TreeBotNet152	0.802	0.713	0.729	0.721

The values emphasized in bold within the table indicate the highest value for the corresponding metric.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, Y.; Lai, H.; Chen, B.; Huo, Y.; Xia, K.; Huang, J. Automatic Method for Extracting Tree Branching Structures from a Single RGB Image. Forests 2024, 15, 1659. https://doi.org/10.3390/f15091659

AMA Style

Yang Y, Lai H, Chen B, Huo Y, Xia K, Huang J. Automatic Method for Extracting Tree Branching Structures from a Single RGB Image. Forests. 2024; 15(9):1659. https://doi.org/10.3390/f15091659

Chicago/Turabian Style

Yang, Yinhui, Huang Lai, Bin Chen, Yuchi Huo, Kai Xia, and Jianqin Huang. 2024. "Automatic Method for Extracting Tree Branching Structures from a Single RGB Image" Forests 15, no. 9: 1659. https://doi.org/10.3390/f15091659

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Automatic Method for Extracting Tree Branching Structures from a Single RGB Image

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Construction

2.2. Resampling of Branch Structure

2.3. Vector Representations for Tree Branching Structures

2.4. Tree Branching Structure Extraction Model

2.5. Model Training

2.6. Model Evaluation

3. Results

3.1. Comparison of Results for Different Model Variants

3.2. Analysis and Comparison of Model Prediction Results

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI