Is Your Training Data Really Ground Truth? A Quality Assessment of Manual Annotation for Individual Tree Crown Delineation

Steier, Janik; Goebel, Mona; Iwaszczuk, Dorota

doi:10.3390/rs16152786

Open AccessArticle

Is Your Training Data Really Ground Truth? A Quality Assessment of Manual Annotation for Individual Tree Crown Delineation

by

Janik Steier

^*

,

Mona Goebel

and

Dorota Iwaszczuk

Department of Remote Sensing and Image Analysis, Technical University of Darmstadt, Franziska-Braun-Str. 7, 64287 Darmstadt, Germany

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(15), 2786; https://doi.org/10.3390/rs16152786 (registering DOI)

Submission received: 7 June 2024 / Revised: 22 July 2024 / Accepted: 25 July 2024 / Published: 30 July 2024

(This article belongs to the Section AI Remote Sensing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

For the accurate and automatic mapping of forest stands based on very-high-resolution satellite imagery and digital orthophotos, precise object detection at the individual tree level is necessary. Currently, supervised deep learning models are primarily applied for this task. To train a reliable model, it is crucial to have an accurate tree crown annotation dataset. The current method of generating these training datasets still relies on manual annotation and labeling. Because of the intricate contours of tree crowns, vegetation density in natural forests and the insufficient ground sampling distance of the imagery, manually generated annotations are error-prone. It is unlikely that the manually delineated tree crowns represent the true conditions on the ground. If these error-prone annotations are used as training data for deep learning models, this may lead to inaccurate mapping results for the models. This study critically validates manual tree crown annotations on two study sites: a forest-like plantation on a cemetery and a natural city forest. The validation is based on tree reference data in the form of an official tree register and tree segments extracted from UAV laser scanning (ULS) data for the quality assessment of a training dataset. The validation results reveal that the manual annotations detect only 37% of the tree crowns in the forest-like plantation area and 10% of the tree crowns in the natural forest correctly. Furthermore, it is frequent for multiple trees to be interpreted in the annotation as a single tree at both study sites.

Keywords:

training data quality assessment; manual labeling; tree crown delineation

1. Introduction

A sufficiently large amount of reference data is essential for training deep learning models and ensuring the validity of their results. In addition to quantity, researchers and users are increasingly focusing on quality, since low-quality training data cannot be compensated for by even the best performing machine learning algorithm [1,2].

Especially in the field of environmental observation using remote sensing, such as satellite or aerial images, high-quality reference data in sufficient quantity are important in order to validate the classification results of the deep learning models against the real conditions on site. Ideally, but rarely in practice, such reference data are available in the form of on-site measurements or ground truth data for training deep learning models.

Individual tree crown delineation and mapping enable access to important information for the analysis, modelling and management of natural forests, planted forests and urban forests [3]. For example, the mapping of individual tree crowns supports a more accurate analysis of carbon storage estimations [4], biodiversity management [5] and forest health descriptions on an individual tree level [6].

Furthermore, by accurately determining the spatial delineation of tree crowns, valuable information, such as the crown projection area, which can be defined as the parallel vertical projection of the crown onto a horizontal plane, can be determined [7]. The crown projection area at the individual tree level is utilized in forest management to estimate the diameter and volume [8] as well as the growth rate [9] of trees.

In recent years, deep learning, particularly convolutional neural networks (CNNs), has become established in the field of individual tree detection and crown delineation (ITDCD) [3]. CNN models, including object detection networks like RetinaNet [10] and instance segmentation models such as Mask-RCNN [11], have shown a significant improvement in terms of accuracy over traditional ITDCD approaches [3]. CNN models are mostly applied to high-resolution RGB (red, green, and blue) images [3].

The training of deep learning models, especially CNNs, is usually computationally intensive and requires a large amount of labeled samples, also called annotations [12]. In general, these annotations for deep learning models can be provided as class labels, semantic descriptors, bounding boxes or dense contours described as masks or polygons. Many real-world annotation projects require near-pixel-accurate labels [13]. In the case of ITDCD, precise annotations of single tree crowns’ contours are required.

Most studies in the field of ITDCD collect training samples or annotations via hand labelling, in which individual tree crowns or canopy cover are manually labelled using bounding boxes or polygons, also called crown masks. The annotations are usually generated by the visual interpretation of remote sensing data, like RGB and multispectral images and LiDAR data [3,7,11,14,15].

However, in general, the manual generation of training data through visual interpretation in remote sensing imagery as well as polygon annotations specifically are inherently error-prone and are fraught with several unavoidable challenges [13,16]. Possible error sources from manual annotations of tree crowns include a variety of subjective and objective factors:

The characteristics of the tree crown: Tree crowns have irregular shapes, overlapping canopies and indistinct edges, and shadows appear between crowns [17];
The skills of the annotators: The subjective recognition of complex crown shapes varies between annotators, whose patience, levels of fatigue and attitude affect the quality of the annotation labeling [18];
Forest conditions: Natural forests’ dense vegetation makes it very difficult to distinguish individual tree crowns with the human eye [7,14,19];
Image Quality: A low ground sampling distance (GSD) in the images and lighting conditions make it difficult to distinguish tree crowns.

Consequently, manually delineated tree crowns are not considered to represent the true delineations of individual tree crowns on site. The above reasons are bound to result in a significant variation in tree crown annotations depending on the labeler, the forest conditions and the remote sensing image quality. Some of these errors can be mitigated by labeling the imagery while observing trees on site. However, field-based labeling is time-consuming and costly and requires the expertise of local botanists, if further tree characteristics like tree species are examined [14,20]. Because of these costs, labeling based solely on remote sensing imagery has become the preferred approach for labeling tree crowns specifically [17].

A growing number of studies have applied CNN models to ITDCD with manually generated annotations in various forest environments. The model detection results are also validated against samples delineated manually by visual interpretation [21]. The manually generated annotations are rarely, if ever, validated with the true delineation of individual tree crowns on site.

For example, Sun et al. [19] used a cascade mask region with convolutional neural networks (CMask R-CNN) to detect and delineate 112 million individual tree crowns in a subtropical mega city in airborne optical images. To train and test the network, 128,000 tree crown labels were manually annotated based on visual interpretation. A validation of these annotations with reference data was not conducted. They also assume that the quality of their labels deteriorates in dense forests, with negative effects on model performance.

In contrast, Wagner et al. [14] validated their manually delineated tree crowns in a tropical forest by an expert botanist in the field, whereby additional financial resources were required. Furthermore, only crowns that were clearly seen by visual interpretation were outlined. Freudenberg et al. [7] validated their manually generated annotations through the visual interpretation of an aerial image against a canopy height model based on LiDAR. Unfortunately, these studies did not provide any quantitative data regarding the validation results or information on whether the annotations were post-processed.

In a previous work, we trained and tested two instance segmentation deep learning frameworks (Mask R-CNN and YOLOv8) to detect and delineate individual tree crowns in dense forests. As the input data, we used airborne images and very-high-resolution satellite images from WorldView-3. The instance segmentation technique combines the computer vision tasks of object detection and semantic segmentation with the additional function of delimiting separate instances of a specific object class and assigning an identification value to the respective instance. To train and test the frameworks, we created manually polygonal annotations of the tree crowns as reference data. However, the model’s delineation results were not sufficient. We identified that one of the reasons for its unsatisfactory performance was the error-prone training data. The annotation process was quite difficult for the annotators in terms of identifying individual tree crowns in dense forest stands and delineating them with pixel accuracy. Thus, the manually generated annotations did not reliably represent the real position and extent of the single tree crowns and led to an inaccurate model result.

The lack of validation of training data is not unique to the deep-learning-based delineation of individual tree crowns. The issue of uncertainty in training data is rarely assessed or reported in the context of machine learning applied to observations of Earth. However, it is frequently assumed that such data are inherently accurate [22]. Nevertheless, a general evaluation of the manually generated labels against the true conditions on the ground is still necessary in order to realistically assess the final deep learning prediction [16].

Our aim is to highlight this systematic issue in manual tree crown annotation as well as the visual interpretation of the remote sensing imagery of forests and to target the severity of the problem. Through a detailed investigation, the accuracy of manually generated annotations of individual tree crowns is evaluated. We introduce a method to quantify the quality of these annotations. Finally, we demonstrate our approach using two study sites with different forest conditions and reference data sources.

Related Work: Training Data Error Using Visual Interpretation

We previously described potential sources of errors in manual annotation based on visual interpretation (e.g., stemming from the geometric characteristics of the objective or the image quality) of remote sensing data, using the example of individual tree crown delineation. However, multiple sources of errors are possible and likely in training data based on visual image interpretation [23].

In the field of the remote sensing of vegetation, training data in CNN-based studies are most often directly acquired as raw and pre-processed remote sensing data using visual interpretation. While visual interpretation might be more efficient for data annotation than relying solely on in situ data, it can still be a tedious process, especially for large datasets or complex vegetation canopies that require very detailed annotations [12]. Furthermore, visual interpretation can also be prone to errors and bias, particular when identifying vegetation and biophysical coverage, e.g., mapping species in the field, as well as human land use [12,16,24].

Training data errors associated with manual annotations may also result from inadequate or poorly communicated semantic class or object of interest definitions [25]. This is especially evident in urban and vegetation environments, which exhibit high spatial and spectral heterogeneity, and the target of the annotations is semantically and geometrically vague [16]. For example, a collection of training data for mapping informal settlements in Nairobi, Kenya shows that several annotators separately delineated the same area [26] (Figure 1). Because informal settlements are defined by sociodemographic factors in addition to spatial and spectral properties, the creation of training data based on visual interpretations is error-prone, stemming from insufficient semantic class definitions [25].

Furthermore, training data errors can occur if the annotators are insufficiently trained, are unfamiliar with the area under investigation or have a lack of experience in remote sensing data and Earth observation [16]. For example, local knowledge is crucial for delineating different classes of urban land use [25].

This brief summary of training data error sources is by no means conclusive. For a more comprehensive overview of training data errors in Earth observation, we highly recommend the review of Elmes et al. [16].

2. Materials and Methods

In this study, we present an approach to quantify the quality of manually generated individual tree crowns. To better understand the case distinction and validation metrics, we introduce first the two study sites as well as the image and tree reference data. Afterwards, the manual generation of the annotations is explained. The metrics and case discrimination to validate the annotations against the tree reference data are explained and visualized in Section 2.3.

2.1. Study Sites and Tree Reference Data

The manually drawn annotations of individual tree crowns were generated under two varying forest site conditions, with different image sources and different tree reference data.

2.1.1. Study Site 1

Site conditions: The first study site is the municipal cemetery of the city of Frankfurt am Main, Germany, which is a forest-like plantation. Within the cemetery, four sub-areas with a visually apparent dense tree stand were selected for validation (Figure 2). These four sub-areas together cover an area of 8.5 hectare and include 817 individual trees in the tree register, representing a tree stand density of 96 trees per hectare. More precisely, 457 trees (57%) are of the conifer order (Coniferales) in the taxonomic rank, 174 trees (21%) are of the order fagaceous (Fagales), for example the genus birch (Betula), and 71 trees (9%) are Sapindales, e.g., maple (Acer). The remaining 115 trees are distributed over nine different tree orders.

Image data: The individual tree crowns were delineated using digital orthophotos (DOP) with a 20 cm GSD, an 8-bit radiometric resolution and RGB bands. The image dataset was acquired by an aerial flight in June 2021. The resulting digital orthophotos are published by the federal state of Hesse, Germany. They are freely available and updated in a two-year cycle.

Tree reference data: The annotations are validated against the public accessible tree register of the city of Frankfurt am Main from May 2022. This register information was converted to vector point data for each tree. The register provides detailed and georeferenced information, e.g., genus, tree height and stem diameter.

2.1.2. Study Site 2

Site conditions: The second study site is located in the eastern city forest of Darmstadt, Germany. The study site covers an area of 12.4 hectare that is mainly dense forest with smaller clearings of young trees. The eastern section of the city forest is mainly (67%) composed of beech (Fagus), oak (Querus) and other deciduous tree species, as well as pines (Pinus) (33%) [27]. A detailed tree species distribution for this study site is not available due to the density of the forest and the lack of official data.

Image data: The image source for the second study site is a very-high-resolution multispectral 8-band satellite image with approximately 30 cm GSD after pan-sharpening, a 11-bit radiometric resolution and an image off-nadir angle of 11.7°. The image was acquired from the commercial earth observation satellite WorldView-3 by the DigitalGlobe company in June 2019 under mainly cloud-free conditions. For the annotation process, the satellite image is visualized as an 8-bit RGB image. This was chosen for better comparison with study site 1 and other related work.

Tree reference data: The tree reference data for the second study were self-generated as 2D tree segments. These 2D tree segments (Figure 3) are based on 3D UAV laser scanning (ULS) point clouds. The ULS data were collected in November 2022 with a DJI Zenmuse L1 sensor attached to a DJI Matrice 300 RTK drone. The same area was scanned with five flights: one at nadir and four at 45° sensor angles from the four cardinal directions. Data collection and basic processing such as registration were performed by the company Vermessung3D. The resulting point cloud has 731 million points and covers 12.4 hectares (Figure 3a). The tree segmentation was completed using MatLab’s LiDAR toolbox [28] and following the tree segmentation documentation process of [29]. First the ground was isolated, and the elevation was normalized. Thereafter, the point cloud was divided into a grid, and forest metrics, such as canopy cover, gaps in the canopy and leaf area index, were determined from minimum of 2 m above ground. After generating a canopy height model, treetops were detected, which were the base for the subsequent individual tree segmentation (Figure 3b). For the comparison with the manual annotations, the final segmentation was converted into a 2D raster with the perspective from the nadir using CloudCompare [30]. Finally, after cutting the area exactly to the manually annotated area, the delineations were converted from raster to vector data with QGIS [31]. The final tree reference dataset includes 3572 segments, after removing tree segments with an area smaller than 1.5 m² from the dataset. The average tree stand density in this study site is 288 trees per hectare.

2.2. Annotation Generation

The annotation generation process aims to create a training dataset for an instance segmentation deep learning model. Consequently, the precise acquisition of single tree crowns during the annotation process is mandatory. The labeling of non-separable tree-crown-covered areas with a single annotation should be explicitly avoided.

Deep learning models for instance segmentation typically use vector data as the training data. Geometric shapes, such as points, lines and polygons, serve as representations of these annotations. Therefore, we used polygons to manually delineate the tree crowns, to capture the irregular shape and contours of the crowns as accurately as possible.

Before the annotation generation, the original images were split into tiles of 512 × 512 pixels. Typically, the image sizes for training CNNs range between 64 × 64 and 256 × 256 pixels [32]. However, the larger image size was chosen to avoid excessive tree crown fragmentation when splitting the image into tiles, and the training process speed was not prioritized. Furthermore, the image size of 512 × 512 pixels proved to be a favorable compromise in the annotation process between (1) effective digitalization of the tree crowns, due to fewer interruptions when creating the annotations, and (2) having a manageable number of trees within a single image. However, the fragmentation of tree crowns at the image’s edges is unavoidable, due to the tiling process. Therefore, in post-processing spatially related annotations were merged to reduce duplicate annotations of tree crowns along image edges.

Overall, an average image tile from the first study site has 131 individual tree crowns, while an image tile from the second study site has about 800 tree crowns. The end-to-end platform SuperAnnotate [33] was used to manage the training data and generate the manual annotations.

The annotations for both sites were created manually by three annotators, each of whom had different levels of experience in creating single tree crown annotations. Annotator 1 had already had experience from several annotation projects. For Annotator 2, this was their second annotation project, and Annotator 3 did not have any experience in labeling.

The annotators were instructed to capture the outline of the individual tree crowns as precisely as possible and to annotate every possible tree crown on the image, which comes into consideration for the annotator. Furthermore, no time limit was set for the annotation process, and the time taken was not recorded. Further, no salary was paid per individual annotation. Hence, we negated the extrinsic factors of the annotation result, such as monetary rewards for the creation of annotations [34,35].

2.3. Case Distinction and Validation Metrics

To validate the manually generated annotations and to investigate potential causes of errors, we identify and adapt common cases and metrics used in machine learning and deep learning to evaluate the model performance. In binary classification problems, the performance of the model can be expressed with a confusion matrix. This matrix includes four relations between prediction and reference: true positive (TP), false positive (FP), true negative (TN) and false negative (FN) [36]. For our validation, we use the cases of TP, FP and FN. The determination of the true negatives (TN) is not applicable in our study, because the condition of TN is fulfilled if the model correctly predicts the negative class. In our case, the negative class would be the absence of tree reference data, which is not possible to manually capture. Furthermore, the case “multiple reference (MR)” is introduced, which considers multiple tree crowns summarized into one annotation.

Our definitions of true positive, false negative, false positive and multiple reference data differ between the first and second study site, due to varying reference data types. In the first study site, the reference data are available as points, and in the second, they are available as polygons. A detailed description and illustrations of four examples for both study sites are presented in Table 1 and Table 2.

We chose the metrics recall (true positive rate) and miss rate (false negative rate) to validate the acquisition of the single tree reference data by the annotations. The metric “multiple reference rate” determines the proportion of multiple reference data captured by one annotation. To validate the quality of our annotations, we use the metrics precision and false discovery rate as well as the “multiple reference rate-annotation”. This metric measures the proportion of annotations that capture multiple reference data. A comprehensive description of the validation metrics is specified in Table 3 and Table 4.

3. Validation Results

The validation metrics presented in Section 2.3 were implemented for both study sites to obtain a quantitative validation of the manually generated annotations. The annotations of each annotator were evaluated separately. Afterwards, the mean value and the standard deviation of all the annotations and metrics obtained were calculated. In Table 5 and Table 6, the validation results are shown for each annotator and as a total.

3.1. Validation Study Site 1

For the first study site, an average of 511 annotations were made, in comparison to 817 single reference trees. An acquisition recall of only 37% was determined, meaning most of the annotations did not correctly identify the tree register points. In addition, 17% of the single tree register points were not captured by an annotation (miss rate). Almost half (46%) of all single tree register points were annotated within a single polygon (multiple reference rate).

Regarding the quality of the manual annotations, on average, 59% of the annotations correctly identify a single tree register point. Conversely, 16% of the annotations fail to detect a tree register point (false discovery rate). Lastly, 25% of the annotations encompass multiple tree register points.

3.2. Validation Study Site 2

For the second study site, we analyzed the acquisition of tree segments (tree reference data) based on UAV laser scanning data. The manual annotations capture on average 10% of the single tree segments correctly (recall). However, the miss rate reached 43%, indicating that a significant number of single tree segments were not captured correctly by any annotation. As already observed for study site 1, many annotations summarize multiple tree segments as one (48%). The proportion of correct annotations among all the obtained annotations is, on average, 30% (precision). The false discovery rate is 34%, representing the ratio of annotations that do not accurately capture a single tree segment. Lastly, 36% of the annotations encompass multiple tree segments within a single polygon.

4. Discussion

The validation result is impacted by factors such as image quality and the definition of the metrics used, which differ between the two study sites. We present and explore the four most important factors, providing a foundation for the subsequent analysis of the validation results.

4.1. Influencing Factors on the Validation Result

Image quality and on-site conditions: The annotations were based on images of varying quality. At the first study site (Figure 4a), the annotations were made on airborne digital orthophotos with a GSD of 20 cm. In comparison, the satellite image from WorldView-3 for the second study site (Figure 4b) has a slightly lower resolution with a GSD of approximately 30 cm. Furthermore, the satellite image has a slight off-nadir view of 11.7°, which makes the annotation process even more challenging. Next, regarding the image quality differences, the planting properties of the two study sites are also different. The trees of the first study site did not show natural growth or spreading, which results in a low tree density with 96 trees per hectare. The city forest of the second study site has a significantly higher tree density with 288 trees per hectare. The image quality and on-site conditions of the second study site made the annotation process considerably more difficult, which was confirmed by all three annotators.

Metric Definition: The comparability of the validation results between the study sites is also limited, as the metric definitions differ due to the different reference data geometries. For the first study site, the polygonal annotations are compared to points representing the tree reference data. As the position of the tree register point within the polygon is not considered, it is easier to fulfill the criterion for the acquisition of reference data by the annotations. For example, to correctly capture a tree register point with an annotation, which then counts as a true positive, the register point must be placed somewhere in the area of the polygon. On the other hand, it is more challenging to meet the definition of a true positive for the second study site, for which at least 50% of the area of a single tree segment must be within a single annotation. Generally, the definitions of TP and FN significantly influence the validation result. For example, if the criterion for the TP case of the second study site was defined in such way that the intersection over union (IoU) between the annotation and tree reference segment had to be at least 50%, it would decrease the recall significantly.

Reference data quality: As described in Section 2.1.1, the reference tree data of the first study site is extracted from the publically accessible tree register of the city of Frankfurt am Main. The geoposition of the individual trees as well as tree parameters like genus and steam diameter are measured and documented by tree inspectors from the official park department of the city Frankfurt am Main [37]. It is therefore safe to assume that the geoposition and number of trees are accurate. In contrast, at the second study site, the tree reference data are automatically generated two-dimensional tree segments based on three-dimensional UAV laser scanning point clouds; hence, the tree segments may have errors with regard to their shape, position and number. Possible position and geometry errors (two-dimensional segments) of the tree reference data were not quantified in our validation process and are not considered in the calculation of the validation metrics. Finally, when analyzing the validation result, it is important to consider that reference data may also be incorrect due to data collection or processing and may not fully represent the true situation on the ground (the ground truth).

Temporal differences in the data: Both study sites have temporal differences between the data acquisition or publication of the tree reference data and the acquisition of the images on which the annotations were generated. The time difference in the first study site between the image acquisition and publication of the tree reference data is 11 months. At the second site, the ULS point cloud for the reference data was collected three years after the satellite image acquisition and with a shift from summer to autumn. The temporal changes of the trees, such as natural tree growth and death or anthropogenic tree planting and felling, between the annotations and tree reference data have a negative influence on the validation result. It was not possible to quantify this influence within the scope of this work. However, we assume that the validation results would not have improved significantly if the images and reference data had been recorded at the same time.

4.2. Analysis of the Validation Result

The overall rates of correctly capturing the tree reference data (recall) are quite low for both study sites (Table 5 and Table 6). At the second study site, which is far more demanding, the recall is significantly lower with only 10% of the trees being correctly annotated. Significant differences between the recall values of the different annotators are not seen for either study site, as indicated through the small standard deviation. The standard deviations increase greatly with respect to the miss rate and MRR. For both study sites, the standard deviation for these two metrics is around 7%, but the annotator with more experience did not necessarily achieve the best results. This indicates that a level of experience can help to an extent but does not assure good annotation quality.

The influences of image quality, on-site conditions and metric definitions between the two sites are notable in the higher miss rate and false discovery rate for the second study site. For this study site, the annotators missed a high proportion of tree reference data with their annotations.

The most noticeable aspect of the validation result is the underestimation of the actual number of trees in the images by the annotators, which can be seen equally for both study sites. Additionally, the first site’s standard deviation of five for the total annotation count is noticeably lower than the second site’s value of 209. This may be due to both the increased quantity of tree reference data available and the higher degree of complexity of the forest in the second site. A discrepancy is also visible between the total number of annotations and the number of reference data. For the first study site, approximately 40% fewer annotations are made than necessary to capture each tree reference individually. For the second study site, almost 70% fewer annotations were made than should have been based on what was actually present according to the reference data. Furthermore, almost half of the tree reference data on both study sites were captured within a single annotation, as shown by the multiple reference rate. For the first study site, on average, three individual tree register points were captured by one annotation summarizing multiple tree crowns. For the second study site, four individual tree segments are captured on average by one annotation.

Given that these data are intended to be used as guidance for model training, with an average precision of 30% for the second study site, these annotations would be unfit. The model did not consistently present the boundaries of the tree crowns due to its high MRR-A of 35%. In contrast, the first study site is slightly better with a precision of 60% and MRR-A of 25%. Nonetheless, true reference data should have a precision of over 90%.

A correlation between the experience of the annotators and the quality of the annotation expressed by the validation metrics could not be clearly recognized in this work. The annotator with the highest level of experience and the annotator with no experience achieved overall similar validation results.

5. Conclusions

Accuracy and quality assessments are crucial for training data to correctly evaluate machine learning and deep learning model results. However, the quantification of training data quality is insufficient and not treated as an open problem in the Earth observation community.

In this study, to validate the manual annotations, we adapt common cases and validation metrics used in machine learning (e.g., true positive, recall). We extend them by the case “multiple reference”, which considers multiple tree crowns being summarized into one annotation, and the metric “multiple reference rate”, which determines the ratio of multiple tree reference data captured in a single annotation (Section 2.3). The method is demonstrated on two study areas with different image sources, reference data and site conditions (Section 2.1), in which manual polygonal annotations of individual tree crowns were generated by three separate annotators.

The presented validation results of the annotations (Section 3) show the low quality of the manual annotations in capturing individual tree crowns for both study sites. Furthermore, the annotators significantly underestimate the true number of reference trees in the images, as quantified by the miss rate. In addition, the annotators often summarize multiple tree crowns into one annotation, as shown by the high values of the multiple reference rate for both study sites. The quantitative results did not reveal a clear difference in annotator expertise levels.

Finally, the annotation and validation results are influenced by many individual factors, such as image quality, site conditions, metric definitions and reference data quality, which differ from study site to study site.

Based on our research, we conclude that manual annotations of individual tree crowns in forests and areas with a forest-like plantation on remote sensing images are very likely to have significant deficits in capturing the actual conditions on the ground. As the annotations are used as training data, error-prone annotations can cause substantial errors in the prediction of deep learning models [16]. When training a model, the training data must be valued with caution, and the model results, as well as the quality of automatically created tree crown maps, should be reviewed critically.

An approach for enhancing the quality of manually generated and polygon-based annotations is the integration of annotations by multiple annotators. This strategy leverages the principles of crowdsourcing, in which the collective intelligence of crowdworkers is harnessed to accomplish a specific task [38]. Mei et al. [18] provide a method for the integration of tree crown annotations of multiple annotators for the same region of interest for urban trees using Markov random field and multispectral information. Given the promising results demonstrated by this method, in our future research, we could apply this method to our existing annotations of individual tree crowns within forest areas and investigate the impact of the integration of multiple annotations on the validation result.

Moreover, we will train and test a deep learning model for individual tree crown delineation and use only annotations that were validated as correct annotations in this study. This enables a controlled investigation of the influence of annotations, which represent nearly the real shape and position of single tree crowns, in comparison to non-validated annotations on the model performance.

Author Contributions

Conceptualization, J.S. and M.G.; methodology, J.S.; formal analysis, J.S. and M.G.; investigation, J.S.; resources, J.S. and M.G.; data curation, J.S. and M.G.; writing—original draft preparation, J.S.; writing—review and editing, M.G. and D.I.; visualization, J.S. and M.G.; supervision, M.G. and D.I. All authors have read and agreed to the published version of the manuscript.

Funding

The funding was provided by the State of Hesse as a part of “LOEWE funding line 3” (HA-Project-No.: 1381/22-86).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

This study was conducted as part of the cooperative project ForSens between TU Darmstadt and Karuna Technology UG. We would like to express our sincere gratitude to Karuna Technology for providing the data for this research study. A special thanks goes to the annotators.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Oksuz, K.; Cam, B.C.; Kalkan, S.; Akbas, E. Imbalance Problems in Object Detection: A Review. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 3388–3415. [Google Scholar] [CrossRef]
Whang, S.E.; Roh, Y.; Song, H.; Lee, J.-G. Data collection and quality challenges in deep learning: A data-centric AI perspective. VLDB J. 2023, 32, 791–813. [Google Scholar] [CrossRef]
Zhao, H.; Morgenroth, J.; Pearse, G.; Schindler, J. A Systematic Review of Individual Tree Crown Detection and Delineation with Convolutional Neural Networks (CNN). Curr. For. Rep. 2023, 9, 149–170. [Google Scholar] [CrossRef]
Fujimoto, A.; Haga, C.; Matsui, T.; Machimura, T.; Hayashi, K.; Sugita, S.; Takagi, H. An End to End Process Development for UAV-SfM Based Forest Monitoring: Individual Tree Detection, Species Classification and Carbon Dynamics Simulation. Forests 2019, 10, 680. [Google Scholar] [CrossRef]
Saarinen, N.; Vastaranta, M.; Näsi, R.; Rosnell, T.; Hakala, T.; Honkavaara, E.; Wulder, M.; Luoma, V.; Tommaselli, A.; Imai, N.; et al. Assessing Biodiversity in Boreal Forests with UAV-Based Photogrammetric Point Clouds and Hyperspectral Imaging. Remote Sens. 2018, 10, 338. [Google Scholar] [CrossRef]
Shendryk, I.; Broich, M.; Tulbure, M.G.; McGrath, A.; Keith, D.; Alexandrov, S.V. Mapping individual tree health using full-waveform airborne laser scans and imaging spectroscopy: A case study for a floodplain eucalypt forest. Remote Sens. Environ. 2016, 187, 202–217. [Google Scholar] [CrossRef]
Freudenberg, M.; Magdon, P.; Nölke, N. Individual tree crown delineation in high-resolution remote sensing images based on U-Net. Neural Comput. Appl. 2022, 34, 22197–22207. [Google Scholar] [CrossRef]
Dalponte, M.; Frizzera, L.; Ørka, H.O.; Gobakken, T.; Næsset, E.; Gianelle, D. Predicting stem diameters and aboveground biomass of individual trees using remote sensing data. Ecol. Indic. 2018, 85, 367–376. [Google Scholar] [CrossRef]
Wyckoff, P.H.; Clark, J.S. Tree growth prediction using size and exposed crown area. Can. J. For. Res. 2005, 35, 13–20. [Google Scholar] [CrossRef]
Weinstein, B.G.; Marconi, S.; Bohlman, S.; Zare, A.; White, E. Individual Tree-Crown Detection in RGB Imagery Using Semi-Supervised Deep Learning Neural Networks. Remote Sens. 2019, 11, 1309. [Google Scholar] [CrossRef]
G. Braga, J.R.; Peripato, V.; Dalagnol, R.; P. Ferreira, M.; Tarabalka, Y.; O. C. Aragão, L.E.; F. de Campos Velho, H.; Shiguemori, E.H.; Wagner, F.H. Tree Crown Delineation Algorithm Based on a Convolutional Neural Network. Remote Sens. 2020, 12, 1288. [Google Scholar] [CrossRef]
Kattenborn, T.; Leitloff, J.; Schiefer, F.; Hinz, S. Review on Convolutional Neural Networks (CNN) in vegetation remote sensing. ISPRS J. Photogramm. Remote Sens. 2021, 173, 24–49. [Google Scholar] [CrossRef]
Zimmermann, E.; Szeto, J.; Ratle, F. An Empirical Study of Uncertainty in Polygon Annotation and the Impact of Quality Assurance. 2023. Available online: http://arxiv.org/pdf/2311.02707.pdf (accessed on 14 February 2024).
Ball, J.G.C.; Hickman, S.H.M.; Jackson, T.D.; Koay, X.J.; Hirst, J.; Jay, W.; Archer, M.; Aubry-Kientz, M.; Vincent, G.; Coomes, D.A. Accurate delineation of individual tree crowns in tropical forests from aerial RGB imagery using Mask R-CNN. Remote Sens. Ecol. Conserv. 2023, 9, 641–655. [Google Scholar] [CrossRef]
Lassalle, G.; Ferreira, M.P.; La Rosa, L.E.C.; de Souza Filho, C.R. Deep learning-based individual tree crown delineation in mangrove forests using very-high-resolution satellite imagery. ISPRS J. Photogramm. Remote Sens. 2022, 189, 220–235. [Google Scholar] [CrossRef]
Elmes, A.; Alemohammad, H.; Avery, R.; Caylor, K.; Eastman, J.; Fishgold, L.; Friedl, M.; Jain, M.; Kohli, D.; Laso Bayas, J.; et al. Accounting for Training Data Error in Machine Learning Applied to Earth Observations. Remote Sens. 2020, 12, 1034. [Google Scholar] [CrossRef]
Stewart, D.; Zare, A.; Marconi, S.; Weinstein, B.G.; White, E.P.; Graves, S.J.; Bohlman, S.A.; Singh, A. RandCrowns: A Quantitative Metric for Imprecisely Labeled Tree Crown Delineation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 11229–11239. [Google Scholar] [CrossRef]
Mei, Q.; Steier, J.; Iwaszczuk, D. Integrating Crowd-sourced Annotations of Tree Crowns using Markov Random Field and Multispectral Information. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2024, 48, 257–263. [Google Scholar] [CrossRef]
Sun, Y.; Li, Z.; He, H.; Guo, L.; Zhang, X.; Xin, Q. Counting trees in a subtropical mega city using the instance segmentation method. Int. J. Appl. Earth Obs. Geoinf. 2022, 106, 102662. [Google Scholar] [CrossRef]
Caughlin, T.T.; Graves, S.J.; Asner, G.P.; Tarbox, B.C.; Bohlman, S.A. High-Resolution Remote Sensing Data as a Boundary Object to Facilitate Interdisciplinary Collaboration. In Collaboration Across Boundaries for Social-Ecological Systems Science; Perz, S.G., Ed.; Springer International Publishing: Cham, Switzerland, 2019; pp. 295–326. ISBN 978-3-030-13826-4. [Google Scholar]
Wagner, F.H.; Ferreira, M.P.; Sanchez, A.; Hirye, M.C.; Zortea, M.; Gloor, E.; Phillips, O.L.; de Souza Filho, C.R.; Shimabukuro, Y.E.; Aragão, L.E. Individual tree crown delineation in a highly diverse tropical forest using very high resolution satellite images. ISPRS J. Photogramm. Remote Sens. 2018, 145, 362–377. [Google Scholar] [CrossRef]
Foody, G.; Pal, M.; Rocchini, D.; Garzon-Lopez, C.; Bastin, L. The Sensitivity of Mapping Methods to Reference Data Quality: Training Supervised Image Classifications with Imperfect Reference Data. ISPRS Int. J. Geo-Inf. 2016, 5, 199. [Google Scholar] [CrossRef]
Copass, C.; Antonova, N.; Kennedy, R. Comparison of Office and Field Techniques for Validating Landscape Change Classification in Pacific Northwest National Parks. Remote Sens. 2019, 11, 3. [Google Scholar] [CrossRef]
Lepš, J.; Hadincová, V. How reliable are our vegetation analyses? J Veg. Sci. 1992, 3, 119–124. [Google Scholar] [CrossRef]
Kohli, D.; Sliuzas, R.; Kerle, N.; Stein, A. An ontology of slums for image-based classification. Comput. Environ. Urban Syst. 2012, 36, 154–163. [Google Scholar] [CrossRef]
Kohli, D.; Stein, A.; Sliuzas, R. Uncertainty analysis for image interpretations of urban slums. Comput. Environ. Urban Syst. 2016, 60, 37–49. [Google Scholar] [CrossRef]
Meining, S. Waldtzustandsbericht 2020 für den Stadtwald Darmstadt. 2020. Available online: https://www.darmstadtnews.de/wp-content/uploads/2021/01/Waldzustandsbericht_Darmstadt_2020.pdf (accessed on 13 December 2023).
The MathWorks Inc. 2022, Lidar Toolbox Version: 9.4 (R2022b). Available online: https://www.mathworks.com (accessed on 19 July 2024).
The MathWorks Inc. Extract Forest Metrics and Individual Tree Attributes from Aerial Lidar Data. Available online: https://www.mathworks.com/help/lidar/ug/extraction-of-forest-metrics-and-individual-tree-attributes.html (accessed on 19 July 2024).
CloudCompare (Version 2.13.2). 2024. Available online: http://www.cloudcompare.org/ (accessed on 19 July 2024).
QGIS.org. 2024, QGIS Geographic Information System. QGIS Association, Version 3.28.2. Available online: http://www.qgis.org (accessed on 19 July 2024).
Thambawita, V.; Strümke, I.; Hicks, S.A.; Halvorsen, P.; Parasa, S.; Riegler, M.A. Impact of Image Resolution on Deep Learning Performance in Endoscopy Image Classification: An Experimental Study Using a Large Dataset of Endoscopic Images. Diagnostics 2021, 11, 2183. [Google Scholar] [CrossRef]
SuperAnnotate AI, Inc. 2024. Available online: https://www.superannotate.com/ (accessed on 19 July 2024).
Collmar, D.; Walter, V.; Kölle, M.; Sörgel, U. From Multiple Polygons to Single Geometry: Optimization of Polygon Integration for Crowdsourced Data. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2023, 10, 159–166. [Google Scholar] [CrossRef]
Hossain, M. Users’ motivation to participate in online crowdsourcing platforms. In Proceedings of the International Conference on Innovation Management and Technology Research (ICIMTR), Malacca, Malaysia, 21–22 May 2012; IEEE: New York, NY, USA, 2012; pp. 310–315, ISBN 978-1-4673-0654-6. [Google Scholar]
Zhou, Z.-H. Machine Learning; Springer: Singapore, 2021; ISBN 978-981-15-1966-6. [Google Scholar]
FRANKFURT.DE-DAS OFFIZIELLE STADTPORTAL. Baumkataster und Baumliste|Stadt Frankfurt am Main. Available online: https://frankfurt.de/themen/umwelt-und-gruen/umwelt-und-gruen-a-z/im-gruenen/baeume/baumkataster (accessed on 30 May 2024).
Saralioglu, E.; Gungor, O. Crowdsourcing in Remote Sensing: A Review of Applications and Future Directions. IEEE Geosci. Remote Sens. Mag. 2020, 8, 89–110. [Google Scholar] [CrossRef]

Figure 1. Mapping informal settlements in Nairobi, Kenya with manual annotations. Each colored line indicates a different annotator’s delineation of the same area [16]: (a) boundary deviation due to generalization of informal settlements and (b) deviation resulting from inclusion or exclusion of fringe [26] (adapted from Elemes et al. [16] with permission from Kohli et al. [26]).

Figure 2. The four validation areas (red outlines) of study site 1.

Figure 3. Nadir 3D point cloud in RGB color scheme (a) and derived 2D segments (b), which represent the single tree reference data for the validation process of study site 2.

Figure 4. Example annotation images with 512 × 512 pixel resolution based on the digital orthophoto (a) and the satellite image from WorldView-3 (b).

Table 1. Description and illustrations of examples of true positive, false negative, false positive and multiple reference data for study site 1 (annotations are shown by red lines, and tree register points are shown as green dots). FN and FP are portrayed in the same illustration.

Case	Description	Example Illustration
True positive (TP)	One annotation captures exactly one tree register point.
False negative (FN)	A tree register point is not captured by a single annotation.
False positive (FP)	An annotation does not capture a single tree register point.
Multiple reference (MR)	One annotation captures multiple tree register points.

Table 2. Description and illustrations of examples of true positive, false negative, false positive and multiple reference data for study site 2 (annotations are visualized as red polygon lines and tree segments as green-filled polygons). FN and FP are portrayed in the same illustration.

Case	Description	Example Illustration
True positive (TP)	At least 50% of the area of a single segment is located within a single annotation.
False negative (FN)	Less than 50% of the area of a single segment is located within a single annotation.
False positive (FP)	An annotation captures less than 50% of the area of a single segment.
Multiple reference (MR)	An annotation contains multiple segments with at least 50% of their area.

Table 3. Metrics for the validation of the acquisition of tree reference data (study site 1: tree register point, study site 2: tree segments).

Metric	Universal Definition ¹	Definition for Study Site 1 and 2
Recall, True positive rate	$\frac{T P}{T P + F N}$	The ratio of all correctly annotated tree reference data among all tree reference data.
Recall, True positive rate	$\frac{T P}{T P + F N}$	$\frac{\sum T P}{\sum T r e e r e f e r e n c e d a t a}$ (1)
Miss rate, False negative rate	$\frac{F N}{F N + T P}$	The ratio of all non-annotated tree reference data points among all tree register data.
Miss rate, False negative rate	$\frac{F N}{F N + T P}$	$\frac{\sum F N}{\sum T r e e r e f e r e n c e d a t a}$ (2)
Multiple reference rate (MRR)	-	The ratio of multiple tree reference data, which are captured in a single annotation among all tree reference data.
Multiple reference rate (MRR)	-	$\frac{\sum M u l t i p l e r e f e r e n c e d a t a}{\sum T r e e r e f e r e n c e d a t a}$ (3)

¹ Universal definition in the field of machine and deep learning classification.

Table 4. Metrics for the validation of annotations.

Metric	Universal Definition ¹	Definition for Study Site 1 and 2
Precision, Positive predictive value	$\frac{T P}{T P + F P}$	The ratio of annotations that capture correctly single tree reference data to all annotations.
Precision, Positive predictive value	$\frac{T P}{T P + F P}$	$\frac{\sum T P}{\sum A n n o t a t i o n s}$ (4)
False Discovery Rate	$\frac{F P}{F P + T P}$	The ratio of annotations that do not capture single tree reference data to all annotations.
False Discovery Rate	$\frac{F P}{F P + T P}$	$\frac{\sum F P}{\sum A n n o t a t i o n s}$ (5)
Multiple reference rate-annotation (MRR-A)	-	The ratio of single annotations that capture multiple tree reference data to all annotations.
Multiple reference rate-annotation (MRR-A)	-	$\frac{\sum M u l t i p l e r e f e r e n c e d a t a}{\sum A n n o t a t i o n s}$ (6)

¹ Universal definition in the field of machine and deep learning classification.

Table 5. Metrics for acquisitions validation of tree reference data and quality validation of annotations for study site 1.

	Annotator 1	Annotator 2	Annotator 3	Mean Value	Standard Deviation
Reference tree count (points)				817
Annotation count	505	518	510	511	5
Metrics for acquisition validation of tree reference data
Recall	37.2%	36.2%	37.6%	37.0%	0.6%
Miss rate	21.2%	8.1%	22.6%	17.3%	6.5%
MRR	41.6%	55.7%	39.8%	45.7%	7.1%
Metrics for quality validation of annotations
Precision	60.2%	57.1%	60.2%	59.2%	1.5%
False discovery rate	15.6%	17.8%	15.7%	16.4%	1.0%
MRR-A	24.2%	25.1%	24.1%	24.5%	0.4%

Table 6. Metrics for acquisitions validation of tree reference data and quality validation of annotations for study site 2.

	Annotator 1	Annotator 2	Annotator 3	Mean Value	Standard Deviation
Reference tree count (segments)				3572
Annotation count	1465	1020	1024	1170	209
Metrics for acquisition validation of reference data
Recall	11.7%	9.2%	8.5%	9.8%	1.4%
Miss rate	37.8%	36.1%	53.6%	42.5%	7.9%
MRR	50.5%	54.8%	38.0%	47.8%	7.1%
Metrics for quality validation of annotations
Precision	28.5%	32.2%	29.5%	30.1%	1.6%
False discovery rate	41.4%	25.8%	34.7%	34.0%	6.4%
MRR-A	30.0%	42.1%	35.8%	36.0%	4.9%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Steier, J.; Goebel, M.; Iwaszczuk, D. Is Your Training Data Really Ground Truth? A Quality Assessment of Manual Annotation for Individual Tree Crown Delineation. Remote Sens. 2024, 16, 2786. https://doi.org/10.3390/rs16152786

AMA Style

Steier J, Goebel M, Iwaszczuk D. Is Your Training Data Really Ground Truth? A Quality Assessment of Manual Annotation for Individual Tree Crown Delineation. Remote Sensing. 2024; 16(15):2786. https://doi.org/10.3390/rs16152786

Chicago/Turabian Style

Steier, Janik, Mona Goebel, and Dorota Iwaszczuk. 2024. "Is Your Training Data Really Ground Truth? A Quality Assessment of Manual Annotation for Individual Tree Crown Delineation" Remote Sensing 16, no. 15: 2786. https://doi.org/10.3390/rs16152786

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Is Your Training Data Really Ground Truth? A Quality Assessment of Manual Annotation for Individual Tree Crown Delineation

Abstract

1. Introduction

Related Work: Training Data Error Using Visual Interpretation

2. Materials and Methods

2.1. Study Sites and Tree Reference Data

2.1.1. Study Site 1

2.1.2. Study Site 2

2.2. Annotation Generation

2.3. Case Distinction and Validation Metrics

3. Validation Results

3.1. Validation Study Site 1

3.2. Validation Study Site 2

4. Discussion

4.1. Influencing Factors on the Validation Result

4.2. Analysis of the Validation Result

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI