4.2. Annotation
We used the VGG Image Annotator (VIA) [
39] for the annotation process, which is one of the most common image annotation tools. However, since the annotation process is time-consuming, we decided to reduce the manual annotation by following the next steps.
First, we annotated 3751 trees, using 2303 of them for training the model and 1448 for evaluating the machine learning procedure. Each tree was annotated four times for each cultivation year unless it did not appear in the orthomosaic image since it was too small or had no leaves on the flight day. Thus, the annotation dataset consists of two different classes, cherry trees and the background.
The trained model at this step was used to detect cherry trees of other orchards. We used the detected masks from this step and converted them to new annotated cherry trees. To achieve this, we used the following procedure. First, we detected all pixels on the edge of the mask and created a polygon. The polygon was simplified to reduce complexity by removing vertices and then converted to JSON format suitable for VGG Image Annotator.
In the next step, a second stage of manual annotation took place with the VIA tool. We manually corrected any faulty tree detections in this step and added those trees that the algorithm did not detect. Finally, we obtained 11,254 cherry trees annotated, 6440 for the training dataset and 4814 for the evaluation dataset. Thus, 57.22% of cherry trees were used for training and 42.78% were used for evaluation. In addition, we converted the final annotations to the corresponding format for YOLOv8. All information about annotation is summarised in
Table 3.
4.3. Configuration
For the training process, we used a Debian Linux (version 11) virtual machineon the cloud with 2 vCPU cores, 8 GB of RAM and a Tesla T4 NVIDIA GPU. We evaluated both Detectron2 and YOLOv8 for instance segmentation in order to detect cherry trees in orchards. As the backbone network for Detectron2, we tested both ResNet-50 and ResNet-101. For YOLOv8 we tested YOLOv8m-seg and YOLOv8x-seg as pre-trained models. For YOLOv8, we trained both configurations for 500 epochs, and for Detectron2, we used 16,000 iterations, which are also equal to 500 epochs compared to the number of images included in the training dataset. In addition, we used a batch size of 4 for all configurations.
Furthermore, since all cherry trees have almost round crowns or are slightly elliptical, we selected a small difference in the length of the edges of the orthogonal for the anchor boxes. Thus, we have selected the values (0.8, 1.0, 1.2) as aspect ratios for Detectron2.
Figure 5 illustrates the possible anchor boxes with these values for RoI extraction.
YOLOv8 has an anchor-free mechanism, meaning it predicts directly the centre of an object instead of the offset from a known anchor box.
The configuration mentioned above and some of the main hyper-parameters used for Detectron2 and YOLOv8 are summarised in
Table 4 and
Table 5, respectively.
4.4. Mask Improvement
While most of the masks generated by Detectron2 and YOLOv8 closely approximate the actual crowns of the trees, minor corrections are required to achieve optimal results. Furthermore, in some cases, two of the provided masks overlap for both Detectron2 and YOLOv8. When we want to calculate and analyse vegetation indices based on the tree crowns, it is preferable to ensure the independence of each tree.
For the above reasons, we employed an additional method to improve the provided masks. More specifically, we utilised the NDVI index of the orchard and applied a threshold based on the OTSU [
17] method, along with gamma correction, to precisely outline the crown of the trees.
OTSU efficiently divides a given image into two areas of interest by determining the optimal threshold based on the image’s histogram. It achieves this by iteratively searching for a threshold that minimises the intra-class variance, defined as the sum of the two variances of the two weighted classes, as shown in Equation (
1).
The weights and represent the probabilities of the two classes separated by a threshold t, while and denote the variances of the two classes based on the image’s histogram.
Assuming the image’s histogram has
L bins, the class probabilities
and
are calculated using Equations (
2) and (
3), respectively.
The OTSU method for threshold estimation is implemented in various graphics programming libraries, such as OpenCV.
Furthermore, before applying OTSU thresholding, we use gamma correction to improve the effectiveness of the algorithm, particularly in areas where the presence of weeds and grass is obvious.
We selected the NDVI since it is defined as the index that identifies the presence of photosynthesis. Thus, values over a threshold indicate the presence of the cherry trees in the orchard, while lower values indicate the soil surface or the surface with low coverage by grass and weeds. The image of the NDVI index should be perfectly aligned with the corresponding orthomosaic image of the orchard. Moreover, for optimal results, it is preferable to have low coverage of weeds in the orchard and to have all trees detected, especially those in close proximity to others.
The proposed method is divided into two stages. In the first stage, the algorithm is as follows: Firstly, we calculate the corresponding NDVI index for the cultivation area. Subsequently, based on this index, we create a grayscale pseudocolour image and use it as input for the next steps.
Secondly, for each detected cherry tree, we use the OTSU method to calculate a threshold on the surrounding area, which is enlarged by 100% of the mask provided by Detectron2 or YOLOv8. Before applying the threshold to the area, we apply a gamma correction based on Equations (
4) and (
5), where
is the threshold provided by the OTSU method in this step,
are the pixels of the input image,
are the pixels of the output image and 255 is the maximum value of a pixel. Gamma correction helps to make cherry trees more distinct from grass and weeds.
In the next step, after applying gamma correction, we recalculate the threshold using the OTSU method and apply it to obtain a thresholded image of the subsection around a specific tree.
Finally, we concatenate all the derived results from the subsections of the image to create a final thresholded image.
For example,
Figure 6b displays the pseudocoloured grayscale image of the NDVI index for the orchard in
Figure 6a.
Figure 7a depicts the surrounding area of the image for a specific tree. In addition,
Figure 8 displays the corresponding histogram and the detected threshold from the OTSU method for the specific tree. Furthermore,
Figure 7b displays the same surrounding area for the specific tree after applying gamma correction.
Figure 7c displays the resulting black and white image of the two different classes defined by the OTSU method of the tree after the gamma correction. Finally,
Figure 9 presents the overall image of the orchard, obtained by concatenating all thresholded images for each tree.
In the second stage of our method, we use this image as a reference to improve the mask of each tree detected from Detectron2 or YOLOv8. In our examples, we used the derived masks from Detectron2 since they provide better accuracy in cherry tree detection. Furthermore, the final masks from our method remain almost the same even when we choose as initial masks those from YOLOv8.
First, we remove all pixels from the perimeter of the mask where the corresponding pixels in the OTSU black and white image are equal to black. In addition, we remove pixels from the perimeter that belong to other masks. This part of the method resolves any overlaps with nearby trees.
Secondly, we search for nearby pixels of the mask where the corresponding pixel in the OTSU black-and-white image is equal to white. We conduct this process step by step for one additional pixel at a time along the perimeter of the existing mask. Throughout this procedure, we ensure that the additional pixels do not belong to other masks, preventing any overlap. At each step of our approach, we invert the order of the masks to ensure equal expansion when two cherry trees share the same area. We repeat this procedure until no additional pixels are left.
As an example,
Figure 10c,d display masks detected by Detectron2 and YOLOv8, respectively, for the same cherry tree. The predicted masks are precise to ground truth (
Figure 10b) but are not exactly the same.
Figure 10a shows part of the orthomosaic photo from the specific cherry tree.
Furthermore,
Figure 11a shows the area removed or added to the initial mask using the OTSU method. Dark grey pixels indicate the removed area, while light grey pixels indicate the added area. Finally,
Figure 11b shows the final mask of the detected tree after the step based on OTSU. Definitely, it aligns more precisely with the ground truth mask in
Figure 10b.
Figure 11c illustrates the corresponding image in RGB format and the perimeter of the final mask, highlighting the improvement achieved through the suggested method. It is important to note that the proposed method effectively addresses even the shadows from the trees as seen in
Figure 11c. This is due to the characteristics of the NDVI used, which has low values in areas with no vegetation. The shadow is separated from the cherry trees as the NDVI values belong to different classes during the OTSU thresholding.