**4. Discussion**

#### *4.1. Impacts of Classifier, Feature Selection and Data Fusion*

Both SVM and RF classifiers have been reported performing well in remote sensing based classifications, while neither is constantly outperforming the other [17–19]. In our results, there was no statistically significant difference between the two classifiers with the best performing feature set (MNF + ALS). Previous research has shown that fusing ALS data with IS data may increase the classification accuracy, while canopy height information alone is often not enough for the improved results [9,18,52]. In our results, the impact of data fusion depended on the classifier and how the IS data were used. For example, SVM classifier benefited from data fusion when ALS data were combined with MNF data, while RF classifier did not. The most important ALS derived feature was the maximum height divided by the maximum crown diameter. This is logical as, for example, *Eucalyptus* spp. and *Grevillea Robusta* are very tall trees with narrow crowns, while acacias tend to have wider crowns in relation to their height. Our ALS feature set was limited because of discrete return ALS data while full waveform ALS data could yield even bigger improvements [53].

Feature selection did not have statistically significant impact on the best performing feature set MNF + ALS. However, it was still useful, given that the same accuracies were achieved with fewer features, making the training and execution of the models faster. In addition, we used feature selection to find the features that were important for the classification procedure. In a recent review, the near-infrared wavelength regions that were important for tree species classification were found most commonly at 450–550 nm and 650–700 nm [9]. In our results, the most important regions were found at 400–450 nm, 550–570 nm and 700–800 nm. The most common important spectral region in the summary was at 650 nm, which was not important in our study. In our results, the wavelengths near 400 nm were especially important, which was selected as important wavelength area in 38% of the studies in the review [9]. The wavelength regions between 800–1000 nm were not commonly important, which was the same in our study. We did not have SWIR data available, which has important wavelengths, for example, near 1200 nm and 1450 nm, which might increase the classification accuracy.

#### *4.2. The Impact of Up-Sampling and Grouping of Species on the Classification Results*

The OA and Kappa were low when all the 31 species were classified separately. However, when species with fewer than 20 samples were combined, we reached higher OA and Kappa than in a recent study conducted in a similar landscape in Panama [10]. However, our data had fewer species with more than 20 samples. In addition, the F1-scores ranged similarly with high variability between species.

The high F1-score (with low variability) for *Eucalyptus* spp. enables map production for conservation planning. However, it needs to be considered that species with less than four samples were removed from the model (8.1% field measurements with matching tree crown). The highest F1-score for *Eucalyptus* spp. was achieved when it was classified against a mixed group of all the other species using up-sampling. However, the difference between up-sampling and imbalanced model was marginal. In addition, *Acacia mearnsii*, *Grevillea robusta* and *Euphorbia kibwezensis*, could be mapped with relatively high accuracy. *Acacia mearnsii* is highly invasive and monitoring its distribution in the long term would be valuable for conservation planning. *Euphorbia kibwezensis*, a dry land species, could possibly be used to study linkages between changes in climate and the occurrence of the species, as done earlier with *Euphorbia ingens* in South Africa [54]. Species with fewer samples, *Ficus sycomorus*, *Acacia tortilis*, *Erythrina abyssinica* and *Syzygium* spp., were also classified with relatively high mean F1-scores in the single species setting. However, the small sample size and high variability in the results make it difficult to assess how the models would perform when extended over the whole study area. *Cupressus lusitanica* was classified with poor accuracy, possibly because it is used in fences where it grows densely and achieve low maximum heights, but it is also used in plantations for lumber production, where it reaches much higher heights. *Persea americana* was classified with poor accuracy, which could be explained by spectral and structural similarities with a number of other fruit trees. As hyperspectral data also capture the phenological states of trees [55], it is possible that some of the misclassifications can be explained by the spectral variation caused by different phenology resulting from the varying local climate caused by topography.

In the previous studies with a high number of species, a mixed group of species with fewer samples has been commonly used [10,56]. However, combining all the species under a fixed limit (e.g., 20 samples) creates large and highly heterogeneous mixed class. On the other hand, spectral similarity measures, like JM distance, can be used to find spectrally and structurally similar species, which enables creating smaller and more homogeneous groups that also balance the training data. For example, Group 3 created with this approach has two exotic (*Persea Americana* and *Mangifera indica*) and three native (*Ficus sur*, *Syzygium* spp. and *Bridelia micrantha*) fruit bearing tree species. The sixth species, *Phoenix reclinata*, is a palm that produces edible fruits (dates). Thus, it is also an ecologically meaningful group as all of the species are fruit bearing. Based on our results, JM distance may help in identifying groups of species that could be classified together with acceptable accuracy. However, grouping the species using JM distance makes sense only if the created groups have a common ecological or economic function. Up-sampling did not improve OA or Kappa, but we did see improvements on the species level, usually for species with smaller sample sizes or lower initial classification accuracy.

#### *4.3. Evaluation of the Quality of Airborne Data, Field Measurements and Segmentation*

As the airborne data were acquired in 2013 and the later field campaign was conducted in 2015, it was difficult to estimate if a tree had been over five meters tall two years ago. Thus, some of the trees measured in 2015 might have been left undetected by the segmentation algorithm. In addition, some of the species (e.g., *Acacia mearnsii*) grow in dense bush-like formations, which were problematic for the segmentation algorithm, as the crowns were difficult to separate even in the visual interpretation of the CHM. For example, one of the segmented tree crowns contained six *Acacia mearnsii* field measurements. Generally, isolated trees on farmland were easier for the segmentation algorithm, but naturally growing trees with tightly knit canopy structure were challenging, which underlies the difficulty of accurate tree crown segmentation in the tropical areas with dense canopies [11,57]. Furthermore, the positional accuracy was low in some areas due to the mountainous nature of the study area and only one available GNSS base station. Removing all field measurements with positional accuracy lower than four meters helped, but still some of the field measurements might have been matched with wrong tree crown.
