*2.4. Classification Process and Accuracy Assessment*

One of our goals was to analyze the impact of pixel number in the training data sets on classification accuracy; hence, we created training data sets with a set number of pixels per class. Using stratified random sampling, 50% of all reference polygons were selected to create a training test data set, while remaining polygons were used to create a validation data set.

The training test data set was used to create subsets (training data sets with a set number of pixels per class), and all remaining samples ended up in the test data set that was used for preliminary accuracy assessment. The validation data set was created to eliminate spatial autocorrelation with the training test data set (randomly selected pixels were used in the training and validation sets). This allowed to create a spatially independent and stable validation data set, which was used to assess final results.

To investigate the influence of training data set size on achieved classification results, the training test data set was sub-sampled to create training data sets that contained exactly 30, 50, 100, 200, and 300 pixels per class. These sub-sampled data sets will be used for classifier training. If a given class had fewer total available samples than required, random sampling with replacement was used, otherwise random sampling without replacement was employed. If all available pixels for given class were selected for training purposes, a copy of training data for this class was used instead.

An iterative accuracy assessment was used in order to objectively compare achieved results. This was a procedure consisting of the following steps repeated 100 times:


Pixel classification was carried out on the basis of the Support Vector Machine and Random Forest algorithms in R software. The first stage of the training process was to optimize the learning parameters of these algorithms in order to obtain the best possible settings. This task was completed on the training and test sets before the division. A radial basis function was chosen for the SVM algorithm because of its proven efficiency [43] and smaller number of computational difficulties [44]. The learning parameters of the compared classification algorithms were subjected to a tuning process. A gamma value of 0.1 and cost of 1000 was obtained for the SVM algorithm. In the case of the Random Forest algorithm, on the basis of the out-of-bag (OOB) error analysis, the mtry parameter (the number of features randomly sampled at each split) was set at 140 for classification on 430 hyperspectral bands and at 13 for classification on the set of the first 30 MNF transformation bands. In both cases, the number of random trees (ntree) amounted to 500.

In this work, we compared two classification algorithms (SVM and RF), two different data sets (430 original hyperspectral bands, 430 HS, and 30 Minimum Noise Fraction bands, 30 MNF), and five different sample sizes per class in the training data set (30, 50, 100, 200, and 300 pixels). Due to the unavailability of the larger continuous areas of invasive plants on our study area, we have limited the analysis to 300 pixels. All combinations of the above parameters were tested, resulting in 20 different classification scenarios.

Accuracy of the performed classifier training was assessed with the set of test data and the data spatially separated from the training and test set (i.e., on pixels of the validation data set), which was constant for all scenarios. The algorithms were compared, and the best combination of image data set and classifier was determined based on validation performance. The following accuracy parameters were calculated on the basis of the error matrix:

*Remote Sens.* **2020**, *12*, 516


$$\mathbf{F} = \mathbf{2PR}\mathbf{Q}\mathbf{(P+R)}\tag{1}$$

Afterwards, the best models for each classifier and data set were selected on the basis of the mean F1 scores for all classes (based on the validation data), and the images were classified. The significance of statistical differences between the accuracy of the models was checked using the Mann–Whitney–Wilcoxon test [49] (significance level = 0.05). The Mann–Whitney–Wilcoxon test is well suited for testing differences between non-normally distributed populations [26,50]. Distributions of achieved accuracy measures for all classification scenarios were visualized using box plots. A detailed explanation of boxes used in box plots is shown in Figure 4.

**Figure 4.** Explanation of structural elements of boxes used in box plot.

Moreover, classifier training was performed on nine classes, each with an identical number of training samples to reduce any effect of unbalanced training data. After classifier training, background classes were considered as one class with relation to plant classes. Such steps allowed us to properly assess classification quality (which classes are confused with which) and helped us achieve the most accurate results. In our work we assumed that confusion between background classes was acceptable, while confusion between plant species and background classes or other plant species would be a concern that would need to be addressed and reported. When classifying plant species, it is important to deliver a suitable and representative sample of pixels that characterize objects other than object of the study. Such classes can be oftentimes referred to as background classes. Since our study aimed to investigate the influence of training data set size, it would be insufficient to perform classification of four classes, that is, three plant species and one class with background objects. This mainly is due to difficulties in randomly sampling background classes in such a way that, for example, 30 pixels will represent them all. In fact, such an approach would almost guarantee that pixels for background classes covering a relatively small area would not be included in the training data set with a sufficient number of samples, which in turn would destroy any credibility of such work. To address this issue

when creating the training data set, each background class (shadows, trees, other plants, soils, and buildings) had the same number of training samples, equal to the number of samples used for each plant species class. This is to ensure our background classes had similar representation to the plant species classes during classifier training.
