*3.2. Taxonomic Assignment*

The taxonomic identification of the collected specimens was initially made based on their key morphological characters. Further, the taxonomic evaluation was performed at the NCBI GenBank's nucleotide database using the NCBI-BLASTn tool.

Overall, we obtained 99 barcodes belonging to 36 specimens and 13 species in the present study. In addition, we retrieved 18 more barcodes belonging to 18 specimens and 11 species from the NCBI GenBank based on the records from previous studies performed on the flora of the UAE. Altogether, the dataset comprised about 117 barcodes, 54 specimens, and 20 species in common, viz., rbcL (*n* = 49), matK (*n* = 38), and ITS2 (*n* = 30).

Those barcode datasets were further analyzed using the unsupervised OTU picking methods, viz., ABGD and ASAP. The ABGD recognized groups of about 10 to 16 species only using J69 and K80 metrics. In addition, the initial partition exhibited lower accuracy in the species resolution than the recursive partition. Thus, the recursive partitions were further taken into consideration. The rbcL showed 6 partitions of which the fifth recursive partition resolved about 28 specimens and 7 species correctly (a priori intraspecific divergence of (P) = 0.0077, relative gap width (X) = 1.0) (Figure 3a and Table 1). In the case of matK, about 9 partitions were recognized, of which the eighth recursive partition was able

to successfully resolve 29 specimens and 9 species (at *p* = 0.035 and X = 1) (Figure 3b and Table 1). In the ITS2 dataset, about 10 partitions were recognized, of which the seventh recursive partition was found to resolve 29 specimens and 10 species (at *p* = 0.0215 and X = 1) (Figure 3c and Table 1). The simple distance metric showed the lowest accuracy compared to JC69 and K80. Thus, it was not considered.

**Figure 3.** *Cont.*

**Figure 3.** Taxonomic evaluation using unsupervised (ABGD and ASAP) and supervised learning (SVM) methods. (**a**) RbcL maximum likelihood (ML) phylogeny inferred using Kimura-2-parameter (K2P) model and discrete gamma distribution with 100 bootstrap support. (**b**) MatK ML phylogeny inferred using the General Time Reversible model and discrete gamma distribution with 1000 bootstrap support. (**c**) ITS2 ML phylogeny inferred by K2P model along using discrete gamma distribution and invariant sites with 1000 bootstrap support.

**Table 1.** Summary of species identification using unsupervised and supervised learning methods.


As seen in the ABGD analysis, the algorithm identified several species partitions for each *p*-value (priori), which might derive uncertainty from the data [50]. Therefore, it is recommended to implement an integrative taxonomic approach to evaluate the relevance of the ABGD partitions [50]. Thus, the species assignment was further validated using the ASAP, followed by the supervised machine learning approach. In the case of ASAP, it appeared to provide a gap-width score, *p*-value, threshold distance dT, and the number of species corresponding to each defined partition, and thus overcame the challenge of a priori defined by ABGD. The partition could then be prioritized by considering the smallest ASAP score and the asterisk marks that represent the overall best scores.

Accordingly, the partitions with the highest species resolution were discovered for the matK and ITS2 datasets at the threshold distance of 0.029 and 0.0134, respectively (Figure 4a,b). In the matK dataset, about 29 specimens and 9 species were resolved, while in the ITS2, about 29 specimens and 10 species were resolved successfully (Figure 4a,b and Table 1). However, for the rbcL dataset, the second successive partition at the threshold distance of 0.0045 with lower ASAP scores was found to be the best partition showing a higher resolution (Figure 4c), further accurately discriminating 33 specimens and 9 species, and was thus taken into consideration (Figure 3a, and Table 1).

**Figure 4.** Threshold distance ranking the best partition for species delimitation. (**a**) matK 2nd partition (ASAP score = 2.50, Nb = 13, *p* = 0.0027); (**b**) ITS2 2nd partition (ASAP score= 2.50, Nb = 11, *p* = 0.042); (**c**) rbcL 2nd partition (ASAP score = 4.50, number of species (Nb) = 18, *p* = 0.0052).

Following the unsupervised approach, the analysis through the SML approach exhibited the highest species resolution in all three markers, rbcL, matK, and ITS2. SML appeared to resolve about 39 specimens and 10 species in rbcL, 34 specimens and 10 species in matK, and about 29 specimens and 10 species in ITS2 (Figure 3a–c and Table 1).

Overall, in the rbcL dataset, the ASAP and SVM methods successfully differentiated *Paronychia arabica* from *Scelerocephalus arabicus*, and resolved the *Suaeda* genus (*Suaeda aegyptiaca* from *Suaeda vermiculata*, whereas *Haloxylon persicum* and *Haloxylon salicornicum* could not be discriminated using all three methods (ABGD, ASAP, and SVM) (Figure 3a). Moreover, in the rbcL dataset, ASAP alone was able to differentiate the *Amaranthus* genus (*Amaranthus viridis* and *Amaranthus hybridus*), while SVM alone was able to delimit the Calligonum genus (*Calligonum crinitum* and *Calligonum comosum*) (Figure 3a).

In the matK dataset, *Suaeda aegyptiaca* and *Suaeda vermiculata* were only resolved by SVM (Figure 3b), while in the ITS2 dataset, both the species seemed to be accurately differentiated using all three methods (ABGD, ASAP, and SVM) (Figure 3c). Altogether, 17 species were successfully resolved from the 20 barcoded species using the rbcL, matK, and ITS2 markers, though the matK and ITS2 datasets lacked enough species memberships for all 20 species.
