**3. Discussion**

In this study, we present the complete DNA sequence of the mitochondrial genome of *F. sylvatica* L. (Figure 1; GenBank MT446430), a deciduous tree species in the Fagaceae family. This sequence is the first complete mitochondrial genome sequence for the genus *Fagus* and the third one for the order Fagales, together with mtDNA sequences of *Quercus variabilis* (GenBank MN199236; unverified) [9] and *Betula pendula* (LT855379.1; not annotated) [10]. The mitochondrial genome sequence of the *F. sylvatica* individual FASYL\_29 represents the second extranuclear genome sequence of this individual, in addition to the already published chloroplast genome sequence (NC\_041437.1) [4].

Although short reads (Illumina MiSeq reads of 2 × 300 bp) were used in this study, the complete mitochondrial genome sequence could be combined from two large contigs of the initial assembly. The assembled sequence has been successfully validated by mapping of nanopore MinION long reads (Figure S1). Because short reads encounter numerous difficulties due to low-complexity homopolymeric sequence characteristics and the potential presence of large repeat regions in mitochondrial genomes [12,17,19,22,26,42], long read sequencing is increasingly applied (often in addition to short read sequencing) for subsequent assemblies of mitochondrial genome sequences (e.g., [57–59]).

The identified size of the *F. sylvatica* mitochondrial genome assembly of 504,715 bp is between the size of the mitochondrial genomes of *Quercus variabilis*—another member of the Fagaceae family—of 412,886 bp (GenBank MN199236) [9] and *Betula pendula* (Betulaceae) of 581,505 bp (LT855379.1) [10].

Mitochondrial genomes of flowering plants are well known for their large size, fluid genome structure, and variable coding-gene set often due to horizontal gene transfer; e.g., chloroplast and nuclear sequences have been found in mitochondrial genomes or vice versa [11–13,18,19,22–28,56]. These are ongoing processes in plants. Most of the transfers in angiosperms involve ribosomal protein genes [60]. Thus, it is not unexpected that seven genes (*rps2*, *rps7*, *rps10*, *rps11*, *rps13*, *rpl2*, *rpl15*) of the ribosomal genes belonging to the ancestral gene content of the mitochondrial genome of flowering plants [18] were not annotated in *F. sylvatica* (Figure 1). Five of these seven genes (with the exception of *rps10* and *rpl2*) are also missing in the mitochondrial genome of *Quercus variabilis* (MN199236) [9]. The missing genes *rps2* and *rps11* are also lacking in the mtDNA of *Ricinus communis* [61], *Hevea brasiliensis* [62], and *Populus tremula* [44,63] among others. The absence of the *rps13* gene from mitochondrial genomes has been shown for many members of the rosids subclass [60] including *F. sylvatica* (in this study). The ribosomal gene *rps10* missing in *F. sylvatica* is also missing e.g., in *Populus tremula* [44] and *Hevea brasiliensis* [62], but present in *Ricinus communis* [61]. A loss of *rps7* was also reported in ancestors of the Fabaceae family and of *rpl2* in some Fabaceae species [64]. Although the *rpl5* gene is lacking from many of the sequenced plant mitochondrial genomes [65], it is annotated in *F. sylvatica* (MT446430). The two respiratory genes—*sdh3* and *sdh4* (encoding subunits 3 and 4 of succinate dehydrogenase)—that have been reported to be lost from the mitochondrial genome of various angiosperms [66], were annotated in *F. sylvatica* (MT446430).

Plant mitochondrial genomes have abundant interspersed repeats [12,17,19,22,26,42] often including pairs of large repeats which cause isomerization of the genome by recombination, and numerous repeats of up to several hundred base pairs that recombine only when the genome is stressed by DNA damaging agents or mutations in DNA repair pathway genes [19]. In general, the largest repeats within a species (in angiosperms often longer than about 1 kb) have been found to recombine constitutively, leading to isomerization [19]. The longest interspersed repeat in the mtDNA of *F. sylvatica* is about 1 kb (918 bp in size; Table S1) and may be responsible for isomerization. Whereas the longest repeat in the mtDNA of *Quercus variabilis* (another Fagaceae member) is 17.3 kb in size, the longest *Betula pendula* repeat is only 474 bp (Table S1). By comparing mtDNA sequences of 72 angiosperm species, Wynn and Christensen [19] found that only a part of the species (43%) shows repeats above 10 kb.

The dynamic nature of mitochondrial genomes in the Fagales is also reflected by a gene order comparison between *F. sylvatica* and *Quercus variabilis* (Figure S3) which both are members of the Fagaceae family. Although some small collinear gene clusters inferred by Richardson et al. [18] as ancestral angiosperm gene clusters in *Liriodendron tulipifera* were also identified in the mitochondrial genomes of *F. sylvatica* and *Quercus variabilis*, no larger syntenic gene groups could be identified. Interestingly, two common gene clusters—not present in *Liriodendron tulipifera*—were identified in *F. sylvatica* and *Quercus variabilis*: the clusters *ccmB*/*rpl10* and *cox1*/*sdh3*. Whether the *ccmB*/*rpl10*-cluster, which was also identified in *Betula pendula* (Figure S4), is a common cluster of all Fagales remains an open question for future research.

Plant mitochondria employ distinct and complex RNA metabolic mechanisms including RNA editing, splicing of group I and group II introns, maturation of transcript ends, and RNA degradation (reviewed in [34]). RNA editing (in the form of C-U base transitions) is a post-transcriptional process

which is highly prevalent in mitochondria and chloroplasts of land plants [67]. Numerous C→U conversions (and in some plants also U→C) alter the coding sequences of many transcripts of the organellar genomes, while e.g., eliminating premature stop codons or creating AUG start sites, as also shown in this study for the start sites of *nad4L, cox1*, and *nad1* (Figure S2). The start codon of *cox1* is also generated by RNA editing in other land plants, e.g., *Liriodendron tulipifera*, *Nelumbo nucifera*, *Nicotiana tabacum* [68], and *Solanum tuberosum* [69]. The start codons of *nad1* and *nad4L* are also created by RNA editing in *Allium cepa*, *Cucumis sativus*, *Glycine max*, *Gossypium hirsutum*, *Liriodendron tulipifera*, *Nelumbo nucifera*, *Oryza sativa*, *Phoenix dactylifera*, and *Zea mays* [68], among others. In general, non-synonymous RNA editing sites were shown to be particularly highly conserved across different plant species ([68,70] among others).

Aiming at the development of mitochondrial genetic markers suitable to identify *Fagus* species from potential mixtures of different tree species in wood composite products, we sought SNPs specific for *Fagus*, Fagaceae, and Fagales in this study. In contrast to other studies that focused on classical plant barcoding regions (e.g., [71–73]), we followed a strategy similar to super-barcoding [45], however, not considering the entire mitochondrial genome, but including all mitochondrial genes common in tree species used for marker development. Because of the highly dynamic structure of mitochondrial genomes of angiosperms, alignments of complete mitochondrial genome sequences make sense only in very closely related individuals. Recombination activities involving repeated sequences may generate subgenomic forms and extensive structural variation of angiosperm mitochondrial genomes even within the same species [11,12,14,15,17,19,26,29–34].

The development of the SNPtax tool allowed us to select SNP markers potentially specific for different pre-defined taxa based on alignments of DNA sequences of mitochondrial genes (also considering intron-containing genes but excluding trans-spliced genes). The screen for taxon-specific SNPs in conserved genic regions allows considering a broad taxonomic range during the initial SNP selection and also during marker validation because primers can be designed that amplify the region of interest in tree species of various families. The developed CAPS markers (Table 1, Figure 3) are specific for the taxa *Fagus*, Fagaceae, or Fagales, respectively, when considering the tree individuals and related species (59–63 species from about 15 families and 10 orders) included in the entire validation for each marker (see also Table S5). All CAPS markers (Table 1) are located in exonic regions of the related genes with the exception of the marker 4\_Fagaceae\_*nad7* that is based on a Fagaceae-specific SNP in intron 2 of the *nad7* gene. An intron of the *nad7* gene (fourth intron region) was also considered in a study aiming at the identification of medicinal plants [74].

Further validation of the CAPS markers developed in this study is necessary to prove their taxon-specificity also in extended sets of tree individuals from various species, especially if they should be applied for taxon identification against a broader species background than the potential species spectrum of wood composite products. In particular, the two potentially *Fagus*-specific CAPS markers should be further validated with other *Fagus* species besides the five *Fagus* species included in this study.

Molecular markers for taxon assignment within the Fagaceae were also developed in previous studies. For species identification among common tree species of the Alps, Brunner et al. [75] developed CAPS markers based on SNPs in the intron of the plastid gene *trnL* (UAA). One of the markers allows for differentiating *F. sylvatica* from 21 other tree species tested in this study. Because no other *Fagus* species were analyzed, it is unclear if the marker is specific only for *F. sylvatica* or also for other *Fagus* species [75]. Unfortunately, an application of this marker with highly degraded DNA from processed wood products is probably not feasible (amplicon size is too large for this purpose). In another study, microsatellite primers were developed for the endangered beech tree species, *Fagus hayatae* [76]. Recently, a set of 58 SNPs has been selected from coding regions and applied for species discrimination among European white oaks [77]. Different types of molecular markers for DNA profiling of *Quercus* spp. or *Quercus* species groups were developed in other studies, e.g., based on plastid SNPs and InDels [78], short tandem repeat loci [79], or inter-primer binding sites [80].

Recent advances in real-time nanopore sequencing could pave the way to species identification using genome scale data in the future as shown in a field-based study of closely-related *Arabidopsis* species [81].
