*3.2. Read Preprocessing for De Novo Assembly*

Before assembly, we performed downsampling (the reduction of the number of reads). This was done in order to reduce the amount of CPU and RAM resources required for the assembly. For PacBio data used for the de novo assembly, we found based on the k-mer frequency that the whole set of reads results in a coverage of approximately 800× for the mitochondrial genome. Then, we performed two stages of the downsampling. First, using SeqTk v.1.3. (https://github.com/lh3/seqtk), we randomly picked (command "seqtk 0.1") 253,638 reads. Second, with the command "kmer\_filter" from the modified version (see below) of Stacks 2.5 [45], we removed all reads with a median copy number of k-mers less than 5 (this removes the majority of reads corresponding to single-copy nuclear regions). This operation reduced the number of reads approximately twofold—from 253,638 to 133,619. By default, Stacks removes reads where 80% of k-mers have a copy number below a given threshold. We slightly changed its source code, so the required percent of low-copy k-mers to discard a read was 50%, which means that our version of Stacks uses a median copy number of k-mers in a read as a criterion.

For Illumina data, we randomly picked 50–200 millions of paired reads. As we found empirically, by sampling and mapping different numbers of reads (5–300 millions), this number of reads is sufficient to provide >1000× coverage for the plastid and >100× for the mitochondrial genome.
