**Organization Features of the Mitochondrial Genome of Sunflower (***Helianthus annuus* **L.) with ANN2-Type Male-Sterile Cytoplasm**

**Maksim S. Makarenko 1,2,\*, Alexander V. Usatov 1, Tatiana V. Tatarinova 2,3,4,5, Kirill V. Azarin 1, Maria D. Logacheva 2,6, Vera A. Gavrilova 7, Igor V. Kornienko 1,8 and Renate Horn <sup>9</sup>**


Received: 18 August 2019; Accepted: 19 October 2019; Published: 23 October 2019

**Abstract:** This study provides insights into the flexibility of the mitochondrial genome in sunflower (*Helianthus annuus* L.) as well as into the causes of ANN2-type cytoplasmic male sterility (CMS). *De novo* assembly of the mitochondrial genome of male-sterile HA89(ANN2) sunflower line was performed using high-throughput sequencing technologies. Analysis of CMS ANN2 mitochondrial DNA sequence revealed the following reorganization events: twelve rearrangements, seven insertions, and nine deletions. Comparisons of coding sequences from the male-sterile line with the male-fertile line identified a deletion of *orf777* and seven new transcriptionally active open reading frames (ORFs): *orf324*, *orf327*, *orf345*, *orf558*, *orf891*, *orf933*, *orf1197*. Three of these ORFs represent chimeric genes involving *atp6* (*orf1197*), *cox2* (*orf558*), and *nad6* (*orf891*). In addition, *orf558*, *orf891*, *orf1197*, as well as *orf933*, encode proteins containing membrane domain(s), making them the most likely candidate genes for CMS development in ANN2. Although the investigated CMS phenotype may be caused by simultaneous action of several candidate genes, we assume that *orf1197* plays a major role in developing male sterility in ANN2. Comparative analysis of mitogenome organization in sunflower lines representing different CMS sources also allowed identification of reorganization hot spots in the mitochondrial genome of sunflower.

**Keywords:** sunflower; cytoplasmic male sterility (CMS); mitochondrial genome; reorganizations; next generation sequencing (NGS)

#### **1. Introduction**

Low substitution rate in genes along with considerable variability in size and structure are distinct features of plant mitochondrial genomes (mitogenome) [1,2]. Reorganization events in mitochondrial DNA (mtDNA) are primarily caused by disruption of a fragile equilibrium of intramolecular recombinations, maintained by nuclear-mitochondrial genetic interactions [3,4]. Runaway recombination of mtDNA can lead to changes in gene content and expression patterns of mitochondria [5,6]. The mitogenomes of flowering plants carry genes for rRNAs, tRNA, subunits of the respiratory chain complexes, as well as genes for the ribosomal proteins (*rps* and *rpl*). Maturase-related protein gene (*matR*) and genes responsible for the biogenesis of cytochrome *c* (*ccmB*, *ccmC*, *ccmFC,* and *ccmFN*) are also part of the plant mitochondrial gene set [7]. Alterations in transcription activity of the mitochondrial genes can have profound effects on the functionality of mitochondria and, thus, on different plant traits. Among the phenotypic traits caused by mitochondrial impairments, special attention is devoted to cytoplasmic male sterility (CMS) [8]. CMS is an inability to produce or shed functional pollen, which has been described in more than 140 higher plant species [9]. As a result of mtDNA rearrangements, new open reading frames are created, leading to male sterility [10]. In turn, dominant nuclear-encoded restorer-of-fertility genes (*Rf* genes) can restore normal development of pollen. Hence CMS/Rf systems are important both for studying pollen development in plants and for commercial applications [11,12]. The existence of the CMS phenotype in plants eliminates the need for laborious, manual emasculations for a directional crossing of plants, thus promoting its utilization in hybrid breeding [13].

Comparing mtDNA configurations in sunflower is especially interesting, as more than 70 CMS sources have been described within the Food and Agriculture Organization (F.A.O.) program for sunflower [14], even though modern sunflower hybrid breeding predominantly relies on a single male-sterile cytoplasm, the so-called CMS PET1 [15]. This CMS source originated from an interspecific cross of *H. petiolaris* with *H. annuus* [16]. The molecular characterization of the CMS mechanism helps to introduce new CMS sources into breeding programs. So far, the mitogenomes of only a few CMS sources have been sufficiently characterized to be used in sunflower hybrid production. The CMS phenotype can arise spontaneously in wild populations, while in breeding lines—after wide crosses, interspecific exchange of nuclear and/or cytoplasmic genomes, and mutagenesis [11]. It has been demonstrated that some CMS sources obtained from different inter- or intraspecific crosses showed the same mechanism of male sterility formation as the CMS PET1 type [17]. Even though these CMS sources had different origins, they have the same mitochondrial genome organization indicating that some configurations may be preferentially maintained in sunflowers [17].

Less is known about the spontaneously occurring CMS sources in *Helianthus annuus* L. [14]. ANN2 was derived from wild sunflower population N517 in Texas [18]. In this population, 40% of the plants were male-sterile [19]. However, sunflower cultivars with ANN2 CMS type, developed from the N517 population by maintaining with lines like HA89 or RHA265, showed 100% male-sterile progenies, which indicates a stable mitochondrial DNA configuration and absence of heteroplasmy. ANN2 and other spontaneously occurring CMS sources like ANN1, ANN3, and ANN4 maintained by RHA265 were hardly restored [20]. None of the tested maintainer and restorer lines of CMS PET1 were able to restore pollen production in CMS ANN1 and CMS ANN3. Only 12.5% and 15.8% of all investigated lines showed restorer capacity towards CMS ANN2 and CMS ANN4, respectively, indicating very different CMS mechanisms compared to CMS PET1. Three restorer lines, Rf ANN2-PI 413178, Rf ANN2-P21, and Rf ANN2-RMAX1, carry a restorer-of-fertility gene for ANN2 [21,22]. A suppressor gene S1 overpowering the restorer gene action has been recently described by Liu et al. [23], thus making the CMS-Rf interactions in the ANN2 source even more complicated.

Previous mtDNA investigations of some spontaneously occurring CMS sources were based on Southern blot hybridizations with mitochondrial genes [24]. Hybridizations of the CMS sources ANN1, ANN2, ANN3 with *atp6*, *atp9*, *cob*, *cox1*, *cox2*, *cox3*, *rrn18*/*rrn5*/*nad5*, *orfH522*, *orfH708*, or *orfH873* as a probe revealed unique banding patterns for 4 out of the 10 probes [24]. Besides, the analyses showed that CMS ANN4 and ANN5 are very similar to each other and form a distinct group from CMS ANN1, ANN2, and ANN3 [24]. It was also shown that ANN1/ANN2/ANN3 mtDNA-type significantly differs from both the male-fertile sunflower cytoplasm and the CMS PET1 source [24].

We describe the first assembly of the CMS source ANN2, which occurred spontaneously in *Helianthus annuus* L. The current study also provides insights into the flexibility of sunflower mitochondrial genome by comparing different isonuclear male-sterile lines HA89 (ANN2, MAX1, PET1, PET2) and the male-fertile line HA89, allowing identification of hot spots for rearrangements in the sunflower mitochondrial genome. For the CMS mechanism in ANN2, new open reading frames were identified, which were transcriptionally active. The ANN2 CMS source may be interesting not only for oilseed hybrid breeding, but also for horticultural purposes, as it is difficult to restore. It is a highly desirable trait in ornamental sunflowers since pollen production is usually not required nor looked-for, except if the pollen color enhances the contrast with the florets.

#### **2. Results**

#### *2.1. Rearrangements in the Mitochondrial Genome of the Male-Sterile Line HA89(ANN2)*

We assembled the complete mitochondrial genome of the HA89 sunflower line with ANN2 cytoplasmic sterility type (NCBI accession MN175741.1). The master chromosome of HA89 (ANN2) consists of 306,018 bp (Figure 1), and it is 5071 bp longer than the mitogenome of the male-fertile isonuclear line HA89 (NCBI accession MN171345.1).

The HA89 (ANN2) mitochondrial genome has a wide range of rearrangements as compared to the male-fertile HA89 mitogenome. The summary of whole mitogenome alignment of male-sterile and male-fertile lines is presented in Table 1.


**Table 1.** Alignment of the mitochondrial genomes of HA89 and HA89(ANN2) lines.

\* genes, which had impaired sequences as a result of rearrangements.

**Figure 1.** Mitochondrial genome map of HA89(ANN2) cytoplasmic male sterility (CMS) line of sunflower. Intron containing genes are marked by an asterisk (\*) symbol. Trans-spliced genes are presented as the compilation of exons (ex).

The mitochondrial genomes of male-sterile ANN2 and male-fertile HA89 share 14 complementary regions, but their localizations and orientation differ. We showed the localization of complementary regions in a scheme with both genomes shown in linear forms in Figure 2. Since regions #1 and #14 in the case of the circular molecule represent the same region, we classified the other twelve regions (#2–#13) as rearrangements.

In most cases, rearrangements only resulted in a reversal of a gene's direction or a change in gene order. However, the 8584 bp (#7) and 21,433 bp (#8) rearrangements influenced the coding sequence of *nad6*, and the 6029 bp rearrangement (#12) impaired *atp6*. The largest part of the *nad6* gene sequence (~88%) is in the rearrangement #8, while the 3 terminal part of *nad6* lies in the rearrangement #7. As a result of the convergence of #8 and #10 rearrangements in mitochondrial DNA of HA89(ANN2), the new *nad6*-chimeric open reading frame—*orf891*—was created. Analyses of *orf891* transcription pattern gave ambiguous results. Transcripts were detected for both *nad6* (HA89) and *orf891* (HA89(ANN2)) when using primers derived from the 5 identical sequence of their mRNA (Supplementary Table S1). Nevertheless, using the same forward primer, but different reverse primers

(Supplementary Table S1), complementary to the 3 sequence of *nad6* and *orf891*, transcription was detected only for *nad6* (the fertile line), but not for *orf891* (CMS line). It is important to note that almost all the rearrangements found in mtDNA of HA89(ANN2) are accompanied by other types of genome reorganizations—deletions and insertions.

**Figure 2.** Schematic illustration of homologous regions between mitochondrial genomes of HA89 male-fertile and HA89(ANN2) CMS lines. 1—29,196 bp; 2—557 bp; 3—1245 bp; 4—77,441 bp; 5—41,702 bp; 6—4150 bp; 7—8584 bp; 8—21,433 bp 9—8158 bp; 10—24,687 bp; 11—41,505 bp; 12—6029 bp; 13—12,520 bp; 14—977 bp.

#### *2.2. Deletions and Insertions in the Mitochondrial Genome of the Male-Sterile Line HA89(ANN2)*

In comparison to the male-fertile analog, we identified nine longer than 100 bp deletions in the mtDNA of HA89(ANN2), which are shown in Table 2. Most deletions did not affect the protein-coding sequences, except for two deletions of 316 bp and 1165 bp. The 1165 bp deletion resulted in the total elimination of *orf777*, while the 316 bp deletion affected the part of the *atp6* gene. Interestingly, in previous studies, we also discovered the removal of *orf777* from the mitochondrial genomes of two other CMS lines—HA89(PET2) [25] and HA89(MAX1) [26].


Seven longer than 100 bp insertions were detected in mtDNA of the HA89(ANN2) CMS line (Table 3). As a result of these insertions in the mitochondrial DNA of HA89(ANN2), five new open reading frames, namely *orf324*, *orf327*, *orf345*, *orf558*, and *orf933*, have appeared. All five ORFs are transcribed in the case of ANN2, contrary to the HA89 line.

A search for transmembrane domains (TDs) revealed that the protein encoded by *orf558* contained a single TD. In the case of *orf933*, two TDs were detected. The *orf933* encoded protein did not show homology to other sunflower proteins in GeneBank, and had only limited similarity (40–60 amino acids) to hypothetical mitochondrial proteins with unknown functions in *Lactuca sativa* (accession PLY70338.1), *Salvia miltiorrhiza* (accession YP\_008992338.1), *Beta vulgaris* (accession CBJ23356.1), etc. Forty-six amino acids of the N-terminus of the protein encoded by *orf558* matched the N-terminus of cytochrome c oxidase subunit 2 (*cox2* gene). Moreover, most of the amino acids that form the

transmembrane domain in *orf558* protein are identical to those in COX2. However, the sunflower cytochrome c oxidase subunit 2 has two TD and the protein encoded by *orf558*—only a single one (Figure 3). So the *orf558* represents a chimeric *cox2* gene and could potentially play a role in the ANN2 CMS phenotype development.


**Table 3.** Localization of insertions (>100 bp) in the mitochondrial genome of HA89(ANN2) CMS line.

\* appeared as the result of several simultaneous reorganizations of mtDNA structure.

**Figure 3.** Comparison of transmembrane domains of proteins encoded by *cox2* (**A**) and *orf558* (**B**).

The most complex among the discovered ORFs in the HA89(ANN2) mitogenome was *orf1197*, which has appeared from three simultaneous reorganization events involving the 316 bp deletion, the 430 bp insertion, and the 6029 bp rearrangement. The *orf1197* represents a chimeric *atp6* gene, with transcription activity specific for the CMS line HA89(ANN2). In sunflower, the *atp6* gene typically encodes a protein consisting of 351 aa, whereas the predicted size of the translation product of the *orf1197* is 399 aa. Both proteins share 251 identical amino acids in the C-terminus. Thus, the protein encoded by the *orf1197* carries all seven TDs present in the C-terminus of the ATP synthase Fo subunit 6 from mitochondria of male-fertile sunflower (Figure 4).

**Figure 4.** Comparison of *orf1197* and *atp6* encoded proteins and prediction of transmembrane helices. The amino acid sequence of the *orf1197* encoded protein is presented. Amino acids identical to ATP6 are shown in bold. Amino acids forming transmembrane domains are marked by red bars.

#### **3. Discussion**

Recently, we had investigated complete mitochondrial DNA sequences for three CMS sources in sunflower, coming from interspecific crosses: PET1, PET2, MAX1 [25,26]. Comparison of the HA89(ANN2) mitogenome with mitochondrial genome assemblies of male-fertile lines [25,27] and the other HA89 male-sterile analogs provides insights into reorganizations of mtDNA associated with CMS phenotypes. The male-fertile lines (HA89, HA412) have only slight variations in the mtDNA sequence [25]. Whereas the mitogenomes of the CMS sources (HA89(ANN2), HA89(PET1), (HA89(PET2), HA89(MAX1)) showed significant differences as compared to their alloplasmic male-fertile analog. The complete mitochondrial genome of the male-fertile line HA89 adds up to 300,947 bp (NCBI accession MN171345.1), while HA89(PET1) has a size of 305,217 bp (NCBI accession MG735191.1), HA89(PET2) of 316,582 bp (NCBI accession MG770607.2), HA89(MAX1) of 295,586 bp (NCBI accession MH704580.1) and HA89(ANN2) of 306,018 bp (MN175741.1). The difference in genomes sizes is due to several deletions and insertions. For instance, in the mtDNA of all the investigated CMS sources, except HA89(PET1), a similar deletion in the *nad4L*-*orf777*-*atp8* region was observed. In the case of HA89(PET2), this is due to a 711 bp deletion, in HA89(MAX1) has a 978 bp deletion, and in HA89(ANN2) there is an 1195 bp deletion. All these deletions resulted in removal of *orf777* from the mtDNA in the CMS lines. Another region enriched by deletions is the area between *cob*-*ccmFC*, here three overlapping deletions were detected: a 451 bp deletion in HA89(PET1), one of 3780 bp in HA89(PET2) and another one of 4204 bp in HA89(ANN2).

The coincidence between locations of these deletions is not accidental. There are three 265 bp repeats present in the sunflower mitochondrial genome, with the following positions in the mtDNA of the male-fertile HA89 line: 36537-36801 (adjacent to *atp8*), 190074-190338 (next to *cob*), and 202902-202638 (between *orf873*-*atp1*). These repeat regions are shown by red stars in Figure 5. Repeats represent common recombination points in mtDNA molecules [4,28]. The identification of small repeats involved in recombination is important because they influence the maintenance and evolution of mitochondrial genomes [28]. An imbalance in the nuclear-mitochondrion relationship that may occur in distant hybridizations impairs the recombination of mtDNA sub-genomic molecules, therefore, leading to reorganizations in the mitochondrial master chromosome. For instance, in HA89(PET1) the deletion, insertion, and inversion were mentioned in the *cob*-*atp1* region, directly between two repeats (Figure 5). In HA89(PET2), there were also several rearrangements in hot spots, resulting in the formation of a new gene cluster *cob*-*atp8*-*cox3*, as well as in the translocation of the *ccmFC*-*orf873*-*atp1* gene cluster into the *nad4L*-*orf777* region, combined with a deletion and huge insertion (Figure 5). In the mtDNA of both lines, HA89(MAX1) and HA89(ANN2), the specific *atp1*-*atp8*-*cox3* gene order was created, while in the MAX1 CMS source *ccmFC*-*orf873* translocated into the *nad4L*-*orf777* region (with deletion and huge insertion in this region). In the case of ANN2, the *ccmFC*-*orf873* region is located next to the *cob* gene (Figure 5). Thus, we have established that these three 265 bp repeats represent a reorganization hot spots in the sunflower mitochondrial genome.

Considering the insertions discovered in the HA89(ANN2) mitogenome, we also observed a similarity between the insertions of different CMS sources in sunflower. For instance, about 85% of the 1027 bp insertion sequence (ANN2) is complementary to the part of the 15,885 bp insertion (PET2). The 3757 bp insertion (ANN2) contains 1959 nucleotides identical to the 15,885 bp insertion (PET2) and 1215—to the 5272 bp insertion (MAX1). Also, 2343 bp of the 5338 bp insertion in ANN2 are similar to another region of sunflower mtDNA proximal to the *cob* gene (position 185,987–188,330 in the mtDNA of HA89 fertile line). So, this sequence is duplicated in the mitochondrial genome of the HA89(ANN2) CMS line. About 10% of 6452 bp insertion (ANN2) is complementary to both the 5050 bp (PET2) and the 5272 bp (MAX1) insertions. As well as 1158 bp of the 9044 bp insertion (ANN2) are identical to the 5050 bp and 15,885 bp insertions (PET2) and 5272 bp insertion (MAX1).

**Figure 5.** Reorganizations in CMS lines involving the 265 bp repeats (red stars) shown in the male-fertile mtDNA of HA89.

Although there are similarities in deletions, insertions, and rearrangements between mitochondrial genomes of the HA89(ANN2) and other CMS lines, the discovered ORFs were different. We summarized the data of all identified ORFs in the male-sterile cytoplasms in Table 4. The ORFs encoding proteins with similarity to other mitochondrial proteins and especially having transmembrane domains are of particular interest since the chimeric proteins with TD most often cause CMS phenotypes [10,29]. In the mtDNA of HA89(ANN2), we detected three new transcriptionally active ORFs, encoding proteins with TD—*orf558* (one TD), *orf933* (two TD), *orf1197* (seven TD). The *orf933* shows no homology to other sunflower genes, while *orf558* represents a chimeric *cox2* gene, and *orf1197* a chimeric *atp6* gene. It is difficult to estimate the exact contribution of *orf558*, *orf933*, *orf1197* to the development of the male-sterile phenotype in ANN2. However, the possibility of involvement of more than one open reading frame might be one explanation that ANN2 is so difficult to restore. On the other hand, the presence of a suppressor gene S1 discovered by Liu et al. [23] might be the reason for low rates of fertility restoration of the ANN2 CMS source. Previous studies indicate that the chimeric *atp6* genes or new ORFs that are co-transcribed with *atp6* most often cause CMS phenotypes in flowering plants [10,30–33]. Therefore, we suggest that *orf1197* is the major CMS candidate gene for ANN2 CMS source. Moreover, chimeric *atp6* genes were also identified in MAX1 [26] and CMS3/ANT1 [34] CMS types of sunflower. In CMS lines, chimeric *atp6* genes encode N-terminal extended proteins compared to the normal ATP synthase subunit 6 (351 aa): ANN2—399 aa, MAX1—429 aa, (AYV91168.1), CMS3/ANT1—437 aa (CAA57790.1). Moreover, 397 of 399 amino acids in the *orf1197* protein are identical to the chimeric *ATP6* of CMS3/ANT1 line, and therefore this protein represents a shorter version of this 437 aa-long protein. Such similarities support our hypothesis about the importance of *orf1197* in shaping the CMS phenotype in ANN2.


**Table 4.** Summary of transcriptionally active open reading frames in the mitochondrial genome of isonuclear fertile and CMS lines.

**In bold**: ORFs encoding proteins with transmembrane domains.

The *orf558* (as the chimeric *cox2* gene) might also cause cytoplasmic male sterility in ANN2. In other plants species, modified *cox2* sequences seem to be involved in the male sterility. For instance, the CMS specific *pcf* gene in petunia is composed from sequences of the 5 portion of *atp9*, segments of *cox2*, and a large region of unknown origin—*urfS* [35]. In wild beets, the CMS-associated *orf129* shows homology to the 5 flanking and the coding sequence of *cox2* [36]. In the mitogenome of the inap CMS source of *Brassica napus*, which was created by a somatic hybridization with *Isatis indigotica*, a novel *cox2-2* gene was detected, which represents recombination of the *cox2* of woad and *cox2-2* of rapeseed [37].

Another unique feature of HA89(ANN2) mitogenome is the formation of *orf891*. According to the ORFs predictions, a 3 elongation of the *nad6* gene (*orf891*) may occur. However, the cDNA analyses did not agree with the genomic data. Perhaps due to *nad6* mRNA editing instead of a 3 elongated transcript, the shorter one is formed. Heteromorphism in *nad6* transcript length was also observed in *Mimulus guttatus x M. nasutus* hybrids with the CMS phenotype [38]. Both male-fertile and male-sterile hybrids have a single copy of the *nad6* gene, and the divergence in mRNA length was only observed in male-sterile plants [38].

#### **4. Materials and Methods**

#### *4.1. Plant Material*

The CMS line HA89(ANN2) of sunflower was obtained from the genetic collection of the N. I. Vavilov All-Russian Institute of Plant Genetic Resources (Saint Petersburg, Russia). The original source of the ANN2 male-sterile cytoplasm was obtained by Serieys in 1984 [18]. All sunflower lines were grown in regularly irrigated pots in the growth chamber KBWF 720 (Binder, Tuttlingen, Germany) with the following growth conditions: temperature—26 ◦C, humidity—70%, photoperiod—10/14 h (dark/light).

#### *4.2. Mitochondrial DNA Extraction, NGS Library Preparation, and Sequencing*

First, the organelle fraction from leaves of 14-days-old sunflower seedlings was isolated, as described by Makarenko et al. [39]. Such preparations significantly reduced the amount of nuclear DNA. Then DNA extraction was performed with PhytoSorb kit (Syntol, Moscow, Russia), according to the manufacturer's protocol. Equal amounts of DNA from seven plants were mixed, and we used 1 ng of DNA pull for the NGS library preparation step. The library was made with Nextera XT DNA Library Prep Kit (Illumina, Mountain View, CA, USA), following the guidelines of Illumina. The quality of the library was evaluated using Bioanalyzer 2100 (Agilent, Santa Clara, CA, USA). The library was quantified at the Qubit fluorimeter (Invitrogen, Carlsbad, CA, USA) and by qPCR, then diluted up to the concentration of 8 pM. Sequencing was performed on two different Illumina sequencing platforms: HiSeq 2000 using TruSeq SBS Kit v3-HS 200-cycles and MiSeq using MiSeq Reagent Kit v2 500-cycles (Illumina, Mountain View, CA, USA). A total number of 3,063,836 reads (100-bp paired) and 3,305,268 reads (250-bp paired) were generated.

#### *4.3. Mitochondrial Genome Assembly and Annotation*

Quality control of reads was done using FastQC (https://www.bioinformatics.babraham.ac. uk/projects/fastqc/). Trimming of adapter-derived and low quality (Q-score below 25) reads was performed with Trimmomatic software [40]. For contig generation, we used SPAdes Genome Assembler v.3.10.1 [41]. The whole mitochondrial genome was manually assembled using scaffolds based on high coverage (depths > 70) contigs (length > 1000 kbp) and available bridge contigs (length = 0.3–1 kbp). The genome assembly was validated by remapping the initially obtained reads using Bowtie 2 v.2.3.3 [42]. All observed rearrangements were verified by PCR analysis. For variant calling, we used samtools/bcftools software [43] and manually revised polymorphic sites using the IGV tool [44].

The mitochondrial genome was annotated with MITOFY [45], BLAST tool [46], and ORFfinder (https://www.ncbi.nlm.nih.gov/orffinder). Using GeSeq [47], we provided comparisons of our annotation with the current reference annotations (NCBI accessions NC\_023337.1, CM007908.1) of the sunflower mitochondrial genome. Graphical genome maps were generated using the OGDRAW tool v.1.3.1 [48]. The prediction of transmembrane domains was made with TMHMM Server v.2.0 (http://www.cbs.dtu.dk/services/TMHMM-2.0/). The scheme of the bioinformatic pipeline is presented in Supplementary Figure S1.

#### *4.4. Expression Analyses*

RNA was extracted from leaves of seven 28-days-old sunflower plants using a guanidinium thiocyanate-phenol-chloroform based method with the ExtractRNA reagent kit (Evrogen, Moscow, Russia). The quality and concentration of the RNA were evaluated with the Qubit fluorimeter (Invitrogen, Carlsbad, CA, USA) and the NanoDrop 2000 spectrophotometer (Thermo Fisher Scientific, Waltham, MA, USA). Total RNA (0.5 μg) was treated with DNAse I (Thermo Fisher Scientific, Waltham, MA, USA), and then cDNA was synthesized using the MMLV RT kit (Evrogen, Moscow, Russia) with random primers. As a negative control for each DNAse treated mRNA sample, the same reverse transcription protocol was performed, but without the MMLV enzyme. The quantitative PCR was performed with EvaGreen based RT-PCR kit (Syntol, Moscow, Russia) on Rotor-Gene 6000 (Corbett Research, Mortlake, NSW, Australia). A summary of all primer sequences is given in Supplementary Table S1. The primers annealing temperature (Tm) was 60 ◦C.

#### **5. Conclusions**

Assembly of CMS ANN2 mitochondrial genome (HA89(ANN2) line) revealed several rearrangements, insertions, and deletions, as well as seven new open reading frames: *orf324*, *orf327*, *orf345*, *orf558*, *orf891*, *orf933*, and *orf1197*. Transcripts were detected for all new open reading frames in CMS ANN2, but not in the fertile cytoplasm. Only *orf558*, *orf891*, *orf933*, and *orf1197* encoded proteins that contained membrane domains, making them the most likely CMS candidate genes for the ANN2 source. Notably, *orf1197* represents a chimeric *atp6* gene and presumably plays a major role in the CMS phenotype development of ANN2. However, CMS ANN2 may be caused by simultaneous action of several candidate genes. Hot spots for rearrangements (265 bp repeats) were identified, and we propose that they influence the maintenance and evolution of the mitochondrial genome in sunflower. *Plants* **2019**, *8*, 439

**Supplementary Materials:** The following are available online at http://www.mdpi.com/2223-7747/8/11/439/s1, Suppl Table S1. The primers sets used for transcription activity analysis, Suppl Table S2. The protein coding genes annotated in HA89(ANN2) mitochondrial genome, Suppl Figure S1. Bioinformatic pipeline for mitochondrial genome analysis. Trimmed reads are available at https://doi.org/10.6084/m9.figshare.8945084.v1.

**Author Contributions:** Conceptualization, K.V.A. and V.A.G.; methodology, M.S.M. and I.V.K.; software, T.V.T. and M.D.L.; validation, M.S.M., K.V.A. and I.V.K.; formal analysis, M.S.M. and T.V.T.; investigation, M.S.M., K.V.A. and M.D.L.; resources, V.A.G.; data curation, M.S.M. and M.D.L.; writing—original draft preparation, M.S.M., T.V.T. and R.H.; writing—review and editing, M.S.M., T.V.T. and R.H.; visualization, R.H.; supervision, A.V.U. and R.H.; project administration, A.V.U. and V.A.G.; funding acquisition, A.V.U. and M.D.L.

**Funding:** The study was supported by the Ministry of Education and Science of Russia project no. 6.929.2017/4.6. The NGS sequencing was provided with the support of budgetary subsidy to IITP RAS (Laboratory of Plant Genomics 0053-2019-0005). Analytical work was carried out on the equipment of centers for collective use of Southern Federal University "High Technology".

**Acknowledgments:** We are grateful to the cooperators of Northern Crop Science Laboratory (Fargo, USA) for providing seeds of HA89 alloplasmic lines of sunflower to the genetic collection of the N. I. Vavilov Institute of Plant Genetic Resources. We thank reviewers and the editor for their suggestions and comments, which has helped to improve the manuscript, and Todd Lorenz (tlorenz@laverne.edu), Duane R. Chartier (drc@authentica.org) for editing assistance.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **The First Plastid Genome of the Holoparasitic Genus** *Prosopanche* **(Hydnoraceae)**

#### **Matthias Jost 1, Julia Naumann 1, Nicolás Rocamundi 2, Andrea A. Cocucci <sup>2</sup> and Stefan Wanke 1,\***


Received: 11 January 2020; Accepted: 11 February 2020; Published: 1 March 2020

**Abstract:** Plastomes of parasitic and mycoheterotrophic plants show different degrees of reduction depending on the plants' level of heterotrophy and host dependence in comparison to photoautotrophic sister species, and the amount of time since heterotrophic dependence was established. In all but the most recent heterotrophic lineages, this reduction involves substantial decrease in genome size and gene content and sometimes alterations of genome structure. Here, we present the first plastid genome of the holoparasitic genus *Prosopanche,* which shows clear signs of functionality. The plastome of *Prosopanche americana* has a length of 28,191 bp and contains only 24 unique genes, i.e., 14 ribosomal protein genes, four ribosomal RNA genes, five genes coding for tRNAs and three genes with other or unknown function (*acc*D, *ycf* 1, *ycf* 2). The inverted repeat has been lost. Despite the split of *Prosopanche* and *Hydnora* about 54 MYA ago, the level of genome reduction is strikingly congruent between the two holoparasites although highly dissimilar nucleotide sequences are observed. Our results lead to two possible evolutionary scenarios that will be tested in the future with a larger sampling: 1) a Hydnoraceae plastome, similar to those of *Hydnora* and *Prosopanche* today, existed already in the most recent common ancestor and has not changed much with respect to gene content and structure, or 2) the genome similarities we observe today are the result of two independent evolutionary trajectories leading to almost the same endpoint. The first hypothesis would be most parsimonious whereas the second would point to taxon dependent essential gene sets for plants released from photosynthetic constraints.

**Keywords:** Piperales; Hydnoraceae; *Hydnora*; *Prosopanche*; parasitic plants; holoparasite; plastid genome

#### **1. Introduction**

In photoautotrophic plants, the plastid chromosome encodes essential genes for major photosynthesis related functions and is mostly considered highly conserved with respect to gene order and gene content [1] as well as its organization as quadripartite structure consisting of two single copy regions separated by two copies of an inverted repeat [2,3]. The plastomes of parasitic plants are a prime subject to study the possibilities and changes of otherwise highly conserved genomic structures that can occur when organellar genomes are released from selective constraints [4,5]. Plastid genomes of members from nearly all of the at least 11 independently evolved parasitic angiosperm lineages [6] have been studied to a more or less extensive degree [3,7–15]. Recent in-depth papers and reviews summarized similar evolutionary stages of plastome decay [5,16,17], eventually resulting in possible complete loss in Rafflesiaceae [14], depending on the respective level of heterotrophy. However, a vast variety of changes in plastid genome size and organization, gene content, nucleotide composition and mutational rates has been reported within individual lineages and the speed and extent of degradation

seem lineage-specific [3]. Despite those lineage-specific differences, an underlying pattern can be found that plants follow as they transition from autotrophic to heterotrophic lifestyle. With increasing heterotrophy, genes involved in photosynthesis show signatures of relaxed functional constraints and eventually become pseudogenized and lost, often starting with *ndh* genes [16,18]. Following the initial losses is a relaxation of purifying selection in genes related to photosynthesis and genes involved in the translation and transcription machinery. General housekeeping genes that carry out non-photosynthesis related functions can show less relaxation or even slightly higher purification [18]. There are many examples of the different stages of parasitism in numerous lineages with the Orobanchaceae being the prime subject to study the whole range from facultative to holoparasitism [19,20]. It is apparent that often gene order and plastome structure such as the quadripartite nature (large single copy region LSC, small single copy region SSC and inverted repeats IR) are retained in different parasitic plastomes. Yet, there are some exceptions such as the loss of the inverted repeat for example in the highly reduced plastomes of *Pilostyles* [12,21], *Balanophora* [22] and many other lineages of autotrophic plants [23–26]. Functionally the inverted repeats are hypothesized to stabilize the plastid genome [27]; therefore, their loss is shown to lead to more frequent rearrangements [28].

One of the 11 parasitic angiosperm lineages is the holoparasitic family Hydnoraceae (Piperales). In contrast to all other parasitic angiosperm lineages Hydnoraceae are the only lineage outside the monocot and eudicot radiation and among one of the oldest parasitic lineages with an estimated stem group age of ~91 MYA [29]. The family consists of the two genera *Hydnora* (7 species) [30] and *Prosopanche* (5/6 species) [31] with *Hydnora* occurring exclusively in the Old World and *Prosopanche* in the New World [32,33]. According to molecular dating analyses the two genera split from each other about 54 MYA [29]. The plastid genomes of Hydnoraceae are also among the least known with a single known plastome of *Hydnora visseri* [10], for which not only a drastic reduction in genome size but also in gene content has been shown, exceeding the loss of genes involved in the photosynthesis apparatus, yet maintaining the quadripartite structure of the plastome as well as showing nearly identical gene order and orientation compared to the closely related photoautotrophic genus *Piper* (Piperaceae, Piperales) [10]. Given that the two Hydnoraceae genera diverged in the early Eocene and the plastome of *Hydnora visseri* is among the most reduced holoparasitic angiosperms knowledge about its sister genus *Prosopanche* would be highly valuable for putting the sequenced plastomes in context to better understand the evolution of this specific lineage and to compare it to the other different parasitic lineages. We here report the plastid genome of *Prosopanche americana* and compare it to the published genome of *Hydnora visseri* as well as to the plastome of *Aristolochia contorta* [34] (Aristolochiaceae, the closest autotrophic relatives of Hydnoraceae).

#### **2. Results**

#### *2.1. The Plastome of Prosopanche Americana*

In total, 373 million reads were sequenced and de novo assembled using CLC Workbench [35], resulting in a total of 8,342,403 scaffolds of which 372 had BLAST hits for plastid features (Figure 1), using a query of 45 angiosperm plastomes from GenBank and setting the e-value to 1e-10. In depth analyses of these scaffolds based on quality of BLASTn hit in combination with scaffold coverage, resulted in the exclusion of all but scaffold 424, which received the highest number of BLAST hits for plastid genome features. Additionally, it is the longest scaffold with plastid BLAST hits (28,191 bp) and shows the highest depth of coverage (4678) and with a total of 952,705 reads mapped to it. The scaffold has identical ends in sequence, forming 47 bp of overlap, allowing for circularization. The circularization was verified by PCR and the Sanger sequenced product was aligned to the assembled scaffold, which also revealed and corrected a misassembly and an insertion at the respective contig ends, but introduces two unresolved characters in this region.

**Figure 1.** The plastid genome of *Prosopanche americana* is found on a single scaffold. **a**) Stoichiometry plot of the scaffold lengths and the respective mean coverage depth of the *Prosopanche americana* assembly. Scaffolds are visualized as blue circles. The green dots represent scaffolds with BLAST hits for annotated fractions of the plastome (derived from "gene features" in NCBI). The black arrow points to scaffold 424. **b**) Distribution of the reads mapped to the linear scaffold 424 with the scale displaying the depth of coverage in 1k increments. The red boxes represent the coverage of the *Prosopanche amercana* repeat regions.

With this, the *Prosopanche americana* plastome is 28,191 bp in length (Figure 2). Using DOGMA (% identity cutoff = 25, e-value = 1e-5) [36], 25 potential plastid genes were initially identified, consisting of 17 protein coding genes, 3 rRNAs and 7 tRNAs. Only 24 of the initial 25 genes could be verified and were used for further testing and analysis because the hit for *trn*P-GGG could not be confirmed and is a result of the low stringency search settings. With the methods described below, one additional rRNA could be identified, which was not detected by the DOGMA analysis. For comparison, we also ran an analysis of scaffold 424 using MFannot [37] which resulted in a similar result, but failed to identify *rrn*5, *ycf* 2 or *trn*fM-CAU. *trn*W identified by DOGMA is identified as *trn*C with MFannot. In total, the plastome contains 25 genes, of which 24 are unique. There are 14 ribosomal protein genes (*rpl*2, *rpl*14, *rpl*16, *rpl*36, *rps*2, *rps*3, *rps*4, *rps*7, *rps*8, *rps*11, *rps*12, *rps*14, *rps*18, *rps*19), two of which contain introns (*rpl*16, *rps*12), four ribosomal RNA genes (*rrn*4.5, *rrn*5, *rrn*16, *rrn*23), five genes coding for tRNAs (*trn*E-UUC, 2x *trn*I-CAU, *trn*fM-CAU, *trn*W-CCA, *trn*Y-GUA) and three genes with other or unknown functions (*acc*D, *ycf* 1, *ycf* 2). The *Prosopanche americana* plastome (GenBank accession number MT075717) lacks genes involved in photosynthesis, such as genes for the photosystems I and II, RuBisCO large subunit, cytochrome-related or ATP synthase, NADH dehydrogenase genes, as well as RNA polymerase subunits, *clp*P and *mat*K. All protein coding genes in the plastome have

feature continuous reading frames with the only exception of the *rps*19 gene, which does not possess an in-frame start codon in close proximity. Amino acid isotypes predicted by DOGMA [36] are confirmed by tRNAscan-SE [38,39] anticodon prediction (Figure S1) in all but the case of *trn*W-CCA. The underlying sequence is predicted to form a cloverleaf structure with three leaves but it has an ACA anticodon (as predicted by MFannot) instead of CCA. Alignments of the respective tRNAs from *Aristolochia* (*Aristolochia contorta* NC\_036152) [34] indicate that this tRNA is homologous to the *Aristolochia trn*W-CCA. A three-loop cloverleaf 2D structure is predicted for all identified tRNAs, except for *trn*Y-GUA. The latter shows an additional, variable arm in the tRNAscan-SE prediction, which is also present in the *trn*Y of *Aristolochia contorta*. The anticodon versus isotype model prediction (Table S1) by tRNAscan-SE is consistent for two tRNAs (*trn*fM-CAU and *trn*Y-GUA). The correctness of the anticodon prediction for all tRNAs was verified via alignments with respective genes of *Aristolochia contorta*. Given the abovementioned information, it is likely that *trn*W-CCA is a pseudogene.

**Figure 2.** The highly reduced plastome of *Prosopanche americana*. The circular plastome of holoparasitic *Prosopanche americana* contains 24 unique genes (14 ribosomal protein genes, 4 ribosomal RNA genes, five genes coding for transfer RNAs and three genes with other or unknown function), which are distributed over a total length of 28,191 bp. Gene distribution is displayed on the outer circle with color coded gene groups according to the legend (bottom left). GC content is visualized as inner, grey circle. \* indicates genes with introns.

The *Prosopanche* plastome contains 27 intergenic regions with an average length of 173 bp and the longest one being 879 bp in length. In total, 23 open reading frames were identified within the intergenic regions (ORF finder, implemented in Geneious 11.1.5 [40]), while setting the minimum size to 90 bp, which is the shortest plastid encoded protein coding gene in Piperales and allowing for the detection of smaller ORFs within a larger ORF. In-depth analyses of intergenic regions through BLAST or automatic annotation (as low as 50% sequence similarity) did not return meaningful hits for genes or pseudogenes. Structurally, the plastome of *Prosopanche americana* does not have any inverted repeats and therefore does not show the typical quadripartite structure. All genes are found as a single copy on the plastome, except the *trn*I-CAU gene, which is found in two copies both showing the same orientation on the circular plastome. Analysis of the flanking regions of both copies (Figure 3) show a stretch of 195 bp with only four substitutions between them. In addition to the fully duplicated tRNA, an alignment of the two copies also shows a duplication of the first 25 bp of the *rpl*2 gene. The gene is truncated downstream in one repeat (copy B) due to sequence divergence and the end of the duplication. The coverages of the duplication copies are neither significantly higher nor significantly lower than the average plastome coverage (4678). In copy A coverage ranges from 4285 to 4682 and from 4316 to 4868 for copy B.

**Figure 3.** *Prosopanche americana trn*I-CAU repeat region. Alignment of the two *trn*I-CAU repeat copies highlights high sequence identity (green identity bar) over 195 bp with only four nucleotide differences (arrows pointing to gaps in identity graph and nucleotide differences highlighted by color and nucleotide code in sequence alignment). The *trn*I-CAU gene is fully contained within the repeat whereas only 25 bp of the *rpl*2 gene start fall within repeat borders. Contrary to copy B, which stays with a truncated *rpl*2 pseudogene, copy A is preceded by the complete *rpl*2 ORF of *Prosopanche americana*.

The GC content of the whole plastome is 20.4% (Figure 2, Figure 4a). Protein coding regions, which make up 64.9% (18,303 bp) of the total length consist to 17.9% of GC nucleotides, rRNA genes make up 16.5% (4639 bp) and contain with 37.2% the highest percentage of GC in the *Prosopanche* plastome (Figures 2 and 4a). Noncoding regions (excluding introns) on the other hand show the least amount of GC content with 11.3%, while making up 16.3% (4600 bp) of the plastome.

**Figure 4.** *Prosopanche americana* has a low GC content. (**a**) Comparison of the GC content and (**b**) proportional changes of plastid genome compartments (protein coding, ribosomal RNA, intergenic region) among the photosynthetic *Aristolochia contorta* (Aristolochiaceae, NC\_036152) and the holoparasitic Hydnoraceae highlighting the extent of changes the plastomes of the Hydnoraceae have undergone with respect to nucleotide composition and conservation of specific genome regions.

#### *2.2. Comparison of the Plastomes of the Holoparasitic Hydnoraceae*

The plastome of *Prosopanche americana* (28,191 bp) is slightly larger than the plastome of *Hydnora visseri* (27,233 bp, NC\_029358) even though it lacks the inverted repeat, which contributes 1466 bp to the length of *Hydnora*. *Prosopanche*, however, contains a 195 bp repeat and two additional genes (Figure 5). The remaining set of genes is shared between the genomes of *Hydnora* and *Prosopanche*. All protein coding genes are putatively functional with the exception of the *Prosopanche rps*19 and the *Hydnora rps*7. Single gene alignment of the *acc*D gene shows a drastic size difference between the copies of the two Hydnoraceae genera. The closest ATG start codon in frame eligible as *acc*D gene start of *Prosopanche* leads to a 184 bp increase in length, which does not align to the *Hydnora acc*D gene. Apart from the protein coding genes, the set of ribosomal RNAs also is identical between the sister taxa, solely the number of transfer RNAs differs between the two plastomes. *Prosopanche* contains with *trn*W and *trn*Y two additional tRNAs, yet the former in the form of a pseudogene. The gene order between the two plastomes is mostly consistent with an exception of an inversion of the '*rps*7-*rps*12 exon 2 and 3- - region and the '*trn*I-*rpl*2-*rps*2- - region. Contrary to *Hydnora visseri*, the plastome of *Prosopanche americana* does not possess inverted repeats, though it contains a duplicated *trn*I-CAU-region while both copies share the same orientation. The same tRNA, *trn*I, is the only complete gene duplicated in the inverted repeat of *Hydnora*. The overall GC content in *Prosopanche* is lower than in its sister genus (Figure 4a) with the most drastic difference showing in the non-coding regions. The dotplot (Figure A1 in Appendix A) of the two plastomes shows a fairly straight diagonal line with many interruptions due to differences on sequence level and an inversion towards the end (Figure A1, red circle) illustrating the larger of the two inversions between the species.

**Figure 5.** High structural similarity between the plastomes of Hydnoraceae. Linear plastid genome comparison of *Prosopanche americana* and *Hydnora visseri* (NC\_029358). Genes are color coded; the dotted lines highlight inversions between the two genomes compared, genes containing introns are marked with \*.

#### *2.3. Plastome Comparison to Aristolochia*

A comparison of the plastomes of Hydnoraceae with the closely related photoautotrophic *Aristolochia contorta* (NC\_036152) [34] demonstrates large differences in genome size, gene and GC content. With a length of 160,576 bp, the plastome of *A. contorta* is about six times the size of the holoparasites and with a total of 131 genes it contains about six times the amount of that of *Prosopanche americana* and *Hydnora visseri*. *Aristolochia contorta* shows a characteristic plastid genome for photoautotrophic Piperales, with all necessary genes for the photosynthesis machinery [34]. The overall GC content (38.3%), as well as the GC content in different parts of the plastome, is higher than in the respective parts of the Hydnoraceae (Figure 4a) with the highest overall GC contents found in the rRNA genes across the three genera. The reduction of G and C nucleotides in *Hydnora* and *Prosopanche* occurs at a similar rate, with the latter having slightly lower percentages than *Hydnora*. The intergenic regions of *P. americana* however, show a much more drastic increase in AT content than observed in *Hydnora* and make up to 88.3% of the nucleotides (Figure 4). The comparison of changes within the three categories protein coding genes, rRNAs and intergenic regions (IGR) highlights the proportional increase in length in relation to the total plastome length for protein coding genes and the four rRNA genes (Figure 4b). The rRNA genes of the Hydnoraceae make up about three times as much of the total plastome length compared to the corresponding genes in *A. contorta*. A proportional decrease is shown for the intergenic regions in the parasitic plastomes compared to the photoautotrophic relative where IGRs contribute to about twice as much to the total plastome length.

An in-depth look at GC content and codon usage of the protein coding genes shared between the three plastomes using CodonW [41] shows a decrease in G and C nucleotides at the third codon position of these genes in the Hydnoraceae (Table S2). The two protein coding genes with the lowest GC proportion in the parasitic plastomes are *ycf* 2 and *rps*18, with *ycf* 2 being also one of the longest plastid-encoded genes. The lowest GC content in *A. contorta* is found in the *ycf* 1 gene with 31.9%, which is the highest percentage found in all of the *H. visseri* protein coding genes. The lowest overall G and C percentage is found in the *ycf* 2 gene of *P. americana* with only 13.3%.

#### *2.4. Phylogenetic Placement of Hydnoraceae*

The phylogenetic maximum likelihood (ML) tree (Figure 6) of the concatenated 82 gene alignment estimated with GTR+I+G substitution model and rapid bootstrapping (1000 replicates) recovers all major clades as monophyletic with high backbone support, except for the eudicot-monocot split which has a lower bootstrap value of 68. *Prosopanche* and *Hydnora* are supported as sister genera (bootstrap value = 100) and the monophyletic Hydnoraceae are placed within the Piperales as sister to a sister group of Aristolochioideae and Asaroideae with low support (bootstrap value = 13). Piperales as a monophyletic group receive only moderate support (bootstrap value = 76). The phylogram (Figure 6, left) highlights extremely long branches, not only leading to the most recent common ancestor (MRCA) of *Hydnora* and *Prosopanche* but also from the MRCA to the terminals, demonstrating the extremely high substitution rates in Hydnoraceae.

**Figure 6.** High mutational rate within Hydnoraceae relative to other Piperales and diverse photosynthetic angiosperms. Phylogenetic maximum likelihood tree reconstruction as cladogram (left) and phylogram (right) estimated using RAxML with the GTR + I + G model and conducting rapid bootstrapping (1000 replicates) recovers *Hydnora* and *Prosopanche* as sister genera and Hydnoraceae together as sister to a sister group of Aristolochioideae and Asaroideae (both Aristolochiaceae) within the Piperales. Bootstrap support values are displayed above the nodes. The scale bar shows the number of substitutions per site. Ingroup taxa are color coded, with eudicots being green, monocots red, Piperales dark blue and remaining Magnoliids and ANITA grade taxa light blue. Taxa refer to the dataset of Jansen et al. [42] if not otherwise indicated by GenBank number.

Phylogenetic tree reconstruction based on a gene set reduced to the genes that are present in Hydnoraceae, as well as on amino acid alignments of the complete and reduced gene set recovered relationships with generally lower support that are different from widely accepted topologies (Figures S2–S4).

#### **3. Discussion**

Phylogenetic tree reconstruction using maximum likelihood confidently verifies *Hydnora* and *Prosopanche* as sister genera (100 bootstrap support) and shows Hydnoraceae monophyletic, placed within Piperales. The placement of Hydnoraceae as sister to Aristolochiaceae in its widest circumscription (i.e., including Aristolochioideae and Asaroideae, Lactoridaceae not sampled) receives, however, no support (bootstrap value = 13) and might be a result of the long branches as well as missing data (few genes present in Hydnoraceae compared to other angiosperms). The inclusion of Hydnoraceae within Piperales is congruent with previous analyses. However, the positioning within the order differs between multiple different datasets and analyses. Naumann et al. [29] place Hydnoraceae as sister to *Aristolochia* and *Thottea* (Aristolochioideae) with *Lactoris* (Lactoridaceae) being sister to this group (Asaroideae not sampled). Nickrent et al. (six gene analysis) [33] reconstructed Hydnoraceae as sister to *Lactoris* (Lactoridaceae) and this group as sister to *Aristolochia*, but missing Asaroideae in the sampling. Massoni et al. [43] (12 gene analysis) recover Hydnoraceae as sister to *Lactoris* and Aristolochioideae with moderate support and this group sister to Asaroideae.

The plastome of the holoparasite *Prosopanche americana* was found on only one of the 372 scaffolds taken into consideration based on the BLAST for pt features with low stringency settings. The exclusion of the other scaffolds was due to a combination of coverage and quality of the BLAST hit. Careful analysis showed that many of the hits, often only a few bp, were for features that are similar between the different plant genomes (e.g., tRNA, rRNA), especially between the mitochondrial genome and the plastome. An additional BLAST with mitochondrial features (data not shown) revealed overlapping with the pt BLAST results. As a result, the plastome of *Prosopanche americana* solely consists of scaffold 424 and shows a high similarity to the plastome of the sister genus *Hydnora* with respect to genome size, gene content and order as well as levels of AT richness. Both Hydnoraceae plastomes contain the identical gene set with the exception of two additional tRNA genes in *Prosopanche* (*trn*W, *trn*Y). All ribosomal RNAs and protein coding genes are putatively functional (based on feature continuous reading frames), with *rps*7 being the exception for *Hydnora visseri* and *rps*19 the exception for *Prosopanche americana*. The plastome of *Aristolochia*, the closest available, photoautotrophic relative, is about six times the size of Hydnoraceae and contains over six times as many genes with the complete set of plastid-encoded photosynthetic genes. These findings make Hydnoraceae ideal candidates to visualize and compare the degree of reduction and gene loss in a parasitic lineage that is close to a potential endpoint, the potential complete plastome loss [14]. Orobanchaceae, a relatively young parasitic lineage [29], display the entire range of parasitic lifestyles [19] and allow for the visualization of most of the proposed stages of plastome breakdown [5,16,17]. However, plastomes of Orobanchaceae are not nearly as reduced in size or gene content as some mycoheterotrophes and a few parasitic plants, such as *Pylostyles* [12,21], *Balanophora* [22] or Hydnoraceae. The quadripartite plastome structure with two single copy regions and inverted repeats is common between most photoautotrophic plants and also retained in the majority of the parasitic lineages. *Hydnora* [10] still possesses an IR, though only containing one complete gene (*trn*I-CAU) and two pseudogenes. With a length of 1466 bp, it is much smaller than the one of *Aristolochia contorta* (25,459 bp) [34] containing 17 complete genes and one pseudogene. *Prosopanche americana* also contains a gene duplication (*trn*I-CAU), but as both copies appear in the same orientation, this is not to be considered an inverted repeat. The fact that the duplicated gene as well as the partial gene (*rpl*2) on the repeats can be found in the inverted repeat of *Aristolochia* as well as the extremely reduced IR of *Hydnora visseri* leads to the conclusion that these duplicated regions are remnants of an inverted repeat and potentially have changed orientation during a rearrangement process that occured along with genome reduction in *Prosopanche americana.* The coverage in both duplicated regions in *Prosopanche* is also not significantly higher than in the rest of the plastome, something usually common for inverted repeats. Additionally, the sequencing reads representing both copies are not assembled together as one sequence, which is the case for inverted repeats due to their identical sequence. The loss of one IR copy in *Prosopanche* could also be associated with the larger of the two inversions ('*trn*I-*rpl*2-*rps*2- ) when comparing the gene order of the two

Hydnoraceae genera. As these genes usually can be found in the inverted repeats, the loss of a different copy in *Prosopanche* compared to *Hydnora* could result in this apparent inverted gene order, as well as possible recombination events, for example, flip-flop recombination resulting in two plastome isoforms [44] before the IR loss in *Prosopanche*. The position of the second copy of the 195 bp duplication in the latter, nested in between the ribosomal RNA genes is noteworthy, as those are usually part of the inverted repeat. Therefore, the repeat position and the inverted region in comparison to *H. visseri* could potentially be a consequence of the IR loss, which is shown to lead to an increase in genome rearrangements. The occurrence of direct repeats instead of inverted repeats, although larger and containing more genes, is also known for species of the photoautotrophic genus *Selaginella* [45,46] and it is hypothesized that their recombination and gene conversion are not inhibited by the orientation of the direct repeat [47].

All protein coding genes present in the plastome of *Prosopanche americana* have a feature continuous reading frame, making them very likely functional (potential exception *rps*19) although evidence from transcriptome is currently unavailable. Functionality through structural analysis of the six tRNAs, as tested with tRNAscan-SE [38,39] is likely given for five of them. The sequence for *trn*W-CCA (Trp) is predicted by tRNAscan to have an ACA anticodon instead of a CCA anticodon. No tRNA with an ACA anticodon is known from Piperales plastomes but was found, for example, in Asterales (e.g., *Pogostemon cablin* MF287373 [48]). Sequence similarity strongly suggests that said Asterales tRNA with ACA anticodon is known as the intron-containing *trn*V-UAC in Piperales. Alignments of neither Asterales tRNA nor the *trn*V from Piperales did result in reasonable alignments.

Evaluation of nucleotide compositions of the shared protein coding genes between Hydnoraceae and *Aristolochia* visualizes the proportional increase in A and T nucleotides, also known from other plastomes of parasitic lineages [5,22]. *Hydnora* and *Prosopanche* showed a drastic decrease in G and C nucleotides across the whole plastome compared to the photoautotrophic *Aristolochia* with *Prosopanche* having slightly lower values than the sister genus *Hydnora*. Despite the drastically low proportion of G and C nucleotides in the Hydnoraceae plastomes compared to most photoautotrophic species, the record setting AT richness is known for holoparasitic *Balanophora* with most protein coding genes consisting to more than 90% of A and T [22]. For protein coding genes, mutations to the favored nucleotides predominantly occur on the third, variable codon position which has also been observed for other parasitic lineages such as Orobanchaceae [3]. Out of the three available stop codons (i.e., TAG, TGA and TAA) only two (TAG and TAA) are found in the protein coding genes of *Prosopanche*. The predominantly used stop codon found was TAA, highlighting another specific case of A and T nucleotide preference, whereas TAG was only found twice (*acc*D, *rpl*16) as stop codon. Similar cases of favoring certain stop codons or exclusively using a single one have been described, e.g., for the parasitic lineage *Cytinus* (Cytinaceae) [11] and *Balanophora* (Balanophoraceae) [22].

A comparison of the proportional amount that protein coding regions, ribosomal RNAs and IGRs contribute to the whole plastome highlights the many macro level changes Hydnoraceae plastomes experienced in comparison to the photoautotropic *Aristolochia*. The parasitic genomes have lost all plastid encoded genes related to the photosynthesis apparatus and their intergenic regions show a much more drastic decrease in length compared to *Aristolochia*, resulting in a proportional increase of space that the remaining protein coding genes and ribosomal RNAs take up in the plastome.

The comparisons of the two Hydnoraceae plastomes and the highlighted common set and order of genes between the genomes that have evolved independently since the split ~54 MYA lead to two possible scenarios. Either a similarly reduced plastome existed already in the MRCA of *Hydnora* and *Prosopanche* and did not experience many changes with respect to gene content and order since or the observed similarities are a result of two independent evolutionary pathways leading to the same endpoint. Although the first hypothesis would be more parsimonious, a larger sampling within Hydnoraceae would be needed to test these hypotheses. However, the second hypothesis would support the existence of taxon dependent essential gene sets for heterotrophic plants [3,49].

#### **4. Materials and Methods**

#### *4.1. Plant Material, DNA Extraction, Library Preparation and Sequencing*

Plant material of *Prosopanche americana* was collected near Chancaní (Córdoba, Argentina). A herbarium voucher with number AAC5681 was placed at Museo Botánico de Córdoba (CORD). Fresh tepal material was sliced and air dried for 24 hours and subsequently stored in silica gel. DNA extraction was done using the DNeasy Plant Maxi Kit (Qiagen, Venlo, Netherlands). Molecular weight of isolated DNA as well as quality was tested using gel electrophoresis, and NanoDrop 2000 (Thermo Scientific, Waltham, MA, U.S.). Illumina TruSeq DNA PCR-free library preparation was done using 2.2 μg genomic DNA with an insert size of 650 bp. The library was sequenced using a partial lane of an Illumina NextSeq High Output with 300 cycles creating 37,324 million reads. Both, the library preparation and library sequencing were performed at the DRESDEN concept Genome Center of TU Dresden (https://genomecenter.tu-dresden.de).

#### *4.2. Assembly and Identification of Sca*ff*olds with Plastid Information*

Using the de novo assembly function, raw data were assembled with CLC Workbench [35] allowing for automatic word and bubble size. The assembly resulted in a total of 8,342,403 scaffolds to which the reads were mapped back to obtain coverage information and to be able to evaluate the assembly in specific regions of interest. The assembly was blasted (BLASTn, e-value e1-10) against a database containing 45 angiosperm plastid reference genomes from GenBank. Contigs with hits to plastid genomes were extracted and used together with the readmapping information for the creation of a stoichiometry plot in RStudio [50] allowing for highlighting of contigs with hits to plastid gene features followed by a directed and more efficient search for contigs potentially belonging to the plastid genome. The latter contigs were selected for further investigation.

#### *4.3. Gene Annotation and Circularization*

To check for and identify genes the scaffolds were uploaded and analyzed using DOGMA [36] at low stringencies (percent identity cutoff for protein coding genes and RNAs = 25, E-value = 1e-5) and additionally using MFannot [37]. Additionally, alignment tools such as "Map to reference" and LASTZ version 7.0.2 [51] were used as a plug in within Geneious [52] and applied to the scaffolds identified in the initial BLASTn search and the stoichiometry plot. In addition, the published plastome of *Hydnora visseri* (NC\_029358) [10] was used to identify and annotate genes due to potentially higher sequence similarity between the sister genera. Gene boundaries were identified, then genes were annotated using Geneious [52] by aligning respective genes of *Hydnora visseri* and closely related photoautotrophic species of Piperales and in case of protein coding, genes feature continuous reading frames were checked manually. As a proxy of functionality of tRNA genes, in the absence of transcriptome data, the respective sequences were used as input for tRNAscan-SE [38,39] to predict their 2D structure. Criteria used were a cloverleaf structure with three leaves and that the predicted anticodon through tRNAscan found at the anticodon arm matches the anticodon predicted through sequence similarity. The full output is available as supplementary material (Table S1).

The *Prosopanche americana* plastome was finally circularized using the "Circular Sequence" tool in Geneious [52] and drawn using OGDRAW [53]. The correctness of this circular connection was verified using PCR followed by Sanger sequencing (forward and reverse) with a forward (ProAm-rps2F: AACTAAATTACAAGCCATTGATA) and a reverse primer (ProAm-rps14R: TCCTAGAGGTTATTATCGTTAT) derived from the available flanking regions.

#### *4.4. Plastome Comparisons*

The plastome of *Prosopanche americana* (GenBank accession number MT075717) was compared with the plastome of its sister genus *Hydnora* (*Hydnora visseri*, NC\_029358) [10] as well as a plastome of its close autotrophic relative *Aristolochia* (*Aristolochia contorta*, NC\_036152) [34] in terms of multiple parameters such as length, gene content and order, GC content and plastome structure with Geneious [40]. A dotplot (Score matrix: exact, window size: 100, threshold: 200), implemented in Geneious, was used to visualize differences between the *Prosopanche* and *Hydnora* plastome nucleotide sequences. For the protein coding genes shared between the three species, the rates of the four nucleotides were determined using CodonW (version 1.4.2) [41].

#### *4.5. Phylogenetic Placement of Prosopanche Americana*

Protein coding and rRNA genes of *Prosopanche americana* and *Hydnora visseri* were added and aligned to the 81 angiosperm-wide alignments published by Jansen et al. [42]. Additionally, an alignment for the *acc*D gene was created using data from GenBank. The sampling was then improved by adding all complete plastid genomes of Piperales accessions available in GenBank (https://www.ncbi.nlm.nih.gov/), increasing the data set to 83 taxa. After automatic alignment with MAFFT (version 7.450) [54,55], individual alignments were inspected and improved by eye using AliView (version 1.20) [56] and concatenated in Geneious [40]. Single gene alignments have been uploaded to TreeBASE (http: //purl.org/phylo/treebase/phylows/study/TB2:S25836). RAxML analysis [57] using the GTR+I+G model with 1000 bootstrap replications was carried out on the concatenated data set using the CIPRES Science Gateway [58] after calculating the optimal substitution model using jModelTest (version 2.1.7) [59]. Phylogenetic trees were rooted using the three Gymnosperms (*Cycas*, *Ginkgo*, *Pinus*) as outgroup. The output was visualized in TreeGraph 2 [60]. In addition to the 82 gene analysis, we used a gene set reduced to the ones that are present in the Hydnoraceae (Figure S2) and also did RAxML analysis based on amino acid alignments (translated with Geneious [40]), both on the complete (Figure S3) and reduced gene set (Figure S4).

**Supplementary Materials:** The following are available online at http://www.mdpi.com/2223-7747/9/3/306/s1, Figure S1: Secondary structure of *Prosopanche americana* tRNAs. The secondary structure of the five unique *Prosopanche* tRNAs as predicted with tRNAscan-SE [38,39] shows the cloverleaf structure with three leaves for all but *trn*Y-GUA, which has an additional, variable arm. The predicted anticodon (boxes) matches the one predicted by DOGMA and through sequence similarity in all but the case of *trn*W-CCA, where tRNAscan predicts an ACA anticodon, Table S1: Anticodon prediction versus isotype model prediction visualizes tRNA sequence divergence in the *Prosopanche americana* plastome. Tabular tRNAscan-SE [38,39] output for the annotated *P. americana* tRNAs shows the anticodon predicted isotype, predicted anticodon and consistency of the anticodon versus isotype model predictions. Isotype scores are displayed with the top score being red, the 2nd highest score orange and the third highest score yellow, Table S2: Hydnoraceae protein coding genes show increase of A and T nucleotides at third codon position. The nucleotide usage as well as the GC content of the Hydnoraceae and *Aristolochia* protein coding genes as predicted by CodonW [41] highlights higher overall A and T content of the holoparasitic protein coding genes and shows the drastic differences at the third, variable codon position compared to photoautotrophic relative *Aristolochia contorta* [34], Figure S2: Angiosperm phylogeny based on a reduced plastid gene set. Phylogenetic maximum likelihood tree reconstruction as cladogram (left) and phylogram (right) based on a plastid marker set reduced to the genes present in the Hydnoraceae, estimated with RAxML using the GTR + I + G model and conducting rapid bootstrapping (1000 replicates) recovers *Hydnora* and *Prosopanche* as sister genera and Hydnoraceae together as sister to all other Piperales. Monocots are recovered as sister to eudicots and a group of Piperales together with *Drimys*, *Calycanthus* and *Liriodendron*. Bootstrap support values are displayed above the nodes. The scale bar shows the number of substitutions per site. Ingroup taxa are color coded, with eudicots being green, monocots red, Piperales dark blue and remaining Magnoliids and ANITA grade taxa light blue. Taxa refer to the dataset of Jansen et al. [42] if not otherwise indicated by GenBank number, Figure S3: Phylogenetic tree reconstruction based on amino acid alignments of plastid markers places Hydnoraceae as sister to all other angiosperms. Phylogenetic maximum likelihood tree reconstruction as cladogram (left) and phylogram (right) based on a translated and concatenated set of 78 plastid protein coding genes using RAxML and conducting rapid bootstrapping (1000 replicates) recovers *Hydnora* and *Prosopanche* as sister genera and Hydnoraceae together as sister to all other Angiosperms. Bootstrap support values are displayed above the nodes. The scale bar shows the number of substitutions per site. Ingroup taxa are color coded, with eudicots being green, monocots red, Piperales dark blue and remaining Magnoliids and ANITA grade taxa light blue. Taxa refer to the dataset of Jansen et al. [42] if not otherwise indicated by GenBank number, Figure S4 Phylogenetic tree reconstruction based on a reduced set of amino acid plastid markers places Hydnoraceae within eudicots. Phylogenetic maximum likelihood tree reconstruction as cladogram (left) and phylogram (right) based on a translated and concatenated set of plastid protein coding genes reduced to genes present in Hydnoraceae using RAxML and conducting rapid bootstrapping (1000 replicates) recovers *Hydnora* and *Prosopanche* as sister genera and places Hydnoraceae within Eudicots. *Illicium* is recovered as sister to all other Angiosperms. Bootstrap support values are displayed above the nodes. The scale bar shows the number of substitutions per site. Ingroup taxa are color coded, with eudicots

being green, monocots red, Piperales dark blue and remaining Magnoliids and ANITA grade taxa light blue. Taxa refer to the dataset of Jansen et al. [42] if not otherwise indicated by GenBank number.

**Author Contributions:** S.W. conceived the study; fieldwork was done by N.R., A.A.C., S.W.; data generation and analyses M.J., J.N.; writing of the first draft M.J., S.W; responsible for visualization of results M.J., J.N.; all authors reviewed and edited the draft and agreed to the published version of the manuscript. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding

**Acknowledgments:** We thank Christoph Neinhuis for constant support of our work on Hydnoraceae, Sebastian Müller for assistance with lab work, and Gesine Schäfer for her initial work on the plastid genome of *Prosopanche.* We also thank Eric K. Wafula and Claude W. dePamphilis for assistance with the assembly and feedback on the manuscript.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**

**Figure A1.** Dotplot comparison of the plastomes of *Prosopanche americana* and *Hydnora visseri* (NC\_029358) highlights differences on nucleotide level and inversion (red circle). The diagonal lines indicate sequence similarity between the plastomes of *Hydnora* and *Prosopanche*, the gaps visualize areas of drastic nucleotide sequence differences. The red circle highlights the larger of the two inversions between the Hydnoraceae plastomes. The dotplot was created with Geneious (v 11.1.5) by using an exact score matrix, a window size of 100 and a threshold of 200.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Plants* Editorial Office E-mail: plants@mdpi.com www.mdpi.com/journal/plants

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34 Fax: +41 61 302 89 18