*2.5. Phylogenetic Analyses Reveal the Presence of a Few Clades Grouping the PAC Domain Proteins According to Their Associated Domains*

Based on all the criteria described above, 300 PAC domains have been selected for the building of phylogenetic trees (S1,S2). They have been chosen from plant families representative of the green lineage from Bryophytes to Brassicales based on a phylogenetic tree established using plastid gene sequences [32]. When several species were available for a given plant family, only one or a few of them were selected to represent it. For each plant family, the PAC domains sequences were analyzed for their percentage of identity, and the most representative plant species was retained. When the sets of PAC domain sequences were too different between plants of the same family, several species could be maintained. In addition, only PAC domains showing less than 85% of identity inside a given plant species were conserved. As a first step, the sequences were aligned according to their predicted secondary structure. Such a strategy was used in previous studies where the conservation of the primary sequences of the proteins was not sufficient to ensure relevant alignments [33–35]. The PROMALS3D software was used, and the resulting alignment was introduced in the MEGA7 software to build up a maximum likelihood tree using 500 bootstraps. Due to the low level of conservation between amino acid sequences and especially between the PAC domain sequences of the older lineages, we have decided to build up two independent trees to avoid bias due to long-branch attraction: the first one (Tree I) including plants from Bryophytes to *A. trichopoda*, and the second one (Tree II) from *A. trichopoda* to Brassicales.

Regarding Tree I, it is difficult to define clades grouping all the PAC domain sequences because most of the bootstrap values were low (S9). We only considered clades corresponding to bootstrap values higher than 30. We could define seven clades grouping 71% of the retrieved PAC domains, six of them containing one *A. trichopoda* PAC domain: clade A (AmTr.v1.0.061.7, mostly type 1-PAC domains); clade B (AmTr.v1.0.066.9, type 4-PAC domains); clade C (AmTr.v1.0.062.88, type 1-PAC domains, highly conserved sequences); clade D (AmTr.v1.0.041.161, type 2 W-W domains); clade E (AmTr.v1.000047, type 2-PAC domains); clade J (AmTr.v1.0.041.169, type 2-PAC domains); and clade K (*Equisetum sp* PAC domains). The distribution of the PAC domain sequences of the other species was not clear. PAC domains of Bryophytes were represented in clades A, D, and E, whereas a *Tmesipteris parva* (Psilotale) PAC domain was found in clade B, and a *Phylloglossum drummondii* (Lycopodiale) PAC domain in clades C, and J. Of course, one cannot exclude that PAC domains of plants, which have divergent earlier than Amborellales are still missing since only a limited number of fully sequenced genomes are available. Despite the presence of the key Cys residues and of conserved 3D-structure, the large evolutive distance existing between Bryophytes and *A. trichopoda* together with a relaxed selective pressure could explain the low sequence identity observed between sequences of Tree I. Indeed, whereas terrestrialization is assumed to have occurred 450 MYA [36], the age of angiosperms emergence was estimated to be between 169-199 MYA [37]. Based on the putative interaction with cell wall polysaccharides and *O*-glycans, the PAC domain sequence variability could be correlated with the variability of the cell wall composition from Bryophytes to angiosperms [38].

In Tree II, the PAC domains were distributed into 10 clades with high confidence bootstrap values (from 72 to 100) with the exception of clade H (28) (Figure 6, S9). An *A. trichopoda* PAC domain was found in each of them. Four clades were specific to higher plants, each of them, respectively, comprised the following *A. trichopoda* PAC domains: AmTr.v1.0.047.45 (clade F); AmTr.v1.0.068.122 (clade G); AmTr.v1.0.153.4 (clade H); and AmTr.v1.0.019.72 (clade I). Monocot and dicots were represented in all the clades, but clade J comprised a high number of grass PAC domains originating from gene duplication (see above). Interestingly, although the tree has been built up with PAC domains only, they grouped according to their association to other domains: type 1-PAC domains were found in clades A, C, F, H, and I; type 2-PAC domains were grouped in clades D, E, and J, with type 2 W-W domains in clade D; type 3-PAC domains were found in clade G with the exception of three of them in clade H with short Proline-rich motifs at their N-terminus; and type 4-PAC domains were only found in clade D. Thus, it seems that there is a link between the amino acid composition of PAC domains, their secondary structure, and the associated domains. Finally, it seems that all the PAC domains of higher plants have a counterpart in *A. trichopoda*, meaning that the modern multi-domain structures of the PDPs found in the ten angiosperm clades preceded the emergence of angiosperms.

**Figure 6.** Phylogenetic Tree II. Tree II was built up using 196 PAC domains sequences from *A. trichopoda* to *A. thaliana*. Ten clades (A to I) were defined according to significant bootstrap values (higher than 72, with the exception of clade B). The type of PDPs (e.g., Type 1 is 1, see Figure 2) found in each clade indicated between brackets. The name of the *A. trichopoda* PDP found in each clade is indicated and highlighted with a red star.
