*2.2. The Number and the Diversity of PAC Domain Proteins Increase Along the Green Lineage*

The PDPs have been classified according to the domains associated with the PAC domain. Four types were distinguished (Figure 2). Type 1 corresponds to proteins only containing a PAC domain. The corresponding genes could exhibit either no intron or one intron between the sequences encoding Cys 1 and Cys 2 and those encoding Cys 3 to Cys 6. Type 2 includes proteins with an N-terminal PAC domain, which could be associated to (i) a Proline-rich domain or (ii) a well-conserved domain of unknown function usually encoded by a specific exon and starting with the following amino acid motif: Tryptophane-X8-Tryptophane (W-W domain) (S7). As an example, At2g16630 (FOCL1) is a type 2-PAC domain protein with a W-W domain at the C-terminus. Type 3 encompasses proteins with a C-terminal PAC domain. The PAC domain could be associated with a Histidine stretch, a Proline-rich domain, and/or an AGP domain. For example, AtAGP30 and AtAGP31 are type 3-PAC domain-proteins. Finally, type 4 corresponds to proteins containing central PAC domains flanked by two extensin domains. Although a few proteins with Serine-(Proline)4 motifs typical of extensins at their C-terminus were

found in *Anthocerophyta* and Lycopodiales, the first bona fide type 4-PDP was found in Psilotales. There is no such PDP in *A. thaliana*.

**Figure 2.** Classification of PAC domain proteins according to the location of the PAC domain and associated domains. **Type 1.** Proteins containing the PAC domain alone. **Type 2.** Proteins containing the PAC domain at their N-terminus together with another (several other) domain(s). **Type 3.** Proteins containing the PAC domain at their C-terminus together with another (several other) domain(s). **Type 4.** Proteins containing the PAC domain in a central position flanked by two domains with extensin motifs (Serine-(Proline)n <sup>≥</sup> 2).

In Bryophytes and Anthocerotophyta, only one to three PAC domain proteins were found for each species (S1). The number of PDPs was higher in Psilotales and Equisetales as well as in all the plant families, which have appeared later in the green lineage. Eleven PDPs are present in *Amborella trichopoda,* which is considered as an ancestor common to angiosperms [25]. The highest numbers of PDPs, i.e., between 17 and 23, were found in Poales, *Brachypodium distachyon*, *Sorghum bicolor*, *Zea mays*, and *Oryza sativa*, as well as in *Linum usitatissimum*, *Populus trichocarpa,* and *Gossypium raimondii*. In Poales like *B. distachyon* and *O. sativa*, the genes encoding PDPs could be found in tandem (Figure 3). The PAC domains of these genes could show a high degree of identity (more than 85%), supporting the recent tandem duplication events [26]. In addition, PAC domains with various numbers of Cys residues were also found in Poales (S1). The functionality of those PAC domains has not yet been established.

**Figure 3.** Examples of domain containing-protein (PDP) genes organized in tandem in the *B. distachyon* and *O. sativa* genomes. The orientation of the genes is indicated by arrows. The names of the genes are abbreviated, e.g., 20970 stands for *Bradi3g20970*, and 5750 for *O. sativa LOC\_Os10g5750*. Genes sharing more than 85% identity in their PAC domain coding sequences at the amino acid level are represented with arrows of the same color.

The different types of PDPs are unevenly distributed within the different plant species (Figure 4). Only type 1- and type 2-PDPs were found in all plant families. Among the type 1-PDPs, one sub-type should be distinguished. It corresponds to highly conserved sequences throughout the green lineage since Lycopodiales with an overall percentage of identity ranging from 60% to 88% and a percentage of similarity from 69% to 92%. For comparison, the percentage of identity and of similarity between two PAC domain sequences can be rather low (15.4% and 20.7%, respectively). Among the type 2-PDPs, those including a C-terminal W-W domain are present in nearly all plant families from Bryophytes to Brassicales. They could appear as ancestors of PDPs. Type 3- and type 4-PDPs seem to have appeared more recently in the evolution of the green lineage since the most ancient type 3- and type 4-proteins were found in *A. trichopoda* and in Psilotales, respectively. Of course, one cannot exclude that some PDPs are missing in this collection since only a few complete genomes are available for plants from Psilotales to Amborellales.

**Figure 4.** Distribution of the different types of PAC domains within the plant families. The different types of PAC domains are represented in Figure 2. Among type 1-PAC domains, those having a highly conserved amino acid sequence are distinguished (1'). Among type 2-PAC domains, those that are associated to a C-terminal W-W domain are highlighted (2').

#### *2.3. A Possible Origin for the PAC Domain*

We have performed an extensive search of PAC domain sequences in the available databases dedicated to ancestors of the green lineage using both the script described above and BLAST queries using several PAC domains in case the spacing between Cys residues would be slightly different. Mining was done in the following families: Stramenopiles (*Synura petersenii*), Cryptophyta (*Chroomonas sp*), Chlorophyta (*Asteromonas gracilis*, *Chlamydomonas rheinardtii*, *Nephroselmis olivacea*, *Volvox carteri*, *Scenedesmus dimorphus*, *Scher*ff*elia dubia*), Streptophyta (*Chara braunii*, *Coleochaete orbicularis*, *Klebsormidium flaccidum*, *Mesotaenium caldariorum*, *Penium margaritaceum*) (S4). In many cases, the proteins were incomplete either at their N-termini and it was not possible to predict a signal peptide, or at their C-termini, and they could not be classified. Whenever possible, the presence of predicted functional domains associated to the putative PAC domains was checked, and the proteins comprising functional domains associated to intracellular functions were not retained.

We could only find PAC domain-related sequences in *Chlorophyta*: 10 proteins were found in *C. rheinardt*ii and one in *V. carteri* ,which both belong to Chlamydomonales. The Glycine residue located upstream the first Cys residue was always missing, and the PAC domains were associated with Proline-rich motifs of two types: either Serine-(Proline)n or (Proline)n and up to three of them could be found in a given protein. However, the secondary structures of these domains were predicted to be α-helices. In *C. rheinardtii*, the GP1 and GP2 proteins, which both have Serine-(Proline)n motifs, were described as proteins rich in Hydroxyproline resides forming the insoluble glycoprotein framework of the cell wall [27,28]. Furthermore, in *C. orbicularis*, we could find another interesting PAC domain candidate, which was associated to Proline-rich motifs but contained seven Cys residues. The highest level of identity/similarity was found with two PAC domains of *Musa acuminata*: GSMUA\_Achr4T17330 (45%/51%) and GSMUA\_Achr7T01790.1 (39%/50%). The highest level of identity/similarity with a *Marchantiophyta* PAC domain was found with the *Conocephalum conicum* PAC domain ILBQ\_2004952 (30%/46%) and the *M. polymorpha* Mapoly0014s0128 PAC domain (33%/45%). Altogether, the sequence showing the highest level of identity to bona fide PAC domains was found in *C. orbicularis*. This is consistent with the assumption that the Coleochaetales could be one of the ancestors of the green lineage [29].

#### *2.4. Three-Dimensional-Modeling of PAC Domain Proteins*

Three-dimensional-models were calculated for 41 bona fide and 9 putative PAC domains, based on the crystal structure of the *P. lanceolata* PAC domain [24]. The sequence identities between the template and the PAC domains varied between 9.6% and 30.4% (median 15.9%). A sequence identity of 30% is generally seen as a lower limit for reliable models predicted by homology modeling algorithms, but the assumption of disulfide bridges somewhat lowers this limit. However, the low sequence similarities were still an issue. In addition, in 6 out of the 50 PAC domains, the 3D-modeling software I-Tasser was not able to find conformations enabling the formation of the three disulfide bridges between the predefined Cys residues (S8). In all these cases, either the proteins were predicted to have α-helices, or they were missing the Glycine residue upstream Cys 1.

For the bona fide PAC domains, it was possible to propose relevant 3D-models fitting with the typical structure experimentally demonstrated for the *P. lanceolata* PAC domain [24]. Four selected PAC domains from different plants are shown in Figure 5: an *Anthocerophyta* (*Anthoceros formosa*), chosen as an ancestral plant, *A. trichopoda* as the common ancestor to flowering plants, and two higher plants, *Oropetium thomaeum* and *A. thaliana*. All four 3-D models show the expected parallel β-sheets forming a β-barrel and the three disulfide bridges. They also contain loop regions as the *P. lanceolata* PAC domains. The 3D-structure of bona fide PAC domains seems to have been conserved through the evolution of the green lineage. However, the *C. orbicularis* protein, which was assumed to be an ancestor of the PDPs in the green lineage, only had three β-sheets, but the three disulfide bridges were at the predefined positions (S8).

The PAC domains that have been considered apart because of the prediction of α-helices showed completely different 3D-structures (S8). They exhibited less β-sheets or only α-helices, and as mentioned above, the three disulfide bridges were not at the expected positions. The 3-D modeling, thus, brought an additional criterion to confirm bona fide PAC domains. Interestingly, such a β-barrel structure has already been described for a mannose-binding lectin family of red algae, the *Oscillatoria Agardhii* Agglutinin-Homolog (OAAH) mannose-binding lectin family [30]. In this case, two β-barrels associate perpendicularly to build up the complete 3D-structure of the molecule, and the interaction with cell wall polymers occurs at two crevices symmetrically located at its two ends [31]. This role would be consistent with the finding that the PAC domain of AtAGP31 can interact with cell wall polysaccharides and *O*-glycans in vitro [11].

**Figure 5.** 3D-modeling of four PAC domains. (**A**) A representative PAC domain of Bryophytes: IQJU\_2004004\_Anthoceros\_formosae. (**B**) A PAC domain of *A. trichopoda*: AmTr\_v1.0\_068.122. (**C**) A representative PAC domain of the *O. thomaeum* monocot: Oropetium\_11363A. (**D**) A representative PAC domain of the *A. thaliana* dicot: At1g28290. The N-terminus (N-ter) and the C-terminus (C-ter) of the proteins are indicated in blue and red, respectively. Blue ribbons represent β-sheets. The three disulfide bridges are drawn in yellow, and the names of the Cys residues involved are indicated.

To test the role of the conserved Cys residues and, therefore, that of disulfide bridges in 3D-structure stability, in silico mutation experiments have been performed. Possible 5 Cys-PAC domain variants have been tested for the *P. lanceolata* PAC domain, and for each of the eight A. trichopoda PAC domains, which were considered as representative of the eight phylogenic clades (see below). Each Cys residue has been replaced by a Ser residue, and the change in stability was determined by MAESTRO (S11). In all cases, positive values of the ddG parameter indicating changes in unfolding free energy were found, indicating destabilization of the 3D-structure. Altogether, it seems that the conserved Cys residues are critical for the stability of the β-barrel. This could indicate that the domains lacking one Cys residue could be impaired in their biological activity or more sensitive to changes in their physiological environment. The presence of a seventh or even an eighth Cys residue could have different consequences depending on the position(s) of the additional Cys residue(s). Such residue(s) could be involved in different disulfide bridges or not. Only experimental work could allow showing any change in the biological activity of the PAC domain.
