*2.6. Conserved Amino Acids Motifs Inside Clades*

A search for conserved amino acid motifs was done for the PAC domains of each clade of Tree II. The most significant results were found for clades A, B, D, E, G, H, and I (Figure 7). In each clade, the most conserved motifs were detected at the N-terminus of the PAC domain. This was consistent with the definition of the pollen Ole e 1 motif in the Pfam and Prosite databases (PF01190 and PS00925, respectively). However, the consensus defined for the PS00925 domain only exactly fitted with that of clade A PAC domains ([EQT]-G-x-V-Y-C-D-[TNP]-C-R). Furthermore, the most conserved PAC domains were found in the C clade (Figure 8). Their degree of conservation in the green lineage from Lycopodiales to Brassicales is impressive. Finally, the C-terminal W-W domain present in all the proteins belonging to clade D was also very well conserved from the Bryophytes to the Brassicales with common motifs mostly located in its N-terminus half (S10).

**Figure 7.** The most conserved motifs of PAC domains inside clades A, B, D, E, G, H, I in PDPs of *A. trichopoda* plant families appeared subsequently. The number of PAC domains in each clade is indicated as well as the score of the conserved motif according to the MEME software.

**Figure 8.** The most conserved PAC domains from clade C. The comparisons have been made between 18 PAC domain sequences from Lycopodiales to Brassicales.

The combination of sequence conservation with the accessibility of conserved residues on the protein surface shall hint to functional important sites while conserved residues located in the protein core are more likely important for maintaining the fold. Also, conserved residues in the loop regions may have a functional role, although they are less accessible in the static 3D-structural model as loops are often flexible and may move considerably. We, therefore, defined a representative 3D-model for each clade and obtained the solvent accessibility and secondary structure for each residue and aligned this information with the sequence profiles (S12). Indeed, many of the conserved sites are inaccessible to the solvent and located within or close to the β-sheets and, thus, are expected to maintain the fold. Candidates for the functional role are, for example, in clade A a Phe-x-Thr pattern (profile position 11–13); in clade B, a cluster of basic residues at position 18-22; in clade D, the conserved charged residues Lys and Asp at position 9 and 10; or in clade H, the amino acids Lys and Arg at position 35. The reliability of such assumptions depends on the quality of the structural models. We calculated a model quality score with MAESTRO and related the scores of the models to scores of experimentally determined structures (S13). The scores of the models are in the range of the modeling template structure (PDB code 4Z8W), indicating that none of the models should be largely wrong.

The conservation of motifs in PAC domains suggests common biological activities. It is possible to infer that their interactions with cell wall polysaccharides or *O*-glycans assumed from in vitro studies have been conserved and that the distribution of PDPs in the different plant families reflects differences in cell wall polysaccharides. Regarding the W-W C-terminal domain of the clade D PAC domains, its role remains to be unraveled. It is encoded by a distinct exon and could originate from exon shuffling [39].

#### **3. Materials and Methods**

#### *3.1. Databases*

The sequences used in this study have been retrieved from different databases, such as Orchidstra 2.0 ([40] http://orchidstra2.abrc.sinica.edu.tw/orchidstra2/orchid\_blast.php), genome annotation Databases ([41], http://genome.microbedb.jp/blast/blast\_search/klebsormidium/genes), Phytozome ([42], https://phytozome.jgi.doe.gov/pz/portal.html), OneKP ([43], https://db.cngb.org/onekp/) (see S1). When necessary, nucleotide sequences have been translated into amino acid sequences using EMBOSS transeq ([44], https://www.ebi.ac.uk/Tools/st/emboss\_transeq/).

#### *3.2. Comparisons and Alignment of PAC Domains*

The BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi) program has been used for sequence comparison. Similarities between PAC domain sequences have been calculated using either Blast2seq (https://blast.ncbi.nlm.nih.gov/Blast.cgi) or needle (http://www.bioinformatics.nl/cgi-bin/ emboss/needle). The sub-cellular localization of proteins has been predicted with TargetP-2.0 ([45], http://www.cbs.dtu.dk/services/TargetP/) and the presence of β-sheets and/or α-helices using SABLE ([46], http://sable.cchmc.org/) and NetSurfP ([47], http://www.cbs.dtu.dk/services/NetSurfP/). The selected PAC domains starting at the Gly amino acid located three amino acids upstream of Cys 1 and ending at Cys 6 have been aligned using PROMALS3D ([48], http://prodata.swmed.edu/ promals3d/promals3d.php) to take into account the prediction of α-sheets. The phylogeny has been calculated using MEGA7 ([49], https://www.megasoftware.net/) with the maximum likelihood option and 500 bootstraps. The presence of the PROSITE (PS00925, [50], https://prosite.expasy.org/) and Pfam (PF01190, [51], http://pfam.xfam.org/) domains have been checked in the retrieved sequences. Inside clades, conserved motifs have been identified using MEME ([52], http://meme-suite.org/tools/meme) or WebLogo3 ([53], http://weblogo.threeplusone.com/).

#### *3.3. Three-Dimensional Modeling*

For a subset of PAC domains, models were generated utilizing MODELLER [54] and I-Tasser [55]. Thereby, disulfide bridges were defined beforehand based on alignments with PDB entry 4Z8W corresponding to the *P. lanceolata* PAC domain [24]. Subsequently, these models were scored with

MAESTRO [56], DOPE [57], and ProSA 2003 [58]. Then the top-scoring models were relaxed with Rosetta [59], and finally, the relaxed models were scored with the same three methods.

We consistently used PAC domains from *A. trichopoda* as representative models for each clade. The relative solvent accessibility of these models was calculated by an adaptation of the Geometry library algorithm [60]. The secondary structure assignment was obtained by DSSP [61,62].

Both MODELLER and I-Tasser depend on template structures. MODELLER is a homology-modeling tool, which assumes significant sequence similarity between target and template structures in order to create a reliable alignment between them. Loops and sidechains are modeled with respect to the target sequence. The overall fold, however, is largely determined by the template structure. I-Tasser is a fold-recognition approach, where sequence similarity between target and template does not play a major role. Moreover, I-Tasser uses structural fragments rather than complete protein (domain) folds, from which the overall fold is built. The final model is not determined by a single template. As such, it should be better applicable for PAC domain sequences with low similarity to the Pla I 1 PAC domain.

#### **4. Conclusions**

This study has allowed better defining PDPs by combining amino acid sequences features, secondary structures, and 3D-modeling. This protein family has appeared early during the evolution of the green lineage. It has, however, not been possible to identify with certainty a PAC domain ancestor in the presumed precursor organisms of the green lineage even if the *C. orbicularis* PAC domain appeared as a possible candidate. The association of the PAC domain with Pro-rich sequences seemed to be an ancient event, the most ancient sequence carrying both a PAC domain and a Proline-rich domain being found in Bryophytes, and those carrying both a PAC domain and extensin domains in Psilotales. Despite a great amino acid variability between PAC domains, the tertiary β-barrel structure strengthened by three disulfide bridges has been conserved in all bona fide PAC domains. Finally, the subset of PAC domains belonging to Clade C is intriguing. Their very high level of conservation at the amino acid sequence level suggests that they play critical roles in plant cell walls. Defining the specificity of interaction of the different PAC domains with other cell wall polymers will be one of the next challenges to fully unravel the roles of PDPs in the cell wall architecture.

**SupplementaryMaterials:** Supplementary materials can be found at http://www.mdpi.com/1422-0067/21/7/2488/s1 S1 Number of PAC domain and PAC domain-related proteins in different plants from Bryophytes to Brassicales; S2 Amino acid sequences of PAC domain proteins in the green lineage; S3 Some examples of PAC domain-related proteins containing predicted functional domains suggesting intracellular functions; S4 Amino acid sequences of PAC domain-related proteins in ancestors to the green lineage; S5 Amino acid sequences of putative PAC domains with only five Cys residues, more than six Cys residues, or no Gly residue upstream Cys 1; S6 Amino acid sequences of putative PAC domains with six Cys residues, but predicted α-helices; S7 Amino acid sequences of the PAC and W-W domains of Type 2-PDPs including a C-terminal W-W domain; S8 Top-scoring 3D-models of PAC domains and the corresponding scores. Some PAC domain 3D-models; S9 Expanded phylogenetic trees of PAC domains from Bryophytes to *A. trichopoda* (Tree I) and from *A. trichopoda* to Brassicales (Tree II); S10 The conserved W-W domain from PAC domains belonging to clade D from Bryophytes to Brassicales PDPs; S11 *In silico* mutagenesis experiment to test the stability of the 3D-structure of a set of PAC domains mutagenized on one of the six conserved Cys residues; S12 Solvent accessibility and secondary structure for each residue and alignment of this information with the conserved sequence profiles of PAC domains; S13 MAESTRO scores for PAC domain models in relation to MAESTRO scores for experimentally-determined structures taken from the PDB database.

**Author Contributions:** Conceptualization, E.J. and C.D.; methodology, E.J., C.D., G.G., J.L., P.L., and H.S.C.; validation, J.L., G.G., P.L., and E.J.; investigation, H.N.-K.; data curation, H.N.-K., E.J., H.S.C.; writing—original draft preparation, E.J.; writing—review and editing, all authors.; visualization, G.G., J.L., P.L. and E.J.; supervision, E.J.; project administration, E.J.; funding acquisition, E.J. All authors have read and agreed to the published version of the manuscript.

**Funding:** The authors are thankful to Université Paul Sabatier-Toulouse III (France) and CNRS for supporting their research work. HNG-K has been granted by the Vietnamese Ministry of Education and Training for his PhD work. This work was also supported by the French Laboratory of Excellence project entitled "TULIP" (ANR-10-LABX-41; ANR-11-IDEX-0002-02). JL is supported by the Austrian Science Fund (FWF, grant P30042).

**Conflicts of Interest:** The authors declare no conflict of interest.
