**1. Introduction**

Plant cell walls are composite structures mainly made of polysaccharides and proteins. Cellulose microfibrils and hemicelluloses form intricate networks, which are embedded in a pectin matrix [1]. Although present in minor amounts, the cell wall proteins (CWPs) play critical roles in polysaccharides organization and remodeling processes during growth and upon environmental stresses [2,3]. Cell wall proteomics has revealed the great diversity of CWPs and allowed the discovery of unexpected CWP families [4]. The combination of genetics and biochemistry approaches has allowed demonstrating the roles of CWPs in polysaccharide metabolism, biosynthesis of lipid-rich cell wall layers, lignin monomer polymerization, but also in signaling and ROS homeostasis maintenance [5–8].

Among the newly described CWPs families, the importance of the PAC (Proline-rich Arabinogalactan protein and Conserved Cysteines) domain containing-protein (PDP) family could be stressed because of their presence in many cell wall proteomes (see *WallProtDB*, www.polebio. lrsv.ups-tlse.fr/WallProtDB/, query with "Ole e1 allergen domain" as a keyword). The name of the PDP family was initially proposed by Baldwin et al. [9], who described them as a sub-family of non-classical arabinogalactan proteins (AGPs) containing both an AGP domain and a C-terminal domain containing six Cysteines residues (named Cys 1 to Cys 6 herein). Later on, a domain partly describing the PAC domain has been proposed in the Pfam database (PF01190, http://pfam.xfam.org/). The firstly described member of this family was a protein from *Nicotiana alata* named AGPNa3 [10]. Then, several proteins very close to AGPNa3 were studied, for a review, see [11]. As examples, the following ones can be mentioned: *Daucus carota* DcAGP1 [12]; *Arabidopsis thaliana* AtAGP30 [13], and AtAGP31 (At1g28290) [14]; *Capsicum annuum* CaPRP1 [15]; *Gossypium hirsutum* GhAGP31 [16]; and *Petunia hybrida* PhPRP1 [17]. More recently, it appeared that the PAC domain could also be found alone, located at the N-terminus of the mature protein or associated with different types of domains, such as a Histidine-rich region, an *O*-glycosylated Proline/Hydroxyproline-rich domain, or an extensin domain [18,19].

Functional studies on several of the *A. thaliana* PDPs have shown their diverse roles during plant development. *PRPL1* (*Proline-Rich Protein-Like*, *At5g05500*) has a trichoblast-specific expression and plays roles in root hair elongation, as shown by the reduction in length of root hairs in the *prpl1* mutant [20]. Plants lacking *FOCL1* (*Fused Outer Cutiular Ledge 1*, *At2g16630*) produce stomata without a cuticular ledge, and thus, focl1 mutants display drought tolerance [21]. *AtAGP30* (*At2g33790*) is involved in root regeneration in vitro and in the timing of seed germination [13]. *AtAGP30* is expressed in root atrichoblasts under the control of ABA signaling [22]. *AtAGP31* is expressed in vascular tissues and repressed by methyl jasmonate at the transcriptional level [14]. AtAGP31 has also been shown to accumulate in actively growing etiolated hypocotyls [23]. In vitro interactions have been demonstrated between its PAC domain and galactans or the Gal-Ara-rich *O*-glycans of its Proline/Hydroxyproline rich domain [11]. These studies have led to the assumption that AtAGP31 could be involved in cell wall non-covalent protein/polysaccharide networks playing roles during quick cell elongation [11].

Recently, the crystal structure of the PAC domain of an allergenic protein from *Plantago lanceolata* containing an N-terminal PAC domain (Pla l 1 as a member of the Ole e 1–like protein family, PDP code 4Z8W) has been determined, highlighting the importance of β-sheets in its secondary structure [24]. In particular, the structure revealed a seven-stranded β -barrel with four loop regions. Three intramolecular disulfide bonds were found between (i) β 1b and β 6 strands (Cys 1-Cys 5), (ii) β 2 and β 5 strands (Cys 3-Cys 4), and the (iii) C-terminus and loop C-terminal of β 2 strand (Cys 2-Cys 6), thus forming a closed branched loop. A detailed characterization of allergens of the same protein family allowed proposing that they share the same core structure, whereas loop regions can be heterogeneous.

In this article, we aim at giving an evolutive overview of the PDPs throughout the green lineage, from Bryophytes to late divergent plants, such as monocots and dicots. We first define more precisely the PAC domain characteristics in order to retrieve PAC domain sequences from available genomic or RNA-seq databases using a tailor-made bioinformatic script. Since the conservation of the primary amino acid sequences of PAC domains was rather low, and since the presence of β -sheets seemed to be essential for domain folding, bona fide PAC domains were selected according to their secondary structure conservation, and protein alignment was done using a software taking into account secondary structures. Modeling of tertiary structures was done based on the available crystal structure of the Pla l 1 PAC domain. Finally, we could draw a phylogenic tree and sort the PAC domains according to their association with other domains. We could also investigate the occurrence of PAC domains in ancestor organisms.

#### **2. Results and Discussion**

#### *2.1. Characteristics of the PAC Domain and Search for New PDP Candidates*

The overall strategy used for this study is summarized in Figure 1. As a first step and in order to obtain a better definition of a PAC domain, orthologous sequences have been identified in the *A. thaliana* genome using that of the AtAGP31 PAC domain. Altogether, 14 candidate sequences were identified and manually checked for the presence of the six conserved Cys residues: At1g29140, At1g78040, At3g09925, At4g08685, At4g18596, At5g45880, At5g54855, AtAGP31, At5g05500 (PRPL1), At5g15790, At2g34790 (AtAGP30), At2g34700, At4g18596, and At2g16630 (FOCL1). These sequences were then used to identify additional PDPs by sequence similarity in eight other angiosperm genomes: *Amborella trichopoda*, *Brachypodium distachyon*, *Oryza sativa*, *Sorghum bicolor*, *Populus trichocarpa*, *Eucalyptus grandis*, *Linum usitatissimum*, and *Gossypium raimondii*. About 50 putative PDPs were collected and manually checked for the presence of the six conserved Cys residues. From this first data mining step, it appeared that the level of conservation of the amino acid sequences of the PAC domains could be low. In particular, except between the two first conserved Cys residues (Cys 1 and Cys 2), the spacing between Cys residues could be variable. Thus, the usual homology-based mining was not sufficient, and an alternative strategy was necessary to obtain exhaustive results for each plant. The alignment of angiosperms PAC domains has allowed calculating the range of spacing between the conserved Cys residues. Then, a tailor-made script based on several points detailed in Table 1 has been set up to search for additional PDPs in the same genomes or in other genomics or transcriptomics databases. However, the prediction of a signal peptide for protein secretion could not be made systematically for the proteins translated from transcriptomics data because the sequences could be incomplete. Furthermore, when genomic sequences were available, the presence of an intron between the sequences encoding, on the one hand, Cys 1 and Cys 2, and on the other hand, Cys 3 to Cys 6 was searched for to support the PAC domain identification.



**<sup>1</sup>** The number of amino acids between two successive Cys residues is indicated between brackets.

Using this script, sequences encoding PAC domains have been searched for in 78 plant species belonging to the green lineage from Bryophytes (*Bryophyta*, *Marchantiophyta* and *Anthocerotophyta*) to late divergent plants. Altogether, about 450 putative PAC domain sequences were collected (S1–S4).

Three additional criteria have then been used to select bona fide PAC domain proteins. The first one was the number of conserved Cys residues. Indeed, we have found putative PAC domains showing the expected characteristics, but containing only five Cys residues, or containing more Cys residues, up to nine (S1,S5). Although some of them had sequences very similar to those of six Cys-containing PAC domains (S5), we have decided to dismiss them in case of a lack or an excess of Cys residues, which would modify the folding of the domain by generating disulfide bridges different from the expected ones. The second exclusion criterion was the absence of predicted β-sheets. Indeed, the crystal structure of the Pla I 1 PAC domain has allowed highlighting the importance of these β-sheets in its secondary structure [24]. Some proteins with large predicted α-helices and/or no predicted β-sheets have been dismissed with regard to this criterion, especially in Bryophytes, Equisetales, and Alismatales (S1,S6). The third criterion was the presence of associated predicted functional domains suggesting intracellular

functions like aldehyde dehydrogenase domain (PF00171, *Tetraphis pellucida* HVBQ\_2004216) or JmjC and JmjN domains of transcription factors (PF02373 and PF02375, *Pallavicinia lyelli* YFGP\_2007785) (S3). In most of these latter cases, it was not possible to predict the sub-cellular localization of the proteins because they resulted from the translation of incomplete contigs obtained from RNA-seq data.

**Figure 1.** Pipeline for Proline-rich Arabinogalactan protein and Conserved Cys (PAC) domain protein identification and phylogeny. The name of the bioinformatics programs and resources used at each step are indicated in brackets.
