**1. Introduction**

Cellulose, a linear polysaccharide of glucose linked by β-1,4-glycosidic linkages, is the most abundant biopolymer on Earth and is found in the cell walls of plants. Cellulose consists of long chains of glucose tightly packed together due to H-bonds and constitutes the chief load-bearing polysaccharide. It is embedded in a matrix of pectins and hemicelluloses, and is additionally impregnated by lignin in some instances [1]. Cellulases are grouped into endoglucanases (EC: 3.2.1.4), that randomly hydrolyse internal β-1,4-glycosidic bonds and exoglucanases (cellobiohydrolase, EC: 3.2.1.91) that processively release mainly cellobiose from the reducing or non-reducing chain

extremity [2]. Processive endoglucanases that possess the properties of both endo- and exocellulases have also been described [3,4].

Based on amino acid sequence similarity, cellulases are classified into different glycosyl hydrolase (GH) families [5,6]. For example, endocellulases span the GH-families, 5–10, 12, 26, 44, 45, 48, 51, 61, 74, and 124, whereas exocellulase members are found in the GH families, 5, 6, and 9 (CAZy database, available online: http://www.cazy.org/Glycoside-Hydrolases.html). Most cellulases involved in the degradation of cellulose deriving from plant lignocellulosic biomass are produced by bacteria, archaea, fungi, and protozoa [7]. Some bacteria, oomycetes, protozoa, sea squirts, the fungus *Microdochium nivale* [8] and especially plants synthesize cellulose for growth and development, and, hence, require cellulases to degrade, modify, and remodel cellulose [9]. Some microorganisms (bacteria, fungi, protozoa) that live in a symbiotic relationship within the guts of phytophagous organisms also produce cellulases [10]. Later it was discovered that, apart from cellulolytic enzymes from symbionts, invertebrates also possess endogenous cellulases secreted by salivary glands and the gut [11]. Until recently, it was considered that cellulose catabolism was limited to heterotrophic organisms and higher plants (for remodeling cellulose). However, in 2012, it was experimentally established that the photosynthetic microalga *Chlamydomonas reinhardtii* (Cr) can utilize cellulose for growth in the absence/limitation of other C-sources by secreting endocellulases [12]. The alga combines features of both plants and animals (it is considered a "planimal" [12]), and has a genome characterized by an expansion of transporter gene families, indicative of an adaptation to life in soil environments [13].

In view of the biotechnological applications of novel cellulases in the degradation of lignocellulosic biomass to produce biofuel, here, we bioinformatically analyze, for the first time, cellulases from three microalgal species whose complete genomes have been published and compared [14,15]. We choose cellulase homologs from microalgae with increasing multicellularity (unicellular alga *C. reinhardtii*; colonial algae *Gonium pectorale*, Gp, with 16 and *Volvox carteri*, Vc, with 2000–6000 cells) and compare their sequences with different cellulases from diverse taxonomic groups. We model all the microalgal cellulase homologs and analyze in detail conserved motifs and their phylogenetic relationship, arrangement of different domains, and active-site architecture in addition to examining carbohydrate-binding modules (CBMs) and linker regions. We conclude this study by determining the expression levels of three cellulases in Gp in a control condition and after the addition of crystalline cellulose substrate (filter paper) to the growth medium.

#### **2. Results and Discussion**

The present work is based on the discovery that the photosynthetic microalga, *C. reinhardtii*, can secrete cellulases into the medium under CO2-limiting conditions, although cellulase secretion was not detected in the closely related Chlorella kessleri [12]. Interestingly, Chlorella has cellulose, whereas Cr, Gp, and Vc do not have cellulose in their cell walls [16]. In the present paper, we discuss the sequence and structural analysis of cellulases from three members of Chlorophyceae (Cr, Gp, and Vc) with increasing cellular complexity (from single cells to colonies).

#### *2.1. Algal Cellulases Belong to Glycosyl Hydrolase Family 9*

The structurally and functionally important conserved residues show that all algal sequences of catalytic domains (CDs) belong to the inverting GH9 family of CAZymes (Carbohydrate-Active Enzymes) with (α/α)6-barrel topology. Glucanases, belonging to the GH family 9, are considered the most conserved cellulases and are widely distributed among bacteria, fungi, amoebozoa, invertebrate metazoans, mosses, ferns, gymnosperms, and angiosperms [17]. Three conserved regions are identified in the CDs of algal cellulases (Figure 1 lower panels, Supplementary Figure S1a), consistent with the motifs/patterns of GH9 cellulases reported from across diverse taxonomic groups [17]. The variation of amino acids at each position within each region is compared between microalgal (Figure 1, lower panels) and all other GH9 cellulases described (Figure 1, upper panels) [17]. Region I of microalgal cellulases contains the characteristic DAGD motif where, in addition to

H-bonding of acidic residues with water (Figure 1, lower panel and asterisks), the C-terminal D acts as the catalytic base that extracts a proton from the nucleophilic water and the N-terminal D acts as an essential supporting residue [3,18–20]. The pattern corresponding to Region I, ([LVS]-x-[GK]-G-[WFYLM]-[YHF]-D-[ACGS]-G-[DSN]-X(2)-[KMR]-[FAILY]-X-[FWYLQTV]-[APTN S]-[MLGAQS]) has now been included in the PROSITE database (PS60032) [21]. Interestingly, in Region I of GH9 from all other organisms, D (catalytic base) is replaced by an N in the Angiosperm *Medicago* and G in few sea-squirt isoenzymes [17]; however, the activities of these enzymes have not been determined. In Region I of GH9 from all other organisms, two G and a K residues are also conserved, however, their role in catalysis has not yet been elucidated (Figure 1, upper panel).

The comparison of Region II (PROSITE pattern, PS00592) reveals that, although H and R residues (Figure 1, upper panels) are involved in substrate-binding via H-bonding [18,19], both residues are not conserved. The H is replaced by V in Vc and by S in *Panesthia cribrata* (Metazoa), whereas R is replaced by K and S in microalgae and by A or G in many GH9 cellulases [17]. An interesting finding about Region II of algal cellulases is the presence of an extra four residue sequence (PT[PTA][YSG]) (Figure 1, lower middle panel), which is missing in non-algal GH9 enzymes, with the exception being two cellulase homologs (CrCel9D and Gp KXZ44756) (Supplementary Figure S1a). The PROSITE pattern, PS00592, has now been revised to [HLY]-[AILMV]-[FIL]-G-x-[NSTW]-x(2,4)-[SCTV]-[FY]-[LIVMFY]-[SITV]-G-x(1,5)-[GSY]-x(2)-[AFPSTY]- [FLPSV]-x(2)-[AILPQVM]-[HV]-[DHLS]-[KRS] [21].

Residues in Region III are involved in substrate-binding and catalysis (Figure 1, asterisks), with fully conserved E acting as an acid that protonates the leaving group [19,22] and stabilizes the positively-charged oxocarbonium transition-state [18,23]. The fully conserved nucleophilic D forms H-bonds with the residues of the active-site loop, comprising of regions I and II, to bring it in proper alignment [18].


**Figure 1.** Variation in amino acids at each position within three conserved motifs of microalgal cellulases compared to other consensus sequences of GH9 cellulases from across the taxonomic groups. The sequences from different taxonomic groups were chosen, as mentioned in Figure 2. Upper panels, sequences from [17]; lower panels, microalgal sequences (this study). The gaps are denoted by dashes. \*, catalytic, and binding residues. Blue residues, variations in algal sequences. The extra four residues in Region II is found in all algal cellulases, except CrCel9D and Gp KXZ44756. "X" refers to extra residues in Region II not shown by [17]. The pattern corresponding to Region I updates the PROSITE Database.

#### *2.2. Algal Cellulases Are Closest to Invertebrate Metazoan GH9 Enzymes*

The percentage identity matrix (Supplementary Figure S1b) and the phylogenetic analysis (Figure 2) of CD regions (such as blue highlighted regions, Supplementary Figure S1a) from GH9 cellulases reveal that algal cellulases are closer to invertebrate metazoan enzymes than to plants. The identity matrix of GH9 family cellulases from selected groups show that algal enzymes have the highest (33–42%) and lowest (17–36%) sequence identity, with proteins from invertebrates and bacteria, respectively, whereas the identity with cellulases from plants is 27–33% (Supplementary Figure S1b).

Full length sequences of GH9 cellulases have other domains, such as CBM, immunoglobulin (Ig-like), fibronectin type III (Fn3-like), and dockerin, which can produce a bias during phylogenetic analysis [17]. To accurately determine the similarity, all cellulases were truncated to comprise only CDs (Supplementary Figure S2) and the phylogenetic tree was then constructed. The phylogenetic analysis revealed that truncated GH9 cellulases cluster together within a taxonomic group and that algal cellulases are closer to invertebrate metazoan enzymes (Figure 2), as suggested by the identity matrix (Supplementary Figure S1b). Notably, the GH9 belonging to the representatives of a specific (Sub)Kingdom (notably, Eubacteria, Fungi, Metazoa, Plantae) cluster together, except for *Ciona savignyi*, which does not form a cluster with the other representatives of Chordata *Branchiostoma floridae* and *Ciona intestinalis*, but clusters, instead, with arthropods. *C. intestinalis* forms a separate branch where the two cellulases analyzed are found together. The only representative of the Kingdom Protista (Phylum Amoebozoa), *Dictyostelium discoideum*, clusters in a sister clade to the one formed by bacteria. Interestingly, the cellulases from the green microalgae, *G. pectorale*, *C. reinhardtii*, *V. carteri*, and *Chlorella zofingiensis*, do not cluster with GH9 from Plantae, but instead form separate groups. This is consistent with the hypothesis that GH9 genes are related by vertical descent and not by horizontal gene transfer [17]. Particularly, the *G. pectoral*, GH9 KXZ44756, is sister to *C. reinhardtii* CrCel9D, which is the gene described as being strictly induced by crystalline cellulose [12].

**Figure 2.** Maximum likelihood phylogenetic tree of GH9 cellulases (built using catalytic domains, CD in the protein sequences; see Supplementary Figure S2) from different species (100 bootstraps). The circles refer to the bootstraps (range 0.6–1; the size of the circles is proportional to the bootstrap values). The name of the species analyzed and their accession numbers are indicated in the tree. The *V. carteri* cellulases are indicated VC2958622 and VC2952174. The different colors represent the different taxonomic groups, i.e., either (Sub) Kingdoms, Phyla, Divisions, or Orders.
