**1. Introduction**

Proteoglycans (PGs) constitute a diverse family of proteins that occur in the extracellular matrix (ECM) and pericellular matrix (PCM) and on the surface of mammalian cells. They consist of a core protein and one or more covalently attached glycosaminoglycan (GAG) chains. PGs play critical roles in numerous biological processes, which are mediated by both their protein part and their GAG chains [1,2].

GAGs refer to six major polysaccharides in mammals: chondroitin sulfate (CS) [3], dermatan sulfate (DS), heparin (HP), heparan sulfate (HS) [4,5], hyaluronan (HA) [6], and keratan sulfate [7,8]. Their molecular mass ranges from a few kDa to several million Da for hyaluronan. Despite significant compositional differences, GAGs also share common features. They are linear polysaccharides made of disaccharide repeats. The disaccharides are composed of uronic acid and an hexosamine, alternatively

linked through 1-4 and 1-3 glycosidic bonds (Figure 1), except for keratan sulfate, which involves galactose (Gal*p*) and N-acetylglucosamine (Glc*p*NAc) [7]. In contrast to the five other GAGs, hyaluronan is not sulfated and does not bind covalently to proteins to form proteoglycans. Variations in the pattern of GAG sulfation at various positions, create an impressive structural diversity. Two hundred and two unique disaccharides of mammalian GAGs have been identified so far, including 48 theoretical disaccharides in HS [9].

In addition to their contribution to the physicochemical properties of PGs, GAGs play an essential role in the organization and assembly of the extracellular matrix. They also regulate numerous biological processes by interacting with proteins in the extracellular milieu and at the cell surface. The six mammalian GAGs were shown to interact with 827 proteins in the recently published GAG interactome [10].

Many of these GAG interactions have been investigated and characterized in health and disease. According to [10], they take place in various locations (intracellular, cell surface, secreted, and blood proteins) and the protein partners range from individual growth factors (e.g., fibroblast growth factor-2) to large multidomain extracellular proteins such as collagens I and V, and fibronectin with different affinity and half-life [11,12]. These proteins are involved in a variety of biological processes such as extracellular matrix assembly, cell signaling, development, and angiogenesis [10,13,14]. Besides, glycosaminoglycans play a role in host-pathogen interactions by binding to bacterial, viral, and parasite proteins [15–20]. The significance of the understanding and mastering the molecular features underlying the interaction of GAGs to proteins was magistrally demonstrated by the development of the antithrombotic drugs as reviewed in [21]. –

**Figure 1.** Main repeating units of glycosaminoglycans. The color-coding of the constituting monosaccharide complies with SNFG nomenclature [22]. The abbreviations are as follows: Glc*p* for glucose, Ido*p* for idose), Gal*p* for galactose, N for amine, S for sulfate, A for acid, and NAc for N-acetyl.

The length, sequence, substitution pattern, charge, and shape of GAGs control both their physicochemical properties and their biological functions. Understanding the functions of GAGs first requires methods to accurately assess their molecular weight, their composition and their sequences. This is made possible through ongoing progress in mass spectrometry, and heparan sulfate has been sequenced by liquid chromatography-tandem mass spectrometry (LC-MS/MS) [23–27]. Furthermore, the structural and conformational complexity of GAGs challenges the characterization of their three-dimensional features using either experimental or theoretical methods. In a sense, GAGs concentrate most on the difficulties faced in structural glycoscience. They combine the challenges associated with both glycans and polyelectrolytes. Several experimental techniques have been used to solve GAG structures, including fiber X-ray crystallography, nuclear magnetic resonance (NMR) [28,29], electron microscopy, small-angle X-ray scattering (SAXS) [30], and neutron scattering (elastic incoherent neutron scattering EINS [31], and small-angle neutron scattering SANS [32]. Still, no single technique can cope with such complexity, but, computational methods offer valuable tools to integrate partial information collected experimentally. These, in turn, are useful to validate and improve simulation strategies. However, these approaches remain limited due to the intrinsic properties of GAGs. Like any other complex glycans, they are highly flexible, create many solvent-mediated interactions and have a polyanionic character. Nevertheless, progress in this field is underway, as detailed in [33] that investigates structures from monosaccharides to polysaccharides.

GAG-protein complex structures available in the PDB have been compiled by Samsonov and coworkers [34]. They concluded that this dataset does not represent the diversity of natural GAG sequences. It implies that computational approaches will be critical in understanding GAG structural biology and their mechanisms of interaction with their protein partners [35–37]. Significant progress has been made to investigate GAG structures, isolated and complexed with proteins, both at all-atom and coarse-grained levels [33,38–41]. However, appropriate tools for data mining of GAG-protein interactions are still missing [12,14].

MatrixDB (http://matrixdb.univ-lyon1.fr/) is a biological database focused on molecular interactions between extracellular proteins and polysaccharides [42]. It offers the first step to investigate the molecular mechanisms of GAGs-protein interactions. In this resource, building and displaying the three-dimensional structural models of GAGs was rationalized through an effort to standardize the format of GAGs sequences and group GAG disaccharides into a limited number of families [9]. However, the relative spatial orientations of key GAG chemical groups interacting with (potential) "hot spots" on the proteins was not characterized. The conformational features displayed by the long-chain GAGs polysaccharides were not considered either. To move forward, we collected further evidence of experimental GAG and GAG-protein interaction data, from databases and in the relevant literature.

Experimental details of protein or protein complex three-dimensional structures are comprehensively recorded in the Protein Data Base [43] While being an essential repository, the glycan-related data stored in the PDB is not easily accessible to non-glycoscientists. This difficulty was identified in the glycoscience community and gave rise to several initiatives. Tools were designed to correct inconsistencies in the data [44–46]. Data was organized in publicly available databases, cross-referenced, and interoperable with the glycomic, and other omic, databases to ease data access and analysis, such as Glyco3D [47], UniLectin3D [48], and MatrixDB [42] for GAG-extracellular protein complexes. We now report the development of GAG-DB, a database containing three-dimensional data on GAGs and GAGs-protein complexes retrieved from the PDB. It includes protein sequences and standard nomenclature of GAG composition, sequence, and topology. It provides a family-based classification of GAGs, cross-referenced with glyco-databases, with links to UniProtKB via accession numbers [49]. The 3D visualization of contacts between GAGs and their protein ligands is implemented via the protein-ligand interaction profiler (PLIP) [50] and the nature of the structure that GAG polysaccharides can adopt, either in the solid-state or in solution is also reported. Finally, characterized quaternary structures of the complexes improve understanding if and how GAGs participate in long-range, multivalent, binding with the potential synergy when several chains are involved in interactions.
