2.20.1. Taxonomic Distribution

#### Ixodes

The transcriptome of *I. ricinus* contains approximately 70 unique transcripts varying in identity from 14-80%. A large group of single domain peptides is present that are truncated at the C-terminal end of the N-terminal β-sheet structure (Figures S36A and S37A). These contain two disulfide bonds (four cysteines) with the first (DS1) linking the third strand from the larger β-sheet (β-sheet 1) with the second strand of the smaller sheet (β-sheet 2)

and the second bond linking the loop between the second and third β-strands of β-sheet 1 with some part of the C-terminal region of the protein (DS-2). In this group of proteins β-strand A of β-sheet 1 is absent in the models and the N-terminus extends away from the protein and forms a short α-helical segment (Figure S37A). This group exhibits high overall amino acid sequence conservation ranging between 60 and 85% in amino acid identity.

Also present in *I. ricinus* is a larger group of peptides containing both the N- and Cterminal domains and having four disulfide bonds (eight cysteines). These are very diverse in amino acid sequence (17–70% identity) but are all similar in overall structure to the CirpTtype [93] complement inhibitors (Figures S36B and S37B). The N-terminal domain is much like the single-domain sequences discussed above. Many of these sequences are modeled by AlphaFold2 as lacking the first strand of β-sheet 1. A segment of unstructured sequence is present at the N-terminus in these cases suggesting that the full four-stranded β-sheet structure may be present (Figure S37B). The C-terminal domain of this group contains three disulfide bonds, most notably a cluster of two (DS2 and DS3) that is characteristic of all members of the 8.9 kDa family except those *Ixodes* variants that do not contain a C-terminal domain. This cluster contains two sequential cysteine residues, with the first linking to the loop between β-strands B and C of the N-terminal domain (DS2) and the second linking to the C- end of β-strand D of the N-terminal domain (DS3). A fourth disulfide bond (DS4) links the extreme C-terminus with a second C-terminal domain cysteine residue linking the strands forming a small two-stranded β-sheet (β-sheet 3) that is normally present in this domain (Figure S37B).

#### Amblyomma

Ticks in the genus *Amblyomma* produce a diverse set of peptides from the 8.9 kDa family containing variations in the structure of the C-terminal domain as well as forms in which the entire two domain structure is duplicated or partially duplicated to give proteins that could have multivalent binding properties. The *Amblyomma* forms contain 6, 8, 12 or 16 cysteine residues (3, 4, 6 or 8 disulfide bonds).

Six-cysteine forms from *Amblyomma* species are variable in sequence (20–80% amino acid sequence identity within the group) and contain the N-terminal domain seen in *Ixodes* proteins as well as a C-terminal domain that is truncated directly C-terminal to the "CC" double cysteine motif involved in DS2 and DS3 of the eight-cysteine forms described above (Figures S38A and S39A). Because of the shortened C-terminal domain in these forms, they do not contain DS4. Eight-cysteine variants are also present in *Amblyomma* which resemble those described in *Ixodes* and are quite diverse in sequence, showing about 20–80% amino acid identity within the group (Figures S38B and S39B). This group contains the C5 complement inhibitors and is similar in general structure to the eight-cysteine forms from *I. ricinus* (Figure S4B).

In addition to the six- and eight-cysteine forms of the 8.9 kDa family, *A. maculatum* (Kock, 1844) contains extended variants that have extra domains attached to the eightcysteine structure. The simplest type has a three-domain structure resembling the fourcysteine, single domain forms described in *I. ricinus* attached to the C-terminal end of the eight-cysteine, two-domain module (Figures S40 and S41A). The chain continues from the N-terminal module into the extra domain forming a hairpin loop corresponding to β-sheet 2 of the *I. ricinus* single domain protein, then into a three-stranded, antiparallel β-sheet corresponding to β-sheet 1 of *I. ricinus* four-cysteine proteins and the eight-cysteine proteins from all species (Figure S41A). This type contains twelve cysteine residues forming six disulfide bonds which are conserved in position relative to those from the previously described proteins.

Also present in *A. maculatum* are variants with four domains made up of two complete eight-cysteine units fused end-to-end into a single polypeptide (Figures S40 and S41B). Like the three-domain forms described above, the second structural module of these proteins contains a modified N-terminal domain β-sheet structure, but also has a fully elaborated C-terminal domain containing a β-sheet 3 type structure (Figure S41B). These proteins contain sixteen conserved cysteine residues forming eight disulfide bonds.

#### Rhipicephalus

Like *Amblyomma*, *Rhipicephalus* (Kock, 1844) species contain 8.9 kDa family members having two domains with 6 or 8 cysteine residues as well as four-domain proteins containing 16 cysteine residues. An unusual group of 8-cysteine proteins with a shortened C-terminal domain is present. It contains an N-terminal domain with β-sheet 1 modeled as having three strands, and β-sheet 2 as having two or three strands (Figures S42A and S43A). The C-terminal domain is truncated relative to the 8-cysteine forms of *Ixodes* or *Amblyomma* but contains the "CC" double cysteine motif that makes up part of the two-disulfide bond cluster (Figure S43A). All members of this group are modeled by AlphaFold2 as having single free cysteines near the N-terminus of the protein and near the C-terminus, which are not in proximity to one another (Figure S43A). This suggests that they may form multimers or contain an unmodeled model structure in which these two unpaired cysteines are in proximity to form a disulfide bond.

A group of "conventional" 8-cysteine forms are also found in *Rhipicephalus* in which the domain structure and disulfide bonding pattern of most closely resembles comparable proteins from *Amblyomma* or *Ixodes*. This not to say they are highly similar at the sequence level or would be expected to be functionally homogeneous. Within this single species these proteins exhibit a range in amino acid identity of 17–52% (Figures S42B and S43B).

Finally, *R. appendiculatus* contains a set of extended variants like those of *Amblyomma* which contain four protein domains and sixteen cysteine residues. As in the *Amblyomma* forms, these are derived from the end-to-end fusion of two eight-cysteine forms into a single polypeptide. In this case, the N-terminus of the first eight-cysteine unit forms part of the C-terminal domain of the unit by forming one strand in in β-sheet 3, resulting in it having three strands rather than the usual two. In a second model, the C-terminus of the protein integrates into β-sheet 1 of the C-terminal eight-cysteine unit, making it a 4-stranded antiparallel sheet. The disulfide bonding pattern of this group is that expected from the fusion of two eight-cysteine units.

#### 2.20.2. C5-Binding Anticomplement Proteins

The only established function of the 8.9 kDa family is inhibition of complement by binding to component C5 and preventing its activation [93]. Orthologous 8-cysteine 8.9 kDa family members from *R. pulchellus* (Gerstäcker, 1873), *Dermacentor andersoni*, *R. sanguineus* and *A. americanum* have been found to function similarly. Using the crystal structure of CirpT1, the variant from *R. pulchellus,* complexed with the macroglobulin domain 7 of human complement C5 and the cryo electron microscopy structure of CirpT1 complexed with C5 along with the inhibitors OmCI and RaCI1, we identified a block of eight residues in the interaction interface (Figure S44). Of the selected set of sequences from the TickSialofam database, only those eight-cysteine forms closely resembling the previously described inhibitors contained even a small number of amino acid identities within the selected sequence block. Few partial matches or weakly similar sequences were found in the set of *Ixodes* transcripts suggesting that C5 inhibitors of this "type one" clade of the 8.9 kDa family and are restricted to metastriate species. As anticipated, due to the high variability and low degree of sequence identity within the groups described here, only orthologs of CirpT1 appear to have a C5 binding function. The structural diversity revealed by Alphafold2 modeling therefore suggests that the 8.9 kDa salivary protein family of the eight-cysteine type can be expected to perform multiple functions.

#### *2.21. The 8-kDa Family*

The 8-kDa family occurs in metastriate hard ticks and contain a cysteine knot structure characterized by a three- or four-stranded antiparallel β-sheet folded into an open sided barrel stabilized by three disulfide bonds (Figures S45 and S46) [95]. The disulfides are

clustered at the end of the β-sheet containing both the N- and C-termini. Two alternative disulfide bonding patterns are seen, one with a pattern (based on relative cysteine positions) of: 1–6, 2–4, 3–5, and the second having a pattern of 1–4, 2–5, 3–6, which is also seen in the evasins. This family contains the RaCI complement inhibitors from *R. appendiculatus* which target complement factor C5 (Figures S45 and S46) [96]. These bind at a surface of C5 containing elements of the MG1, MG2 and C5d domains and prevent its cleavage by the C5 convertase. RaCI proteins from *R. appendiculatus*, *R. microplus* (Canestrini, 1888) and *D. andersoni* have been analyzed and found to act in a similar manner but have somewhat variable sequences in regions interacting with C5. This is explained by the large number of backbone interactions involved in C5 binding. RaCI peptides contain a lengthened loop between β-strands 1 and 2 that inserts into a pocket lying between the MG1 and MG2 domains of C5. This loop is not extended in other members of the family suggesting that these binding interactions cannot occur and that these proteins do not bind C5.

#### *2.22. The 15-kDa Basic Family*

The 15-kDa basic family found in *Amblyomma* sp. is a cysteine knot variant similar in structure to the 8-kDa family [95]. It contains an antiparallel β-sheet domain with two or three strands stabilized by three disulfide bonds in the pattern (based on relative cysteine positions) 1–4, 2–5, 3–6. Some members contain two additional cysteine residues forming a potential fourth disulfide bond as part of a disordered N-terminal coil (Figures S47 and S48). C-terminal to the cysteine knot domain is a length of disordered coil, followed by one or two helical segments containing greater than five turns which are then followed by a second intrinsically disordered region.

#### *2.23. Complement-Binding Family*

Members of the complement binding family are large proteins containing lectin or von Willebrand A (vWA) domains linked to strings of all β-sheet sushi-like domains, mostly stabilized by one or two disulfide bonds, that are reminiscent of complement control protein (CCP) domains. The lectin and vWA domains occur at the N-terminal end of the protein with the CCP domains extending out from them. One member of the group (JAR89651) from *I. ricinus* contains a fucose-binding lectin domain followed by a C-type lectin domain leading to ten repeated sushi domains. A second protein (JAR90946), also from *I. ricinus*, contains an N-terminal vWA domain followed by eight repeated modular domains Figures S49–S51). The proteins contain large numbers of cysteine residues, and all were predicted by AlphaFold2 to participate in disulfide bonds. These structures suggest that in blood feeding, the N-terminal parts may bind to exposed carbohydrate or collagen patches and the repeated domains may function as modulators of complement function in the mode of factor H. Other members that can be categorized as belonging to this group include JAB71472 from *I. ricinus* that contain only the repeated domains without the apparent lectin or vWA "anchors". Interestingly, these proteins (see alignments of JAR89651 and JAR90946, Figures S49–S51) show high degrees of similarity to the N-terminal parts and complete conservation of cysteine residues of these large proteins from a wide variety of arthropods such as mites, crustaceans and horseshoe crabs that do not feed on vertebrate hosts suggesting that they have functions in endogenous systems not involving vertebrate blood such as immune surveillance. They are related to the sushi, von Willebrand factor type A, EGF, and pentraxin domain-containing (SVEP1) and CUB and sushi modular domain proteins (CSMD1) of vertebrates. CSMD1 is known to inhibit complement by a mechanism in which it serves as a cofactor to factor I in the degradation of C4b and C3b from the classical and alternative pathways of complement, respectively [97].

#### *2.24. Dae-2 Family*

The Dae-2 (domesticate amidase effector 2) proteins are a group of cysteine peptidases whose genes have been acquired by tick species by lateral transfer from bacterial genomes [98,99]. These serve an antimicrobial function by proteolytically cleaving cell

wall peptidoglycan. The tick salivary Dae-2 proteins have been shown to have broader substrate specificity than microbial forms and are thought to act by controlling growth of skin microbes at the site of feeding. The catalytic cysteine and histidine residues (Cys 23 and His 73 in the bacterial Tae-2 numbering system) are conserved in all tick forms (Figure S52). Differences in surface structure, particularly along the substrate binding groove, are considered to be determinants of selectivity for peptidoglycans from different bacterial forms. Unlike the bacterial Tae-2 which contain no disulfide bonds, tick Dae-2 proteins contain a conserved disulfide linking Cys 58 and Cys 89 (in JAC30591 numbering, Figure S53). There is also a free cysteine at position 4 (JAC30591 numbering) and in most tick sequences this is paired with a cysteine at position 31 to form a second disulfide (Compare JAC30591 and AEO34830 in Supplementary Figures S52 and S53).

## **3. Methods**

#### *3.1. TSFam Database*

From the TSFam database [5], 15,796 sequences obtained from tick salivary transcriptomes were selected based on the presence of a signal peptide indicative of secretion, and no transmembrane domains outside the signal peptide. The original database was clustered by blastp and paired-joining the sequences to attain 25, 30... 90, 95% identity in at least 70% of the longest sequence. Thus, for each degree of identity there are n clusters, sorted by their decreasing abundance. Accordingly, each particular cluster can be determined by two numbers, the first being the identity threshold and the second the cluster number for that identity (Supplemental spreadsheet S1). The sequences from each cluster were used to construct a psiblast generated PSSM, and these were combined, after proper annotation, within the database where tick sequences can be searched using the rpsblast tool (Supplementary TickSialoFam 2.0 database).

## *3.2. Alphafold2 Program*

The Alphafold2 program [6] was run locally on the NIH Biowulf cluster in a Linux environment using 8 cpu's, one v100x GPU and 60 GB RAM, using the monomer or multimer mode. Prediction of disulfide bonds were made by calculating the distances between all sulfur atoms from cysteine residues, available from the pdb files, and assigning a disulfide bond for those pairs that had a distance smaller than 3 Å.

#### *3.3. Dali*

The Dali program [7], available online: http://ekhidna2.biocenter.helsinki.fi/dali/ README.v5.html (accessed on 1 July 2022) was used to compare the Alphafold2 predictions to the structures available in the PDB database. The program was run locally in the NIH Biowulf cluster. The program generates statistical analyses of the comparisons. According to the manual, a Z score above 20 indicates that two structures are homologous, between 8 and 20 that two structures are probably homologous, between 2 and 8 is a gray area, and a Z-score below 2 is not significant.

#### *3.4. Disintegrin Searches*

We have previously scanned salivary proteins from blood sucking arthropods for disintegrin motifs [19] after building prosite blocks (Supplementary File S1—Prosite disintegrin motifs) which were used to search the tick salivary protein database (Supplemental spreadsheet S1) using the program ps\_can.pl [100] Available online: https: //ftp.expasy.org/databases/prosite/.

#### **4. Conclusions**

The addition of structural fold prediction algorithms in the classification of secretory salivary gland protein families adds a powerful dimension that allows confirmation and validation of various protein families, groups or folds. It also allowed assignment of distantly related families or groups to well-known families or to predict novel folds not yet

determined by conventional structural biological methodologies. The models provided by Alphafold2 also allow identification of potential homodimers and insights into the quaternary folds of proteins with multiple domains. In addition, the models provide insight into the potential functions and mechanisms of various families and provide a basis for assessment of structural integrity via disulphide bond predictions. Alphafold2-based classification as utilized for the TSFam2.0 database has already improved the original database, while adding significant information that can be used in hypothesis driven research on protein family function and evolution. The TSFam2.0 database is therefore a significant improvement on the original TSFam database that will with subsequent refinement and addition of more tick protein families and structures result in a comprehensive classification of tick protein families.

**Supplementary Materials:** The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/ijms232415613/s1, (1) Supplemental Figures: PowerPoint file with manuscript figures and movies. (2) Supplemental File S1-Disintegrin motifs in prosite format used to scan tick salivary proteins using the program ps\_scan.pl Available online: https://github.com/ebi-pfteam/interproscan/blob/master/core/jms-implementation/support-mini-x86-32/bin/prosite/ps\_ scan.pl. (3) Supplemental spreadsheet S1–Hyperlinked spreadsheet containing putative tick salivary proteins linked to comparisons to several databases and AlphaFold predicted structures. Clusterization of the proteins allowed for extraction of reversed-position specific motifs collected into the TSFam 2.0 database. The spreadsheet has links to pdb files, which need programs that are able to open them. We suggest the use of ChimeraX Available online: (https://www.cgl.ucsf.edu/chimerax/ download.html) or Swiss-PDBViewer Available online: https://spdbv.unil.ch/. (4) Supplementary TickSialoFam 2.0 database–Include RPS models and a formatted database which should be used to query protein sequences by means of the rpsblast program from the NCBI Blast suite of programs Available online: https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/.

**Author Contributions:** B.J.M. Study design: Data analysis; Manuscript draft; Manuscript revision. J.F.A.: Study design: Data analysis; Manuscript draft; Manuscript revision. J.M.C.R.: Study design: Software design; Data analysis; Manuscript draft; Manuscript revision. All authors have read and agreed to the published version of the manuscript.

**Funding:** J.M.C.R. and J.F.A. were supported by the Intramural Research Program of the National Institute of Allergy and Infectious Diseases (Vector-Borne Diseases: Biology of Vector Host Relationship, Z01 AI000810-21). B.J.M. was supported by the National Research Foundation of South Africa (Grant Numbers: 137966).

**Data Availability Statement:** Publicly available datasets were analyzed in this study. This data can be found here: https://proj-bip-prod-publicread.s3.amazonaws.com/transcriptome/TickSialoFam/ TSF2.0/SupSpreadsheet+1.xlsx.

**Acknowledgments:** This work utilized the computational resources of the NIH HPC Biowulf cluster (http://hpc.nih.gov).

**Conflicts of Interest:** The authors declare no conflict of interest.

## **References**

