*2.1. Identification of Toxins from Transcriptomic and Proteomic Analysis*

In this study, we used a proteotranscriptomic approach to characterize the venom from *C. iheringi*, since no protein or gene sequence was available in public databases. Therefore, the venom was submitted for proteome analysis while the venom gland mRNA was extracted and submitted for transcriptome investigation.

The *C. iheringi*'s venom gland mRNA was extracted and sequenced by Illumina HiSeq 1500 technology (Figures S1 and S2 of the supplementary material). A total of 15,904,398 paired-end reads were generated. The relevant pre-processing quality control, filtering, and trimming steps were applied, resulting in 14,964,551 (94.1%) high-quality reads. The transcriptomic profile of the *C. iheringi* venom gland generated 88,774 assembled transcripts with an average length of 766 bp, a Transcript N50 of 1104 and contained 16,266 (18.3%) transcripts with a length of greater than 1 Kb (Table 1). We evaluated the completeness of the *C. iheringi* transcriptome assembly using BUSCO (Benchmarking Universal Single-Copy Orthologs), searching against the 954 metazoa ortholog groups, and identified 934 (97.8%) of the conserved groups in metazoa; of these, 885 (92.7% of total) were complete, and 49 (5.1%) genes were fragmented.

For the alignment of *C. iheringi* transcriptome assembly against the 106,197 transcripts from 10 species from the Scolopendromorpha orders (Table 2) (*C. anomalans, H. marginata, S. alternans, S. cingulata, S. dehaani, S. morsitans, S. subspinipes, S. virirdis, S. rubiginosus, S. sexspinosus*) we obtained a total of 5328 (6%) *C. iheringi* hits, with the *Cryptops anomalans* having the highest rate of identification, of 4272 (4.83%). The sequence similarity surveys, by BLASTx alignment, resulted in 71.4% of unknown transcripts. Therefore only 28.6% of all transcripts presented at least one protein homolog against the Uniprot and TSA databases.

To further characterize the toxins sequences, the crude venom was analyzed by LC-MS/MS, and then, we performed automatic peptide matching against the predicted proteins from the *C. ihering*'s transcriptome. The sequences identified by this approach were

labeled as putative known toxins if they were present in a public database, and if not, they were referred to as putative unknown toxins.

**Table 1.** Description of Transcriptome sequencing and Assembly of *Cryptops iheringi* and the transcriptome completeness analysis by BUSCO.


**Table 2.** The number of transcripts from TSA/NCBI for each species from Scolopendromorpha orders and the number of hits from *C. iheringi* transcriptome assembly against the orders.


Furthermore, the predicted proteins from the *C. ihering*'s transcriptome that were not identified in the approach above were labeled as non-toxins, aligned with the Gene Ontology (GO) database, and classified according to their main biological category, in accordance with the GO nomenclature. The remaining sequences that were not identified as a match within the searched databases, and which were not identified through the association between the transcriptome and total venom proteome, were called unknown transcripts and they were no longer explored.

Among the non-toxins transcripts, around 6877.9 transcripts (27%) belong to the Biological Process, 10,953.7 transcripts (43%) to the Cellular Component, and 7642.12 transcripts (30%) to the Molecular Function, the five most representative categories for each GO term were represented as the percentage of transcripts in Figure 1.

The proteomic analysis of the crude venom revealed that 454 predicted proteins of the transcriptome could be classified as unknown venom components or as putative venom toxins, which were further classified into 24 different protein families (Figure 2). In terms of relative expression in TPM (transcript per million), putative unknown toxins and putative known toxins represented 24.97% of the transcriptome.

**Figure 1.** Non-toxins distribution of the five most representative categories of ontologies in the total number of transcripts from the transcriptome analysis of *C. ihering*'s venom gland. Annotation was performed according to the Gene Ontology terms for cellular component, biological process, and molecular function categories.

**Figure 2.** Distribution of the diversity of transcripts encoding putative known toxins found in the proteotranscriptomic approach of the venom gland of *C. iheringi*. Percentages correspond to the relative expression in TPM of each category.
