*2.3. Filtering by Cellular Localization Scores*

Cellular localization data and phylogeny were employed to narrow the list of candidate keratinases that could potentially make a difference in keratin degradation for strain G11C. Proteases secreted into the extracellular medium are expected to play an important role in keratin degradation [2]. Thus, to identify those sequences predicted as extracellular, all previously mentioned datasets (three-strain set, functional keratinases, and putative non-keratinases) were inputted into three localization prediction software: PSORTb, CELLO, and SignalP (Table S12). To add informative categories and therefore, facilitate interpretation of results in the following steps, we separated the three-strain set into: (i) Keratinase-linked proteases (40 sequences), which are sequences linked to functional keratinases according to the network, (ii) "unassigned p-orthogroup" sequences (28 sequences) according to protease orthogroup classification, and (iii) the remaining into the three-strain category (516 sequences). Principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE) were employed to embed sequences into a bidimensional space through the numerical scores retrieved from each tool (Figure 4). Two semi-defined clusters are visualized in the PCA plot (Figure 4A), representing intracellular (lower left corner) and extracellular (lower right corner) proteases. As expected, most of the keratinase-linked proteases from the three-strain dataset group together with known keratinases are represented as extracellular proteases. On the other hand, most of "unassigned p-orthogroup" sequences, including the peptidases unique of *Streptomyces* sp. G11C, are predicted as cytoplasmic and membrane-associated proteases. Non-keratinases were distributed in sparse coordinates, where seven sequences can be considered putative extracellular proteases.

Sequences were further clustered in the t-SNE bidimensional space using the DB-SCAN algorithm [49], revealing a clear separation of the sequences into groups (Figure 4B, Figure S3). This clustering also guides the comparison of the sequence space representation between the PCA and the t-SNE. From Figure 4B, three DBSCAN clusters related to extracellular localization characteristics can be visualized. These were named t-SNE group 0, 1, and 2, each one with 68, 91, and 32 total sequences, respectively, where a considerable number of functional keratinases is present (Table 3; Table S13). Therefore, these groups are of great interest as they may contain promising sequences to encode potential extracellular keratinases. In all the identified groups, there are more sequences predicted as extracellular belonging to strains Vc74B-19, and CHD11 than from strain G11C. Possibly, differences in keratinolytic activity could be due to specific characteristics at the amino acid sequence level. To deepen our analysis, we complemented this study with a phylogenetic analysis of the sequences identified in the t-SNE groups.


**Figure 4.** (**A**) Two-dimensional PCA using localization features for proteases. Protease sequences were retrieved from genomes of strain G11C, Vc74B-19, and CHD11. The x- and *y*-axis explain 41% and 18% of the observed variance of the data, respectively. Functional keratinases are depicted in red, putative non-keratinases in yellow, sequences from our three streptomycete genomes (i.e., three-strain category) in green, sequences without assigned p-orthogroup in magenta and keratinase-linked proteins, according to the network, are depicted in blue. Loadings represent the cellular localization features: Wall, membrane, cytoplasmic, and extracellular of the software CELLO and PSORTb, in addition to putative signal peptides: LIPO(Sec/SPII), SP(Sec/SPI), TAT(Tat/SPI) predicted with SignalP. (**B**) Two-dimensional t-SNE using localization features for the same group of proteases. Defined t-SNE group 0, 1, and 2 are enclosed by dashed polygons.
