*4.9. Cellular Localization and Dimension Reduction*

Prediction of subcellular localization for proteases was performed using CELLO v2.5 [92] and PSORTb v3.0.2 [93]. Putative signal peptides were predicted with SignalP 5.0 [94]. All predictions are consolidated and shown in Table S12. Numerical cellular localization data per sequence were used as features for dimension reduction techniques. This list includes wall, membrane, extracellular and intracellular scores from PSORTb (four features) and CELLO (four features), and also export pathway scores SP(Sec/SPI), TAT(Tat/SPI), LIPO(Sec/SPII), OTHER, intracellular and signal peptide possibility from SignalP (six features). The resulting matrix consisted of n sequences by 14 features. The set of sequences of size n includes the functional keratinase dataset, the putative non-keratinase dataset, and the three-strain dataset. The raw data matrix was standard-normalized, and dimensional reduction analysis was performed using Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE) (perplexity 30, 1000 iterations, fixed seed) [95]. Output coordinates for both methods were normalized using a min-max range. For the PCA, loadings for each feature were calculated from eigenvectors and depicted using arrows. The whole procedure was done using the Scikit-learn v.0.22.2 Python package [96].
