*4.10. Clustering and Phylogeny of t-SNE Groups*

t-SNE points, i.e., coordinates by sequence, were clustered into groups, called "t-SNE groups", using the DBSCAN algorithm [49] implemented in Scikit-learn v.0.22.2. Multiple sequence alignment (MSA) for each t-SNE group was performed using MAFFT v7.455 [97] with options G-ins and -maxiter 1000. Average occupancy (average number of residues per position in the alignment) and MSA filtering by occupancy were obtained using the Prody Python package [50]. Positions with occupancy below 70% were removed from original MSAs to generate more compact MSAs. A substitution model was fitted to each MSAs using ProtTest 3 [98], considering all distributions plus I and G models, and then a multithread maximum likelihood tree was obtained using RaxML v.8.2.12 [99] with the best parameters calculated by ProtTest and the rapid bootstrapping configuration (-f a option, 1000 bootstraps). Because no outgroup sequence was provided, the trees were midpoint rooted. Ancestral state reconstruction on trees [100] over the discrete categories: Keratinase, non-keratinase, three-strain, and keratinase-linked protein was performed using the package Phytools in R [101]. The set up was defined in 1000 iterations and an "ER" model, which means that equal rates for all permitted transitions.
