*4.10. Genome Sequencing and Alignment*

Whole-genome sequencing was carried out by Microbes NG (https://microbesng. com/ (accessed on 9 November 2020)) as follows: Genomic DNA libraries were prepared using Nextera XT Library Prep Kit (Illumina, San Diego, CA, USA) following the manufacturer's protocol with the following modifications: two nanograms of DNA instead of one were used as input, and PCR elongation time was increased to 1 min from 30 s. DNA quantification and library preparation were carried out on a Hamilton Microlab STAR automated liquid handling system. Pooled libraries were quantified using the Kapa Biosystems Library Quantification Kit for Illumina on a Roche light cycler 96 qPCR machine. Libraries were sequenced on the Illumina HiSeq using a 250 bp paired-end protocol. Reads were adapter trimmed using Trimmomatic 0.30 with a sliding window quality cut off of Q15 [87]. The closest available reference genome was identified using Kraken [88] the reads were mapped with BWA mem for assessing the quality of the data. De novo assembly of the reads was carried out utilising SPAdes [89]. MeDuSa [90] was utilised for genome scaffolding, using reference strains with >95% similarity based on 16S rRNA sequencing data. The

*Pseudonocardia* isolates were not analysed by MeDuSa as no reference strains were identified. The whole genome sequences for the polar strains have been deposited to GenBank with the following accession numbers: SAMN14679891-SAMN14679907 (Table S2).

### *4.11. Biosynthetic Gene Cluster Mining and Comparison*

The identification of BGCs was carried out using antiSMASH 5 beta [91]. The variety and number of BGCs each Polar strain was visualised using the Circos diagram [92]. The detected BGCs were grouped into Gene Cluster Families (GCF) using BiG-SCAPE 1.0 beta (Navarro-Munoz et al. 2019), with the underlying assumption that similar BGCs, i.e., BGCs that belong to the same GCF, produce similar metabolites. BiG-SCAPE was run using Longest Common Subcluster alignment mode, and cluster analysis carried out at the default cutoff of 0.3.

#### *4.12. Computational Pattern Matching*

Computational prioritisation of links between BGCs and candidate products made use of two complementary approaches. Firstly, the standardised strain correlation score described in [42] was used to compute a score between each spectrum and each GCF. The original strain correlation score introduced in [41] is heavily influenced by the number of strains present in each spectrum or GCF making the ranking of links between spectra and a particular GCF problematic. The standardised score overcomes this limitation, permitting a more balanced ranking of spectra for each GCF independent of their size. Significance values for each link were computed as described in [42]. Secondly, a novel approach named Rosetta (code available here: https://github.com/sdrogers/nplinker/ tree/master/prototype/rosetta\_data\_prep (accessed on 9 November 2020)) based upon a set of collated matches between the GNPS [23] library spectra and the MiBIG database of characterised BGCs allows for putative links between individual spectra and BGCs to be highlighted. The set consists of 2960 links, 2069 unique spectra, 249 unique MIBiG IDs. To establish this set of collated links, the structural annotations available for both databases were used. A pair of objects from the two datasets were matched if the first blocks of the InChIKeys of the molecules in the GNPS library spectra and MiBIG validated gene cluster products matched. Matching was restricted to the first block to avoid distinguishing between molecules based on chemical properties that would not show up in the MS/MS spectra (e.g., stereochemistry). With this set of collated links, observed spectra and BGCs were putatively matched as follows: spectral similarity between measured MS2 spectra and the relevant subset of the GNPS spectra was computed using the modified cosine score (equivalent to "Analog search" in the GNPS framework). Results from antiSMASH were parsed to extract the known cluster blast results and Rosetta links between spectra and BGCs were generated where the spectra showed similarity to the GNPS spectrum and the MiBIG entry was found in the known cluster blast record for the BGC. All analysis was performed with the NPLinker framework [42] in which potential can be reported using either one of these two scoring methods, or both simultaneously, with user-defined thresholds.

**Supplementary Materials:** The following are available online at https://www.mdpi.com/1660-3 397/19/2/103/s1, Figure S1: Molecular network of 3107 parent ions produced by 25 Polar actinomycete strains. Nodes are colour coded based on genus: *Agrococcus*, *Dietzia*, *Halomonas*, *Kocuria*, *Microbacterium*, *Micrococcus*, *Pseudonocardia* and *Rhodococcus*. Grey nodes represent media components, whereas orange nodes represent parent ions that are produced by more than one different medium. Figure S2: Pie chart showing the distribution of parent ions (%) between the 36 chemical class terms shown in the legend as annotated by MolNetEnhancer. The percentage of parent ions with no chemical class match (70.5%) is not shown in the pie chart Each class has been colour coded to match the molecular network generated through MolNetEnhancer workflow analysis (Figure 5B). Table S1: Isolation and collection data of the 25 polar bacteria. Table S2: Genome quality of the Polar strains (*Pseudonocardia* strains were not analysed by MeDuSa as no reference strains were available). Table S3: Identified BGCs using antiSMASH 5 clusters after genome scaffolding using MeDuSa. Table S4: Bioactive bacterial extracts organised by genus, strain name (KRD) and growth medium ISP3, A1M1, ISP2, and 10-fold dil. TSB. Antibiotic activity against the clinical pathogens *E. faecalis*, *S. aureus*, *K. pneumoniae*, *A. baumannii, P. aeruginosa* and *E. coli* is shown as zones of inhibition (cm) and colour coded by inhibition zone size. Table S5: Putatively identified metabolites using the Rosetta approach.

**Author Contributions:** Conceptualisation, K.R.D. and S.R.; methodology, S.S., G.H.E., S.R., and K.R.D.; formal analysis, S.S., G.H.E., S.R., and K.R.D.; investigation, S.S., G.H.E., A.H.H., S.R., and K.R.D.; writing—original draft preparation, S.S., G.H.E., S.R., and K.R.D.; writing—review and editing, S.S., G.H.E., A.R., J.J.J.v.d.H., A.H.H., S.R., and K.R.D.; supervision, K.R.D. and S.R.; project administration, K.R.D. and S.R. funding acquisition, K.R.D. and S.R. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by Carnegie Trust Collaborative Research Grant (KRD, SR, SS). AR, KRD and SR were supported by the Biotechnology and Biological Sciences Research Council (BB/R022054/1). Additionally, genome sequencing was provided by MicrobesNG (http://www. microbesng.uk (accessed on 21 January 2021)) which was supported by the Biotechnology and Biological Sciences Research Council (BB/L024209/1).

**Data Availability Statement:** The code for Rosetta is available at https://github.com/sdrogers/ nplinker/tree/master/prototype/rosetta\_data\_prep (accessed on 21 January 2021). The genomes have been deposited to GenBank with the following accession numbers: SAMN14679891-SAMN14679907 (Table S2). The GenBank accession numbers for the 16S rRNA gene sequences are the following: MT135519 (KRD153), MT135569 (KRD128), MT135795 (KR077), MT135986 (KRD070), MT136106 (KRD026), MT136242 (KRD012), MT136243 (KRD022), and MT136510 (KRD096) (Figure 1). The LC–MS data are available at the MassIVE dataset under access number MSV000086584.

**Conflicts of Interest:** The authors declare no conflict of interest.
