*2.5. Computational Pattern Matching*

A recently introduced software framework, NPLinker, was utilised to suggest links between spectra of interest and their corresponding GCFs and therefore BGCs [42]. Initially, analysis was carried out using the standardised strain correlation scoring method which yielded potential MF-GCF links based upon correlating strain presence and absence. This approach greatly narrowed the space of links requiring investigation. Further analysis of the suggested links based on biosynthetic knowledge allowed the BGCs to be identified that were likely to be most relevant to the metabolite of interest. Specifically, the GNPS infrastructure allows parent ion clustering into molecular families and comparison of observed spectra with GNPS embedded libraries. Simultaneously, BGCs were clustered into GCFs via BiG-SCAPE. The generated MFs and GCFs were then uploaded to NPLinker where potential MF-GCF links were ranked based on two scoring functions; standardised strain correlation scoring and the here introduced, novel approach named Rosetta scoring (Figure 6A).

**Figure 4.** (**A**) Hinton diagram showing the number of parent ions (proportional to the size of white box) produced by each strain (max. 1041, min. 543 parent ions) and shared by each pair of strains (max. 670, min. 104 parent ions. Number of parent ions specific to that strain (or pair) are proportional to the size of the inner purple box. (**B**) Hinton diagram showing the number of parent ions by strain across each media (white box). Parent ions specific to only that strain-medium are also shown (purple box). The thickness of black box outline corresponds to the number of ESKAPE pathogens bioactivity was observed against (ranging from 1 to 6). For example, the bacterial metabolite extract for KRD185 in diluted TSB was found to be bioactive against all six pathogens, but no bioactivity was observed when the same strain was cultured in any of the other tested media.

**Figure 5.** (**A**) Molecular network of 3107 parent ions produced by 25 strains cultured in four media and those found in media and solvent blanks. Nodes are colour coded based on media: ISP2, ISP3, A1M1 and 10-fold diluted TSB. Grey nodes represent media components, whereas orange nodes represent parent ions that are found in more than one different medium. (**B**) Nodes are colour coded based on 36 chemical class terms annotated using MolNetEnhancer workflow. Orange nodes represent parent ions that had no matches with any chemical class.

**Figure 6.** (**A**) Molecular families (MFs) are created through molecular networking using the GNPS infrastructure and parent ions of interest are identified through dereplication using GNPS embedded libraries and the Rosetta tool. Genome mining data (antiSMASH) of the Polar strains were clustered in GCFs using BiG-SCAPE. The GNPS (MFs) and BiG-SCAPE (GCFs) outputs are then analysed with NPLinker to rank potential MF-GCF links using two scoring functions (standardised strain correlation linking and Rosetta scoring). (**B**) The ectoine metabolite produced by two *Rhodococcus* spp. (KRD175 and KRD197) when cultured in ISP3 and one *Halomonas* sp. (KRD171) cultured in A1M1 was linked with its corresponding ectoine BGC via NPLinker.

2.5.1. Computational Pattern Matching Using the Standardised Strain Correlation Scoring Method

The parent ion (*m*/*z* 547.3815) produced by *Microbacterium* sp. KRD174 cultured in A1M1 showed spectral similarity to the GNPS spectrum CCMSLIB00000569369 suggesting it was an antimycin-related metabolite. Through NPLinker, it was shown that this metabolite could potentially be linked with the NRPS-like, betalactone, t3PKS and terpene BGCs (KRD174). Although the standardised strain correlation score for all links was high (2.7–4, with 4 being the maximum value observed in the dataset), when this information was combined with the fact that antimycins are produced by an NRPS/PKS hybrid [49], it was hypothesised that the betalactone and terpene BGCs were less likely to be involved in the biosynthesis of the metabolite of interest. Of course, further validation studies are required to confirm the responsible BGC. Similarly, another metabolite (*m*/*z* 521.3294) showed similarity with GNPS spectrum CCMSLIB00004710288 for conglobatin (MIBiG ID: BGC0001215), suggested that it could potentially be structurally related to the known macrolide conglobatin originally isolated from the antibiotic-producing *Streptomyces con-*

*globatus* [50]. The metabolite of interest was produced by two *Micrococcus* strains, KRR022 and KRD026 (diluted TSB medium), in addition to *Rhodococcus* sp. KRD175 (diluted TSB medium) and two *Pseudonocardia* strains KRD184 and KRD291 (ISP2 medium). Using the standardised strain correlation scoring method, the spectrum was potentially linked with 14 GCFs; two from *Micrococcus* sp. (KRD026) and 12 from *Rhodococcus* sp. (KRD175). However, the highest standardised strain correlation linking score (2.1) was observed for the hybrid BGC arylpolyene-NRPS (KRD026) as well as for the NRPS, NRPS-like, arylpolyene and butyrolactone BGCs (KRD175). Considering that conglobatin biosynthesis is governed by an NRPS/PKS BGC [51], the arylpolyene-NRPS BGCs are most likely to be involved in the biosynthesis, but further studies would be required to validate this. These examples demonstrate that using spectral library matches with the standardised scoring method included within NPLinker can narrow down possible MF-GCF links and thus enable a more focused downstream analysis.

2.5.2. Computational Pattern Matching Using Standardised Strain Correlation Scoring and the Rosetta Method

To further investigate the potential links between the genomics and metabolomics datasets of the Polar strains, an additional filter layer was added into NPLinker which allowed the use of the standardised strain correlation scoring method and the Rosetta hit list simultaneously. This approach led to linking spectrum ID 219769 (*m*/*z* 185.1012), putatively identified as ectoine ([M + CAN + H]+ adduct), via Rosetta, with the ectoine BGC in two *Rhodococcus* sp. (KRD175, KRD197) and Halomonas sp. (KRD171) strains. Interestingly, when using only the standardised scoring the same spectrum was linked to 40 GCFs. However, applying the additional Rosetta scoring method narrowed it down to two GCFs (Figure 6B). Moreover, Rosetta identified that spectrum ID 111427 (*m*/*z* 380.2794) could be structurally related to the known antibiotic chloramphenicol, originally isolated from *Streptomyces venezuelae* [52]. The parent ion of interest was present in the metabolite extracts of *Rhodococcus* sp. KRD175 and *Micrococcus* sp. KRD128 and was linked with the NRPS BGC (KRD175) which showed homology to the chloramphenicol BGC. It is important to note that the Rosetta scoring approach is limited by the number of MiBIG BGCs for which experimental spectra are available. Due to the relatively low number of publicly available spectra of microbial metabolites [53], the combined filtering approach (standardised score and Rosetta) could only identify links for ectoine and chloramphenicol to their corresponding BGCs. It must be pointed out that the Rosetta hits were a result of matching single MS fragments to publicly available MS/MS datasets (Table S5), hence the aforementioned metabolites could be only putatively identified. However, this workflow clearly shows the promise of the implemented method for analysing large genomics and metabolomics datasets.

#### **3. Discussion**

Over the years, it has been shown that the Arctic and Antarctic marine environment host a vast variety of Actinobacteria with great potential for producing novel chemistry with a wide range of biological activities [11–13]. Bioprospecting for new specialised metabolites from Polar strains has greatly improved by the advancement of publicly available tools for untargeted metabolomics [23] and genome mining [54], which are continuously under development to meet the rapidly evolving field of microbial natural products discovery. One of the main challenges of genome mining is the quality of the genome assembly and annotation which can affect the outcome of the analysis [55,56]. A large number of contigs in the genome assembly can lead to BGCs, especially PKS-I and NRPS, to be broken across pieces and not being identified by available software and tools. A great example of such issue was demonstrated by Baltz who showed that draft genomes containing large NRPS/PKS-I genes were incorrectly assembled due to being largely fragmented which resulted in overestimation of such BGCs by antiSMASH 3.0 [57]. However, since then, new updated versions of antiSMASH have been released in which the location of the gene cluster close to the contig edge is flagged. Moreover, the need

for closed genomes is of paramount importance for accurate and reliable genome mining. However, long-read technologies are often required to achieve this, which comes with greater expense and their own drawbacks such as high error frequencies and reliability [58]. A recent study of nine Actinobacterial species, including three *Pseudonocardia* strains used short-read (Illumina MiSeq) and long-read (Oxford Nanopore MinION) sequencing technologies to analyse BGC fragmentation. The authors found that the MinION-based genome assemblies increased the sensitivity related to BGC annotation and reduced the number of fragmented BGCs. [56]. In this present study we omitted the *Pseudonocardia* strains from the genomic analysis due to lack of reference strains for genome scaffolding. Genome mining of the 17 non-*Pseudonocardia* strains revealed a wide diversity of BGCs with most of them having low homology to known BGCs which suggests biosynthetic and chemical novelty. Terpene BGCs were present in almost every genome, which was not surprising as recent studies have revealed a wide distribution of terpene synthases in bacteria which has led to the development of a new hidden Markov model for terpene synthases identification in bacterial genomes [58,59]. As expected, the number and variety of BGCs increased for larger genome sizes such as the *Rhodococcus* strains. However, it was unexpected to notice that smaller genomes such as *Micrococcus*, *Halomonas* and *Kocuria* were lacking PKS and NRPS BGCs as actinomycetes are known to produce metabolites encoded by those pathways [60,61]. A similar observation was made by Schorn et al. when studying rare marine actinomycetes [8]. Although small genomes might not look as promising from a natural products discovery perspective, it does not necessarily mean that they are not worth further investigation. The sponge-associated *Micrococcus* sp. was reported to produce a new antibacterial xanthone named microluside A [62] and marine *Halomonas* strains have yielded new antibacterial and cytotoxic metabolites named loihichelins A−F and aminophenoxazinones, respectively [63,64].

To further explore and investigate the observed BGCs in our Polar strains, analysis showed the ectoine BGC present in all genomes; this is known to be ubiquitous as the metabolite aids survival under extreme osmotic stress [65]. Moreover, the terpene BGC with high homology (66%) to a known carotenoid BGC was present in all *Micrococcus* strains and clustered in the same GCF. Carotenoids are terpenoids produced by all photosynthetic organisms and some non-phototrophic organisms, and have several applications as food colorants, feed supplements, nutraceuticals, and pharmaceuticals [66]. Terpene BGCs with homology (>37%) to the isorenieratene BGC were observed in the *Rhodococcus* strains and were clustered in the same GCF. Actinobacteria, and particularly *Streptomyces* spp., often bear isorenieratene BGCs in their genome that are usually silent, and there have been only a few cases in which these BGCs have been activated [67,68]. Furthermore, the genomic data of the three strains belonging to the genus *Rhodococcus* suggest the presence of NRPS BGCs which could potentially encode for cyclic lipopeptides. Such metabolites are of great importance in drug discovery with the example of daptomycin, originally isolated from the soil-derived *Streptomyces roseosporus* [69], which has been approved by the FDA as an antibacterial agent against Gram positive pathogens [70].

For over 30% of the BGCs within our dataset, the most similar known cluster encoded for an antibiotic. Of this, almost half showed low homology (<10%) with known BGCs. This is an exciting finding suggesting that the rare actinomycete strains derived from Polar marine sediments can potentially be a fruitful source of novel chemistry. It is worth noting that extracting metabolites from culture broth in organic solvents was proven to be a more effective and reliable method to assess biological activity (disc diffusion assay) than an agar plug assay [71]. Although genome mining of the *Rhodococcus* spp. showed promising potential for producing metabolites, the bioassay data did not fully support this. As only strain KRD175 exhibited moderate but selective activity against *K. pneumoniae*. This could be because the BGCs encoding for antibiotics remained silent or the biologically active compounds were produced in low amounts that were not sufficient to inhibit the growth of the pathogens. Moreover, the bacterial metabolite extracts mostly inhibited the growth of *S. aureus*, whereas only a few showed inhibitory effects against *K. pneumoniae* and *A.* *baumannii*, which are two of the most drug-persistent pathogenic bacteria [72,73]. To the best of our knowledge, there are only a few published reports on the inhibitory effects of microbial specialised metabolites on *A. baumannii* [74–76] but none on *K. pneumoniae*; and therefore, the Polar strains with such activity show promise to combat these pathogens.

Linking genomic and metabolomics datasets of actinomycete strains for specialised metabolite discovery has been introduced only recently [41]. However, there is increased interest in the scientific community to further explore this niche research field by generating automated methods for correlating these complex datasets and ranking promising MF-GCF links for further investigation. Targeted linking and automated approaches for accelerating drug discovery have been reviewed [39,53]. Recently, metabolomic and genomic data of 72 isolates belonging to the rare actinomycete genus *Planomonospora* were analysed using publicly available tools to link specialised metabolites to their corresponding BGCs [77]. The authors were able to manually pair siomycin congeners to a RiPP BGC and a new salinichelin-like metabolite to the known BGC encoding for erythrochelin. In the present study, the newly developed software, NPLinker, was used to link our experimental datasets and prioritise strains for further chemical and biosynthetic investigation. The filtering approaches that were implemented (standardised strain correlation score and Rosetta) established links for ectoine and chloramphenicol to their corresponding BGCs but were not yet sufficient to link the potentially new identified metabolites (antimycins-like and conglobatin-like compounds) to GCFs as publicly available spectra of microbial metabolites are almost non-existent and remain mostly hidden in supplemental figures in literature. Van Santen et al. [78], among others, discussed the need for data sharing within the scientific community which will allow the field of natural products to catch up with datacentric approaches used in other research fields and further flourish. It is worth pointing out that limiting number of Rosetta hits obtained within this metabolomics dataset is indicative of the potential novel chemistry of the Polar strains which is further supported by the large number of nodes that could not be annotated to specific chemical classes. However, our findings agree with a recent literature review which reported only 29 new metabolites isolated from Antarctic and Arctic bacteria, of which 13 have been discovered from marine actinomycetes [13]. A future direction for NPLinker could be the integration of bioassay data along with metabolomics and genomics datasets, as previously suggested by others [79], which will give the opportunity to users to explore possible MF-GCF links based on bioactivity and target the BGCs and therefore the metabolite(s) responsible for the biological effect.
