*2.2. Non-Ribosomal Peptide Synthetase (NRPS) Gene Cluster of Nostocyclopeptides*

Having the whole genome sequence of *N*. *edaphicum* CCNP1411, we have analyzed in detail the non-ribosomal peptide synthetase (NRPS) cluster, containing potential genes coding for enzymes involved in the synthesis of nostocyclopeptides. To establish correct spans for non-ribosomal peptide synthetases, 35 complete nucleotide sequence clusters derived from *Cyanobacteria* phylum were aligned resulting in hits scattered around positions 2,287,143–2,323,617 and 7,609,981–7,643,289 within the *N*. *edaphicum* CCNP1411 chromosome (7.7 Mbp) (Figure 2). This method of characterization presented the overall similarity of selected spans to micropeptin (cyanopeptolin) biosynthetic gene cluster [23] and nostocyclopeptide biosynthetic gene cluster [17], respectively. To confirm these results, the antiSMASH analysis was employed resulting in confirmation of previously defined NRPS spans and adding two more regions 1,213,069–1,258,319 and 5,735,625–5,780,238, to small extent (12% and 30%, respectively) similar to anabaenopeptin gene cluster [24]. For the purpose of this study, we focused on putative nostocyclopeptide producing non-ribosomal peptide synthetase. Annotation of the selected region revealed nine putative open reading frames (ORFs), transcribed in reverse (7) and forward (2) direction. The identified cluster was arranged in a similar fashion to AY167420.1 (nostocyclopeptide biosynthetic gene cluster from *Nostoc sp*. ATCC 53789), with the exception of two ORFs (>170 bp), intersecting operon (*ncpFGCDE*) putatively encoding proteins involved in MePro assembly, efflux and hydrolysis of products of the second putative operon *ncpAB* (Figure 3).

**Figure 3.** Schematic alignment of genes coding for putative non-ribosomal peptide synthetase from *N*. *edaphicum* CCNP1411 (red) and two related Ncp-producing synthetases AY167420.1 and CP026681.1 (white). The grey bar in the lower right corner shows the identity percentage associated with color of the bars connecting homologous regions. The analysis was conducted at the nucleotide level.

Two sequences ORF1 (HUN01\_34350) (837 bp) and ORF2 (HUN01\_34355) (1107 bp) embedded on 3 end of nostocyclopeptide gene cluster resemble *nosF* and *nosE* genes, found in the nostopeptolide (*nos*) gene cluster [18] with 96% nucleotide sequence identities in both instances, putatively encoding for zinc-dependent long-chain dehydrogenase and a Δ1-pyrroline-5-carboxylic acid reductase. Further upstream, there is an ORF3 (HUN01\_34360) (798 bp) of 98% homology to unknown gene from AF204805.2 gene cluster, suggested previously to be involved in 4-methylproline biosynthesis [17,25], due to close proximity of downstream genes encompassing this reaction, but no experimental evidence was presented. Alignment of the sequence of this putative protein have shown a sequence homology, to some extent, to 4- -phosphopantetheinyl transferase, crucial for PCP aminoacyl substrate binding (Figure 4) [26]. Moreover, partially present adenylate-forming domain within ORF4 (HUN01\_34365) (165 bp) belongs to the acyl- and aryl- CoA ligases family, and may putatively engage substrate for post-translational modification of the PCP domain. Facing the same direction, an ORF5 (HUN01\_34370) (1605 bp)-bearing putative domain classified as transpeptidase superfamily DD-carboxypeptidase and ORF6 (HUN01\_34375) (2010 bp) homologous to ABC transporter ATP-binding protein/permease may be engaged in *ncpAB* peptide product transport [27]. Neither the ORF7 (HUN01\_34780) Shine–Delgarno (SD) sequence upfront translation start codon could be assigned nor the TA-like signal ~12 nucleotides upstream could be found.


**Figure 4.** Structure-based sequence alignment of 4- -phosphopantetheinyl transferase and partial ORF3. Amino-acids highlighted in black color indicate conserved residues, whereas those in grey color indicate conservative mutations.

The main part of the Ncp biosynthetic gene cluster is located on the forward strand comprising two large genes which nucleotide sequences are homologous over 80% to *ncpA* and *ncpB* subunits of the *ncp* cluster in *Nostoc* sp. ATCC53789 [17]. Both these genes code for proteins consisting of repetitive modules incorporating single residue into elongating peptide. ORF 8 (HUN01\_34785) (11,334 bp) encompasses three of these modules, whereas ORF 9 (HUN01\_34380) (14,157 bp) encodes four modules. The core of one NRPS module consists of three succeeding domains: condensation (C), adenylation (A) and peptidyl carrier protein (PCP). Moreover, adjacent to coding spans of extreme modules, two tailoring domains were found within ORF8 and ORF9 genes (Figure 5).

**Figure 5.** Schematic representation of conserved domains within *ncpA* and *ncpB* coding nucleotide sequences. They are composed of repetitive modules condensation (C), adenylation (A) and peptidyl carrier protein (PCP) domains adjacent to delineating docking, epimerization and reductase domains aligned with two related synthetases AY167420.1 and CP026681.1. The analysis was conducted at the nucleotide level.

Alignment of nucleotide sequences to the *ncpAB* operon revealed major differences in consecutive NcpB3 and NcpB4 modules. Utilizing the selected spans conjoined with conserved domain search allowed us to distinguish and compare C, A and PCP amino-acid sequences (Figure 6). Intrinsic modules of NRPS, with an exception of NcpB3 adenylation domain sequence, were found homologous above 91%, whereas extremes have shown the biggest composition differences ranging from 13–15% to 24% in the NcpB4 adenylation domain (Figure 6).

**Figure 6.** Heatmap of the highest (light blue) and lowest (black) percentage of similarities between NcpA and NcpB domains in *N*. *edaphicum* CCNP1411 and ATCC53789; values scaled by rows. The analysis was conducted at the amino acid level.

The determination of the whole genome sequence of *N*. *edaphicum* CCNP1411 allowed us to perform analyses of genes coding for enzymes involved in the synthesis of nostocyclopeptides. The general analysis demonstrated homology of the NRPS/PKS clusters of *N*. *edaphicum* CCNP1411 to systems occurring in other cyanobacteria, however, with some differences. The non-ribosomal consensus code [28,29] allowed to recognize and predict the substrate specificities of NRPS adenylation domains: tyrosine (NcpA1), glycine (NcpA2), glutamine (NcpA3) for NcpA and isoleucine/valine (NcpB1) serine (NcpB2) 4-methylproline/proline (NcpB3) phenyloalanine/leucin/tyrosine (NcpB4) for NcpB (Table 2). This prediction was found to be in line with the structures of the Ncps detected in *N*. *edaphicum* CCNP1411.


**Table 2.** Characterization of substrate binding pocket amino acid residues adenylation domains of NcpA and NcpB modules based on gramicidin S synthetase (GrsA) phenylalanine activating domain. Residues in brackets mark inconsistency with AY167420.1 residues.

To devolve elongating product onto subsequent condensation domain, the studied synthetase utilizes PCP domains, subunits responsible for thiolation of nascent peptide intermediates, where post-transcriptional modification of conserved serine residue shifts the state of the domain from inactive *holo* to active *apo*. Modification of this residue is related to PPTase which transfers

a covalently-bound 4- -phosphopantetheine arm of CoA onto the PCP active site, enabling peptide intermediates to bind as reactive thioesters. Case residue which undergoes a nucleophilic attack by the hydroxyl group was conserved in every module within the PCP domain predicted at the front of the second helix [30].

The stand-alone docking domain (D) (7,617,812–7,617,964 bp) found on N-terminus of NcpA may be an essential component mediating interactions, recognition and specific association within NRPS subunits. The potential acceptor domain, based on sequence homology of conserved residues to C-terminal communication-mediating donor domains (COM), was found at the NcpB4 PCP domain second helix, encompassing conserved serine residue within potential binding sequence [31]. Moreover, this communication-mediating domain may putatively bind to C-terminus of NcpB3 and NcpB4 condensation (C) domains based on conserved motif LL**E**G**I**V, found by sequence homology to last five amino-acids of C-terminal docking domains residues, key for their interactions [32]. Within the same β-hairpin, a group of charged residues (ExxxxxKxR) putatively determines the binding affinity of the N-terminal domain [33].

Two tailoring domains encoded at the 5 ends of *ncpA* and *ncpB* genes were classified as epimerization (E) (7,627,742–7,629,043 bp) domain and reductase (R) (7,642,183–7,643,238 bp) domain, accordingly. Epimerization domain catalyzes the conversion of L-amino acids to D-amino acids, a reaction coherent with D-stereochemistry of the peptide glutamine residue, where His of the conserved HHxxxDG motif and Glu from the upstream EGHGRE motif raceB comprise an epimerisation reaction active site [34]. Homologous HHxxxDG conserved motif sequence is found in condensation domains (C), where a similar reaction is catalyzed within peptide bond formation, putatively by the second His residue [35]. As in *ncp* cluster [17], module NcpA1 motif includes degenerate sequences in two positions HQIVGDL with leucine instead of phenylalanine residue at the start of the helix. The second histidine site-directed mutagenesis abolished enzymatic activity which might suggest that NcpA1 condensation domain is inactive [36].

Reductase domain (R) found at the C-terminus of NRPS was classified as oxidoreductase. Despite 15% discrepancy in domain composition compared to NcpB core catalytic triad Thr-Tyr-Lys and Rossmann-fold, a NAD (P) H nucleotide-binding motif GxxGxxG positions were not affected. The mechanism driving this chain release utilizes NAD (P) H cofactor for redox reaction of the final moiety of the nascent peptide to aldehyde or alcohol [37,38].

## *2.3. Structure Characterization of Ncps Produced by N. edaphicum CCNP1411*

Thus far, only three Ncps, Ncp-A1, A2 and M1, and their linear aldehydes were isolated as pure natural products of *Nostoc* strains [13,16]. Ncp-A3, with MePhe in the C-terminal position, was obtained through aberrant biosynthesis in the *Nostoc* sp. ATCC53789 culture supplemented with MePhe [13]. The linear aldehydes of Ncp-A1 and Ncp-A2, with Pro instead of MePro, were chemically synthesized and used to study the Ncps epimerization and macrocyclization equilibria [19,20]. In our work, ten Ncps, differing mainly in position 4 and 7, were detected by LC-MS/MS in the *N*. *edaphicum* CCNP1411 cell extract (Table 3, Figure 1, Figure 7, Figure 8 and Figure S1–S7). These include five cyclic structures, four linear Ncp aldehydes, and one linear hexapeptide Ncp. The putative structures of the six peptides, which were found to be naturally produced by *Nostoc* for the first time, are marked in Table 3 in bold (Ncp-E1, Ncp-E1-L, Ncp-E2, Ncp-E2-L, Ncp-E3 and Ncp-E4-L).


**Table 3.** The putative structures of nostocyclopeptides (Ncps) detected in the crude extract of *N*. *edaphicum* CCNP1411 and the structure of Ncp-M1 identified in *Nostoc* sp. XSPORK 13 A [16]. The new analogues are marked in bold and the peptides detected in trace amounts are marked with [T]. The variable residues in position 4 and 7 are marked in blue.

\* Ncp-E4-L is the only linear Ncps analogue with carboxyl group in C-terminus. \*\* Identified in *Nostoc* sp. XSPORK [16].

**Figure 7.** Postulated structure and enhanced product ion mass spectrum of the linear aldehyde nostocyclopeptide **Ncp-E1-L;** Tyr+Gly+Gln+Ile+Ser+Pro+Phe characterized based on the following fragment ions: *m*/*z* 795 [M+H]; 777 [M+H−H2O]; 759 [M+H−2H2O]; 646 [M+H – Phe]; 614 [M+H – Tyr−H2O]; 575 [M+H−(Tyr+Gly)]; 549 [Tyr+Gly+Gln+Ile+Ser+H]; 531 [Tyr+Gly+Gln+Ile+Ser+H−H2O]; 462 [Tyr+Gly+Gln+Ile+H]; 349 [Tyr+Gly+Gln+H]; 334 [Ser+Pro+Phe+2H]; 247 [Phe+Pro+H]; 229 [Phe+Pro+H]; 201 [Phe+Pro+H – CO]; 148 [Tyr−NH2]; 136 Tyr immonium; 129, 101 (immonium), 84 Gln; 70 Pro immonium.

The process of *de novo* structure elucidation was performed manually, based mainly on a series of b and y fragment ions produced by a cleavage of the peptide bonds (Figures 7–9, Figures S1–S7), and on the presence of immonium ions (e.g., *m*/*z* 70 for Pro, 84 for MePro, 136 for Tyr) in the product ion mass spectra of the peptides. The process of structure characterization was additionally supported by the previously published MS/MS spectra of Ncps [14]. The fragment ions that derived from the two amino acids in C-terminus usually belonged to the most intensive ions in the spectra and in this study they facilitated the structure characterization. For example, in the product ion mass spectrum of Ncp-A1 (Figure S1) and Ncp-E3 (Figure S7), ions at *m*/*z* 209 [MePro+Leu+H] and m/z 181 [MePro+Leu+H−CO] were present, while in the spectrum of Ncp-E2 (Figure S5) with Pro (instead MePro), the corresponding ions at 14 unit lower *m*/*z* values, i.e., 195 and 167 were observed. The spectra of the linear Ncps contained the intensive Tyr immonium ion at *m*/*z* 136. Based on the previously determined structures of Ncp-A1 and Ncp-A2 [13], we assumed that in Ncp-E2, the amino acids in position 4 and 7, are Ile and Leu, respectively (Table 3; Figure S5). These two amino acids are difficult to distinguish by MS. Definitely, the NMR analyses are required to confirm the structures of the Ncps. The presence of Val in position 4, instead of Ile, distinguishes the Ncp-E3 from other Ncps produced by *N*. *edaphicum* CCNP1411. As it was previously reported [17], and also confirmed in this study, the predicted substrates of the NcpB1 protein encoded by *ncpB* and involved in the incorporation of the residue in position 4 are Ile/Leu and Val. However, the domain preferentially activates Ile, which explains why only traces of Val-containing Ncps were detected in *N*. *edaphicum* CCNP1411 (Table 3).

**Figure 8.** Postulated tructure and enhanced product ion mass spectrum of a linear nostocyclopeptide **Ncp-E4-L** [Tyr+Gly+Gln+Ile+Ser+MePro] characterized based on the following fragment ions: *m*/*z* 677 [M+H]; 659 [M+H−H2O]; 642 [M+H−H2O−NH3]; 549 [Tyr+Gly+Gln+Ile+Ser+H]; 531 [Tyr+Gly+Gln+Ile+Ser+H−H2O]; 521 [Tyr+Gly+Gln+Ile+Ser+H−CO]; 462 [Tyr+Gly+Gln+Ile+H]; 434 [Tyr+Gly+Gln+Ile+H−CO]; 349 [Tyr+Gly+Gln+H]; 329 [Gln+Ile+Ser+H]; 312 [Ile+Ser+MePro+H]; 221 [Tyr+Gly+H]; 193 [Tyr+Gly+H−CO]; 148 [Tyr−NH2]; 136 Tyr immonium; 86 Ile immonium; 84, 101 (immonium), 129 Gln; 84 MePro immonium.

Methylated Pro (MePro) in position 6 is quite conserved. MePro is a rare non-proteinogenic amino acid biosynthesized from Leu through the activity of the zinc-dependent long chain dehydrogenases and Δ1-pyrroline-5-carboxylic acid (P5C) reductase homologue encoded by the gene cassette *ncpCDE* [17,18,25]. The genes involved in the biosynthesis of MePro were found in 30 of the 116 tested cyanobacterial strains, majority (80%) of which belonged to the genus *Nostoc* [39]. The new Ncp-E1 and Ncp-E2, detected at trace amounts, are the only Ncps produced by *N*. *edaphicum* CCNP1411 which contain Pro (Table 3). The presence of *m*/*z* 84 ion in the fragmentation spectra of the two Ncps complicated the process of *de novo* structure elucidation. This ion corresponds to the immonium ion of MePro and could indicate the presence of this residue. However, the two ions *m*/*z* 101 and 129, which together with ion at *m*/*z* 84, are characteristic of Gln, suggested the presence of this amino acid in Ncp-E1 and Ncp-E2. The detailed characterization of Ncp fragmentation pathways is presented in Figures 7–9 and in Supplementary Materials (Figures S1–S7).

**Figure 9.** Enhanced product ion mass spectrum of the cyclic nostocyclopeptide **Ncp-E1** with putative structure cyclo[Tyr+Gly+Gln+Ile+Ser+Pro+Phe] characterised based on the following fragment ions: *m*/*z* 777 [M+H]; 759 [M+H−H2O]; 741 [M+H−2H2O]; 690 [M+H−Ser]; 672 [M+H−Ser−H2O]; 662 [M+H−Pro−H2O]; 646 [M+H−Phe]; 628 [M+H−Phe−H2O]; 575 [M+H−(Ser+Pro)−H2O]; 549 [Tyr+Gly+Glu+Ile+Ser+H]; 480 [Phe+Tyr+Gly+Gln+H]; 462 [Tyr+Gly+Gln+Ile+H]; 444 [Tyr+Gly+Gln+Ile+H−H2O]; 434 [Tyr+Gly+Gln+Ile+H−CO]; 392 [Pro+Phe+Tyr+H]; 352 [Phe+Tyr+Gly+H]; 335 [Phe+Tyr+Gly+H−H2O]; 316 [Ser+Pro+Phe+H]; 307 [Phe+Tyr+Gly+H−H2O -CO]; 298 [Ile+Ser+Pro+H]; 229 [Phe+Pro+H]; 201 [Phe+Pro+H−CO]; 158 [Gly+Gln+H−CO]; 132 Phe; 70 Pro immonium. Structure of the peptide is presented in Figure 1.

In addition to the heptapeptide Ncps, *N*. *edaphicum* CCNP1411 produces a small amount of the linear hexapeptide, Ncp-E4-L, whose putative structure is Tyr+Gly+Gln+Ile+Ser+MePro (Table 3, Figure 9). This Ncp was detected only when higher biomass of *Nostoc* was extracted. As the proposed amino acids sequence in this molecule is the same as the sequence of the first six residues in Ncp-A1 and Ncp-A2, the hexapeptide can be a precursor of the two Ncps. The other option is that the cell concentration of the Ncps is self-regulated and the Ncp-E4-L is released through proteolytic cleavage of the final products. This hypothesis could be verified when the role of the Ncps for the producer is discovered. In the *ncp* gene cluster, the presence of *ncpG* encoding the NcpG peptidase, with high homology to enzymes hydrolyzing D-amino acid-containing peptides was revealed by Becker et al. [17] and also confirmed in this study. Therefore, the in-cell degradation of Ncps by the NcpG peptidase is possible, but it probably proceeds at D-Gln and gives other products than Ncp-E4-L.
