**Genome Mining and Synthetic Biology in Marine Natural Products Discovery**

Editor

**Maria Costantini**

MDPI • Basel • Beijing • Wuhan • Barcelona • Belgrade • Manchester • Tokyo • Cluj • Tianjin

*Editor* Maria Costantini Stazione Zoologica Anton Dohrn Italy

*Editorial Office* MDPI St. Alban-Anlage 66 4052 Basel, Switzerland

This is a reprint of articles from the Special Issue published online in the open access journal *Marine Drugs* (ISSN 1660-3397) (available at: https://www.mdpi.com/journal/marinedrugs/ special issues/gnm mn syn bio).

For citation purposes, cite each article independently as indicated on the article page online and as indicated below:

LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. *Journal Name* **Year**, *Volume Number*, Page Range.

**ISBN 978-3-0365-0254-0 (Hbk) ISBN 978-3-0365-0255-7 (PDF)**

© 2021 by the authors. Articles in this book are Open Access and distributed under the Creative Commons Attribution (CC BY) license, which allows users to download, copy and build upon published articles, as long as the author and publisher are properly credited, which ensures maximum dissemination and a wider impact of our publications.

The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons license CC BY-NC-ND.

## **Contents**


## **About the Editor**

**Maria Costantini** is a molecular biologist. Her research is focused on studying secondary metabolites from marine organisms as a source of stress as well as biologically and pharmacologically active compounds, using molecular biology approaches.

The aims of her current research in molecular ecology include:


## *Editorial* **Genome Mining and Synthetic Biology in Marine Natural Product Discovery**

#### **Maria Costantini**

Department of Marine Biotechnology, Stazione Zoologica Anton Dohrn, Villa Comunale, 80121 Napoli, Italy; maria.costantini@szn.it; Tel.: +39-081-583-3285

Received: 17 November 2020; Accepted: 22 November 2020; Published: 3 December 2020

Oceans cover nearly 70% of the earth's surface and host a huge ecological, chemical, and biological diversity. The natural conditions of the sea favor, in marine organisms, the production of a large variety of novel molecules with great pharmaceutical potential. Marine organisms are unique in their structural and functional features compared to terrestrial ones [1,2]. Innovation in this field is very rapid, as revealed by the funding of several Seventh Framework Programme (FP7) and Horizon 2020 projects under the topic "Blue Growth". The major parts of these projects have the common final goal of meeting the urgent need to discover new drug entities to counteract the increased incidence of severe diseases (such as cancer) and the reduced efficacy of existing drugs [3].

For this reason, the application of molecular biology techniques in biotechnological field is very important. Driven by the rapidly-decreasing cost and increasing throughput of DNA-sequencing technologies, significant progress in genomics has renewed interest in the discovery of natural products. Rapidly-expanding genomic and metagenomic datasets reveal a vast number of biosynthetic gene clusters (BGCs) in nature, which are predicted to encode novel biomedically-relevant molecules. This 'top-down' discovery approach can only provide access to a small fraction of biosynthetic compounds, given that the majority of microorganisms cannot be isolated or cultured. Furthermore, even in culturable organisms, the encoded secondary metabolites of many biosynthetic gene clusters are unknown. This may be due to strong down-regulation of product biosynthesis at the transcriptional, translational, and/or post-translational levels in the absence of the right activating cues in the laboratory. Alternatively, secondary metabolites that are produced at very low yields may escape detection and characterization. Natural product discovery, which used to be predominantly an analytical chemistry problem, has become a challenge in genomics and molecular biology with respect to how to manipulate relevant genes and sequences to produce the desired encoded metabolites.

In recent years, marine genomics has grown rapidly, helped by the large amount of information that is are becoming available to the international scientific community. Taking into account the current excitement in the field of marine biotechnology, we have assembled this Special Issue entitled "Genome Mining and Synthetic Biology in Marine Natural Product Discovery". The aim of this Special Issue is to assess the impact of these molecular approaches on the discovery of bioactive compounds from marine organisms. The term "genome mining" is used to identify all bioinformatic investigation aimed at detecting the biosynthetic pathways of bioactive natural products and their possible functional and chemical interactions [4]. Genome mining includes the identification of BGCs of previously-uncharacterized natural products within the genomes of sequenced organisms, sequence analysis of the enzymes encoded by these gene clusters, and the experimental identification of the products of the gene clusters [5].

This special issue includes one review article and four original studies on the application of genome mining and synthetic biology in promoting marine natural products for biotechnological applications.

An up-to-date view on the importance of genome mining to identify new natural products is provided in the review by Albarano et al. [6] In the first part of this review, the significance of genome mining and other associated techniques is discussed, as well as their possible advantages and disadvantages. One of the strengths of using genome mining is the detection of a large amount of bioactive natural compounds. Genome mining has the advantages of being relatively cheap and easy to apply in the laboratory and not requiring particular skills and/or experience by the operators. The weaknesses of this bioinformatic approach are that: (1) only known biosynthetic gene clusters can be identified and (2) it is not possible to predict the biotechnological potential of the natural products under analysis, as well as to formulate their chemical structures. In addition, the authors include not only marine organisms but also terrestrial ones, considering that molecular approaches were only recently applied to the discovery of bioactive compounds from the sea.

A very important point arising from this review is that the available literature on genome mining mainly concerns bacteria, from which the majority of new natural drugs are isolated. In fact, Gao et al. [7] report on the cloning, recombinant expression, and characterization of the ulvan lyase (able to degrade ulvan to oligosaccharides) gene, *ALT3695*, from marine bacterium *Alteromonas* sp. A321. Its amino acid sequence exhibits low sequence identity with the polysaccharide lyase family 25 (PL25) ulvan lyases from *Pseudoalteromonas* sp. PLSV (64.14%), *Alteromonas* sp. LOR (62.68%), and *Nonlabens ulvanivorans* PLR (57.37%). Ulvan is a type of cell wall polysaccharide, extracted from marine green seaweed (belonging to the genera *Ulva* and *Enteromorpha*), with a great biotechnological potential thanks to its various pharmacological properties including antioxidant, anticoagulant, antitumor, antihyperlipidemic, and antiviral activities. Furthermore, ulvan-derived oligosaccharides has applications in the functional food and pharmaceutical industries. ALT3695 exhibits excellent stability and substrate affinity, showing that it could be used for the preparation of ulvan oligosaccharides.

Other important sources of natural drugs are the fungi. Examples of such drugs are penicillin, cephalosporin, ergotrate, and the statins. Guided by bioinformatic analysis of the genomic sequence of a marine-derived fungal strain of *Penicillium dipodomyis* YJ-11, Yu et al. [8] report for the first time on the application of global regulator LaeA in *P. dipodomyis* (PdLaeA), aiming to the increase the secondary metabolite. In fact, this regulator is able to influence fungi in many aspects of their life, such as increasing or reducing secondary metabolite production, activating cryptic gene clusters, asexual and sexual differentiation, sporulation, and pigmentation. The overexpression of the PdLaeA gene in *P. dipodomyis* YJ-11 lead to the discovery of two new compounds, 10,11-dihydrobislongiquinolide and 10,11,16,17-tetrahydrobislongiquinolide, together with four known sorbicillin analogues.

Microalgae are eukaryotic photosynthetic microorganisms, which can be cultivated in large quantities, and for this reason they are receiving increasing attention from biotechnologists. They are able to produce a wide diversity of species and natural products with pharmaceutical, nutraceutical, and cosmeceutical applications. Riccio et al. [9] investigated the presence of genes involved in monogalactosyldiacylglycerols (MGDGs) and sulfoglycolipid synthesis in microalgae, considering all the microalgal transcriptomes and metatranscriptomes available from the Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP) and the recent Tara Oceans and Global Ocean sampling expedition. In particular, the presence of MGDG synthase (MGD), UDP-sulfoquinovose synthase (SQD1), and sulfoquinovosyltransferase (SQD2) were identified in different microalgal classes. Not all the sequences are expressed within each taxa analyzed, suggesting the absence of a specific gene. This study prompted the discovery of new bioactive species with new insights into enzyme discovery from microalgae for biotechnological applications.

Zhu et al. [10] performed research on the genome of *Thraustochytriidae* sp. SZU445, a single-celled eukaryotic marine protist belonging to the class Labyrinthulomycetes. Using a third-generation sequencing platform, fatty acid synthesis pathways, specifically for docosapentaenoic acid (DHA) docosahexaenoic acid (DPA) were identified, predicting and analyzing the Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases. This represents a very important step for the scientific community because these marine protists can be cultivated to yield high biomasses containing substantial quantities of lipids abundant in polyunsaturated fatty acids (PUFAs), and so also accumulating large amounts of DHA.

Even though these new molecular approaches need to be ecxpanded, this Special Issue of Marine Drugs provides an interesting panorama of current research on genome mining and other associated experimental techniques. As Guest Editor, I trust that this collection will help the scientific community to inspire the next generation of marine biotechnologies. In fact, molecular and bioinformatic approaches permit us to overcome the over-utilization of marine resources and the use of destructive collection practices, and to apply a more environmentally-friendly approach to drug discovery

**Funding:** This research received no external funding.

**Conflicts of Interest:** The author declares no conflict of interest.

#### **References**


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Review* **Genome Mining as New Challenge in Natural Products Discovery**

#### **Luisa Albarano 1,2,**†**, Roberta Esposito 1,2,**†**, Nadia Ruocco <sup>1</sup> and Maria Costantini 1,\***


Received: 4 March 2020; Accepted: 3 April 2020; Published: 9 April 2020

**Abstract:** Drug discovery is based on bioactivity screening of natural sources, traditionally represented by bacteria fungi and plants. Bioactive natural products and their secondary metabolites have represented the main source for new therapeutic agents, used as drug leads for new antibiotics and anticancer agents. After the discovery of the first biosynthetic genes in the last decades, the researchers had in their hands the tool to understand the biosynthetic logic and genetic basis leading to the production of these compounds. Furthermore, in the genomic era, in which the number of available genomes is increasing, genome mining joined to synthetic biology are offering a significant help in drug discovery. In the present review we discuss the importance of genome mining and synthetic biology approaches to identify new natural products, also underlining considering the possible advantages and disadvantages of this technique. Moreover, we debate the associated techniques that can be applied following to genome mining for validation of data. Finally, we review on the literature describing all novel natural drugs isolated from bacteria, fungi, and other living organisms, not only from the marine environment, by a genome-mining approach, focusing on the literature available in the last ten years.

**Keywords:** bacteria; fungi; genome mining; natural products; synthetic biology

#### **1. Introduction on Bioactive Natural Products Isolation**

Nature is an important source of bioactive products and their derivatives (secondary metabolites), which form part of many important drugs formation widely used in the clinic field [1]. In fact, as reported in Newman and Cragg [2], over the last 30 years the great majority of anticancer, anti-infective, and anti-bacterial drugs are represented by natural products and their derivatives, produced by all organisms (from bacteria to plants, invertebrate, and other animals) with different chemical structure and leading to several biological activities [3,4]. Furthermore, these secondary metabolites have influenced the development of several drugs, including antibacterial, anticancer, and anti-cholesterol agents [5]. Several of these bioactive products are derived from microorganisms, such as fungi and bacteria [6], which have represented an important source of antibiotics and many other medicines [7,8]. In particular many bacteria deriving from the marine environment, particularly those found in association with marine invertebrates (such as sponges), are able to produce secondary metabolites with potential anticancer and antifungal roles because of their cytotoxic properties [9,10]. Considering the great problem of the antimicrobial resistance increase and its high impact on human health, there is an important need of searching for new natural products that could therefore remedy this issue [11,12]. For these reasons, in the past decade, genomic science has been used to identify the possible drug targets and to find novel genes cluster for the biosynthesis of natural products [13]. The development

of the genome sequencing technologies to find novel metabolites has surely drown the attention of pharmaceutical industries, which had by now lost interest in natural products due to the advent of combinatorial chemistry [14]. The advent of based-genome sequencing techniques, especially with establishment of genome mining, has allowed to obtain new natural drugs in a faster and cheaper way.

#### *Genome Mining*

The term "genome mining" are associated to every bioinformatics investigation used to detect not only the biosynthetic pathway of bioactive natural products, but also their possible functional and chemical interactions [15]. Specifically, the genome mining involves the identification of previously uncharacterized natural product biosynthetic gene clusters within the genomes of sequenced organisms, sequence analysis of the enzymes encoded by these gene clusters, together with the experimental identification of the products of the gene clusters (Figure 1; [16]).

**Figure 1.** Associated techniques (categorized as molecular biology techniques, chemical analysis, cellular biology techniques, and bioinformatic analysis) to genome mining for validation of data, leading together to drug discovery.

Genome mining is entirely dependent on computing technology and bioinformatics tools. About this point, a huge amount of data, consisting of DNA sequences and their annotations, are now deposited in publicly accessible databases. The storage and handling of these resources relies on the continued development of computers and the networks. Once all the genes within a new genome are identified, they can be compared with those of known functions in the public databases. Both raw and annotated genomic data, as well as bioinformatics tools, for sequence comparisons are freely available through the different websites. It also important to keep in mind that it is now a mandatory publication prerequisite of most scientific journals that sequence data from research involving novel DNA sequences is deposited in a publicly accessible database.

In the case for which the sequences of many proteins, encoding for enzymes, involved in natural product biosynthesis are deposited in these databases, it is relatively easy to identify pathways in which they are involved by sequence comparisons. The availability of these synthesis enzymes and the pathways in which they operate, together with the sequence comparisons with genes from which they arise, can certainly be used to identify homologs, and potentially the pathways, in the new organism under analysis. However, it is important to consider that many enzymes are similar in sequences but follow chemical processes that are slightly different, leading to a different pathway or very different final end product.

Furthermore, genome mining has a strong support by synthetic biology, consisting in the design and the construction of new biological, as for examples enzymes, genetic circuits and/or the redesign of existing biological systems. These combined approaches are mainly used to detect novel natural products in bacteria and fungi probably because of operon organization of their synthesis genes [13], allowing the control of transcriptional levels and also the association of their potential metabolic function [17]. Moreover, the central role of genome mining consists in finding new biosynthetic gene clusters (BGCs). In fact, the BGCs encode for two class of enzymes, polyketide synthases (PKS) and non-ribosomal peptide synthases (NRPS), which are the two most important biosynthetic routes responsible for the formation of natural products [18]. This approach also provides the possibility to compare target gene clusters to known gene clusters useful for the prediction of their function and structure using different associated web database [5]. In fact, although the genome mining allowed to find and identify the gene clusters responsible for the production of natural product synthesis, in the last decade web tools and databases have been integrated to improve the performance of this approach [15]. This scientific progress has enabled the development of three important web tools: (i) "antibiotics and Secondary Metabolite Analysis SHell" (antiSMASH), its first version was issued in 2011 and it is a web server able to associate the gene clusters identification with a series of specific algorithms for compounds analysis [19]. Therefore, this approach performs the prediction of sequences and offers a more detailed analysis of identified gene clusters and consequently gives the predicted image of amino acid stereochemistry structure [5]. (ii) "PRediction Informatics for Secondary Metabolomes" (PRISM), open-web tool, consisting of a genomic prediction of secondary metabolomes. Using different algorithms that compare the new genetic information with 57 virtual enzymatic reactions (such as adenylation, acyltransferase, and acyl-adenylating), this approach provides the possibility of obtaining a correspondence between known natural drugs and possible new ones [20]. (iii) "Integrated Microbial Genomes Atlas of Biosynthetic gene Clusters" (IMG/ABC) [21], launched in 2015, is a large open web database of known predicted microbial BGCs able to associate BGCs with secondary metabolites (SMs) and analyze both BGCs and SMs. In this way, it offers the ability of finding similar function between BGCs present in database and BGCs to be identified [22].

Starting from these general considerations, in the present review we want to emphasize the significance of genome mining approach to identify new natural products, also underlining the possible advantages and disadvantages of this technique. Moreover, we debate the associated techniques that can be applied following to genome mining for validation of data. Finally, we review the literature describing all novel natural drugs found from bacteria, fungi and other living organisms by genome mining approach, focusing on the examples available in the literature of the last ten years.

#### **2. The Significance Genome Mining in Drug Discovery**

Approximately half of clinically approved drugs (including antibiotics) are represented by natural products and their derivatives. Recently, the development of new bioinformatics, genetics and analytic tools, has provided new strategies for the discovery of natural products of biotechnological interest known as "combinatorial biosynthesis approaches" [23,24]. These techniques, together with bioinformatic approaches, have shown that the ability of organisms (particularly microorganisms) to produce bioactive natural products has been underestimated [25]. These organisms have been deeper explored through the sequencing of their genome and the application of genome mining approaches [26]. Genome analysis has shed light the presence of numerous biosynthetic gene clusters that could be involved in the synthesis of other secondary metabolites defined cryptic or orphan for their unknown origin [25].

Genome mining aims at predicting the genes that encode for new natural compounds of biotechnological interest by using several bioinformatic approaches [21]. The importance of genome mining is based on the urgent need to discover new drug entities due to the increased incidence of severe diseases (such as cancer) and the reduced efficacy of existing drugs [27]. Furthermore, the biosynthetic gene clusters contain elements that can be used to increase the production of both natural and engineered products by promoting costs reduction and their commercial use [26].

#### *2.1. Strengths and Weaknesses of Genome Mining*

As in the case of all approaches, also genome mining has strengths and weaknesses, summarized in Table 1. One of the advantages of using genome mining is to foster the detection of a large amount of bioactive natural compounds [6]. In addition, genome mining approach is relatively cheap and easy to apply in laboratory, and it requires no particular skills and/or experience of the operators [28]. Combining genome mining with genetic engineering techniques will make it possible to achieve maximum diversity of natural products [29]. This bioinformatic approach allows to predict the chemical structure of bioactive natural products, but forecasts are often difficult to formulate [28,29].


**Table 1.** Strengths and weakness in the use of genome mining.

As reported in Wohlleben et al. [6], a great disadvantage of genome mining is that only known biosynthetic gene clusters can be identified [29]. Moreover, with this approach, it is not possible to predict the biological activities of the natural products identified [26]. However, genome mining is still an evolving technique [29], in fact, scientists are trying to improve this bioinformatic tool in order to reduce the limits of this approach.

#### *2.2. Synthetic Biology and Other Experimental Techniques Associated with Genome Mining*

Synthetic biology progresses have been possible thanks to the very recent advent of DNA sequencing and synthesis in molecular biology field. The distinguishable element of synthetic biology respect to the other traditional molecular biology approaches is represented by its focus on the design and construction of components which are core for example of enzymes and metabolic pathways [30]. These genomic assessments joined to microbial diversity provide the fundamental natural libraries for further engineering.

In this review we have not focused our attention on synthetic biology because a great number of reviews on the most obvious and popular applications of synthetic biology methodologies have been published from 2008 to today [30–45].

Natural product production using engineered microorganisms represent the more important application of synthetic biology in the biotechnological field for natural products. The most important of commercialized examples are represented by two compounds produced by fermentation in genetically modified yeast: The semisynthetic malaria drug artemisinin and the first consumer-market synthetic biology product, "natural" vanillin [46,47]. These successful application of synthetic biology opened new perspective in the exploration of microbes as sources of high-value compounds on an industrial scale.

Genome mining is followed by the identification of cryptic pathways using several strategies, known as "combinatorial biosynthesis", which that can be used in order to create novel genetic combinations of structural biosynthetic genes. These methods consist of gene activation/inactivation and mutasynthesis approaches. Gene inactivation involves the creation of a mutant organism, in which the biosynthetic gene cluster becomes inactive, thus eliminating the production of metabolites. The

comparison between mutant organisms can be made by high-performance liquid chromatography/mass spectrometry (HPLC-MS), revealing the natural product absent in the mutant organism [26]. Therefore, gene inactivation needs as evidence of cluster involvement in compound biosynthesis [24]. Secondary metabolites come from precursors of primary metabolism, and their over-production is related to an enhanced protein synthesis [48]. However, in some cases, there are genes that produce specific precursors not provided by the primary metabolism. These precursors are usually used as starting units for example to the production of polyketides synthases (PKS) or non-ribosomal peptide synthetases (NRPS), which in turn produce natural compounds. Inactivation of genes involved in the biosynthesis of these precursors leads to non-productive mutants that can be used for the biosynthesis of new compounds by mutasynthesis or mutational biosynthesis [23,24,26]. If some genes are silent, it would be impossible to produce and test the biological activity of the natural product. It is therefore necessary to apply the activation of silent pathways under the control of a constitutive promoter or inactivating repressors [28]. In the final stages of metabolites biosynthesis, several enzymes such as, transferase, oxygenase, oxidase, peroxidase, and reductase, play a key role for further modifications [26].

#### Examples of Other Experimental Techniques

A method to identify new natural products with biotechnological potential combines the research of coding genes for a specific compound with the detection of bacterial resistance. This approach, called target-directed genome mining, relies on the identification of gene clusters without knowing the molecules produced [49].

Another method to identify a natural product is the one strain/many compounds (OSMAC) approach. This method is based on the systematic alteration of culture media or cultivation parameters to force the expression of cryptic genes. In addition, any metabolism deregulation system can be used to improve the production of secondary metabolites, leading to the discovery of new bioactive compounds. Many of these approaches involve the treatment of known chemicals that modify the structure of chromatin or the use of small molecules that re-shape and regulate secondary metabolism by inhibiting the synthesis of fatty acids [50].

Another technique associated with genome mining is the in vitro reconstruction of biosynthetic pathways that produce natural products. This technique can be used to generate highly pure intermediates, limiting side reactions such as the formation of toxic compounds and reducing protein-protein interactions [51].

Taking into account this background, we review on the new natural drugs found from bacteria, fungi and other living organisms by genome mining approach. We analyzed organisms that derive not only from marine environment but also from the terrestrial ones, considering that the genome mining and other techniques associated with it are still at the beginning for the discovery of bioactive compounds from the sea.

#### *2.3. Bacteria*

The first point that must be underlined is that the most of medicinal products described above derive from bacteria [6] (see Table 2). In fact, the available literature on genome mining mainly concerns these microorganisms. Specifically, soil and marine bacteria, such as actinomycetes, produce the greatest part of natural drugs identified in the last thirty years [52]. The actinomycetes can be isolated from various habitats, such as soil, sea deposits, sponges, corals, mollusks, seagrasses, and mangroves [53].


**Table 2.** Genome mining approaches applied to microorganisms.

Hornung et al. [54] applied genome mining to identify strains capable of producing halogenase enzymes, where halogenations represent an important feature for the biological activity of a great number of different natural products. *Escherichia coli* DH5a and *E. coli* XL1 Blue were used to identify the complete halogenase gene sequence and to build primer-specific probes for these genes. Moreover, genomic DNA was isolated from 550 strains of actinomycetes available in strain collections. Using specific primer probes, it has been demonstrated that some actinomycetes are able to produce halogen enzymes.

Furthermore, nuclear magnetic resonance spectroscopy (NMR spectroscopy) was applied to understand the structure of these molecules, revealing that they were not exactly like those already known in literature. *Streptomyces*, a type of actinomycetes gram-positive bacteria, have also extensively been studied.

In fact, McAlpine et al. [55] used the genome mining approach to identify new antibiotic ECO-02301, with a potent antifungal activity, from *Streptomyces aizunensis* NRRL B-11277 bacteria. This compound was active against several strains of human pathogenic fungi (*Candida albicans* ATCC10231, *Candida glabrata* ATCC 90030, *Candida lusitaniae* ATCC 200953, *Candida tropicalis* ATCC 200955, *Candida krusei* LSPQ 0309, *Saccaromyces cerevisiae* ATCC 9763, *Aspergillus fumigatus* ATCC 204305, *Aspergillus flavus* ATCC 204304, *Cryptococcus neoformans* ATCC 32045). To obtain the expression of this gene after the grown of bacteria in flask, the proteins were extracted and analyzed by high performance liquid chromatography (HPLC) monitored by a diode array detector (DAD) that detects the absorption in UV region and positive/negative mass spectrometry (MS) ions.

The analysis of the genome of *S. aizunensis* NRRL B-11277 helped the prediction of the structure of this compound with sufficient accuracy so to represent a guide for its isolation.

Furthermore, an anti-infective agent, called arylomycin, and its BGCs by *Streptomyces roseosporus* strains, were described using imaging mass spectrometry (IMS) and MS guided by genome mining approaches [56]. Specifically, *S. roseosporus* was co-cultured with two pathogen strains, *Staphylococcus aureus* and *Staphylococcus epidermidis*, and its genome has been sequenced. It was so demonstrated that *S. roseosporus* produced daptomycin, an antibiotic molecule. Moreover, they spotted *S. roseosporus* in the center of *S. aureus* and *S. epidermidis* cultures and after 36 hours of incubation, using IMS and MS, aptomycin ions have been not observed, but a cluster corresponding to the potassium adduct was found. These results suggested that *S. roseosporus* was also able to produce three additional antibiotics. Furthermore, to identify the biosynthetic gene cluster of these molecules, a peptide-genomic mining approach was applied, which relied on the short sequence tag (SST) from tandem very spectrometric data. With this approach, in fact, they established that these three molecules were arylomycins.

In a similar study, Liu et al. [57] demonstrated that *S. roseosporus*, in addition to the non-ribosomal peptide synthetase-derived molecular families and their gene subnetworks, todaptomycin, arylomycin, and napsamycin, was also able to produce stenothricin. Firstly, after DNA extraction, to identify the molecular network they reduced the complexity of analysis to 837 genes using MS/MS spectra with parent ion masses within 0.3 Da and compared to related MS/MS spectral patterns. It was possible to observe the already known genes ofarylomycin, napsamycin, daptomycin and their variants. However, they identified four genes for stenothricin but combining the MS/MS spectra to the amino-acid blocks found by antiSMASH, 21 genes clusters were found. Furthermore, to understand their biological activities, a screening platform (named BioMAP) was used and then the cytological profiling, evaluating this activity against 15 bacterial strains. These approaches revealed that the stenothricinis was active on both Gram-negative and Gram-positive bacteria.

Seo et al. [58] used the DNA extraction to isolate the antibiotic pentalenolactone biosynthetic gene clusters from the known pentalenolactone producers *Streptomyces exfoliatus* UC5319 and *Streptomyces arenae* TU469. By building probes based on the previously cloned *S. exfoliates* pentalenene synthase gene, the sequence of the *S. exfoliatus* Pen biosynthetic gene cluster were analyzed, revealing that the furthest upstream gene, designated as PenR, encoded a 153 aa MarR-family transcriptional regulator. Moreover, PenI, PenH, and PenF were also found, which were expected to catalyze the oxidative conversion of pentalenene to 1-deoxy-11-oxopentalenic acid, as previously established for the othologous *Streptomyces avermitilis* proteins. Furthermore, the attention was pointed out on penE product, because it seems to be the key branch point that distinguished the pentalenolactone and neopentalenolactone biosynthetic pathways. PenE gene encoded a protein that is a homologue of the known Baeyer-Villiger monooxygenase from *S. avermitilis*, PtlE. The compounds PenD, PntD, and PtlD were characterized by mass spectrometry and H-NMR, also generating the deletion mutants with no production of pentalenolactone.

In another study, Tang et al. [49] analyzed, through bioinformatic approaches (BLASTP, Artemis Release 12.0), the genome of *Streptomycetes* sp. M10 discovering 20 biosynthetic gene clusters involved in the synthesis of natural products, such aspolyketides, NRPs, siderophores, lantibiotics, terpenoids. In addition, one of all gene clusters shared a partial similarity with candicidin/FR-008gene cluster, which in turn encoded for antifungal polyene assuming the potential role of this strain to produce this compound. Finally, to confirm this potential activity, the polyene was tested against the phytopathogen *Fulvia fulva* for its antifungal activity.

A high throughput genomic library expression analysis system (LEXAS) was applied for efficient, function-driven discovery of cryptic and new antibiotics from *Streptomyces*, known producers of several antibiotics [60]. Each BAC clone was transferred individually into an engineered antibiotic overproduction host, avoiding preference for smaller BACs. The LEXAS captured two known antibiotics, identified two novel lipopeptides and their BGC that was not produced/expressed in the native *Streptomyces rochei* strain, and revealed a cryptic BGC for unknown antibiotic. Specifically, in this research two new antibiotics streptothricins and borrelidin were found and for their validation these genes were expressed in the surrogate host *Streptomyces lividans* SBT5 by heterologous expression. Moreover, to analyze the antimicrobial activity, SBT5 products were tested against *Staphylococcus aureus* and *Bacillus mycoides*, showing an inhibition. In addition, they discovered two novel linear lipopeptides and their BGCs also adding the analysis of their structures by HPLC and liquid chromatography-mass spectrometry (LC-MS).

Thirty-eight secondary biosynthetic gene clusters of nataxazole (NAT) and its derivatives were identified from *Streptomyces* sp. Tü 6176, using in silico by genome mining and antiSMASH 2.0 [61]. In particular, the NAT entire BGC was described, consisting of 21 genes: 12 encoding for structural proteins, 4 for regulatory proteins, 4 probably involved in NAT secretion, and 1 with unknown function. Moreover, using the gene inactivation and heterologous expression of NAT cluster, it was established that secondary metabolite pathways were outside of NAT gene cluster (not a common in actinomycetes) despite they were involved in NAT biosynthesis. Furthermore, using antibiotic disc diffusion assay, an antibiotic activity was found only against *Staphylococcus albus* J1074, whereas the negative effect was absent in *Streptomyces lividans* JT46, *Micrococcus luteus* and *Escherichia coli*. Anticancer activity was tested against human tumor cell lines (HT29, A549, MDA-MB-231, AGS and A2780) including mouse cell line NIH/3T3 used as control. In this way, they demonstrated that these compounds have moderate activity against maleficent cells. In a similar study, Ye et al. (2017) used genome mining and antiSMASH 2.0 to identify the presence of 31 biosynthetic clusters in *Strepmomyces argillaceus* ATCC12956.

The most studied BGC between all found was the gene that encoded for argimycin P (renamed *arp* cluster). In addition, the pathway for the biosynthesis of *arp* was reconstructed by means of genetic engineering. Moreover, using in vitro tests on cells, no cytotoxic activity of this compound was found against 59 tumour cell lines. In another study, Paulo et al. [63] used in silico genome mining on strains of *Streptomyces* sp. CBMAI 2042, isolated from the branches of the plants *Citrus sinensis*. Moreover, this strain also prevented the proliferation of pathogens in citrus such as *Citrus xylella*, *Geotrichum candid* var. citri-Aurantii, and *Colletotrichum gloesporioides*. In particular, 35 biosynthetic gene clusters were found including the putative NRP biosynthetic gene cluster that encoded for valinomycin. In addition, by combining genome mining and molecular network, it was possible to reconstruct the origin of the biosynthetic pathway of cyclodepsipeptides, which have antibacterial, antiviral, and anticancer activity.

Furthermore, Purves et al. [64] applied the genome mining approach on bacteria extracted from two marine sediments (Antarctic and Scotland). They identified eight genera (*Bacillus*, *Streptomyces*, *Micromonospora*, *Paenibacillus*, *Kocuria*, *Verrucosispora*, *Staphylococcus*, and *Micrococcus*) and used 38 strains on which MS analysis was conducted. Thanks to this approach a great number of metabolites were identified, of which 1422 were Antarctic-specific, while 1501 were Scottish-specific secondary metabolites. Moreover, a molecular network was built up by Global Natural Products Social (GNPS) Molecular Networking, showing that only 8% of strains belonging to these locations displayed a similarity, implying a high degree of biogeographic influence upon secondary metabolite production. Organic extracts from these 38 selected strains were tested for cytotoxicity against epithelial colon adenocarcinoma cells (Caco-2) and human fibroblasts originating from foreskin (HS27). No effect on normal cell viability was observed, while seven extracts were bioactive against Caco-2 at 50 g/mL concentration. Direct observation revealed morphological changes, such as cell shrinkage and formation of apoptotic bodies. Moreover, Deng et al. [65] identified three new fluorinase enzymes from three bacterial strains, *Streptomyces* sp. MA37, *Nocardia brasiliensis*, and *Actinoplanes* sp. using the genome mining approach. These proteins were isolated and purified using overexpression of fluorinasegenes in *Escherichia coli*. Analyzing this product with in vitro activity assay, it revealed a high homology (about 85%) of its BGCs to the original one (called flA1) founded in *Streptomyces cattleya*. Finally, it was also assessed that *Streptomyces* sp. MA37produced some unidentified fluorometabolites.

As mentioned before, the actinomycetes are distributed in different marine habitats, being mainly associated to sponges. In fact, Jin et al. [53] have conducted genome mining experiments with *Streptomyces* sp. PKU-MA00045 isolated from sponges. Specifically, five new aromatic polyketides, fluostatins M–Q (1–5) were isolated using PCR-based genome mining method, and their chemical structures were clarified by 1H-NMR and 13C-NMR. The entire genome of *Streptomyces* sp. PKU-MA00045 was sequenced and compared to homologues in the published fluostatin gene clusters with BLAST, so identifying the BGCs of these new five compounds. In a similar experiment, Almeida et al. [10] used OSMAC approach to identify an octapeptidicsurugamide (Surugamide A) from *Streptomyces* sp. SM17, isolated from the marine sponge *Haliclona simulans.* The phylogenetic analysis with NCBI BLASTN demonstrated that this marine bacteria was phylogenetically linked to five strains of terrestrial *Streptomyces* bacteria: *Streptomyces albidoflavus* strain J1074, *Streptomyces albidoflavus* strain SM254, *Streptomyces sampsonii* strain KJ40, *Streptomyces* sp. FR-008 and *Streptomyces koyangensis* strain VK-A60T. Since *S. albidoflavus* strain J1074 was widely used as a model for various biotechnological studies, the secondary metabolites of the biosynthetic gene clusters were predicted by antiSMASH program, comparing the new BGCs with those already collected by *S. albidoflavus* strains. In this way, it was demonstrated that *Streptomyces* sp. SM17 produced different secondary metabolites. Moreover, using NMR technique it was possible to show that *Streptomyces* sp. SM17 was able to produce higher levels of Surugamide A than the *S. albidoflavus* strain J1074.

However, Anoop et al. [66] studied another bacterial strain *Pseudovibrio* sp. POLY-S9, isolated from intertidal marine sponge *Polymastia penicillus* sampled from the Atlantic coast of Portugal. In fact, after genome sequencing of this marine bacteria, new genes-related bioactive compounds were isolated, such as polyketide synthase, nonribosomal peptide synthetase and siderophore, using genome mining by antiSMASH. Moreover, several genes involved in symbiotic relationships, such as the ankyrin repeats, tetratrico peptide repeats and Sel1, were also identified. Another important finding of this study was represented by some genome plasticity elements of POLYS9, which allowed the survival of these bacteria and their adaptation to various habitats through the exchange of genetic material. Using MS/MS-based molecular networking analysis a bacterial strain was isolated from the Caribbean sponge *Tectitethya crypta*, able to produce spongosine, deoxyspongosine, spongothymidine, and spongouridine, generally referred as "spongonucleosides" [67].

Spongosine, a methoxyadenosine derivative, had several pharmacological applications, having anti-inflammatory activity (for their capability to inhibit the nitric oxide production in cells) and analgesic and vasodilation properties. After MLSA and BLAST analyses, this strain was identified as *Vibrio harveyi*, and thanks the genomic DNA sequencing and antiSMASH platform, six potential secondary metabolite pathways were described.

Planctomycetes are ubiquitous bacteria that were usually found in marine, freshwater and soil habitats, even if it is possible to find them as free living organisms, or attached to abiotic and biotic surfaces, as for example to algal cells. Some strains also live as symbionts of prawns, marine sponges or termites [72]. For instance, Jeske et al. [68] applied the genome mining methods to define the metabolic properties of *Planctomyces*. First, they found 102 genes or gene clusters involved in the production of secondary metabolites by analyzing 13 genomes on antiSMASH database. Moreover, the genome analysis showed a close correlation between the length of BGCs and the amino acid sequence of the predicted secondary metabolites. Moreover, since most BGCs were transcriptional silent, the Phenotype MicroArray technology was applied on compounds secreted by *Planctomyces limnophilus* (limnic strain) and *Rhodopirellula baltica* (marine strain), confirming that there was a strong relationship between *Planctomycetes* and algae or plants, which in turn secrete compounds that might serve as trigger to stimulate the secondary metabolite production in *Planctomycetes*. Thus, this study provides strong evidences for the use of these bacteria for drug development.

In a different study, Guérard-Hélaine et al. [69] identified new aldolase enzymes, belonging the aldolase/transaldolase family, from 313 different prokaryote species. Comparing the sequence of 1148 proteins extracted from these strains to already known aldolases and transaldolases, 700 genes were selected. The overexpression of these genes and the following LC-MS analysis allowed the selection of 19 proteins of interest. After cloning of the corresponding genes and using fast protein liquid chromatography (FPLC), 18 enzymes were purified, including two aldolases and sixteen transaldolase. Moreover, the activity of these 18 enzymes was evaluated by high-throughput screening (HTS), revealing that six of those annotated as transaldolase showed aldolase activity. Maansson et al. [8] extracted DNA from 13 closely related strains identified as *Pseudoalteromonas luteviolacea*, isolated from all over the Earth, and analysed their potential to produce secondary metabolites. Specifically, antiSMASH analysis demonstrated that only 10 biosynthetic pathways were preserved in all strains, including glycosylated lantipeptide (RiPP1) and two bacteriocins (RiPP2 and RiPP3). All strains have maintained essential pathways, such as that responsible for the production of siderophores, homoserine lactones and violacein. Furthermore, bacteria were grown in culture media to stimulate the synthesis of secondary metabolites and the chemical structures of these compounds were analyzed by LC-MS/MS. Particular attention was paid on violacein pathway, showing the presence of an insert in the *bmp1* gene in the thioesterase domain probably responsible of *Pseuoalteromonas* color. Moreover, the varieties *Pseudoalteromonas* S4047-1, S4054 and CPMOR-1 produced indolmicin antibiotic. However, the biosynthetic pathway coding for the antibiotic indolmicin has never been characterized.

#### Cyanobacteria

Cyanobacteria were also studied for their interesting bioactive secondary metabolites. For example, they produce mycosporine and mycosporine-like amino acids (MAAs), which are antioxidant molecules that eliminate toxic oxygen radicals protecting cells from saline, drying or thermal stress in some organisms and may act as an intracellular nitrogen reservoir. These compounds were also found in many other organisms such as yeasts, fungi, algae, corals and lichens [73]. Applying genome mining approach and BLAST analysis, Singh et al. [70] demonstrated that among four strains of cyanobacteria (*Anabaena variabilis* PCC 7937, *Anabaena* sp. PCC 7120, *Synechocystis* sp. PCC 6803 and *Synechococcus* sp. PCC 6301) exposed to 72 hours of UV radiation, only *Anabaena variabilis* PCC 7937 was able to produce MAAs. HPLC analysis of these four cyanobacteria revealed the presence of a unique combination of two genes, predicted *DHQ synthase* (YP\324358) and *O-methyltransferase*; (YP\324357) in *A. variabilis* PCC7937, which were missing in other non-MAAs-synthesizing cyanobacteria. Micallef et al. [71] identified the gene cluster responsible for hapalosine synthesis and hapalosine biosynthetic pathway from the genomes of three cyanobacteria (*Hapalosiphon welwitschii* UH strain IC-52-3, *Westiella intricata* UH strain HT-29-1 and *Fischerella* sp. CC 9431), by using genome mining combined with

Geneious version 6.1.7 and antiSMASH. Single cyanobactin cluster of biosynthetic genes was identified only in the genome of *W. intricate* UH strain HT-29-1, demonstrating that there is structural diversity of cyanobacteria inside cyanobacteria strains. Moreover, only *Fischerella* sp. PCC 9339 encoded a microviridine gene cluster and they identified the MAA (*mys*) gene cluster in the strains *W. intricate* varieties UH HT-29-1, *H. welwitschii* UH strain IC-52-3, *Mastigocoleus testarum* BC008, *Fischerella muscicola* SAG1427–1 and *Chloroglopsis* sp. PCC 9212. Finally, the presence of the cluster of scytonemin genes within the genome of *Mastigocladopsis repens* PCC 10,914 was discovered, suggesting that this organism was able to bio-sintetizes cytonemin in order to protect the cells against UVA-radiation. The geosmin gene cluster was identified in *W. intricata* variety UH HT-29-1, *H. welwitschii* UH strain IC-52-3, *Fischerella* sp. PCC 9431, and *F. muscicola* SAG 1427–1.

#### *2.4. Fungi*

As described above, the most important sources of natural drugs are not only bacteria but also fungi [6]. In fact, many different natural products, such as penicillin, cephalosporin, ergotrate and the statins represent well-known fungal secondary metabolites for pharmacological applications [74]. For these organisms the genome mining also proved to be a useful method to find BGCs (Table 3). In a study of Bergmann et al. [75] a silent metabolic pathway was detected, which might code for the biosynthesis of polyketides or polypeptides in *Aspergillus nidulans.* In particular, considering that the cryptic gene cluster provided a putative activator gene called *apdR*, it was amplified and cloned into expression vector pAL4, which coded for inducible alcohol dehydrogenase promoter *alcAp* of *A. nidulans* and the pyr-4 gene of *Neurospora crassa* as a selectable marker.


**Table 3.** Genome mining approaches applied to fungi.

Using Southern blot analysis, it was demonstrated that under inducing conditions the *apdA* gene encoded the PKS-NRPS hybrid synthetase. Moreover, HPLC analysis displayed that this induced strains were able to produce two main products, Aspyridones A and B, and two minor compounds, whose structures was elucidated by NMR and MS. In a similar study, Mao et al. [76] revealed a silent metabolic pathway involved in natural product biosynthesis. In fact, after genome sequencing, 68 BGCs were identified, being in contrast to the two predominant metabolites normally produced, the F1-ATPase inhibitors 1 and 2. Since these BGCs are localized within the heterochromatic regions, a mutant strain was built deleting *hdaA* (gene of the histone H3 lysine 14 (K14) deacetylase). In this way, using metabolite extraction and LC-MS analysis, it was demonstrated that the mutant produced more compounds compared to wild strain. Moreover, after overexpression of these genes, ten compounds were isolated, of which four contained new structures, including the cyclic peptides arbumycin and arbumelin, the diterpenoid arbuscullic acid A, and the meroterpenoid arbuscullic acid B. However, Ye et al. [78] applied the genome mining approach to conduct a phylogenetic analysis of

fifteen bifunctional terpene synthases found in five fungal genomes. Specifically, the terpene BGCs sequence were different and synthetized sesterterpenes with new carbon skeletons, suggesting that these microorganisms were separated in five different clades. Moreover, two clades, *Aspergillus oryzae* and *Neosartorya fischeri*, did not produce terpene, hypothesizing that BGCs were silent in standard conditions. For these reasons, heterologous expression was performed in *A. oryzae* using *E. coli* plasmids and the extract was analyzed with GC-MS, 1H- and 13C-NMR elucidating the structure of four compounds, one of which known as sesterfisherolsynthase (*NfSS*) and previously found in *N. fischeri*. Furthermore, bioinformatic analysis showed that *NfSS* gene was encoded downstream of a cytochrome P450 monooxygenase (NfP450) and it was transformed by NfP450 to sesterfisheric acid. Finally, to identify NfP450 gene, double transformant with NfSS and NfP450 genes was prepared and the extract was examined by LC-MS and HR-MS indicating that NfP450 conducted a NfSS modification.

Furthermore, Ding et al. [77] have identified the first BGCs of the stephacidin and notoamide, belong to family of prenylated alkaloids, from *Aspergillus* sp. MF297-2. Specifically, after sequencing of genome the entirenotoamide and stephacidin gene cluster was identified by BLAST comparing sequence to gene *ftmA*, which was previously mined from an *Aspergillus fumigatus*. By bioinformatics analysis, 19 genes involved in notoamide biosynthesis were found to constitute this cluster. To understand the function, this cluster was cloned using *E. coli* DH5R and overexpressed into *E. coli* BL21. The proteins were purified with a single Ni-NTA column and analyzed with HPLC, LC-MS, 1H- and 13C-NMR. Two central pathway enzymes, *NotF* and *NotC,* were identified suggesting a scheme for the biosynthesis of stephacidin and notoamide metabolites.

#### *2.5. Other Organisms*

Several other organisms, completely unrelated to the marine environment, have been used as subject of genome mining approach, such as terrestrial microorganisms, plants, and animals (Liu et al. 2018; Table 4).


**Table 4.** Genome mining approaches applied to ants, copepods and plants.

Gruber and Muttenthaler [4] applied genome mining to identify defense- and neuropeptides in the genomes of social ants of the subfamilies of Myrmicinae (*Atta cephalotes*), Formicinae (*Camponotus floridanus*) and Ponerina (*Harpegnathos saltator*); ants are difficult to manipulate for scientific purposes because of the size of their bodies and organs. Most interestingly, genes encoding for oxytocin/vasopressin-related peptides (inotocins) and their putative receptors were identified, using a publicly available matrix of tools, including the search for similarity with tBLASTn, prediction of gene structure using GeneWise algorithm and alignments of sequences by ClustalW.

Carotenoids cannot be synthesized de novo, but they must therefore be taken with food (such as algae) and get protective human health benefits as well. Free astaxanthin and its esterified forms are the main carotenoids present in crustaceans and in particular in copepods. Mojib et al. [79] aimed on understanding the metabolic and genetic basis of the blue phenotype between the blue pigmented organisms from the phylum Arthropoda, subclass Copepoda (*Acartia fossae*) and the phylum Chordata, class Appendicularia (*Oikopleura dioica*) in the Red Sea. Firstly, liquid chromatography-UV method was used to detect the carotenoids and mass spectrometry and HPLC were used to detect intermediate

metabolites, present at low concentrations. The chromatograms identified astaxanthin in all samples, while the fucoxanthin was not detected in any samples. In addition, other carotenoids, intermediate compounds for conversion from β-carotene to astaxanthin, were also identified. The metabolic pathway for each sample was reconstructed for the conversion from β-carotene to astaxanthin. The results showed that all the species followed the same metabolic pathways via almost the same intermediate metabolite formation. Echinenone, one of the intermediate metabolite was not detected in any of the samples but its hydroxylated form, the 3-idrossi chinenone, was detected in all samples, as well as lutein. Putative β-carotene hydroxylase of P450 family coding transcripts was identified in blue *A. fossae* by in silico transcriptome mining. Putative carotenoid-binding proteins after transcriptome/genome mining showing 100% homology to Apolipoprotein D (ApoD) and crustacyanin as predicted by HHpred database.

A customized version of the plantiSMASH genome mining algorithm was created to identify a sesterterpene synthase gene repertoire in some *Brassicaceae* plants, which synthesizes fungal-type sesterterpenes with diverse scaffolds, thus fueling the drug-discovery pipeline [80]. Sesterterpenoids are a rare terpene class with not well explored chemical structure and diversity, representing a potential new drug source. This study offered new insights on the origin of structural diversity for protein engineering, supporting the idea of convergent evolution for natural product biosynthesis.

#### **3. General Conclusions**

Many drugs used, for example, as anticancer, antibacterial, and anti-inflammatory agents in the clinical field are derived from natural products and their derivatives. In fact, these secondary metabolites are produced by all organisms (from bacteria to plants, invertebrates, and other animals) and show several biological activities useful in several biotechnological applications. However, the most important sources of natural drugs are microorganisms (mainly bacteria, also associated with marine organisms, such to sponges) and fungi. In the last decades, the great advances made in the field of molecular biology techniques, representing a good example the genome mining together with the synthetic biology, strongly push the identification of BGCs, encoding for enzymes involved in the biosynthesis of natural products. Taking together, these next-generation and highly sophisticated tools contribute to the emergence of a new generation of natural product research. These techniques are in their infancy for their application to marine environment, but there are in literature a lot of applications for the discovery of bioactive natural products for other environments. For this reason, we think that a review reporting all these examples could give strong support in pushing the applications of these new techniques in discovery bioactive compounds from the marine environment, also due to high level of biodiversity offered by the sea in comparison with the Earth. Genome mining, as well as synthetic biology and all the techniques to them associated, represent a new challenge in natural products discovery from the marine environment, without impact on the environment and with no use of destructive collection practices of marine organisms.

**Author Contributions:** M.C. conceived and designed the scheme of the review; L.A. and R.E. performed the bibliographic research and prepared the original draft; N.R. contributed to the preparation of the original draft; All authors have read and agree to the final version of the manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** Luisa Albarano was supported by a PhD (PhD in Biology, University of Naples Federico II) fellowship co-funded by the Stazione Zoologica Anton Dohrn and University of Naples Federico II. Roberta Esposito was supported by a PhD (PhD in Biology, University of Naples Federico II) fellowship funded by the Photosynthesis 2.0 project of the Stazione Zoologica Anton Dohrn. Nadia Ruocco was supported by a research grant "Antitumor Drugs and Vaccines from the Sea (ADViSE)" project (PG/2018/0494374).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Cloning, Expression, and Characterization of a New PL25 Family Ulvan Lyase from Marine Bacterium** *Alteromonas* **sp. A321**

#### **Jian Gao, Chunying Du, Yongzhou Chi, Siqi Zuo, Han Ye and Peng Wang \***

College of Food Science and Engineering, Ocean University of China, Qingdao 266003, China **\*** Correspondence: pengwang@ouc.edu.cn; Tel.: +86-0532-82032290

Received: 25 August 2019; Accepted: 1 October 2019; Published: 8 October 2019

**Abstract:** Ulvan lyases can degrade ulvan to oligosaccharides with potent biological activity. A new ulvan lyase gene, *ALT3695*, was identified in *Alteromonas* sp. A321. Soluble expression of *ALT3695* was achieved in *Escherichia coli* BL21 (DE3). The 1314-bp gene encoded a protein with 437 amino acid residues. The amino acid sequence of ALT3695 exhibited low sequence identity with polysaccharide lyase family 25 (PL25) ulvan lyases from *Pseudoalteromonas* sp. PLSV (64.14% identity), *Alteromonas* sp. LOR (62.68% identity), and *Nonlabens ulvanivorans* PLR (57.37% identity). Recombinant ALT3695 was purified and the apparent molecular weight was about 53 kDa, which is different from that of other polysaccharide-degrading enzymes identified in *Alteromonas* sp. A321. ALT3695 exhibited maximal activity in 50 mM Tris-HCl buffer at pH 8.0 and 50 ◦C. ALT3695 was relatively thermostable, as 90% activity was observed after incubation at 40 ◦C for 3 h. The *Km* and *Vmax* values of ALT3695 towards ulvan were 0.43 mg·mL−<sup>1</sup> and 0.11 <sup>μ</sup>mol·min−1·mL<sup>−</sup>1, respectively. ESI-MS analysis showed that *enzymatic products* were mainly disaccharides and tetrasaccharides. This study reports a new PL25 family ulvan lyase, ALT3695, with properties that suggest its great potential for the preparation of ulvan oligosaccharides.

**Keywords:** ulvan-derived oligosaccharides; ulvan lyase; heterologous expression; polysaccharide lyase family 25

#### **1. Introduction**

Ulvan is a type of cell wall polysaccharide, extracted from marine green seaweed (genus *Ulva* and *Enteromorpha*) [1]. It is mainly composed of iduronic acid (IdoA), 3-sulfated rhamnose (Rha3S), xylose (Xyl), and glucuronic acid (GlcA) [1]. The repetitive disaccharide units in ulvan are Rha3S-GlcA, Rha3S-IdoA, and Rha3S-Xyl [1]. Ulvan has been shown to possess various pharmacological properties, such as antioxidant [2], anticoagulant [3], antitumor [4], antihyperlipidemic [5], and antiviral [6] activities. However, its biological activity is greatly affected by its molecular weight, and its application could be limited by its high viscosity [3,7,8]. Compared to high-molecular-weight ulvan, low-molecular-weight ulvan possesses higher solubility, lower viscosity, easier absorption, and greater exposure of reactive groups [3,5,9]. In addition, it has been reported that low-molecular-weight sulfated polysaccharides [7] and their iron complexes [10] have potent biological activity. Thus, ulvan-derived oligosaccharides have been attracting increased attention for their tremendous potential for applications in the functional food and pharmaceutical industries.

Several methods for depolymerizing polysaccharides have been studied. Qi et al. prepared different molecular weight ulvans by treatment with H2O2 and by changing the depolymerization conditions [7]. Yu et al. degraded ulvan by microwaving under different pressures [8]. Mild acid hydrolysis has also been used to degrade ulvan [1]. However, such chemical and physical methods have many shortcomings, including sulfate group loss, complex products, and low oligosaccharide yields [9]. In contrast, enzymatic depolymerization is a potentially energy-efficient and environmentally friendly method, with no requirement for harsh reagents or extreme conditions [11]. Moreover, specific enzymes can be very useful for studying polysaccharide structure [12,13]. Therefore, it is necessary to find new enzymes that are capable of degrading ulvan, such as ulvan lyases.

The first marine bacterium capable of degrading ulvan was isolated by Lahaye et al. [12]. However, no in-depth study of its ulvan-degrading enzymes has been reported. Recently, Collén et al. reported the first endolytic ulvan lyase genes [14], which were found in the ulvanolytic bacterium *Nonlabens ulvanivorans*. Subsequently, the genomes of three ulvanolytic bacteria were sequenced [15], and ulvan lyases belonging to different polysaccharide lyase families were found. Kopel et al. reported several ulvan lyases in these three genomes [15], and these ulvan lyases showed no homology to those found by Collén et al. [14], which belong to PL24 family. Foran et al. identified another novel ulvan lyase (LOR\_29) in the *Alteromonas* sp. LOR genome, which is the founding member of polysaccharide lyase family 25 (PL25) [16]. Thus far, three ulvan lyase families have been established (http://www.cazy.org), including PL24, PL25, and PL28. Structural characterizations of representative enzymes from these three families have also been reported [17–19]. As the primary ulvan-degrading enzyme [16], ulvan lyase catalyzes β-elimination at the internal bond between uronic acid and Rha3S, producing oligosaccharides with unsaturated uronic acid (ΔGlcA) [14,15]. Compared to other methods, the uniform enzymatic product is an advantage of using ulvan lyases to degrade ulvan, which is convenient for studying their pharmacological activity. In addition, sulfate groups are well retained during the degradation process, which is essential for the activity of ulvan oligosaccharides [9]. Ulvan lyases have also been used for epitope deletion studies [20]. However, only seven ulvan lyases have been characterized. To expand the repertoire of enzymes to efficiently produce ulvan-derived oligosaccharides, additional new ulvan lyases must be investigated.

Previous studies showed that *Alteromonas* sp. A321 was capable of degrading ulvan [9]. In this study, a new ulvan lyase gene, *ALT3695*, was identified in *Alteromonas* sp. A321 and soluble expression of ALT3695 was achieved in *Escherichia coli* BL21 (DE3). Recombinant ulvan lyases were purified and the molecular weight was investigated. ALT3695 differs from other enzymes previously found in *Alteromonas* sp. A321 [21]. Thus, this study reports a new enzyme for preparing ulvan-derived oligosaccharides and enriches the marine enzyme library.

#### **2. Results and Discussion**

#### *2.1. Sequence Analysis*

The *ALT3695* gene is 1314 bp in length and encodes a 437-amino acid protein. The ALT3695 amino acid sequence shares 64.14%, 62.68%, and 57.37% sequence identity with reported ulvan lyases from *Pseudoalteromonas* sp. PLSV (PLSV\_3936, GenBank accession no. WP\_033186995.1) [18], *Alteromonas* sp. LOR (LOR\_29, GenBank accession no. WP\_052010178.1) [16], and *Nonlabens ulvanivorans* PLR (NLR\_492, GenBank accession no. WP\_036580476.1) [16], respectively. As the representative enzyme of PL25 family, the structure and catalytic mechanism of PLSV\_3936 have been investigated [18]. In PLSV\_3936, His123, His143, Tyr 188, Arg204, and Tyr246 are conserved and have been proposed as active site residues, and Gln66, Tyr246, and Arg282 are highly conserved.

Several homologous enzymes from different organisms with less sequence identity were selected in the carbohydrate-active enzymes (CAZy) database. Amino acid sequence alignment showed that most residues are also conserved in ALT3695 and other PL25 family members, except Gln66 (Figure 1). Among these residues, His143 and Try246 could help Arg204 to neutralize the negative charge on glucuronic acid. His123 and Tyr188 were acid-base catalysis residues [18]. A phylogenetic tree of ALT3695 and other reported ulvan lyases was constructed by the neighbor-joining method, which suggested that ALT3695 is a PL25 family ulvan lyase (Figure 2).

**Figure 1.** Amino acid sequence alignment of ALT3695 with ulvan lyases from *Pseudoalteromonas* sp. PLSV (PLSV\_3936, GenBank accession no. WP\_033186995.1), *Alteromonas* sp. A321 (ALT3695, GenBank accession no. MN347032), *Nonlabens ulvanivorans* PLR (NLR\_492, GenBank accession no. WP\_036580476.1), *Paraglaciecola chathamensis* S18K6 (GCHA\_4617, GenBank accession no. GAC12534.1), *Siansivirga zeaxanthinifaciens* CC-SAMT-1 (AW14\_13480, GenBank accession no. AJR04515.1), and *Catenovulum* sp. CCB-QB4 (C2869\_03520, GenBank accession no. AWB65560.1). Active site residues are marked with filled circles (-). Highly conserved residues are marked with filled triangles (-).

#### *2.2. Expression and Purification of Recombinant ALT3695*

Soluble expression of His-tagged ALT3695 ulvan lyase was achieved in *E. coli* BL21 (DE3) by adding 1 mM isopropyl-β-d-thiogalactopyranoside (IPTG). The recombinant ALT3695 was purified, and SDS-PAGE showed only a single protein band. The purified ALT3695 showed a specific activity of 1.88 U/mg protein, with 52% recovery (Table 1). As expected, the apparent molecular weight was 53 kDa (Figure 3), which is similar to that previously reported for PL25 family ulvan lyases, such as LOR\_29 (52 kDa) [16] and NLR\_492 (55 kDa) [16]. The molecular weight of ulvan lyases from other families, such as IL45\_01510 (GenBank accession no. AEN28574.1, PL28) [14] and PsPL (GenBank accession no. AMA19992.1, PL24) [22], were about 46 kDa and 59 kDa, respectively.

**Figure 2.** Phylogenetic tree of ALT3695 (filled triangle) and other ulvan lyases generated using the neighbor-joining method. Numbers along the branch nodes represent bootstrap percentages based on 1000 resamplings. The scale bar indicates the average number (0.2) of amino acid substitutions per site. *Pseudoalteromonas* sp. PLSV (PLSV\_3936, GenBank accession no. WP\_033186995.1), *Alteromonas* sp. LOR (LOR\_29, GenBank accession no. WP\_052010178.1), *Nonlabens ulvanivorans* (NLR\_492, GenBank accession no. WP\_036580476.1), *Pseudoalteromonas* sp. PLSV (PLSV\_3925, GenBank accession no. WP\_033186955.1), *Alteromonas* sp. LOR (LOR\_107, GenBank accession no. AMA19991.1), *Pseudoalteromonas* sp. PLSV (PLSV\_3875, GenBank accession no. AMA19992.1), *Alteromonas* sp. LOR (LOR\_61, GenBank accession no. WP\_032096165.1), *Formosa agariphila* KMM 3901 (BN863\_22190, GenBank accession no. WP\_038530530.1), and *Nonlabens ulvanivorans* PLR (IL45\_01510, GenBank accession no. AEN28574.1).



**Figure 3.** SDS-PAGE of purified ALT3695. Lane 1: purified ALT3695; Lane M: molecular weight marker.

#### *2.3. Influence of Temperature on the Activity of Recombinant ALT3695*

The influence of temperature on enzyme stability and the optimum temperature for ALT3695 activity were investigated. Ulvan lyases from different families exhibit different optimum temperatures. The optimum temperature for ALT3695 activity was 50 ◦C (Figure 4a), which is higher than that of another ulvan lyase from the PL25 family (45 ◦C) [16]. Ulvan lyases from other families, such as FaPL28 (GenBank accession no. WP\_038530530.1, PL28) [23] and PsPL (GenBank accession no. AMA19992.1, PL24) [22], showed maximum activity at lower temperatures of 29.5 ◦C and 35 ◦C, respectively. Little has been reported concerning the thermal stability of PL25 family members. FaPL28 had poor thermal stability, and less than 10% activity was observed after incubation for 3 h at or above 30.9 ◦C [23]. In contrast, ALT3695 was relatively thermostable and 90% activity was observed even after incubation for 3 h at 40 ◦C (Figure 4b). However, the stability of ALT3695 decreased as the temperature was increased above 40 ◦C (Figure 4b), 49% residual activity was observed after incubation for 3 h at 45 ◦C, and no activity was observed after incubation at 55 ◦C for 1 h.

**Figure 4.** Biochemical characterization of ALT3695. (**a**) Activity of ALT3695 over a range of temperatures (35–60 ◦C) to determine the optimal temperature. (**b**) Thermal stability of ALT3695. (**c**) Activity of ALT3695 over a range of pH values to determine the optimal pH. (**d**) The pH stability of ALT3695.

#### *2.4. Influence of pH on the Activity of Recombinant ALT3695*

To investigate the effect of pH on ALT3695 activity, different buffers (pH 4–9.5) were used. The optimum pH of other ulvan lyases ranged from 7.5 to 9.0 [16,22,23], which corresponds to the pH of slightly alkaline seawater [24]. Similarly, ALT3695 showed maximal activity at pH 8.0 (Figure 4c). In addition, activity in Tris-HCl buffer was higher than that in Na2HPO4-citric acid buffer at pH 8.0, suggesting that the catalytic activity of ALT3695 was also affected by the buffer salt. ALT3695 retained 80% activity after pre-incubation at pH 4.0–9.0 for 2 h at 4◦C, indicating that ALT3695 has excellent pH stability (Figure 4d).

#### *2.5. Influences of Surfactants, Metal Ions, and Metal Chelators on the Activity of Recombinant ALT3695*

Table 2a shows the influence of various metal ions on ALT3695 activity. K<sup>+</sup> had no significant influence on ALT3695 activity. Additionally, Fe2<sup>+</sup> and Cu2<sup>+</sup> decreased enzyme activity, and this effect was also found in the ulvan lyases, PsPL [22] and FaPL28 [23]. Furthermore, ulvan lyases from different families exhibit different responses to Co2<sup>+</sup>, the activity of PsPL was increased by Co2<sup>+</sup> [22], while the activity of FaPL28 [23] and ALT3695 was decreased by Co2<sup>+</sup>. This may be attributed to structural differences among these enzyme families. Moreover, Cd2<sup>+</sup>, Hg2+, and Fe3<sup>+</sup> deactivated ALT3695 completely, which may be caused by the binding of these ions to the SH, CO, and NH moieties of amino acids, resulting in structural changes and inactivation [21]. It is worth mentioning that although ALT3695 has Zn2+-containing ligands, Zn2<sup>+</sup> strongly inhibited enzymatic activity. Thus, Zn2<sup>+</sup> may play a dual role, as inherent Zn2<sup>+</sup> is essential for maintaining enzyme structure, but exogenous Zn2<sup>+</sup> may bind to other amino acid residues and inhibit activity [25].


**Table 2.** (**a**) Effects of various metal ions on the activity of ALT3695 ulvan lyase. (**b**) Effects of surfactants and metal chelators on enzyme activity.

The enzyme activity was increased by Ca2<sup>+</sup>, Mg2+, and Ba2<sup>+</sup>. The addition of Ca2<sup>+</sup> would not only increase the activity of PL25 ulvan lyase, but also stimulate the activity of PsPL [22] and FaPL28 [23]. With more Ca2<sup>+</sup> added, the activity of ALT3695 was also increased. In addition, ALT3695 activity decreased by 68% after pre-incubation in 5 mM ethylenediaminetetraacetic acid (EDTA), similar phenomenon was also found in PsPL and FaPL28, this may due to the removal of divalent cations bound to enzymes. Additionally, structure analysis suggested that divalent metal ions play a structural role in all three ulvan lyase families.

The effect of surfactants and metal chelators on ALT3695 activity was also investigated (Table 2b). It has been reported that non-ionic surfactants could increase the activity of some enzymes [26,27]. However, surfactants had little effect on the activity of ALT3695.

#### *2.6. Kinetic Parameters of Recombinant ALT3695*

*Vmax* is the maximum initial rate of an enzymatic reaction. Each enzyme has a specific *Km* for each substrate, which is inversely related to the enzyme's affinity for the substrate [28]. The kinetic

ND: No activity was determined.

parameters of other ulvan lyases from the PL25 family have not yet been reported. The *Km* and *Vmax* of recombinant ALT3695 toward ulvan were 0.43 mg·mL−<sup>1</sup> and 0.11 <sup>μ</sup>mol·min−1·mL<sup>−</sup>1, respectively, as determined by the Lineweaver-Burk plot. ALT3695 showed a higher affinity for ulvan than PsPL and FaPL28, which had a *Km* of 2.10 mg·mL−<sup>1</sup> [22] and 0.75 mg·mL−<sup>1</sup> [23], respectively, suggesting the potential of ALT3695 for application.

#### *2.7. Analysis of Enzymatic Products*

The main repetitive disaccharide units in ulvan are Rha3S-GlcA, Rha3S-IduA, and Rha3S-Xyl. ALT3695 could cleave the bond on both Rha3S-GlcA and Rha3S-IdoA, while PL24 family ulvan lyase specifically cleaves the bond between Rha3S-GlcA. The products of ulvan degradation by ALT3695 were analyzed by negative electrospray ionisation mass spectrometry (ESI-MS). Mass spectroscopic analysis of products showed two major types of oligosaccharides. The peak at m/z 401 was equivalent to the disaccharide ΔGlcA-Rha3S, with a molecular weight of 402 Da. There was an unsaturated double bond between C4 and C5 [18]. The peak at m/z 379 was equivalent to double-charged molecular ions of the tetrasaccharide ΔGlcA-Rha3S-Xyl-Rha3S, with a molecular weight of 760 Da [14]. A low abundance of single-charged molecular ions of this tetrasaccharide was also identified as the peak at m/z 759. However, the products of ulvan degradation by PL24 ulvan lyase were different. The products included the two most abundant oligosaccharides which comprised of disaccharides (ΔGlcA-Rha3S) and tetrasaccharides (ΔGlcA-Rha3S-IdoA-Rha3S), with minor tetrasaccharides (ΔGlcA-Rha3S-Xyl-Rha3S) [15].

Moreover, the change in the full wavelength during enzymatic degradation was studied (Figure 5b). As the enzymatic reaction continued, the absorption peak at 235 nm gradually increased, which may be due to the generation of ΔGlcA-Rha3S and ΔGlcA-Rha3S-Xyl-Rha3S, as C-4=C-5 double bonds have a high absorbance at 235 nm [29].

**Figure 5.** Analysis of the enzymatic products of ulvan generated by ALT3695. (**a**) ESI-MS analysis of products. (**b**) Full wavelength scans conducted during enzymatic degradation.

#### **3. Materials and Methods**

#### *3.1. Strains, Plasmids, and Medium*

The ulvan-degrading strain *Alteromonas* sp. A321 was provided by Peng Wang (Ocean University of China, Qingdao, China). Genomic DNA was extracted, and the whole genome was sequenced. *E. coli* strains DH5α and BL21 (DE3) were purchased from TIANGEN Biotech Co. (Beijing, China). The pProEX-HTa vector was used to clone and express the ulvan lyase gene. The *E. coli* strains were cultured in Luria-Bertani medium.

#### *3.2. Sequence Analysis*

An ulvan lyase gene, named *ALT3695*, was identified in the genome, and its sequence was deposited into the NCBI database (GenBank accession no. MN347032). DNAMAN 9 (Lynnon Biosoft, San Ramon, CA, USA) software was used to analyze the sequence of *ALT3695*. The homology was analyzed using BLAST at the NCBI website. A phylogenetic tree based on the homology of various ulvan lyases was constructed using MEGA 5.0 (Koichiro, Tokyo, Japan).

#### *3.3. Construction of the Recombinant ALT3695-Expressing Strain*

The *ALT3695* gene was amplified by PCR using Phanta® Turbo Super-Fidelity DNA Polymerase (Vazyme Biotech, Nanjing, China). Two primers were designed to amplify the *ALT3695* gene, 5 -CCGGAATTCCGGCCCAATAATACTGTCGACTTCGCTAACA-3 and 5 -GGGGTACCCCTTACTTATTCTCGCTTCGTATTGGTCC-3 . Then, the amplicon was purified, digested, and ligated into the pProEX-HTa vector to generate pProEX-HTa-ALT3695. The recombinant plasmid pProEX-HTa-ALT3695 was transformed into *E. coli* DH5α for amplification, and then transformed into *E. coli* BL21(DE3) for expression. Positive strains were verified by sequencing.

#### *3.4. Expression of Recombinant ALT3695*

*E. coli* BL21 (DE3) cells containing pProEX-HTa-ALT3695 were cultivated in Luria-Bertani medium supplemented with ampicillin (100 μg/mL). The culture conditions were 37 ◦C and 200 rpm. When the strain entered logarithmic growth phase (OD600 ~ 0.7), IPTG was added at a final concentration of 1 mM to induce the expression of ALT3695. Then, the culture temperature was decreased to 28 ◦C, but 200 rpm was retained with further cultivation for 18 h. Finally, cells were collected by centrifugation under freezing conditions (4 ◦C) and resuspended in 20 mM Tris-HCl buffer (pH 7.5).

#### *3.5. Enzyme Purification*

After homogenization by sonication, the supernatant containing crude enzymes was obtained by centrifugation under freezing conditions. The His-tagged ulvan lyase was purified using BeaverBeads™ IDA-Nickel (Beaverbio, Suzhou, China) and eluted using 20 mM Tris-HCl buffer (pH 7.5) supplemented with 0.5 M NaCl and 300 mM imidazole. The concentration of soluble protein was determined by the Folin-Ciocalteu method [30]. Purified enzyme was then subjected to SDS-PAGE to analyze the purity and molecular mass [31].

#### *3.6. Enzyme Activity Assay*

Ulvan was extracted from *Ulva clathrate* (Fujian, China) according to the method of Qi et al. [32]. The ulvan substrate was added at 0.1% (w/v) in 50 mM Tris-HCl buffer (pH 8.0) and the reaction was conducted at 35 ◦C. The activity of ulvan lyase was examined by measuring the change in absorbance at 235 nm [14]. One unit (U) of ulvan lyase activity was defined as the amount of protein needed to generate 1 <sup>μ</sup>mol of unsaturated glucuronyl residue (extinction coefficient, 4800 M−1·cm−1) per minute [23]. All assays were repeated three times.

#### *3.7. Characterization of Recombinant ALT3695*

To examine the optimum temperature of recombinant ALT3695, the reaction was conducted at 35–60 ◦C under standard conditions. The change in enzyme stability at different temperatures was analyzed by conducting enzyme activity assays after pre-incubating the enzyme for various times (5, 15, 30, 60, 90, 120, and 180 min) at 35–60 ◦C.

The optimal pH of recombinant ALT3695 was investigated by examining its activity in 50 mM Na2HPO4-citric acid buffer (for pH 4.0–8.0) and 50 mM Tris-HCl buffer (for pH 7.5–9.5). To investigate the change of enzyme stability by pH, purified enzymes were stored for 2 h at 4 ◦C in buffers of different pH levels, and then assayed for residual enzyme activity.

To examine the influence of different metal ions on recombinant ALT3695 activity, the enzyme was incubated with various metal ions (at 10 mM and 20 mM) for 2 h at 4 ◦C. The following metal ions were tested: K+, Ca2+, Mg2+, Zn2+, Ba2+, Cu2+, Fe2+, Hg2+, Co2+, Cd2<sup>+</sup>, and Fe3<sup>+</sup>. The effect of the following surfactants and metal chelators on recombinant ALT3695 activity was also determined: Tween-20, Tween-80, Triton X-100, EDTA, and 1,10-phenanthroline. The enzyme solution was stored with each surfactant or potential inhibitor (at 5 mM and 10 mM) for 2 h at 4 ◦C. Then, enzyme activity was assayed and compared with the activity in the absence of additive.

#### *3.8. Kinetic Measurements*

To determine the kinetic parameters of recombinant ALT3695 using ulvan as a substrate, reactions were performed with various concentrations (0.03125–1.0 mg·mL−1) of substrate under standard conditions for 5 min. Then, the *Km* and *Vmax* values were calculated by constructing a Lineweaver-Burk double reciprocal plot.

#### *3.9. Analysis of Enzymatic Products*

To obtain enzymatic products for analysis, purified ALT3695 was incubated with ulvan (5 mg·mL<sup>−</sup>1) for 3 h at 35 ◦C. Then, product samples were collected, and the product molecular weights were measured by negative-ion ESI-MS (Agilent 1290 Infinity II-6460, Agilent Corp., Wilmington, DE, USA; Frag = 90.0 V, m/z 100–3000 amu). Full wavelength scans (190–600 nm) were obtained during enzymatic degradation.

#### **4. Conclusions**

ALT3695 is a new ulvan lyase identified from *Alteromonas* sp. A321. The gene was cloned, and the recombinant ALT3695 was expressed in soluble fraction. The enzymatic properties were also investigated. The enzyme exhibited excellent stability and substrate affinity. Maximum activity was observed at 50 ◦C, and 90% activity remained after incubation at 40 ◦C for 3 h. ALT3695 catalyzes β-elimination at the internal bonds between uronic acids and Rha3S, leading to increased absorbance at 235 nm. ESI-MS results showed that disaccharides and tetrasaccharides were the major enzymatic products. In conclusion, ALT3695 shows great potential for the preparation of ulvan oligosaccharides. Further research on this recombinant strain and a structural analysis of ulvan lyase ALT3695 are warranted based on results of this study.

**Author Contributions:** J.G., C.D., and P.W. conceived and designed the experiments; J.G., C.D., and Y.C. performed the experiments; S.Z. and H.Y. analyzed the data; J.G. and P.W. wrote the paper.

**Funding:** This research was funded by the Key Research and Development Program of Shandong 2018GHY115032, the National Key Research and Development Program 2018YFC0311203 and the Innovation and Development of Marine Economy Demonstration City (Weihai) Program.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Discovery of Two New Sorbicillinoids by Overexpression of the Global Regulator LaeA in a Marine-Derived Fungus** *Penicillium dipodomyis* **YJ-11**

**Jing Yu 1, Huan Han 1, Xianyan Zhang 1, Chuanteng Ma 1, Chunxiao Sun 1, Qian Che 1, Qianqun Gu 1, Tianjiao Zhu 1,2, Guojian Zhang 1,2 and Dehai Li 1,2,\***


Received: 28 May 2019; Accepted: 26 July 2019; Published: 28 July 2019

**Abstract:** Overexpression of the global regulator LaeA in a marine-derived fungal strain of *Penicillium dipodomyis* YJ-11 induced obvious morphological changes and metabolic variations. Further chemical investigation of the mutant strain afforded a series of sorbicillinoids including two new ones named 10,11-dihydrobislongiquinolide (**1**) and 10,11,16,17-tetrahydrobislongiquinolide (**2**), as well as four known analogues, bislongiquinolide (**3**), 16,17-dihydrobislongiquinolide (**4**), sohirnone A (**5**), and 2 ,3 -dihydrosorbicillin (**6**). The results support that the global regulator LaeA is a useful tool in activating silent gene clusters in *Penicillium* strains to obtain previously undiscovered compounds.

**Keywords:** genome mining; global regulator; LaeA; overexpression; *Penicillium dipodomyis*; sorbicillinoids

#### **1. Introduction**

Filamentous fungi have proven to be important sources of bioactive natural products for development of new drug leads. As the traditional approach (cultivation of microorganisms, chemical extraction of the produced metabolites, and final structure and bioactivity elucidations) has been continuously applied in discovery of new secondary metabolites, it has become a frequent issue that known structures are repeatedly discovered, while a big portion of biosynthetic genes are not expressed under current culturing technologies also termed as "silent" or "cryptic" genes. To increase the silent metabolic potential of the microbial producers, a variety of approaches such as heterologous expression, epigenetic regulation, transcriptional regulation and ribosome engineering, have been developed to affect the biosynthetic process in different gene regulation levels [1–3], among which, transcriptional factor regulation is often adopted because it is feasible and effective, and from which unexpected new secondary metabolites could be obtained by activation of the silent genes.

LaeA is an effective global regulator which was first discovered from *Aspergillus nidulans* and *A. fumigatus* by Jin Woo Bok and Nancy P. Keller in 2004 [4], and proved to be able to influence fungi in many aspects, such as increasing [5–7] or reducing secondary metabolite production [8], activating cryptic gene clusters [9], asexual and sexual differentiation as well as changes in phenotype including sporulation [5,9] and pigmentations [5]. Later on, LaeA gene analogues with similar functions were reported from other fungal species such as *Penicillium chrysogenum* [10], *Monascus ruber* [11], *Alternaria alternate* [12], and *Dothistroma septosporum* [8].

As a part of our ongoing work searching for diverse secondary metabolites from marine derived fungi, we recently isolated one filamentous fungi *Penicillium dipodomyis* YJ-11 from a marine sediment sample collected in Jiaozhou Bay of Qingdao. In order to activate the silent metabolic potential and obtain diversified secondary metabolites, we overexpressed the native global regulator of PdLaeA in *P. dipodomyis* YJ-11 and the mutant showed changes both in morphologies (sporulation and pigmentations) and metabolic profiles in contrast to the control (Figure 1). Further chemical studies on the mutant strain led to the isolation of two new compounds, 10,11-dihydrobislongiquinolide (**1**) and 10,11,16,17-tetrahydrobislongiquinolide (**2**), together with four known analogues, bislongiquinolide (**3**), also named as bisorbibutenolide, 16,17-dihydrobislongiquinolide (**4**), also named as dihydrotrichotetronine, sohirnone A (**5**), and 2 ,3 -dihydrosorbicillin (**6**).

**Figure 1.** (**a**) Morphologies of the control strain and OE::PdLaeA strain of *P. dipodomyis* YJ- 11 after incubating at 28 ◦C for 5 days for sporulation. (**b**) HPLC analysis of the extracts from the control strain and OE::PdLaeA strain of *P. dipodomyis* YJ-11.

#### **2. Results and Discussion**

The LaeA gene analogue PdLaeA was identified via Localblast, using a LaeA gene (*Aspergillus nidulans*, Q6TLK5.1) as a query. The PdLaeA gene was then cloned from genomic DNA of the strain *P. dipodomyis* YJ-11. The total size of the gene is 1283 bp and the predicted open reading frame (ORF) is 1053 bp, which may encode a polypeptide of 350 amino acids. BLAST analysis by NCBI indicated that the PdLaeA protein had 95% sequence identity to the protein of PcLaeA (*P. citrinum*, BAL61197.1), PrLaeA (*P. roqueforti FM164*, CDM34701.1), PdiLaeA (*P. digitatum PHI26*, EKV10385.1), and PdiLaeA (*P. digitatum Pd1*, XP\_014530787.1). Phylogenetic analysis revealed that PdLaeA is mostly related to PcLaeA (Figure S2 in Supplementary Materials). Sequence analysis via InterProScan showed that the PdLaeA protein was an S-adenosyl-l-methionine-dependent methyltransferase (IPR029063), which is consistent with the putative mechanism of LaeA genes [13–15].

The PdLaeA in *P. dipodomyis* YJ-11 was amplified by specific primers (Table S1) and ligated into the vector pZeo using restriction sites XhoI and XbaI. The recombinant vector was transformed to *P. dipodomyis* YJ-11 to generate the OE::PdLaeA mutants (the mutant of *P. dipodomyis* YJ-11 harboring vacant pZeo was also generated as a control). The mutants showed obvious changes on both spore morphology and pigment formation. Followed by fermentation in PDB media with shaking at 28 ◦C for 9 days, high performance liquid chromatography (HPLC) analysis also showed a series of new peaks presenting in the extract of the OE::PdLaeA mutant compared with that of the control strain (Figure 1), indicating changes in secondary metabolite production.

For exploring the structures for the activation products, the OE::PdLaeA mutant was cultured in larger scale (10 L). Guided by UPLC-MS data, the EtOAc extract (12 g) of the fermentation was fractionated by octadecyl silane chemically bonded silica (ODS) medium performance liquid chromatography (MPLC) and then HPLC to yield compounds 10,11-dihydrobislongiquinolide (**1**, 11 mg), 10,11,16,17-tetrahydrobislongiquinolide (**2**, 35 mg), bislongiquinolide which also had been named as bisorbibutenolide [16,17] (**3**, 18 mg), 16,17-dihydrobislongiquinolide (also named as dihydrotrichotetronine [18,19], **4**, 84 mg), sohirnone A (**5**, 6 mg), and 2 ,3 -dihydrosorbicillin (**6**, 5 mg) (Figure 2). The known compounds **3**–**6** were obtained and identified by the comparation of 1H NMR data and mass spectroscopy with the literature refs. [16–22]. Compounds **3**, **5,** and **6** which had been isolated in our laboratory before [20], were also confirmed by HPLC analysis using standard samples.

**Figure 2.** Structures of compounds **1**–**6** isolated from the strain OE::PdLaeA P*enicillium dipodomyis* YJ-11 and hydrogenation product **7**.

Compound **1** was obtained as a yellow powder with the molecular formula C28H34O8 supported by the high resolution electrospray ionization mass spectroscopy (HRESIMS) peak at *m*/*z* 497.2185 [M − H]<sup>−</sup> (calcd. 497.2170). The 1H NMR spectrum of **1** revealed six methyl proton signals, two methylene proton signals, nine methine signals including six olefinic ones at 5.0–7.5 ppm, and three aliphatic ones at 2.8–3.5 ppm. In the 13C NMR spectrum, in addition to the six methyls, two methylenes and nine methines which were assignable carbon signals, there were 11 quaternary carbons. The unsaturation degree further suggested the presence of three rings in the structure. The 1H and 13C NMR data (Table 1) were very similar to those of bislongiquinolide (**3**) [16,17] indicating that they had the same core structure. The major differences were the replacement of one double bond (in **3**) by a single bond (in **1**). Careful comparison of the 13C NMR data of C9-C14 (δ<sup>C</sup> 181.3, 32.5, 28.4, 129.1, 126.9, 17.8 respectively) with those of compound **3** indicated the Δ<sup>10</sup> double bond was reduced. The key homonuclear chemical shift correlation spectroscopy (COSY) correlations of H-10/H-11/H-12/H-13/H-14 together with 1H detected heteronuclear multiple bond correlation (HMBC) correlations from H-11 (δ<sup>H</sup> 2.42) to C-10 (δ<sup>C</sup> 32.5), from H-10 (δ<sup>H</sup> 2.51) to C-12 (δ<sup>C</sup> 129.1), C-9 (δ<sup>C</sup> 181.3), and C-3 (δ<sup>C</sup> 108.4) (Figure 3) confirmed the structure of **1** as the 10,11-hydrogeneted analogue of **3**, which was named as 10,11-dihydrobislongiquinolide.

**Figure 3.** COSY and key HMBC correlations of compounds **1**,**2** and **7**.


**Table 1.** 1H NMR data of experimental compounds **1** and **2** (500 MHz, CDCl3, TMS, δ ppm), literature shared compound **3** [16] (400 MHz, CDCl3, δ ppm), and the hydrogenation product **7** (600 MHz, CDCl3, TMS, δ ppm) (*J* in Hz).

<sup>a</sup> Overlapped by other signals.

Compound **2** was obtained as a yellow oil and it was analyzed by HRESIMS (*m*/*z* 499.2343 [M − H]−, calcd. 499.2326) for the molecular formula C28H36O8, which was one double bond equivalent (DBE) less than **1**. The 1H NMR spectrum of **2** revealed six methyl proton signals, four methylene proton signals, and seven methine protons with four olefinic ones (δ<sup>H</sup> 5.0–6.0) and three aliphatic ones (δ<sup>H</sup> 2.6–3.5). The 1H and 13C NMR data (Table 1) were very similar to those of **1** except that one double bond was hydrogenated in **2**. Careful comparison of the 13C-NMR data of C15-C20 with those of compound **1** indicated that the C16-C17 double bond was reduced, which was further confirmed by the COSY correlations (H-16/H-17/H-18, H-19/H-20) and HMBC correlations from H-18 (δ<sup>H</sup> 5.34) to C-20 (δ<sup>C</sup> 17.8), from H-17 (δ<sup>H</sup> 2.11 and δ<sup>H</sup> 2.19) to C-19 (δ<sup>C</sup> 126.8), from H-16 (δ<sup>H</sup> 2.45–2.60) to C-15 (δ<sup>C</sup> 213.2), C-17 (δ<sup>C</sup> 25.9), and C-18 (δ<sup>C</sup> 128.5) (Figure 3). Thus, the structure of **2** was established and named as 10,11,16,17-tetrahydrobislongiquinolide.

The relative and absolute configurations of compounds **1** and **2** were determined by nuclear overhauser effect spectroscopy (NOESY) correlations (Figure 4), coupling constants (Table 1), hydrogenation reaction (Figure 5), circular dichroism (CD) spectra (Figure 6), and biogenetic considerations. In compound **1**, the *E* geometries of the double bonds in the side chains were deduced by the large coupling constants (15.0 Hz). In compound **2**, the NOESY correlation (Figure 4) between 1-CH3 (δ<sup>H</sup> 1.13) and H-16 (δ<sup>H</sup> 2.45–2.60) suggested the same orientation of 1-CH3 and the chain from C-7 through C-20; the NOESY correlation between H-4 (δ<sup>H</sup> 3.42) and H- 10 (δ<sup>H</sup> 2.45–2.60) indicated a *Z* geometry of Δ3, 9 and located H-4 to the same side of the chain from C-9 through C-14, which was supported by the similar chemical shifts as reported for the known structure of

bislongiquinolide [17] and bisorbibutenolide [16]. The NOESY correlation of 5-CH3 (δ<sup>H</sup> 1.31)/H-10 suggested that 5-CH3 faced to C-3 in the bicyclo[2.2.2] ring. The 7*R\**, 8*S\** relative configuration was suggested by the coupling constant (3*J*H-7, H-8 = 6.6 Hz) [16]. The NOESY correlation of H-16/23-CH3 (δ<sup>H</sup> 1.71) indicated that the 23-CH3 pointed to the direction of the side chain from C- 15 to C-20, while the NOESY correlations of H-10/21-CH3 and H-4/21-CH3 oriented 21-CH3 to another side chain from C-9 to C-14 and H-4. Thus, the relative configuration of compound **2** should be 1*R\**, 4*S\**, 5*S\**, 7*R\**, 8*S\**, 21*S\**, which adopted the same configuration of the bicyclo[2.2.2]octanedione core as the reported analogues [16,17].

**Figure 4.** Key NOESY correlations of compounds **1** and **2**.

**Figure 5.** HPLC analysis of hydrogenation reaction products: (**a–d**) represent the reaction products of compounds **1–4** respectively and (**e**) is a co-injection of all the reaction products of compounds **1–4.**

**Figure 6.** Circular dichroism (CD) spectra of **1**–**4** and **7**.

To confirm the consistent configurations of the co-isolated compounds **1**–**4**, hydrogenation over palladium on carbon was performed respectively. HPLC analysis of the reaction products showed that compounds **1**–**4** gave the same hydrogenation product (Figure 5), which suggested identical configurations of compounds **1**–**4** core structures. The hydrogenated product, named octahydrobislongiquinolide (**7**), was isolated by HPLC.

Compound **7** was obtained as a white powder with the molecular formula C28H40O8 supported by the HRESIMS peak at *m*/*z* 527.2608 [M + Na]<sup>+</sup> (calcd. 527.2615). In addition to the signals expected for the bicyclo[2.2.2] core structure and the butyrolactone ring, the 1D NMR spectra of **7** shows signals for two methyls, eight methylenes, and no olefinic proton indicating that the double bonds in the sorbyl side chains were hydrogenated thoroughly. The key COSY correlations of H-10/H-11/H-12/H-13/H-14 together with HMBC correlations from H-14 (δ<sup>H</sup> 0.87) to C-13 (δ<sup>C</sup> 31.9) and C-12 (δ<sup>C</sup> 22.5), from H-10 (δ<sup>H</sup> 2.40–2.50) to C-11 (δ<sup>C</sup> 25.2) and C-9 (δ<sup>C</sup> 182.4) (Figure 3) confirmed the structure of one sidechain. Similarly, the key COSY correlations of H-16/H-17/H-18/H-19/H-20 together with HMBC correlations from H-16 (δ<sup>H</sup> 2.40–2.50) to C-15 (δ<sup>C</sup> 214.3), from H-19 (δ<sup>H</sup> 1.20–1.30) to C-18 (δ<sup>C</sup> 31.0), from H-20 (δ<sup>H</sup> 0.89) to C-18 (δ<sup>C</sup> 31.0) (Figure 3) confirm the structure of the other sidechain. Thus, the structure of **7** was established and named as octahydrobislongiquinolide.

Therefore, the absolute configurations of **1** and **2** were proposed to be the same as the co-isolated **3** and **4**, which were also supported by the optical rotation values (**1**, **2**, **3**, **4**: +35.6, +71.8, +105 [17], +350 [19]), the biogenetic consideration, and the similar trend in CD curves (Figure 6). Moreover, in the CD spectrum as shown in Figure 6, the curves of compounds **1**–**4** and **7** clustered into two groups (**1**, **2,** and **7** were in one group and **3** and **4** belonged to the other), which suggested that the double bond at C-10 on the 3-sorbyl substitution made a dominant contribution to the shape of the CD spectrum compared with the double bonds at other positions such as C-12, C-16 and C-18 in bicyclo[2.2.2]octanedione containing sorbicillinoid structures.

The cytotoxicities of compounds **1**–**4** were evaluated using HL-60, K562, BEL-7402, HCT-116, A549, Hela, L-02, MGC-803, SH-SY5Y, PC-3, H446, U87, MDA-MB-231, HO8910, ASPC-1 and MCF- 7 cell lines, but none of them presented a cytotoxic effect at 30 μM. The antimicrobial activity was also evaluated, with no activity detected under the concentration of 30 μM either. Compounds **1**–**4** exhibited identical weak siderophore activity with chrome azurol sulfonate (CAS) with an ED50 value of 400 μM (deferoxamine mesylate was used as a positive control with an ED50 value of 100 μM). Inspired by the radical scavenging activity of the known compound **3** [16], 1,1-diphenyl-2-picrylhydrazyl radical 2,2-diphenyl-1-(2,4,6-trinitrophenyl)hydrazyl (DPPH) radical scavenging assay was used to test the radical scavenging activity of compounds **2**–**4** (a paucity of compound **1** prevented analysis), and they showed similar effects with ED50 values of 166.7 μM, 183.3 μM, and 89.6 μM (the value of ascorbic acid was 27.3 μM as positive control).

The sorbicillinoids are a family of hexaketide metabolites, with the presence of the sorbyl sidechain as a unique structural feature. Monomeric, dimeric, trimeric, and nitrogen-containing metabolites constitute the sorbicillinoids, among which dimers are the most common, with potential bioactivities as drug leads [23,24]. The compounds **1**–**4** in this report are an expansion of the library of the bridged bicyclic bisorbicillinoids with the bicyclo[2.2.2]octanedione core structure. In contrast to the previous discovery strategies of conventional extraction and separation, biosynthesis or total synthesis, it is the first report of sorbicillinoid discovery by activating silent gene clusters. The above result shows that overexpression of the global regulator LaeA is a useful method to discover new natural products by activating silent biosynthetic gene clusters.

#### **3. Materials and Methods**

#### *3.1. General Experimental Procedures*

DNA restriction enzymes were used as recommended by the manufacturer (New England Biolabs, NEB, Beijing, China). Polymerase chain reaction (PCR) was performed using *TransStart*® *Fastpfu* DNA Polymerase (Transgen Biotech, Beijing, China). PCR products were confirmed by PCR analysis using <sup>2</sup>×*EasyTaq*® *PCR SuperMix* (Transgen Biotech, Beijing, China). Genomic DNA samples were prepared using the CTAB isolation buffer at pH 8.0 (20 g/L cetyltrimethylammonium bromide, 1.4 M sodium chloride, and 20 mM EDTA) [25]. The gene-specific primers are listed in Table S1. UV spectra were recorded on a Beckman DU 640 spectrophotometer (Beckman Coulter Inc., Brea, CA, USA). Specific optical rotations were obtained using a JASCO P-1020 digital polarimeter (JASCO Corporation, Tokyo, Japan). Electrospray ionization-mass spectrometry (ESIMS) and HRESIMS were obtained on a Thermo Scientific LTQ Orbitrap XL mass spectrometer (Thermo Fisher Scientific, Waltham, MA, USA) or using a Micromass Q-TOF ULTIMA GLOBAL GAA076 LC Mass spectrometer (Wasters Corporation, Milford, MA, USA). 1H NMR, 13C NMR and 2D NMR spectra were recorded on an Agilent 500 MHz DD2 spectrometer (Agilent Technologies Inc., Santa Clara, CA, USA). LC-MS was performed using an Acquity UPLC H-Class coupled to a SQ Detector 2 mass spectrometer using a BEH C18 column (1.7 μm, 2.1 × 50 mm, 1 mL/min) (Waters Corporation, Milford, MA, USA). Semi-preparative HPLC (YMC Co., Ltd., Kyoto, Japan) was performed on an ODS column (YMC-Pack ODS-A, 10 × 250 mm, 5 μm, 3 mL/min). Medium-pressure liquid chromatography (MPLC) was performed on a Bona-Agela CHEETAHTM HP100 (Beijing Agela Technologies Co., Ltd., Beijing, China) [26].

#### *3.2. Materials and Culture Conditions*

The fungal strain, authenticated as *P. dipodomyis* YJ-11, was collected from the marine sediment around the sewage outlet in Jiaozhou Bay of Qingdao. The strain was identified by internal transcribed spacer (ITS) sequence and the sequence data were submitted to GenBank (accession number: MK682303). Working stocks were prepared on Potato Dextrose agar slants stored at 4 ◦C in our laboratory.

The strain was incubated in media potato dextrose agar (PDA, 20% potato, 2% dextrose, and 2% agar) at 28 ◦C for 5 days for sporulation. Media potato sorbitol agar (PSA, 20% potato, 2% dextrose, 1.2 M D-sorbitol, and 2% agar) and PDA with 400 μg/mL zeocin (Sigma) were used to screen resistant transformants. The mutants were also cultured at 28 ◦C in an incubator. For compound production, the strains were cultured on potato dextrose (PDB, 20% potato and 2% dextrose) at 28 ◦C, 180 rpm for 9 days. *Trans*1-*T*1 Phage Resistant Chemically competent cell (Transgen Biotech, Beijing, China) was used for plasmid preservation and amplification, following standard recombinant DNA techniques. *E. coli* cultures were growing at 37 ◦C in another incubator.

#### *3.3. Sequence Analysis of the PdLaeA Gene*

The LaeA gene was analyzed by Localblast with the reported LaeA obtained in national center for biotechnology information (NCBI). For the multiple sequence alignment analysis, the amino acid sequences of PdLaeA and other LaeA homologues from different species retrieved from NCBI were aligned using the ClustalX software [27]. The phylogenetic analysis was conducted with the MEGA7 software [28]. The conserved domain of the PdLaeA protein was scanned by the InterProScan program [29].

#### *3.4. Construction of the PdLaeA Expression Vector*

The overexpression vector pZeo which mainly contains a constitutive promoter PgpdA, resistant ampicillin gene and resistant zeocin gene as selection markers was digested with restriction endonucleases XhoI and XbaI. The PdLaeA gene was PCR-amplified (from genomic DNA of the wild-type strain *P. dipodomyis* YJ-11) using specific primers containing XhoI and XbaI restriction sites (Table S1). The PCR products were digested with the same endonucleases and PdLaeA gene was introduced into pZeo vector to generate pZeo-PdLaeA (Figure S3 in Supplementary Materials). The recombinant vector was transformed into competent *E. coli* strain *Trans*1-*T*1 to extract plasmids for transformation.

#### *3.5. Fungal Protoplast Formation and Transformation*

The strain *P. dipodomyis* YJ-11 was first grown on PDA plates at 28 ◦C for 5 days. Fresh spores were collected into 50 mL PDB together with yeast extract media in 250 mL Erlenmeyer flasks and germinated at 28 ◦C and 180 rpm for about 7 h. Mycelia were gathered by centrifugation at 4000 rpm for 15 min, and washed by 25 mL osmotic buffer (1.2 M MgSO4, 10 mM sodium phosphate, pH 5.8). Subsequently, the mycelia were suspended into 10 mL of osmotic buffer containing 30 mg lysing enzymes from *Trichodema harzianum* (Sigma) and 20 mg Yatalase (TaKaRa), transferred into an empty sterile bottle, and cultured in a shaker of 28 ◦C at 80 rpm overnight to form protoplast.

After the whole night, the mixture was collected in a 50 mL centrifuge tube and covered gently with isopyknic protoplast trapping buffer (0.6 M sorbitol, 0.1 M pH 7.0 Tris-HCl). After centrifugation at 4000 rpm for 15 min at 4 ◦C, protoplasts were collected in the interface of the above two buffers. The protoplasts were then transferred to a sterile 50 mL centrifuge tube and washed by 20 mL STC buffer (1.2 M sorbitol, 10 mM CaCl2, 10 mM pH 7.5 Tris-HCl). The protoplasts were resuspended in 2 mL STC buffer for transformation. Then, the freeze-drying plasmids pZeo-PdLaeA and the plasmid pZeo (the desired mutants were regarded as a control in the following crude analysis) were dissolved in 50 μL STC buffer, followed by 100 μL protoplast suspension and the mixture incubated for 60 min on ice. Next, 600 μL of polyethylene glycol (PEG) solution (60% PEG, 50 mM calcium chloride and 50 mM pH 7.5 Tris-HCl) was added to the protoplast mixture, and the mixture was incubated at room temperature for an additional 25 min. The mixture was spread on the regeneration solid PSA medium (PDA medium with 1.2 M sorbitol and 400 μg/mL zeocin) and incubated at 28 ◦C for around 4 days [30].

#### *3.6. Transformants Screening*

After regeneration, the transformants were passaged to PDA plates with 400 μg/mL zeocin respectively. Zeocin resistant mutants were transferred onto new PDA media containing 400 μg/mL zeocin for the second screening. The strains that were able to grow were subjected to further PCR analysis validation. The putative OE::PdLaeA mutants and the wild-type strain were cultured on PDA media for 5 days at 28 ◦C in an incubator in order to extract genomic DNAs. PCR analysis to confirm the gene insertion was carried out using three pairs of primers to verify the zeocin resistant mutants as shown in Table S1 and Figure S4 (primers gpda-1 and gpda-2 to verify whole of the PdLaeA gene, primers gpda-1 and YZ-LaeA-F to verify the upstream of target gene, primers gpda-2 and YZ-LaeA-R to verify the downstream of target gene). Those transformants which went through all these verifications were recognized as the desired mutants.

#### *3.7. Fermentation and Extraction*

For small-scale analysis, the OE::PdLaeA mutant strains and the control strain were grown on PDA plates for 5 days at 28 ◦C. Shortly after sporing, they were inoculated into 150 mL of PDB medium and cultured at 28 ◦C, 180 rpm. Nine days later, the cultures were extracted with twice the volume of ethyl acetate. The organic phase was evaporated and the residue was dissolved in MeOH, which was analyzed by HPLC and indicated that one of the six mutants showed an apparent change in metabolite production (Figure 1).

For compound isolation, the selected strain was initially handled the same as above. Then a large-scale fermentation was performed in 500 mL Erlenmeyer flasks (total 10 L) for further incubation. The broth was extracted three times with ethyl acetate to give a total of 60 L of extract solution. The organic phase was evaporated under reduced pressure to afford a crude residue (12 g).

#### *3.8. Compound Isolation*

The crude extracts were separated by MPLC (60% to 100% MeOH in H2O for 60 min). The fractions containing the target compounds were combined and concentrated to give 7 fractions (Fr.1 to Fr.7). For further purification, semi-preparative HPLC were carried out. Fr.4 was then purified by HPLC

(55% MeCN in H2O, 0.5% THF) to obtain Fr.4-1 (**6**; 5 mg; tR 24.2 min). Fr.5 was purified by HPLC (42% MeCN in H2O, 0.5% THF) to obtain Fr.5-1 (**3**;18 mg; tR 30.8 min) and Fr.5-2 (**1**; 7 mg; tR 38.0 min). Fr.6 was purified by HPLC (60% MeOH in H2O, 0.5% THF) to give five subfractions (Fr.6- 1 to Fr.6-5), in which Fr.6-1 (**1**; 4 mg; tR 30.3 min), Fr.6-4 (**2**; 35 mg; tR 46.1 min), Fr.6-5 (**5**; 6 mg; tR 52.2 min) were proven to be pure. Fr.6-3-1 (**4**; 84 mg; tR 20.5 min) was purified by HPLC (50% MeCN in H2O, 0.5% THF) from Fr.6-3. The purity of each compound was checked by LC-MS and the structures were confirmed by NMR including 1H, 13C, and 2D NMR spectra.

10,11-Dihydrobislongiquinolide (**1**): [α] 20 <sup>D</sup> 35.6 (c 0.13, MeOH), UV (MeOH) λmax (log ε): 228 (3.67), 296 (2.97) nm, 1H-NMR (CDCl3, 500 MHz) and 13C-NMR (CDCl3, 125 MHz) data are shown in Tables 1 and 2, HRESIMS *m*/*z*: 497.2185 [M − H]<sup>−</sup> (Calcd. for C28H34O8: 497.2170).

10,11,16,17-Tetrahydrobislongiquinolide (**2**): [α] 20 <sup>D</sup> 71.8 (c 0.50, MeOH), UV (MeOH) λmax (log ε): 237 (1.22), 296 (3.24) nm, 1H-NMR (CDCl3, 500 MHz) and 13C-NMR (CDCl3, 125 MHz) data are shown in Tables 1 and 2, HRESIMS *m*/*z*: 499.2343 [M − H]<sup>−</sup> (Calcd. for C28H36O8: 499.2326).

**Table 2.** 13C NMR data of experimental compounds **1** and **2** (125 MHz, CDCl3, TMS, δ ppm), literature shared compound **3** [16] (100 MHz, CDCl3, δ ppm) and the hydrogenation product **7** (150 MHz, CDCl3, TMS, δ ppm).


<sup>b</sup> Detected by HMBC spectrum.

#### *3.9. Hydrogenation of Compounds 1–4*

Compounds **1**–**4** in methanol were hydrogenated over palladium on carbon at room temperature overnight, respectively. The reaction mixture was filtered, concentrated, and purified by semipreparative HPLC (65% MeOH in H2O, 0.1% THF) to give octahydrobislongiquinolide (**7**; 5 mg;

tR 15.6 min). The purity of compound **7** was checked by LC-MS and the structures were confirmed by NMR including 1H, 13C and 2D NMR spectra.

Octahydrobislongiquinolide (**7**): [α] 20 <sup>D</sup> 78.4 (c 0.10, MeOH), UV (MeOH) λmax (log ε): 227 (4.78), 297 (2.13) nm, 1H-NMR (CDCl3, 600 MHz) and 13C-NMR (CDCl3, 150 MHz) data are shown in Tables 1 and 2, HRESIMS *m*/*z*: 527.2608 [M + Na]<sup>+</sup> (calcd. 527.2615 for C28H40O8Na).

#### *3.10. Assay of Cytotoxicity, Antimicrobial and Antioxidation Activity*

These biological evaluations were carried out as previously reported in References [26,31].

#### *3.11. Assay of Siderophore Activity*

Siderophore activity was evaluated by the chrome azurol sulfonate (CAS) assay. Buffer A consist of 50 mL 5 M CAS solution and 10 mL Fe3<sup>+</sup> solution (1 mM FeCl3·6H2O, 10 mM HCl). 72.9 mg hexadecyl trimethyl ammonium bromide (HDTMA) was dissolved in 40 mL water after heated as buffer B. Buffer B was then slowly added to buffer A under stirring to afford 100 mL CAS assay solution. All tested compounds were dissolved in MeOH at stepwise concentrations (1–100 mM) then 99 μL of the compound solution and 1 μL CAS assay solution were dispensed into wells of a 96-well microtiter tray. The mixture was shaken and left to stand for 4 h. After that, absorbance was measured at 630 nm and the inhibition rate was calculated. Deferoxamine mesylate, which could make the color change of CAS from blue to orange, was used to determine standard curves relating the CAS reactivity to iron-binding ligands. The ED50 values denoted the concentration of sample required to remove 50% of the iron in CAS solution [32,33].

#### **4. Conclusions**

In summary, guided by bioinformatic analysis of the genomic sequence of *P. dipodomyis* YJ-11, we discovered one global regulator of PdLaeA. Further overexpression of the PdLaeA gene in *P. dipodomyis* YJ-11 led to the discovery of two new compounds, 10,11-dihydrobislongiquinolide (**1**), 10,11,16,17-tetrahydrobislongiquinolide (**2**), together with four known sorbicillin analogues (compounds **3**–**6**). This is the first report on application of global regulator LaeA in *P. dipodomyis* with the purpose of increasing the secondary metabolite producing potential. This result also indicated that the production of sorbicillinoids may be regulated by the global regulator LaeA.

**Supplementary Materials:** The following are available online at http://www.mdpi.com/1660-3397/17/8/446/s1, Figure S1: AntiSMASH analysis of the genome of the strain *Penicillium dipodomyis* YJ-11, Figure S2: Phylogenetic tree analysis of PdLaeA and its homologs from different species, Figure S3: Map of the vector pZeo and PdLaeA overexpression plasmid pZeo-PdLaeA, Figure S4: PCR analysis for confirming the gene insertion, Figures S5–S11: 1D, 2D NMR and HRESIMS spectra of 10,11-dihydrobislongiquinolide (**1**), Figures S12–S18: 1D, 2D NMR and HRESIMS spectra of 10,11,16,17-tetrahydrobislongiquinolide (**2**). Figures S19–S20: 1D NMR spectra of bislongiquinolide (**3**). Figures S21–S22: 1D NMR spectra of 16,17-dihydrobislongiquinolide (**4**), Figures S23–S28: 1D, 2D NMR and HRESIMS spectra of octahydrobislongiquinolide (**7**), Table S1: The primers used in this study.

**Author Contributions:** The contributions of the respective authors are as follows: J.Y. drafted the work, constructed the plasmids and performed the fermentation, extraction, as well as the isolation. X.Z., H.H., and C.M. were involved in the acquisition of mutant strains and bioinformatic analysis. J.Y. and C.S. elucidated the constituents and performed the biological evaluations. Q.C., Q.G., T.Z., and G.Z. contributed to checking and confirming all procedures of isolation and identification. D.L. designed the study, supervised the laboratory work, and contributed to the critical reading of the manuscript, and was also involved in structural determination and bioactivity elucidation. All the authors have read the final manuscript and approved the submission.

**Funding:** This work was financially supported by the Fundamental Research Funds for the Central Universities (201941001), NSFC-Shandong Joint Fund for Marine Science Research Centers (U1606403), Financially supported by the Marine S&T Fund of Shandong Province for Pilot National Laboratory for Marine Science and Technology (Qingdao) (2018SDKJ0401-2), Taishan Scholar Youth Expert Program in Shandong Province (tsqn201812021).

**Acknowledgments:** The authors recognize material contributions from Prof. Yi Tang (University of California, Los Angeles) for providing plasmids.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Genome Sequencing and Analysis of** *Thraustochytriidae* **sp. SZU445 Provides Novel Insights into the Polyunsaturated Fatty Acid Biosynthesis Pathway**

**Xingyu Zhu 1,**†**, Shuangfei Li 1,2,3,**†**, Liangxu Liu 1, Siting Li 1, Yanqing Luo 1, Chuhan Lv 1, Boyu Wang 1,2,3, Christopher H. K. Cheng 4, Huapu Chen <sup>5</sup> and Xuewei Yang 1,2,3,\***


Received: 23 December 2019; Accepted: 14 February 2020; Published: 18 February 2020

**Abstract:** *Thraustochytriidae* sp. have broadly gained attention as a prospective resource for the production of omega-3 fatty acids production in significant quantities. In this study, the whole genome of *Thraustochytriidae* sp. SZU445, which produces high levels of docosapentaenoic acid (DPA) and docosahexaenoic acid (DHA), was sequenced and subjected to protein annotation. The obtained clean reads (63.55 Mb in total) were assembled into 54 contigs and 25 scaffolds, with maximum and minimum lengths of 400 and 0.0054 Mb, respectively. A total of 3513 genes (24.84%) were identified, which could be classified into six pathways and 44 pathway groups, of which 68 genes (1.93%) were involved in lipid metabolism. In the Gene Ontology database, 22,436 genes were annotated as cellular component (8579 genes, 38.24%), molecular function (5236 genes, 23.34%), and biological process (8621 genes, 38.42%). Four enzymes corresponding to the classic fatty acid synthase (FAS) pathway and three enzymes corresponding to the classic polyketide synthase (PKS) pathway were identified in *Thraustochytriidae* sp. SZU445. Although PKS pathway-associated dehydratase and isomerase enzymes were not detected in *Thraustochytriidae* sp. SZU445, a putative DHA- and DPA-specific fatty acid pathway was identified.

**Keywords:** whole-genome sequencing; docosahexaenoic acid (DHA); polyunsaturated fatty acid; fatty acid synthesis pathway; polyketide synthase pathway

#### **1. Introduction**

Long-chain polyunsaturated fatty acids (LC-PUFAs), such as docosahexaenoic acid (DHA, 22:6n-3), eicosapentaenoic acid (EPA, 20:5n-3), and arachidonic acid (ARA, 20:4n-6), have gained increasing attention because of their potential health benefits to humans. These benefits have been demonstrated

in a number of previous studies, and include the proper development of infant brain and eyes [1,2]; cardiovascular disease prevention [3]; antioxidative, anti-inflammatory [4] as well as anti-cancer properties [5], and the prevention of Alzheimer's disease [6]. Currently, ocean fish are the primary PUFA production source. However, because of the drawbacks of a fishy odor, the accumulation of contaminants, and declining fish stock, it is necessary to identify other sustainable and relatively safer alternative sources of PUFA [7].

Thraustochytrids, a single-celled eukaryotic marine protist belonging to the class Labyrinthulomycetes, can accumulate large amounts of DHA, resulting in these organisms having attracted a great deal of scientific and industrial interest. Research has shown that some thraustochytrid strains can be cultivated to yield high biomasses containing substantial quantities of lipids abundant in PUFAs [8]. There is an abundance of research on the optimization of fermentation parameters in terms of salinity [9], pH, temperature [10], and cultivation medium [11] for high DHA production. Besides, metabolic engineering is also used as a promising approach to promote DHA productivity. Recent research has indicated that DHA is synthesized by two distinct pathways in thraustochytrid: the fatty acid synthase (FAS) pathway and the polyketide synthase (PKS) pathway [12]. The standard FAS pathway synthesizes fatty acids through a series of elongase- and desaturase-catalyzed reactions. Delta-4 desaturase, delta-5 desaturase, and delta-12 desaturase have been reported in *Thraustochytrium aureum* ATCC 34304 [13–15], and delta-5 elongase, delta-6 elongase, and delta-9 elongase have also been successfully identified in some thraustochytrid strains [16,17]. Fatty acids are synthesized through the PKS pathway via highly repetitive cycles of four reactions, including condensation by ketoacyl synthase (KS), ketoreduction (KR), dehydration, and enoyl reduction (ER). Meesapyodsuk and Qiu identified three large subunits of a type I PKS-like PUFA synthase in *Thraustochytrium* sp. 26185 [18]. Nevertheless, to date, the exact biosynthetic mechanism of DHA in *Thraustochytrid* species remains unknown.

De novo genome assembly enables functional genomics research in species whose genetic information is limited, especially in many "non-model" organisms. However, for the *Thraustochytrids* species, the available genomic information is limited. Ye et al. reconstructed a genome-scale metabolic map of *Schizochytrium limacinum* SR21 and found that the biosynthesis of DHA via the PKS pathway requires abundant acetyl-CoA and NADPH, and that 30 genes were predicted as potential targets for DHA overproduction [19]. Comparative genomic analysis revealed that both the FAS and PKS pathways of PUFA production were incomplete and that secreted carbohydrate-active enzymes were remarkably consumed in the genomes of two newly isolated *Thraustochytrids* strains, Mn4 and SW8 [20]. Genome sequence and analysis of *Thraustochytrium* sp. 26185 also showed that both aerobic and anaerobic pathways are present in this strain. However, an essential gene encoding stearate delta-9 desaturase was missing in the aerobic pathway.

*Thraustochytriidae* sp. SZU445 is a promising DHA-producing candidate with DHA production levels up to 45% of the dry cell weight. In this study, the high-quality genome assembly of *Thraustochytriidae* sp. SZU445 is reported using combined second- and third-generation sequencing strategies. The comprehensive genomic analyses provide new insight into PUFA biosynthesis and molecular machinery, laying a framework for future genetic studies and the biotechnological utility of oleaginous microorganisms.

#### **2. Results**

#### *2.1. Genome Sequencing and De Novo Assembly*

A total of 64.71 Mb raw data were generated by PacBio platform (BGI, Shenzhen, China) and MGI2000 platform (MGI, Shenzhen, China). After shortening the adaptor sequences and filtering out low-quality data, 63.55 Mb of clean data was obtained. The PacBio platform was then used to generate high-quality subreads. From 1,055,214 subreads, 11,245,510,279 bp were obtained with 25 scaffolds and 54 contigs generated after sequence assembly (Table 1). The N50 values of the scaffolds and contigs were 5,997,638 bp and 2,552,134 bp, respectively. The GC content of the scaffolds and contigs was 45.04%. The Poisson distributions of the GC content and GC depth analysis (Figure 1) suggested that there was no external DNA contamination or sequencing bias, demonstrating that the genome assembly was of high quality. In addition, according to the sequence of the sequenced marine fungus, the (G − C)/(G + C) calculation method was used for GC skew analysis, and based on the results of gene distribution, ncRNA distribution, and annotation, a gene circle was constructed that indicated the distribution of each component in the genome (Figure 2). The genome of *Thraustochytriidae* sp. SZU445 (Table 2) contains approximately 14,145 genes, 18,768 exons, 14,145 CDSs, and 4623 introns. The total length of genes is 26,947,341 bp, with an average length of 1905.08 bp. Among all genes, those with lengths of 200–499 nt and 500–999 nt were drastically more abundant than those of other lengths (Figure 3). Based on the assembly and annotation results, 1106 copies of the noncoding RNA were predicted, which included nearly 493 copies (0.06%) of tRNA, 235 copies (0.64%) of rRNA, and 77 copies (0.01%) of snRNA (Table 3).

**Figure 1.** Statistical analysis of the GC content and depth correlation analysis of *Thraustochytriidae* sp. SZU445. The abscissa is the GC content, and the ordinate is the average depth. The scatter plot shows a shape that approximates a Poisson distribution and shows that sequencing data have low GC bias.

#### *2.2. Genome Sequence Annotation*

All genes of *Thraustochytriidae* sp. SZU445 were aligned to sequences in 18 public databases, including the Carbohydrate-Active enZYmes Database (CAZY), the Transporter Classification Database (TCDB) database, Swiss-Prot, InterPro, GO, KOG/COG, and KEGG (Table 4). The number and percentage of genes annotated by each database are presented in Table 4. Notably, 30.35% of the genes were unidentifiable, the percentage of which was similar to that observed in other species [21–23].

**Figure 2.** Genomic circle diagram of *Thraustochytriidae* sp. SZU445. From the outer to the inner rings: Genome (sorted by length), gene density (gene number in 50,000 bp nonoverlapping windows), ncRNA density (ncRNA number in 100,000 bp nonoverlapping windows), GC (GC rate in 20,000 bp nonoverlapping windows), GC\_skew (GC skew in 20,000 bp nonoverlapping windows).

**Figure 3.** Gene length distribution of *Thraustochytriidae* sp. SZU445. The abscissa is the length of the gene, and the ordinate is the number of genes corresponding to the length of the gene.


**Table 1.**

Summary of clean data assembly for

*Thraustochytriidae*

 sp. SZU445.




Type: ncRNA type. Copy: The number of ncRNA type copies. Avg\_Len: The average length of ncRNA. Total\_Len: The total length of ncRNA. % in Genome: ncRNA types as a percentage of the genome.



Genes and Genomes. KOG: EuKaryotic Orthologous Groups. COG: Clusters of Orthologous Groups. P450: Fungal Cytochrome P450 Database. TF: Transcription Factor database. EKPD: Eukaryotic Protein Kinases and Protein Phosphatases. NOG: Evolutionary genealogy of genes: Non-supervised Orthologous Groups. CARD: The Comprehensive Antibiotic Resistance Database. CWDE: Cell Wall Degrading Enzyme. NR: Non-Redundant Protein Database. DBCAN: a web server and Database for automated Carbohydrate-active enzyme ANnotation. PHI: Pathogen–Host Interactions. PHOSPHATASE: a phosphatases database of EKPD.

The GO database was founded by the Gene Ontology Association in 1988 including three categories: (1) cellular component, used to describe locations, subcellular structures, and macromolecular complexes such as nucleoli, telomeres, and the recognition of the initial complex; (2) molecular function, used to describe the gene functions and gene products, such as binding to carbohydrates or ATP hydrolase activity; and (3) biological process, used to describe the ordered combination of molecular functions to achieve broader biological functions, such as mitosis [24]. Genes are assigned to one or more of these categories depending on the nature of the product. Through the GO database annotation, we speculated the possible functions of genes based on their annotations in different categories. In *Thraustochytrium* sp. SZU445, 22,446 genes were annotated based on the GO database (Figure 4). The majority genes in the biological process category were involved in cellular and metabolic processes, with 3084 and 2634 genes identified, respectively. Simultaneously, only one gene participated in cell proliferation and detoxification in the cellular component category. For the cellular component category, cell (1027 genes), cell part (1027 genes), and membrane (872 genes) were the top three terms. In the molecular function category, the number of genes participating in binding (4506 genes) and catalytic activity (3305 genes) occupied the dominant position. Only one gene related to metallochaperone activity was identified in the molecular function category. According to the literature, fatty acid synthesis-related genes are primarily classified in the metabolic process group of biological process [25,26]. Thus, fatty acid anabolic processes dominated the biological processes of *Thraustochytrium* sp. SZU445.

**Figure 4.** Distribution of GO database functional annotations. The ordinate is the annotation item, and the abscissa is the number of corresponding genes.

KEGG is a database of large-scale molecular data generated from molecular-level information—particularly genome sequencing and other high-throughput experimental techniques—to understand the advanced functions and utility of biological systems. The database divides biological pathways into eight categories, each of which has subdivisions. Each subdivision is annotated with an associated gene that is then graphically displayed. Through this database annotation, it is easy to find genes of all annotations related to a specific function. According to the results, 3513 genes were assigned into 6 KEGG classifications (cellular processes, environmental, genetic, human diseases, metabolism, and organismal systems) in *Thraustochytrium* sp. SZU445 (Figure 5). The percentages of genes involved in the 6 KEGG classifications (metabolism and organismal systems, environmental, human diseases, genetic, cellular processes) were 10.84%, 5.65%, 13.13%, 20.66%, 34.10%, and 15.74%, respectively. Among them, the dominant classification was genes involved in metabolism pathways (1198 genes). The metabolism classification can also be divided into 12 subclassifications (metabolism of amino acid, biosynthesis of other secondary, carbohydrate, energy, lipid, cofactors and vitamins, other amino acids, terpenoids and polyketides, nucleotide, global and overview maps, glycan biosynthesis and metabolism, xenobiotic biodegradation, and metabolism). Sixty-eight genes participated in lipid metabolism, which was fifth among the remaining subclassifications. This result provided the foundation for further gene analysis of fatty acid metabolism. In particular, the number of genes involved in global and overview maps (430 genes) was the largest among all subclassifications and classifications.

**Figure 5.** Distribution of KEGG database functional annotations. The ordinate is the annotation item, and the abscissa is the number of corresponding genes.

#### *2.3. Phylogenetic Analysis of Thraustochytriidae sp. SZU445*

Based on the 18S rRNA sequences of the sample Thraustochytriidae sp. SZU445 and the reference strains, we used the MEGA software to construct a phylogenetic tree. The neighbor-joining method was applied for the analysis. As shown in Figure 6, Thraustochytriidae sp. SZU445 was shown to share a common ancestor with all reference strains. The evolutionary position of SZU445 is lower compared with the reference strains, while it is higher than Aurantiochytrium sp. HS399. In terms of kinship, Thraustochytriidae sp. SZU445 showed a higher kinship than Aurantiochytrium sp. LY-2012.

**Figure 6.** The phylogenetic tree produced using the neighbor-joining method analysis. The evolutionary history was inferred using the neighbor-joining method. The optimal tree with the sum of branch length = 0.01309240 is shown. The percentage of replicate trees in which the associated taxa clustered together in the bootstrap test (1000 replicates) is shown next to the branches. The evolutionary distances were computed using the maximum composite likelihood method and are in the units of the number of base substitutions per site. The analysis involved six nucleotide sequences. Codon positions included were 1st + 2nd + 3rd + noncoding. All ambiguous positions were removed for each sequence pair. There were a total of 1739 positions in the final dataset. Evolutionary analyses were conducted in MEGA7.

#### *2.4. Analysis of Genes Involved in Long-Chain Fatty Acid (LCFA) Biosynthesis*

We selected the genes that participated in LCFA biosynthesis from the genome sequencing and protein annotation results in the KEGG database (Table 5). There are two main ways to synthesize LCFAs: The FAS pathway and the PKS pathway.

**Table 5.** Enzymes involved in fatty acid biosynthesis and metabolism identified by annotation of the *Thraustochytriidae* sp. SZU445 genome.



**Table 5.** *Cont.*

In *Thraustochytrium* sp. SZU445, FAS (KEGG, EC: 2.3.1.86) (total gene length: 12,443 bp; CDS length: 12,285 bp; number of amino acids: 4147; number of copies: 1) and acyl carrier protein (ACP) (COG, COG ID:COG0236) (total gene length: 429 bp; CDS length: 429 bp; number of amino acids: 143; number of copies: 1) were sequenced and annotated. In the FAS pathway, the acetic acid to long-chain saturated fatty acids (C16:0 and C18:0) step of primary fatty acid biosynthesis is catalyzed by a type I FAS, and long-chain unsaturated fatty acids are generated after a string of desaturation and elongation reactions through desaturases and elongases [27]. FAS is a multifunctional polypeptide of 4147 amino acids that includes four catalytic domains for biochemical reactions: condensation, ketoacyl reduction, hydroxyacyl dehydration, and enoyl reduction [28]. Additionally, it also possesses an ACP domain for binding of the phosphopantetheine prosthetic group to carry an acyl chain. Furthermore, a malonyl-CoA ACP transacylase (MAT) domain, that can catalyze the transformation of malonyl-CoA to malonyl-ACP, was also detected in the enzyme [29]. In the aerobic pathway, the biosynthesis of LCFAs involves a string of desaturation and elongation reactions. Therefore, desaturase and elongase are highly significant in the aerobic FAS pathway. The second half of the FAS pathway can be divided into two subpathways. One subpathway starts with C18:3 (9,12,15), the biosynthesis of which involves the delta-6 desaturation of C18:3 (9,12,15) to C18:4 (6,9,12,15), the elongation of C18:4 (6,9,12,15) to C20:4 (8,11,14,17), and the delta-5 desaturation of C20:4 (8,11,14,17) to C20:5 (5,8,11,14,17) followed by another elongation of C20:5 (5,8,11,14,17) to C22:5 (7,10,13,16,19) and a final delta-4 desaturation of C22:5 (7,10,13,16,19) to C22:6 (4,7,10,13,16,19). The same enzymes are involved in both the first and second pathways. In *Thraustochytrium* sp. SZU445, two desaturases and elongases were detected: delta-4 (KEGG, EC: 1.14.19.17 1.14.18.5) (total gene length: 1471 bp; CDS length: 1167 bp; number of

amino acids: 388; number of copies: 1), delta-9 (KEGG, EC: 1.14.19.18) (total gene length: 1601 bp; CDS length: 1601 bp, number of amino acids: 541; number of copies: 1) and elongase (KOG, KOG ID:KOG3072) (total gene length: 1088 bp; CDS length: 1088 bp; number of amino acids: 362; number of copies: 1). These protein products exhibited 100.00%, 95.54%, and 99.87% similarity to the sphingolipid delta-4 desaturase of *Hondaea fermentalgiana* (NCBI accession: GBG25593.1), the delta-9 desaturase of *Hondaea fermentalgiana* (NCBI accession: GBG34675.1), and the elongation of very-long-chain fatty acids protein 6 of *Hondaea fermentalgiana* (NCBI accession: GBG25297.1) [30]. The characterization of this desaturase gene and its function was reported by exogenous expression in yeast [31–33]. However, there were only two desaturases (delta-4 and delta-9 desaturase) annotated in *Thraustochytrium* sp. SZU445 that belonged to the classic FAS pathway. The results indicated that a nonclassic FAS pathway may exist in *Thraustochytrium* sp. SZU445 for LC-PUFA synthesis.

PKS has been identified in prokaryotes and eukaryotes [12]. Briefly, fatty acids and polyketides are constructed by repeated decarboxylative Claisen ester condensations of an acyl-CoA starter part with malonyl-CoA units catalyzed by a 3-ketoacyl-synthase (KS). This process often involves an ACP and a (malonyl) acyltransferase (MAT/AT). In fatty acid biosynthesis, beta-oxidation processing typically yields an entirely saturated acyl backbone by ketoreductase (KR), dehydratase (DH), and an enoyl reductase (ER) [34]. An isomerase changes the conformation of the fatty acid chain. KS (KEGG, EC: 2.3.1.179) (total gene length: 1335 bp; CDS length: 1335 bp; number of amino acids: 444; number of copies: 1), KR (KEGG, EC: 1.1.1.-) (total gene length: 1489 bp; CDS length: 1266 bp; number of amino acids: 422; number of copies: 1), and ER (KEGG, EC: 1.3.1.9) (total gene length: 1146 bp; CDS length: 1146 bp; number of amino acids: 382; number of copies: 1) were detected in *Thraustochytrium* sp. SZU445 by genome sequencing and protein annotation. The KS, KR, and ER were highly homologous to the beta-ketoacyl-acyl-carrier-protein synthase II of *Aphanomyces invadans* (NCBI accession: XP\_008875515.1), oxidoreductase of "*Candidatus Poribacteria*" bacterium (NCBI accession: RKU36622.1) [35] and tRNA-dihydrouridine(47) synthase NADP+-like of *Hondaea fermentalgiana* (NCBI accession: GBG27734.1) [30] with approximately 94.45%, 95.67%, and 99.00% identity at the amino acid level. These results demonstrate that *Thraustochytrium* sp. SZU445 may possess the PKS pathway but not the classical pathway. Nevertheless, neither the gene nor protein for the DH and isomerase were discovered in *Thraustochytrium* sp. SZU445. We found two enzymes in *Thraustochytrium* sp. SZU445 with functions that were highly similar to those of a DH and an isomerase. These two enzymes were dTDP-glucose 4,6-DH (KEGG, EC: 4.2.1.46) (total gene length: 1134 bp; CDS length: 1134 bp; number of amino acids: 377; number of copies: 1) and delta-3,5-delta-2,4-dienoyl-CoA isomerase (KEGG, EC: 5.3.3.21) (total gene length: 852 bp; CDS length: 852 bp; number of amino acids: 283; number of copies: 1). Both of these proteins were 94.00% and 99.00% homologous to the dTDP-glucose 4,6-DH of *Fistulifera solaris* (NCBI accession: GAX23278.1) [36] and delta-3,5-delta-2,4-dienoyl-CoA isomerase of *Hondaea fermentalgiana* (NCBI accession: GBG25986.1) [30]. Thus, we propose that these two enzymes are the isozymes of DH and isomerase and that they participate in the PKS pathway together with KS, ER, and ER. The specific functions of dTDP-glucose 4,6-DH and delta-3,5-delta-2,4-dienoyl-CoA isomerase are discussed in the next section.

#### **3. Discussion**

#### *3.1. Fatty Acid Synthesis by the FAS Pathway*

In the FAS pathway, the synthesis of PUFAs can be divided into two parts [37]. In the first part, palmitic acid (C16:0) is synthesized by the dehydration of acetyl-CoA and malonyl-COA by the FAS complex (NOG, NOG ID: 49.09) [38]. In the second part, palmitic acid (C16:0) is transformed into a very-long-chain (C22) fatty acid by various enzymes. The FAS complex consists of an ACP (COG, COG ID: COG0236) and six enzyme monomers [39]. The six enzyme monomers are acetyl-CoA-ACP transacylase, malonyl-CoA-ACP transacylase, beta-ketoacyl-ACP synthase, beta-ketoacyl-ACP reductase, beta-hydroxyacyl-ACP, DH, enoyl-ACP reductase, and palmitoyl-ACP

thioesterase [40]. Then, after a string of desaturation and elongation reactions by delta-9, delta-12, delta-15, delta-6, and delta-4 desaturases [31] and elongases, palmitic acid obtains double bonds, extending the carbon chain and finally generating PUFAs. In *Thraustochytrium* sp. SZU445, FAS (NOG, NOG ID: 49.09) and ACP (COG, COG ID: COG0236) were identified. Therefore, the saturated fatty acids before C18:0 can be synthesized. Interestingly, *Thraustochytrium* sp. SZU445 was not observed to contain delta-12, delta-15, or delta-6 desaturases by whole-genome sequencing and protein annotation; however, it could synthesize PUFAs. A similar phenomenon has also been reported in a number of other studies, where *Arthrospira platensis* was observed to only have 15-*cis*-phytoene desaturase and zeta-carotene desaturase, and *Strongylocentrotus nudus* was observed to only have delta-5, delta-6, and delta-9 desaturases [41–44]. This result may indicate that the FAS pathway is not the primary pathway responsible for PUFA synthesis.

#### *3.2. Fatty Acid Synthesis by the PKS Pathway*

Polyketides, the secondary metabolites that harbor multiple units of ketide groups (-CH2-CO-) [45], are synthesized by PKS, which is an enzyme similar to FAS in bacteria [12]. Similar to the FAS pathway, by using ACP as a covalent connection of PUFA synthesis [39,46] (COG, COG ID: COG0236), the PKS pathway proceeds with repeated cycles. The whole cycle includes condensation by using an acyl-ACP and a malonyl-ACP to form a ketoacyl-ACP with KS (KEGG, EC: 2.3.1.179), keto-reduction by converting ketoacyl-ACP to hydroxyacyl-ACP by KR (KEGG, EC: 1.1.1.-), and dehydration by removing a water molecule from hydroxyacyl-ACP. One whole cycle can produce an unsaturated enoyl-ACP, which is then reduced to a saturated acyl chain by ER (KEGG, EC: 1.3.1.9) [47]. However, unlike the FAS pathway, the PKS pathway often neglects steps of the whole cycle, e.g., dehydration and reduction. Therefore, the products of the PKS pathway vary greatly in structure and often contain keto and hydroxyl groups and double bonds. KS, KR, and ER were discovered in *Thraustochytrium* sp. SZU445, but an isomerase and a DH were not identified; however, *Thraustochytrium* sp. SZU445 can also generate PUFAs such as DHA (C22:5) [42–44].

#### *3.3. Putative Fatty Acid Synthesis Pathway in Thraustochytrium sp. SZU445*

Since *Thraustochytrium* sp. SZU445 lacked complete FAS and PKS pathways, we speculated that the synthesis of SZU445 LC-PUFAs involves a combination of the two pathways. According to the whole-genome sequencing and protein annotation results, no DH or isomerase genes were found in *Thraustochytrium* sp. SZU445. However, we identified two enzymes with functions that were highly similar to those of DH and isomerase. The two enzymes are dTDP-glucose 4,6-DH (KEGG, EC: 4.2.1.46) and delta-3,5-delta-2,4-dienoyl-CoA isomerase (KEGG, EC: 5.3.3.21). dTDP-glucose 4,6-DH can catalyze the transformation of dTDP-glucose into dTDP-4-keto-6-deoxyglucose, forming a ketone group [48]. The isomerization of 3-*cis*-octenoyl-CoA to 2-*trans*-octenoyl-CoA can be catalyzed by delta-3,5-delta-2,4-dienoyl-CoA isomerase with specific activity for altering the conformation. Moreover, delta-3,5-delta-2,4-dienoyl-CoA isomerase is indispensable for the beta-oxidation of PUFAs [49]. These two enzymes may be the isoenzymes of DH and isomerase and, hence, perform the same functions.

The protein annotation of *Thraustochytrium* sp. SZU445 revealed that the DH and isomerase of the classic PKS pathway are not present in *Thraustochytrium* sp. SZU445. In addition, only the delta-4 and delta-9 desaturases of the classic FAS pathway were found. Moreover, the fatty acid composition of *Thraustochytrium* sp. SZU445 was analyzed using gas chromatography and shown to primarily include methyl tetradecanoate (C14:0), methyl hexadecanoate (C16:0), DPA, and DHA C20:4 (7,10,13,16). Based on these results, we speculated upon the possible synthesis pathway of long-chain fatty acids in *Thraustochytrium* sp. SZU445.

We inferred that long-chain saturated fatty acids of *Thraustochytrium* sp. SZU445, such as methyl hexadecanoate, are produced by the FAS pathway and that long-chain unsaturated fatty acids, such as DHA, are produced by the PKS pathway, with one desaturase participating in this process instead of the desaturation and elongation steps in the FAS pathway (Figure 6). For the long-chain saturated fatty acids, acetyl-CoA and malonyl-CoA are used as substrates to yield methyl tetradecanoate (C14:0) and methyl hexadecanoate (C16:0) by cycling through the FAS pathway six and seven times, respectively. For the long-chain unsaturated fatty acid DPA, C22:4 (7,10,13,16) is generated by cycling through the PKS pathway ten times. C22:5 (4,7,10,13,16) is then converted from C22:4 (7,10,13,16) via delta-4 desaturase. DPA is eventually produced by the activity of delta-3,5-delta-2,4-dienoyl-CoA isomerase (KEGG, EC: 5.3.3.21). In the biosynthesis of DHA, C20:4 (7,10,13,16) is produced by circulating through the PKS pathway nine times. C20:5 (4,7,10,13,16) is formed through the addition of a double bond on the carbon at the fourth position by delta-4 desaturase. Finally, DHA biosynthesis is completed after one additional cycle of the PKS pathway (Figure 7).

**Figure 7.** Putative fatty acid synthase pathway of *Thraustochytriidae* sp. SZU445 with the classical fatty acid synthesis pathway and the polyketide synthase pathway. The enzymes colored blue are present in the classical fatty acid synthase (FAS) and polyketide synthase (PKS) pathways. The enzymes colored green are present in *Thraustochytriidae* sp. SZU445 and correspond to the classical FAS and PKS pathways. The enzymes colored red are isozymes of dehydrase and isomerase in the PKS pathway that exist in *Thraustochytriidae* sp. SZU445.

#### **4. Materials and Methods**

#### *4.1. Microbes and Cultivation*

*Thraustochytrium* sp. SZU445 was isolated from mangroves (22◦31 13.044" N, 113◦56 58.560" E) in the coastal waters of southern China. Because of the results of an initial assessment for DHA production, *Thraustochytrium* sp. SZU445 was selected for further research. Furthermore, *Thraustochytrium* sp. SZU445 was stored in the China General Microbiological Culture Center (CGMCC) under the accession no. 8095. First, the strain was inoculated into M4 liquid medium, which was prepared in 100% filtered natural seawater containing 0.15% peptone, 2% glucose, 0.025% KH2PO4, and 0.1% yeast extract. Seawater was gathered from Mirs Bay, Shenzhen (22◦31 32.632" N, 114◦28 40.185" E), with a salinity of approximately 30.75–33.19‰. Cultures were incubated for 48 h at a constant temperature of 27 ◦C and with shaking at 180 rpm. Subsequently, 4% of the culture was used to inoculate fresh M4 medium and incubated for three days under the same growth conditions.

#### *4.2. DNA Preparation and Sequencing*

The sequencing platform chosen for this subject was a combination of the PacBio (BGI, Shenzhen, China) and MGI2000 (MGI, Shenzhen, China) platforms. Genomic DNA was extracted from a 50 mL fresh cell suspension of *Thraustochytrium* sp. SZU445, which was centrifuged at 8000 rpm for 10 min. The resulting cell pellets were ground to a fine powder in liquid nitrogen, then washed once in 5.0 mL DNA extraction buffer. Subsequently, the pellets were resuspended in 5.0 mL phenol/chloroform/isoamyl alcohol (25:24:1) and centrifuged twice at 9000 rpm for 16 min. The supernatant was then washed in an equal volume of chloroform and centrifuged at 12,000 rpm for 16 min. Genomic DNA was precipitated by adding 2.5 volumes of 100% ethanol and collected by centrifugation at 12,000 rpm for 15 min at 4 ◦C. The genomic DNA of *Thraustochytrium* sp. SZU445 was sequenced on the PacBio sequencing platform based on third-generation sequencing technologies. Prior to library construction, the concentration, integrity, and purity of the genomic DNA was tested. The concentration was detected using a Qubit fluorometer (Invitrogen, Carlsbad, CA, USA). The purity and integrity of the sample was assessed by agarose gel electrophoresis (agarose gel concentration: 1.2%; voltage: 120 V; electrophoresis time: 50 min). To construct the library for the PacBio platform, 10 μg of genomic DNA was randomly fragmented by g-TUBE (Covaris, Woburn, MA, USA). The fragmented genomic DNA was then processed using an Agencourt AMPure XP-Medium kit (Beckman Coulter, Brea, CA, USA) to obtain fragments with a size of 10 kb. The DNA fragments generated from exonuclease ExoVII (New England Biolabs, Ipswich, MA, USA) digestion and subject to end repair were connected to the hairpin structure at both ends to form a dumbbell structure called SMRTbell. The library for the PacBio sequencing platform was successfully constructed after the purification and sorting of fragments. Qualified library fragments were sequenced after the annealing and binding polymerase steps. The genome sequencing data of *Thraustochytriidae* sp. SZU445 were uploaded to the National Center for Biotechnology Information Search database (NCBI) (BioProject ID: PRJNA579065; BioSample accession: SAMN13135800; SRA accession: PRJNA579065). The method for constructing the library using the MGI2000 platform (MGI, Shenzhen, China) is essentially the same as that described for the PacBio sequencing platform (BGI, Shenzhen, China).

#### *4.3. Fragment Assembly and Gene Annotation*

Since the raw sequencing data contained low-quality sequences, such as adaptors as well as duplicated and low-quality reads, these reads were filtered to obtain reliable subreads for assembly and guarantee the reliability of the subsequent analysis results. Genome assembly can be divided into the following four parts. (1) Subreads correction: Use Proovread (Version: 2.12; Parameter setting: -t 4 –coverage 60 –mode sr; related website: https://github.com/BioInf-Wuerzburg/proovread) to perform mixed correction on Subread to get highly reliable corrected reads; (2) corrected reads assembly: based on the corrected reads, Celera (Version: 8.3; parameter setting: doTrim\_initialQualityBased = 1,

doTrim\_finalEvidenceBased = 1, doRemoveSpurReads = 1, doRemoveChimericReads = 1, - d properties -U; related website: http://sourceforge.net/projects/wgs-assembler/files/wgs-assembler/wgs-8.3/) and Falcon were each used to generate separate assemblies. Finally, the optimal assembly result was selected; (3) assembly result correction: single-base correction by GATK (Version: v1.6-13; parameter settings: -cluster 2 -window 5 -stand\_call\_conf 50 -stand\_emit\_conf 10.0 -dcov 200 MQ0 ≥ 4; related website: http://www.broadinstitute.org/gatk/) for assembly using second-generation small fragment data to obtain highly reliable assembly sequences; (4) scaffold and hole filling: a scaffold was constructed using SSPACE\_Basic\_v2.0 (Version: v2.0; related website: http://www.baseclear.com/ genomics/bioinformatics/basetools/SSPACE) on the assembly results based on the second-generation Illumina large-segment data, and pbjelly2 (Version: 15.8.24; related website: https://sourceforge.net/ projects/pb-jelly/) was used to fill holes in the scaffold. After completing the genome assembly, we first constructed the gene model for subsequent gene function annotation. We chose the de novo prediction to build the gene model. The de novo prediction uses the GeneMark-ES software (Version: 4.21, parameter settings: –genemarkes, Related website: http://exon.gatech.edu/). This software is based on Hidden Markov Model and ab initio algorithm to make predictions, and it does not require reference species and reference sequences. It is self-trained. In the assembled sequence, all possible open reading frames (ORF) were predicted as CDSs by the GeneMarkes software. After the prediction was finished, the predicted ORF was translated into the amino acid sequence. Then it was compared with 18 protein databases such as Kyoto Encyclopedia of Genes and Genomes (KEGG) database (Version: 81), Gene Ontology (GO) (Version: releases\_2017-09-08), and Swiss-Prot database (Version: release-2017-07) by BLAST to obtain corresponding functional annotation information. Since each sequence alignment result exceeds one, in order to ensure its biological significance, we retain an optimal alignment result as the annotation of the gene. The Eukaryotic Cluster of Orthologous Groups (KOG) and Kyoto Encyclopedia of Genes and Genomes (KEGG) database pathway annotations were implemented using a separate BLAST search against both the KOG (http://www.ncbi.nlm.nih.gov/COG/) and KEGG (https://www.genome.jp/kegg/) databases. The Gene Ontology (GO) and Swiss-Prot annotations were also carried out using the Swiss-Prot database (https://web.expasy.org/docs/swiss-prot\_guideline.html) and the GO database (https://www.ebi.ac.uk/ols/ontologies/go), respectively. In addition, the prediction and annotation of kinase domains in the Eukaryotic Protein Kinases and Protein Phosphatases Database (EKPD) were performed by HMMER (v.3.0) using the default parameters and the kinase database (http://ekpd.biocuckoo.org/faq.php). Moreover, for non-coding RNA prediction, the rRNA was predicted by RNAmmer software (Version: 1.2, parameter settings: –s Species –m Type –gff\*.rRNA.gff –f\*.rRNA.fq, Related website: http://www.cbs.dtu.dk/services/RNAmmer/). The tRNA region and tRNA secondary structure predicted by tRNAscan-SE software (Version: 1.3.1, parameter settings: –Spec\_tag(BAOG) –o \*. tRNA –f\*.tRNA.structure, Related website: http://gtrnadb.ucsc.edu/). SnRNAs was predicted in comparison with Rfam database (Version: 9.1, parameter settings: –p blastn –W 7 –e 1 –v 10000 –b 10000 –m 8 –i subfile –o \*.blast.m8, Related website: http://rfam.sanger.ac.uk/) by using the "covariance model" (CMS) of the Infranal software (Version: 1.1.1 (July 2014), parameter settings: default parameter, Related website: http://rfam.sanger.ac.uk/). The default parameters of Infranal software is displayed in Supplementary Materials.

#### *4.4. Phylogenetic Analysis*

The 18S rRNA was selected as the alignment sequence, and the 18S rRNA sequences of the reference strains were obtained from National Center for Biotechnology Information (NCBI). The blast function of the NCBI was used to compare the similarity of *Thraustochytriidae* sp. SZU445 and those of the reference strains. The NCBI accession numbers of the reference strains are shown in Table 6.


**Table 6.** The National Center for Biotechnology Information (NCBI) accession numbers of the reference strains for the phylogenetic analysis.

The MEGA software (Version: 7.0.26; algorithm: neighbor-joining method, the maximum composite likelihood method for the evolutionary distances; related website: https://www. megasoftware.net/) was used to construct a phylogenetic tree for *Thraustochytriidae* sp. SZU445 and the reference strains.

#### **5. Conclusions**

In summary, this research revealed the essential genome features of *Thraustochytriidae* sp. SZU445 and proposed a putative fatty acid synthesis pathway (specifically for DHA and DPA). Our purpose was to identify metabolism-related genes involved in PUFA synthesis in *Thraustochytriidae* sp. SZU445. This study is the first to provide genome information relating to *Thraustochytriidae* sp. SZU445 using a third-generation sequencing platform. By assembling contigs and scaffolds, we predicted and analyzed the GO and KEGG terms for potential genes and proteins involved in the synthesis of polyunsaturated fatty acids. These data could yield novel insights into the molecular mechanisms of *Thraustochytriidae* sp. and serve in developing innovative strategies for promoting PUFA (including DHA) production.

**Supplementary Materials:** The default parameters of Infranal software are available online at http://www.mdpi. com/1660-3397/18/2/118/s1.

**Author Contributions:** X.Z., Y.L., and S.L. (Siting Li) wrote the paper; X.Z., S.L. (Shuangfei Li) designed the concept of the experiments and analyzed the data; L.L. completed the work of strain mutation and provided the strain; C.L., B.W., C.H.K.C., and H.C. checked and revised the detail of the manuscript; X.Y., as the corresponding author, is responsible for critical reading and finalization of the manuscript. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by [National Key Research and Development Project] grant number [2018YFA0902500], [the Natural Science Foundation of Guangdong Province] grant number [2018A030313139], [Joint R&D Project of Shenzhen-Hong Kong Innovation] grant number [SGLH20180622152010394], [Natural Science Foundation of Shenzhen] grant number [KQJSCX20180328093806045], [Shenzhen Grant Plan for Science & Technology] grant number [CKCY2016042710211071] and [Shenzhen University Start-up Research Funding] grant number [827-000192, 2016100].

**Conflicts of Interest:** The authors declare no conflicts of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Monogalactosyldiacylglycerol and Sulfolipid Synthesis in Microalgae**

#### **Gennaro Riccio 1, Daniele De Luca <sup>2</sup> and Chiara Lauritano 1,\***


Received: 12 April 2020; Accepted: 27 April 2020; Published: 1 May 2020

**Abstract:** Microalgae, due to their huge taxonomic and metabolic diversity, have been shown to be a valuable and eco-friendly source of bioactive natural products. The increasing number of genomic and transcriptomic data will give a great boost for the study of metabolic pathways involved in the synthesis of bioactive compounds. In this study, we analyzed the presence of the enzymes involved in the synthesis of monogalactosyldiacylglycerols (MGDGs) and sulfoquinovosyldiacylglycerols (SQDG). Both compounds have important biological properties. MGDGs present both anti-inflammatory and anti-cancer activities while SQDGs present immunostimulatory activities and inhibit the enzyme glutaminyl cyclase, which is involved in Alzheimer's disease. The Ocean Global Atlas (OGA) database and the Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP) were used to search MGDG synthase (MGD), UDP-sulfoquinovose synthase (SQD1), and sulfoquinovosyltransferase (SQD2) sequences along microalgal taxa. *In silico* 3D prediction analyses for the three enzymes were performed by Phyre2 server, while binding site predictions were performed by the COACH server. The analyzed enzymes are distributed across different taxa, which confirms the importance for microalgae of these two pathways for thylakoid physiology. MGD genes have been found across almost all analyzed taxa and can be separated in two different groups, similarly to terrestrial plant MGD. SQD1 and SQD2 genes are widely distributed along the analyzed taxa in a similar way to MGD genes with some exceptions. For Pinguiophyceae, Raphidophyceae, and Synurophyceae, only sequences coding for MGDG were found. On the contrary, sequences assigned to Ciliophora and Eustigmatophyceae were exclusively corresponding to SQD1 and SQD2. This study reports, for the first time, the presence/absence of these enzymes in available microalgal transcriptomes, which gives new insights on microalgal physiology and possible biotechnological applications for the production of bioactive lipids.

**Keywords:** microalgae; monogalactosyldiacylglycerol synthase; UDP-sulfoquinovose synthase; sulfoquinovosyltransferase; monogalactosyldiacylglycerols; sulfoquinovosyldiacylglycerols; transcriptome analysis

#### **1. Introduction**

Microalgae are eukaryotic photosynthetic microorganisms that are adapted to live in ecologically different habitats, which results in a wide diversity of species and natural products [1,2] with pharmaceutical [3], nutraceutical [4], and cosmeceutical [5] interests. Microalgae can be cultivated in huge quantities and this advantage overcomes the bottleneck of drug discovery from marine macro-organisms and destructive collection practices. In addition, many studies have focused on optimizing the culturing conditions in order to obtain the metabolites of interest or produce them in large amount. Regarding microalgal molecular resources, various microalgal genomes have been

sequenced [6]. However, there are still few genomes available compared to the huge number of existing microalgae [6]. This is due to the fact that microalgae, especially dinoflagellates, can have very large genomes, up to 112 Gbp [7]. On the contrary, several transcriptomes are available for microalgae [6]. Many of these have been sequenced thanks to the Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP), the TARA ocean, and the Global Ocean Sampling expeditions, while others are on the way. This will also help discover the enzymes involved in the synthesis of bioactive metabolites and suggest which are the most promising species for genetic engineering manipulations in order to increase the production of specific metabolites.

At the moment, there are only a few studies on metabolic pathways involved in the synthesis of bioactive compounds from microalgae [7–11]. The aim of this paper was to investigate the presence in microalgae of enzymes involved in the synthesis of specific functional lipids with huge interest for pharmaceutical, nutraceutical, and cosmeceutical applications. In particular, microalgae have been found to contain high contents of monogalactosyldiacylglycerols (MGDGs) [12]. MGDGs are compounds of biotechnological interest because they have been investigated for the potential treatment of human pathologies such as cancer (e.g., human pancreatic cancer cell lines) [13] and inflammation [14]. A purified MGDG, extracted from spinach, is a strong DNA polymerase inhibitor and it enhances chemotherapeutic efficiency in human pancreatic cancer cell lines (BxPC-3, MIAPaCa2, and PANC-1) [15]. Regarding marine examples, two MGDGs from the diatom *Phaeodactylum tricornutum* showed *in vitro* pro-apoptotic activity on immortal mouse epithelial cell lines (W2 cells) [16].

Two different MGDGs from the green microalga (Chlorophyta) *Tetraselmis chuii* [17] and four MGDGs from *Nannochloropsis granulata* (Ochrophyta, Eustigmatophyceae) [18] showed anti-inflammatory properties. These MGDGs, extracted from both *T. chuii* and *N. granulata*, were able to reduce NO production and inducible nitric oxide synthase (iNOS) protein levels in lipopolysaccharide (LPS)-stimulated RAW264.7 macrophage cells. In addition, sulfoglycolipids, which constitute the anionic fraction of MGDGs [19], also present interesting biological properties, such as glutaminyl cyclase (QC) inhibitory activity [20] and immuno-stimulatory activity [21]. Sulfolipids extracted from the green microalgae *Tetradesmus lagerheimii* (formerly *Scenedesmus acuminatus*), *Scenedesmus producto-capitatus*, *Pectinodesmus pectinatus* (formerly *Scenedesmus pectinatus*), and *Tetradesmus wisconsinensis* are able to inhibit QC [20] (an enzyme involved in Alzheimer's disease progression [22]) and, thus, have potential as lead compounds against Alzheimer's disease. Furthermore, a synthetic sulfolipid derived from *Thalassiosira weissflogii* CCMP1336 (renamed *Conticribra weissflogii*, Bacillariophyta), named β-SQDG18, is a potent vaccine adjuvant [21]. β-SQDG18 is able to trigger a more effective immune response against cancer cells to improve dendritic cell (DC) maturation, to increase CD83-positive DC, and to stimulate the production of pro-inflammatory cytokines (IL-12 and INF-γ) [21].

The aim of the present work was to investigate the presence of genes involved in MGDG and sulfoglycolipid synthesis in microalgae (Bacillariophyta, Cercozoa, Chlorophyta, Chrysophyceae, Coccolithophyceae, Cryptophyta, Dictyochophyceae, Dinophyceae, Euglenophyceae, Eustigmatophyceae, Glaucophyceae, Pavlovophyceae, Pelagophyceae, Pinguiophyceae, Raphidophyceae, Rhodophyta, Synurophyceae, and Xanthophyceae) because little information is available. Our analyses were performed considering all the microalgal transcriptomes and metatranscriptomes available from the MMETSP project and the recent Tara Oceans and Global Ocean sampling expeditions, respectively.

Galactolipids represent up to 80% of the total lipids of the plastid membranes [23]. They can contain one or two galactose (Gal) molecules, which bond to the glycerol backbone at sn-3 position, and are called, respectively, MGDGs or digalactosyldiacylglycerols (DGDGs). The reaction required for MGDG biosynthesis is catalyzed by a MGDG synthase (MGD), which transfers the Gal moiety from UDP-Gal to diacylglycerol [24] (Figure 1a). MGDG is required in plastids biogenesis and integrity and in photosynthesis [24]. In terrestrial plants, the MGDG synthase is encoded by two types of genes, namely type-A (AtMGD1) and type-B (AtMGD2 and AtMGD3) isoforms, and enzymes are well characterized [25–30]. In microalgae, there is no information available yet.

Sulfolipids are constituent of the thylakoids in plant and algal chloroplasts [31]. Sulfoquinovose, which is the building block of sulfolipids, is also the major component of the biological sulfur cycle [32] and it is produced by photosynthetic organisms at a rate of 1010 tons per year [33]. The biosynthesis of these lipids proceeds in two reactions. The first reaction is catalysed by the UDP-sulfoquinovose synthase (SQD1), that assembles the UDP-sulfoquinovose from UDP-glucose and sulphite [34]. The second reaction is catalysed by sulfoquinovosyltransferase (SQD2), which transfers the sulfoquinovose to diacylglycerol [35]. This produces sulfoquinovosyldiacylglycerol (SQDG) (Figure 1b).

Different classes of microalgal-derived compounds have been identified and several have shown specific biological activities, such as anti-cancer [13], anti-inflammatory [14], anti-tuberculosis [36], anti-epilepsy [37], anti-microbial [38], immune-regulatory [39], anti-hypertensive [40], anti-atherosclerosis [40], and anti-osteoporosis [40] activity. The systematic studies of the genes involved in the synthesis of the bioactive compounds of interest are increasing [7,9,10] and can open new perspectives for gene-editing and boost the use of microalgae as a source of new marine natural products.

**Figure 1.** Enzymes responsible for (**a**) monogalactosyldiacylglycerol (i.e., MGDG synthase) and (**b**) sulfoquinovosyldiacylglycerol (i.e., UDP-sulfoquinovose synthase or SQD1 and sulfoquinovosyltransferase or SQD2) biosynthesis.

#### **2. Results and Discussion**

*2.1. Identification and Taxonomic Assignation of MGD, SQD1, and SQD2 Homologous Sequences*

The BLASTp search conducted in the OGA database returned 1,614 annotated eukaryotic sequences putatively attributable to MGD, 1,422 to SQD1 and 1,340 to SQD2. Of these sequences, the following were unambiguously identified as the genes of interest after Blast2GO analysis: 1154 to MGDG, 817 to SQD1, and 273 to SQD2 (Supplementary Data S1–S3, respectively).

The taxonomic pattern of these sequences, excluding the ones annotated as only as "Eukaryota" and "Stramenopiles" is illustrated in Figure 2. Of the 15 taxonomic divisions considered, 13 were found in the sequences assigned to MGD, 11 to SQD1, and 7 to SQD2 (Figure 2). In this database (OGA), we found homologs for all three genes investigated in the following microalgal taxa: Bacillariophyta, Chlorophyta, Dictyochophyceae, Coccolithophyceae, Pavlovophyceae, and Pelagophyceae. In particular, homolog sequences were particularly abundant in Bacillariophyta, Coccolithophyceae, and Pelagophyceae. For Pinguiophyceae, Raphidophyceae, and Synurophyceae, only sequences coding for MGDG were found. On the contrary, sequences assigned to Ciliophora and Eustigmatophyceae were exclusively corresponding to SQD1 and SQD2. Considering that the results are based on transcriptomic data, they can suggest the absence of a specific gene in certain taxa or simply the analyzed organisms were not expressing the specific gene at the time of sampling and fixing for the RNA-sequencing.

**Figure 2.** Taxonomic characterization and composition of the analyzed genes. MGD, SQD1, and SQD2 homologs were retrieved from Tara Oceans meta-transcriptomes. Graphical view of taxonomic composition for each gene is reported.

For MMETSP transcriptomes, we retrieved 313 sequences for MGD, 84 for SQD1, and 266 for SQD2 after performing a blastp search. After functional annotation (and removal of sequences considered not valid), 267 sequences were left for MGD, 80 for SQD1, and 121 for SQD2 (Supplementary Data S4–S6, respectively).

We reported in Table 1 the species and the strains in which MGD, SQD1, and SQD2 were found. The analyzed genes were found in 96 different species or strains across 14 different taxonomic categories. All the three genes were found in the reported taxa, with the exception of the two species of Chrysophyceae (*Dinobryon* sp. UTEXLB2267 and *Paraphysomonas imperforata* PA2, Supplementary Data S7). In addition, considering that the results are based on transcriptomic data, they can be interpreted as either the absence of that specific gene or lack of its expression in that particular condition.


**Table 1.** Species and strains from MMETSP transcriptomes in which MGD, SQD1, and SQD2 homologs were found. Colors refer to taxonomic ranks as in Figure 2. P = presence of validated gene homologs.


**Table 1.** *Cont*.

#### *2.2. Final Dataset and Phylogenetic Inferences*

The final alignments consisted of 649 sequences for MGD, 521 for SQD1, and 244 for SQD2 (Supplementary Data S8–S10, respectively), from both the OGA and MMETSP database. The length of each dataset after trimming of poorly aligned regions was as follows: 377 bp for MGD, 198 bp for SQD1, and 295 bp for SQD2.

The MGD phylogenetic tree showed that most of the taxa here investigated contained paralog copies (homolog copies resulting from duplication events) of monogalactosyldiacylglycerol synthase gene (Figure 3). These paralogs generally occurred in two copies, resulting in two distinct and highly supported clades for most taxa (red circles, Figure 3, Supplementary File S1). A few taxa (e.g., Coccolithophyceae, Cryptophyta, and Xantophyceae) presented only one MGD copy (and, therefore, a single group of sequences), while others (e.g., Dinophyceae) presented several MGD paralogs (Figure 3 or Supplementary File S1).

**Figure 3.** MGD unrooted phylogenetic tree. Colored circles at the base of each node refer to branch support after aLRT SH-like test. Colors of taxa refer to taxonomic groups as in Figure 2 and Table 1.

For the SQD1 gene, the phylogenetic tree showed that all the sequences belonging to the same taxon formed a highly supported monophyletic group in most species (Figure 4, Supplementary File S2). Among the exceptions, there are the diatoms (Bacillariphyta) where, beside a clade containing the most diatom sequences retrieved from OGA and MMETSP, there are a few others interspersed across the tree (Figure 4). One of these only contains sequences of *Thalassiosira rotula* (synonym of *T. gravida*) strain CCMP1093 from MMETSP. The others contain small groups of sequences from OGA (Figure 4, Supplementary File S2). Similarly, SQD1 dinoflagellates (Dinophyceae) homologs do not form a monophyletic group in the tree. This could be due to the high complexity of the plastid evolution in Dinophyceae. Dinophyceae acquired plastid with four different endosymbiotic events. [41]. In a secondary endosymbiotic event, Dinophyceae acquired plastid by serial endosymbiosis with green microalgae (Chlorophyta) [42]. In the tertiary endosymbiotic event, Dinophyceae acquired tertiary plastids by endosymbiosis with Cryptophyceae-Cryptomonads, Haptophyta, and Bacillariphyta [41].

**Figure 4.** SQD1 unrooted the phylogenetic tree. Colored circles at the base of each node refer to branch support after an aLRT SH-like test. Colors of taxa refer to taxonomic groups in Figure 2 and Table 1.

For the SQD2 gene, all the sequences from the same taxonomic group formed monophyletic groups except for the Dinophyceae (Figure 5, Supplementary File S3). All the dinoflagellate sequences from OGA and MMETSP were interspersed across the tree with high support, close to Bacillariophyta, Chlorophyta, Dictyocophyceae, and Coccolithophyceae.

**Figure 5.** SQD2 unrooted phylogenetic tree. Colored circles at the base of each node refer to branch support after the aLRT SH-like test. Colors of taxa refer to taxonomic groups in Figure 2 and Table 1.

#### *2.3. Structural Details of MDGs, SQD1, and SQD2 from Thalassiosira Weissflogii CCMP1336 (Conticriba weissflogii)*

We built *in silico* models for MGDs, SQD1, and SQD2 proteins from the diatom *T. weissflogii* CCMP1336 (synonym of *Conticribra weissflogii*). *T. weissflogii* (*C. weissflogii*) CCMP1336 has been selected because it is known to have immunostimulatory activity, and both galactolipids and sulfolipids have been found to play a key role in this bioactivity [39,43]. 3D *in silico* models (Figure 6) were generated using the amino acid sequences of MGD (CAMPEP\_0193073380, CAMPEP\_0193064960 and CAMPEP\_0193062160) (Figure 6a–c), SQD1 (CAMPEP\_0193062736) (Figure 6d), and SQD2 (CAMPEP\_0193058822) (Figure 6e), obtained from the MMETSP database. Phyre2 results were summarized in Table 2. The analyses pointed out the homology between *T. weissflogii* (*C. weissflogii*) and *Arabidopsis thaliana* proteins. Structural prediction of MGDs (CAMPEP\_0193073380, Supplementary Data S11, CAMPEP\_0193064960 Supplementary Data S12, and CAMPEP\_0193062160, Supplementary Data S13) found high similarity with the 3D structure of MGD1 from *A. thaliana*. The percentage of identity between *T. weissflogii* (*C. weissflogii*) and *A. thaliana* MGD1 is higher than 40%. Similar results were obtained for structural prediction of SQD1 from *T. weissflogii* (*C. weissflogii*) (Supplementary Data S14) with 45% identity with the 3D structure of SQD1 from *A. thaliana*.

**Figure 6.** *In silico* model generated by Phyre2 for: (**a**) MGD CAMPEP\_0193073380, (**b**) MGD CAMPEP\_0193064960, (**c**) MGD CAMPEP\_0193062160, (**d**) SQD1 CAMPEP\_0193062736, and (**e**) SQD2 CAMPEP\_0193058822.

**Table 2.** Report of Phyre2 analysis. We report the template (protein of known structure used for the prediction analysis) and its protein data bank (PDB) code, confidence (probability that the sequence and template are homologous), and percent id (percent of identity).


Structural prediction for SQD2 (Supplementary Data S15) found a similarity with the 3D structure of sucrose synthase-1 (SUS1) from *A. thaliana*. The percentage of identity between SQD2 and SUS1 is 18%, which is lower than the percent of identity of SQD1 and MDGs. SUS1 from *A. thaliana* catalyzed a reversible sucrose synthesis. It transfers glucose moiety from UDP to fructose and it has been found to form a complex with both UDP–glucose and UDP–fructose [44]. These findings may indicate the presence of a low specificity ligand-binding site in the SQD2 enzyme.

In order to evaluate whether predicted structures preserved the functional active sites of MGDs, SQD1, and SQD2, they were also analyzed by using the COACH server. MGD analysis was performed for all the structures obtained from Phyre2. The analysis pointed out the presence of a UDP binding pocket for the three MGDs. MGD CAMPEP\_0193073380 (Figure 7a) binding site prediction had a

C-score 0.32 and the presence of 17 residues in the binding site (i.e., H69, A71, R263, G312, G313, V350, G352, F379, V380, M383, Y396, G398, P399, G400, T401, E40, and E422). MGD CAMPEP\_0193064960 (Figure 7b) binding site prediction had a low C-score value (0.18) and 17 amino acid residues were involved in the binding site (i.e., H89, A91, R260, G305, G306, V335, G337, F528, V529, M532, K545, G547, P548, G549, T550, E553, and E571). MGD CAMPEP\_0193062160 (Figure 7c) binding site prediction had a low C-score (0.18) and 17 residues involved (i.e. H262, A264, R434, G477, G478, V507, G509, F586, V587, M590, K603, G605, P606, G607, T608, E611, and E629). COACH prediction analysis highlighted the same 17 amino acid residues involved in ligand-binding site of the three MDGs. Moreover, clustal omega analysis including MGD1 from *A. thaliana* indicated that these 17 residues were located in highly conserved domains (Figure 7d) and corresponded to the active site of MGD1 from *A. thaliana* [45] with the exception of the alanine residue.

**Figure 7.** Protein-ligand binding site prediction by the COACH server for MDGs from *Thalassiosira* (*Conticribra*) *weissflogi* CCMP1336. Prediction of binding sites for the three MGDs: (**a**) MGD CAMPEP\_0193073380–UDP complex, (**b**) MGD CAMPEP\_0193064960–UDP complex, and (**c**) MGD CAMPEP\_0193062160–UDP complex. (**d**) Among the Clustal Omega result of MDG amino acid sequences, the red boxes indicated the conserved amino acid residues involved in binding-pockets.

SQD1 analysis indicated the presence of a NAD-binding pocket (Figure 8a) with a C-score of 0.85. The following residues were reported to be involved in the NAD-binding pocket structure: G74, D76, G77, F78, C79, D98, N99, S101, R102, L140, D141, V142, F164, A165, E166, R168, A169, N186, L210, G211, T212, Y252, K256, Q279, G280, I281, V282, and Y306. The presence of a UDP–6–sulfoquinovose (USQ) binding pocket (Figure 8b) was predicted with a C-score of 0.27. The residues involved in USQ-binding pocket were R168, A170, T212, M213, G214, Y252, H253, Q279, G280, I281, T308, V309, R312, T324, Y326, Q331, R333, V371, R393, and E395. Most of them were involved in USQ binding (Figure 8c). The ligand binding pocket structure prediction, such as the predicted structure, was in line with the structure and function of SQD1 from *A. thaliana* [34,46].

**Figure 8.** Protein-ligand binding site prediction by COACH server of SQD1 from *Thalassiosira* (*Conticribra*) *weissflogii* CCMP1336. (**a**) Prediction of binging site of the complex SQD1–NAD. (**b**) Prediction of binding site of the complex SQD1–USQ. (**c**) Structure of USQ binding-pocket and the specific interaction between USQ and highlighted amino acid residues of SQD1.

SQD2 analysis indicated the presence of a UDP-biding pocket (Figure 9a) (C-score 0.28), which involved the residues S82, G83, N86, F256, V282, G283, R284, K289, E309, Q332, L333, L338, E355, G358, F359, V360, and E363. Moreover, COACH analyses pointed out the presence of a N-Acetylglucosamine (NAG)–binding pocket (Figure 9b) (C-score 0.33) that involved the residues G83, Y84, R87, H190, T191, K249, S354, E355, T356, L357, and G358. These data confirmed the presence of a low specificity ligand binding pocket, and suggest that SQD2 from diatoms could be involved in the synthesis of other diacylglycerols such as glucuronosyldiacylglycerol (GlcADG) in *Arabidopsis* [47] or flavonoid glycosylation in *Oryza sativa* [48,49].

**Figure 9.** Protein-ligand binding site prediction by COACH server of SQD2 from *Thalassiosira* (*Conticribra*) *weissflogii* CCMP1336. (**a**) Prediction of binding site of the complex SQD2–UDP. (**b**) prediction of binding site of the complex SQD2–NAG.

#### **3. Materials and Methods**

#### *3.1. Identification of MGD, SQD1, and SQD2 Homologous Sequences*

Since our primary interest was to ascertain the occurrence of MGD, SQD1, and SQD2 in microalgae, we used the sequences of MGD (accession number XP002181685), SQD1 (accession number XP002185968), and SQD2 (accession number XP002185276) from the diatom *Phaeodactylum tricornutum* as queries for a BLAST [50] search against the Ocean Global Atlas (OGA, [51]) database (http://tara-oceans.mio.osupytheas.fr/oceangene-atlas/), and several protist transcriptomes from the Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP, [52]) available at https://zenodo.org/record/12125852585. The OGA database contains a collection of more than 116 million eukaryote and 40 million prokaryotic genes gathered during the *Tara* Oceans [53,54] and the Global Ocean Sampling [55] expeditions. Instead, MMETSP contains the transcriptomes of some of the most abundant and ecologically significant microbial eukaryotes in the oceans. To retrieve homologs, we used the blastp algorithm against the metagenome/metatranscriptome and transcriptome databases contained in OGA and the MMETSP, setting the expect threshold to 1E-10. For OGA, after ascertaining that the sequences obtained from metagenomes and metatranscriptomes were identical, we only used one dataset (metatranscriptomes) for further analyses. Since a sequence-based homology search could have recovered different genes with similar functions, we used Blast2GO [56] to obtain a functional annotation of the homologs retrieved. We used the default settings (i.e., blastx program, using the nr BLAST database and with a BLAST expectation value of 1.0E-3) for the analysis. We considered valid Blast2GO annotations containing the following names: 1,2-diacylglycerol 3-beta-galactosyltransferase and monogalactosyldiacylglycerol synthase for MGD, sulfolipid biosynthesis protein, sulfoquinovosyldiacylglycerol synthesis protein, UDP sulfoquinovose synthase and uridine 5 -diphosphate-sulfoquinovose synthase for SQD1, sulfoquinovosyl transferase SQD2, sulfoquinovosyldiacylglycerol 2, and UDP-sulfoquinovose: DAG sulfoquinovosyltransferase for SQD2. All the sequences identified as only "predicted or hypothetical protein" without specification were *a priori* discarded.

#### *3.2. Taxonomic Overview*

All the sequences from OGA that passed the Blast2GO analysis with the criterions specified above were annotated using the annotation file generated during homolog retrieval. We removed all the sequences without a taxonomic annotation, annotated only as "Eukaryota" and, whenever applicable, with a generic annotation that did not allow to discriminate such sequences from others of lower taxonomic rank (e.g., "Stramenopiles"). All the taxa in which the MGD, SQD1, and SQD2 genes were found and were organized into a table. The abundance of such sequences were plotted as histograms using the R [57] working packages *scales* [58] and *ggplot2* [59]. For the homologs retrieved from the MMETSP transcriptomes, we generated a table illustrating the species surveyed and the genes of interest that were found in each of them.

#### *3.3. Sequence Alignment and Phylogenetic Inference*

The sequences of each gene from OGA and MMETSP that passed Blast2GO annotation and with a length of at least 200 aa were aligned using COBALT [60] available at https://www.ncbi.nlm.nih.gov/ tools/cobalt/). Unlike other common software used for protein alignments (e.g., ClustalW, MAFFT, MUSCLE, ProbCons) that only use sequence information, COBALT also integrates the information of protein-motif regular expressions (PROSITE database) and of conserved protein domains (NCBI CDD database). In doing so, COBALT has a better chance of producing a biologically meaningful multiple alignment compared to tools that do not utilize this information 60]. Poorly aligned regions from each alignment were removed with trimAl v1.2 [61] in order to increase the quality of subsequent phylogenetic analyses. We used the *automated1* option to find the most appropriate mode to trim the alignments (use of gaps or similarity scores) depending on the alignment characteristics.

In order to infer reliable phylogenetic trees, all the sequences that, after trimming, were shorter than one quarter (around 60–70 aa) of the final length of the alignment were removed, unless longer sequences were not available for that particular taxon. Maximum likelihood phylogenetic trees were inferred using PhyML [62] using the LG substitution model [63], which turned out to be the best evolution model for the three genes investigated according to the Akaike Information Criterion implemented in SMS [64]. Support to nodes was calculated using the Shimodaira-Hasegawa-like (aLRT SH-like) procedure [65]. Trees were visualised and graphically edited in FigTree v1.4.3 (http://tree.bio.ed.ac.uk/software/figtree/).

#### *3.4. In Silico Protein Model*

The three-dimensional (3D) *in silico* models of MGD, SQD1, and SQD2 proteins from *Thalassiosira weissflogii* CCMP1336 (currently regarded as synonym of *Conticribra weissflogii*, Bacillariophyta) were generated using Phyre2 server (http://www.sbg.bio.ic.ac.uk/~{}phyre2/html/page.cgi?id=index) [66]. Protein ligand-binding site predictions were performed using COACH analyses from the Zhang Lab server. COACH generated ligand-binding site predictions and a confidence score (C-score) of the prediction. The C-score ranged from 0 to 1. A higher score indicates a reliable prediction (https://zhanglab.ccmb.med.umich.edu/COACH/) [67,68]. Pictures were obtained using CCP4MG, version 2.10.11 obtained by http://www.ccp4.ac.uk/MG/download/ [69].

#### **4. Conclusions**

This study analyzes the presence of MGD, SQD1, and SQD2 in microalgae and gives a broad overview of their presence among different microalgal classes. It has been shown that, within each taxa, some species do not express all the sequences, which suggests the absence of a specific gene. Despite the fact that several species have been found to express MGDs, SQD1, and SQD2 (Table 1), only a relatively small number of active microalgae have been studied and reported to possess anti-inflammatory/immunomodulatory activities, including the diatoms *Attheya longicornis*, *Cylindrotheca closterium*, *Trieres mobiliensis* (formerly *Odontella mobiliensis*), *Phaeodactylum tricornutum*, *Porosira glacialis*, *Pseudo-nitzschia pseudodelicatissima,* and *Thalassiosira* (*Conticribra*) *weissflogii*, and the flagellates *Amphidinium carterae* (Dinophyceae), *Edaphochlamys debaryana* (formerly *Chlamydomonas debaryana*), *Chloroidium saccharophilum* (formerly *Chlorella ovalis*), *Dunaliella salina* (formerly *Dunaliella bardawil*) (Chlorophyceae), *Nannochloropsis granulata*, *Nannochloropsis oculata* (Eustigmatophyceae), *Diacronema lutheri* (formerly *Pavlova lutheri*, Pavlovophyceae), *Tetraselmis chuii,* and *Tetraselmis suecica* (Chlorophyta, Chlorodendrophyceae) [39]. Our study can direct and accelerate the discovery of new bioactive species and give new insights on enzyme discovery from microalgae with biotechnological applications.

3D *in silico* prediction analyses indicated that the enzymes, maintaining conserved domains, could be effectively involved in the synthesis of compounds with known anticancer and immune-modulatory activities, such as MGDGs and SQDGs. This approach can give preliminary information for the selection of specific microalgal species for drug discovery and for genetic engineering approaches in order to produce huge amounts of bioactive compounds of pharmaceutical interest.

**Supplementary Materials:** The following are available online at http://www.mdpi.com/1660-3397/18/5/237/s1. **Data S1**: Blast2GO analysis of MGD genes. BLASTp search conducted in the OGA database highlighted 1614 annotated eukaryotic sequences putatively attributable to MGD. Of these sequences, 1154 were identified as the gene of interest. **Data S2**: Blast2GO analysis of SQD1 genes. BLASTp search conducted in the OGA database highlighted 1422 annotated eukaryotic sequences putatively attributable to SQD1. Of these sequences, 817 were identified as the gene as the gene of interest. **Data S3**: Blast2GO analysis of SQD2 genes. BLASTp search conducted in the OGA database highlighted 1340 annotated eukaryotic sequences putatively attributable to SQD2. Of these sequences, 273 were identified as the gene of interest. **Data S4**: MGD amino acid sequences from MMETSP transcriptomes. After functional annotation, 267 sequences were left for MGD. **Data S5**: SQD1 amino acid sequences from MMETSP transcriptomes. After functional annotation, 84 sequences were left for MGD. **Data S6**: SQD2 amino acid sequences from MMETSP transcriptomes. After functional annotation, 121 sequences were left for SQD2. **Data S7:** Original data for Table 1. All the species and the strains were analyzed from MMETSP. **Data S8**: Alignments for MGD amino acid sequences. The alignments were performed using 649 amino acid sequences for MGD. **Data S9**: Alignments for SQD1 amino acid sequences. The alignments were performed using 521 amino acid sequences for SQD1. **Data S10:** Alignments for SQD2 amino acid sequences. The alignments were performed using 295 amino acid sequences for SQD2. **Data S11**: PDB file of MGD CAMPEP\_0193073380 predicted structure. *In silico* model generated by Phyre2 for MGD CAMPEP\_0193073380. **Data S12**: PDB file of MGD CAMPEP\_0193064960 predicted structure. *In silico* model generated by Phyre2 for MGD CAMPEP\_0193064960. **Data S13:** PDB file of MGD CAMPEP\_0193062160 predicted structure. *In silico* model generated by Phyre2 for MGD CAMPEP\_0193062160. **Data S14:** PDB file of SQD1 CAMPEP\_0193062736 predicted structure. *In silico* model generated by Phyre2 for SQD1 CAMPEP\_0193062736. Additional file 15: Data S15: PDB File of SQD2 CAMPEP\_0193058822 predicted structure. *In silico* model generated by Phyre2 for SQD2 CAMPEP\_0193058822. **File S1**: PDF file of MGD unrooted phylogenetic tree. Colored circles at the base of each node refer to branch support after aLRT SH-like test. Colors of taxa refer to taxonomic groups as in Table 1. **File S2**: PDF file of SQD1 unrooted phylogenetic tree. Colored circles at the base of each node refer to branch support after aLRT SH-like test. Colors of taxa refer to taxonomic groups in Table 1. **File S3**: PDF file of SQD2 unrooted phylogenetic tree. Colored circles at the base of each node refer to branch support after the aLRT SH-like test. Colors of taxa refer to taxonomic groups as in Table 1.

**Author Contributions:** Conceptualization, G.R. and C.L. Methodology and formal analysis, G.R., D.D.L., and C.L. Supervision, C.L., G.R., D.D.L., and C.L. co-wrote the paper. All authors have read and agreed to the published version of the manuscript.

**Funding:** The "Antitumor Drugs and Vaccines from the Sea (ADViSE)" project (PG/2018/0494374) funded this research.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Marine Drugs* Editorial Office E-mail: marinedrugs@mdpi.com www.mdpi.com/journal/marinedrugs

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34 Fax: +41 61 302 89 18

www.mdpi.com ISBN 978-3-0365-0255-7