**1. Introduction**

Intensive agriculture entails the extensive use of pesticides to increase yields. One commonly used insecticide has been γ-hexachlorocyclohexane (γ-HCH, also known as lindane or γ-BHC)—a potential carcinogen that is highly toxic to humans, causing neurological diseases, and endocrine disruption [1]. This hydrophobic substance was classified as a persistent organic pollutant [2] and banned in 2009 under the Stockholm Convention. Although the natural degradation of lindane takes a long time, there are known microorganisms that actively degrade it, and can significantly increase the rate of bioremediation [3,4]. Most of the γ-HCH degrading microorganisms belong to *Sphingomonadaceae* (Alphaproteobacteria phylum)—mainly *Sphingobium* sp., *Novosphingobium* sp. and *Sphingomonas* sp. Data showing the γ-HCH degradation activity for the representatives belonging to the *Sphingopyxis* genus has not been published yet, despite the fact that three strains, *Sphingopyxis flava* R11H, *Sphingopyxis indica* DS15 and *Sphingopyxis terrae* subsp. *ummariensis* DSM 24316, were isolated from the environment contaminated with this pesticide [5–8].

The main γ-HCH degradation pathway was described in detail for *Sphingobium japonicum* UT26S and *Sphingobium indicum* B90A [9–11]. These two microorganisms contain the *lin* genes encoding a set of enzymes responsible for lindane conversion to β-ketoadipate, which is further degraded and utilized in other metabolic pathways. The first and key enzyme from the catabolic pathway is dehydrochlorinase LinA [12]. It catalyzes the transformation of γ-HCH to 1,3,4,6-tetrachloro-1,4-cyclohexadiene (1,4-TCDN) via γ-pentachlorocyclohexane ( γ-PCCH) by sequential removal of two chloride atoms [13]. This protein is unique among other known dehydrochlorinases as does not require any cofactors [14,15]. LinA is 154 amino acids long and acts as a trimer, where each protomer forms a cone-shaped α + β barrel fold and D25, H73, and R129 residues were shown to be essential for its activity [16,17]. Subsequently, two chloride atoms are removed from 1,4-TCDN by LinB dehalogenase [18] that converts 1,4-TCDN into 2,5-dichloro-2,5-cyclohaxadiene-1,4-diol (2,5-DDOL). LinB requires only H2O as a cofactor, and unlike LinA acts on a broad range of halogenated substrates—also on a very stable β-HCH isomer [19]. The last enzyme from upper degradation pathway is LinC [20] that converts 2,5-DDOL to 2,5-dichlorohydroquinone (2,5-DCHQ). LinC belongs to the short-chain alcohol dehydrogenase family and requires NAD+ cofactor. Further reactions from downstream degradation pathway involve further removal of chloride atoms from 2,5-DCHQ to hydroquinone (HQ) by reductive dechlorinase LinD [21] and next ring cleavage by ring cleavage oxygenase LinE [22]. It is also known that 2,5-DCHQ can be transformed directly by LinE or LinEb (enzyme identical to LinE in 53% of amino acids) to maleylacetate (MA) through a ye<sup>t</sup> unknown unstable product [10,11]. Next, MA is converted by maleylacetate reductase LinF, to β-ketoadipate and later LinG; LinH and LinJ proteins (acyl-CoA transferase and thiolase) transform β-ketoadipate to succinyl-CoA and acetyl-CoA [11].

Here, we present a detailed genomic analysis of the γ-HCH degrader—*Sphingopyxis lindanitolerans* WS5A3p [23]. As described previously, WS5A3p was selectively enriched using the γ-HCH as a sole carbon source. The genome of this new microorganism was sequenced and analyzed, leading to the identification of 13 *lin* genes involved in the γ-HCH degradation pathway. Additionally, we provide insights into *Sphingopyxis* based on the analysis of 44 genomes, with a description of the core pangenome and the unique proteins for each representative.

#### **2. Materials and Methods**

#### *2.1. Genome Sequencing and Assembly*

For genomic DNA isolation, all steps were performed as described previously [23]. Two libraries were prepared for genome sequencing: (1) a paired-end library with an insert size of 500 bp, (2) a Nextera ® Mate Pair library (Illumina Inc., San Diego, CA, USA) with an average insert size of 8 kbp. Sequencing was performed on the Illumina MiSeq platform (Illumina Inc.) with a 300 bp read length resulting in 311,675 and 1,376,608 raw paired reads respectively. Next, a number of bioinformatics tools were used to analyze the raw reads. Adapters from the raw reads were trimmed using Cutadapt [24], and quality filtered with Sickle [25] (quality at least Q30). The mate-paired libraries were processed using NxTrim script v0.4.0 provided by the manufacturer. Assembly was performed using the SPAdes 3.9.1 software [26]. The assembly was manually checked by mapping sequencing reads to obtained contigs using Genious 6.1.6 [27]. This Whole Genome Shotgun project has been deposited at DDBJ/ENA/GenBank under the accession PHFW00000000. The version described in this paper is version PHFW01000000. The genome was annotated using the NCBI Prokaryotic Genome Annotation Pipeline [28]. Genes with signal peptides were identified with SignalP 4.1 [29]. Association of genes and proteins to COG categories was done using COGNIZER software [30] and MicroScope platform [31]. Analysis of secondary metabolite biosynthesis gene clusters was done with antiSMASH 5.0 (https://antismash.secondarymetabolites.org) [32]; prophage sequences in WS5A3p genome were identified with PHASTER (http://phaster.ca/) [33]; and CRISPRCasFinder software (http: //crispr.i2bc.paris-saclay.fr) [34] with default parameters was used to identify CRISPR-related sequences.

#### *2.2. Average Nucleotide Identity Based on BLAST (ANIb) and GGDC Analyses*

The similarity of the sequenced genome of *S. lindanitolerans* WS5A3p to other public genomes of closely related *Sphingopyxis* species [8,23,35–44] was determined based on the average nucleotide identity using both BLASTn (ANIb) and MUMer (ANIm) algorithms with the help of the pyani Python module [45] and average amino acid identity (AAI), which was calculated using CompareM (available at: https://github.com/dparks1134/CompareM).

## *2.3. Pangenomic Analyses*

For pangenomic analyses, nucleotide sequences of *Sphingopyxis* sp. genomes were downloaded from the NCBI database on December 2017. Due to low quality and completeness, genomes retrieved from metagenomes and single-cell sequencing were excluded from the analysis. To unify annotation, open reading frames (ORFs) were predicted for all genomes using Prokka [46].

Phylogenetic analysis of *Sphingopyxis* genomes was performed using Phylophlan [47], and ggtree R package [48] was used to display the results. Phylophlan software identifies 400 ubiquitous genes based on their amino acid content, selects the most discriminative positions in each sequence and concatenates them into a single long sequence which is used for ML tree construction by RAxML [49]. From the Phylophlan generated protein alignment we ran RAxML to generate 500 bootstrap replicates (ML search with PMB substitution matrix, as the best-scoring AA model).

For pangenomic analysis of amino acid sequences obtained from annotated ORFs, all genomes were merged into one set. Next, the clustering process based on sequence identity was performed using CD-HIT [50]. Parameters for CD-HIT were as follows: s parameter (length di fference cuto ff) = 0.9, and c parameter (sequence identity threshold) = 0.8. Using in-house Python and R scripts processing CD-HIT output (scripts collected into Pangenome Analysis Pipeline (PAPi) available at https://github.com/michkam89/PAPi), we identified clusters that had protein representatives present in all genomes (core pangenome) but also clusters specific for each genome included in the analysis (unique clusters). For the unique clusters of each particular genome, we performed additional blastp alignment against other *Sphingopyxis* genomes to exclude false-positive sequences.

*Sphingopyxis* amino acid sequences obtained from annotated ORFs of the analyzed dataset were also searched for the presence of proteins responsible for aromatic compound degradation. For this purpose, we performed a blastp search of all ORFs against in-house protein database developed by Kato and colleagues [51] filtering reads with parameters: e-value <1 × 10−<sup>5</sup> and alignment percent identity >50%. This database covers the metabolic paths of aerobic and anaerobic degradation routes of 3CB, phenanthrene, biphenyl, and carbazole (phenanthrene-phthalate-protocatechuate route, biphenyl-benzoate route, carbazole-anthranilate route, naphthalene-salicylate route, catechol 1,2- and 2,3-dioxygenation route, gentisate route, anaerobic polycyclic aromatic hydrocarbon (naphthalene) route, and anaerobic benzoate and toluene route) [51].

Alignment of protein sequences for analyzed Lin proteins was performed using MUSCLE [52], phylogenetic relationship and alignment visualization was done in Geneious [27].

#### **3. Results and Discussion**

The WS5 soil sample, from which the WS5A3p strain was isolated, had been contaminated with a high concentration of organochlorine pesticides including γ-HCH (lindane), dichlorodiphenyltrichloroethane (DDT) and methoxychlor. WS5A3p was isolated through an 8-week enrichment under constant γ-HCH pressure at 5 mg/<sup>L</sup> [23]. Its activity was confirmed by the ability to create clear zones on γ-HCH agar plates. Moreover, the decrease of γ-HCH was observed in a liquid medium assay where the presence of small amounts of 2,5-dichlorophenol was identified (Figure S1).

The WS5A3p strain was cultured in LB medium and genomic DNA was isolated and sequenced on the Illumina MiSeq platform (see Materials and Methods section) with genome coverage estimated at 112× [23]. As a result of genomic sequencing and assembly, three DNA sca ffolds were obtained. General WS5A3p genome statistics are presented in Table S1. The size of the assembled genome sequence is 4,372,786 bp, and contains 65.1% G + C and is 90.4% coding. The assembly consists of one circular sca ffold, representing a single chromosome of 4.14 Mbp with three unsequenced gaps of the total estimated length of 305 nucleotides. Two remaining sca ffolds of length 181,517 bp and

42,040 bp having separate partition and transfer modules were classified as plasmids. There are a total of 4184 predicted genes, from which 4,133 are protein-coding and 75.1% have assigned putative function. The WS5A3p genome has 51 RNA genes (three rRNA as one ribosomal operon, 45 tRNA and three ncRNA). Observed substitution of lysine 88 by arginine in S12 protein of the 30S ribosomal subunit (RpsL) suggests natural resistance of this strain to streptomycin, what is considered as a feature of *Sphingomonadaceae* and *Erythrobacteraceae* [43]. The functional categorization of genes into COGs (clusters of orthologous groups) for WS5A3p is presented in Table 1. Further analysis of WS5A3p genome in a context of phage-related sequences using PHASTER [33], revealed four prophage regions: one intact region (score = 110) according to the tool's completeness core, one questionable (score = 90) and two incomplete regions (score = 60 and 50). The intact region (16.1 kb in length) localized on chromosome sequence shows the highest similarity to genome fragment of *Sphingopyxis granuli* TFA [43]. No CRISPR-related sequences were identified in WS5A3p genome using CRISPRCasFinder application [34]. Like other *Sphingopyxis* genomes, this one contains ectoine synthesis genes *ectABCD*, Aspartate kinase Ask\_Ect and Ectoine/proline transporter ProP [43]. Secondary metabolite biosynthesis gene clusters analysis using antiSMASH 5.0 also showed the presence of regions potentially involved in the biosynthesis of carotenoids (one region with 75% similarity to astaxanthin dideoxyglycoside cluster) or natural antimicrobials such as β-lactones (one region), lenthipeptides (one region), lasso peptides (three regions), which may give advantage to *S. lindanitolerans* in an oligotrophic environment.



We compared the WS5A3p genome to 46 other *Sphingopyxis* genomes deposited in the NCBI database. Phylogenetic placement based on 400 highly conserved protein sequences showed that WS5A3p is placed on a distinct branch on the phylogenetic tree with 100% bootstrap support (Figure 1A). We also saw that *Sphingopyxis baekryungensis* DSM 16222 and LPB0140 isolate, previously assigned to *Sphingopyxis,* clustered with outgroup genomes analyzed in the dataset (containing representatives of *Sphingobium, Novosphingobium,* and *Sphingorhabdus*). To extend these findings we also performed genomic analysis of whole DNA sequences (average nucleotide identity based on Mummer—ANIm, average nucleotide identity based on BLAST—ANIb) and amino acid content of all protein sequences (average amino acid Identity—AAI). These tests showed that WS5A3p is unique among available genomes. We also identified that values for MWB1 isolate in all of the three tests were outlying the most from other species, therefore we decided to treat this isolate also as an outgroup. Average values (excluding outgroup and LPB0140, MWB1 and DSM 16222 genomes) for similarity among *Sphingopyxis* genomes to WS5A3p based on ANIb, ANIm and AAI were 82.6 ± 0.98%, 85.8 ± 0.63%, and 81.8 ± 1.35%, respectively. The ranges were 80.1–83.7% for ANIb, 84.5–86.8% for ANIm and 79.7–83.8% for AAI. Results of the tests mentioned above are illustrated in Figure 1B. All of the analyses distinguished two groups of genomes highly similar to each other. Group H1 consisting of isolates: H005, H012, H038, H053, H077, H080, H085, H093 and group H2 consisting of isolates: H057, H067, H071, H073, H081, H100, H107 that are genomic clones with almost identical sequences. This result provides additional evidence that exclusion of *S. baekryungensis* DSM 16222 from *Sphingopyxis* genus and reclassification of isolates LPB0140 and MWB1 should be considered in the near future.

**Figure 1.** Phylogenetic analyses of *Sphingopyxis* genomes. ( **A**) Phylogenetic relationship of *Sphingopyxis* genomes based on a maximum likelihood analysis of 400 conserved protein sequences. Branches are colored according to bootstrap values, which are presented on the nodes. The bar represents 0.8 substitutions per site. The WS5A3p branch is indicated with a blue diamond. The blue shaded box contains outgroup genomes included in the analysis. (**B**) Sequence similarity of analyzed *Sphingopyxis* genomes comparing to WS5A3p strain based on ANIb (blue triangle), ANIm (red diamond) and AAI (green circle).

To analyze the uniqueness of WS5A3p genome, we first performed the pangenomic comparison of protein sequences present in all *Sphingopyxis* representatives. From these tests, we excluded genomes of LB0140, MWB1 and DSM 16222, as based on the previous phylogenetic analysis they clustered closely to outgroup genomes (LB0140, DSM 16222) and had AAI values below 80% (LB0140, DSM 16222 and MWB1). Table S2 presents general information about all genomes included in the analysis. Representatives had been isolated from diverse water environments (drinking water, seawater, ground, lake, and cave water) and soils (mainly contaminated with organic pollutants) as well as anthropogenic

sites like mines or wastewater treatment plants. Isolates derived from geographically distant regions in Europe, Asia, North America and one from Australia. The median size of the *Sphingopyxis* genome is 4.3 Mbp and ranged from 3.4 Mbp for *Sphingopyxis alaskensis* RB2256 and 5.7 Mbp for *Sphingopyxis macrogoltabida* 203. The average *Sphingopyxis* genome contains 4236 genes with a minimal number of 3287 and maximal of 5572 for *S. alaskensis* RB2256 and *S. macrogoltabida* 203, respectively.

Pangenomic analyses usually focus on pathogenic bacteria at the species level [53]. Few studies have described generic bacterial pangenomes beyond *Streptococcus* [54] or *Bifidobacterium* [55]. The first attempt to describe the *Sphingopyxis* pangenome was performed by Garcia-Romero [33] and Parthasarathy [56]. They analyzed seven *Sphingopyxis* genomes and established the core pangenome size at 1371 gene families or 1515 single copy orthologues. In our pangenomic analysis, we grouped proteins into clusters based on their amino acid sequence, di fferentiating 734 core clusters present in all of the analyzed genomes, which represent less than 20% of the average *Sphingopyxis* genome. Compared to the previous reports, our dataset increased from seven to 44 genomes and allowed us to reduce the size of the core *Sphingopyxis* pangenome by almost half. It is only 1.5× larger than the minimal genome content of the smallest free-living microorganism (470 coding regions) [57]. We also calculated the size of the entire pangenome with the median size of 28,914 clusters (IQR = 2645). Accumulation plots of the pangenome and core pangenome are presented in Figure S2. While core genome seems to reach its saturation and addition of new genomes should not a ffect it significantly, the pangenome median curve started to flatten but it is not saturated ye<sup>t</sup> and the pangenome still should be considered as "open".

Next, we assigned representatives of core clusters to COGs. Nineteen percent and 20% of clusters were assigned to cellular processes and signaling (Figure 2A), and information storage and processing functions (Figure 2B), respectively. Almost 45% of all core clusters belonged to metabolic processes (Figure 2C). The enrichment in housekeeping functions like translation and ribosome biogenesis is clearly visible here, but amino acid and nucleotide metabolism were also very abundant processes.

We also identified and compared unique protein clusters for each genome included in the analysis (Table S2), and showed significant di fferences among the number of unique clusters; from four to 1667. The number of unique clusters positively correlate with phylogenetic distance to closest relatives (Spearman rank correlation *p*-value <2.2 × <sup>10</sup>−16, Figure S3). As expected, closely related genomes contained fewer unique clusters (previously mentioned groups H1 and H2, but also genomes of Root154 and Root214) than more distant relatives.

In the pangenomic analysis, we found that the WS5A3p strain has 1274 unique protein clusters (one of the highest number among species analyzed in our dataset—Table S2), representing almost 31% of all clusters identified in its genome. Representatives of 1087 of unique protein clusters were successfully assigned to COGs, with 286 (26%) of them assigned to groups of general function prediction or unknown function (Figure 2D). However, within the general function group, we identified numerous aromatic ring-cleaving dioxygenases, monooxygenases, hydrolases and reductases that may participate in xenobiotic decomposition as well as members of the major facilitator superfamily (MFS) known to be involved in membrane transport. Here, we also found enzymes responsible for lindane turnover. Among poorly characterized COGs, the most abundant clusters were assigned to metabolic processes (369 clusters) and among them, 86 clusters were involved with inorganic ion transport and metabolism (Figure 2C). These clusters represent various iron-related proteins in high amounts, with the largest fraction as TonB-dependent receptors associated with the uptake and transport of large substrates such as iron siderophore complexes and vitamin B12 [58]. We also identified 123 clusters involved in transcription processes as well as clusters from replication, recombination and repair group (65 clusters) (Figure 2B). These were mainly transposases, integrases, and methyltransferases, indicating a high potential for recombinational processes. Analyzing the former, we detected transcriptional regulators assigned to numerous transcriptional regulator families like TerR, HxlR, GntR, LysR, IclR, AcrR, MarR, LuxR, ArsR, and AsnC.

**Figure 2.** Genome-scale analysis of protein functions for *Sphingopyxis lindanitolerans* WS5A3p in context of *Sphingopyxis* pangenome. Bar chart illustrating protein clusters participation in different COGs functional categories for *Sphingopyxis* core pangenome (orange) and WS5A3p unique clusters (purple). (**A**) Cellular processes and signaling; (**B**) information storage and processing functions; (**C**) metabolism process; (**D**) unknown functions.

For many *Sphingomonadaceae* members, a high number of genes putatively involved in aromatic compound degradation pathways have been annotated. Thus we also searched *Sphingopyxis* proteomes for the presence of proteins potentially involved in aromatic compound degradation. To reduce the number of calculations we selected H005 and H057 as representatives of H1 and H2 groups of nearly identical genomes respectively. We analyzed 31 proteomes and obtained 314 hits in total. The median number of identified proteins per proteome was 12 (IQR = 7), whereas the highest number of hits were found for *S. bauzanensis* DSM2271 (*n* = 38), strain LC363 (*n* = 29), *S. lindanitolerans* WS5A3p (*n* = 27) and strain H050 (*n* = 26) (Figure S4A,B). DSM2271 and WS5A3p strains were isolated from soils that were contaminated with either aromatic hydrocarbons or pesticides and might have acquired more genes related to the metabolism of aromatic compounds than other microorganisms analyzed. On the other hand, LC363 and H050 strains were isolated from water environments—cave and drinking water respectively—that are rather oligotrophic. For *S. lindanitolerans* WS5A3p we identified more enzymes potentially involved in aromatic compound degradation than for other strains isolated also from a γ-HCH contaminated environment like *S. flava* R11H (*n* = 15), *S. indica* DS15 (*n* = 9) and *S. terrae* subsp. *ummariensis* UI2 (*n* = 9). Majority of blastp hits (62%) were unique per genome, but also multiple copies of some proteins were identified (duplicates 31%, triplicates 4%, quadruplicates and more 2%). The most frequently identified proteins (that occurred more than 30 times) were dienelactone hydrolase (clCinC), 2-keto-4-pentanoate hydratase (CexF), 3-oxoadipate CoA-transferase (CinE), cytochrome P450 (CYP) and 4-hydroxy-2-oxovalerate aldolase (CexG). The highest frequency of the aforementioned enzymes can be explained by the fact that they conduct more distant reactions in their metabolic pathways and those substrates can participate in other pathways. The most frequently identified enzymes belonged to catechol 1,2-dioxygenation and catechol 2,3-dioxygenation route. Only for *S. bauzanensis* DSM2271 all seven proteins from catechol 2,3-dioxygenation route and only for strain KK2 all four proteins from catechol 1,2-dioxygenation route (for chlorocatechol) were identified (Figure S4C). Those results support the thesis that *Sphingopyxis*, like other sphingomonads, are specialized in the degradation of a particular compound rather than degrading a wide range of substrates [59]. Since the catechol is the main intermediate in the monoaromatic hydrocarbons biodegradation pathways, WS5Ap3 could be involved in degradation of benzene, phenol, benzoic acid or derivatives thereof. Catechol is also an intermediate product in the naphthalene and biphenyl pathways, so these compounds could also support growth of this strain, which may also confirm the presence of trans-o-hydroxybenzylidenepyruvate hydrolase-aldolase (NE), salicylaldehyde dehydrogenase (NF) (naphthalene-salicylate route) and biphenyl-2,3-diol 1,2-dioxygenase (BpC), 2,6-dioxo-6-phenylhexa-3-enoate hydrolase (BpD), benzolate 1,2-dioxygenase (BpG), dihydroxycyclohexadiene carboxylate dehydrogenase (BpH) (biphenyl-benzoate route) coding sequences in the genome (Figure S4C). However, the e ffect of individual aromatic compounds on the growth of this strain requires further study.

Plasmids present in bacteria isolated from extreme environments often contain catabolic genes (e.g., for xenobiotic degradation) that are essential for survival in harsh conditions [60,61]. A thorough analysis of the WS5A3p genome led to the identification of two potential plasmid sequences. Their comparison to other known catabolic plasmid sequences derived from *Sphingomonadaceae* family is shown in Figure 3. Both of them have a mosaic structure of segments identified in other plasmids. The fact that the segments of high similarity are flanked by transposases suggests that they were acquired by fusion with ancestral plasmids [62]. The sequence of the larger plasmid—pSPMK1—has regions of high similarity to plasmids pMI1, pMI2 and pMI3 isolated from *Sphingobium* sp. MI1205—another γ-HCH degrader [59]. One region of pSPMK1 has exclusive similarity to plasmid pHSL3 from *Sphingomonas* sp. JJ-A5 (Genbank: CP018224.1), a chlorinated pesticide (alachlor) degrading bacterium. This region contains mostly conjugal transfer module, in particular, type IV secretion system genes. The shorter plasmid—pSPMK2—has the largest region similar to plasmids also derived from γ-HCH degrading microorganisms like pISP3 from *Sphingomonas* sp. MM-1 [63] and pSRL3 from *S. indicum* B90A [64,65]. Except for the first 8 kbp this region covers almost the entire sequence. Homologous regions include the stabilization and transfer module also present in pSA4 from non-γ-HCH-degrading strain *Novosphingobium resinovorum* [66]. The first 8 kbp of pSPMK2 is present only in the pMI1 plasmid (however in a di fferent part of pMI1 than the sequence present in contig 1), and a small part in pCHQ1 from *S. japonicum* UT26S.

Since WS5A3p created clear zones on LB plates covered with γ-HCH and showed its degradation ability in a liquid medium, the genome sequence was analyzed for the presence of genes coding proteins involved in xenobiotic degradation pathways. pSPMK1 contains fragments of high similarity to the genes encoding a major pathway for the aerobic degradation of HCH isomers - the Lin pathway: *linA*, *linB*, *linX*, *linEb*, *linF*, and *linGHIJ* genes, whereas pSPMK2 contains *linC* and *linDER* genes (Table S3). Likely a consequence of numerous acquisitions, loss and structural rearrangements of mobile elements among di fferent species, we can distinguish di fferent localizations of *lin* genes. The location of *lin* genes ranges from almost solely chromosomally encoded in *S. japonicum* UT26S, to highly dispersed on multiple plasmids as in *Sphingobium* sp. MI1205 or MM-1 strains [62,67]. Numerous lines of evidence indicate that the accumulation of these genes is an ongoing process, as there are microorganisms identified with partial sets of *lin* genes in their genomes [68]. In *S. lindanitolerans* WS5A3p, the genes comprising the entire γ-HCH degradation pathway were detected exclusively on the two plasmids. The evolution and horizontal transfer of *lin* genes is connected with transposons [69]. By searching the IS FINDER database [70] we identified multiple insertion sequences (IS) in the WS5A3p genome, with IS*6* (*n* = 12) and IS*21* (*n* = 8) as the major IS families. The IS*6100* involved in the distribution of *linA* to *linF* genes in sphingomonad strains [12,42], belonging to the family of IS*6* insertion sequences were detected solely on pSPMK1 and pSPMK2 plasmids, not the chromosome (Figure S5). On pSPMK1 we identified seven complete IS*6100* and two partial matches on opposite edges of sca ffold sequence, whereas on pSPMK2 only one complete IS*6100* and two partial matches. On pSPMK1 all of the *lin* genes are associated with at least one IS*6100* in close proximity. Arrangement of *linA* and *linX* genes resembles mostly the pattern detected in pMI2 where one of the flanking IS*6100* is localized distantly from *linA*. On the other hand, the *linB* gene is surrounded directly by IS*6100* sequences like

on *Sphingobium* sp. TKS chromosome. Differently, than in other sphingomonad's plasmids, *linC* gene is not flanked by two IS*6100* sequences [42]. In the case of pSPMK2, only partial IS*6100* sequence downstream *linC* gene was identified (Figure S5).

**Figure 3.** Comparative map of plasmid pSPMK1 and plasmid pSPMK2. Sequence similarity between pSPMK1 and pSPMK2 compared to selected plasmids derived from representative *Sphingomonadaceace*. The first ring represents GC content, rings 2–9 (for pSPMK1) and 2–7 (for pSPMK2) represents the similarity between chosen reference sequences. The last ring for both plasmids indicates open reading frames identified in the sequence, colored according to their function: black—transposases, integrases; purple—*lin* genes; orange—transfer, conjugation, partition genes; red—other catabolic genes; dark green—toxin/anti-toxin; light green—heavy metal resistance; blue—other genes, grey—hypothetical proteins.

Protein sequences of lindane upper degradation pathway proteins LinA and LinB are characterized by a high level of polymorphism and this alters their enantio- and stereoselectivity [71]. Sequences of Lin proteins identified in WS5A3p are almost identical to those from the archetypal strain UT26. The most differentiated in amino acid composition is an enzyme from the γ-HCH upper degradation pathway— LinA (named here LinA type-1a), which shows 98.7% identity with a LinA type 1 (also described as LinA2) [12] that was described to be more efficient in γ-HCH degradation than LinA1 from *S. indicum* B90A [72] to which it has a 90.3% identity (Figure 4). Also, three other variants of the LinA protein have been proposed in the literature [16,73–75]. To distinguish them, we use the nomenclature LinA3 (from *Sphingobium* sp. HDIPO4) and LinA type-2 and LinA type-3 (from the soil metagenomes). Their identities with WS5A3p LinA type-1a protein are 93.6% and 86.6%, respectively. LinA type-1a protein has conserved D25 and H73 residues necessary for its specific enzymatic activity. It differs from LinA type-1 and type-3 by A23G and I35V, which is unique for WS5A3p. While G23 is involved in protein thermostability [16], V35 have not been ye<sup>t</sup> identified as accountable for specific protein features and need to be further investigated. Other LinA type-1a residues are the same as in LinA type-1. Based on the presence of K20, L96, and A131 residues we can predict that LinA type-1a will exhibit preference to the (−) enantiomers [76].

**Figure 4.** Multiple sequence alignment of LinA protein variants. The amino acid sequence of LinA type-1a fromWS5A3p strain (1) compared to previously known variants of this protein (LinA type-1 from *Sphingobium japonicum* UT26 (2), LinA1 from *Sphingobium indicum* B90A (4), LinA3 from *Sphingobium sp.* HDIPO4 (6), LinA type-2 (5) and LinA type-3 from soil metagenomes (3)). Sequences were aligned and visualized using Geneious [27]. Dots correspond to the identical amino acids, dashes indicate the gaps. Phylogenetic relationships are indicated next to the sequence names and accession numbers.

The genes encoding putative ABC transporter (*linKLMN*) necessary for γ-HCH utilization, reported for *Sphingobium* sp. were not identified in the WS5A3p genome. The highest amino acid identity of one of the WS5A3p ABC-transporter proteins to products of those genes were: 61% to LinK, 68% to LinL, 56% to LinM and 50% to LinN. The fact that homologs of the mentioned WS5A3p chromosomally encoded, putative ABC-transporter system are widely distributed among *Sphingopyxis* sp. with low identity levels (e.g., 42–75% identities at aa level for linKLMN identified in UT26 and MM-1 strains [62]) seems consistent with the hypothesis that several variants of these proteins underwent convergen<sup>t</sup> evolution to facilitate lindane utilization [73] and more, this divergence roughly reflects the phylogenetic relationship between the strains [62].
