*2.6. Phylogenetic Analysis of Mined Genes and Functional Annotation of DUF819 (PF005684)*

The study was extended to dissect the evolutionary relationship among the uncharacterized genes because these are considered to be highly conserved and have an important role in biology. DUF819 is a family containing proteins (PF005684) found in 532 species with a total of 756 sequences. A total 1517 genes were found among *Gossypium arboreum* (258), *G. raimondii* (258), *G. hirsutum* (513), and *G. barbadense* (488) from the cotton functional genomic database (www.cottonfgd.org). These were the best-fit matched homologue genes having the highest similarity with the four cotton species. Out of 1517, only 116 genes differentially expressed in experiments and belonging to PF00400 and PF05684 were identified among *G. arboreum*, *G. raimondii*, *G. hirsutum,* and *G. barbadense.* A total of 24 uncharacterized genes identified to diagnose their evolutionary relationship with these cotton species. A phylogenetic tree consisting of 115 genes out of 116 expressed genes, including uncharacterized genes, in different experiments was constructed for the sorted PF00400 (105) and PF005684 (11) (Figure 7). The protein sequence of one gene remained unaligned during a ClustalW alignment. The PF00400 belonging to the superfamily WDR was used as a reference because 13 out of 24 uncharacterized genes were linked to this protein domain. The remaining 11 uncharacterized genes were named yjcL. The 11 yjcL (PF005684) genes were found to be more closely related to Cotton\_A04\_RF178 and then to GOBAR\_DD\_12SPA2, whose functions are known. Cotton\_A04\_RF178 is a well-known E3 ubiquitous protein having an important role in stress response in plants and animals [54]. It is predicted that, as these genes make a very close cluster with a well-known gene playing a crucial role in plant survival, they may have same function because they can be assigned to proteins by using the bioinformatics tools in comparative genomics [55]. The yjcL of A\_ and D\_, which are subgenomes of *G. hiursutum* and *G. barbadense*, make close groups with the yjcL of *G. arboreum* and *G. raimondii*, respectively. This indicates that yjcL genes may flow from *G. arboreum* and *G. raimondii* to *G. barbadense* and *G. hirsutum* in equal proportion. Moreover, the uncharacterized genes in *G. hirsutum*, *G. raimondii,* and *G. arboreum* indicated as all2124 are grouped close to Gh\_A03ago, which has the known function of a protein related to F-box/WD repeats. Similarly, these results can also be validated by predicting that Gh\_A01all2124, Gorai-all2124, Gorai-alr3466, and Cott\_A\_all2124 perform the same function as that determined for Gh\_A03ago. GOBAR\_AA12SPAC343 is separate from but close to the gene pak1ip1 with the known function of p21-activated protein kinase-interacting protein 1-like. The gene SPBC1711 located on the A\_ and D\_ genome of *G. hirsutum*, *G. arboreum*, and *G. raimondii* makes a close group with the gene named RUP2 in the A03 and A09 chromosomes of *G. hirsutum*. RUP2 is a gene composed of WD protein domains. Meanwhile, another gene, SPAC3H5, originating in *G. hirsutum* and *G. arboreum*

was found to be lying nearest to the DDB2 gene of known function in *G. hirsutum*. These genes were distributed throughout the 26 chromosomes. The maximum number of genes (11) was found on chromosome D11, and the minimum number (4) was found on chromosome 2 (Supplementary Table S6). The coding DNA sequences (CDS) were characterized and the GC content percentage ranged from 38.9 to 52.7 with its length ranging from 417 to 3221 (bp). The maximum number of exons was noted to be 24 in the Gh\_D11G0779 homologue of GOBAR\_DD21377, indicating the highest intron disruption (Supplementary Table S6). The majority of these mined genes, especially the uncharacterized ones, have a single protein domain, which means that these genes are highly conserved. We analyzed the features of these genes and the results showed several categories related to stress and fiber development in upland cotton. We further analyzed the genes through annotations and Gene Ontology (GO) terms that were associated with the mined genes, which describe the genes in relation to cellular components (CCs), molecular function (MF), and biological process (BP) [56]. In cellular components, functions such as microtubule organizing center (11%), microtubule-associated complex (10%), membrane coat (13%), coated membrane (13%), cytoskeleton part (10%), microtubule cytoskeleton (%), cytoskeleton (10%), and protein complex (23%) were observed. Similarly, 14 molecular functions and 5 biological processes were observed (Figure 8). Finally, we carried out RNA sequence expression to validate our results. The 65 genes with differential expression in *G. hirsutum* were selected to construct a heat map. The genes were both up and downregulated in cold, hot, polyethylene glycol (PEG), and salt treatments and different developmental stages of different tissue organs, such as calyx, leaf, petal, pistil, root, stamen, stem, and torus tissue (Figure 9). The genes were categorized into two main groups. Group 1 comprised 34 genes that were significantly expressed; i.e., with fragments per kilobase of transcript per million mapped reads (FPKM) value of more than 1. Among the 34 upregulated genes, SPA2 (protein SPA1-RELATED 2) with Gene ID Gh\_D12G2294 has five Go functions: protein kinase activity (GO:0004672 = MF), protein binding (GO:0005515 = MF), ATP binding (GO:0005524 = MF), protein phosphorylation (GO:0006468 = BP), and transferase activity transferring phosphorus-containing groups (GO:0016772 = MF). CDC40 (Pre-mRNA-processing factor 17) with Gene ID Gh\_A05G0018 depicts three GO functions: mRNA splicing via spliceosome (GO:0000398 = BP), protein binding (GO:0005515 = MF), and catalytic step 2 spliceosome (GO:0071013 = CC). Two Guanine nucleotide-binding protein subunit beta-2s with different Gene IDs were found to have two similar GO functions, namely protein binding (GO: 0005515 = MF) and signal transduction (GO:0007165 = BP). All remaining genes were found to be associated in molecular function with protein binding with GO:0005515. The yjcL-linked gene ID Gh\_A09G2500 showed significant expression against drought and salt stress and fell into group 1. The other two Gene IDs associated with SPAC3H5, an uncharacterized WD-repeat-containing protein (GO:0005515 = MF), also indicated significant expression. Group 2 has 31 genes that exhibited the differential expression of both up and downregulation (Supplementary Table S7). Among these, only the Gh\_D07G1711 gene showed three GO functions: mRNA splicing via spliceosome (GO:0000398 = BP), protein binding (GO:0005515 = MF), and catalytic step 2 spliceosome (GO:0071013 = CC). All others were associated with WDR25 (WD-repeat containing protein 25) with the GO function GO:0005515 = MF except for three genes with the IDs Gh\_D11G0109, Gh\_A11G2961, and Gh\_D09G0432, which are linked to the uncharacterized gene yjcL and have no GO functions. In the second group, four genes, namely Gh\_D07G2259, Gh\_A10G2180, Gh\_D09G0432, and Gh\_D02G1696, were relatively downregulated while all other genes showed differential expression. Gh\_D04G1713 showed upregulated expression in petal and stamen tissues but relative downregulation in other tissues and under stress treatments.

*Int. J. Mol. Sci.* **2018**, *19*, x FOR PEER REVIEW 10 of 21

*Int. J. Mol. Sci.* **2018**, *19*, x FOR PEER REVIEW 10 of 21

**Figure 7.** Evolutionary relationship of 115 genes belonging to protein domains of DUF819 (PF005684) and WDR (PF00400) in *Gossypium arboreum*, *G. raimondii, G. hirsutum,* and *G. barbadense.* The phylogenetic tree was constructed using MEGA software ver. 7.0 by the neighbor-joining method. The parameters were 1000 bootstraps and pairwise deletion. The 24 uncharacterized genes are indicated by red dots in four *Gossypium* species. **Figure 7.** Evolutionary relationship of 115 genes belonging to protein domains of DUF819 (PF005684) and WDR (PF00400) in *Gossypium arboreum*, *G. raimondii, G. hirsutum,* and *G. barbadense.* The phylogenetic tree was constructed using MEGA software ver. 7.0 by the neighbor-joining method. The parameters were 1000 bootstraps and pairwise deletion. The 24 uncharacterized genes are indicated by red dots in four *Gossypium* species. **Figure 7.** Evolutionary relationship of 115 genes belonging to protein domains of DUF819 (PF005684) and WDR (PF00400) in *Gossypium arboreum*, *G. raimondii, G. hirsutum,* and *G. barbadense.* The phylogenetic tree was constructed using MEGA software ver. 7.0 by the neighbor-joining method. The parameters were 1000 bootstraps and pairwise deletion. The 24 uncharacterized genes are indicated by red dots in four *Gossypium* species.

**Figure 8.** *Cont.*

*Int. J. Mol. Sci.* **2018**, *19*, x FOR PEER REVIEW 11 of 21

**Figure 8.** Genes were analyzed using the Agrigo v 2.0 software. Gene Ontology (GO) annotation results for *Gossypium hirsutum* mined genes. GO functional classification of genes mined with protein sequences predicted for their involvement in biological processes (BPs), molecular functions (MFs), and cellular component (CCs). **Figure 8.** Genes were analyzed using the Agrigo v 2.0 software. Gene Ontology (GO) annotation results for *Gossypium hirsutum* mined genes. GO functional classification of genes mined with protein sequences predicted for their involvement in biological processes (BPs), molecular functions (MFs), and cellular component (CCs).

*Int. J. Mol. Sci.* **2018**, *19*, x FOR PEER REVIEW 12 of 21

**Figure 9.** RNA sequence data analysis of 65 differentially expressed genes in eight different cotton tissues with reference to a control group (CG) under different cold (CT), heat (HT), PEG (PEGT), and salt treatments (ST) listed at the top of the figure. The names of genes are listed to the left of the figure. The heat map was generated from the log10 (FPKM) of the expression values by using R software. The *Y* axis represents the relative expression (2−∆∆*C*<sup>t</sup> ). **Figure 9.** RNA sequence data analysis of 65 differentially expressed genes in eight different cotton tissues with reference to a control group (CG) under different cold (CT), heat (HT), PEG (PEGT), and salt treatments (ST) listed at the top of the figure. The names of genes are listed to the left of the figure. The heat map was generated from the log10 (FPKM) of the expression values by using R software. The *Y* axis represents the relative expression (2−∆∆*C*<sup>t</sup> ).

#### In this study, 111 SSR primer pairs generated 382 polymorphic loci in the 132 tested accessions. **3. Discussion**

**3. Discussion** 

An average of 3.44 alleles amplified per marker was observed for all accessions, ranging from 2 to 8 alleles. This value is comparable with the findings of Dahab et al. [57] on allele number using 70 SSR markers on *Gossypium hirsutum*. Consistent results were found with 3.93 alleles per locus by Bardak and Bolek [58] for assessing genetic diversity in diploid and tetraploid cottons using Simple Sequence Repeat (SSR) and Inter Simple Sequence Repeat (ISSR) markers. Wendel and Percy [17] detected 3.47 enzymes per locus by studying 17 enzymes encoding 59 loci in a collection of 58 accessions of *darwinii* from six islands. However, other studies showed a variable allele number per locus. For example, 2.13, 2.20, 5.46, and 7.64 alleles per marker have been found in several genetic diversity assessments for the cotton germplasm [26,59–61]. The semi-wild accessions retaining a diverse germplasm showed high allele numbers in the majority of studies consistent with a recent study because these accessions have not yet been exposed to extensive human selection pressure for accumulating a particular type of alleles [26,62,63]. The number of alleles observed per marker is In this study, 111 SSR primer pairs generated 382 polymorphic loci in the 132 tested accessions. An average of 3.44 alleles amplified per marker was observed for all accessions, ranging from 2 to 8 alleles. This value is comparable with the findings of Dahab et al. [57] on allele number using 70 SSR markers on *Gossypium hirsutum*. Consistent results were found with 3.93 alleles per locus by Bardak and Bolek [58] for assessing genetic diversity in diploid and tetraploid cottons using Simple Sequence Repeat (SSR) and Inter Simple Sequence Repeat (ISSR) markers. Wendel and Percy [17] detected 3.47 enzymes per locus by studying 17 enzymes encoding 59 loci in a collection of 58 accessions of *darwinii* from six islands. However, other studies showed a variable allele number per locus. For example, 2.13, 2.20, 5.46, and 7.64 alleles per marker have been found in several genetic diversity assessments for the cotton germplasm [26,59–61]. The semi-wild accessions retaining a diverse germplasm showed high allele numbers in the majority of studies consistent with a recent study because these accessions have not yet been exposed to extensive human selection pressure for accumulating a particular type of alleles [26,62,63]. The number of alleles observed per marker is contingent on the selection of markers, the collection of germplasm to be genotyped, and the platform used for the resolution of amplified products [64].

Our study results determined an average PIC value of 0.555 with a range of 0.078 to 0.821, which completely corresponds to the literature-cited average PIC value for cotton SSRs, which ranges from 0.122 [65] to 0.71 [26]. Higher PIC values in cotton as shown in the current study suggest that these accessions can be useful for improving cotton [57]. The unique alleles identified in this study had percentages of 7.36, 5.83, 6.43, 17.36, 17.51, 3.2, and 14.44 in *G. barbadense*, *G. darwinii*, *G. tomentosum*, *G. hirsutum*, *G. ekmanianum*, *G. stephensii* and *G. klotzschianum*, respectively. These are higher than the percentages reported in the earlier report [65]. It is an interesting finding that all the unique alleles were found in the newly collected accessions along with a few from *G. hirsutum*. The unique alleles may be related to unique characteristics, such as extra-long fibers in *G. barbadense* and drought and salt tolerance in *G. tomentosum* and *G. darwinii*, and be similar to other wild accessions.

The common alleles were estimated to understand the gene flow mechanism of the *Gossypium* species during evolution. The results are supported by previous studies and the hypothesis that *G. hirsutum* has a single evolutionary lineage because all of the species in this study have common alleles with reference to the fixed alleles of *G. hirsutum*. *G. tomentosum* is considered to be a sister of *G. hirsutum*, while *G. barbadense* originates from a geographically overlapping region. *G. darwinii* has the same origin, the Galapagos Island, as *G. barbadense*. The *G. klotzschianum* diploid cotton is considered endemic to the New World and has an origin similar to that of *G. raimondii*. The two new species *G. ekmanianum* and *G. stephensii* are sister clades even though they make distinct groups with *G. hirsutum*, but *G. ekmanianum* and *G. stephensii* are monophyletic to *G. hirsutum* which strengthens the hypothesis of gene flow from different species to *G. hirsutum* having a single-lineage evolution [1,13,15]. The present analysis with a high level of natural introgression among the wild accessions shows consistency with the results of Yu et al. [51] and Hinze et al. [36] who described the distribution of introgression within the *G. hirsutum* and *G. barbadense* genomes using the chromosome positions as markers.

Populations from different islands are isolated distinctly, indicating general correspondence to Wedel and Percy's investigations [17]. Due to good agreement with Wendel and Percy [17], an exploration that occurred 30 years prior to the present study, this novel study will also be helpful in understanding the basis of the hybridization and domestication phenomenon in cotton evolution. Our findings also suggest that a wild germplasm has higher genetic diversity than that in cultivated cotton.

A phylogenetic tree constructed based on genotypic data completely validated the distinct clustering of the accessions detected. The results are quite congruent to prior taxonomic studies [2,3,66]. The average genetic distance (GD = 0.325) revealed the overall level of genetic diversity to be high among semi-wild and cultivated accessions; this finding is similar to earlier reports [26,67,68]. However, this estimate may be inflated since data from monomorphic SSR loci were excluded in the current study.

An evolutionary relationship among the genes was also developed. It has been estimated that the majority of genes are linked to responses towards biotic and abiotic stress conditions. For example, the damage-specific DNA binding protein 2 (DDB2), Autophagy-related protein 16 (ATG16), WD repeat-containing protein LWD1, Denticleless protein homolog (DTL), Protein FIZZY-RELATED FZR2, FZR3, Pre-mRNA-processing factor 17 (CDC40), Diphthine methyltransferase (DPH7), Chromatin assembly factor 1 subunit FAS2, DNA excision repair protein ERCC-8, Flowering time control protein FY, Guanine nucleotide-binding protein subunit beta-like protein (GB1), Myosin heavy chain kinase B (mhkB), WD-40 repeat-containing protein MSI1, MSI4, F-box/WD repeat-containing protein sel-10, U5 small nuclear ribonucleoprotein 40 kDa protein (SNRNP40), THO complex subunit 3 (THO3), WD repeat-containing protein VIP3, WDR25, WDR5A, WDR5B, and U3 snoRNP-associated protein-like YAO belong to the superfamily of WDR and are involved in repairing damaged DNA under various stress conditions [69,70]. Similarly, RUP2 has been found to play a very crucial role in vegetative development and flowering in Arabidopsis [49]. F-box/WD repeat-containing protein 7, named as "ago" and associated with Gh\_A03G1152, supports plants against disease and repairs

damaged DNA [71]. SEC31 homolog B transports proteins and is situated at Golgi-associated endoplasmic reticulum exit sites. CDC20-1 is a known component of the anaphase promoting complex/cyclosome (APC/C), a cell-cycle-regulated E3 ubiquitin–protein ligase complex that controls progression through mitosis and the G1 phase of the cell cycle. This protein is involved in the pathway protein ubiquitination, which is part of protein modification. The intron-containing CDC20 gene copies provide conserved and redundant functions for cell-cycle progression in plants and are required for meristem maintenance, plant growth, and male gametophyte formation [69]. These results also support our hypothesis that yjcL (PF005684) has the same functions as these WDR family genes.

The current study is highly associated with the pedigree information recently provided by Gallagher et al. [3] after molecular confirmation of newly designated species of *Gossypium*. Genetic diversity within the group was highest for the Isabella group and lowest for the Hawaiian group (Table 2 and Figure 1). The genetic differentiation between groups was further validated by AMOVA, with 49% of the variation among populations and 51% of the variation within populations (\*\* significant *p* > 0.001) being explained by the population structure of the wild cotton germplasm (Table 3). Such higher variation may be due to the complete study of seven different species of diploid and tetraploid cotton. This also indicates the presence of a great genetic difference among tetraploid and diploid cottons as well as a good level of genetic diversity within each group, which can be used in further hybridization breeding programs in cotton to broaden the narrow genetic base of *G. hirsutum*, which is becoming a serious threat due to limited allelic availability [72]. The FST values for the diploid and tetraploid cottons observed in this study (0.301–0.869) are very high, indicating high genetic distance and diversity. PCoA plots separated tetraploid cottons from diploid plants, supporting the AMOVA results. All these results are in good agreement with Noormohammadi et al.'s genetic diversity analysis [68] between diploid and tetraploid accessions.

Thus, our results could help breeders to determine the selection of appropriate parental combinations in germplasm enhancement programs and conserve genetic diversity and the evolutionary relationship among the genes of uncharacterized functions. The presence of profound population differentiation could pose a challenge to successful Genome-Wide Association Mapping (GWAS) studies in the Upland cotton germplasm for traits that are associated with population structures. The power of structure-based association studies to detect the effects of a single gene would be reduced if a large fraction of variation was explained by the population structures [22,73]. In such cases, alternative association mapping populations would be more useful.
