**1. Introduction**

The leading natural fiber in the world is a product of cotton crops. Cotton is placed in the taxonomic order with the genus *Gossypium* and has broad phenotypic diversity, which includes more than 50 species [1–3]. There are now 7 tetraploid and 46 diploid cotton species after molecular confirmation and taxonomic designated two new tetraploid ones, i.e., *Gossypium ekmanianum* (AD6) and *Gossypium stephensii* (AD7) [3–6]. Among those, four are cultivated throughout the world: two of these species are diploids (2*n* = 2*x* = 26) and two are allotetraploids (2*n* = 4*x* = 52). Global cotton production is manifested from the two allotetraploid species *Gossypium hirsutum* and *Gossypium barbadense* [7–9].

Data on allotetraploid cotton evolution indicates that the seven tetraploid cottons evolved about 1.5 million years ago by hybridization of the Old world cotton *Gossypium herbaceum* (A<sup>1</sup> genome) and the New world cotton *Gossypium raimondii* (D<sup>5</sup> genome) as a consequence of subsequent diploidization and domestication [3,4,8,10–12]. G*ossypium hirsutum*, also called "Upland cotton", represents 90% of global cotton fiber production [13], while *Gossypium barbadense* (also known as Pima) is valued for its extra-long staple fiber source, is domesticated in North-West South America, has its native origin in Egypt, and contributes around 8% of total world lint [9]. Wild *Gossypium darwinii* originated from Galapagos Island and, relative to *G. barbadense,* also has good fiber fineness characteristics and is a rich source of resistance to *fusarium* and *verticillum* wilts [14]. The D-genome *Gossypium klotzschianum*, having glabrous seed coverings, evolved through long-distance dispersals, is endemic to Galapagos Island, and is considered a New-World D-genome diploid along with *G. raimondii*. *Gossypium tomentosum* is drought-tolerant, native to a Hawaiian Island, and has a more diffuse population structure falling typically as scattered individuals and small populations on several islands. *G. tomentosum* (AD3), *G. darwinii* (AD5), *G. ekmanianum* (AD6), and *G. stephensii* (AD7) are wild and are not grown commercially [1,2,15,16]. Wendel and Percy analyzed 58 *G. darwinii* accessions from six islands using 17 isozyme markers and identified a high genetic diversity level within these accessions and relationships with *G. barbadense* and *G. hirsutum* genomes. This classic study suggested that *G. darwinii* and *G. barbadense* are separated and each has a distinct genome [17].

The genetic diversity of different plant species is an essential element for crop production in agriculture, including cotton. Genetic variation in the *Gossypium* species is widespread, covering large geographic and ecological niches. It is a vital source of conserved genetic diversity in situ in Mexico for cotton origin [18,19] and is preserved ex situ within worldwide cotton germplasm collections and materials of breeding programs. The productivity of cotton and future efforts to improve cotton depend to a large extent on the elucidation of genetic diversity in cotton genetic stocks and their effective utilization in cotton improvement programs [20].

The narrow genetic background of Upland cotton has become a major concern as low genetic diversity gives rise to stagnant yield and quality of breeding. The elite breeding programs cannot make robust inferences without using the unexploited standing genetic variation of archaic cultivars typically associated with wild accessions [14,21,22]. The characterization of genetic diversity between and within groups enables us to find heterozygous groups, understand population structures, and isolate a core set of lines for genetic analysis studies in cotton. A multitude of studies indicate the extensive usage of model-based structure analysis for investigating genetic diversity in cotton [22,23]. Genetic diversity estimates have been established using genotypic data and DNA-based molecular markers [24–28]. Molecular markers are more reliable since they can directly determine allelic diversity and give robust estimates of genetic distances.

The DNA-based markers used for determining genetic diversity in cotton include restriction fragment length polymorphisms (RFLPs) [29], random amplified polymorphic DNA (RAPD) [30–32], amplified fragment length polymorphisms (AFLPs) [33], simple sequence repeat (SSR) [9,34–36], expressed sequence tags (ESTs) [37], inter-simple sequence repeat (ISSR) [38,39], and single nucleotide polymorphisms (SNPs) [40]. Compared with other biomarkers, SSR has advantages that include more reproducibility, co-dominant inheritance, distribution throughout the genome, and its being highly transferable, informative, and reliable [41].

Although data from several studies implicates the marker-based estimation of genetic diversity in cotton, the majority of those remain bound to the number of accessions included or the number of markers used to describe genetic diversity [42]. Recently, an effort has been made by Kirungu et al. [43] to explore the important genes linked to SSR markers by constructing a genetic linkage map between *Gossypium davidsonii* and *G. klotzschianum*. Similarly, a study of gene diversity, their functionality, and especially the diagnosis of uncharacterized domains of proteins in developing the evolutionary relationship among cotton accessions will be fruitful for exploring the mystery of cotton evolution. Among all protein domains with a unique structure and functions, nearly more than 20% are currently described as "domains of unknown function" (DUFs). They are often overlooked as irrelevant as many of them are found in only a few genomes. Approximately 2700 DUFs exist in bacteria as compared to eukaryotes, which have only 1500. More than 800 DUFs have been found to be common in bacteria and eukaryotes, and about 300 of these are also present in archaea. Evolutionary conservation suggests that many of these DUFs are important in biology as they mostly represent single-domain proteins, clearly establishing the biological importance of DUFs [44].

The importance of prioritizing DUFs has been recognized in various experimental and/or computational characterization efforts [45–48]. We identified DUF819 (PF005684), which is not only highly conserved but also plays an important role against biotic and abiotic stress, among four sequenced cotton species by using the WDR (PF00400) superfamily as reference-genome-sequenced proteins. Genome-wide characterization of WD-repeats, also known as tryptophan-aspartic acid or the W-D superfamily, has only been conducted in Arabidopsis and Cucumber [49,50] till now. Therefore, a comprehensive study comprising a wide collection of germplasms, more efficient genotyping, and collective genomic platforms is required to measure the overall genetic diversity in diploid and allotetraploid cotton, which will help overcome the future challenges of the gene pool's disastrous escape.

The objectives of this study were to explore the genetic diversity and evolutionary relationship among the domains of uncharacterized proteins in natural diploid and allotetraploid cotton germplasm resources and to analyze the population structures to maximize estimations about the accessions of cotton present in a wild nursery of China for their efficient utilization in cotton-breeding programs.
