1. Introduction
Due to its high protein and oil content, soybean (
Glycine max L.) is one of the world’s most important crops, accounting for the largest proportion of protein consumption, livestock feed, and oil seed production (
http://soystats.com, accessed on 1 November 2022). Soybean protein contains all of the essential amino acids including isoleucine (Ile), histidine (His), leucine (Leu), lysine (Lys), methionine (Met), phenylalanine (Phe), threonine (Thr), tryptophan (Trp), and valine (Val), making it a nutritionally valuable crop [
1]. In Korea, soybean is used in various products, including tofu, soymilk, soybean sprouts, and soybean paste. Therefore, the gene-based improvement of protein, oil, and amino acid content is a very important goal in soybean breeding. However, the domestication bottleneck and selective breeding have led to a significant reduction in the genetic diversity of modern soybean cultivars, which has hindered breeding progress [
2].
Wild soybean (
Glycine soja Sieb. and Zucc.), which is the ancestor of cultivated soybean (
G. max), is highly valuable as a breeding material for the improvement of soybean because of its high genetic diversity [
3,
4]. The identification of genetic loci supporting the phenotypic diversity of protein, oil, and amino acid content observed in wild soybeans can be employed in soybean breeding. The average protein and oil content of cultivated soybean seeds is 40% and 20%, respectively [
5], compared to 48% and 10% for wild soybean [
6,
7,
8]. The amino acid composition of wild soybean is similar to that of cultivated soybean, with the highest content of glutamic acid (Glu) and aspartic acid (Asp) and the lowest content of cysteine (Cys) and Met [
9,
10].
To date, previous studies have reported 255 and 322 quantitative trait loci (QTLs) associated with the protein and oil content of soybean, respectively (
https://www.soybase.org/, accessed on 1 November 2022). It has been widely documented that soybean protein and oil contents have a negative correlation [
11,
12]. Therefore, because a candidate gene for one of these traits may be associated with the other, they need to be studied together. Diers et al. (1992) identified major QTLs for protein and oil content on chromosomes 15 and 20 [
13]; since then, these QTL regions have been gradually narrowed based on the results of previous studies [
14,
15,
16,
17,
18]. These two major QTLs are found in the wild soybean accession PI 468916, with an increase in protein contents of 24 g/kg and 17 g/kg in the presence of homozygous alleles for the QTLs on chromosomes 20 and 15, respectively [
13]. Recently, the QTL located on chromosome 15 was subjected to fine-mapping at 535 kb intervals between simple sequence repeat (SSR) markers [
19], and fine-mapping and candidate gene selection have been completed for chromosome 20 as well [
20].
Amino acids are widely used in the animal feed industry [
1,
21], with the poultry and swine industries consuming over 400,000 mt of Lys [
1] and spending approximately USD 100 million annually to supplement feed with synthetic Met [
22]. However, there have been few genetic studies targeting amino acid content compared to protein and oil contents in soybean. In a previous study, using a population of 101 F
6-derived recombinant inbred lines (RILs) derived from the high-protein line N87-984-16 and the high-yield line TN93-99, a total of 32 QTLs associated with 18 amino acids were identified across 17 soybean chromosomes [
23]. In particular, two QTLs associated with Cys were detected on molecular markers Satt235 (LG-G, chr. 18) and Satt252 (LG-F, chr. 13), and three QTLs associated with Met were detected on molecular markers Satt252 (LG-F, chr. 13), Satt564 (LG-G, chr. 18), and Satt590 (LG-M, chr. 7) [
1]. Warrington et al. (2015) also identified a total of thirteen QTLs for the content ratio of four amino acids (Lys, Thr, Met, and Cys) in the crude protein from one hundred and forty RILs developed from a cross of Benning and Danbaekkong, and studied the relationship between the protein and amino acid contents [
18]. Recently, eight genomic regions associated with the contents of Cys, Met, Lys, and Thr have been identified in a genome-wide association study (GWAS) using 621
G. max accessions in maturity groups I–IV and 34,014 single-nucleotide polymorphism (SNP) markers [
21].
The objective of the present study was to identify candidate genes related to protein, oil, and amino acid content in a diverse set of 203 wild soybean accessions using a GWAS.
3. Discussion
The wild soybean accessions used in this study were collected from Korea, China, Japan, and Russia and contain various genetic diversity [
24], which can be utilized for soybean improvement by applying the GWAS to identify useful alleles. Soybeans contain not only essential amino acids but also a large amount of unsaturated fatty acids; thus, they are widely consumed for health purposes. As a result, many studies have been conducted on QTLs involved in regulating protein and oil content. The present study analyzed 203 wild soybean accessions grown for two years for their content of protein, oil, and 17 amino acids. The average protein and oil content for wild soybean was 47.84% and 7.33%, respectively. This is consistent with several previous studies that have reported a higher protein content and lower fat content than the 40% and 20% widely reported for protein and oil, respectively, in cultivated soybean [
6,
7,
8]. Globally, soybean accounts for the largest proportion of oilseed production at 61% (
http://soystats.com/, accessed on 1 November 2022). These results indicate that the oil content has increased during the domestication of cultivated soybean from wild soybean. On the other hand, there has been a shift toward lower protein content. This can be explained by the negative correlation between protein and oil content [
11,
12,
21], which is also observed in
Figure 3. In soybeans, there are constituent amino acids that make up the proteins, and there are free amino acids. In this study, the constituent amino acids were analyzed, and it was found that the content of Glu was the highest, and the correlation between each amino acid was significantly positive. Chotekajorn et al. (2021) [
25] analyzed the free amino acids from 316 wild soybean accessions and identified that Arg was the most abundant, while most of the amino acids were positively correlated with each other, similar to the results of this study.
In the GWAS results, five and six genes containing detected SNP markers were found for the protein and oil content, respectively. It has been widely reported by many studies that major candidate genes associated with protein contents are present on chromosomes 15 and 20 [
13,
15,
17,
19,
20]. Kim et al. (2016) conducted fine-mapping of the protein and oil content using a backcross population with the high-protein line PI 407788A as the donor parent and Williams 82 as the recurrent parent and found that QTLs were located between BARCSOYSSR_15_0161 and BARCSOYSSR_15_0194 on chromosome 15 [
19]. In addition, Fliege et al. (2022) recently conducted fine-mapping and the RNAi transformation of the protein content using a backcross population with the high-protein wild soybean line PI 468916 as the donor parent and A81-356022 as the recurrent parent in the initial stages of a large-scale QTL analysis for soy protein and oil content [
20]. Their research revealed that the protein content is regulated by a CCT domain protein polymorphism in the
Glyma.20G85100 gene on chromosome 20 [
20]. In this study, AX-90368184 on chromosome 15 and AX-90513791 on chromosome 20, which were associated with protein and oil content, respectively, were detected at positions similar to the aforementioned major QTLs. The genes on the reference genome where the SNPs are located are
Glyma.15g055200 (F-box and associated interaction domain-containing protein) and
Glyma.20g087700 (protein of unknown function), which differ from the aforementioned genes. However, it is clear that the major QTLs are located on chromosomes 15 and 20.
These differences can be ascribed to the analysis of different accessions and the fact that the protein and oil content is not regulated by a single gene. The DNA binding with one finger (DOF) family of plant-specific transcription factors (TFs) is known to regulate seed protein accumulation and mobilization [
26]. OBP3, an annotation of the
Glyma.11g015500 gene detected on chromosome 12, is a member of the DOF family. It has been reported that OBP3 regulates the signaling of phytochrome and tryptochrome in Arabidopsis thaliana [
27] and plays an important role in growth and development [
26]. However, the function of OBP3 in soybean is unknown. The involvement of the DOF family in protein accumulation suggests that the
Glyma.11g015500 gene may be a strong candidate gene for involvement in regulating the protein content.
The
Glyma.20g050300 gene for zinc-binding alcohol dehydrogenase family protein was detected on chromosome 20 and associated with oil contents. Soybean alcohol dehydrogenase has been found to be active in anaerobic reactions and seed respiration, including in response to flooding stress [
28,
29]. On the other hand, alcohol dehydrogenase was included among the fatty acid synthesis-related proteins identified in the comparative proteomics of high-fat soybean cultivar JY73 in a previous study [
30]. Therefore, the fat content may be indirectly affected by alcohol dehydrogenase depending on the condition of the seed; thus,
Glyma.20g050300 may be a candidate gene for the regulation of the oil content.
Interestingly, in the GWAS results for the content of the seventeen amino acids, markers AX-90332294 and AX-90522787 located on chromosomes one and three, respectively, were detected for nine amino acids. In particular, AX-90332294 exhibited the largest SNP variance for each amino acid (
Figure 6). For these amino acids,
Glyma.01g053200 and
Glyma.03g239700, which were annotated with the prefoldin chaperone subunit family protein and aspartyl protease/7S seed globulin precursor, respectively, were identified. Chaperone is known to act as a proteolytic enzyme in eukaryotes by inducing proteases that aid in the structural folding of protein complexes or the degradation of proteins [
31,
32]. Prefoldins are a family of chaperone proteins, which are heterohexameric proteins composed of two α subunits and four β subunits [
31,
33]. Protein complexes are eventually formed by amino acids, so
Glyma.01g053200 may be related to the content of amino acids. In addition, it is known that β-con-glycinin (7S) and glycinin (11S) account for more than 70% of the total soybean storage proteins [
23,
34]. The fact that the 7S and 11S proteins in soybeans make up a significant portion of storage proteins may not be directly related to the presence of SNPs in structural genes. Rather, the expression and accumulation of these proteins are known to be controlled by regulatory elements, such as promoters and enhancers, that govern the transcription and translation of the corresponding genes. However, genetic variation, including SNPs in structural genes encoding 7S and 11S proteins, can affect expression levels or protein function, which in turn can affect soybean protein composition and nutritional value. Thus, while the presence of SNPs in structural genes may not be directly related to the abundance of storage proteins in soybeans, in a broader sense, it suggests that genetic variation in these genes may have important implications for the quality and utilization of soybean proteins. In this study,
Glyma.03g239700,
Glyma.19g164800, and
Glyma.19g164900 are associated with precursors or subunits of 7S and 11S storage proteins. In addition, amino acid synthesis involves several complex processes [
35] and can be regulated by shikimate dehydrogenase (
Glyma.03g242400), chorismate mutase (
Glyma.06g061700), arogenate dehydratase (
Glyma.12g072500), asparagine (
Glyma.20g025400), and aminotransferase (
Glyma.15g012300). Gene expression patterns for candidate genes from
https://www.soybase.org/soyseq/, accessed on 1 November 2022 are shown in
Table S1. Candidate genes were expressed according to various tissues and stages of seed development, and in particular, it was confirmed that
Glyma.03g239700 was intensively expressed during the period of seed development. This information suggests that the candidate genes detected in the present study may directly or indirectly affect amino acid contents.