**4. Discussion**

The trait of seed ratio (length-width ratio) screened in this study has been reported to have very high broad-sense of heritability in recently published peanut research. Zhang et al. [72] reported that it has a high broad-sense of heritability (0.81) in peanuts. For other legume crops, Hu et al. [73] reported a very high broad-sense of heritability ranged from 92.46 to96.25 in three traits related to seed shape in soybeans. If a phenotypic trait has a high level of heritability, the influence of the environmental factors might be relatively small, and in this case, it could be possible that genes (or QTL) with relatively large effects on the trait could be identified even if the trait were not measured in the same conditions. Of course, even in this case, the influence of the environmental conditions cannot be overlooked.

The genotyping data from the 58K SNP array chip could play an important role in understanding the evolutionary history of peanuts and the domestication of cultivated peanut [74]. The application of the array chip also demonstrated that it is a powerful and reliable tool for peanut germplasm background selection and evolutionary studies [75]. In the present study, it is the first to conduct GWAS analysis using a large number of Korean peanut germplasms as well as the USDA peanut core collection with a high-density SNP chip data that can be used toward increasing the genetic diversity of the US peanut germplasm collection.

The cultivated peanut species (A. *hypogaea*) is known to originate from southern Bolivia to northwestern Argentina based on the occurrence of the two progenitor species, A. *duranensis* and A. *ipaensis*, and archaeological evidence gathered in those regions [76–78]. Researchers also suggested that the eastern slopes of Cordillera may be a possible area for the origin of A. *hypogaea* due to the favorable environment for peanut growth [78,79]. However, the present study showed an interesting result in that South American peanuts, generally regarded as the origin of peanuts, were revealed as having significant genetic differences from peanuts of other regions, including South Korea.

The evaluation results for the evolutionary relationships among the entirety of the 384 peanut germplasms indicated that most of the peanut individuals from South Korea and South America separated into two distinct groups and were also independent from the peanuts from the other origins. This might indicate that there was a grea<sup>t</sup> genetic difference between the peanut germplasms from South Korea and South America. Likely, due to the lack of interactions between South Korean peanut germplasms and others, it might be possible that an independent breeding history by human selection and/or environmental influences for a long period have caused these genetic differences.

In human genetic association studies with high-dimensional genomic data, regularization methods, such as LASSO and elastic-net, have been widely applied to identify outcome-related genetic sites and genes as they have certain advantages over univariate analysis. First, regularization methods can easily handle highly correlated genomic measurements and covariate effects as they are based on a regression model. Secondly, the majority of regularization methods have been implemented into very efficient computational algorithms such R package 'glmnet' and 'gglasso'. These packages can detect outcome-related genetic-sites and genes in less than a minute for more than 100K dimensional genomic data. Lastly, there are various types of regularization methods that can be applied to different types of genomic data. For example, we applied LASSO and elasticnet to SNP data in the GWAS or QTL analysis; however, sparse group LASSO [80] and network-based regularization [81] are ideal for group structured genomic data, such as gene expression data and DNA methylation data. Despite these advantages of regularization methods, they have rarely been applied to detect QTLs or genes of interest in crops. In this study, LASSO was able to identify potentially outcome-related SNPs that were not identified in general GWAS methods although further validation studies are required for these SNPs.

Data filtering is the primary process of genome-wide association analysis, which includes huge amounts of data and requires strict quality control standards. Data filtering is divided into two sections, one for marker variables and another for individuals. The former considers the minor allele frequency (MAF) and the degree of missing data and heterozygosity, etc., whereas the latter mostly considers missing levels, population stratification, and independency among individuals [82]. The entire set of heterozygous SNPs are typically used in human GWAS analysis [83]. In peanuts, a high level of heterozygosity may not be expected as peanut is a self-pollinating crop revealed to have a low outcrossing rate ranging from 1.9% to 8% [84]. However, our array chip data showed a large number of heterozygous SNPs, which can affect the GWAS results. According to Figure 7, the significant SNPs identified from using 5% to 20% maximum heterozygous rate showed the same GWAS results, with one significant marker at FDR 0.05, while the results from using 30% to 100% maximum heterozygous rate showed similar GWAS results with three

significant markers. Therefore, we filtered the genotype data with maximum heterozygous SNPs of 20%, and we used less heterozygous SNPs for analysis.

Carbon assimilated by photosynthesis is transported into seeds with multiple purposes, such as the biosynthesis of starch, oil, amino acids, and cellulose. The most important aspect of oil accumulation in developing seeds lies in the activation of metabolic pathways driving incoming carbon into fatty acid biosynthesis at the expense of competitive pathways. Within the genomic region of ~300 kb associated with seed development, phosphoenolpyruvate (PEP) carboxylase (PEPC; Arahy.HT9EWH) was among eleven genes located within the LD of significant SNPs on the chromosome Araip.B08 (Supplementary Table S6). PEP is catalyzed into oxaloacetate (OAA), a protein precursor, by PEPC [85]. OAA can be converted to malate and then to pyruvate (a precursor for oil). PEPC had been reported to regulate the metabolic network of glycolytic carbon into precursors for both oil and protein in soybean seed development [86]. The activation status of PEPC has been reported to play a key role in the partitioning of assimilates into the different storage products in barley (*Hordeum vulgare*), alfalfa (*Medicago sativa*), and fava bean (*Vicia faba*) [87–89]. In peanuts, researchers reported that the expression levels of PEPC genes were significantly associated with lipid accumulation [90]. In the present study, only fifteen annotated genes were identified within the genomic region as being highly associated with seed development through high-throughput GWAS analysis. Among them, the PEPC gene could have a strong causal effect within this region associated with diverse metabolic pathways that includes including protein and oil biosynthesis.
