1. Introduction
Soybean is cultivated worldwide in different latitudinal zones. In Brazil, this model was adapted for local conditions and is presently used by all South American public and private breeding companies [
1]. It is recognized as a species with a narrow genetic base, which can hinder the acquisition of genetic sources for economically important traits and populations with significant genetic variability. While each cultivar has traditionally been associated with a relatively narrow latitudinal zone, its adaptability to diverse producing regions stems from the genetic variability present in crucial gene loci and quantitative trait loci (QTLs) responsible for regulating flowering and maturity. The primary genetic loci, E1–E11 and J, and several QTLs, such as Tof11/Gp11, Tof12/Gp1/qFT12-1, and qDTF-J, have been identified. In general, except for the E6, E9, E11, and J genes, the dominant allele of the E genes confers late flowering and maturity, whereas an increase in the number of recessive alleles leads to early flowering [
2,
3]. The genetic diversity of soybean has been reduced, mainly because of genetic bottlenecks related to domestication. In contrast, its wild relative
Glycine soja, which grows under various environmental conditions, has retained significant genetic diversity [
4]. Furthermore, soybean lineages have undergone distinct and individual selection based on geographical location, with numerous highly conserved regions among cultivated varieties because of domestication [
5]. In countries with vast continental dimensions, such as Brazil, which is the world’s largest soybean producer, the segregation of soybean populations for genetic enhancement is often constrained by latitude, leading breeders to focus their efforts on similar genotypes. Consequently, crosses between cultivars with differing relative maturity groups can contribute to genetic diversity by exploring novel QTLs and loci of interest.
Genetic value prediction is paramount in genetic improvement programs as it exclusively encompasses the hereditary component of quantitative trait control that can be inherited by offspring. Consequently, acquiring insights into the genetic value of individuals constitutes a critical facet of breeding programs, ensuring the realization of genetic advancements [
6]. During the selection phase, breeders must discern the genetic potential of individual candidates and make decisions regarding the improvement of specific genotypes, all grounded in empirical data. It is imperative to possess accurate selection predictions to systematically evaluate individuals within segregated populations and pinpoint superior genotypes [
7].
Genetic variability is estimated using variance components utilizing the restricted maximum likelihood method (REML), and obtaining the best unbiased linear prediction (BLUP) of genetic values is preferred because it maximizes selective accuracy compared to parametric statistical methodologies [
8]. Using parametric statistical methods necessitates assumptions related to the probability distribution of variables, often assuming the linear nature of the phenomenon under study. However, this can result in inefficiencies in the analysis since these ideal conditions are not always met when collecting data within genetic improvement programs.
In contrast to both parametric and non-parametric analyses, artificial neural networks (ANNs) offer distinct advantages that render them better suited for particular scenarios. Consequently, they play a valuable role in the selection and development stages, as described in [
9], and exhibit a high predictive capacity, as demonstrated in [
10]. Artificial neural networks (ANNs) are machine-learning models inspired by the human brain and have shown promise in various areas, such as pattern recognition, natural language processing, and computer vision. They can learn from raw data without prior knowledge of the domain or specific problems and handle incomplete or noisy data.
ANNs have been used in various areas of agriculture, such as the prediction of crop productivity [
10,
11,
12], soil attributes, and image interpretation. ANNs have shown high efficiency in predicting genetic values compared to other methodologies in several studies [
13,
14,
15]. In addition, methodologies using fractal analysis have been applied in plant breeding, reducing human error during the breeding process [
16].
Therefore, the present study aims to evaluate the efficiency of ANNs in predicting genetic values of soybean genotypes derived from broad and narrow crosses for the trait of the relative maturity group (RMG). These results may provide critical information for developing new, more productive soybean varieties adapted to Brazilian conditions.
2. Materials and Methods
Three soybean populations with different territorial adaptabilities based on their relative maturity group (RMG) classification [
1] were evaluated in the Soybean Breeding Program at the “Júlio de Mesquita Filho” State University in Jaboticabal, São Paulo.
The cross between the cultivars BMX Veloz (GMR 5.0) and BRS 278 RR (GMR 9.4) resulted in the “WRMG” population, characterized by wide adaptability and coverage in the critical photoperiod for latitudes between 23° LS (Subtropical) and 0° LS (Tropical). The “Brazilian” population comprised 220 F4, 252 F5, and 252 F6 progenies in 2017, 2018, and 2019, respectively.
The “Subtropical” population was established from the cross between BMX Energia (GMR 5.3) and BMX Potência (GMR 6.7) cultivars with a critical photoperiod for the southern region of Brazil, corresponding to latitudes 23° LS (Subtropical) and 20° LS (Tropical). The “Subtropical” population comprised 120 F5, 168 F6, and 168 F7 progenies in 2017, 2018, and 2019, respectively.
The “Tropical” population was obtained by crossing the cultivars BRS 245 RR (GMR 7.3) and BRS 278 RR (GMR 9.4) with a critical photoperiod for the northern region of Brazil, corresponding to latitudes of 20° LS (Tropical) and 0° LS (Tropical). The “Tropical” population comprised 60 F5, 60 F6, and 104 F7 progenies in 2017, 2018, and 2019, respectively.
The evaluations of the three populations across the three agricultural years took place at the Teaching, Research, and Extension Farm (FEPE) situated within São Paulo State University (UNESP) on the Campus of Jaboticabal (FCAV), São Paulo, Brazil. The location is positioned at a latitude of 21°15′19″ south and a longitude of 48°19′21″ west, with an altitude of 615 m. This region offers ideal photoperiod conditions for soybean genotypes, falling within the 6–8 growing degree month (GMR) range. This suitability is attributed to the prolonged rainy season in the region, spanning from November (spring) to April (fall), which permits the cultivation of soybean genotypes with a maturity cycle ranging from 90 to 150 days.
The control varieties for each population were their parents, as well as TMG 7262 RR (GMR 6.2), TMG 1174 RR (GMR 7.4), and TMG 1179 RR (GMR 7.9), for all agricultural years. The experimental design used was the augmented block design described by [
17], in which the progenies were arranged in plots of a five-meter-long row with a spacing of half a meter between rows. Controls, parents of each population, and two other commercial cultivars were randomly allocated to each experimental block. The planting density was 15 seeds per meter, and all cultivation practices followed the technical recommendations for soybean culture [
18]. Data were collected from five visually selected plants per plot.
The evaluated traits were total crop cycle (MATURITY), assessed by the number of days counted from germination until harvest at the R8 development stage [
19]; grain yield (GY), obtained by the weight in grams of the grains from the five selected plants in each plot after harvest and processing; the number of days to flowering (NDF), counted from germination until full flowering of the field (stage R
2—[
19]; the height of first pod insertion (AIV), i.e., the height measured from the ground to the first productive pod of the plant, in cm; and plant height at maturity (APM), i.e., height from the ground to the last fertile pod of the plant, also expressed in cm.
Data were analyzed for each population using R version 4.0.2 [
20] via the mixed model approach proposed by [
21].
The variance components were estimated using the restricted maximum likelihood (REML) approach [
22]. Fixed effects were checked for significance using the F-test, and significance of variances associated with random effects was verified using the likelihood ratio test. Heritability, experimental coefficient of variation (CV), and selection accuracy (rgg) were calculated as described by [
22].
The developed neural network is a multilayer perceptron (MLP) with an input layer, two hidden layers, and an output layer. The number of units in the input layer was composed of the five agronomic years, population identification, and the three agricultural years, totaling nine input layers. It was necessary to convert the categorical variables (years) into a single-point representation. The input layers also included the population (POP) and three agricultural years.
The input layer has nine neurons, the hidden layers have 64 and 128 neurons, and the output layer has one neuron corresponding to MATURITY and the other corresponding to grain yield (GY).
The dataset consisted of 6158 examples. The MLP network was built in Python 3.6 using Keras as a front end, TensorFlow 2.3.0 as a back end, and Scikit-learn 0.22.2. The dataset was split into k, or ten partitions, where k-1 sections were used for training, and k was used to test the model (k-fold cross-validation). Thus, ten models were created in which the training and test data were changed for each iteration [
23].
The final evaluation of the models was based on the correlation between the observed and predicted values using network (R
2) and RMSE parameters [
24]. The selected activation function was a logistic sigmoid function. The output layer used the softmax function [
25].
Backpropagation is the standard algorithm for updating the weights in this type of network and is used to train the network. Backpropagation is an efficient method for calculating the partial derivatives of each layer. The weights are updated using gradient descent, which aims to minimize the error produced by the network. The algorithm generally applies the data and passes it forward to the successive layers, called the “forward pass”. Then, it calculates the error in the output layer and propagates it backwards, called the “backward pass”. These steps are repeated until the error is as small as possible [
26].
Stochastic gradient descent (SGD) is an efficient method for calculating gradient descent SGD [
27]. In practice, the optimizers used are variations of the SGD. This study used an adaptive moment estimation (ADAM) optimizer [
28]. The number of training cycles was set to 600. Care was taken to limit the number of iterations such that it did not become excessive, which could lead to a loss of generalization power.
The efficiency comparison in predicting genetic values between the mixed models REML/BLUP and artificial neural networks was performed using the coincidence index (%) of the 10 and 20% best genotypes for each trait, according to each methodology, and the gain with selection, considering a selection intensity of 20%. For the total cycle trait (MATURITY), genotypes with lower additive genetic values were selected to choose earlier genotypes, and for grain yield, genotypes with higher additive values were selected to increase the estimates.
3. Results
The “WRMG” population exhibited significant genetic variance for GY and MATURITY across the three agricultural years, corresponding to the various generations (as shown in
Table 1). Five of the six estimated heritabilities fell within the medium-to-high range, spanning from 0.33 to 0.82. Regarding the MATURITY trait, the average ranged from 130 days in the F4 generation to 121 days in the F6 generation, following the selective process applied to the earliest plants. In the case of GY, there was a remarkable increase of over 100% by the conclusion of the selection process. In the F4 generation, the average GY per plant was 17 g, and this increased to 39 g per plant in the F6 generation.
The “Subtropical” population showed significant genetic variance for GY and MATURITY, except for GY in F6. The heritabilities of the six estimates were moderate to high, ranging from 0.31 to 0.68. The “Subtropical” population with earlier and less productive progenies in F5 was selected for the cycle of 115 days and 39 g per plant, which is ideal for the evaluated region.
The “Tropical” population presented significant genetic variance for GY only in the first generation and for MATURITY only in the last generation, starting from two non-significant generations. Only the heritability of GY and MATURITY in generation F7 and GY in F6 were medium to high. The “Tropical” population with lines with high maturities in F5 (131 days) and optimal productivity (39 g/plant) was selected for lower maturities, with 104 days in generation F7 and a productivity of 35 g/plant.
Experimental precision, verified through accuracy and environmental coefficient of variation (CV) estimators, varied among generatio no ns and evaluated traits. Accuracy was higher for MATURITY than for GY for the “Brazilian” population. For the “Subtropical” and “Tropical” populations, the accuracy estimates for generations F7 and F6 were higher for GY than for MATURITY.
The coefficient of environmental variation was consistently higher for GY than for MATURITY in the same year, which was consistent with these traits when evaluated on a per plant basis and not on the plot mean. In addition, CVs are inherent to the traits themselves.
The correlation estimates (R
2) obtained using the MLP algorithm of the ANN between the observed and predicted data exceeded 0.999, indicating a remarkably high predictive capacity for both GY and MATURITY, as detailed in
Table 2. The models exhibited the lowest RMSE values for GY, with 0.077 during training and 0.076 during validation. For MATURITY, these values were 0.2407 during training and 2.6106 during validation. It is worth noting that these RMSE values share the same units as the variables under investigation. The overall mean GY of 33.56 g per plant corresponds to an error of merely 0.46%. In the case of maturity, where the mean was 121 days, the error in the validated models amounted to 4.2%.
The predictive capacity through artificial intelligence provided by MLP-ANN was superior to that based on RR-BLUP in predicting the genetic value of plants for both grain yield (GY) and MATURITY, as shown in
Table 3.
The similarity in the classification of the best genotypes indicated by both methodologies was observed using the coincidence index with two percentages (
Table 4). For the MATURITY variable, the percentages ranged from 30.77% (F5 population “Subtropical”) to 100% (F7 population “Subtropical”), considering the selection intensity of 10%, and from 63.16% (F4 population “WRMG”) to 92% (F5 population “WRMG”), considering the selection intensity of 20%. This demonstrates that, for MATURITY, a lower selection intensity may allow for similar genetic gains. For GY, the lowest coincidence for the 10% intensity occurred for F5 in the “WRMG” population (68.18%). The highest for F4 in the “WRMG” population was 89.47%. Considering a selection intensity of 20%, the coincidence percentages were 68.18% (F7 population “Tropical”) and 87.50% (F5 population “Subtropical”). This demonstrates that genetic gains for GY can be similar when applying both methodologies.
It was possible to observe a similarity between the genotypes indicated as the best by both methodologies, although there was a significant divergence in the ranks they occupied. For the trait MATURITY in the “Tropical” population in 2018, there was no ordering of the best genotypes due to zero genetic variance by the analysis using mixed models. According to the ANN analysis, even with minor differences, genotypes were ordered.
The expected gains from the selection, considering an intensity of 20%, are listed in
Table 5. For GY, the highest gains were obtained by the progenies ranked according to the artificial neural network prediction, reaching 11.91% for the “Tropical” population, while for the BLUP prediction, the highest gain was 4.43% for the “WRMG” population. For MATURITY, in which the goal is to reduce the total crop cycle, differences were observed between the methodologies, with the highest reduction being −5.42 (ANN, “WRMG” population) and the lowest reduction being −1.49 (BLUP, “Subtropical” population). The expected gains with ANN-MLP compared to RR-BLUP were 30–110% higher for MATURITY and 90–500% higher for grain productivity.
4. Discussion
The results of the predictive capability of ANN-MLP reveal its aptitude for capturing common nonlinear interactions in quantitative genetic traits. This proficiency stems from its capacity to account for non-additive effects, particularly in the context of grain yield and MATURITY. Importantly, the predictive ability of ANN-MLP extends to soybean populations exhibiting both wide genetic variability and a narrow genetic base. Various authors have demonstrated the superiority of ANN over mixed models. For instance, a study on flowering traits in beans [
29] highlighted the effectiveness of ANN. Similarly, simulated data were used to showcase how ANN excels in capturing epistatic effects [
30].
The MLP neural network was used to estimate soybean yield through its production components, such as plant height (PH), number of branches per plant (B), number of pods per plant (P), number of seeds per pod (S), and weight of 1000 seeds (WTS) [
31]. In this supervised training MLP neural network, the correlation was 0.848 for the validation of grain productivity, with considerable accuracy, using the information on the agronomic traits of the plant, growth habits, and population density of soybean crops. According to [
10], MLP has proven to be more efficient in using a relatively small dataset and generalist or unsupervised problems; furthermore, MLP has efficiency for one or few layers, as well as shallow neural networks. The best RNA model tested was highly accurate and able to correctly classify all genotypes, replicating the selection made by the geneticist during the BLUP simulation [
32]. This indicates that ANN can be a valuable tool in plant breeding, assisting in the selection of genotypes with greater efficiency and accuracy.
Corn productivity was predicted using an artificial neural network and the construction of multilayer perceptron (MLP) models using public data and experimental networks of corn [
10]. The models with data imputation were more accurate than those without imputation, and the model with climatic data/SWB had the lowest RMSE of 71 kg ha
−1.
ANN has also been used to predict soybean and maize yields by comparing the prediction capacities of models at the state, regional, and local levels; it was concluded that the ANN models for maize had a correlation of 0.877 and an RMSE of 1036 kg/ha, and for soybean, the correlation was 0.64 and the RMSE was 1356 kg/ha [
33].
The partial similarity in indicating the best genotypes for MATURITY and GY between the two prediction methodologies indicates the high efficiency of ANN as the prediction by mixed models is based on assumptions and considers several genetic and environmental parameters [
15].
The R
2 values were 4 times higher for RR-BLUP than for ANN-MLP validation for days to maturity and 128 times higher for yield per plant, showing the efficiency of ANN-MLP compared to RR-BLUP, according to [
29]. The same authors also identified the efficiency of ANN-MLP compared with RR-BLUP in predicting the capacity for flowering traits in black beans.
The efficiency of the predictive model created using the neural network was verified according to the R2 parameter (correlation between observed and predicted values), which can range from 0 to 1, indicating a higher correlation the closer it is to 1; meanwhile, regarding the RMSE parameter (root-mean-squared error), which can range from 0 to 1, it indicates a lower error and higher efficiency the closer it is to 0. The high positive estimates of R2 (above 0.998) and the low magnitude of RMSE for maturity (0.241) indicate good accuracy of the model, a low magnitude of error, and no tendency to over- or under-predict values.
The results obtained corroborate the results from other studies that neural networks, unlike traditional REML/BLUP models, allow the capture of nonlinear relationships from data information, and thus more effectively capture the non-additive effects associated with genetic control of productivity and maturity traits, as for other traits, such as flowering in bean cultivars [
29,
34]. DNN is applied with a huge dataset to adjust the artificial neural network and several hidden layers [
11], such as for convolutional and other neural networks. The popular BP, RBF, GN, GRNN, SVM, and SVR models, as well as MLP, traditionally use numerical data and one or two hidden layers, which are suitable for more specific situations.
Autogamous species, such as soybean, exhibit non-additive or epistatic effects, owing to their high level of homozygosity, which is observed in different species, such as common beans, rice, barley, and sorghum [
29]. Therefore, when parametric models are used, the prediction of genetic values for both MATURITY and grain productivity may have low accuracy.
The variation in genotype rankings can be attributed to the neural networks’ capacity to comprehend intricate data traits and rely on experiential knowledge for genetic value predictions. This unique feature of neural networks also clarifies the ordering of the F5 genotypes in the northern population, even in the absence of genetic variance, as determined by REML/BLUP analysis.
The lower coincidence percentages observed in the populations during the first year compared to those in the third year for the MATURITY trait can be attributed to greater variability within populations during early generations, in which segregation processes are still ongoing. This heightened variability results in divergent rankings when employing each methodology.
Based on the estimates of the expected gains with selection, it is possible to predict the success of selecting specific populations. The neural network was always superior to the RR-BLUP for traits and all populations.
MLP has been previously applied in different areas, such as weed science [
35] or drought tolerance [
36]. Soybean productivity has also been estimated using various machine-learning algorithms, such as multilayer perceptron, support vector machine (SVM), and random forest (RF), using spectral reflectance data [
37]. The authors concluded that the MLP is efficient for soybean breeding.
Developing increasingly productive and resistant cultivars depends directly on the genetic variability in selected populations [
22,
38]. The results of this study indicate the presence of variability among the progenies in most of the cases evaluated. However, low estimates of genetic variance may be explained by the narrow genetic base of soybean, as pointed out by several authors [
39]. This may imply low variability within the “Subtropical” and “Tropical” populations and the existence of relatedness among parents, especially in the case of populations derived from biparental crosses.
The non-significant genetic variances observed for MATURITY in the “Tropical” population can be attributed to the intricate interactions between the long-juvenile trait in the late parents and the E allelic series. These interactions result in reduced variability when evaluated under our specific conditions [
40]. It is worth noting that in individual analyses, genetic variance, heritability, and accuracy parameters may be either underestimated or overestimated if genotype × environment interactions are not considered, especially for quantitative traits. Similar genotype–environment interactions were also highlighted by [
1]. In the case of the augmented block design, the absence of genotype repetitions within and between years could potentially account for the variation in parameter estimates and their relatively lower magnitude [
41].
The heritability estimates indicate the possibility of selecting for MATURITY with the potential for more significant genetic gains than GY in the “WRMG” population while maintaining the same proportion of selected individuals. The “Subtropical” population allows for the selection of both MATURITY and GY with similar genetic gain possibilities, but this is lower than the “WRMG” population for MATURITY. However, for the “Northern” population, the genetic gains for MATURITY will be very low, and for GY, they could be intermediate to high.
The “WRMG” population with genetic variability for maturity and GY stood out with an efficient selection for reducing maturity and increasing grain yield, demonstrating the potential for generating lines, combining earliness with high productivity.
The “Subtropical” population, originating from crosses of cultivars adapted to higher latitudes, presented genetic variability for maturity and GY, standing out with a low average for maturity and productivity in the F5 generation. However, after an efficient selection for increased maturity combined with increased productivity, it demonstrated the potential to generate earlier maturing lineages than the “WRMG” population but with similar productivity.
The “Tropical” population originated from crosses of cultivars adapted to lower latitudes and presented a longer average cycle length and high productivity. However, after intense selection for cycle reduction, super-early lines were generated within 104 days and a high grain productivity of approximately 35 g per plant. Nevertheless, the intensity of selection prioritized for MATURITY to adapt the lines to the region caused a drastic reduction in GY genetic variability, which was insignificant in F7.
The three populations produced highly productive soybean lines with variable and appropriate MATURITY for the region and strategic crop management for early and late sowing. Thus, the populations showed both genetic variability and high means of lines suitable for MATURITY and GY, proving that populations with broad and restricted latitude adaptations are ideal for extracting new lines in intermediate geographic regions of latitudes.
For GY, the gains obtained by the progenies ordered according to the neural network prediction were higher than those obtained using BLUP ordering in 83.4% of the northern population, 59.3% of the southern population, and 47.7% the Brazilian population.
For MATURITY, gains were higher when considering the ordering performed by the neural network in proportions of 53.3% for the “WRMG” population, 36.2% for the “Tropical” population, and 25.1% for the “Subtropical” population regarding the mixed models. Although the percentage gain was higher for the population with wider crossings for the trait, it can be considered that in this situation, there was a more significant reduction in variability due to the selection directed to the evaluation site of ideal GMR between six and eight. Although there may be an overestimation by the neural network in predicting genetic values, the estimates obtained for both MATURITY and GY agree with those obtained by [
42] for soybean crops by applying the same selection intensity.
Both prediction methodologies have demonstrated the viability of successful selection based on gain estimates. The selection of genotypes for Brazil, specifically between the northern and southern regions, was grounded in MATURITY, GY, or a combination of both traits. This choice was made due to the high and relatively similar genetic means and variances observed. Opting for selection within a single environment can effectively pre-screen soybean genotypes, enhancing their potential adaptability across a broader experimental network. Moreover, it offers the advantage of significantly reducing costs within breeding programs.