*2.3. Yield Stability and Genotype* × *Environment Interaction*

The environment (climate and soil), agricultural operations (sowing-cultivation-harvesting), and plant genotype are the factors that affect the yield and productivity of crops. The relationships (direct and/or indirect) and interactions between these factors create a complex situation determining the potential yield of plants [39]. Environmental variations and the genotype × environment interaction (GEI) are the factors that cause year to year variations in the yield and phenotypic trait of a specific genotype. The choice of a genotype for a target trait is a complex and difficult task because of the GEI, as different genotypes respond differently to varied environmental conditions. The estimation of relative performance of genotypes over the environments, through stability analysis is a perfect solution to these yearly variations [81]. Finlay and Wilkenson's regression analysis and coefficient [82]; Eberhart and Russel's coefficient of regression (S<sup>2</sup> di) [83]; Wricke's ecovalence (Wi) [84], Shukla's procedure of stability variance [85], coefficient of variance (CV) [86], and Lin and Binns cultivar performance measure [87] are classical univariate approaches used for the assessment of the GEI. Linear regression analysis and variance components are the main aspects of these methods [88]. Apart from the aforementioned statistic methods, the sustainable yield index (SYI) [89] is used to evaluate the effects of agricultural practices on crop yield sustainability [90]. All these methods are parametric, and therefore, the assumptions of the distribution of data and the homogeneity of variance should be considered before they are applied [91]. There are nonparametric univariate methods to evaluate the GEI, including S<sup>i</sup> 1 , S<sup>i</sup> 2 , S<sup>i</sup> 3 , and S<sup>i</sup> 6 stability parameters [92,93], Kang parameter [94], Ketrank and Ketyield plots [95], Fox-rank [96], and Star [91]. These nonparametric stability statistics are analytical clustering procedures that determine the stability of genotypes on the basis of ranks rather than data and free from modeling assumptions. A genotype is considered stable if its ranking is relatively constant across environments [91].

Principal component analysis, cluster analysis, additive main effects and multiplicative interactions (AMMI), and genotype plus genotype × environment interaction biplot (GGE) are multivariate procedures enabling examination of multidirectional aspects of the GEI by imaging the response of a genotype in an *E*-dimensional space [91]. Multivariate stability analyses are more powerful and precise than univariate approaches. However, these are complex methods that do not provide a simple measure of yield stability for a reliable ranking of genotypes. Limited access to software is another bottleneck of these methods [91]. In a recent study, both linear and nonlinear regression models were applied to estimate the influence of climate variables (precipitation, sunshine duration, average relative humidity, maximum temperature, minimum temperature, and average temperature) on the growth and yield-related characteristics of cotton (the cotton height at the flowering stage, stalk weight, yield of cotton seeds, and lint percentage). The authors found that the interpretation of linear regression equations was generally lower than the interpretation of nonlinear equations [97]. There was a linear relationship or a relatively complex nonlinear relationship between the cotton growth indicators and climate variables in one site of their study, but they did not find the best equations for the cotton growth indices and the influence of climate variables on the cotton growth indices at several sites. In addition, the authors developed one regression model for each condition [97]. When several independent variables and several dependent variables are of interest, i.e., multiple-independent variables versus multiple-dependent variables, ANN can reduce the required analyses and result in higher accuracy [25]. It is clear that an ANN model can find the best equations in all studied environments in a faster and more precise manner by considering other factors such as soil and cotton properties. Plant growth indices and climate variables could be entered into an ANN model as dependent and independent variables, respectively. Then, linear and nonlinear relationships between the variables can be considered through powerful ANN models. There are well-recognized statistical and biological limitations to regression approaches. ANN modeling would enable breeders to evaluate the GEI and genetic stability of a large number of genotypes faster and more precisely. Coupled artificial intelligence (ANN) with deep phenotyping is a valuable tool for understanding plant–environment interactions [98].

### *2.4. Biotic and Abiotic Stress Assessment*

Plants are exposed to various biotic and abiotic stresses. Different approaches have been applied to assess the tolerance and resistance of plant genotypes to these stresses and to identify superior genotypes.

There have been numerous breeding attempts to combat drought stress. Plants' tolerance to drought has been studied through some statistical indices, such as tolerance (TOL), mean productivity (MP), stress susceptibility index (SSI), geometric mean productivity (GMP), harmonic mean (HARM), relative drought index (RDI), stress tolerance index (STI), yield index (YI), yield stability index (YSI), and modified stress tolerance index (K1STI and K2STI) [99–106]. These classical approaches are based on morphological data, mainly yield generated under nonstress (Yp) and stress (Ys) conditions. However, apart from morphological attributes, there are many physiological and biochemical pathways involved in plants' response to environmental stresses. Secondary metabolites, cellular antioxidants, plant growth regulators, compatible solutes, and polyamines are all involved in plants' response to biotic and abiotic stresses [107,108]. Combining phenomic data with metabolomic and genomic data is an efficient strategy to assess plants' responses to biotic and abiotic stresses [109]. Classical multivariate statistical methods are not efficient enough to manage such a large volume of data (multiple independent variables versus multiple dependent variables). Linear regression is the most common technique used for the detection of nutrition deficiencies through the RGB image technique. However, features extracted from digital images with nonlinear relationship with nutrient content cannot be explained through linear regression model [110]. Machine learning techniques, along with digital images, could be used to model and predict genotypes' responses to stressful conditions and find the ones that are more resistant to stress and nonstress environments by analyzing all phenomic and omics (metabolomic and genomic) data. Big data—imaging and remote-sensing data—can be

interpreted through machine learning for high-throughput stress phenotyping [111]. Ravari et al. [52] applied an MLP-ANN and the TOL, MP, GMP, HM, SSI, STI, YI, and YSI indices to predict the salinity tolerance of 41 Iranian wheat cultivars (*Triticum aestivum* L.). They found that the YSI, MP, GMP, and STI were the best predictors of salinity-tolerant cultivars. In Arabidopsis (*Arabidopsis thaliana*), miRNA expressions were used as input features to predict plant responses to abiotic stresses of drought, salinity, cold, and heat using machine learning models of decision tree (DT), SVM, least-square support vector machines (LSSVM), and Naïve Bayes (NB). It was concluded that miRNA-169, miRNA-159, miRNA-396, and miRNA-393 had the highest contributions to plant response towards abiotic stresses and the SVM with Gaussian kernel had better performance than other machine learning methods in prediction of plant stress response (R<sup>2</sup> = 0.96) [8]. Deep CNN along with traditional machine learning method was used for identification and classification of maize drought stress through the field-obtained data under optimum moisture, light drought, and moderate drought stress. Authors reported identification accuracy of 98.14%, which was more than Gradient Boosting Decision Tree (GBDT) method [19].

Deep CNNs have been widely used to classify and detect various plant diseases—biotic stress [112–115]. Image recognition and classification of maize leaf diseases, including northern corn leaf blight (*Exserohilum*), common rust (*Puccinia sorghi*), and gray leaf spot (*Cercospora*) diseases, have been conducted using deep CNN with an accuracy of 93.35% [116]. In cucumber (*Cucumis sativus*), a semantic segmentation model based on CNN was developed to segment the powdery mildew disease on leaf images at pixel level, and pixel accuracy of CNN model (96.08%) was more than segmentation methods of K-means, Random forest, and GBDT [28]. In pearl millet (*Pennisetum glaucum*), DNNs has been applied for identification of mildew disease, and accuracy of 95.00% was reported for the developed model [37].

### *2.5. Classical Mating Designs and Hybrid Breeding Programs*

The integration of statistics into genetics led to some classical mating designs such as mean generation analysis [117], diallel crosses analysis [118–120], line × tester analysis [121], North Carolina designs [122], and triple test cross [123,124]. These methods have been used for genetic analysis of crops in order to find the nature of gene actions (additive, dominance, and epistasis) involved in controlling important morphological, phenological, and yield component characteristics, to calculate broad and narrow sense heritability and predict the outcomes of cross-breeding programs.

The prediction of parental combinations is critical to the choice of superior combinational homozygous parental lines in F1-hybrid breeding programs [125]. However, it is a challenging task with a large number of cross combinations when there are many inbred parental lines. Therefore, the prediction of the yield performance of cross combinations of parental lines may significantly reduce the required time and budget of F1-hybrid breeding programs [126]. ANN could be used to predict parental combinations and calculate the correct values of general and specific combining abilities (GCA and SCA) in mating designs, such as topcross, line × tester, and diallel cross. Khaki et al. [126] applied matrix factorization and a neural network to predict the yield performance of cross combinations of inbreds and testers of unsown maize on the basis of historical yield data collected from the crossing of other inbreds and testers. The proposed model was significantly better than other models such as deep factorization machines (DeepFM), generalized matrix factorization (GMF), LASSO, RF, and neural networks.

### **3. Applications of Machine Learning in In Vitro-Based Plant Biotechnology**

Biotechnology-based breeding methods (BBBMs) complement classical breeding methods in rapid plant improvement. In vitro regeneration, as the main core of many in-vitro-based breeding methods, has numerous plant breeding applications. In situ and ex situ conservation and micropropagation (proliferation) are direct applications of in vitro regeneration [127]. In endangered rare plant species, like medicinal plants, in vitro culture is an effective strategy for mass propagation, germplasm

conservation, and production of bioactive compounds [128]. Several factors determine the fate of cultured cells in in vitro regeneration of plants. These are the plant genotype, plant growth regulators (PGRs), culture medium components, explant type, explant age, enhancer additives-elicitors, etc. [127]. These factors can be divided into three main categories: initial triggers of regeneration (environmental signal inputs and physical stimuli), epigenetic and transcriptional cellular responses to the initial triggers, and molecules that manage the formation and development of the new stem cell niche [129]. The combination and interactions between these factors lead to multifactorial nature of the in vitro plant regeneration process. Basal culture medium components, plant genotype, PGRs, explant type, and explant age are all multilevel factors with different applicable combinations. The inclusion of other factors results in a very complex situation for interpretation. Plant cells and tissues have nondeterministic and nonlinear developmental patterns in a stressful in vitro environment [130]. The analysis of variance of factorial experiments and simple means comparison analysis with classical methods such as LSD, Tukey's HSD, and Duncan's test, are the main statistical methods used to interpret the effects of interaction between effective factors in most in vitro regeneration studies [128,131,132].

Murashige and Skoog (MS), modified MS (MMS), Gamborg's B5 medium Woody Plant Medium (WPM), and Driver and Kuniyuki Woody Plant Medium (DKW) are the most commonly used basal culture media in in vitro regeneration studies. Basal medium manipulation is a promoting strategy that has been applied to increase the output of in vitro studies [133]. However, due to the large number of micro- and macroelements in the culture medium, it is difficult to manipulate their concentrations. In this situation, prediction of the effect of culture media components on the target characteristics of in vitro regenerants is the right solution. Artificial neural networks have been applied in these experiments to predict the best culture media components for efficient propagation of different plant species [29,31,134].

Different combinations of auxin and cytokinin PGRs can determine the developmental fate of cultured cells and tissues toward organogenesis and/or somatic embryogenesis. The cytokinin/auxin ratio is also very important in in vitro studies [135]. Niazian et al. [131] found that 2,4-dichlorophenoxyacetic acid (2,4-D) combined with kinetin resulted in indirect somatic embryogenesis of cultured hypocotyl segments of ajowan medicinal plants, whereas a combination of 3-methoxy(-6-benzylamino-9-tetrahydropyran-2-yl) purine and naphthalene acetic acid led cultivated explants toward an indirect shoot regeneration pathway. Arab et al. [30] combined artificial neural networks and genetic algorithms to predict and optimize the effect of cytokinin–auxin plant hormone (BAP, KIN, TDZ, IBA, and NAA) combinations and concentrations on the number of microshoots per explant, the length of microshoots, developed callus weight, and the quality index of plantlets in in vitro proliferation of Garnem (G × N15) rootstock. The ANN model predicted the number and length of microshoots with high accuracy. The highest values of the variable sensitivity ratio for the proliferation rate were related to the BAP (19.3), KIN (9.64), and IBA (2.63) inputs. An MLP-ANN was developed to predict the physical properties of embryogenic callus and the number of somatic embryos in in vitro regeneration of ajowan under the effect of different combinations of the explant age, concentrations of 2,4-D, kinetin, and sucrose inputs [25]. The ANN model predicted the physical properties of embryogenic callus (area, perimeter, Feret diameter, roundness, and true density) and the number of somatic embryos better than the multiple linear regressions. Fifteen-day-old hypocotyl explants × 1.5 mg/L 2,4-D × 0.5 mg/L Kin × 2.5% (*w*/*v*) sucrose was the best combination of inputs with the highest measured and predicted number of somatic embryos [25].

Apart from culture medium components and PGRs combination, ANN has been applied to model the sterilization step of in vitro regeneration. Hesami et al. [27] applied an MLP-ANN along with a genetic algorithm to model and optimize the contamination frequency and explant viability under the influence of seven input variables, i.e., HgCl2, Ca(ClO)2, nanosilver, H2O2, NaOCl, AgNO3, and immersion times, in an in vitro culture of chrysanthemum. The lowest contamination frequency (0%) and the highest explant viability (99.98%) resulted from 1.62% NaOCl at 13.96 min immersion time. The sensitivity analysis of the ANN showed that the immersion time was the most important

variable affecting the contamination frequency and explant viability [27]. ANNs are also used to simulate in vitro growth of plant tissue cultures, distinguish embryos from nonembryos, predict the formation of plantlets from embryos, estimate the biomass of cell cultures, simulate the distribution of temperature in a culture vessel, identify and estimate the in vitro induced shoot length, and cluster in vitro regenerated plantlets [130].

Other in vitro-based breeding methods, such as artificial polyploidy induction, doubled haploid production, plant gene transformation, and genome editing methods also have multifactorial nature and require multivariate statistical methods to interpret the results. Different chemical enhancers can be used in in vitro doubled haploid production methods (induced parthenogenesis and androgenesis) to improve the haploid induction efficiency, e.g., PGRs, osmoprotectants, cellular antioxidants, reactive oxygen species scavengers, polyamins, stress hormones, chlormequat chloride, compatible solutes, DNA demethylating agents, histone deacetylase inhibitors, cell wall remodeling agents, ethylene inhibitors, and other applicable additives. They enhance tolerance to inductive stresses and improve the final efficiency of doubled haploid production [108]. ANN models may improve the efficiency of in vitro doubled haploid production and solve the problem of recalcitrant species/genotypes by predicting the best combination(s) of these additives in interaction with other influencing factors, such as the plant genotype, the surrounding environment of donor plants, physical treatments (inductive stresses) of cultured gametophytic cells, the developmental stage of initial gametophytic cells, and culture medium components. The ANN predicted the callus induction percentage in androgenesis (anther culture) of tomato (*Lycopersicon esculentum* L.) under the influence of plant genotype, the concentrations of 2,4-D and kinetin PGRs, and the concentration of gum Arabic better than the MLR model [50].

Plants' vigor and performance are commonly enhanced by mitotic-induced polyploidy. It consists in in vivo and in vitro application of mitotic spindle poisons [136]. In vitro-induced polyploidy is a multifactorial procedure. The efficiency of in vitro-induced polyploidy may be affected not only by in vitro regeneration parameters (basal culture medium components, combination of PGRs, additives, etc.) but also by the plant genotype, the developmental stage of initial explants as well as the type, dosage, and duration (exposure time) of the application of the antimitotic agent. Due to the genotype dependency, different genotypes of plant species exhibit different responses to concentrations of the antimitotic agent applied [137]. This results in significant interaction of the plant genotype and antimitotic agent in artificial polyploidy induction. Although there have been no reports on the application of ANN to model and predict the results of in vitro-induced artificial polyploidy, it might increase the efficiency by predicting and finding the best combination and interaction of all influential factors.

*Agrobacterium*-mediated gene transformation is a well-known method of plant gene transformation and genetic engineering. However, various parameters must be optimized for an efficient gene delivery, including the *Agrobacterium* strain cell density, the time of inoculation, the type and concentration of antibiotics to kill *Agrobacterium*, the type and concentration of selectable antibiotics, and the concentration of acetosyringone [138]. These influencing factors along with in vitro regeneration factors result in a multi-variable nature of *Agrobacterium*-mediated gene transformation [127]. It is obvious that machine learning algorithms could be used to predict and optimize *Agrobacterium*-mediated gene transformation, especially in important *Agrobacterium*-recalcitrant plant species.

## **4. Coupled Machine Learning-Image Processing for High-Throughput Phenotyping and Precision Agriculture**

Classical measurement of plants' physical features by visual assessment is a laborious, time-consuming, costly, and error-prone process in both conventional and in vitro-based plant breeding studies. This step can be accelerated and facilitated by the machine vision method, which is more accurate and precise than visual assessment. Nondestructive measurement of physical features, both outdoors and in vitro, is another important advantage of image processing [25]. Automated non-invasive fast scoring of several plant traits through high-throughput phenotyping platforms can

speed up and facilitate the phenotyping of plant populations and selection of superior varieties [139]. The integration of precise measured image-based characteristics with omics data could help to identify the key traits involved in the mechanisms of stress tolerance and acclimation [109]. On the other hand, the ability of deep learning in the identification of plants' features provides a great opportunity for further advances in image analysis [98]. Combined image processing (for feature extraction) and machine learning (for data analysis) is a powerful strategy required for faster and precise image-based plant phenotyping [140]. The use of deep learning techniques in computer vision can accelerate plant breeding programs such as plant phenotyping and classification of genotypes [141]. Coupled image processing-ANN has been used to measure phenotypic characteristics and assess genetic diversity and classify different plant species [38,65,142,143]. Deep learning, especially CNN, has become a powerful tool for image analysis in recent years [49]. Uzal et al. [48] applied a computer vision method for feature extraction along with developed convolutional neural networks to estimate the number of seeds in soybean pods and then to classify the obtained data. In most cases, the convolutional neural networks learnt to detect each seed in the pod, which indicates their high classification efficiency. There are other advanced imaging techniques, which are more efficient than simple visualization techniques and can be used to analyze in-field images instead of indoor methods. Recently, an R-based pipeline has been developed, which enables analysis of orthomosaic images from agricultural field trials and calculation of the number of plants per plot, canopy cover percentage, vegetation indices, and plant height [139]. A deep neural network model trained with such in-field images could very effectively classify and estimate desired characteristics from in-field images [48]. Coupled image processing-artificial neural network has been used in BBBMs for in vitro modeling of somatic embryogenesis in ajowan [25] and androgenesis-based haploid induction in tomato [50]. Plant phenotyping and precision agriculture could be significantly different in terms of the spatial and temporal resolutions, although both generate big data sets in a format of image. These are information- and technology-based domains with specific demand and challenges. Precision agriculture is an agricultural management system based on spatial and temporal variability in crop and soil factors within a field (with environmental parameters). However, in phenotyping systems, the crop field parameters are homogenous and datasets in molecular, cellular, and whole plant levels are considered for plant phenotyping. Precision agriculture examines spatial heterogeneities within crop stands, whereas the appearance and performance of a genotype under distinct environmental conditions are examined in plant phenotyping [144,145]. High-throughput salt-stress phenotyping has been reported in okra (*Abelmoschus esculentus* L.) through a trained DNN using physiological and biochemical traits, such as fresh weight, SPAD, elemental contents, and photosynthesis-related parameters, measured from 13 genotypes under salt stress treatment [36]. Establishment of high-throughput phenotyping platforms (HTPPs) to phenotype physiomorphological traits under highly heterogeneous field environment, in a precise, labor-, and cost-effective manner, is essential to bridge the gap between genomics and phenomics [146]. Machine learning algorithms can be used for image-based plant stress phenotyping in a wide scale from leaf and canopy to filed range. Identification, classification, quantification, and prediction of big data, obtained from higher-throughput phenotyping systems such as unmanned aerial system (UAS) technology and ground robots, can be conducted through deep learning algorithms [147]. In carrot (*Daucus carota*), a precision agriculture approach was conducted through on-farm punctual carrot sampling data incorporated into the satellite imagery data using a random forest regression algorithm. Accuracy of developed model to predict carrot yield using database composed of spectral bands was acceptable (R<sup>2</sup> = 0.82; RMSE = 2.64 mg ha−<sup>1</sup> ; MAE = 1.74 mg ha−<sup>1</sup> ) [26].

## **5. A Proposed Idea for Plant Ploidy Level Determination through Image Processing-Machine Learning**

In chromosome engineering studies (polyploidy and haploid induction), one important step is taken to verify the ploidy level. It can be confirmed through direct (chromosome counting) or indirect methods (morphological and anatomical indicators and flow cytometry). Although the direct

method of chromosome counting is reliable and unambiguous [148], it is laborious, time consuming, and complicated and requires highly skilled operators [149]. Indirect verification of the ploidy level through classical markers, such as stomatal morphometric data (stomatal density per unit area, the number and size of stomata), the density of chloroplasts per stomatal guard cells, size of guard cells and pollen size, is a rapid and simple method [150], but not completely reliable. Flow cytometry is a reliable method based on direct correlation between the nuclear DNA content and ploidy level. However, according to a recent study, the comparison of the DNA content in standardized leaf punch samples is not a reliable method to recognize putative doubled haploids, as there is a DNA content equivalence between haploid and diploid samples [151].

Machine learning algorithms can be used for ploidy level identification of plants. Recently, a deep learning-based object detection algorithm has been developed for evaluating the stomatal density and elucidating the variation in the stomatal density among various soybean accessions [49]. This DNN could also be useful for ploidy level prediction. There have been two reports on the use of other methods to identify the ploidy level in plants. Altunta¸s et al. [33] used convolutional neural networks to recognize haploid and diploid maize seeds through *R1-nj* anthocyanin color marker data of 1230 haploid and 1770 diploid maize seed images. The accuracy and sensitivity of the model amounted to 94.22% and 94.58%, respectively [33]. Remote sensing has also been applied to determine the ploidy level of quaking aspen (*Populus tremuloides* Michx.) [152].

Here, we offer another idea to identify the ploidy level of plants through coupled image processing-supervised deep neural network using visual data of cellular patterning of the epidermal layer. Haploids have smaller and more densely packed epidermal and mesophyll cells (more cells per same unit area) than doubled haploids. This results in an equivalent DNA content per unit leaf area for haploids and their counterpart diploids [151]. Cellular patterning in the epidermis and mesophyll can be specific to each ploidy group. Therefore, epidermal cell patterning (size shape and number) could be used as ploidy level recognition and classification criteria [151]. The use of imaging techniques for precise feature extraction of leaf punch samples (the cellular pattern, including the cell size and number) and the subsequent modeling of captured images (classification modeling) through deep learning approaches, particularly CNN, results in an image-based model, which can be used to estimate the ploidy level in chromosome engineering studies of different plant species (Figure 3). It is a more precise, fast, and cost-effective method of ploidy level distinction, which could also be used in other branches of plant science, e.g., in genetic diversity, evolutionary, and species invasiveness studies. *Agriculture* **2020**, *10*, x FOR PEER REVIEW 16 of 24

**Figure 3.** The proposed coupled image processing-supervised machine learning to determine the ploidy level in different plant species through cellular patterning in leaf epidermis and mesophyll layers.

Most classical statistical methods use only simple statistics and few influential factors to assess the biological features of plants. For example, Y<sup>p</sup> and Y<sup>s</sup> are the only indices used to identify drought-tolerant plant genotypes in yield-based drought tolerance assessment methods. However, there are other influential factors, such as cellular, physiological, and phytochemical pathways, which are involved in plants' responses to environmental stress. The tolerance of different plant species to biotic and abiotic stresses, as complex biological processes, can be efficiently enhanced through large-scale analysis of phenomic, metabolomic, and genomic data. Machine learning models are capable of processing large amounts of data (imaging and remote-sensing data) for high-throughput stress phenotyping. The analysis of different omics and phenomic data may result in more precise interpretation of GEI and yield stability. Plants' qualitative and quantitative characteristics can be predicted more precisely by analysis of climate data (temperature, humidity, sunshine, precipitation, etc.), soil factors, agricultural operations data (harvest date, information on diseases, crop status, ground temperature, etc.), topographic, and meteorological data. Big data analysis enables more efficient classification of plants' phenotypes and genotypes. Machine learning techniques are able to manage large amounts of data in various areas of plant breeding, which can lead to more accurate results and better interoperation than classical statistical methods. Artificial neural networks can be used for pattern recognition, nonlinear regression, and classification purposes in plant tissue culture studies because they can handle binary, continuous, categorical, and fuzzy datasets. The present review can give an overview of applications of machine learning to plant breeders. It would be helpful to adopt the correct method of data analysis in future studies, which in

**Author Contributions:** M.N. and G.N. contributed equally to this work. M.N. and G.N. have read and agreed to

**Figure 3.** The proposed coupled image processing-supervised machine learning to determine the

layers. **6. Conclusions**

turn can increase the output of studies.

the published version of the manuscript.
