**Mohsen Niazian 1,\* and Gniewko Niedbała 2,\***


Received: 18 August 2020; Accepted: 25 September 2020; Published: 27 September 2020

**Abstract:** Classical univariate and multivariate statistics are the most common methods used for data analysis in plant breeding and biotechnology studies. Evaluation of genetic diversity, classification of plant genotypes, analysis of yield components, yield stability analysis, assessment of biotic and abiotic stresses, prediction of parental combinations in hybrid breeding programs, and analysis of in vitro-based biotechnological experiments are mainly performed by classical statistical methods. Despite successful applications, these classical statistical methods have low efficiency in analyzing data obtained from plant studies, as the genotype, environment, and their interaction (G × E) result in nondeterministic and nonlinear nature of plant characteristics. Large-scale data flow, including phenomics, metabolomics, genomics, and big data, must be analyzed for efficient interpretation of results affected by G × E. Nonlinear nonparametric machine learning techniques are more efficient than classical statistical models in handling large amounts of complex and nondeterministic information with "multiple-independent variables versus multiple-dependent variables" nature. Neural networks, partial least square regression, random forest, and support vector machines are some of the most fascinating machine learning models that have been widely applied to analyze nonlinear and complex data in both classical plant breeding and in vitro-based biotechnological studies. High interpretive power of machine learning algorithms has made them popular in the analysis of plant complex multifactorial characteristics. The classification of different plant genotypes with morphological and molecular markers, modeling and predicting important quantitative characteristics of plants, the interpretation of complex and nonlinear relationships of plant characteristics, and predicting and optimizing of in vitro breeding methods are the examples of applications of machine learning in conventional plant breeding and in vitro-based biotechnological studies. Precision agriculture is possible through accurate measurement of plant characteristics using imaging techniques and then efficient analysis of reliable extracted data using machine learning algorithms. Perfect interpretation of high-throughput phenotyping data is applicable through coupled machine learning-image processing. Some applied and potentially applicable capabilities of machine learning techniques in conventional and in vitro-based plant breeding studies have been discussed in this overview. Discussions are of great value for future studies and could inspire researchers to apply machine learning in new layers of plant breeding.

**Keywords:** artificial neural networks; big data; classification; high-throughput phenotyping; modeling; predicting

### **1. Introduction**

Due to climate change (global warming), increasing food requirements and depletion of resources in consequence of increasing global population, it is necessary to use modern technologies in agriculture and food sciences [1]. Plant breeding is a dynamic branch of agricultural science. It started with simple selection of impressive plants with superior characteristics. Later, genetics and statistics were involved in classical plant breeding, mainly after the discoveries of Gregor Mendel and Sir Ronald Aylmer Fisher. Next, modern plant breeding emerged with the advancements in genetic and biotechnology approaches. Classical plant breeding methods mainly included assessment and classification of genetic diversity, yield components analysis (indirect selection of superior genotypes with impressive economic characteristics), yield stability analysis (genotype × environment interaction), enhanced tolerance to biotic and abiotic stresses, and hybrid breeding programs. In vitro-based biotechnological breeding methods mainly included in vitro micropropagation, doubled haploid production, artificial polyploidy induction, and *Agrobacterium*-mediated gene transformation. In in vitro micropropagation studies, researchers want to investigate the effects of influential factors (inputs), such as combination of culture medium components, combination and concentrations of plant growth regulators (PGRs), and interactions of plant genotype × culture medium × PGRs × explant type × explant age × elicitor additives × type and concentration of carbohydrate source × etc., on regeneration efficiency (outputs) of their desired plants. Classical statistical techniques have been employed to analyze and interpret the results of both classical and in vitro-based plant breeding studies. These analytical techniques are mainly based on variance and linear regression models to assess the relationship of variables and predict the effect of independent variables on dependent variables. One regression model is required to assess the effect of a group of independent variables (X1, X2, X3, . . . , Xn) on one dependent variable (Y), according to the multiple linear relationships [2]. However, nonlinear and nondeterministic properties are inextricably linked with plant biological systems [3]. Therefore, despite of successful applications, the classical linear regression-based models are unable to interpret highly nonlinear and complex relationships between dependent and independent variables. Most of these plant breeding approaches are "multiple-independent variables versus multiple-dependent variables." Under these conditions, one regression model is required for each output [4]. Powerful data mining tools are employed in plant breeding studies to predict and explain complex data.

Machine learning—the science of programming computers so they can learn from data—has been widely applied in both classical and in vitro-based plant breeding studies to interpret the flow of information about plants from the DNA sequence to the observed phenotypes. There are three ways to classify machine learning methods, including supervised and supervised models, linear and nonlinear algorithms, and shallow and deep learning models (Figure 1). Artificial neural networks (ANNs), deep neural networks (DNNs), convolutional neural networks (CNNs), random forest (RF), and support vector machines (SVMs) are examples of nonlinear nonparametric machine learning algorithms, applied for processing nonlinear data in plant studies [5]. These data-driven models are able to parse and interpret non-normal, nonlinear, and nondeterministic unpredictable data sets, through the full use of all spectral data and avoid irrelevant spectral bands and multicollinearity [6,7]. Among different learning algorithms, including supervised, unsupervised, reinforcement, sparse dictionary, and rule-based, supervised learning is more suitable and efficient for life science problems [8]. Supervised learning can be used for classification (predicting non-numeric answers) and regression (predicting numeric answers) [9]. Formless datasets such as data obtained by photo imaging or sequencing can be interpreted through machine learning algorithms [10]. Genome sequencing data can be used in machine learning models for the identification and classification of transposable elements [11]. By using machine learning algorithms, breeders are able to predict multiple outputs (multiple-dependent variables) through different combinations of multiple inputs in one model and reduce required analyses.

*Agriculture* **2020**, *10*, x FOR PEER REVIEW 3 of 24

**Figure 1.** Different categories of machine learning algorithms. **Figure 1.** Different categories of machine learning algorithms.

Artificial neural networks, consist of an input, an output, and several hidden layers, are nonlinear nonparametric models which do not require a prior structure for data and detailed information about the physical processes to be modeled and to tolerate data loss [12,13]. Because of their more hidden layers, DNNs have greater predictive power than ANNs. Convolutional neural networks, as state-of-the-art deep learning architecture, are inspired by the natural visual perception mechanism of the living creatures and consist of convolutional, pooling, fully-connected layers, and an output layer [14]. CNNs are suitable for classification studies because of automatic feature extraction [9]. Image classification, object detection, object tracking, pose estimation, text detection and recognition, visual saliency detection, action recognition, scene labeling, speech, and natural language processing are some of the typical applications of CNNs [14]. Neural networks have low interpretability of the features (lack the interpretation capability), especially CNN in which the features extracted are hidden. More advanced machine learning technique of SVMs, which uses a supervised learning algorithm to find both linear and nonlinear relationships in data, can be used for clustering, classification, and regression analysis of data sets. In comparison with multilayer perceptron (MLP) of ANN, SVM uses a large number of hidden units and has better performance in the formulation of the learning problem, subsequently quadratic optimization task [15]. Random forest regression is a regression tree-based machine learning that uses multiple decision trees to classify data and needs setting the number of trees, the number of random features, and the stop criteria for training. RF is more suitable for spectral data analysis and overfitting can be controlled through combining different independent predictors [16,17]. In semantic segmentation methods, such as automated phenotyping and plant disease detection, deep learning CNN can be more effective than shallow learning models of SVMs and RF and problem of required large manually crafted features can be solved by using image augmentation and small manually annotated Artificial neural networks, consist of an input, an output, and several hidden layers, are nonlinear nonparametric models which do not require a prior structure for data and detailed information about the physical processes to be modeled and to tolerate data loss [12,13]. Because of their more hidden layers, DNNs have greater predictive power than ANNs. Convolutional neural networks, as state-of-the-art deep learning architecture, are inspired by the natural visual perception mechanism of the living creatures and consist of convolutional, pooling, fully-connected layers, and an output layer [14]. CNNs are suitable for classification studies because of automatic feature extraction [9]. Image classification, object detection, object tracking, pose estimation, text detection and recognition, visual saliency detection, action recognition, scene labeling, speech, and natural language processing are some of the typical applications of CNNs [14]. Neural networks have low interpretability of the features (lack the interpretation capability), especially CNN in which the features extracted are hidden. More advanced machine learning technique of SVMs, which uses a supervised learning algorithm to find both linear and nonlinear relationships in data, can be used for clustering, classification, and regression analysis of data sets. In comparison with multilayer perceptron (MLP) of ANN, SVM uses a large number of hidden units and has better performance in the formulation of the learning problem, subsequently quadratic optimization task [15]. Random forest regression is a regression tree-based machine learning that uses multiple decision trees to classify data and needs setting the number of trees, the number of random features, and the stop criteria for training. RF is more suitable for spectral data analysis and overfitting can be controlled through combining different independent predictors [16,17]. In semantic segmentation methods, such as automated phenotyping and plant disease detection, deep learning CNN can be more effective than shallow learning models of SVMs and RF and problem of required large manually crafted features can be solved by using image augmentation and small manually annotated empirical dataset for fine-tuning a synthetically

empirical dataset for fine-tuning a synthetically bootstrapped CNN [18]. Through the integrating

bootstrapped CNN [18]. Through the integrating image feature extraction with classification in a single pipeline, deep convolutional neural networks have been considered as mainstream in biotic and abiotic stress diagnosis and classification [19]. A nine-layer deep CNN model was trained for identification of plant leaf diseases using data set with 39 different classes of plant leaves and background images and 96.46% classification accuracy was reported, which is greater than traditional machine learning approaches of SVM, decision tree, logistic regression, and K-NN [20]. CNNs are also applicable in remote sensing for object detection and pattern recognition. High accuracy (84%) for fine-grained mapping of vegetation species and communities using deep CNN-based segmentation, trained by data directly derived from visual interpretation of unmanned aerial vehicles (UAV)-based high-resolution Red-Green-Blue (RGB) imagery, has been reported [21].

A lot of training data is required in ANN for the optimization of sigmoid functions belonging to the hidden layer's neurons, as overfitting and local minima may happen by small number of training data. Therefore, the optimization process cannot be properly carried using back-propagation algorithms, when the number of training samples is small [8]. Through the short review on studies that used SVM and ANN techniques for identifying disease in plants, it was concluded that the ANN-based methods are better than SVM-based methods, as few samples and features are used in SVM-based methods to identify the disease-affected plants [22]. Conversely, in modeling in vitro culture of *Chrysanthemum* (*Dendranthema* <sup>×</sup> *grandiflorum*), better performance accuracy of SVR (R<sup>2</sup> <sup>&</sup>gt; 0.92) than MLP (R<sup>2</sup> <sup>&</sup>gt; 0.82) has been reported [15]. Applying different algorithm and comparing their performance is an appropriate solution to find the best algorithm in a particular data set. In tea plant (*Camellia sinensis* L.), partial least squares discriminative analysis (PLS-DA) and least squares-support vector machines (LS-SVM) were used for the classification of different nitrogen nutrition status under field condition and better performance with correct classification of LS-SVM than PLS-DA was reported [23]. The pros and cons of different nonlinear machine learning methods under similar scenarios are presented in Table 1.


**Table 1.** Pros and cons of nonlinear machine learning algorithms applied in classical and in vitro-based plant breeding studies.

ANNs—artificial neural networks; CNN—convolutional neural networks; RF—random forest; SVMs—support vector machines.

Different application areas for nonlinear machine learning technologies in classical and in vitro-based plant breeding studies are shown in Figure 2. The following sections of the article provide a comprehensive review of the applications of these nonlinear machine learning techniques in classical and in vitro-based plant breeding studies. *Agriculture* **2020**, *10*, x FOR PEER REVIEW 5 of 24

**Figure 2.** Potential applications of machine learning techniques in classical and modern plant breeding.

**Figure 2.** Potential applications of machine learning techniques in classical and modern plant breeding. Some recently applied nonlinear machine learning models in both classical and in vitro-based plant breeding studies are listed in Table 2.

**Table 2.** Examples of recently applied nonlinear machine learning models in classical and modern

plant breeding studies are listed in Table 2.

**Type of Machine Learning**

ANN MLR, IP

plant breeding studies.

**Plant Species**

Ajowan (*Trachyspermum ammi* L.)

Chrysanthemum

Cucumber (*Cucumis sativus*)

Garnem (G × N15)

Some recently applied nonlinear machine learning models in both classical and in vitro-based

**Techniques Purpose(s) Reference**

content

Predicting physical properties of embryogenic callus and number of somatic embryos

through the miRNAs' concentration [8]

sterilization [27]

embryogenesis [3]

on somatic embryogenesis [15]

powdery mildew disease [28]

hormonal [30]

salts of in vitro culture medium [29]

[24]

[25]

ANN MLR Modeling and predicting of seed yield [2]

ANN MLR Modeling and predicting of essential oil

ANN GA Modeling and optimizing of in vitro

ANFIS GA Modeling and optimizing of somatic

ANN, SVMs MLP Modeling effect of plant growth regulators

ANN GA Modeling and optimizing of in vitro

Carrot (*Daucus carota*) RF - Precision agriculture-yield mapping [26]

*Arabidopsis thaliana* DT, SVMs, NB Gaussian kernel Predict the plant abiotic stresses response

*Prunus* rootstock ANN GA Prediction and optimization of mineral



linear regressions; MLP—multilayer perceptron; NB—Naïve Bayes; RBF—radial bases function; RF—random forest; SVMs—support vector machines; UPGMA—unweighted pair group

method with arithmetic mean.

### **2. Traditional Plant Breeding**

### *2.1. Assessment and Classification of Genetic Diversity*

One of the most important prerequisites of plant breeding programs is genetic diversity, which enables selection of important accessions and their use in future breeding programs [61]. Morphological, biochemical, and physiological markers have been analyzed to investigate the genetic diversity of different plants. Morphological features are the simplest to measure and do not require special tools or techniques. The statistical analysis of these markers can prove the existence of genetic diversity among studied genotypes. Niazian et al. [62] used analysis of variance (ANOVA) and estimated the coefficient of variation (CV) of different agro-morphological traits (plant height, number of branches, number of umbels, number of umbellets in an inflorescence, biological yield, and single plant yield) of eight ecotypes of ajowan medicinal plant (*Carum copticum* L.) and observed significant genetic diversity.

Molecular/genetic markers are another group of markers which enable assessment of genetic diversity and discrimination of the genotype. Amplified fragment-length polymorphism (AFLP), restriction-fragment length (RFLP), randomly amplified polymorphic DNA (RAPD), simple sequence repeat (SSR), intersimple sequence repeats (ISSR), and single nucleotide polymorphism (SNP) are the most commonly used molecular markers to study genetic diversity and species identification in different target plants [63]. These genetic markers estimate phylogenetic relationships and identify varieties more reliably and effectively than morphological markers [64]. Although molecular markers are more effective than morphological markers in the assessment of genetic diversity and discrimination and identification of various plant genotypes, there are some technical and/or economic limitations [65].

Classical multivariate analyses such as cluster analysis, discriminant function analysis, and principal component analysis (PCA) have been used for the classification and grouping of different genotypes in various plant species by means of morphological, biochemical, physiological, and molecular markers [61,64,66–68]. Object detection through deep learning algorithms could be used for efficient genetic diversity assessment and classification of plant genotypes. The use of CNN to classify morphological parameters is an appropriate alternative to conventional classification methods, such as k-nearest neighbor, probabilistic neural network, support vector machine, genetic algorithm, and PCA, all of which are time consuming and require feature extraction [65,69]. In soybeans (*Glycine max* (L.) Merr.), the genetic diversity of 90 accessions was detected through high-throughput evaluation of stomatal density [49]. In *Cinnamomum osmophloeum* Kanehira (Lauraceae), deep CNN was applied for differentiating between morphologically similar species, and accuracy of CNN classifiers was better than SVMs classifiers (96.7% vs. 74.6%) [70]. Sant'Anna et al. [71] compared the performance of ANN with Fisher's classical multivariate statistical technique and Anderson's discriminant functions to assess the genetic diversity and classify 10 plant populations. They observed that ANN-classified populations with high and low differentiation were better than classical methods, as there were fewer wrongly classified individuals. Linear discriminant analysis and nonlinear artificial neural network methods were applied to identify and discriminate 10 potato varieties with morphological data obtained through image processing. The correctness of classification of the ANN method was 100% [38].

As was mentioned above, machine learning can also be used for classification through molecular markers data. DNA/RNA sequences can be used for training CNNs and applications in plant molecular biology and classification of genotypes through molecular markers [9]. Different machine learning models were used to identify true single nucleotide polymorphisms (SNPs) in allopolyploid peanuts (*Arachis hypogaea* L.). The selection of true SNPs by means of real peanut RNA sequencing (RNA-seq) and whole-genome shotgun (WGS) resequencing data resulted in above 80% accuracy [72]. Costa et al. [32] applied a neural network algorithm to infer the genetic diversity and group allelic frequencies obtained by RAPD and SSR molecular markers in grapevine rootstock varieties and

found three genetically diversified clusters among 64 grapevine rootstocks analyzed. Deep learning techniques enable prediction of plant phenotypes from their genome data [9].

Artificial neural networks have also been applied for genomic prediction and genomic selection in different plant species [73–75]. The phenotypes of 2000 Iranian bread wheat landrace accessions were predicted from genomic dataset collected from 33,709 DArT markers using a deep convolutional neural network. Authors reported that the Pearson's correlation coefficients between observed and predicted phenotypic values (grain length, grain width grain hardness, thousand-kernel weight, test weight, sodium dodecyl sulfate sedimentation, grain protein, and plant height) in deep CNN were more than other genomic selection methods [57].

### *2.2. Yield Component Analysis and Indirect Selection (Prediction)*

An increase in the economic yield (seed yield, oil yield, sugar yield, essential oil yield, biomass yield, straw yield, lint percentage, etc.) is always the target of most breeding programs. However, yield is a highly complex quantitative trait, which is usually controlled by several genes, and it is strongly influenced by the environment. Therefore, yield traits have low heritability and direct selection does not improve such complex traits. Instead, plant breeders usually prefer to use simpler highly correlated traits to have greater influence on yield. Selected yield component(s) will be used as "selection criteria" in future studies, i.e., indirect selection [2,76]. Classical single variable and multivariate linear methods, such as correlation coefficient analysis, PCA, path analysis, and multiple regression analyses (stepwise, forward, and reverse), have been used in classical plant breeding to interpret relationships between plant traits and improve important quantitative properties like yield and tolerance to biotic and abiotic stresses. The correlation coefficient analysis and path analysis have been used to evaluate a simple relationship between two traits and identify cause/effect relationships between correlated variables, respectively [24]. Regression-based methods are the most effective multivariate statistical methods for indirect selection purposes. They are based on a linear relationship of a dependent variable (Y) as a function of multiple independent variables. These multiple variables create a complicated condition for interpretation. However, some reduction techniques, such as PCA and factor analysis, are able to concentrate the original multiple variables in a few complex variables [77].

Stepwise, forward, and reverse regression analyses have been used to determine the effects of yield components on different economic quantitative characteristics in various crops. Backward stepwise regression was used to find the relationship between changes in grain yield and yield components of rice (*Oryza sativa* L.) in terms of the relative response ratio to elevated CO<sup>2</sup> [78]. Stepwise regression was used to determine the components of sugar beet (*Beta vulgaris* L.) yield affecting the yield of sugar under water deficit regimes and foliar application of jasmonic acid [76]. Zou et al. [77] applied stepwise regression analysis to identify the yield components involved in drought resistance of cotton seedlings (*Gossypium hirsutum* L.). Despite all the advantages, there is one major drawback to regression-based models in classical plant breeding studies—it is impossible to analyze nonlinear relationships of dependent and independent variables [2,79]. The application of nonlinear machine learning algorithms in yield component analysis and indirect selection studies enables the interpretation of nonlinear relationships between dependent and independent variables, the contribution of yield components to yield and prediction of economic quantitative characteristics. ANN was more efficient than multiple linear regressions (MLR) in the prediction of seed yield [2] and essential oil content [24] of ajowan (*Trachyspermum ammi* L.). Emamgholizadeh et al. [79] found that ANN predicted the yield of sesame seeds (*Sesamum indicum* L.) better than MLR. The ANN model was characterized by lower root mean square error (RMSE) and higher determination coefficient (R<sup>2</sup> ). The analysis of the sensitivity of the ANN model showed that the number of capsules per plant and the flowering time were, respectively, the most and the least significant variables to the yield of sesame seeds. Artificial neural networks have successfully predicted the yield of apples, pears, chives, and onions, allowing for data on crop diseases, time until harvest (based on the date), current temperature, humidity and precipitation (amount of snowfall) in the area, amount of sunshine, ground temperature, atmospheric pressure, and moisture

evaporation in the ground [80]. ANNs have also been used to predict the yield of winter rapeseed and winter wheat on the basis of meteorological data (air temperature and precipitation) and information on the use of mineral fertilizer [41,42,53,54]. The superiority of DNN (Long Short-Term Memory) over the auto regressive integrated moving average (ARIMA) model in predicting wheat production has been reported [56]. Deep CNN classification has been applied for image-based ear counting of wheat with high level of robustness, without considering of variables, such as growth stage and weather conditions [55].

Neural networks have also been used to estimate and predict the qualitative characteristics of different plants. The ANN model predicted the oil content in sesame more accurately and efficiently than the MLR model [46]. Parsaeian et al. [47] applied a multilayer perceptron (MLP)-ANN to estimate the oil and protein content in sesame on the basis of 138 morphological features measured in 125 sesame seed genotypes. The morphological characteristics of seeds were measured through image processing. The qualitative parameters of oil and the protein content in sesame seeds estimated by means of R<sup>2</sup> and RMSE statistics revealed the superiority of MLP over the radial basis function (RBF), extended RBF (ERBF), GRNN, M5-Rule, M5-Tree, support vector machine regression, and linear regression models [47]. Niedbała et al. [59] developed a multilayer perceptron ANN to assess the influence of the cultivar and weather conditions on the concentration of ferulic acid and correlate its content with the concentration of deoxynivalenol and nivalenol in 23 winter wheat genotypes with different Fusarium resistance. Independent variables consisted of 14 features, including 12 quantitative data and 2 qualitative data. The sensitivity analysis of neural networks showed that the experiment variant and winter wheat cultivar were the most important determinants of the concentration of ferulic acid, deoxynivalenol, and nivalenol in winter wheat seeds [59]. Ray et al. [60] applied an MLP-ANN model to assess the effects of topographic, soil, and environmental factors (18 input parameters, including soil nutrients and climate factors) on the content of active constituent of coronarin D in white ginger lily (*Hedychium coronarium*). The sensitivity analysis of the ANN showed that altitude, manganese, and zinc were the most important variables predicting the coronarin D content.
