*2.7. Discriminant Model and Evaluation*

In order to obtain an accurate and reliable classifier, the original spectral data were used to establish a classification model with SVM, Random Forest (RF), and K-nearest neighbors (KNN). The performance of different classifiers was compared, and the optimal classifier method was selected to be used in the subsequent data processing classification algorithm.

The basic idea of SVM was to divide the segmentation hyperplane with the maximum classification interval according to the training samples in the feature space. When facing the nonlinear problem, the kernel function was introduced and transformed into a linear problem in high-dimensional space through nonlinear transformation. SVM was often used for problems with a small sample set or linear indivisibility [43]. The radial basis function (RBF) kernel function had more advantages in dealing with the nonlinear relationship between feature information and categories [44], hence RBF was selected as the kernel function of SVM in this study. The optimal loss parameter and kernel parameter was searched by the cross-verification grid optimization method.

RF was an ensemble learning method based on the Bagging algorithm, which could be used to solve classification and regression problems. RF had the advantages of processing high-dimension data, strong adaptability to data sets, and fast training speed [45]. In this study, when the RF classifier was trained, the number of decision trees was set to 50 to store the observation results of each tree.

KNN was a commonly used classification algorithm. Its core idea was to select k nearest neighbor samples in the feature space. In these K samples, if most samples belong to a certain category, the test samples also belong to this category [46]. In this study, when the KNN algorithm was used for training, parameters were automatically optimized to obtain the optimal nearest neighbor number and distance measurement parameters.

The rationality of data set division affects the prediction performance of the classification model. To avoid the influence of artificially selected calibration prediction sets on the results, in this study, all 240 samples were sequentially divided into 4 moldy levels based on the determined CAT activity value, so there were 60 samples in healthy, mild, moderate and severe levels, respectively. Then, the 60 samples of each category were randomly divided into calibration and prediction sets with a proportion of 3:1. Hence, 180 samples were selected as the calibration set to build the calibration model, and the 60 remaining samples were selected as the prediction set for evaluating the performance of the established model.

The performance of the model was evaluated from four aspects: classification accuracy of the calibration set and prediction set, and overfitting. Generally, a good model should have higher classification accuracy and lower differences between calibration and prediction sets. The main key steps of this study were shown in Figure 3.

**Figure 3.** The experimental scheme of the data fusion model for identification of maize with different moldy levels.
