4.1. Sample Acquisition
The experimental samples were purchased from different vegetable markets in Nanjing randomly and transported to the laboratory at Nanjing Agricultural University. To establish a robust and reliable prediction model, the inedible part of the asparagus was cut off, and the remaining part with economic value was preserved, as shown in
Figure 5.
A total of 15 asparagus spears were selected as the test set, and those with length of 20–25 cm, diameter of 0–0.8 cm, and bruises number greater than three and less than or equal to five were classified as Grade 3. The asparagus with length of 25–30 cm, diameter of 0.8–1.0 cm, and bruises number greater than one and less than or equal to three were classified as Grade 2; those with length greater than 30 cm, diameter greater than 1.0 cm, and bruises number less than or equal to one were classified as Grade 1.
The attribute data of each group of samples were measured, and the proposed method based on TOPSIS was implemented using MATLAB(2016a) software. The experiment was repeated three times for each group to ensure accuracy. The scores of the verified asparagus samples are displayed in
Table 7.
4.3. Model Training Results and Analysis
In machine learning, the recall ratio and precision ratio are typically used to indicate the percentage of correct predictions. The recall and precision ratios are the following:
where
TP is true positive,
TN is true negative, and
FP is false positive.
These two variables are generally negatively correlated. In practice, when one is selected as a reference, the
F1 score is introduced, which is the harmonic average of the two variables.
The closer the result is to one, the better the prediction result.
The recall ratio
R, that is, the true class rate, is the longitudinal coordinate, and the false-positive class rate
FPR is the horizontal coordinate used to draw a curve, that is, the receiver operating characteristic (ROC) curve is obtained, where the false-positive class rate is the following:
If the ROC curves of the different models are plotted in a graph, the curve closest to the upper-left corner represents the best model classification. In an actual situation, if two curves cross each other, it is difficult to determine which is better or worse. Therefore, the area under the curve (AUC) is introduced, that is, the area surrounded by the curve and the coordinate axis. The larger the area, the better the effect. In general, if the AUC is greater than 0.85, the model performs well.
To judge the prediction ability of the model for unknown data, called the generalization ability of the model, dataset U was randomly divided into k mutually exclusive subsets of similar sizes using the K-fold cross-validation method. Each time, the union of k − 1 subsets was used as the training set, and the remaining subset was used as the test set. This yields k training/test sets, which allows training and testing k times and ultimately returns the average result of the k tests.
The above grading levels of 100 samples obtained by TOPSIS were selected as the training set and a scatterplot was made, having length as the horizontal coordinate and diameter as the vertical coordinate, as shown in
Figure 7.
The scatterplot shows the distribution and concentration of asparagus. Approximately 85% of the asparagus is 20–30 cm in length and 0.8–1.0 cm in diameter; 10% percent of the asparagus is less than 22 cm in length, with a diameter of less than 0.8 cm; the remaining 5% of the asparagus is greater than 30 cm in length and 1.0 cm in diameter, suggesting that the asparagus was concentrated in the range of 20–30 cm in length and 0.8–1.0 cm in diameter.
By setting the parameters of medium tree (Max Num Splits = 20), fine K-NN (Num Neighbors = 1), medium K-NN (Num Neighbors = 10), medium Gaussian SVM (Kernel Scale = 1.7, Box Constraint = 1), linear SVM (Box Constraint = 1), quadratic SVM (Polynomial Order = 2, Box Constraint = 1), and cubic SVM (Polynomial Order = 3, Box Constraint = 1), using the classification learning program in MATLAB (2016a) and comparing different models, such as decision trees, discriminant analysis, SVM, and K-NN, it can be seen from
Figure 8 that the SVM, linear discriminant, and fine K-NN have high accuracies of 96%, 94%, and 93%, respectively.
As shown in
Figure 9, by comparing the AUC values of the ROC curves, it is found that the AUC value of the ROC curve of the fine K-NN is less than 0.85, whereas the AUC value of the ROC curve of the medium Gaussian SVM is 1, indicating that the fine K-NN model performed poorly. As can be seen from the prediction results in
Table 8, the test accuracy of linear discriminant was 80.00%, which is lower than that of the medium Gaussian SVM (test accuracy of 86.67%); therefore, the generalization ability of the linear discriminant was poor. Thus, the SVM method was selected.
By comparing four different SVM models, the samples were trained by selecting five-fold cross-validation. The training accuracy is displayed in
Table 9.
As seen in
Table 9, the training accuracy of the medium Gaussian SVM is higher than that of the other three methods, reaching 96%. However, it is necessary to determine the predictive ability of the model for unknown data.
4.4. Model Verification
After the training analysis results were obtained, 15 asparagus of known grade obtained by the above method were used as the test set. Different models were used to make predictions, and the predicted results were statistically analyzed, as listed in
Table 10.
By comparing
Table 9 and
Table 10, it can be seen that although the medium Gaussian SVM has a higher training accuracy than the quadratic SVM and cubic SVM methods, the test accuracy was only 86.67%, whereas the quadratic SVM and cubic SVM methods obtained the highest test accuracy of 93.34%. This demonstrates good generalization ability, which indicates that there may be overfitting in the medium Gaussian SVM.
To visualize the data, a parallel coordinate diagram is shown in
Figure 10, representing each variable of the high-dimensional data (length, diameter, and bruises) with a series of parallel axes; the value of the variable corresponds to the position on the axis, which reflects the changing trend and relationship between each variable. As shown in
Figure 10, the classification of labels mainly depends on whether lines of the same color are concentrated. On the parallel coordinate diagram of quadratic SVM and cubic SVM, the values of the length attribute are concentrated in the range of −1.5 std to 1.0 std, and the values of the diameter attribute are concentrated in the range of −1.5 std to 0.5 std. The bruises attribute has six discrete values, and the concentration is relatively uniform. The same-colored lines for each attribute (length, diameter, and bruises) are relatively concentrated, and different colors have a certain distance, indicating that the three selected attributes are useful for predicting the label category. In the observation chart, there are a few lines (such as the blue line in the cubic SVM chart) with large deviation from the same color, which affects the prediction results.
To obtain the accuracy more intuitively, the records in the dataset are summarized in the form of a matrix according to the two criteria of the real category and the category judgment predicted by the classification model. The confusion matrix is shown in
Figure 11.
The accuracy of the quadratic SVM and cubic SVM models can be intuitively seen from the confusion matrix, that is, the ratio of the sum of the diagonal numbers of the matrix to the total number is the accuracy of the model. In
Figure 11, the accuracy of the two models is 94% and 92%, respectively.
In practice, we pay more attention to the good asparagus, which will affect economic value; therefore, Grade 3 is regarded as positive and Grades 1 and 2 as negative. The ROC curve is shown in
Figure 12, in which the AUC values of the ROC curves for the two models are 1.00, that is, the model performs well.