1. Introduction
Pear (
Pyrus spp.) is a widely grown and consumed fruit [
1,
2]. There are many distinct types of pears, and they have varied properties [
3,
4]. Among them, the “Dangshan” pear is one of the most popular types in China [
5]. The “Dangshan” pear is popular because it has the advantage of having thin and juicy skin. However, fruit chaffing disease of the “Dangshan” pear has caused large losses to fruit farmers, and this disease often occurs in “Dangshan” pear farming. Woolliness response disease is a physiological disease [
6]. This disease is connected to a lack of nutrients or a reduction in root uptake in “Dangshan” pears, in which the deficient nutrients are largely boron, calcium, and water. In the absence of calcium, iron, and boron or in the presence of reduced root uptake, the fruit is encouraged to ripen quicker, and fruit hardness is reduced. This results in the development of woolliness response disease in the fruit.
To prevent the occurrence of woolliness response disease in “Dangshan” pears, effective and accurate detection methods have been researched, and strategies have been explored mainly around the causes of woolliness response disease. The traditional detection of mineral nutrients is largely based on laboratory physicochemical analysis, including inductively coupled plasma–mass spectrometry (ICP–MS), atomic absorption spectrometry, and UV-VIS spectrophotometry measures [
7,
8,
9]. Although the results of these approaches are quite accurate, they have the disadvantages of destructive sampling and being time-consuming, labor-intensive, and costly, and these characteristics bring numerous restrictions to the study of mineral nutrition in pear fruit. Among the numerous non-destructive testing techniques, near-infrared reflectance spectroscopy (NIRS) and computer vision systems (CVS) are commonly utilized.
Several studies have employed NIRS and CVS techniques to diagnose diseases in many agricultural products [
10]. For example, YanYu et al. [
11] established a tool for monitoring fruit quality from a combination of NIRS and chemical models and then utilized this tool to explore the construction of a generic model that, among other things, predicts the Soluble Solids Content (SSC) of thin-skinned fruits with similar physicochemical properties. Lei-Ming Yuan et al. [
12] employed vis-NIRS technology paired with a bias fusion modeling strategy for the noninvasive assessment of “Yunhe” pears, which improved the loss of spectral information in the optimized PLS model. Cavaco et al. [
13] proposed a segmented partial least squares (PLS) prediction model for the hardness of “Rocha” pears in combination with a vis-NIRS segmentation model, which was used to predict the hardness of the fruit during ripening under shelf-life conditions. Pereira et al. [
14] devised a color-image-processing method that employed a combination of digital photography and random forest to anticipate the ripeness of papaya fruits, evaluating them on the basis of flesh firmness. Zhu et al. combined process characteristics and image information to evaluate the quality of tea leaves [
15]. Shumian Chen et al. developed a machine vision system for the detection of defective rice grains [
16].
Near-infrared spectroscopy (NIRS) detection methods have the advantage of being nondestructive, convenient, environmentally friendly, and safe [
17]. The method’s coverage of molecular absorption covers frequency combinations and double-frequency absorption of hydrogen-containing groups or other chemical bonds in many organic compounds, mainly C-H, N-H, S-H, O-H, and others. Therefore, the spectral profile produced using NIR spectroscopy may then be utilized to better reflect information on the organic matter of the hydrogen-containing groups in the sample under test and information on the composition of several other biochemical structures [
18]. Machine vision technology has the advantages of being nondestructive, rapid, efficient, and objective [
19,
20]. This technique uses optical systems and image-processing equipment to replicate human vision. The technology extracts information from the acquired target image and processes it to obtain the information required for the object to be detected and then analyzes it. Therefore, the image information produced by machine vision technology can properly and objectively depict the appearance features of “Dangshan” pears, and this advantage plays a significant part in the woolliness response disease of “Dangshan” pears.
These studies detected diseases based on only a single aspect of NIRS or CVS. However, the NIRS and CVS approaches can only acquire the main components and appearance of the samples separately and cannot obtain complete information on the quality of “Dangshan” pears. Therefore, if NIRS features or CVS features are fused for disease diagnosis, this may improve the accuracy of diagnosing woolliness response disease. Feature-level fusion procedures make it possible to study sample features fully, and several studies have validated the use of feature-level fusion [
21,
22]. At the same time, this research used several forms of data-fusion algorithms to merge information from multiple detection techniques to produce better sample characterization and enhanced identification. Therefore, to effectively diagnose the woolliness response disease of “Dangshan” pears, researchers have concentrated on multi-technology integration to address the drawbacks of utilizing a single form of technology. Studies have performed data fusion by merging data from numerous different sources. A hybrid method was devised by Miao et al. to distinguish nine species of ginseng [
23]. Fun-Li Xu et al. [
24] used CVS and HSI techniques to accomplish quick, nondestructive detection of frostbite in frozen salmon fillets.
Currently, in integrated spectroscopic and image techniques, spectral detection methods are generally used in hyperspectral techniques. Hyperspectral imaging (HSI) provides images in which each pixel contains spectral information to reflect the chemical properties of a particular region [
25]. However, we discovered a few studies integrating NIRS and CVS for disease detection in fruit. Hyperspectral image techniques have disadvantages such as high cost and limited accessibility, which have become a not insignificant impediment in the development of disease diagnostic approaches for “Dangshan” pears. NIRS technology has the advantages of low cost and ease of field use [
26,
27]. Therefore, this work combines the application of NIRS and CVS systems with a feature-level fusion strategy to explore a method that can increase the accuracy of illness identification in “Dangshan” pears.
This study created a nondestructive, objective, and accurate approach for the diagnosis of chaffing diseases in “Dangshan” pears. The method incorporates a combination of an NIRS system and a CVS. We extracted the NIRS characteristics by modeling the machine-learning approach. CVS features are retrieved by employing a convolutional neural network. We then compare the classification results obtained using only NIRS features, only CVS features, and a fusion of NIRS and CVS features to determine the best method for “Dangshan” pear woolliness response disease identification and to explore and analyze the effects of different-layer fusion modeling for the CVS feature model. This study intends to provide a theoretical basis and innovative concepts for developing innovative technologies for the woolliness response disease of “Dangshan” pear.
The remainder of this study is separated into three sections. In
Section 2, the spectral and image-data-gathering methods for “Dangshan” pears and the accompanying machine-learning methods, deep-learning methods, and feature-level fusion approaches are discussed.
Section 3 covers not only the performance of the single-feature model and the feature-level fusion model but also evaluates and analyses the fusion effects of different layers of the convolutional neural network.
Section 4 discusses the conclusions of this investigation.
3. Results and Discussion
3.1. Division of the Training and Validation Sets
In this study, the initial data comprised the following two types: healthy and diseased.
Table 2 displays the number of these samples, with a total of 480 sample data. The sample set was divided based on a 7:3 ratio, where 70% (336 sample data points) was used to train the network, and the remaining 30% (144 sample data points) was utilized to validate the trained network. This experimental device has 32 GB of memory, and the experiments utilize the TensorFlow framework to build deep-learning model structures and run on an NVIDIA RTX 3060 GPU platform in the python environment.
3.2. No-Fusion Separate Modeling Evaluation
The number of hidden layer nodes in MLP is a structurally sensitive parameter, which means that a very low number of nodes may lead to poor training, while a high number of nodes might lead to overfitting [
32]. Therefore, we picked ten groups of hidden layer nodes from 10 to 100 in 10 steps of traversal to develop the MLP model matching the number of hidden layer nodes. The MLP model was denoted as MLP X, where X specifies the number of hidden layer nodes. The validation set traversal results of the classification models with varied numbers of hidden layer nodes are given in
Figure 6. When the MLP classification models built with different numbers of hidden layer nodes are compared, the results show that model MLP_90, which was built with 90 hidden layer nodes, had the best fit.
To validate the feasibility of spectral features to diagnose woolliness response diseases in “Dangshan” pears, we employed a support vector machine (SVM), random forest (RF), extreme gradient boosting (XGBoost), adaptive boosting (AdaBoost), and other machine-learning methods to develop models. A grid search approach is utilized to select the best hyperparameters for these machine-learning models. Details of the final parameters of the machine-learning models are presented in
Table 2.
The validation set results for all machine-learning models are provided in
Table 3. By comparing the validation set results of these models, it is shown that MLP had the best modeling performance. The ideal MLP model MLP_90 has accuracy (0.611), precision (0.614), recall (0.611), and F1 (0.608) on the validation set. The results reveal that MLP significantly outperformed other machine-learning models. MLP_90 was the best among the classification models generated using MLP with varied numbers of hidden layer nodes.
In the extraction of image features, this paper uses VGG16, VGG19, ResNet50, ResNet101, Xception, and DenseNet201 to perform comparisons with other models. The optimum model for image feature classification models is explored and analyzed. Migratory learning methods such as freeze layers and fine-tuning are employed to extract significant features. In this study, fine-tuning of transfer learning was employed to enhance the efficacy of the CNN architecture and replace the last layer of the pretrained training-only model. The training accuracy curves for the six image feature classification models are given in
Figure 7a, with DenseNet201 converging as the fastest and the slowest model being VGG19. The training loss curves in
Figure 7b reveal that DenseNet201 had the least training loss. The validation accuracy curves are given in
Figure 7c, with ResNet101 converging the fastest and VGG19 converging the slowest. The validation loss curve in
Figure 7d demonstrates that ResNet50 had the lowest value of validation loss.
The results of the six image feature classification models on the validation set are reported in
Table 4. These results suggest that the Xception model outperformed the other models on the validation set in terms of accuracy (0.840), precision (0.879), recall (0.840), and F1 (0.836). The results reveal that Xception works best among the six convolutional neural network methods in this paper.
In summary, this section validates the feasibility of spectral features and image features in recognizing woolliness response diseases of “Dangshan” pears. The results of the classification models produced by the two features were compared and analyzed. Among the classification models for spectral features, MLP was the best model with the highest accuracy, precision, recall, and F1. Among these, MLP_90 was the best model with the highest validation accuracy among the models developed with the varied number of hidden layer nodes of MLP.
The spectral data used in this study resulted in unsatisfactory extraction of spectral features due to light-scattering effects [
44]. Although effective preprocessing can essentially eliminate the light-scattering effect, finding the best spectral preprocessing method for different models is a complex process. To solve this challenge of how to choose the best preprocessing method, in the algorithm proposed in this paper, the spectral data are selected without preprocessing, and the spectral features extracted from the original spectral data are used directly, and then the spectral features are fused with the image features for modeling, thus enabling an end-to-end approach. Therefore, all six models outperformed the spectral feature classification model in terms of validation in the classification model of image features. The training accuracy curves of the six models were more or less the same, with the training accuracy exceeding 90%. In addition, the validation accuracy of all six models was above 75%. According to a detailed examination of the results, Xception had the best validation accuracy while maintaining the best training accuracy and training loss. This implies that the Xception model is the best model among the six image feature classification models and has greater learning ability for the identification of woolliness response diseases in “Dangshan”.
3.3. Modeling of Spectral and Image Fusion Features
In this study, the image feature vector was first extracted using the CNN feature extraction method, and then the spectral feature vector was extracted using the NIRS feature extraction method. The two features were then concatenated to generate a multidimensional vector. This vector was used as the input to the prediction layer. The final output of the prediction layer was employed as the score of two pear classes, where the class with the greatest score was regarded as the acknowledged class of pears. The spectral feature model used MLP to extract NIR spectral features, and the image feature network model employed DenseNet201, ResNet50, ResNet101, VGG16, VGG19, and Xception migration learning models to extract image features.
This study compared and analyzed the model effects of combining different spectral and image models for fusion modeling. In this case, the spectral models were MLP models built with different numbers of hidden layer nodes, and the image models were different convolutional neural classification networks. In this paper, ten different models of MLP (MLP_10, MLP_20, MLP_30, MLP_40, MLP_50, MLP_60, MLP_70, MLP_80, MLP_90, and MLP_100) were selected to determine the most suitable MLP models for fusion.
The classical ResNet and VGG were then used in this study to initially identify the most suitable MLP models for fusion. The accuracy of the training model is not necessarily positively correlated with the number of model layers. This is because as the number of network layers increases, the network accuracy appears to saturate and decreases. Therefore, VGG16 is preferred between the two models VGG16 and VGG19; ResNet50 is preferred between the two models ResNet50 and ResNet101. In this subsection, VGG16 and ResNet50 were used to model the fusion with different MLP models, where the model with the best results was the most suitable model. The accuracy and F1 of the fusion modeling of the two convolutional network models with the MLP are shown in
Figure 8. The results show that MLP_30_VGG16 and MLP_30_ResNet50 had the best validation results with the highest accuracy and F1. Therefore, the MLP_30 model with 30 nodes in the hidden layer is the most suitable model for feature fusion.
After the ideal number of hidden layer nodes was determined for the MLP model to be 30, the optimal model combining NIR reflectance spectral features and image feature fusion modeling was further examined. The training accuracy curves of the six fusion models are given in
Figure 9a, with MLP_30_VGG19 converging the fastest and MLP_30_VGG16 the slowest. The value of the latter was the worst. The training loss curve in
Figure 9b shows that MLP_30_VGG19 had the smallest training loss value. The validation accuracy curves are given in
Figure 9c, with MLP_30_ResNet101 converging the fastest and MLP_30_DenseNet201 converging the slowest. The validation loss curve in
Figure 9d reveals that MLP_30_Xception had the lowest value of validation loss.
The results of the six fusion models on the validation set are reported in
Table 5. Among these, the best modeling results were discovered for MLP_30_Xception, which had the highest accuracy (0. 972), precision (0. 974), recall (0. 972), and F1 (0. 972). In addition, MLP_30_ResNet101 also performed well, with good accuracy (0. 965), precision (0. 966), recall (0. 965), and F1 (0. 965). The results reveal that the combination of MLP and ResNet101 is the superior combination, and the combination of MLP and Xception is the best combination.
3.4. Optimization of Fusion Models for Different Depth Feature Layers of Visual Images
MLP shows outstanding performance with ResNet101 and Xception for fusion modeling. However, utilizing image features extracted from different layers of the network and fusing them will cause the models that they build to have different modeling effects. Therefore, this study uses ResNet101 and Xception to further evaluate the performance of convolutional neural networks with different layers for fusion modeling. Five alternative sets of layers (layer 1, layer 2, layer 3, layer 4, and layer 5) of ResNet101 were selected for feature-level fusion with MLP_30, and five models were built to explore the best-fused layers of ResNet101. The distinct fusion layer models are named MLP_30_ResNet101_X, where X is the different layers of ResNet101. At the same time, three alternative sets of layers of Xception (Entry flow, Middle flow, and Exit flow) were selected to be fused with MLP_30 at the feature level, and three models were built to explore the best fusion layers of Xception. The different fusion layer models are named MLP_30_Xception_Y, where Y is the different layers of Xception.
The training accuracy curves of ResNet101 for five alternative sets of layer fusion modeling are given in
Figure 10a, with all five models obtaining training accuracies above 85%. The validation accuracy curves in
Figure 10b show that MLP_30_ResNet101_layer 5 had superior recognition results compared to the other four models. The training accuracy curves for the three major process fusion models of Xception are presented in
Figure 10c, with all three models having training accuracies of over 95%. The validation accuracy curves in
Figure 10d show that MLP_30_Xception_Exitflow had superior recognition results compared to the other two models.
The results of simulating the fusion of multiple layers of the convolutional network with spectral features are displayed in
Table 6. By comparing the results of the feature-level fusion of MLP_30 with five different layers of ResNet101 and three different processes of Xception, correspondingly, it was discovered that the MLP_30_ResNet101_layer5 model had the highest accuracy (0.917), precision (0.920), recall (0.951), and F1 (0.917). Precision (0.920), recall (0.951), and F1 (0.917). Additionally, for comparison, the MLP_30_Xception_Exitflow model was determined to have the highest accuracy (0.951), precision (0.956), recall (0.951), and F1 (0.951). The results show that MLP_30_ResNet101_layer 5 had better recognition than the other four models of ResNet101, while MLP_30_Xception_Exitflow had better recognition than the other two models of Xception.
In summary, this section verifies that the models created with different layer functions in ResNet101 and Xception had varying performances. Among these, MLP_30_ResNet101_layer5 had the best model performance with the highest validation accuracy among the five sets of different layer fusion models of ResNet101. MLP_30_Xception_Exitflow had the best model performance with the highest validation accuracy among the three primary process fusion models of Xception.
3.5. Optimal Model Analysis and Comparison
For the comparison in this section, we selected the optimal models MLP_90 and Xception among the spectral and image feature classification models and selected the two superior models MLP_30_ResNet101_layer5 and MLP_30_Xception_Exitflow for the fused features of infrared spectral features and image features. The accuracy comparison of the four models on the validation set is provided in
Table 7. Among these, the MLP_30_Xception_Exitflow model, which used a feature-level fusion technique, exhibited good performance with the greatest accuracy (0.951), precision (0.956), recall (0.951), and F1 (0.951).
The classification confusion matrix for the four optimal models is presented in
Figure 11. The results showed that the MLP_30_Xception_Exitflow network model had the highest validation accuracy. In the MLP_90 network model, 22 “Dangshan” pear diseased samples were incorrectly predicted to be “Dangshan” pear healthy samples, and 34 “Dangshan” pear healthy samples were incorrectly predicted to be “Dangshan” pear diseased samples. In the Xception network model, 23 “Dangshan” pear healthy samples were incorrectly predicted to be “Dangshan” pear diseased samples. Three samples of “Dangshan” pear diseased in the MLP_30_ResNet101_layer5 network model were incorrectly predicted to be “Dangshan” pear healthy samples, and nine samples of “Dangshan” pear healthy samples were incorrectly predicted to be “Dangshan” pear diseased samples. In the MLP_30_Xception_Exitflow network model, seven samples were incorrectly predicted to be healthy samples.
In summary, in the comparison of the four superior models, MLP_90, ResNet101, MLP_30_ResNet101_layer5, and MLP_30_Xception_Exitflow, the MLP_30_Xception_Exitflow model, after employing the feature-level fusion technique, obtained the best classification results. The MLP_30_Xception_Exitflow model had the highest accuracy (0.951), precision (0.956), recall (0.951), and F1 (0.951). The combination of the MLP classification model and the Xception convolutional neural classification network with the fusion of the NIR spectral features and image features extracted separately was the best combination.
4. Conclusions
The fast and precise diagnosis of “Dangshan” pear woolliness response disease is vital, as it is a physiological disease that has a substantial impact on the quality of “Dangshan” pears. This research indicates that it is feasible to apply near-infrared reflectance spectroscopy (NIRS) features and image features to diagnose woolliness response disease in “Dangshan” pears. The experiments first acquired information on the chemical composition and appearance of the samples via NIRS and CVS techniques, respectively, and then used machine-learning and deep-learning methods for diagnostic classification. These findings imply that the feature-level fusion technique can utilize the advantages of NIRS and CVS features to gain more extensive sample information compared to single-feature models and consequently improve the accuracy of recognizing “Dangshan” pear diseases to a greater extent. Then, to explore the effect of different depth feature layers of visual pictures on fusion modeling, experiments were performed to model the fusion of CVS features extracted via different layers of convolutional neural networks with NIRS features. The results show that the fusion modeling of the feature layer with the highest depth of the visual image has a more accurate classification performance. In this study, the combination of the MLP classification model and the Xception convolutional neural classification network fused with the NIR spectral features and image features extracted, respectively, was the best combination, with the accuracy (0.972), precision (0.974), recall (0.972), and F1 (0.972) of this model being the highest compared to the other models. In summary, the use of near-infrared reflectance spectroscopy (NIRS) features and image feature fusion for the identification of “Dangshan” pear woolliness response disease is a promising method for disease diagnosis and provides a broad perspective in the field of fusion for agricultural disease diagnosis. It can provide new ideas for achieving fast, reliable, and nondestructive quality control instruments for various agricultural products.