The generalization ability of the DSRN is mainly related to data set size, data noise interference degree, model structure, module layers, iteration times (EPOCH), learning rate (LR), batch size (Batch_size), and other hyperparameters.
Since there are only 120 samples in this experiment, the network contains the Batch Normalization (BN) layer of dropout, and the effect of adding the dropout layer is not good, in order to reduce the degree of overfitting, we need to adopt a shallow network structure. At the same time, in order to improve the feature extraction ability of each layer, a large convolution kernel is adopted for feature extraction. After many experiments, the results are shown in
Table 4. It can be found that when the convolution kernel size is 7, 31, and 7, respectively, Test_R
2 is the largest, and Test_RMSE and Test_MAE are the smallest, that is, the prediction effect is the best. Therefore, the 1D-DSRN model in this paper uses each basic convolutional module containing three convolutional layers, using convolution kernel sizes of 7, 31, and 7, respectively, to extract features, and with the deepening of the number of layers, according to the number of channels, 16, 32, and 64 to extract features in turn.
In order to enable the network to extract correct features from complex noise, this paper constructs a noise reduction encoder through a CDAE as the first filter and reduces the dimension. Secondly, the channel attention mechanism used to set the threshold value in the DRSN is improved, and the original global maximum pooling is added to the global average pooling, so that the network not only extracts the most obvious features in the sample but also extracts the features of the whole sample, that is, more general features. In this way, more noise can be removed from the final threshold setting, which can improve the accuracy. And then the two pooled feature vectors are fused as the weight of the threshold setting to set the threshold value.
In order to improve the accuracy of fitting and due to the complexity of manual adjustment parameters, EPOCH, LR, and BATCH_SIZE are optimized by the whale optimization algorithm. In order to verify the effectiveness of the whale optimization algorithm, the empirical manual adjustment parameters EPOCH, LR, and BATCH_SIZE are compared with the EPOCH, LR, and BATCH_SIZE optimized by the whale optimization algorithm to set the network parameters. The determination coefficient is taken as the evaluation index. Additionally, since EPOCH and BATCH_SIZE must be integers during the network operation, BATCH_SIZE must be greater than 1 at the BN layer. And the whale optimization algorithm in the process of calculating the value of the generated value is a floating point number, so the two hyperparameters were rounded up to ensure that the value is an integer greater than 1. The results are shown in
Table 6. The first six behaviors are optimized according to the empirical manual adjustment parameters, and the last line is optimized by the algorithm parameters. It can be found that the effect of the parameters optimized by the algorithm is indeed better, and the parameter values are also the values rarely obtained by the general empirical manual parameters.
3.2.2. Comparison of Models
In order to select the most effective prediction model and improve the versatility of the soil total nitrogen olfactory detection model, PLSR, SVR, and RF, which are typical of traditional machine learning, were selected and verified, respectively. In order to ensure the synchronization of the training set and test set with the training set and test set randomly generated by the deep learning method, the TRAIN_TEST_SPLIT method in sklearn was used to divide the data set, and the random number seed was set to 99. In addition, BPNN, which represents a neural network, CNN, DSRN, and other classical one-dimensional regression prediction models in deep learning algorithms are used for comparison. The results are shown in
Figure 9.
The main parameters of the RF algorithm are the decision tree and the number of leaves. If the number of decision trees is too large, the calculation time will be affected. However, if the number is too small, the regression prediction effect will be reduced. Leaves are the end nodes of the decision tree, and a too-small number of leaves will make the model more susceptible to noise in the data [
30]. In this paper, it is found through several tests that the best effect is achieved when the number of decision trees is 3, the minimum number of leaves is 2, and the maximum number of leaves is not limited. R
2 is 0.858, and it can be seen that its evaluation index is relatively low, probably because RF is more suitable for classification tasks, and is not suitable for the specific numerical prediction of this task.
The main influencing parameters of the SVR algorithm are the penalty factor C and the kernel function parameter g. The larger C is, the more attention is paid to the total error in the whole optimization process, and the higher the requirement is for error reduction. When C tends to infinity, no sample of error is allowed to exist. When C approaches 0, only a larger interval is required. No meaningful solution can be obtained and the algorithm will not converge. The g value must be greater than 0, and with an increase in the g value, the higher the complexity of the model, the worse the generalization ability, and the higher the overfitting degree [
31]. In this paper, it is found through many experiments that the best effect is R
2 = 0.871 when kernel = ‘poly’, C = 1, and g = 0.48. Compared with the RF algorithm, its evaluation index is improved, although its model rating is unchanged. Other indicators are better than the RF algorithm, indicating that the SVR model has better performance and tends to be more stable in the specific numerical prediction of this task.
The main Influencing parameter of the PLSR algorithm is the number of its principal components. If the number of principal components is too large, the prediction effect will be better, but it will lead to overfitting of the model. If the number of principal components is too small, the complexity of the model will be reduced, but the prediction effect will also be reduced [
32]. In this paper, it is found through many experiments that when the number of principal components is 5, the best effect is achieved, and R
2 is 0.873. Compared with the SVR algorithm, its prediction effect and fitting degree are improved, but the improvement is not large. The PLSR algorithm and the SVR algorithm may be more suitable for the numerical prediction required by this task than the RF algorithm.
The summary of the model comparison is shown in
Table 7. To sum up, among the three traditional machine learning algorithms (PLSR, SVR, and RF), PLSR has the best predictive performance (R
2 is the largest, RMSE and MAE are the smallest), SVR has the second-best predictive effect, and RF has the lowest predictive performance.
In neural networks and deep learning models, the main parameters that affect the model are various hyperparameters of the model and the number of hidden layers of the network, such as EPOCH, LR, BATCH_SIZE, etc.
When EPOCH BPNN is 50, LR is 1E-2, and BATCH_SIZE is 70, the optimal effect can be achieved. The prediction result with an R
2 of 0.877 is shown in
Figure 10. It can be seen from the figure that although its evaluation index is higher than that of traditional machine learning algorithms, it is not much higher, which may be inferred because the BPNN is not a deep learning algorithm. The hidden layers in its network structure can only be used with shallow layers to reduce overfitting caused by overcomplexity of the model, so not enough features are automatically learned to fully fit a function curve that predicts total nitrogen content [
33].
By comparing the effect of a DSRN with the effect of a CNN in deep learning algorithms, it can be found that the R
2 of the DSRN is 0.929, which is better than the R
2 of the CNN (0.907) and is also in line with the experiment of Minghang et al. [
29]. In their work, it was found that the effect of the CNN was worse than that of DSRN when there was noise interference. This may be due to the existence of certain noise in the data. So the denoising processing was further added to carry out the experiment.
By comparing the effect of a CDAE and DSRN, it was found that the noise reduction effect of Gaussian white noise added to the CDAE was better (R2 = 0.945), which proved that the noise reduction treatment had indeed improved the prediction effect.
Finally, the WOA was combined with CDAE-DSRN to optimize its parameters, and the best effect was achieved when Max_iter = 5, dim = 3, SearchAgents = 5, and R2 = 0.968.
In summary, among neural networks and deep learning models, the prediction effect of the 1D-CDAE-WOA-DSRN proposed in this paper is the best, followed by the DSRN optimized by the whale optimization algorithm, namely the 1D-WOA-DSRN, followed by the 1D-CDAE-DSRN with a convolutional noise reduction autoencoder added. While the effect of the DSRN without the WOA is the same as that of the CNN, the BPNN has the worst effect but its R2 is greater than 0.87, indicating that the five models have good prediction ability and the effect is better than that of traditional machine learning methods.
In order to obtain the best model and verify the stability of the model, a more intuitive method is adopted to compare the fitting results of various models. The results are shown in
Figure 10. It can be seen from the figure that the fitting degree of the CDAE-WOA-DSRN proposed in this paper is the best, and the training set and test set are both near the fitting line and intersect with the fitting line.
Among the three traditional learning algorithms, the RF algorithm has data points in the test set that are farther away from the fitting line than those in the training set, showing a fitting trend. Although the PLSR and SVR algorithms did not select the trend of overfitting, the data points in their test set had many data points far away from the fitting line compared with the data points in the CDAE-WOA-DSRN, and their fitting degree was far lower than that of the algorithms in the CDAE-WOA-DSRN.
However, although the BPNN has a better effect than the other three traditional algorithms, the three traditional algorithms are not accurate in fitting the data point of acc = 3.2 in the test set. On the contrary, both the BPNN and CDAE-WOA-DSRN are accurate in fitting the data point. It is speculated that the artificially extracted features for this data point cannot describe these data well. In addition, the situation in which the training data within 3 g·kg−1 in the training set of the three traditional algorithms all have obvious outliers is further advanced, but there is still a big gap compared with the effect of the basic data points of the CDAE-WOA-DSRN all having intersection points with the fitting line.