4.1. Experiment Preparation
In this study, a total of 4000 standard samples of tobacco leaves were collected and measured from different regions of Guizhou Province by Guizhou Tobacco Science Research Institute of China. For the determination of the standard values of tobacco chemical compositions, all tobacco samples were dried in an oven at 60
C under normal pressure for half an hour first, and then ground to certain granularity through a whirlwind grinding instrument. Next, the sample powders were sieved by mesh. The sieved powders were then processed and analyzed by a San+Automated Wet Chemical Analyzer (Skalar, Holand) (a continuous flow inject analytical instrument). The analyzer can accurately measure the values of routine chemical compositions including nicotine, total sugar, reducing sugar, total nitrogen, potassium, chlorine and pH using a range of different analytical methods [
13].
The obtained values are used as the standard values for the experimental analysis of chemical compositions. Statistical values of seven tobacco compositions from 4000 standard samples of tobacco leaves are shown in
Table 3.
As shown in
Table 3, all the reference values of seven compositions are normally distributed around the mean values (2.80, 30.42, 25.94, 2.35, 1.21, 0.36, 5.36) with standard deviations (STD). The ranges of seven compositions are 0.36–6.01, 5.29–51.05, 0.91–40.32, 1.00–4.45, 0.37–3.03, 0.01–1.55, and 4.53–6.01, respectively. This means that all the samples are in a good representation of distribution and cover a wide range of values. For example, nicotine has a strong effect on both the aroma and taste of tobacco products, and nicotine intake can have some side effects on human body [
4]. The composition of reducing sugar and total sugar correlates with aftertaste, irritation and aroma quality, and the amount of total nitrogen correlates with the smoke concentration and smoking strength [
5,
9]. The pH value of tobacco is the determinant factor in the acute toxicity and is also correlated with total nitrogen, total alkaloid and total volatile alkali bases of tobacco [
6]. The potassium amount has a positive relationship with the flavor and the degree of wetness [
8]. The chemical compositions of tobacco leaves affects the quality of tobacco together, and various chemical compositions are coupled and closely related to each other.
NIR spectra were collected by Thermo Antaris 2 with multiple sensors (Thermo Fisher Scientific Inc., Waltham, MA, USA). NIR chemical detector is shown in
Figure 3. The collected spectra have the resolution of 8 cm
and 64 scans.
As shown in
Figure 4, the NIR range is from 3800 cm
to 10,000 cm
. There are significant fluctuations from 3800 cm
to 6500 cm
. The 3800 cm
to 4870 cm
region mainly contains the combination bands of C-H plus C-H, N-H, N-H plus O-H. The 3900 cm
to 4010 cm
, 4110 cm
to 4400 cm
, and 4400 cm
to 4570 cm
regions contain the combination bands of C-H, C-H plus C-H, and N-H plus O-H, respectively. The 4570 cm
to 4870 cm
region contains the combination bands of N-H and the first overtone of C=O plus O-H. The 5050 cm
to 5250 cm
region contains the combination bands of O-H and the second overtone of C=O. The 5725 cm
to 6110 cm
region contains the first overtone regions of C-H, S-H, and the 6110 cm
to 7270 cm
region contains the first overtone regions of N-H and C-H plus C-H. These NIR spectra characterize the main tobacco compositions, such as nicotine, total sugar, reducing sugar and total nitrogen. In addition, the potassium has a sensitive band in spectra and the chlorine participates in photosynthesis [
8]. The characteristics of potassium and chlorine were bound up with the absorption of C-H, O-H and N-H, which supports the theoretical foundation for determining potassium and chlorine by NIR spectra. There is a certain correlation between NIR spectra data and tobacco chemical constituents.
The detailed division of tobacco data is shown in
Figure 5. In this study, there are a total of 4000 standard tobacco data sets, which were randomly divided into both training and testing sets at 1:1 ratio (2000 samples in training and testing sets, respectively). In the model training, the root-mean-squared error of 5-fold cross-validation is used to evaluate the network model. According to the ratio of 4:1, the spectra of routine chemical constituents of tobacco were divided into both calibration and validation sets.
Both training and testing of tobacco data were performed using an NVIDIA GeForce RTX 2080 GPU and Intel Core(TM) i7-8700 CPU with a running memory of 24 GB. The neural network was built using the deep learning framework Tensorflow and Windows 10 operating system, and the training and testing of the proposed TCCANN were processed using the Python 3.6 platform.
4.3. Parameters of TCCANN
In this experiment, the total training rounds of TCCANN are set to 250,000. The loss value of the training set is shown in
Figure 6. The ordinate represents the loss value, and the abscissa represents the number of training rounds. The red curve represents the trend of the loss value as the network training progresses.
As shown in
Figure 6, in the initial stage of network training, the loss value of the network shows a rapid downward trend, and then decreases slowly until reaching stabilization. When the network training is 100,000 rounds, the training loss value is
. When the network training is 150,000 rounds, the training loss value is
. When the network training reaches 250,000 rounds, the loss value eventually decreases to
. The proposed TCCANN fluctuates considerably from 0 to 200,000 rounds of training. Finally, the loss of TCCANN tends to stabilize and maintains a small value, which confirms TCCANN has a better compatibility.
During the training process, 5-fold cross-validation is used in the proposed model, and both training and validation samples are randomly split according to a 4:1 ratio. In order to accurately measure both generalization ability and prediction accuracy of the proposed model, RMSE, and MAE are used as evaluation indexes for both verification and test sets, respectively.
In the validation set, the RMSE mean value of chemical constituents is 0.03864, and the mean value of MAE is 0.02190. In the test set, the RMSE mean value of chemical constituents is 0.04134 and the mean value of MAE is 0.02501. The values of RMSE do not have significant differences on both verification and test sets. Similarly, the values of MAE do not have significant differences on both verification and test sets either. The correlation coefficient of seven chemical compositions on both verification and test sets is greater than 0.99 and close to 1. The loss values of seven chemical compositions corresponding to both verification and test sets show a good linear relationship. Thus, the proposed networks are not involved in overfitting and underfitting issues, and have a good generalization ability.
In order to test the analytical efficiency of the proposed TCCANN, we recorded the training and testing times. With the help of Cuda and GPU: training 1928 samples 250,000 steps only cost average 44.35 s, and testing 690 samples only cost average 0.83 s which means a cost of about 1.19 milliseconds for one sample. The result show that, under the conditions of satisfying equipment and data, compared with the traditional chemical method, TCCANN is simpler to operate, the analysis speed is greatly improved, and good analysis results can be obtained.
4.4. Comparison with Existing Methods
The proposed TCCANN is used to analyze the complex NIR spectra and determine the chemical compositions of tobacco leaves, including nicotine, total sugar, reducing sugar, total nitrogen, potassium, chlorine, and pH value. In order to demonstrate the good performance of the proposed model, the proposed TCCANN is compared with existing methods, including PLS regression methods [
2], wavelet transformation support vector machine (WT-SVM) [
11], LS-SVM methods [
51] and variable adaptive boosting partial least squares (VAB-PLS) methods [
18]. According to the original papers, PLS, WT-SVM, LS-SVM, and VAB-PLS were implemented to carry out comparative experiments. Four models use the same settings, platform, and evaluation indicators as the proposed model. The specific division is shown in
Figure 5.
Figure 7 shows the correlations between the predicted values and the measured values of 2000 tobacco samples by using four different models on the testing set.
Figure 7 has seven rows and five columns (35 scatter plots in total). Each scatter plot represents the correlation between the predicted value and the measured value of a chemical composition. The abscissa of each scatter plot represents the measured value of a chemical composition, and the ordinate represents the predicted value of a chemical composition by different models. The first to fifth columns show the values of seven chemical compositions obtained PLS, WT-SVM, VAB-PLS, LS-SVM, and TCCANN, respectively. Each column shows the values of a chemical composition obtained by PLS, WT-SVM, VAB-PLS, LS-SVM, and TCCANN, respectively. The first to seventh rows show the obtained results of nicotine, total sugar, reducing sugar, total nitrogen, potassium, chlorine, and pH corresponding to red, light green, blue, sage, orange, purple, and dark green points, respectively.
As shown in
Figure 7, the first column scatter diagram represents the PLS model, the second column scatter diagram represents WT-SVM model, the third column scatter diagram represents VAB-PLS model, and the fourth column scatter diagram represents the LS-SVM model, and the fifth column scatter diagram represents the proposed TCCANN. As shown in
Figure 7, both predicted and measured values obtained by PLS and WT-SVM show great differences, and most of the points are not evenly distributed around the diagonal. Some differences in the predicted and measured values of LS-SVM and VAB-PLS exist, involving a few points distributed around the diagonal. Both predicted and measured values of TCCANN only have little difference, and most of the points are evenly and compactly distributed along the diagonal
. As shown in
Figure 7, the closer these points are to the diagonal the better the fitting effect of the model. There is a significant linear relationship between the predicted values and the measured values of seven chemical compositions in proposed TCCANN.
Figure 8 shows the results of three evaluation indicators obtained by five analysis models on the test set. Three sub-figures from left to right show the values of RMSE, MAE, and
obtained by five different models, respectively. The abscissa of each line chart represents seven different chemical compositions, and the ordinate represents the specific loss value. NIC, TS, RS, TN, PO, CL, PH represent nicotine, total sugar, reducing sugar, total nitrogen, potassium, chlorine, and pH value, respectively. The grey line represents the PLS model, the red line represents the WT-SVM model, the green line represents the VAB-PLS model, the blue line represents the LS-SVM model, and the orange line represents the ATCNN model.
According to the results, the loss values of seven chemical compositions obtained by RMS and MAE of PLS, WT-SVM, VAB-PLS, and LS-SVM are greater than the corresponding values obtained by TCCANN. The loss values of seven chemical compositions obtained by LS-SVM fluctuate considerably. The loss values RMSE and MAE of seven chemical compositions obtained by TCCANN are more stable than the results obtained by the other four models. According to the sub-figure of , the values of seven chemical compositions obtained by TCCANN are greater than the results obtained by the other four models. The TCCANN values are more stable and close to 1.
Table 4 shows the results obtained by five analysis models on the test set. NIC, TS, RS, TN, PO, CL, PH represent nicotine, total sugar, reducing sugar, total nitrogen, potassium, chlorine, and pH value, respectively. CC represents chemical-compositions. For PLS, the mean value RMSE of seven chemical compositions is 0.08961, and the minimum loss of total nitrogen is 0.08201. The mean value MAE of seven chemical compositions is 0.07857, and the minimum loss value of chlorine is 0.06393. The mean value
of seven chemical compositions is 0.72428. For WT-SVM, the mean value RMSE of seven chemical compositions is 0.08706, and the minimum loss of total nitrogen is 0.07069. The mean value MAE of seven chemical compositions is 0.06764, and the minimum loss value of total nitrogen is 0.05401. The mean value
of seven chemical compositions ss 0.78283. For VAB-PLS, the mean value RMSE of seven chemical compositions is 0.07191, and the mean value MAE of seven chemical compositions is 0.05107. The mean value
of seven chemical compositions is 0.94358. For LS-SVM, the mean value RMSE of seven chemical compositions is 0.06507, and the mean value MAE of seven chemical compositions is 0.05031. The mean value
of seven chemical compositions is 0.97301. As shown in
Table 4, the average RMSE of LS-SVM is less than the corresponding ones obtained by PLS and WT-SVM. The MAE value of LS-SVM is less than the corresponding ones obtained by PLS, WT-SVM, and VAB-PLS. The
value obtained by LS-SVM is larger than the corresponding ones obtained by PLS, WT-SVM, and VAB-PLS. Therefore, the overall performance of LS-SVM model is better than WT-SVM, PLS, and VAB-PLS.
For TCCANN, the mean values of RMSE and MAE are 0.04134 and 0.02501, respectively. The loss values RMSE and MAE of seven chemical compositions are less than the corresponding ones obtained by the other four models. According to the above results, TCCANN bears a good overall performance and high accuracy. The correlation coefficients
of seven chemical compositions are greater than 0.99, which are better than the other four models. As shown in
Figure 8 and
Table 4, TCCANN performs significantly better than the other four machine learning methods over the tobacco dataset. In comparative experiments, the values of appraisal indexes indicate that the generalization ability and prediction accuracy of TCCANN are superior to the other four methods. Thus, TCCANN is a powerful solution to the problem of determining the chemical composition detection of tobacco leaves.