2.3.3. Feature Selection for Spectral Data
Feature selection is typically required prior to modeling with spectral data [
25]. The spectral data of the duck eggs contain information that can be exploited to identify the origin of eggs, as well as a large amount of irrelevant data. Retaining the spectral information that can be used to identify the origin of duck eggs and deleting the unnecessary information can improve the model’s detection speed and performance [
26]. The successive projections algorithm (SPA) and the competitive adaptive reweighted sampling (CARS) algorithm were used in this study to extract spectral data features that can reflect duck egg origin information.
The SPA algorithm is a forward-loop-selection method that can minimize vector space collinearity [
27]. The implementation of the SPA algorithm requires the specification of two parameters (the initial wavelength and the number of wavelengths to be extracted):
(a) Before the initial iteration, let represent the column of the spectrum of the training set, where , and represent the total number of wavelengths.
(b) The non-selected columns are marked as set .
(c) Calculate the projection vector of
on
space, where
is the projection operator and
is the vector of remaining wavelength points after the initial wavelength points have been removed.
(d) The position of the largest projection vector that could be selected in the previous step is taken as the initial value for the next step, and the set with the smallest cross-validation mean squared error value is selected by constructing a multivariable correction model.
The CARS algorithm was adopted to select the wavelength points with relatively large absolute values of regression coefficients in the partial least squares model. Specifically, adaptive reweighted sampling means were employed, the wavelength points with relatively small weights were eliminated, and cross-validation was used to select the subset with the lowest cross-validation mean squared deviation values, which can effectively identify the optimal combination of variables [
28]. The implementation procedure is as follows:
(a) Monte Carlo sampling. To construct a partial least squares model, a certain percentage of samples are selected at random from the data set. The spectral matrix to be measured is
, where
is the number of samples,
is the number of variables, and
is the component matrix. The equation of the partial least squares model is as follows:
The regression coefficient, the bias, and the weight coefficient are denoted by
,
, and
, respectively.
is defined to measure the importance of each wavelength point, and wavelength points having a weight coefficient of 0 are eliminated.
(b) After several rounds of Monte Carlo samplings, the residual rates of the wavelength variables are calculated, and then the variables are filtered by the weight coefficients to determine the optimal combination of variables when the cross-validation mean squared deviation is minimal.
2.3.4. Modeling Methods
Random forest (RF), support vector machines (SVM), and CNN were employed to establish the classification model. The spectral data of duck eggs from Sichuan, Jiangsu, and Henan were labeled as “0”, “1”, and “2”, respectively. The establishment of the classification model includes training and validation.
RF consists of several decision trees, each of which is independent of the others. Each decision tree classifies samples and then uses the voting mechanism to complete the classification task. It performs well in classification and can handle high-dimensional data more effectively. The implementation process is as follows:
(a) Set N to the number of samples in the training set. Resume the sampling procedure multiple times to obtain the training set for the decision tree, where M is the number of features.
(b) Set m to the number of input features, and m must be much smaller than M, which is used to establish the optimal splitting point for each decision tree node.
(c) The decision tree finishes growing, no pruning operation is performed, and the training of the model is finally completed.
SVM is proposed based on statistical theory and structural risk minimization, and it can solve classification, regression, and distribution estimation problems. Its basic principle is to map the nonlinear problem into a high-dimensional feature space by selecting appropriate kernel functions and penalty factors and then constructing an ideal classification hyperplane for classification. The decision function of the model is shown below [
29]:
where
x denotes the feature vector,
denotes the optimal solution, and
denotes the optimal hyperplane.
CNN introduces the mechanism of local connectivity and weight sharing, which enables it to contain more hidden layers. Attributed to this, CNN has a distinct advantage in handling classification problems [
30]. However, CNN requires a large number of training and test samples; thus, this study reviewed the research on the application of CNN to spectroscopy. Yu et al. employed a one-dimensional convolutional neural network (1D-CNN) using 120 samples to predict the pesticide residues in Hami melon [
31]. Tian et al. used a 1D-CNN in combination with visible/near-infrared spectroscopy to detect freezing damage in oranges with 114 samples [
32]. Bai et al. estimated soil organic carbon using CNN and visible/near-infrared spectroscopy, and 330 samples were used [
33]. In our study, 261 duck eggs were used to construct the origin detection model. As long as the built CNN does not have a complex structure, this number of samples is sufficient.
After feature extraction by CARS or SPA, the spectral data of duck eggs is one-dimensional and not suitable for direct input to the CNN; therefore, the one-dimensional matrix must be converted into a two-dimensional matrix, and the conversion equation is as follows:
where
denotes one-dimensional spectral data, and
is the transpose of one-dimensional spectral data. The two-dimensional spectral matrix contains the original information of the one-dimensional spectral data, reflecting the variance and covariance of the samples while adapting to the CNN’s input structure.
In this study, the small size of the two-dimensional spectral matrix of the duck egg is unsuitable for constructing a complex CNN. After numerous efforts, a CNN with 3 convolutional layers, 3 batch normalization layers, 2 fully connected layers, and 1 pooling layer was constructed (
Figure 2). The network’s specific structure is as follows:
(a) Input layer: it transforms the duck egg spectral data processed by the SPA or CARS algorithm into a two-dimensional matrix.
(b) Convolution layer 1: the data of the input layer are subjected to the two-dimensional convolution operation with a 3 × 3 convolution filter and 96 convolution kernels.
(c) Batch normalization layer 1: Performing batch normalization on the data in the convolution layer 1 can prevent model overfitting, and activating the data using the ReLU function after this step can also prevent model overfitting.
(d) Max pooling layer 1: it can reduce the data dimensionality of batch normalization layer 1 and the CNN’s computational complexity. The kernel size is 2, and the stride is 2.
(e) Convolution layer 2: the data of max pooling layer 1 are subjected to the two-dimensional convolution operation with a 1 × 1 convolution filter and 192 convolution kernels.
(f) Batch normalization layer 2: perform batch normalization on the data in convolution layer 2 and perform activation of the data using the ReLU function after this step.
(g) Convolution layer 3: the result of batch normalization layer 2 is subject to two-dimensional convolution, with a 1 × 1 convolution filter, and the number of convolution kernels is 384.
(h) Batch normalization layer 3: after this step, the data in convolutional layer 3 are subjected to batch normalization and activation using the ReLU function.
(i) Fully connected layer 1: the number of nodes is 32, and all the data in batch normalization layer 3 are transformed into a one-dimensional format before being activated with the ReLU function.
(j) Fully connected layer 2: the number of nodes is three since the output of the CNN corresponds to three origins, i.e., Sichuan, Jiangsu, and Henan.
The parameters of the CNN must be determined through training, and a gradient descent method was employed to find the optimal network parameters. The cross-entropy loss was utilized to quantify the difference between predicted and actual results. The initial learning rate in this study was set to 0.0001, the mini-batch size was 4, and the maximum number of epochs was 300.