2.2.2. Feature Variable Selection Method
The extensive soil hyperspectral data, characterized by high-band dimensionality, contains invalid, redundant, and overlapping spectral information. This complexity poses challenges, causing instability and hindering efforts to enhance the accuracy of soil organic matter content inversion models constructed solely on the full-band basis. Constituting a pivotal facet of spectral analysis, the judicious selection of features that exhibit robust responses to soil organic matter content from the realm of redundant and high-dimensional wavelength variables directly governs the efficacy of the prediction model. In the practical course of feature extraction, the rationale is typically evaluated through two primary lenses, namely, interpretability of the target variable and redundancy among independent variables. The interpretability of the target variable emphasizes the predictive capability of individual variables or their amalgamation. Attention is paid to how well the chosen variables facilitate an understanding of the target variable. Focusing on striking a balance between model performance and diminished variable redundancy is key. This aspect seeks to ensure a streamlined and effective model outcome. Considering the inherent advantages of PCA, Lasso, and SCARS feature selection methods—specifically, their speed and the ease of interpreting selected feature variables—this study utilizes these three methodologies to navigate the extensive full-band spectra. The use of these methods stands to enhance the model’s robustness and precision.
PCA stands as a widely used technique for reducing the dimensionality of sample data, finding extensive application in data analysis and machine learning domains. The fundamental aim of PCA lies in substituting numerous variables in the initial dataset with a reduced set, effectively diminishing feature dimensions while conserving the bulk of sample data information. Using PCA for data dimensionality reduction not only simplifies complexities but also streamlines computer processing by reducing data volume, thereby trimming down processing times. At its core, PCA operates by maximizing the variance in sample point projections along the axes (variables) of a newly defined coordinate system achieved with axis rotation. The first principal components correspond to the coordinates of sample points along the axes with the highest variance.
For a dataset encompassing p variables and n samples, the computation involves deriving the covariance matrix . Subsequently, p eigenvalues, denoted as and their corresponding eigenvectors are extracted. The kth principal component can be expressed as where .
Determining the number of selected principal components hinges upon their contribution rates and cumulative contribution. The contribution rate of the kth principal component, denoted as , is calculated as . Typically, a higher contribution rate signifies a greater preservation of the original sample information. The cumulative contribution rate of the first m principal components is computed as . A cumulative contribution rate of 80% or more suggests strong retention of the original sample information by the selected principal components, serving as a criterion for component selection.
In our indoor spectral data collection for soil samples, comprising 2151 bands (variables), we used the PCA method to select five principal components. Their respective contribution rates were 86.99%, 5.91%, 4.88%, 1.21%, and 0.49%. The resulting cumulative contribution rate stands impressively high at 99.48%, indicating an excellent retention of the original sample information.
Feature selection, a process entailing feature reduction, is effectively attained by introducing a penalty term, also referred to as regularization, into the loss function. This augmentation integrates the magnitude of regression coefficients into the training and parameter-solving process. The penalty coefficients are strategically set to nullify the impact of regression coefficients with limited significance, effectively attenuating them to zero. This selective approach ensures the retention of only pivotal features. The objective function of Lasso regression is succinctly depicted by Equation (1).
where
n represents the number of samples,
p signifies the number of features,
yi denotes the target variable for the
ith sample,
β0 and
β symbolize regression coefficients,
xi signifies the feature vector for the
ith sample, and α stands for the regularization parameter.
In the pragmatic execution of Lasso feature variable selection, a higher penalty coefficient results in the identification of fewer features. In this study, the cross-validation technique is harnessed to compute the Root Mean Square Error (RMSE) of the model. The optimal penalty parameter value is ascertained by identifying the juncture where the RMSE achieves an exceedingly minuscule value. The cross-validation process entails a penalty parameter selection range, specifically α = [0.001, 0.002, 0.005, 0.01, 0.1, 1.0]. Using the Lasso feature selection method, a total of eleven feature wavelengths were discerned from the comprehensive full-band spectra. Notably, the optimal penalty parameter in this context was found to be α = 0.001. This selection results in approximately 0.5% of the total wavelengths being retained. The precise positions of these wavelengths, along with their corresponding regression coefficients, are visually presented in
Figure 3.
SCARS, a method that amalgamates Monte Carlo sampling with the PLSR model, hinges upon variable stability as the bedrock of its feature selection process. Leveraging a sequence of competitive adaptive reweighted sampling iterations, SCARS pursues variable retention criteria based on the absolute weights of regression coefficients within the PLSR model. The procedure advances iteratively by generating new subsets with adaptive reweighted sampling (ARS) while prioritizing variables with more substantial absolute regression coefficient weights. Each iteration entails the reconstruction of the PLSR model using the updated subset, culminating in the identification of wavelength variables within the subset exhibiting the smallest RMSE. The successive calculations lead to the discernment of characteristic wavelengths. In summary, the SCARS methodology operates through the following sequential steps.
Step 1: Use the Monte Carlo sampling technique to compute the stability of the
ith wavelength variable in the
Mth Monte Carlo sample, denoted as
Ci. This stability measure is defined as follows:
In this equation, signifies its mean value across all Monte Carlo samples, denotes the standard deviation of the ith wavelength variable, and P represents the number of variables. Evidently, higher values of and lower values of contribute to greater stability of the ith wavelength variable.
Step 2: Use forced wavelength selection and ARS to distill a subset of wavelength variables characterized by enhanced stability. Concurrently, leverage the Exponentially Decreasing Function (EDF) to quantify the ratio of retained wavelength variables relative to the entirety of the wavelength.
During each sampling iteration, the ARS method was utilized to iteratively sift wavelength variables from the subset retained in previous iterations (i.e., from steps 1 and 2). This iterative process was looped to yield a subset consisting of K wavelength variables (K representing the number of loops). Drawing upon these variable subsets, distinct PLSR models were constructed. Subsequently, the corresponding Root Mean Square Error of Cross-Validation (RMSECV) was computed. Ultimately, the wavelength variable subset that yielded the minimum RMSECV emerged as the final curated feature variable.
The SCARS methodology was adeptly used to distill 94 distinctive wavelengths from the expansive full-band spectrum, constituting approximately 4% of the total wavelength count. This meticulous wavelength selection process is vividly illustrated in
Figure 4. As depicted in
Figure 4a, the count of retained wavelengths displays a decremental trend as the number of SCARS iterations escalates. The rate of reduction transitions from swift to gradual. Meanwhile,
Figure 4b depicts the trajectory of the 10-fold RMSECV in relation to the augmentation of iteration numbers. This trajectory is characterized by a sequence of shifts from higher to lower values, punctuated by minor oscillations, followed by a subsequent shift back to higher values. This pattern emerges as a progressive sequence of descending values, coupled with intermittent fluctuations, and subsequently transitioning to ascending values. Notably, a minimum RMSECV value is achieved after 19 iterations, effectively designating the resultant wavelength subset as the culled feature wavelength set.
Acknowledging the potential for overfitting due to the higher number of features than samples, we used the PCA method to derive principal component features from the 94 feature bands initially screened with SCARS, resulting in the extraction of 5 principal components. Termed the SCARS-PCA features, these five components exhibit contribution rates of 83.26%, 7.88%, 6%, 1.57%, and 0.63% respectively. Notably, their cumulative contribution rate stands at 99.34%, signifying their robust retention of original sample information.