2.5.1. SPA

The SPA is a forward variable selection method that uses simple operations to minimize the collinearity of variables in vector space [40]. Three phases are required to select characteristic wavelengths which have the least collinearities.

First, K chains with N\_max variables are created by using QR decomposition of spectral matrix SpecNcal <sup>×</sup> <sup>K</sup>. The number of N\_max should be between the minimum value defined by the data processor and the smaller of Ncal and K. Here, Ncal and K represent the number of samples in Cs and wavelengths, respectively.

Second, K × N\_maxsets of characteristic wavelengths were selected according to the root mean square error of Vs (RMSEV). Each regression coefficient vector B of the PLSR model was calculated according to Equation (2). The RMSEV of the corresponding PLSR model was calculated according to Equation (3). The set of characteristic wavelengths with the minimum RMSEV was selected.

$$\mathbf{Spec} \times \mathbf{B} = \mathbf{Ref} \tag{2}$$

$$\text{RMSE}(\mathbf{j}) = \sqrt{\frac{1}{\mathcal{N}\_{\text{val}}} \sum\_{i=1}^{\mathcal{N}\_{\text{val}}} \left( \text{Ref}\_{\mathbf{v}}(\mathbf{i}) - \text{Ref}\_{\mathbf{v}}(\mathbf{i}) \right)^{2}} \tag{3}$$

where Specc refers to the set of preprocessed spectral data, which has Nrows (0 < N < N\_max) and S columns (0 < S < K); Refc refers to the measured values of SCC corresponding to the selected N samples in Cs; Refv(i) refers to the measured value of SCC of sample i in Vs; Ref ˆ <sup>v</sup>(i) refers to the predicted SCC value calculated by selected spectral data and B.

Third, uninformative wavelengths were further eliminated according to the F-test. A correlation index was defined for each selected wavelength at the end of phase 2. The index was the absolute value of the arithmetic product of the regression coefficient and the standard deviation. The originally selected characteristic wavelengths were rearranged in descending order according to the correlation indexes. Another set of PLSR models was established with the spectral data of the first j wavelengths and SCC. Corresponding RMSEVs were calculated. The critical value, tRMSEV, was calculated by the inverse function of the sum distribution function for the F distribution, as shown by Equation (4), for which the significance value α was 0.25 and the degrees of freedom were the same. The wavelengths whose RMSEVs were less than tRMSEV were chosen as the final characteristic ones.

$$\mathbf{t}\_{\text{RMSEV}} = \frac{\text{RMSEV(j)}}{\min(\text{RMSEV(j)})} \tag{4}$$
