*2.3. Spectrum Acquisition and Pre-Processing*

An XDS Rapid Content liquid grating Vis-NIR spectrometer (with a transmission attachment and a 2 mm quartz cuvette) from FOSS, Denmark was used for this study. The spectra were collected in the range of 400–2500 nm, including most of the NIR region. The wavelength sampling interval was 2 nm. An appropriate amount of sample was placed in a quartz cuvette with an optical path of 2 mm to collect the spectra. Each sample was scanned thrice in the NIR spectrometer. Then, the average spectrum of the three scans was taken as the acquisition spectrum. At the end of each acquisition, the quartz cuvette was cleaned with distilled water and dried with filter paper. The spectrum of each sample was measured at room temperature (24 ± 1 ◦C) and humidity (46% ± 1% RH).

The measured spectrum was inevitably affected by instrument noise and the ambient environment. Therefore, four spectral pre-processing methods were used for the spectra of the water samples: first derivative (FD), second derivative (SD), multiplicative scatter correction (MSC), and standard normal variate (SNV) [19].

#### *2.4. Sample Set Partitioning and Model Evaluation Parameters*

The sample set partitioning based on joint X-Y distance (SPXY) [20,21] was used. The training and test sets were partitioned with a ratio of 3:1. The training set could identify different classes of spectral patterns upon fitting the classification model, whereas the test set was used to evaluate the performance of the model. The specific partitioning results with the surface water sample information are shown in Table 1.


**Table 1.** Statistics of surface water COD values and partitioning of sample set.

Notes: COD: chemical oxygen demand; Min: minimum; Max: maximum.

#### *2.5. Evaluation of the Model Performance*

The accuracy, sensitivity, and specificity were used to evaluate the overall performance of the classification models. The classification accuracy refers to the ratio of the number of samples correctly discriminated to the total number of samples in the classification model when testing the established model using the prediction set. Sensitivity and specificity are two key metrics for the classification model that indicate the percentage of positive and negative samples correctly classified, respectively. When the accuracy, specificity, and sensitivity are closer to 1, the classification model has better performance.

$$\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{FP} + \text{TN} + \text{FN}} \tag{1}$$

$$\text{Sensitivity} = \frac{\text{TP}}{\text{TP} + \text{FN}} \tag{2}$$

$$\text{Specificity} = \frac{\text{FP}}{\text{FP} + \text{TN}} \tag{3}$$

where *TP* denotes the number of positive samples in the predication set correctly classified by the mode; *FN* denotes the number of positive samples in the test set incorrectly classified by the model; *FP* denotes the number of negative samples in the test set incorrectly classified by the model; and *TN* denotes the number of negative samples in the test set correctly classified by the model.

#### *2.6. Synthetic Minority Oversampling Technique*

When modeling algorithms are applied directly to data with uneven and unbalanced sample distribution, samples in the categories with smaller quantities are prone to misclassification, which reduces the overall accuracy [22]. Therefore, improving the discrimination accuracy of minority samples in discriminant analysis is a key issue. The numbers of collected surface water samples with COD values below and above the threshold were quite different; the number of samples with COD values above the threshold was small. Therefore, the oversampling method of the SMOTE algorithm was used to improve the sample distribution. New surface water categories were generated in the training set such that below-threshold and above-threshold samples obtained balanced observations (equal number of samples per category in the training set). The SMOTE algorithm proposed by Chawla et al. [23] is an efficient oversampling technique that can be used to avoid the overfitting that arises from the direct replication of a small quantity of samples. The SMOTE technique runs the oversampling difference by introducing synthetic examples into the spectral space and adding K-nearest neighbors. The K value was set to five to control the newly generated examples. For the original dataset, the sample training set corresponding to each pre-processing method generated a SMOTE-processed training set.

#### *2.7. Competitive Adaptive Reweighted Sampling*

CARS is a wavelength selection method that adopts the Darwinian evolution theory of "survival of the fittest". The key wavelengths selected are those with relatively large absolute coefficients in the multiple linear regression model. This selection method conducts a wavelength selection based on the exponential decay function (EDF) and then selects the key wavelengths based on the competitive wavelength selection of adaptive reweighted sampling [24,25]. The algorithm implementation is divided into the following four steps:


EDF can realize the rapid elimination and selection of wavelengths. In each sampling process, the wavelength ratio to be retained is calculated by using EDF. The calculation formula of the wavelength ratio is as follows:

$$r\_l = ae^{ki} \tag{4}$$

Among them, *a* is related to two fast constants, which are related to the number of spectral wavelengths *p* and the number of sampling runs *N* in CARS.

$$a = \left(\frac{p}{2}\right)^{\frac{1}{N-1}}\tag{5}$$

$$k = \frac{\ln\left(\frac{p}{2}\right)}{N - 1} \tag{6}$$

After forced wavelength reduction by EDF, ARS is used to imitate the principle of survival of the fittest, and wavelengths are eliminated in a competitive manner. In ARS, variables will be randomly weighted and sampled, and variables with larger weights will be selected.

The aforementioned algorithms were run in MATLAB (R2018a, Math Works, Inc., Natick, MA, USA).

#### **3. Results and Discussion**

#### *3.1. Descriptive Statistics*

The statistics of the COD values for surface water samples measured in the laboratory and the partitioning of the sample set are shown in Table 1. The PH of these samples ranges from 5.63 to 8.92. The COD values of all surface water samples varied between 4 and 688 mg/L, with a mean value of 61.98 mg/L and a median value of 27 mg/L. It was also evident that the number of samples with a COD value below the threshold value far exceeded that of samples with a COD value above the threshold value. Samples with a COD value larger than the threshold value may be influenced by human activities and natural factors related to landscape changes, such as domestic wastewater discharge. The training and test sets were divided using the SPXY method with a ratio of 3:1. A total of 95 and 32 samples in the dataset were divided into the training and test sets, respectively. The results of the division are shown in Table 1.

Figure 1a shows the distribution of COD content of surface water samples. Most COD values were between 1 and 100 mg/L. Samples with COD values greater than 40 mg/L were designated as contaminated samples and those with COD values less than 40 mg/L were uncontaminated samples. A total of 25 samples in the training set were contaminated and 70 samples were uncontaminated, whereas 14 samples in the test set were contaminated and 18 samples were uncontaminated. This indicated a large gap between the number of uncontaminated and contaminated samples in the training set, which was likely to affect the modeling results. Therefore, in the subsequent analysis, the SMOTE algorithm was used to generate new surface water categories in the training set, so that the uncontaminated and contaminated samples achieved balanced observations. The impact of the excessive gap between the number of categories on the modeling was examined by comparing this with the modeling results without using the SMOTE algorithm. Additionally, the feasibility of the SMOTE algorithm was verified. With the application of the SMOTE algorithm, the uncontaminated samples (<40 mg/L) in the training set increased from 25 to 70, forming a new training set. The numbers of uncontaminated and contaminated samples in the training set are shown in Figure 1b.

ples in the training set are shown in Figure 1b.

**Figure 1.** (**a**) Histogram of surface water COD value distribution, (**b**) distribution of training set **Figure 1.** (**a**) Histogram of surface water COD value distribution, (**b**) distribution of training set samples before and after SMOTE expansion, (**c**) t−SNE visualization results plot.

samples before and after SMOTE expansion, (**c**) t−SNE visualization results plot.

and 70 samples were uncontaminated, whereas 14 samples in the test set were contaminated and 18 samples were uncontaminated. This indicated a large gap between the number of uncontaminated and contaminated samples in the training set, which was likely to affect the modeling results. Therefore, in the subsequent analysis, the SMOTE algorithm was used to generate new surface water categories in the training set, so that the uncontaminated and contaminated samples achieved balanced observations. The impact of the excessive gap between the number of categories on the modeling was examined by comparing this with the modeling results without using the SMOTE algorithm. Additionally, the feasibility of the SMOTE algorithm was verified. With the application of the SMOTE algorithm, the uncontaminated samples (<40 mg/L) in the training set increased from 25 to 70, forming a new training set. The numbers of uncontaminated and contaminated sam-

The Vis-NIR spectral data were mapped into two-dimensional space using the t-distributed stochastic neighbor embedding (t-SNE) visualization algorithm. Thus, the variability and inherent characteristics of the Vis-NIR spectral datasets of the uncontaminated and contaminated samples could be understood more intuitively. During visualization, the t-SNE method can preserve the nonlinear structure of the spectral dataset [26]. In contrast to principal component analysis, t-SNE searches for the data structure based on a random probability distribution over the domain graph [27]. The visualization results of the surface water sample dataset upon using the t-SNE algorithm are shown in Figure 1c. The dataset forms two distinct clusters, wherein each point represents a sample, and the axes represent the first two dimensions of the t-SNE features. These t-SNE visualization The Vis-NIR spectral data were mapped into two-dimensional space using the tdistributed stochastic neighbor embedding (t-SNE) visualization algorithm. Thus, the variability and inherent characteristics of the Vis-NIR spectral datasets of the uncontaminated and contaminated samples could be understood more intuitively. During visualization, the t-SNE method can preserve the nonlinear structure of the spectral dataset [26]. In contrast to principal component analysis, t-SNE searches for the data structure based on a random probability distribution over the domain graph [27]. The visualization results of the surface water sample dataset upon using the t-SNE algorithm are shown in Figure 1c. The dataset forms two distinct clusters, wherein each point represents a sample, and the axes represent the first two dimensions of the t-SNE features. These t-SNE visualization results further validate the feasibility of using the Vis-NIR spectral technique for discriminant analysis of surface water COD.

#### results further validate the feasibility of using the Vis-NIR spectral technique for discri-*3.2. Spectral Absorption Characteristics*

minant analysis of surface water COD. *3.2. Spectral Absorption Characteristics*  Figure 2a–d present the original spectra of surface water and those after SD pre-processing for 400–2500, 1200–1500, and 1800–2200 nm, respectively. In Figure 2a, the spectra of uncontaminated and contaminated samples show similar trends and shapes. However, after SD pre-processing, the spectra show multiple peaks and troughs. Since there are large peaks and troughs near 1800 nm, the spectra after SD pre-processing were locally Figure 2a–d present the original spectra of surface water and those after SD preprocessing for 400–2500, 1200–1500, and 1800–2200 nm, respectively. In Figure 2a, the spectra of uncontaminated and contaminated samples show similar trends and shapes. However, after SD pre-processing, the spectra show multiple peaks and troughs. Since there are large peaks and troughs near 1800 nm, the spectra after SD pre-processing were locally amplified to obtain Figure 2c,d. These figures show more pronounced absorptions at 1400, 1450, and 1980 nm, which may be caused by the stretching vibrations of the O-H, C-H, and N-H bonds, respectively [28–30]. They also show that the uncontaminated and contaminated samples exhibited large differences in these three bands.

#### amplified to obtain Figure 2c,d. These figures show more pronounced absorptions at 1400, *3.3. Correlation Analysis between Wave Bands*

1450, and 1980 nm, which may be caused by the stretching vibrations of the O-H, C-H, and N-H bonds, respectively [28–30]. They also show that the uncontaminated and contaminated samples exhibited large differences in these three bands. Figure 3 shows the correlation between wavelength points. Vis-NIR is an indirect technique for rapid measurement and discrimination that requires a small amount of prepared samples and does not use harmful chemicals. It can qualitatively discriminate COD contamination in surface water using spectral absorption characteristics [31]. However, interference of instrument noise and high coincidence of information bands of various components occur during measurement. Vis-NIR also has a wide wavelength range. Therefore, there is extensive irrelevant band information. Figure 3 shows that the correlation between the 1050 wavelength points is different, with some features showing a strong correlation and others showing a weak correlation. Therefore, it is necessary to choose the appropriate wavelength band for modeling and obtain a model with high performance by removing non-informative bands.

**Figure 2.** Visible near-infrared spectroscopy plots of surface water samples: (**a**) raw spectra; (**b**) second derivative (SD) (400–2500 nm); (**c**) SD (1200–1500 nm); (**d**) SD (1800–2200 nm).

The discrimination results of the PLS–DA, SMOTE–PLS–DA, and CARS–SMOTE– PLS–DA models with different pre-processing methods for surface water samples are summarized in Table 2. The sample sets were divided using the SPXY method and saved as the training and test sets. The raw spectra were pre-processed differently. In the PLS– DA model, the spectral pre-processing had different effects on surface water pollution discrimination. FD and SD had positive impacts on the accuracy of the model. The model achieved the best prediction results after SD pre-processing. The accuracy of the training set and the accuracy, sensitivity, and specificity of the test set of the PLS–DA model were 0.88, 0.88, 0.83, and 0.93, respectively. However, the MSC and SNV pre-processing methods had a negative impact on the accuracy of the model. With either pre-processing method, the accuracy of the modeling results decreased compared to that with the original spectra. The pre-processed training and test sets were saved separately. The training set was SMOTE-processed to obtain a new training set, which was then subjected to PLS–DA. The SMOTE–PLS–DA modeling results are shown in Table 2. Compared with those of the PLS–DA model, the SMOTE–PLS–DA model accuracy with the FD, SD and MSC pre-processing methods was improved. Among them, for the FD pre-processing method, the training and test set accuracies of the model improved by 7% and 7%. For the SD method, the training and test set accuracies of the model improved by 9% and 6%. For the MSC method, the training and test set accuracies of the model improved by 12% and 3%. However, the accuracy of the SMOTE–PLS–DA model of the SNV pre-processing method was

**Figure 3.** Correlation between 1050 wavelength points.

not improved, but the sensitivity of the model was greatly improved.

**Figure 3.** Correlation between 1050 wavelength points.

### *3.4. Comparison of Classification Results*

The discrimination results of the PLS–DA, SMOTE–PLS–DA, and CARS–SMOTE– PLS–DA models with different pre-processing methods for surface water samples are summarized in Table 2. The sample sets were divided using the SPXY method and saved as the training and test sets. The raw spectra were pre-processed differently. In the PLS– DA model, the spectral pre-processing had different effects on surface water pollution discrimination. FD and SD had positive impacts on the accuracy of the model. The model achieved the best prediction results after SD pre-processing. The accuracy of the training set and the accuracy, sensitivity, and specificity of the test set of the PLS–DA model were 0.88, 0.88, 0.83, and 0.93, respectively. However, the MSC and SNV pre-processing methods had a negative impact on the accuracy of the model. With either pre-processing method, the accuracy of the modeling results decreased compared to that with the original spectra. The pre-processed training and test sets were saved separately. The training set was SMOTE-processed to obtain a new training set, which was then subjected to PLS–DA. The SMOTE–PLS–DA modeling results are shown in Table 2. Compared with those of the PLS– DA model, the SMOTE–PLS–DA model accuracy with the FD, SD and MSC pre-processing methods was improved. Among them, for the FD pre-processing method, the training and test set accuracies of the model improved by 7% and 7%. For the SD method, the training and test set accuracies of the model improved by 9% and 6%. For the MSC method, the training and test set accuracies of the model improved by 12% and 3%. However, the accuracy of the SMOTE–PLS–DA model of the SNV pre-processing method was not improved, but the sensitivity of the model was greatly improved.

**Table 2.** Summary of discrimination results of partial least squares discriminant analysis (PLS–DA), synthetic minority oversampling technique (SMOTE)–PLS–DA, and competitive adaptive reweighted sampling (CARS)–SMOTE–PLS–DA models with different pre-processing methods for surface water.


Notes: \* Pre.p: Pre-processing; The boldfaced rows indicate the best pretreatment methods and their results.

To simplify the model and further improve its prediction performance, the raw spectra were pre-processed using four different methods and subjected to feature selection. Then, the training set was processed using the SMOTE algorithm to obtain the results of CARS– SMOTE–PLS–DA, as shown in Table 2. The accuracy of the model improved after CARS feature selection. After SD pre-processing, the training and test set accuracies of the model improved by 11% and 9%, respectively, compared to those of the PLS–DA model. The sensitivity and specificity were greatly enhanced. The simplicity of the model also improved, with 1050 wavelength points being reduced to 38. The CARS algorithm further improved the model performance and simplified the model, compared to the SMOTE– PLS–DA model. The training set accuracy of the model improved and the sensitivity and specificity increased to a greater extent.

To further investigate the performance of the three models, the receiver operating characteristic (ROC) curves of the four different pre-processing methods and surface water score map were plotted and analyzed, the ROC is a comprehensive evaluation index reflecting the continuous variables of the sensitivity and specificity in the classification problem [32], as shown in Figure 4. The points of each curve in Figure 4c are closer to the upper left corner than those in Figure 4a,b, indicating that the prediction accuracy corresponding to each pre-processing method improved with the application of CARS and SMOTE algorithms. However, for the PLS–DA model, the ROC curve is closer to the dashed line after pre-processing with MSC and SNV. In other words, the model performance was reduced. For the SMOTE–PLS–DA model, the ROC curve of the original spectra is closer to the dashed line, i.e., the model performance was poorer. For the CARS–SMOTE–PLS–DA model, compared with Figure 4a, all five curves are closer to the upper left corner, whereas the ROC curves of MSC and SNV are below that of the original spectra. In other words, the MSC and SNV pre-processing methods reduced the model performance. Moreover, the ROC curve of SD is closer to the upper left corner, i.e., the modeling effect with SD pre-processing was better. *Water* **2022**, *14*, x FOR PEER REVIEW 10 of 14

**Figure 4.** Receiver operating characteristic (ROC) curves and surface water score map: (**a**,**d**) partial least squares discriminant analysis (PLS–DA) model, (**b**,**e**) synthetic minority oversampling technique (SMOTE)–PLS–DA model, (**c**,**f**) competitive adaptive reweighted sampling (CARS)–SMOTE– PLS–DA model. FD: first derivative; SD: second derivative; MSC: multiple scattering correction; **Figure 4.** Receiver operating characteristic (ROC) curves and surface water score map: (**a**,**d**) partial least squares discriminant analysis (PLS–DA) model, (**b**,**e**) synthetic minority oversampling technique (SMOTE)–PLS–DA model, (**c**,**f**) competitive adaptive reweighted sampling (CARS)–SMOTE–PLS–DA model. FD: first derivative; SD: second derivative; MSC: multiple scattering correction; SNV: standard normal variate.

SNV: standard normal variate. In Figure 4d, it can be seen that a large number of samples with a label value of 1 have scores below 0 and a number of samples are misclassified; in Figure 4e, after SMOTE, it can be seen that for samples with the label value of 1, the score has improved significantly, but there are still a number of samples with scores below 0. In order to further improve the score, we used the CARS algorithm to improve the performance of the model. In Figure 4f, we can see that only two samples with the sample label of 1 have scores below In Figure 4d, it can be seen that a large number of samples with a label value of 1 have scores below 0 and a number of samples are misclassified; in Figure 4e, after SMOTE, it can be seen that for samples with the label value of 1, the score has improved significantly, but there are still a number of samples with scores below 0. In order to further improve the score, we used the CARS algorithm to improve the performance of the model. In Figure 4f, we can see that only two samples with the sample label of 1 have scores below 0, at the same time, the scores of the samples with the label −1 are all located below 0, and the model prediction was greatly improved.

0, at the same time, the scores of the samples with the label −1 are all located below 0, and

The results of the CARS feature selection of the SD pre-processed spectra are shown in Figure 5a. A total of 38 bands were selected as key variables from 1050 wavelength points, mainly located near 430–500, 550–600, 700–860, 1050–1080, 1900–2000, and 2350– 2400 nm. To verify whether the selected 38 bands could represent the variability between uncontaminated and contaminated surface water samples, the scores of the bands were plotted, as shown in Figure 5b. There was large variability in the scores of the 38 bands; this also proved the feasibility of these bands selected by CARS. The greatest variability in the scores was found near 498 nm; this may be caused by C-H bond vibrations of aro-

the model prediction was greatly improved.

matic hydrocarbons in the vicinity [28,33].

**4. Discussion** 
