**4. Discussion**

#### *4.1. Band Analysis by CARS Algorithm*

The results of the CARS feature selection of the SD pre-processed spectra are shown in Figure 5a. A total of 38 bands were selected as key variables from 1050 wavelength points, mainly located near 430–500, 550–600, 700–860, 1050–1080, 1900–2000, and 2350–2400 nm. To verify whether the selected 38 bands could represent the variability between uncontaminated and contaminated surface water samples, the scores of the bands were plotted, as shown in Figure 5b. There was large variability in the scores of the 38 bands; this also proved the feasibility of these bands selected by CARS. The greatest variability in the scores was found near 498 nm; this may be caused by C-H bond vibrations of aromatic hydrocarbons in the vicinity [28,33]. *Water* **2022**, *14*, x FOR PEER REVIEW 11 of 14

ond derivative (SD) pre-processing, (**b**) score plot of the feature bands.

**Figure 5.** (**a**) Feature bands selected by competitive adaptive reweighted sampling (CARS) after sec-**Figure 5.** (**a**) Feature bands selected by competitive adaptive reweighted sampling (CARS) after second derivative (SD) pre-processing, (**b**) score plot of the feature bands.

The chemical bonds corresponding to the main bands of the Vis-NIR region screened by CARS and the possible corresponding contamination components are shown in Table 3. The band most screened by CARS was near 400–860 nm; this may arise from the vibra-The chemical bonds corresponding to the main bands of the Vis-NIR region screened by CARS and the possible corresponding contamination components are shown in Table 3. The band most screened by CARS was near 400–860 nm; this may arise from the vibration of C-H and N-H chemical bonds, such as those in aromatic hydrocarbons [28,34].


tion of C-H and N-H chemical bonds, such as those in aromatic hydrocarbons [28,34]. **Table 3.** Basic chemical bonds, absorption wavelengths, and possible associated water pollution **Table 3.** Basic chemical bonds, absorption wavelengths, and possible associated water pollution components of main spectral bands screened by competitive adaptive reweighted sampling (CARS) for visible near-infrared region.

#### 1800 C-H Organics *4.2. Implication of Proposed Strategy*

2100 N-H Organics (amine) 2400 C-O Organics (Carbohydrates) *4.2. Implication of Proposed Strategy*  The CARS–SMOTE–PLS–DA modeling approach proposed in this paper not only improves the discrimination accuracy of the PLS–DA model but also simplifies the model input variables. When using Vis-NIR as the input for the PLS–DA model, most spectral variables may be redundant; on the other hand, fewer spectral input variables may result in the loss of COD-related information. A spectral selection algorithm can solve both problems, and the optimal number of input spectra for a balanced model can be found using spectral

The CARS–SMOTE–PLS–DA modeling approach proposed in this paper not only improves the discrimination accuracy of the PLS–DA model but also simplifies the model

variables may be redundant; on the other hand, fewer spectral input variables may result in the loss of COD-related information. A spectral selection algorithm can solve both problems, and the optimal number of input spectra for a balanced model can be found using spectral variable selection. The modeling effect can reduce due to the large difference between the number of contaminated and uncontaminated surface water samples collected. To solve this problem, the feasibility of the SMOTE algorithm in solving the problem of uneven sample distribution was explored. The feasibility of PLS–DA and SMOTE–PLS– DA was experimentally verified before conducting CARS–SMOTE–PLS–DA. The discrimination accuracy improved after SMOTE solved the problem of the uneven sample distribution. Finally, the Vis-NIR spectra of surface water were subjected to band selection after the pre-processing with four different methods. Combining the CARS selection algorithm

1500 C-O Organics (aromatics)

variable selection. The modeling effect can reduce due to the large difference between the number of contaminated and uncontaminated surface water samples collected. To solve this problem, the feasibility of the SMOTE algorithm in solving the problem of uneven sample distribution was explored. The feasibility of PLS–DA and SMOTE–PLS–DA was experimentally verified before conducting CARS–SMOTE–PLS–DA. The discrimination accuracy improved after SMOTE solved the problem of the uneven sample distribution. Finally, the Vis-NIR spectra of surface water were subjected to band selection after the pre-processing with four different methods. Combining the CARS selection algorithm with the SMOTE algorithm not only improved the discrimination accuracy of the model but also reduced the input of the discrimination model.

In this study, the surface water samples were collected for a total of 4 months, covering both the rainy and non-rainy seasons in Guangzhou. Changes in the rainy season will lead to changes in COD because the runoff generated by the rainfall in the rainy season will cause pollutants from land sources to enter the water, resulting in an increase in COD. From the principle of COD chemical detection, these pollutants are all aerobic substances. The aerobic substances in the surface water during the rainy season and non-rainy season have a general law and there will be no major changes in components due to the rainy season. We carried out Vis-NIR detection on a large number of samples and used a surface water model to grasp the quantitative relationship between all aerobic substances and COD values as much as possible. We used the CARS–SMOTE–PLS–DA model to realize the online monitoring of large COD values, which provides a new way of discriminating for the management of seriously polluted surface water.

#### **5. Conclusions**

This study employed a new approach with CARS–SMOTE–PLS–DA and Vis-NIR to judge whether surface water can meet the COD standard (40 mg/L) for agricultural use and the general landscape. It demonstrated the feasibility and effectiveness of introducing the CARS band selection technique and the SMOTE algorithm into Vis-NIR analysis. The CARS–SMOTE–PLS–DA modeling approach not only had a higher overall accuracy but also produced a more simplified model. The optimal pre-processing method for all three modeling methods was SD, with PLS–DA yielding an accuracy of 88% with the input of 1050 wavelength points. Compared to the PLS–DA model, the CARS–SMOTE–PLS–DA model exhibited an 11% improvement in accuracy and a 96% reduction in wavelength input. The CARS–SMOTE–PLS–DA model experienced a 5% improvement in accuracy and a 96% reduction in wavelength input compared to the SMOTE–PLS–DA model. Overall, the surface water COD discrimination method (CARS–SMOTE–PLS–DA model) proposed in this paper has the advantages of novelty, eco-friendliness, simplicity, and broad prospects. It is a novel method for real-time online surface water COD discrimination, which is conducive to the management and development of surface water resources.

**Author Contributions:** Conceptualization, X.H., J.M. and W.J.; methodology, X.H., J.M., J.C. and Y.Y.; visualization, X.H., J.C., B.X. and W.Y.; sampling, X.C.; writing—original draft, X.H.; writing—review and editing, X.C., D.X. and F.H. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the National Natural Science Foundation of China (61975069), the Guangzhou science and technology project (202103000095), the Key-Area Research and Development Program of Guangdong Province (2020B090922006), and the Free Exploration Project of Special Research Funds for the Central Public-Interest Scientific Institution (PM-zx703-202112-338).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data presented in this study are available on request from the corresponding authors. The data are not publicly available due to the continuation of a follow-up study by the authors.

**Acknowledgments:** The support provided by South China Institute of Environmental Sciences, Ministry of Ecology and Environment, State Environ-ment Protection Key Laboratory of Water Environmental Simulation and Pollution Control, Guangzhou 510655, China.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

