**1. Introduction**

Sustainable development of the ecological environment is the common demand for the survival of all mankind, and the recycling of agricultural water resources is urgent. Surface water is one of the main sources of water for agricultural irrigation and an important factor affecting the quality of crops [1,2]. With the rapid advancement of industrialization in modern society, the random discharge of industrial wastewater has become an increasingly serious environmental problem. Extensive domestic garbage and industrial chemical residues flow into surface water, resulting in the deposition of a variety of harmful chemicals. This poses a serious threat to the recycling of agricultural water resources [3,4]. Accurate judgment of surface water pollution is one of the means to ensure the quality of agricultural cultivation.

Surface water pollutants are mainly organic and are generally quantitatively indicated by the chemical oxygen demand (COD). Conventional methods to test COD include the dichromate method and the permanganate index method. These methods not only require chemical reagents but also have the shortcomings of complex chemical reactions and long time periods. Moreover, they are likely to cause secondary pollution if the chemical reagents

**Citation:** Han, X.; Chen, X.; Ma, J.; Chen, J.; Xie, B.; Yin, W.; Yang, Y.; Jia, W.; Xie, D.; Huang, F. Discrimination of Chemical Oxygen Demand Pollution in Surface Water Based on Visible Near-Infrared Spectroscopy. *Water* **2022**, *14*, 3003. https:// doi.org/10.3390/w14193003

Academic Editor: Karl-Erich Lindenschmidt

Received: 24 August 2022 Accepted: 20 September 2022 Published: 23 September 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

are not handled properly [5]. Hence, to protect and recycle surface water resources, it is necessary to develop a rapid, effective, and eco-friendly detection technology to accurately evaluate the degree of surface water pollution [6].

Visible near-infrared spectroscopy (Vis-NIR) is a green, non-destructive, and rapid detection technique. It is widely used in fields such as the ecological environment and medicine through a combination of statistical modeling and chemometric methods [7]. This analytical technique not only provides rich qualitative and quantitative information on substances but also has the advantages of being non-destructive and easy to apply. Therefore, this technique is widely used to detect various water pollution indicators [8]. The analytical method of this technique mainly involves the establishment of a calibration model using the spectra and conventional values of the target components. Linear discriminant models such as partial least squares discriminant analysis (PLS–DA) are commonly used in spectral modeling owing to their simple structure and ease of operation [9]. PLS–DA is a classification technique based on partial least squares. Its mathematical basis is principal component analysis, and the regression model between the independent variable and the categorical variable of the training sample is mainly established by the information of the samples in the process of features selection, and then the characteristic variables related to the classification are effectively extracted [10].

On the one hand, the accuracy and stability of the model will be affected by less representative sample data and the skewed distribution of sample categories [11]. Uneven distribution can easily occur when collecting samples. Therefore, the key factor affecting the performance of the classification model is the quantity distribution of the samples in different categories. Common machine learning algorithms adopt a balanced training set, where all categories are represented equally [12]. However, such treatment leads to a certain error in the prediction of categories with a large number of samples, whereas categories with a small number of samples are prone to misclassification [13]. On the other hand, the accuracy and stability of the model will suffer from redundancy in the spectral data. If the entire Vis-NIR band is used to train the model, it is often too complex and may produce inefficient models [14]. Some spectral variables may contain irrelevant or even noise information, which may distort the true relationship between the sample information and Vis-NIR predictors. Spectral selection algorithms are applied to overcome the drawbacks of spectral analysis. The competitive adaptive reweighted sampling (CARS) algorithm is one of the most commonly used band selection strategies [15]. This algorithm eliminates unimportant spectral variables when extracting the optimal subset of such variables according to the regression coefficients. However, it has not been validated whether this algorithm can effectively discriminate if the COD of surface water exceeds the threshold through Vis-NIR.

In our last article, we achieved quantitative predictions for surface water, but not very good predictions for COD greater than 120 mg/L [16]. In this experiment, samples that were more seriously polluted and whose COD was greater than 600 mg/L were added, and the method of qualitative discrimination was tried to achieve high-accuracy COD online discrimination, which provided new ideas for surface water quality management.

The purpose of this study was to explore the best comprehensive modeling approach of Vis-NIR to diagnose whether the COD of surface water exceeds its management value. The following objectives were considered: (1) to understand the effect of spectral preprocessing methods on the discrimination results of surface water COD; (2) to improve the distribution of sample categories using the synthetic minority oversampling technique (SMOTE) algorithm; (3) to develop a CARS-SMOTE-PLS-DA model for rapid determination of COD in surface water using the CARS band selection algorithm and the SMOTE algorithm; and (4) to determine the important wavelengths for the discrimination of surface water COD and the relevant components of surface water pollution.

#### **2. Materials and Methods**

### *2.1. Study Area and Sample Collection*

The samples for this study were provided by the South China Research and Monitoring Analysis Center, South China Institute of Environmental Sciences, Ministry of Ecology and Environment. These samples were from Guangzhou, Guangdong, China. Surface water was collected from an inland river in Guangzhou that was often used as the water source for agricultural irrigation and the daily life of residents. A total of 127 samples were collected from 15 July to 15 October 2021. They were placed in sealed test tubes and labeled in the sampling order, and then delivered to the laboratory at room temperature on 16 October 2021. The COD value of each sample was determined in all experiments using the conventional permanganate index method [17]. The measured COD values were used for the calibration and validation of spectral analysis.

### *2.2. Chemical Analysis and Contamination Assessment*

To determine the COD content of surface water, a known amount of potassium dichromate solution was added to 127 surface water samples with silver salts as the catalyst in a strong acid medium. After boiling and refluxing, the unreduced potassium dichromate in the samples was titrated with ferrous ammonium sulfate using the ferroin indicator solution as the indicator. The mass concentration of oxygen consumed was calculated based on the amount of potassium dichromate consumed, which was the specific value of COD.

The collected surface water samples were divided into two categories according to the COD threshold value (40 mg/L) required for Class V in the environmental quality standards for surface water (GB3838-2002), which is applicable to surface water for agricultural use and in the general landscape. They were further coded as binary 0 or 1 to indicate the COD content of each water sample as below or above the threshold, respectively [18].
