*4.2. Advantages and Innovation of RFs in the Construction of the WQImin Model*

In previous studies, scholars generally used the stepwise multiple linear regression method to develop the WQImin models [3,10], which were evaluated based on R<sup>2</sup> , MSE, and percentage error (PE) values to select important water quality indicators. Compared with previous studies, the data distribution of WQI values in the present study was wide and the model was relatively difficult to construct. The WQImin obtained with the above method did not perform well on the testing set, in which PE > 10% [10].

In recent years, ML has shown excellent performance in regression models, and has attracted increasing attention for use in academia and industry. The RF-based WQIRFmin model in this study exhibited a better performance and yielded more stable results compared with the traditional stepwise multiple linear regression method (Figure S1). In recent years, some research has focused on combining ML with individual water quality indicators. Chen et al. used ML methods to classify surface water quality with only a few water quality parameters [17]. However, the national standards for surface water quality evaluation in China still use a single-indicator evaluation method. There are relatively few studies on the combination of ML and comprehensive water quality assessment. The use of RFs combined with the WQI method in this study is a novel attempt to use ML for water quality assessment. Given the rapid development of artificial intelligence and big data, ML and deep learning can be combined with water quality assessment, water quality warning systems, and other related water quality research in the future.

#### **5. Conclusions**

The main conclusions are as follows: (1) The main water quality parameters of the M River that exceeded the Class III standards were TN, *F. coli*, Fe, and Mn. The WQI results indicated that the water quality of the M River was 'good' overall, with an overall average WQI value of 72.11. The average WQI values of the four monitoring stations ranged from 68.31 to 77.16, and there was a clear trend of deterioration from upstream to downstream. (2) The feature importance of each water quality parameter in the WQIRF model was quantitatively assessed, and five parameters (Mn, Fe, *F. coli*, DO, and TN) were selected as key water quality parameters for establishing the WQIRFmin model, which had good accuracy (R<sup>2</sup> = 0.96). (3) The PMF method was applied to identify five pollution sources and to apportion their contributions to each water quality parameter. (4) Quantitative assessment of the impact of pollution sources on water quality showed that pollutions sources were ranked as: heavy metal pollution (53.18%) > microbial contamination (18.15%) > non-point source agricultural (9.64%) > nutrient contamination (6.73%), while the unexplained variability accounted for 10.95% of the total.

The methods used in this study to analyse the water quality of the M River could reduce the measurement cost of water quality assessment and effectively improve the measurement efficiency. In addition, the findings provide support for formulating water quality management strategies. The methods of selecting key water quality parameters and of assessing the quantitative contributions of pollution sources to the variation in the WQI values could be practically applied to other surface waters to greatly improve our understanding of the overall water quality condition. Additional studies will be required to assess precisely the unidentified sources of pollution and variation of further water quality parameters that were not analyzed in this study.

However, water pollution is a complex process, and more factors will affect the migration and transformation of pollutants. Therefore, we should continue to improve the research methods and technical means, and explore the methods and theories of traceability of exceeded pollutants at both qualitative and quantitative levels. It is necessary to verify and analyze the existing results, optimize the sampling scheme, and establish a model of the relationship between environmental variables and water pollutants. This will be a major direction for future development.

**Supplementary Materials:** The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/ijerph20010881/s1, Figure S1: Comparison of the WQI and WQILRmin values from the stepwise multiple linear regression based on the testing dataset; Table S1: Water quality characteristic; Table S2. Weights and normalization factors of the parameters used in the calculation of the water; Table S3. The parameter selection results of the WQILRmin models from the stepwise multiple linear regression.

**Author Contributions:** All authors contributed to the study conception and design. Conceptualization, methodology, software, formal analysis, and writing: Y.Z.; validation and writing: X.W.; methodology and software: S.Z.; validation: L.J.; supervision, project administration, funding acquisition: W.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the National Natural Science Foundation of China (No. 51979194) and Cross-regional Joint Pre-vention and Control Mechanism and Strategic Scientific Research Program for Water Quality Bi-osafety Risks in Upper Yangtze River (No. 2021-YB-CQ-3). We also thank the research on water quality stability characteristics and countermeasures of the Fuzhou Water Supply System (Project NO. 20203000) from Fuzhou Water Group Co. Ltd., China and the Comparative study on corrosion characteristics of pipes and microbial safety of water quality in water supply system (Project NO. kh0040020191200).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Nomenclature**

WQI: Water quality index; PMF: Positive matrix factorization; ML: Machine learning; RF: Random forests; APCS-MLR: Absolute principal components score combined with multivariate linear regression; WWP: West District Water Plant; FWP: Fei Fengshan Water Plant; SWP: Southeast District Water Plant; CWP: Chengmen Water Plant; MSE: Mean square error; MAE: Mean absolute error; MAPE: Mean absolute percentage error; CV: Coefficient of variation; WT: Water temperature; TN: Total nitrogen; TP: Total phosphorus; DO: Dissolved oxygen.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
