1. Introduction
At electric power enterprises, depending on technological processes, a substantial amount of wastewater is produced, containing substances such as oil and petroleum products, as well as heavy metals, which are hazardous to natural water bodies. In this regard, there are strict requirements for wastewater discharge. For the transition to smart energy solutions in the concept of Industry 5.0 [
1,
2], we need to implement new approaches for the environmental impact evaluation. Smart energy involves optimizing energy costs and boosting efficiency through innovative technologies to develop and manage a sustainable energy system. This is accomplished by incorporating artificial intelligence, machine learning, and data analytics into processes, with the help of IoT sensors.
Smart energy plays an active role in providing the preservation of the environment. This can be made possible by adopting digital technologies, including machine learning algorithms, in the area of wastewater emission control, as part of smart energy (see
Figure 1). Analyzing existing methods and providing new, more accurate methods is good for our planet, as it favors circular production models and supports energy technologies that make the use of natural resources in a more efficient way.
Wastewater is any stream of water that is moved out from an electrical power plant’s cycle. The main types of wastewater are the following:
Wastewater from both circulating and open hydro-ash and slag-removal (HSR) systems of power plants that operate on solid fuels;
Blowdown water from the circulating water supply system of thermal power plants, which is discharged continuously;
Wastewater from water treatment plants, which can be discharged periodically or continuously, including reverse osmosis concentrates, wash water from mechanical filters, and eluates after the regeneration of ion exchange filters;
Blowdown water from steam boilers, evaporators, and steam converters that is discharged continuously;
Snow and rain runoff that contains suspended particles, such as various types of contamination and petroleum products, including fuel oil;
Oily, contaminated external condensate, suitable for feeding steam evaporators after cleaning;
Spent washing solutions, both acidic and alkaline, along with wash waters generated after chemical washing and preservation, used for steam boilers, condensers, heaters, and other equipment. This includes periodic runoff, typically formed in the summertime [
3].
Such wastewater should be treated at a local treatment facility until it meets state standards. Releasing wastewater that exceeds the maximum permissible concentrations (MPCs) of harmful substances into a body of water is a severe offense, in accordance with the law, and is punishable by fines and suspension of the activities of enterprises, as it can cause significant negative impacts on the environment.
MPC is a legally approved sanitary and hygienic standard. MPC is “the maximum concentration of chemical elements and their compounds in the environment exposed every day over a long time that does not cause pathological changes or diseases in the human body established by modern research methods at any time in the life of the present and subsequent generations” [
4].
Therefore, the solution to the significant problem of identifying wastewater contamination exceeding the MPC allows us to promptly detect the evidence behind such events and take all necessary measures to minimize adverse outcomes.
The wastewater analytical indicators must be constantly monitored. It is necessary to promptly identify instances of contamination exceeding the MPC at energy facilities. In order to ensure a quick response and high accuracy in emission detection, it is essential to develop technical solutions, such as smart energy monitoring systems with modern measuring devices, and also improve the methods for wastewater composition analysis.
Usually, approaches based on sampling, laboratory testing, and the formation of appropriate actions based on the test results are used.
Laboratory tests of water quality are carried out based on the following criteria:
Physical (smell, taste, turbidity, color);
Microbiological (total microbial count, content of coliform bacteria, spores of sulfite-reducing clostridia, and cysts of lamblia);
Chemical (hydrogen index of water pH, hardness and alkalinity, mineralization, anionic and cationic composition (inorganic substances), content of organic substances).
This approach is currently the most accurate, but has two significant drawbacks. First, there are no clear regulations on the timing of water sampling. Due to the dynamics of liquid movement, samples are sometimes taken too late, resulting in missed instances of exceeding the MPC and lower concentrations than we present at the sampling time. Second, laboratory studies require substantial time, leading to delays in responding to these instances. Additionally, laboratory testing is costly and requires the optimization of the sampling process.
The development of IoT and other technologies has enabled the installation of mobile laboratory posts equipped with direct measurement sensors.
Direct measurement analyzers measure the chemical composition in the laboratory by finding the value of the tested indicator (for example, Manganese) and comparing it with the standards for the maximum permissible concentration of Manganese in water.
A major limitation of this approach is the high maintenance cost. Since mobile posts function as laboratories, they require chemical component supplies for analysis, which becomes very expensive due to their remote locations.
Furthermore, it is important to note that conducting a complete chemical analysis of water samples daily is impractical due to the high costs. As a result, there may have been instances where contamination exceeded MPC but went undocumented.
There is a need for low-cost systems to detect when the MPC is exceeded and trigger automatic water sampling. The sampled water can then be sent to the laboratory for analysis. Inexpensive devices include indirect measurement sensors, which detect changes in the water’s chemical composition without identifying the cause. This indirect measurement approach is one way to monitor wastewater quality and quickly spot anomalies that may indicate that the MPC standards have been exceeded. When working with real-life data, we need to detect known and potential new anomalies in case of some missed events.
Smart energy monitoring systems use indirect measurement sensor channels without information on specific indicators of the concentration of harmful substances in the same water sample. Such sensors measure changes in the chemical water composition (the ratio of oxidized and reduced forms of all chemicals in solution) over time using electrochemical methods with pH/ORP (the power of Hydrogen/Oxidation-Reduction Potential) electrodes that generate time series data with an index of timestamp format.
Changes in the water composition affect the readings of indirect measurement sensors, often showing atypical patterns that differ from those of clean water. These anomalies can indicate cases where wastewater contamination exceeds the maximum permissible concentrations. The diversity in chemical composition is reflected in sensor readings and calls for time series analysis methods that minimize human involvement. Automated techniques, such as stationarity checks, partial autocorrelation function (PACF) and autocorrelation function (ACF) calculations, or advanced methods, such as convolutional neural networks and Isolation Forest, can be used for efficient analysis.
When using this approach to detect instances of exceeding the MPC, a key research question arises: how can such events be identified using indirect sensor readings when only a limited number of confirmed cases are available?
The limited number of confirmed MPC exceedances results from the fact that not every discharge detected by indirect measurement sensors is confirmed or refuted by laboratory tests. This leads to missed events, making using conventional supervised machine learning approaches questionable.
This study aims to identify anomalies in multivariate time series from indirect measurement sensors and compare these with confirmed instances of MPC exceedance. This approach will allow us to achieve the following:
The hypothesis of this research is that the methods for detecting complex anomalies in multidimensional time series can identify events where water contamination exceeds the MPC using indirect sensor measurements under conditions of infrequent laboratory testing.
The key scientific contribution is the developed method for detecting complex anomalies in multidimensional time series, named IFPC (Isolation Forest—Predicates Conjunction).
The structure of the paper is outlined as follows: The Literature Review section reviews current research approaches in the domain of water quality assessment, methods for detecting anomalies, and the evaluation of anomaly detection metrics. The Materials and Methods section outlines the problem statement, details the methodology and tools used, and describes the IFCP algorithm. The Results section provides the outcomes of applying the anomaly detection methods. The Discussion section describes, analyzes, and interprets the findings. The Conclusion section concludes the article.
2. Literature Review
The literature review covers sources on approaches in the field of water quality assessment.
Rashevskiy et al. studied ecology modeling for the sustainable development of the urban environment [
5]. Daoping et al. studied the application of sensors for a wastewater treatment system [
6]. Rehbach et al. proposed multi-objective machine learning for a water quality monitoring system and addressed the problem of dataset imbalance [
7]. A survey of machine learning methods applied to anomaly detection on drinking water quality data by Dogo et al. [
8] contains a comprehensive overview of the drinking water quality monitoring problems. The provided statistics are as follows: 41% of the surveyed water utilities rely on manual water sample collection for analysis, and only 16% rely on automated collection. However, only 40% of them would like to have a real-time water quality monitoring system, and only 17% currently have it. It also provides an overview of traditional monitoring types (physical observation, laboratory analysis, portable detection devices) and the transition to wireless sensor networks. It gives a retrospective of research into machine learning methods to detect anomalies in drinking water quality data. Dogo et al. consider class imbalance and the missing data impact on the performance of learning algorithms in water quality anomaly detection problems. The paper [
9] presents a retrospective study on this issue. Muharemi et al. prove the weakness of RNN (recurrent neural network) and LSTM (long short-term memory) algorithms with data that have an imbalance problem [
10]. Ribeiro and Reynoso-Meza propose using the ensemble learning Synthetic Minority Oversampling Technique (SMOTEBoost) and Random UnderSampling—RUSBoost for problems with extreme imbalance coefficients [
11]. Zhang et al. consider solving the problem of noisy data coming from the water composition monitoring sensors by wavelet transforms [
12]. Wu et al. present a successful implementation of an adaptive learning rate backpropagation neural network (ALBP) and a 2-step Isolation and Random Forest (2sIRF) models in urban water supply systems in Norway [
13]. Al-Gunaid et al. describe a system concept for collecting data by a hardware–software complex for wastewater composition automated control from pH/ORP (power of Hydrogen/Oxidation-Reduction Potential) sensors. The paper [
14] proposes an approach for automatic wastewater discharge detection by binary classification.
References [
15,
16,
17,
18,
19] present the basic terms of anomaly detection tasks.
In a general sense, an anomaly means a deviation from expected behavior. Anomaly detection is the task of detecting unusual patterns that do not correspond to expected behavior.
The following types of anomalies in time series are identified [
15]:
- -
Point anomalies, when a single data instance is considered anomalous in relation to other data;
- -
Contextual anomalies, when a data instance is anomalous in a certain context;
- -
Collective anomalies, which occur as a sequence of time points when no normal data exist between the beginning and end of the anomaly.
Anomaly detection methods, according to [
16], are shown in
Figure 2.
Anomaly detection metrics are considered in [
17,
18]. An overview of metrics by anomaly type and anomaly detection tasks is shown in
Figure 3.
The main problems in solving this class of tasks are the following:
- -
Noisy data, missing values associated with the data transmission specifics from sensors;
- -
Unbalanced distribution of classes (imbalanced data), associated with the peculiarity of the event (rare and missed events);
- -
Unlabeled data related to the specifics of the wastewater excess contamination MPC monitoring (unlabeled data do not indicate whether the value is normal or abnormal due to the use of indirect rather than direct measurement sensors);
- -
Lack of metrics for solving tasks with unlabeled data.
In the context of our study, we hypothesize that the wastewater excess contamination MPC occurred if there was a pattern of a rare event, namely, an anomaly was detected by all sensor channels at the same time point.
An anomaly detected by all smart energy station’s sensor channels at the same time point will be considered a complex anomaly [
19].
To evaluate the IFPC method, we will operate with three dates actually laboratory recorded facts of exceeding contamination MPC. Algorithmic data labeling and the subsequent use of traditional metrics for evaluating supervised machine learning methods, in this case, cannot be trusted. It is proposed to identify and display the dates of detected complex anomalies and compare them with known dates.
3. Materials and Methods
3.1. Methodology
The system workflow is shown in
Figure 4 and includes the following:
Data collection from the software and hardware complex (SHC) with pH/ORP sensors [
18] by indirect measurement sensor channels. At the command of the SHC controller, the self-priming pump ensures liquid renewal in the flow chamber, where it is analyzed by a set of signal sensors and measuring analyzers. The values from the sensor channels are processed by the controller and transmitted via the TCP/IP protocol to the application server for further processing. Frequency of receiving data from sensors—1 measurement per minute (1440 per day; 43,200 per month; 525,600 per year), representing the total number of measurements used in the analysis ~3,679,200).
Saving the obtained data to the database.
Loading time series data for analysis. The task is to detect time series anomalies.
Data analysis.
Making decisions about automatic water sample collection for detection of the instances of exceeding the MPC. The anomalies found are indicators for subsequent laboratory chemical water analysis by direct measurements.
Automatic water sample collection.
Analysis of water’s chemical composition. Direct measurement analyzers measure chemical composition in the laboratory. The value of the tested indicator (for example, Manganese) is found and compared with the standards of the maximum permissible concentration of Manganese in water. Exceeding the MPC of contamination is recorded in the report.
Saving the results of the completed analysis.
Using the results for subsequent data analysis.
Comparison of the results of indirect measurement analysis and the analysis of the water composition (direct measurement).
3.2. Problem Statement
Let be a multidimensional time series, where each element is an m-dimensional vector representing readings of m sensor channels at time t.
It is necessary to assign to each moment of time
t an anomaly label of reading from the
i-th channel of the sensor
where ,
where A(x)—a predicate defining the properties of the argument and A(x) ∈ {0, 1}.
A prerequisite is that there are no labeled data in the sample. Unlabeled data do not indicate whether the value is normal or abnormal, due to the use of indirect rather than direct measurement sensors. Labeled data are a set of data that have already been marked or classified by the researcher or algorithmically.
We will consider an anomaly a complex anomaly if the conjunction of predicates (1) is satisfied.
Generally, to prepare the data, preprocessing is required:
Removal of noise in the data (achieved by the moving average method);
Data normalization for the stable operation of the algorithms;
Aggregation of data when they are presented in unequal time intervals;
The logarithm of data (to acquire the homoscedasticity);
Differentiation of data (to remove the trend component and bring the time series to a stationary one).
One of the requirements for method selection is minimal data preprocessing or the ability to use raw, unprocessed data.
3.3. The Isolation Forest
The Isolation Forest (iForest) method was selected as the basic unsupervised machine-learning technique for anomaly detection. This approach demands minimal data preprocessing, is robust against missing data and varying data dimensions, is invariant to feature scaling, and does not require parameter tuning. It has linear time complexity and low memory usage, making it more efficient than most other algorithms, particularly for analyzing streaming data.
The iForest method is based on the construction of a set of several disjoint, undirected binary decision trees (or isolation trees, iTree), otherwise a decision tree ensemble.
A decision tree is a graph where there is only one route between any pair of vertices, and it does not contain a single loop. The decision tree is a connected acyclic graph (
Figure 5).
Isolation Forest algorithm [
16], as shown in Algorithms 1 and 2 below:
Algorithm 1. iForest(X, d, ψ) |
Inputs: X—dataset, d—number of trees, ψ—number of subsample instances Output: set of d iTrees trees 1: Initialize Forest 2: set tree height limit 3: for i = 1 to d do 4: 5: 6: end for 7: return Forest |
Algorithm 2. iTree(X, e, l) |
Inputs: X—dataset, e—height of current tree, l—height limit Output: iTree tree 1: if then 2: return 3: else 4: let Q be a list of attributes in X 5: randomly select an attribute q ∈ Q 6: randomly select a split point p from max and min values of attribute q in X 7: 8: 9: return 10: end if |
The developers of the Isolation Forest algorithm [
16,
21,
22] propose the following estimation of anomaly (2):
where ψ—number of instances in the subsample;
h(x)—length of the path to the observation x, defined by the number of edges starting from the root of the tree;
—mean h(x) from iTree ensemble;
c(ψ)—average path length of unsuccessful search in iTree.
Marginal data refer to the anomaly estimation of an instance according to the following (2):
If ;
If ;
If .
The criteria for recognizing an instance as abnormal are as follows:
If , then an instance of the class can be considered anomalous with certainty;
If , then an instance of the class can be considered normal;
If anomaly estimation yields s ≈ 0.5 for all instances, then the sample has no anomalies.
In other words, the class instance normality measure is the arithmetic mean of the trees’ depths. Anomalous specimens are located in the leaves, at a shallow tree depth [
20,
23].
3.4. Method IFPC (Isolation Forest—Predicates’ Conjunction)
An IFPC method has been developed for identifying complex anomalies in multidimensional time series (
Figure 6), and can be outlined in the following steps:
- 1.
Data preparation:
- -
Data download;
- -
Data concatenation;
- -
Preprocessing (normalization) of data.
- 2.
For all sensor channels:
- -
Identifying anomalous sample instances by the basic method;
- -
The output of time series with marked anomalies;
- -
Detection of complex anomalies by the logical operation of conjunction.
- 3.
Extracting and displaying dates where complex anomalies were detected.
To verify the IFPC method, a source code was written in the integrated development environment Visual Studio Code using Python 3.12.4 with the following libraries:
- -
pandas (data processing);
- -
matplotlib (data visualization);
- -
numpy (linear algebra problems);
- -
scipy.stats (statistics);
- -
statsmodels (test for stationarity of the series);
- -
sklearn (machine learning methods);
- -
pmdarima (ARIMA model);
- -
keras (deep learning methods).
The program has several data sources: downloaded .csv and .json files and user-supplied parameters.
The 12 .csv files contain unlabeled data for 12 months; in other words, the dataset does not contain any markings that classify a particular indicator value as normal or abnormal. The JSON file contains unlabeled data for June–July 2022. The data structure in different file types is identical and is shown in
Figure 7.
For one time stamp, readings from 7 sensor channels are presented.
In the case of system disruption, the indicators’ values have the NAN (Not-A-Number) value. The files provided do not contain NAN values, so this article does not consider the case of system malfunctions (hardware failure).
The computational experiment aims to verify the hypothesis that the wastewater excess contamination MPC in energy facilities has occurred if anomalies are detected by all sensor channels at the same time.
Input data:
Unlabeled dataset for 2020–2021 consisting of 12 .csv files;
Unlabeled dataset for June–July 2022 in .json format;
Three dates actually laboratory recorded facts of exceeding contamination MPC—7, 8, and 9 June in 2022.
Output data: time series plots with anomalies marked and a list of dates where the contamination MPC may have been exceeded.
4. Results
In this section, we will discuss the results obtained based on the experimental design. Firstly, we will present the results of the benchmark approach using the 3-sigma rule (3σPC method). Next, we will cover the results for the benchmark approach utilizing the clustering (k-meansPC method). Lastly, we will present the results of the IFPC (Isolation Forest—predicates’ conjunction) method.
4.1. 3σPC Method
The 3-sigma rule states that for many fairly symmetric unimodal distributions, almost the entire population is within three standard deviations of the mean. Note that the 3-sigma rule states is simple and is widely used in practice as the benchmark model. The disadvantages of the 3-sigma rule are discussed in scientific literature. For instance, the choice of threshold in the approach is not a trivial problem. It depends on the physical aspects of a domain. Researchers use various heuristics based on their domain knowledge.
Figure 6 shows the result of searching for anomalies by the 3-sigma method on the full dataset of data for each sensor channel. Note that this approach operates as a filter, omitting values around maximum and minimum. The complex anomalies plot in
Figure 8 shows the 3σPC (3 sigma—predicates’ conjunction) method results.
Figure 9 shows another result of the 3σPC method results—a list of dates.
Thus, detecting anomalies by the 3σPC method, we identified nine dates, but none of these dates correspond to actually laboratory-recorded instances of contamination exceeding the MPC—7, 8, and 9 June in 2022.
4.2. K-meansPC Method
K-means is a method for clustering problems in unsupervised learning. The algorithm seeks to minimize the total square deviation of cluster points from the centers of these clusters.
Figure 10 shows the result of searching for anomalies by the k-means method on the full dataset for each sensor channel. The complex anomalies plot in
Figure 8 shows the k-meansPC (k-means–predicates’ conjunction) method results.
The k-means method has two main disadvantages. The first drawback is due to initialization of method parameters. The second is that a number of clusters need to be set up. This value depends on expert knowledge.
Figure 11 shows another result of k-meansPC method—a list of dates.
Thus, detecting anomalies by the k-meansPC method, we identified three dates, but none of these dates actually correspond to laboratory-recorded instances of contamination exceeding the MPC.
4.3. IFPC Method
Figure 12 shows the result of searching for anomalies by the Isolation Forest method on the full dataset for each sensor channel. The complex anomalies plot in
Figure 12 shows the IFPC (Isolation Forest—predicates’ conjunction) method results.
Figure 13 shows another result of the IFPC method—a list of dates.
Thus, using the IFPC method to detect anomalies, we identified 57 dates, including all three dates with actual laboratory-recorded instances of contamination exceeding the MPC—7, 8, and 9 June in 2022.
Table 1 summarizes the results of the experiments.
Unlike benchmark approaches, it is important to note that the IFPC method is not sensitive to data gaps, making it suitable for in situ systems designed to detect water contamination exceeding the MPC. This capability opens up new possibilities for implementing autonomous hardware and software systems with automatic sampling functions. Additionally, since the IFPC method does not rely on preliminary time series processing, incorporating additional factors (such as temperature, turbidity, or oxygen concentration) is a straightforward technical task that does not require changes to the analysis methodology.
The experiments confirmed the hypothesis. The IFPC method for complex anomaly detection in multidimensional time series determines the dates the laboratory-recorded instances of water contamination exceeding the MPC. This proves that indirect measurement sensors data can be used to identify such events.
At the same time, combining predicates with other anomaly detection or forecasting methods seems promising, but this study could further develop this approach in future work.
5. Discussion
As a result of the experiments, the method based on detecting complex anomalies using Isolation Forest successfully identified laboratory-confirmed instances exceeding the MPC. In contrast, the benchmark approaches used in this study, despite their sensitivity, did not yield positive results. This is likely because exceeding the MPC in water contamination is indicated by significant changes across multiple data collection channels simultaneously. Therefore, the concept of detecting complex anomalies with this method is valid but requires careful analysis. Specifically, issues such as selecting the appropriate time window for sensor data and aligning time series with one another are areas that could be further developed in this study.
It is important to note that the detected events exceeding the MPC that are not included in the initial data sample (for IFPC, this includes 57 different days, which represents 16% of the total number of days in the study period). These events can be categorized into the following groups:
- -
Actual event of water contamination exceeding the MPC for which there is no laboratory test confirmation and, therefore, it was not included in the data sample.
- -
False positive response of the anomaly detection method, i.e., there was no actual excess.
Without complete and reliable information, it is difficult to assess errors in the proposed approach using the same criteria as binary classification methods. Given iForest’s ability to detect similar patterns in time series data and the stringent rules for identifying complex anomalies through predicate conjunction, we can conclude the following: the identified contamination events exceeding the MPC are ‘similar’ to confirmed cases, with a confidence level defined by the ‘reliability’ indicator of the decision trees. This allows us to introduce a measure of how closely the identified results match confirmed cases based on these characteristics. To enhance this approach, it is important to classify contamination events according to specific substances (such as fluorine, iron, etc.). This will enable more accurate diagnostics when using the IFPC method for particular substances. Additionally, the conjunction of predicates can detect complex anomalies not only by analyzing events across different channels but also by examining events within the same time series across various observation intervals. In this context, a complex anomaly can be viewed as connections between sequences of events.