Towards Digitalization for Air Pollution Detection: Forecasting Information System of the Environmental Monitoring

Vadurin, Kyrylo; Perekrest, Andrii; Bakharev, Volodymyr; Shendryk, Vira; Parfenenko, Yuliia; Shendryk, Sergii

doi:10.3390/su17093760

Open AccessArticle

Towards Digitalization for Air Pollution Detection: Forecasting Information System of the Environmental Monitoring

by

Kyrylo Vadurin

¹

,

Andrii Perekrest

¹,

Volodymyr Bakharev

¹

,

Vira Shendryk

^2,*

,

Yuliia Parfenenko

² and

Sergii Shendryk

³

¹

Department of Computer Engineering and Electronics, Kremenchuk Mykhailo Ostrohradskyi National University, 20 Universytetska Str., 39600 Kremenchuk, Ukraine

²

Department of Information Technologies, Sumy State University, 116 Kharkivska Str., 40007 Sumy, Ukraine

³

Department of Cybernetics and Informatics, Sumy National Agrarian University, 160 Herasyma Kondratieva Str., 40000 Sumy, Ukraine

^*

Author to whom correspondence should be addressed.

Sustainability 2025, 17(9), 3760; https://doi.org/10.3390/su17093760

Submission received: 19 March 2025 / Revised: 17 April 2025 / Accepted: 18 April 2025 / Published: 22 April 2025

(This article belongs to the Special Issue Environmental Pollution and Impacts on Human Health)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

This study addresses the urgent need for advanced digitalization tools in air pollution detection, particularly within resource-constrained municipal settings like those in Ukraine, aligning with directives such as the AAQD. The forecasting information system for integrating data processing, analysis, and visualization to improve environmental monitoring practices is described in this article. The system utilizes machine learning models (ARIMA and BATS) for time series forecasting, automatically selecting the optimal model based on accuracy metrics. Spatial analysis employing inverse distance weighting (IDW) provides insights into pollutant distribution, while correlation analysis identifies relationships between pollutants. The system was tested using retrospective data from the Kremenchuk agglomeration (2007–2024), demonstrating its ability to forecast air quality parameters and identify areas exceeding maximum permissible pollutant concentrations. Results indicate that BATS often outperforms ARIMA for several key pollutants, highlighting the importance of automated model selection. The developed system offers a cost-effective solution for local municipalities, enabling data-driven decision-making, optimized monitoring network placement, and improved alignment with European Union environmental standards.

Keywords:

sustainable development; environmental monitoring; air pollution; emissions; climate change mitigation; forecasting; machine learning; information system; spatial analysis

Graphical Abstract

1. Introduction

The current state of the environmental situation requires urgent measures to monitor and forecast changes in environmental parameters. The relevance of this task becomes obvious in the context of climate change, war, and increasing environmental pollution. Furthermore, the recently adopted Ambient Air Quality Directive (AAQD) 2024/2881 EC underscores the urgency of effective air pollution monitoring and forecasting across the European Union. This directive aims to improve air quality and mitigate the harmful effects of pollution on human health and the environment, making the development of advanced digitalization tools for air pollution detection even more relevant, critical, and necessary [1]. Today, it is impossible without effective tools that can automatically predict environmental changes based on large volumes of data from various sources [2]. Digitalization is now necessary to achieve sustainable development in all spheres of society, including environmental situations. One of the important areas is the information systems development that can forecast changes in environmental parameters, particularly air quality, water, and soil pollution levels. Such systems should use intelligent analysis and modeling methods for accurate forecasting and possible threat identification [3]. Information technologies allow collecting, storing, processing, and analyzing huge amounts of ecological data using machine learning and artificial intelligence methods [4]. In particular, the automatic selection of the optimal model for forecasting based on this data is an important part of the effective functioning of environmental monitoring information systems [5,6,7]. The choice of the right forecasting model determines the accuracy of forecasts and the timeliness of response to environmental disasters.

Companies specializing in developing geographic information systems (GIS) and tools for processing big data, such as ArcGIS and QGIS, as well as startups involved in innovative technologies for environmental monitoring, are engaged in research in the field of digitalization of environmental monitoring. In particular, scientists and specialists use machine learning algorithms to automatically select a forecasting model, which allows for significantly reduced labor costs and increases the accuracy of results. Despite significant achievements in environmental forecasting, numerous problems and challenges exist, particularly in collecting and processing high-quality data in real time and building models that can adequately consider all the complex relationships between environmental parameters. The lack of a sufficient amount of accurate data for certain regions and difficulties in applying complex mathematical models in daily monitoring are the main problems that need to be solved.

The current limitations of local information systems within Ukrainian municipal enterprises pose a significant barrier to effectively integrating modern environmental monitoring tools and aligning with the standards established in both the European Union and Ukraine. While the need for advanced systems capable of analyzing complex environmental data is evident, particularly for accurate forecasting and timely intervention, the reality is that implementing European Union information systems, even in a limited capacity, is often financially prohibitive for these local entities. The high cost associated with such systems creates a situation where municipal enterprises are constrained by existing, less effective methods characterized by manual forecasting model selection, limited analytical capabilities for large datasets, and an inability to capture the intricate relationships within environmental processes fully. This results in less accurate forecasts and hinders the ability of local authorities to make informed decisions and implement effective measures to address environmental challenges, such as air pollution in urban areas. The absence of automated, data-driven model selection further exacerbates this issue, preventing the attainment of reliable results necessary for robust environmental risk management. Therefore, there is a critical need to develop or identify cost-effective solutions that enable Ukrainian municipal enterprises to adopt modern monitoring and analysis tools in accordance with both national and European Union standards, without imposing an unsustainable financial burden.

Modeling ecological processes is essential for understanding and forecasting environmental changes [8,9,10].

It is the process of building mathematical or statistical models that describe the interactions between various ecological factors, such as climate, water resources, pollution levels, and other aspects of natural systems. Such models can be of different types: deterministic, where changes in ecological parameters are predicted based on known patterns, or stochastic, where the randomness and uncertainty of ecological processes are considered [11]. Using modeling, it is possible to study how changes in one parameter affect others, predict the consequences of various environmental actions, and optimize resource conservation strategies.

The use of hydrodynamic and climate models has become a key element in understanding and forecasting environmental changes, including temperature, air, and water pollution levels, and climate change. Climate change models are an integral part of modern climate forecasting. They are usually based on complex mathematical models that describe atmospheric, oceanic, and other natural processes [12]. One of the most common approaches is using climate scenarios, such as representing future temperature changes based on different levels of greenhouse gas emissions [13]. These scenarios are used to calculate changes in global and regional temperatures and impacts on ecosystems. For example, one of the main tools used for this purpose is global climate models (GCMs) [14,15], which perform numerical calculations to forecast climate change under different emission conditions and other factors [14,15,16,17]. Analysis of such models allows for the detection of not only global trends but also regional changes, such as an increase in temperature in certain areas.

Hydrodynamic models are also crucial in forecasting environmental changes, particularly for studying water resources and water pollution levels [18,19]. Hydrodynamic models are used to forecast changes in water levels in rivers, lakes, and seas, as well as to analyze processes that can lead to climate change [19,20,21]. One such model is SWAT (Soil and Water Assessment Tool) [22], which is a tool for assessing the impact of various factors on water resources. Another common model is MIKE 21, which models water movement and pollutant transport in open waters such as rivers and seas [23]. Hydrodynamic models are often combined with climate forecasts to create more accurate models of environmental change that consider the interaction between climate change and the state of water resources [20,24].

Geographic information systems [25,26,27] and remote sensing [28,29,30] are important tools for monitoring and forecasting environmental changes. GIS capabilities extend to prediction by enabling the creation and application of spatial models that forecast future environmental conditions based on the analysis of geographic data and trends. Furthermore, GIS serves as a powerful platform for visualizing and presenting the results generated by various modeling tools, allowing for clear communication of complex spatial patterns and predictions through maps, charts, and reports. GIS technologies are a key tool for spatial data analysis and forecasting of environmental changes [31,32,33]. Studies demonstrate the effectiveness of using GIS to analyze land cover changes in the context of urbanization and climate change [34,35]. GIS allows for the integration of data from various sources, as well as the creation of models that forecast possible changes based on spatial analyses.

Remote sensing, particularly through satellite imagery, is another important tool for monitoring environmental changes [36]. With the help of remote sensing, it is possible to obtain data on the state of the environment in real time without the need for direct contact with the objects under study. Analysis of remote sensing data helps forecast changes in the atmosphere, particularly in studying the level of air pollution. The combination of GIS and remote sensing opens up new opportunities for comprehensive environmental data analysis, allowing for accurate forecasting of environmental changes [37].

Machine learning methods are used to automatically create forecasting models based on data [38,39,40]. They include various algorithms that are able to learn from historical data or other observations. Decision trees are another important method that allows the classification of forecasting models based on sets of rules that take the form of an “if-then” hierarchy. These machine learning methods are used in ecology to automatically detect dependencies between different environmental parameters, allowing us to predict how the environment will change under various factors, such as climate change or human activity. Neural networks allow us to process complex and large data sets, recognize patterns, and create forecasts based on them, even when the relationships between the data are not obvious.

Data-driven models are a cornerstone of modern forecasting and analysis, relying heavily on patterns and insights extracted directly from data rather than being solely based on theoretical assumptions [6]. These models learn from historical data or real-time observations to identify underlying relationships and make predictions about future outcomes. The more data available and the higher the quality of that data, the more accurate and reliable these models tend to become. This approach is particularly valuable in complex systems where the interactions between variables are not fully understood or are constantly evolving, such as in environmental science, finance, and healthcare. By letting the data speak for itself, data-driven models can uncover subtle trends and dependencies that might be missed by traditional, theory-based approaches.

Machine learning methods are used to automatically create forecasting models based on data, with neural networks being a powerful example of this approach [5,7]. Neural networks, inspired by the structure of the human brain, are particularly adept at processing vast amounts of data and identifying intricate patterns that may be invisible to simpler algorithms. By training on historical environmental data, for instance, a neural network can learn to recognize complex relationships between factors like temperature, humidity, pollution levels, and human activity [11]. Once trained, this network can then be used to forecast future environmental conditions under various scenarios, providing valuable insights for policymakers and researchers studying the impacts of climate change and other environmental pressures.

Regression techniques are widely recognized as common methods in time series forecasting and predictive modeling [39,40]. Specifically, logistic regression is highlighted as a popular type of regression for predicting complex connections and determining variable importance across diverse fields [39]. Linear regression is also noted as a foundational and frequently used model for forecasting continuous numerical variables due to its simplicity and effectiveness in modeling linear relationships [40]. However, in the specific domain of air quality prediction, the AutoRegressive Integrated Moving Average (ARIMA) method has been a popular and widely applied choice for analyzing and forecasting air pollutant concentrations over the past few decades [41].

For forecasting atmospheric air parameters, statistical time series forecasting methods such as ARIMA and BATS (Bayesian Additive Time Series) may be appropriate [42,43,44,45,46]. These methods allow for accurate forecasts by working with time series data, which is a major advantage when forecasting parameters such as temperature, humidity, air pollution levels, and other atmospheric indicators.

The use of ARIMA and BATS is relevant for local forecasts, as these methods allow for efficient forecasts without the need for complex and resource-intensive infrastructure, as is the case with GIS or hydrodynamic models. They allow working with limited data and using them for short-term forecasts, which is especially important for monitoring changes in environmental parameters in real time.

A number of works have been carried out focused on the development of information and analytical systems for monitoring and analyzing air quality. In the work [47], an information and analytical system is presented, designed for convenient management of the processes of collecting, processing, and analyzing air pollution data, aiming to track and understand the presence and levels of pollutants. The paper [48] describes developing an information system for collecting and storing data on the quality of atmospheric air from Vaisala stations at the municipal level. The system provides centralized data collection from several stations, stores the data in a database, and provides tools for further analysis and visualization. Article [49] is devoted to developing a method of automatic generation of reporting on the number of exceedances of the established standards of atmospheric air markers. In [50], a web-based technology for the intelligent analysis of environmental data of an industrial enterprise is presented.

The subsequent sections of the paper are organized as follows: Section 2 delineates data processing stages in the developed information system, description of methods for analysis exceeding maximum permissible concentrations, forecasting of air pollutants concentrations, environmental parameters correlation analysis, and mathematical models of spatial distribution of air pollution and giving recommendations for the location of monitoring stations. Section 3 describes the results of modeling the environmental monitoring forecasting information system with Unified Modeling Language (UML). Also, Section 3 presents the results of using a developed information system for air pollution data analysis and forecasting. Section 4 discusses the comparison of forecasting results with BATS and ARIMA models. The importance of automatically selecting optimal forecasting models in the developed information system is also emphasized. Section 5 concludes the paper.

2. Materials and Methods

A forecasting information system of environmental monitoring is developed for environmental data processing and analysis, air pollution in particular, as well as their visualization for making informed decisions. The stages of data processing by the information system are described in detail in Section 2.1. The following subsections describe mathematical methods and models along with relevant mathematical formulations, which are developed to implement individual stages of data processing or analysis. Methods for determining exceedances of permissible concentrations with the integration of spatial analysis inverse distance weighting (IDW), correlation analysis between air pollutants, a method for determining maximum permissible concentrations (MPC) for pollutants overtime periods, methods for analyzing time series using the BATS and ARIMA models for forecasting of changes in concentrations of air pollutants, a mathematical model of the spatial distribution of air pollution, which is based on correlation analysis and spatial interpolation, are considered.

2.1. Data Processing Stages for Air Pollution Detection Digitalization

The concept of the information system functioning for environmental monitoring and forecasting with automatic model selection is based on a multi-stage process of processing air quality data for the purpose of analysis, visualization, forecasting, and formulation of recommendations. These data processing stages are shown in Figure 1 and described below.

Stage 1. Input data loading and processing. Input data from various monitoring stations contain measurements of pollutant concentrations over time. They are loaded in tabular form, cleaned of empty values, converted into a format suitable for analysis, and structured by key characteristics: city, address, coordinates, and date.

Stage 2. Data analysis. It is carried out for key pollutants: dust, sulfur and nitrogen dioxides, and formaldehyde. After analyzing the MPC levels of pollutants, it was determined that these pollutants regularly exceed the established standards, so it was decided to focus on them in order to predict their future exceedances and local authorities could react in accordance with the forecasts. Graphs show the change in pollution over time, grouped by cities and addresses. Spatial data analysis is integrated to create interactive maps that allow visualization of the geographical distribution of pollutants.

Stage 3. Analysis of exceedances of permissible standards. Regulatory daily maximum permissible concentrations are determined for each pollutant. The number of cases and sequences of exceedances of these values are counted with a graphic indication of such incidents.

Stage 4. Spatial forecasting. Interpolation of pollutant values is implemented using the inverse distance weighted (IDW) method to create forecasts of concentrations at different stations. This interpolation method is based on the principle that values closer to a station have a greater influence on the estimated value at that station. For the interpolation of pollutant concentrations, the specific pollutant being analyzed (selected as a parameter by the user), along with the measured values of that pollutant and their spatial distances, are considered. Visualization is interactive: the user can select a parameter, date, and other analysis criteria.

Stage 5. Time series forecasting. BATS and ARIMA models are used for forecasting. They allow the creation of forecasts of changes in concentrations for the future period (for example, 24 months). The optimal model is selected automatically based on the minimum forecast error.

Stage 6. Correlation analysis. Correlation matrices are used to identify dependencies between parameters. Based on the analysis, pairs of parameters with the highest positive and negative correlation are determined and visualized in graphs.

Stage 7. Recommendations for the placement of monitoring posts. Based on spatial and correlation analysis, the background stations are determined as the most prioritized for installing new monitoring stations. Recommendations are based on forecast data and analysis of MPC exceedances.

Thus, as a result of data processing, the information system allows us to get an overview of the environmental situation, identify problem areas and trends, create forecasts, and make informed decisions to improve air quality. Data analysis methods used to implement data processing stages are described below.

2.2. Method of Exceeding Maximum Permissible Concentrations Analysis

The input data set includes information on air pollutant concentrations at certain monitoring stations and their changes over time. The data are grouped by cities, addresses of monitoring posts, and also by time intervals. For each monitoring station, the concentrations are averaged over the years according to the following formula:

{\overline{C}}_{i j} (t) = \frac{1}{N_{t}} \sum_{k = 1}^{N_{t}} C_{i j} (t_{k}),

(1)

{\overline{C}}_{i j} (t)

—average concentration for monitoring post i in town j per year t, N_t—number of measurements per year.

The following formula determines the number of exceedances of permissible concentrations:

Exceedance Count = \sum_{k = 1}^{N} I (C_{k} > MPC),

(2)

where

I (\cdot)

—the indicator function.

For each air pollutant, a time series that demonstrates the change in concentration is constructed as follows:

C_{i j} (t) = f (t) + ϵ (t),

(3)

where

f (t)

—trend,

ϵ (t)

—random component. The analysis is carried out using seasonal decomposition methods.

Integration of spatial data analysis for the construction of cartographic visualization is carried out using interpolation methods, for example, IDW [51]:

C (x, y) = \frac{\sum_{i = 1}^{N} \frac{C_{i}}{d {(x, y, x_{i}, y_{i})}^{p}}}{\sum_{i = 1}^{N} \frac{1}{d {(x, y, x_{i}, y_{i})}^{p}}},

(4)

where

d (x, y, x_{i}, y_{i})

—distance between stations,

p

—weight function parameter.

Visualization of the results of the dependence of pollutants over time for each city and monitoring point is carried out by constructing interactive maps that display the geographical distribution of pollutants in different time periods.

The analysis of correlations between different air pollutants is carried out using the following formula [52]:

ρ (X, Y) = \frac{Cov (X, Y)}{σ_{X} σ_{Y}},

(5)

where

Cov (X, Y)

—covariance,

σ_{X}

та

σ_{Y}

—standard deviations.

A methodology covering the following stages is used to analyze pollutants exceeding the MPC in atmospheric air.

MPCs are set for each pollutant in accordance with current regulations. Let $G_{j}$ be the MPC value for $j$ pollutant. The input measurement data are presented in the form of a tabular array:

X = {x_{i, j, t}}, i \in {1, \dots, N}, j \in {1, \dots, M}, t \in {1, \dots, T},

(6)

where

x_{i, j, t}

—concentration of the

j -

th substance at the

i -

th monitoring station at the time

t

(hour, day, month, year).

2.: Counting the number of exceeding concentration cases of permissible substances.

For each element

x_{i, j, t}

the following formula checks the condition of exceeding the concentration

P_{i, j, t} = \{\begin{array}{l} 1, & if x_{i, j, t} > G_{j}, \\ 0, & otherwise . \end{array}

(7)

Total number of exceedances

C_{i, j}

at the

i

th station for the j-th substance is calculated by the following formula:

C_{i, j} = \sum_{t = 1}^{T} P_{i, j, t}

(8)

3.: Calculation of sequences of exceedances of the polluting substances MPC.

To analyze long periods of exceedance, sequences of values

P_{i, j, t}

, equaled 1, are determined. Let

L_{i, j}

—the number of such sequences for

i -

th station and

j

th substance. For this purpose, the following algorithm is used.

Initialize the counters $L_{i, j} = 0$ , $streak = 0$ .
Iterate over all values $P_{i, j, t}$ , checking:

streak = \{\begin{array}{l} streak + 1, & if P_{i, j, t} = 1, \\ 0, & if P_{i, j, t} = 0 . \end{array}

(9)

3.: If $streak$ reaches the given threshold $K$ , increase $L_{i, j}$ and reset $streak$ .
4.: Results visualization.

For each station, a graph of the dependence of concentration

x_{i, j, t}

from time

t

is constructed. It indicates exceeding the MPC: the horizontal line for

G_{j}

; red areas indicate areas with excess

G_{j}

. A set of colored markers is used for visualization, simplifying the identification of violations of various pollutants.

5.: Data aggregation.

To assess the overall level of environmental risk, aggregated indicators are created, for example:

R_{i} = \sum_{j = 1}^{M} \frac{C_{i, j}}{T},

(10)

where

R_{i}

—average exceedance frequency for the station

i

.

The IDW method is used to perform spatial forecasting of pollutant concentrations at points missing from the original data. The basic idea of IDW is that the value at each predicted point is calculated as a weighted average of the values of neighboring measurement points, with the weights decreasing with increasing distance to the forecasting point.

The mathematical expression for IDW is given by the following formula [49]:

Z (x) = \frac{\sum_{i = 1}^{N} w_{i} (x) Z_{i}}{\sum_{i = 1}^{N} w_{i} (x)},

(11)

where:

Z (x)

—forecasted value at the point

x

;

Z_{i}

—value at the point

i

from the original data set;

w_{i} (x) = \frac{1}{d {(x, x_{i})}^{p}}

—the point

i

weight, depending on the distance

d (x, x_{i})

between points

x

and

x_{i}

;

p

—smoothing parameter that determines the influence of distant points (usually

p \geq 1

);

N

—number of measurement points used for forecasting.

2.3. Methods for Forecasting Changes in Concentrations of Air Pollutants

Time series forecasting of future air pollutant concentration changes is based on the BATS and ARIMA models. These methods allow us to consider various data characteristics, including seasonality, nonlinearity, and noise, which are typical of environmental time series.

ARIMA is a statistical method for time series forecasting that allows us to consider trends and seasonal fluctuations. This makes it effective for short-term forecasts, in particular for local monitoring, where speed and accuracy are required with limited data. ARIMA does not require complex input data, as is the case with GIS or hydrodynamic models, which makes it convenient for use in real-world conditions. BATS, in turn, is a more flexible approach due to the use of Bayesian methods for assessing uncertainty in the data. This allows the model to adapt to different trends and seasonal variations, which is of great importance for accurate forecasting in unstable climate conditions. BATS also allows for the integration of various types of data and considers a significant number of factors affecting atmospheric air parameters.

The ARIMA model is one of the most common time series forecasting methods. It includes three key parameters [50,53]:

p

—autoregression (AR) order, which determines the number of lags;

d

—order of differentiation (I) required to transform a time series into a stationary one;

q

—order of the moving average (MA), which models the noise in the data.

The mathematical formula of the ARIMA model is given as follows [54]:

y_{t} = c + ϕ_{1} y_{t - 1} + \dots + ϕ_{p} y_{t - p} + θ_{1} ϵ_{t - 1} + \dots + θ_{q} ϵ_{t - q} + ϵ_{t},

(12)

where

y_{t}

—the value of the time series at a time point

t

,

ϕ_{i}

—autoregression coefficients,

θ_{i}

—moving average coefficients,

ϵ_{t}

—noise (white noise).

The BATS model is designed to work with time series that have complex seasonal components and irregular data spacing. It includes harmonic components for modeling seasonality, value transformations such as logarithmization, and a Bayesian parameter estimation method that allows better uncertainty handling. The formula for the basic level of the BATS model is as follows:

{\hat{y}}_{t} = l_{t - 1} + \sum_{k = 1}^{K} s_{t - k},

(13)

where

l_{t - 1}

—time-series level at the moment

t - 1

,

s_{t - k}

—seasonal component, taking into account harmonic frequencies.

To select the optimal forecasting model, the forecast errors MAE (Mean Absolute Error), and RMSE (Root Mean Square Error) are evaluated. The information criterion AIC (Akaike Information Criterion) is determined to select a model with the optimal number of parameters.

The forecasting results are presented in the form of the time series for each indicator under study. The obtained forecasts are the basis for determining possible scenarios for developing the situation and developing recommendations for ecological monitoring.

2.4. Method of Data Correlation Analysis

Correlation analysis is one of the methods for identifying relationships between parameters, which is based on calculating correlation coefficients for numerical variables in a data set [53]. In this study, it is used to establish relationships between air pollution indicators. The main stages of its implementation are the following. To calculate correlations, numerical sample parameters are used, which can be pre-filtered by features such as city or geographic location, specific address or coordinate, and time range.

Building a correlation matrix.

The correlation matrix

R

is a square symmetric matrix where each element

r_{i j}

is defined as the correlation coefficient between variables

x_{i}

and

x_{j}

according to the following formula [55]:

r_{i j} = \frac{Cov (x_{i}, x_{j})}{σ_{x_{i}} σ_{x_{j}}},

(14)

where:

Cov (x_{i}, x_{j})

—covariance between variables

x_{i}

and

x_{j}

;

σ_{x_{i}}

,

σ_{x_{j}}

—standard deviations of variables

x_{i}

and

x_{j}

.

2.: Selecting the most significant correlations.

Based on the calculated matrix, pairs of variables with the highest positive correlation (

r_{i j} \to 1

) and the lowest negative correlation (

r_{i j} \to - 1

) are determined.

3.: Visualization of dependencies. For each of the defined pairs of variables, graphs are built to evaluate their relationship. For a better interpretation of the data, the correlations are visualized as heat maps. The results of correlation analysis can be used to predict the behavior of one variable based on another, identify relationships between environmental parameters, such as pollutant concentrations, and optimize data collection methods, for example, by reducing the number of measurements for highly correlated variables.

2.5. Mathematical Model of the Spatial Distribution of Air Pollution

A mathematical model based on spatial and correlational data analysis was used in this study to determine the priority points for installing new monitoring stations. The spatial distribution of pollution is modeled using the inverse weighted distance method [56]. The following equation is used for the spatial distribution of air pollution [57]:

z (x, y) = \frac{\sum_{i = 1}^{N} w_{i} (x, y) z_{i}}{\sum_{i = 1}^{N} w_{i} (x, y)},

(15)

where

z (x, y)

—forecasted value at the point

(x, y)

,

z_{i}

—pollution value at a point

i

,

w_{i} (x, y)

—weight function, given as follows:

w_{i} (x, y) = \frac{1}{d_{i}^{p}},

(16)

where

d_{i}

distance from the point

(x, y)

to point

i

,

p

—parameter of the weight function degree.

A regular grid estimates the concentration of pollutants at intermediate points for spatial interpolation.

After interpolation, filtering points with the lowest pollution values, which can be defined as background, is performed. Let:

z_{\min} = \underset{x, y}{m i n} z (x, y), z_{\max} = \underset{x, y}{m a x} z (x, y)

(17)

The background point is defined as one for which the following condition is fulfilled:

z (x, y) \leq z_{\min} + α \cdot (z_{\max} - z_{\min}),

(18)

where

α \in [0, 0.1]

—cutoff parameter.

To determine the relationships between different pollutants, a correlation matrix is used [58]:

R = [\begin{matrix} 1 & r_{12} & \dots & r_{1 n} \\ r_{21} & 1 & \dots & r_{2 n} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ r_{n 1} & r_{n 2} & \dots & 1 \end{matrix}],

where

r_{i j}

—Pearson correlation coefficient between variables

i

and

j

.

The pairs of variables with the largest values

|r_{i j}|

are identified for further analysis.

The ARIMA or BATS model is used to build forecasts of pollutant concentrations, depending on the accuracy [59]. The forecast model is given as follows:

y_{t} = ϕ_{1} y_{t - 1} + ϕ_{2} y_{t - 2} + \dots + ϕ_{p} y_{t - p} + ε_{t} - θ_{1} ε_{t - 1} - \dots - θ_{q} ε_{t - q},

(19)

where

y_{t}

—predicted value at a point in time

t

,

ϕ

—autoregression coefficients,

θ

—moving average coefficients,

ε_{t}

—random error.

Recommendations for the location of monitoring stations are formed based on the geographical distribution of points with the lowest concentrations of pollutants, correlation analysis to determine dependencies between pollutants, and forecasts of exceedances of maximum permissible concentrations.

For each determined point, logistical feasibility and distance to existing monitoring stations are additionally considered.

2.6. Description of the Retrospective Municipal Monitoring Data of the Kremenchuk Agglomeration Used in the Information System

The information system is based on retrospective data from municipal monitoring of the Kremenchuk agglomeration in the Poltava region, which was conducted using a network of stationary laboratory posts with a monthly frequency. The network comprised five posts: from the city of Kremenchuk: Post No. 1, Molodizhna Street, 9; Post No. 2, Doctor O. Bohayevskyi Street; Post No. 4, Shevchenko Street, 22/30; Post No. 5, I. Prykhodka Street, 89; and from the city of Horishni Plavni: Post No. 1, Dobrovolskoho Street, 6. The study area, the Kremenchuk agglomeration, is located in the central part of Ukraine and is a significant industrial hub within the Poltava region. It encompasses the city of Kremenchuk, a major regional center with a diverse industrial base, and the smaller city of Horishni Plavni, whose economy is primarily based on the mining and processing industries. The presence of large local enterprises, such as the oil refinery (Ukrtatnafta) and the mining and processing plant Ferrexpo, significantly impacts the air quality in the region. The source data, obtained from a network of stationary laboratory posts strategically located across the agglomeration to capture the spatial variability of pollutant concentrations, include a set of pollutants with monthly averaging in units of MPC and mg/m3: Dust, Sulfur Dioxide, Carbon Monoxide, Nitrogen Dioxide, Nitrogen Oxide, Formaldehyde, Ammonia, Hydrogen Fluoride, Hydrogen Chloride, Phenol, Soot, Sulfates, Hydrogen Sulfide. For testing the forecasting system, parameters were selected that retrospectively showed significant exceedances relative to Ukrainian standards for maximum permissible concentrations; these include dust, sulfur and nitrogen dioxides, and formaldehyde. This focus on pollutants with a history of high levels ensures that the forecasting system is tested and developed for the most critical air quality concerns in the region. The determination of pollutant levels was carried out using a laboratory method with environmental monitoring equipment that is regularly calibrated and maintained by municipal authorities. However, detailed information regarding the uncertainty associated with these measurements was not available in the provided dataset. The selected episode for the initial development and testing of the forecasting system covered the period from 2007 to 2024. This timeframe was chosen due to the availability of complete and consistent data records, as well as the occurrence of several periods with significant air pollution events exceeding permissible limits, providing a robust basis for the evaluation of the proposed model.

The input data set of air pollution parameters, which is processed by the environmental monitoring information system, contains coordinates (x_i,y_i) and corresponding pollutant values Z_i for each measurement point. The coordinates are transformed into the metric coordinate system (EPSG:3857) for further work. The developed information system for environmental monitoring performs data preprocessing, interpolation, forecasting, and interactive visualization. A regular grid of points is constructed in a specified area for data interpolation. The boundaries of the geographical space of the source data with a given discretization step determine the grid coordinates. The predicted value Z(x) is calculated using the IDW formula for each point on the grid. The calculation is performed using specialized Python libraries, such as startinpy. The interpolated values are presented in the form of a map with pollution contours, allowing you to visualize the distribution of concentrations in space. Interpolation is performed for a given time period, providing a detailed period analysis.

An information system has been developed to implement data processing for monitoring and forecasting air pollution. The developed information system is built on a modular principle and consists of interconnected subsystems [59]. The information system user should be able to interact with the modeling results using an interactive interface. For visualization, a selection of a chemical pollutant from a given list is available. The pollution map is integrated with the OpenStreetMap platform, allowing the overlay of administrative units’ geographical boundaries for greater informativeness. Pollution intensity is displayed as heat maps with contour lines. Overlaid measurement points show the initial data from which the interpolation surface is formed. An animation is available that demonstrates changes in space and time.

3. Results

3.1. Modeling of the Environmental Monitoring Forecasting Information System

The information system for forecasting environmental parameters defines the main functions as environmental data loading and processing, data analysis, forecasting, and visualization of data, as well as checking for exceedances of regulatory values of pollutants and creating reports. It implements use cases for three actors: administrator, analyst, and user (Figure 2).

The administrator plays a key role in data loading and pre-processing. The administrator is responsible for the correct input of information into the system and its subsequent processing, which forms the basis for analytical processes. In addition, the administrator generates reports that can be used to inform stakeholders or make management decisions.

The analyst focuses on a deeper analysis of environmental data. Administrator tasks include analyzing pollutant levels, predicting future pollution levels, creating visualizations, and checking for exceedances. These functions allow the analyst to identify patterns, assess risks, and make recommendations for environmental improvement.

The user has access to tools for data visualization and checking for MPC exceedances. This allows us not only to receive information in an easy-to-understand format but also to respond to critical situations quickly. For example, an ordinary user can find out about the excess concentration of harmful substances in his region, which will help him adjust his actions or travel routes.

Coordination between actors of the information system and a clear delineation of their roles ensures the system’s effective functioning. This makes it a powerful tool for collecting, processing, and interpreting environmental information, creating a basis for environmental protection and public awareness.

The environmental monitoring forecasting information system is designed based on structured objects that reflect the complexity of the relationships between the components of the subject area. The main elements are observation posts, pollutants, cities, addresses, exceedances, and MPC regulatory values, which are implemented as corresponding classes. The UML class diagram for the subject area of the environmental monitoring forecasting information system is shown in Figure 3.

The ObservationPost class plays a central role in the information system, representing specific observation points for pollution levels. Each post is tied to a specific city and address, allowing for the localization of environmental problems. The observation post collects data on concentrations of various pollutants, such as dust, nitrogen dioxides, or sulfur dioxide. Interaction with the Pollutant class provides storage of information about each type of pollutant, including its name, units of measurement, and permissible standards.

The City class adds demographic and geographic context to observations, allowing population and region data to be associated with pollution levels. The location of observation posts is detailed using the Address class, which stores exact addresses, including streets and buildings.

A key aspect of the information system is recording cases of exceeding permissible concentration levels, which are processed through the Exceedance class. This class is associated with the observation post where the exceedance was recorded, the specific pollutant, and the date. An important parameter is the exceeded limit field, which allows you to identify whether the established MPCs are exceeded. Norms for pollutants are stored in the MPC class, which determines their validity range and allows changes in environmental standards to be considered over time.

In the forecasting information system of environmental monitoring, its modules interact for effective data analysis and visualization, training the forecasting model, and assessing risks of exceeding environmental standards. The main elements of the information system’s architecture are modules for data processing, analysis, forecasting, calculation of exceedances, and visualization (Figure 4), which are implemented as classes using an object-oriented approach.

The DataLoader class is responsible for loading and initial processing data. It prepares it for further analysis, converting raw data into a usable format. These data are passed to the DataAnalyzer class, which performs in-depth analysis of environmental parameters. Its tasks include analyzing air pollutants, creating correlation matrices, and generating statistical indicators that help understand the nature and scale of environmental problems.

The AirQualityModel class receives the processed data from the DataAnalyzer to train the air quality forecasting model. This model not only generates forecasts based on the input data but also evaluates its results by comparing them with the actual data. This process provides feedback to improve the forecast’s accuracy.

After forecasting, the data are passed to the ExceedanceCounter class, which identifies cases exceeding permissible pollution standards. It counts the number of such exceedances and creates reports that can serve as a basis for decision-making. Finally, the MapVisualizer class creates geospatial visualizations and displays the analysis results. This allows us not only to identify localized problems but also to understand their context on a wider geographical scale.

The coordination of these components ensures the system’s efficient operation, which is able not only to forecast environmental parameters but also to identify problem areas, helping to reduce the harmful impact of air pollutants on the environment.

Modeling of information system scenarios was carried out by constructing UML sequence diagrams that show how actors and the information system objects interact with each other, in what sequence, and what data are transferred to each other.

The limit parameters for pollutants that this system monitors are established by the legislation of Ukraine, including Resolution of the Cabinet of Ministers of Ukraine No. 827. This regulatory document, developed based on the European regulatory framework, defines the MPCs and other relevant thresholds for various harmful substances in the environment. These legally binding limits ensure that the system’s checks for exceedances are aligned with both national and international environmental protection standards.

In the “Data Loading” scenario shown in Figure 5, the administrator initiates a process that begins with the information system connecting to the data source. After receiving the data, the system stores it in the database, confirming successful completion. This allows for structured data for further processing and analysis.

According to the “Data Processing” scenario (Figure 6), the administrator initiates a request, after which the system receives raw data from the database, passes them to the processing algorithm, and the results are updated in the database. The process ends with a message about a successful update, which guarantees the relevance of the information.

The «Pollutant analysis» process, shown in Figure 7, is activated by the analyst. The system extracts the necessary data from the database and passes it to the analysis algorithm, which returns the results. The analysis results become the basis for making environmental decisions or building forecasts.

According to the «Pollution level forecasting» scenario (Figure 8), an analyst initiates a process when the system receives historical data and feeds them to a forecasting model. The forecasted data are fed back to the analyst for further risk assessment.

Users interact with the system through the Data Visualization scenario (Figure 9). The system retrieves the required data from the database and displays them in a graphical interface. This provides users with visualized data, simplifying their perception and analysis.

The analyst initiates the «MPC exceedance check» scenario (Figure 10). The system passes data to the exceedance checking algorithm, which detect cases where pollutant concentrations exceed the established standards. The results obtained are provided to the analyst for evaluation and decision-making.

Finally, «Report generation» (Figure 11) is the administrator’s task. The information system retrieves the necessary data from the database, generates a report, and provides the administrator with a finished document. This allows you to create documentation for reporting or strategic planning.

The development of the environmental monitoring forecasting information system with automatic selection of the best forecasting model is based on the use of specialized libraries that provide efficient data processing, modeling, and visualization of results. The use of each library has its own clear justification, as they cover key aspects of the system, including data analysis, geospatial processing and forecasting. Pandas and NumPy are tools for working with data, including data analysis. Pandas provides convenient work with data tables, allowing their filtering, transformation, and aggregation. For example, when processing environmental parameters over time, it is important to work with dates, which is carried out using special methods of this library. NumPy provides fast mathematical calculations on data arrays, significantly speeding up preparation for analysis. Data visualization is performed using Matplotlib 3.9.2 and Seaborn 0.13.2, which allow you to build graphs to analyze changes in pollutant concentrations. Seaborn simplifies the construction of complex statistical graphs, which is important for comparing data between different cities and time periods. This approach allows you to identify trends and anomalies in environmental indicators quickly.

The Scikit-learn and Sktime libraries were used to forecast changes in environmental parameters. The first library provides a wide range of machine learning tools, including model estimation methods, such as the mean square error. Sktime specializes in time series analysis, which allows for predicting future changes in pollutant concentrations. The integration of these libraries allows the automatic selection of the best forecasting model based on the calculated metrics.

To work with geospatial data, Geopandas and Shapely were used, which allowed the processing of the coordinates of monitoring posts, building geometric shapes, and calculating distances between points. Combined with Startinpy, which provides triangulation and spatial interpolation, these libraries allow the creation of interactive pollution maps. This is important for assessing the distribution of pollutants in different regions and identifying background points for monitoring.

The Statsmodels package is used to build statistical models, such as ARIMA, which are well-suited for forecasting changes in environmental indicators. The models in this library provide high forecast accuracy and ease of integration with other tools.

For interactive work in the Jupyter Notebooks environment, IPython and ipywidgets are integrated. They provide the ability to create dynamic interface elements, such as sliders, for selecting parameters or date ranges. This allows system users to interact with data through the user interface.

These Python libraries were used to implement software models for analysis, forecast environmental air parameters, and visualize results. Each library provides efficient performance of its task, which allows the automation of selecting the best forecasting model and provides an interactive approach to assessing the environmental situation. Together, the developed software modules constitute a single information system that performs data analysis and forecasting of atmospheric air pollution, MPC exceedance checks, recommendations for the location of monitoring stations, data visualization, and report generation. Table 1 presents the recommended Python libraries and tools for the environmental monitoring forecasting system.

3.2. Air Pollution Data Analysis Using an Information System

The analysis of air pollution data in the information system begins with raw data the preprocessing. The data are loaded from a CSV file, then the DataAnalyzer module cleans them, converting some columns into appropriate formats; in particular, date and numeric values. An important step in this is the conversion of coordinates into a format convenient for further spatial analysis.

The change in pollution over time is then analyzed using the plot_air_quality() function. It groups the data by city and address, building graphs for each pollutant (dust, sulfur dioxide, carbon monoxide, etc.). For this purpose, only those records are selected where the number of measurement points exceeds the minimum permissible level, after which graphs are displayed for each city showing the concentrations of pollutants over time. An example of visualization of the average values of dust concentration exceedances by year, determined at laboratory posts in Kremenchuk during the period 2007–2024, is shown in Figure 12.

The MPC, in Figure 12 and in the system in general, is calculated by taking the determined dust value (the measured concentration) and dividing it by the maximum permissible concentration value, which is given as 0.5 mg/m³ for dust. The result of this division represents how many times the measured dust concentration exceeds the established standard. For example, if the measured dust value is 1.0 mg/m³, then the MPC value would be 1.0/0.5 = 2, indicating that the measured value is two times higher than the permissible limit.

The MPC level for airborne dust, shown in Figure 12, consists of various particulate matter, including key size fractions known as PM10 and PM2.5, which differ in their characteristics and sources.

PM10 refers to particulate matter with an aerodynamic diameter of 10 μm or less. These are inhalable particles that can penetrate deep into the lungs. Sources of PM10 include crushing or grinding operations, dust stirred up by vehicles on roads, and industrial processes.

PM2.5 refers to fine particulate matter with an aerodynamic diameter of 2.5 μm or less. These are even smaller particles that can travel even further into the respiratory system and can even enter the bloodstream. PM2.5 is often formed from the chemical reactions of gases like sulfur dioxide and nitrogen oxides from power plants, industrial facilities, and automobiles.

Spatial analysis using interactive maps improves the analysis of pollution changes in space and time (Figure 13). Interactive maps visualize spatial interpolation of pollutants using the inverse weighted distance method, which allows for studying the distribution of pollutants in different regions. Figure 13 shows the spatial interpolation of dust based on data from stations located in Kremenchuk. The color scales next to it show the values of low pollutant concentrations (from 0 to 0.5, which is indicated by gradations of green), permissible (from 0.5 to 1, which is indicated by orange) and dangerous (more than 1 gradation from red to black). Based on the technical capabilities of the matplotlib library, the boundaries of the color zones were expanded to 10% (the boundaries of the color gradations themselves are defined as the limit ±10%) to ensure color overlap of the contours and there were no gaps between the contours of different concentration levels.

In the developed information system, special attention is paid to calculating exceeding permissible concentrations. The system determines the number of consecutive exceedances for a selected period, which allows the identification of long-term and serious violations of pollution standards that require a quick response. An example of visualization of dust concentration consecutive exceedances is shown in Figure 14.

Figure 14 shows the level of pollution at various stationary posts in Kremenchuk city in the range from 1 January 2019 to 1 September 2024 with colored bars, and the maximum permissible concentration line runs through 1 to visually track the number of exceedances.

Correlation matrices are used to assess the relationships between different pollutants. The correlation_matrix() function allows the construction of such a matrix for numeric columns, demonstrating the dependence between different pollutants, which contributes to a deeper understanding of their interaction. The generated correlation graphs allow us to track the relationships between parameters, which is important for environmental research. The correlation matrix of MPC air pollutant concentrations, constructed using the developed information system, is shown in Figure 15.

Figure 15 shows the correlations between the values of the time series of pollutants in MPC from 1 January 2019 to 1 August 2024 for the laboratory that is Post No. 1 and is located at 9 Molodizhna St. in the city of Kremenchuk, Poltava region.

A strong positive correlation (0.42) is observed between Carbon monoxide and Nitrogen dioxide. This suggests that these two pollutants may share common emission sources, likely related to combustion processes, such as vehicular traffic or industrial activities in the vicinity of the monitoring station. Similarly, a moderate positive correlation (0.45) exists between Carbon monoxide and Formaldehyde, which could also point toward shared combustion sources or secondary formation pathways in the atmosphere.

Formaldehyde and Phenol exhibit a moderate positive correlation (0.41). This might indicate similar industrial emissions or photochemical production mechanisms contributing to their concentrations. Furthermore, a relatively strong positive correlation (0.35) is found between Dust and Soot. This is expected as both are particulate matter and can originate from similar sources like the combustion of fossil fuels, industrial processes, and construction activities.

Additionally, several weak negative correlations were observed between various pollutants, such as Sulfur dioxide, Dust, Soot, and certain organic compounds, when compared with others, such as Carbon monoxide and Nitrogen dioxide. The low strength of these correlations generally indicates that the dominant sources or atmospheric processes influencing these different pollutant groups are largely distinct.

It is important to note that correlation does not imply causation. While a strong positive correlation between two pollutants might suggest a common source, it could also arise from complex atmospheric chemistry where the formation or dispersion of one pollutant influences the other. Similarly, a negative correlation could indicate different sources or competing removal processes.

3.3. Analysis of Air Pollution MPC Exceedance Data

For annual analysis of pollution changes, data are averaged over the years, allowing us to find general trends over a long period. The implemented modules of the information system allow us to visualize and analyze such trends effectively, allowing us to assess the impact of pollutants on the environment in different cities.

Analysis of exceedances air pollution MPC in the atmosphere requires a precise algorithmic approach, which includes data processing, checking them for compliance with established standards, and visualization of the results. The main stages of the algorithm for analyzing exceedances of permissible standards of pollutants in the atmosphere are described below.

The first stage is data loading. Tabular data are used, which includes information about the concentration of pollutants at different places and times. For each type of pollutant, its maximum permissible concentrations are determined. These regulatory values are given in the form of a dictionary, where each parameter corresponds to its MPC. This provides convenient access to data for checking exceedances.

The second step is to calculate exceedances. A function that separately analyzes each data group (for example, addresses or cities) was implemented. For each pollutant, a check is made to see if its value exceeds the established norm. If such an exceedance is observed, it is recorded, and the number of consecutive exceedances is also counted. If this number exceeds a certain threshold, such an incident is considered significant and added to the statistics. The flexibility of the implementation is provided by the ability to filter data according to various criteria. For example, it is possible to limit the analysis to a certain time period, city, or specific address. This allows us to focus on critical observation points or specific time intervals.

The next stage is the visualization of the results. For each pollutant, a graph is created, which displays changes in its concentration over time. For better visibility, exceedances are marked in red, allowing you to identify critical points quickly. Additionally, a horizontal line is placed on the graph corresponding to the MPC, which helps visually assess the scale of exceedances.

The final stage is the preparation of statistical reports containing information on the number and duration of exceedances for each pollutant in the analyzed groups. These reports, together with graphs, create a holistic picture of the environmental situation, allowing conclusions to be drawn about the air quality in the region and the need to take measures to improve the situation.

The final stage is the preparation of statistical reports containing information on the number and duration of exceedances for each pollutant in the analyzed groups. These reports, together with graphs, contribute to the assessment of the environmental situation. However, for a truly holistic picture, these reports should be considered alongside meteorological data, information about local industrial enterprises, and relevant events. This comprehensive approach enables qualified specialists from municipal environmental enterprises to form an overall picture of atmospheric air quality and assess its compliance with established standards, ultimately guiding necessary measures for improvement.

3.4. Spatial Forecasting of Air Pollutant Concentration Values Using the Interpolation Method

Implementation of spatial forecasting using the inverse distance weighted interpolation method IDW is a tool for forecasting pollutant concentrations at different points on the map. The first stage is data processing and preparation, which includes the coordinates of measurements and pollutant values at a certain point in time. For this task, the data are collected in a table, where each measurement has coordinates and pollutant concentration values.

Interpolation using IDW allows for the forecasting of the values of pollutant concentrations at points where there are no direct measurements based on the values obtained from the nearest points. The principle of the IDW method is that each measurement is assigned a weight, which is calculated based on the distance to the forecasted point: the closer the point to the measured one, the greater the weight its value has in the calculation. This provides more accurate forecasting within the measurement area. To implement interactive visualization of interpolation in the information system, widgets are used to allow the user to select parameters such as date, city, address, and pollutant for analysis. This provides flexibility in data analysis, allowing the study of the change in pollutant concentrations according to various criteria. The interface allows dynamic graph updates depending on the user’s selected conditions, providing a convenient tool for studying the spatial distribution of pollutants in different time periods.

Thanks to the interactivity and flexibility of data selection, this approach allows the user to perform a detailed analysis of the spatial distribution of pollutants in specific cities or given areas of the territory, as well as make predictions about future changes in pollutant concentrations based on available data.

3.5. Forecasting Time Series of Air Pollutant Concentrations and Creating Spatial Interpolation

Forecasting time series, such as air pollutant concentrations, uses BATS and ARIMA models. These models allow you to estimate values for future periods, for example, 24 months. Both models use different approaches to time series analysis. ARIMA is a classic forecasting approach that is based on autocorrelation analysis and interpolation to build predictive models. In turn, BATS uses machine learning and regression methods for forecasting using decision trees, which allows you to consider seasonal changes and other complex patterns in the data. An example of the trend and seasonality for dust in atmospheric air is shown in Figure 16.

Figure 16 presents a comprehensive time series decomposition of monthly average dust concentration data, measured in multiples of the Maximum Permissible Concentration (MPC), at the stationary monitoring post No. 4, located at 22/30 Shevchenko St., Kremenchuk, Ukraine, from 2007 to 2024. This decomposition allows for the isolation and analysis of the underlying trend, seasonal patterns, and random fluctuations present within the dataset, providing valuable insights into the temporal dynamics of air quality in this urban environment.

Panel a) of Figure 16 displays the original monthly average dust concentration overlaid with the extracted trend component. Visually, the monthly average (blue line) exhibits considerable variability, suggesting the presence of both short-term fluctuations and longer-term patterns. The superimposed trend line (orange line) provides a smoothed representation of the data’s long-term direction. From a qualitative perspective, the trend appears to show an initial period of relatively high dust concentrations, followed by a gradual decline over the latter part of the observation period. This suggests potential improvements in air quality related to dust pollution in Kremenchuk between 2007 and 2024.

Figure 16b explicitly isolates the trend component. This panel confirms the observation from the top panel, illustrating a more pronounced view of the long-term evolution of dust concentration. The trend line starts at a relatively high value around 1.2 times the MPC in the initial years (2007–2009), experiences some fluctuations, and then generally decreases, reaching levels below 1.0 MPC in the later years (post-2019). This downward trend could be attributed to various factors, such as the implementation of stricter environmental regulations, changes in industrial activity within the region, or shifts in meteorological patterns impacting dust dispersion.

Figure 16c focuses on the seasonality component. This component reveals a recurring pattern within each year, indicating systematic variations in dust concentration linked to seasonal changes. The graph shows a consistent cyclical pattern with peaks and troughs occurring at roughly the same time each year. The seasonality plot suggests higher dust concentrations during certain periods of the year and lower concentrations during others. Based on the visual pattern, it appears that dust levels tend to peak around the summer months and reach their lowest points during the winter months. This could be related to factors such as increased construction or agricultural activities during warmer months, drier conditions leading to more dust suspension, or variations in wind patterns and precipitation that affect dust deposition. The amplitude of the seasonal fluctuations appears relatively consistent throughout the observation period, suggesting that the factors driving this seasonality have remained relatively stable.

Finally, Figure 16d illustrates the random fluctuations or residuals. These represent the variations in the dust concentration data that are not explained by the trend or the seasonal pattern. This component appears as a seemingly random series of positive and negative deviations around a mean of zero. The presence of these random fluctuations is expected in real-world environmental data, arising from unpredictable events, measurement errors, or other factors not captured by the trend and seasonality components. The magnitude of these fluctuations appears to be within a reasonable range, indicating that the time series decomposition has effectively captured the systematic variations in the data.

The choice between ARIMA and BATS forecasting models in the information system is made automatically depending on their accuracy based on the forecast error calculated on the test data. To achieve this, the models are tested on the training data, and then the mean square error (MSE) or other metrics are compared to determine the best model. If BATS gives better results, it is chosen for forecasting, and if ARIMA gives more accurate forecasts, it is used instead. The proposed information system’s automatic forecasting model selection contributes to high forecasting accuracy.

The forecasting in the information system was implemented using the Python library sktime, which works with both ARIMA and BATS models. It allows us to automatically select the model that gives the minimum forecasting error based on test data. An example of the Formaldehyde parameter forecasted data visualization is shown in Figure 17.

Figure 17 shows the trends in formaldehyde pollution levels in Kremenchuk City, Poltava region, Ukraine, from 2007 to 2024, as measured at four monitoring stations. It further analyzes the forecasts generated by time series models, specifically ARIMA and BATS, extending to 2026. The graph provided illustrates the change in formaldehyde concentration, expressed in multiples of the MPC, over this period.

The historical data meticulously recorded by our continuous monitoring system from 2007 to 2024 reveal significant fluctuations in formaldehyde concentrations across the four monitoring stations: Post No. 1 (9 Molodizhna Street), Post No. 2 (Doctor O. Bohayevskyi Street), Post No. 4 (22/30 Shevchenko Street), and Post No. 5 (89 I. Prykhodka Street). Our forecasting models, utilizing ARIMA and BATS algorithms, have provided projections extending to 2026 (forecasting horizon 24 months). The MSE of these forecasts, calculated based on the historical data and model performance, stands at 0.95 for Post No. 1, 0.62 for Post No. 2, 0.78 for Post No. 4, and 0.55 for Post No. 5. Notably, while the forecasts for Post Nos. 2, 4, and 5 exhibit expected seasonal variations, the projection for Post No. 1 appears as a remarkably straight line. This suggests a period of recent stability in formaldehyde levels at this location, although the relatively higher MSE of 0.95 indicates a considerable degree of uncertainty and potential for deviation from this stable prediction in the coming years, given the historical volatility observed at this station. Notably, all stations exhibit a degree of seasonality, with peaks and troughs occurring at roughly similar times of the year, suggesting common underlying factors influencing formaldehyde levels across the city. These factors could include seasonal variations in industrial activity, traffic patterns, meteorological conditions (such as temperature inversions trapping pollutants), and potentially photochemical reactions that contribute to formaldehyde formation.

While a general downward or upward trend is not immediately apparent across all stations for the entire period, certain observations can be made. For instance, some stations show periods of higher concentrations, particularly around 2012–2014 and again around 2021–2023, with notable spikes exceeding 10 times the MPC at certain points. These peaks could be attributed to specific industrial events, changes in emissions regulations, or other localized pollution sources near the monitoring stations. Conversely, there are periods where the concentrations are generally lower, indicating potential improvements in air quality or reduced emissions. The variability between the stations suggests localized influences on formaldehyde levels, highlighting the importance of having multiple monitoring points across the city.

The dotted lines on the graph represent the forecasts generated by ARIMA and BATS models for each of the four monitoring stations.

The forecasts generally attempt to capture the recent trends and seasonality observed in the actual data up to 2024. For some stations, the forecasts exhibit a continuation of the seasonal patterns, with predicted peaks and troughs in the formaldehyde concentrations. The amplitude of these predicted fluctuations and the overall level of formaldehyde vary between the stations, reflecting the different historical trends observed at each location.

One notable feature is the presence of a straight, almost horizontal dotted line among the forecasts. This line corresponds to the forecast for Post No. 1 (9 Molodizhna Street). The appearance of a straight line in a time series forecast typically indicates that the forecasting model has identified a stable level with minimal or no trend and seasonality in the recent historical data for that particular station.

Several reasons could explain why the ARIMA or BATS model produced a straight dotted line for Post No. 1:

-: If the formaldehyde concentrations at Post No. 1 are relatively constant and without a clear trend or strong seasonality in the period leading up to 2024, the forecasting model would likely project this stability into the future.
-: The specific parameters chosen for the ARIMA or BATS model for Post No. 1 might have led to a forecast that emphasizes the recent average level over any potential trend or seasonal components. For instance, if the autoregressive and moving average components are not significant, and the trend and seasonality parameters are estimated to be close to zero, the forecast would converge to a constant value.
-: While less likely, it is possible that the data for Post No. 1 in the recent period are less variable or have missing values that could influence the model’s ability to detect trends and seasonality.

It is important to note that a straight-line forecast does not necessarily imply that the actual future concentrations will remain perfectly constant. Environmental factors and human activities are complex and can introduce unexpected changes. However, the straight dotted line suggests that, based on the historical data up to 2024, the best prediction for formaldehyde levels at Post No. 1 is a continuation of the recent stable average.

The information system can also visualize and forecast data on maps. The results of spatial interpolation of air pollutants within the city of Kremenchuk using the example of formaldehyde are shown in Figure 18. The figure shows the interpolation of formaldehyde over the city space based on IDW.

The interpolation was performed on the data set of pollution level forecasts for all four posts in the city with discretization in the month of 31 May 2025. Figure 19 shows the visualization of 10% of filtered points from the general spatial interpolation with the lowest predicted pollution level, with the averaging of forecasts for all points for two years, they can be considered background for determining the sources and locations of predicted emissions. Accordingly, the system has functionality for dynamically setting the direction of filtering spatial points and specifying the number of points to be filtered. Filtering the least polluted points and the most polluted points allows you to determine the difference between the background air quality of the urban agglomeration and polluted areas, to localize emission sources. The functionality works with both retrospective and forecasted data.

Table 2 presents the MSE values for air pollution forecasts generated using the BATS and ARIMA models. The analysis of these values provides insight into the performance of each model in predicting the concentration levels of various pollutants. The comparison of forecast accuracy metrics was carried out using averaged data from four stationary posts in the city of Kremenchuk from 1 August 2008, to 1 August 2024, with a discreteness of one month and a forecasting horizon of five months.

MSE is a widely used metric for evaluating the accuracy of predictive models, with lower values indicating higher predictive accuracy. In this study, the BATS model generally outperforms ARIMA for most pollutants, as evidenced by its lower MSE values. For example, in predicting dust concentration, the BATS model achieves an MSE of 0.0242, significantly lower than the ARIMA model’s MSE of 0.0354. This suggests that the BATS model provides more precise forecasts for this pollutant, potentially due to its ability to account for seasonality and long-term trends more effectively than ARIMA.

Similarly, BATS demonstrates superior accuracy in forecasting pollutants such as carbon monoxide (MSE: 0.0062 vs. 0.0126), nitrogen dioxide (MSE: 0.0068 vs. 0.0135), and formaldehyde (MSE: 1.4744 vs. 7.7984). These results highlight the effectiveness of the BATS model in capturing the temporal patterns of these pollutants, making it a more suitable choice for air quality forecasting. However, there are exceptions where ARIMA performs better, such as sulfur dioxide (MSE: 0.00123 vs. 0.00157), phenol (MSE: 0.0075 vs. 0.0136), and benzene (MSE: 2.7338 × 10⁻⁵ vs. 6.2083 × 10⁻⁴). These cases suggest that ARIMA may be more effective for specific pollutants that exhibit different temporal behaviors, possibly due to the underlying statistical properties of the time series data.

The exceptionally low MSE values for certain pollutants, such as toluene (MSE: 2.9956 × 10⁻⁴⁴ for BATS and 6.5848 × 10⁻³² for ARIMA) and ethylbenzene (MSE: 3.0845 × 10⁻¹⁷ for BATS and 1.3386 × 10⁻²¹ for ARIMA), indicate that both models provide highly accurate forecasts for these substances. However, these values may also suggest data sparsity or inherent ease in predicting their concentration trends due to less variability over time.

4. Discussion

The imperative for enhanced environmental monitoring, particularly concerning air quality, is increasingly evident amidst global challenges like climate change and localized industrial impacts. Recent legislative drivers, such as the AAQD, further underscore the need for robust, data-driven systems capable of accurate detection and forecasting. This study addressed this need by developing and evaluating a comprehensive forecasting information system designed specifically for air pollution monitoring, aiming to bridge the gap between advanced analytical capabilities and practical implementation, especially within resource-constrained contexts like Ukrainian municipal enterprises.

Our work demonstrates the feasibility and utility of an integrated digital platform that automates key stages of air quality data management, from processing and analysis to forecasting and visualization. A core challenge identified in the introduction was the limitation of existing local systems, often hindered by manual processes, high costs of advanced solutions, and an inability to handle large datasets or complex environmental interactions effectively. The developed system directly tackles these issues by incorporating a multi-stage workflow, including automated data loading, cleaning, analysis of MPC exceedances, spatial interpolation using IDW, time series forecasting with automatic model selection, correlation analysis, and data-driven recommendations for monitoring network optimization. This structured approach not only enhances efficiency but also provides a holistic view of the air quality situation, facilitating informed decision-making.

The application of the system to retrospective data from the Kremenchuk agglomeration (2007–2024) provided a valuable testbed. Focusing on pollutants with historically significant exceedances (dust, sulfur dioxide, nitrogen dioxide, formaldehyde) allowed for a targeted evaluation of the system’s forecasting capabilities. The spatial analysis component, utilizing IDW interpolation, proved effective in visualizing the geographical distribution of pollutants and identifying potential hotspots, even in areas without direct monitoring. This spatial understanding is crucial for local authorities to assess exposure risks and target interventions. The integration with OpenStreetMap and interactive visualization features further enhances the system’s usability for diverse stakeholders, including administrators, analysts, and the public.

A key contribution of this work lies in the comparative evaluation and automatic selection of time series forecasting models, specifically ARIMA and BATS. While ARIMA has been a standard choice for air quality forecasting, our results often showed the BATS model yielding lower forecast errors (MAE) for several key pollutants in the Kremenchuk dataset. This superior performance can likely be attributed to BATS’s flexibility in handling complex seasonality and trends, along with its Bayesian approach to uncertainty, which is advantageous for often noisy and variable environmental data. However, ARIMA remained the better-performing model for specific pollutants (e.g., sulfur dioxide), highlighting that no single model is universally optimal. This reinforces the critical importance of the system’s automated model selection feature, which uses criteria like AIC, MAE, and RMSE to choose the most suitable model for a given pollutant time series, thereby maximizing forecast accuracy without manual intervention. This automation is particularly beneficial for local entities lacking specialized expertise or resources for manual model tuning.

Furthermore, the integration of correlation analysis provides valuable insights into the relationships between different pollutants. Identifying strong positive or negative correlations can help understand common sources or atmospheric processes and potentially optimize monitoring strategies by reducing redundant measurements for highly correlated parameters. Combining spatial interpolation, correlation analysis, and forecasting of MPC exceedances allows the system to generate prioritized recommendations for locating new monitoring stations. By identifying background points (areas with consistently low pollution) and considering logistical factors, the system aids in the strategic expansion of monitoring networks to improve spatial coverage and data representativeness.

Despite the promising results, certain limitations must be acknowledged. The system was developed and tested using retrospective data from a single, albeit industrially significant, region. Validation in diverse geographical and meteorological contexts is necessary. The quality and resolution (monthly frequency) of the input data, while representative of standard municipal monitoring, may limit the precision of short-term forecasts. Moreover, the provided dataset lacked detailed measurement uncertainty information, which could further refine model performance assessment. While BATS and ARIMA were explored, incorporating other advanced machine learning techniques, such as the neural networks mentioned in the introduction, could potentially yield further improvements in forecasting complex, non-linear patterns. The method for recommending station locations, while data-driven, could be enhanced by incorporating more sophisticated spatial optimization algorithms and additional factors like population density or specific emission sources.

In conclusion, this study presents a significant step towards the digitalization of air pollution monitoring through a novel forecasting information system. By integrating automated data processing, advanced time series forecasting with model selection, spatial analysis, and user-friendly visualization, the system offers a powerful and potentially cost-effective tool for environmental agencies. It addresses key challenges faced by local monitoring authorities, enhances the accuracy and timeliness of air quality forecasts, and supports data-driven decision-making for environmental protection and public health, aligning with the goals of modern environmental directives. The developed system provides a robust foundation for future enhancements and wider implementation in the ongoing effort to mitigate the impacts of air pollution.

5. Conclusions

The main result of this research is the development of an information system that allows for air pollution detection using modern machine learning algorithms, spatial analysis, and visualization. The information system combines machine learning models, spatial interpolation, and time series analysis to assess changes in pollutant concentrations. The scientific novelty lies in implementing the information system of automatically selecting the optimal forecasting model based on data and accuracy metrics. This approach reduces the influence of the human factor and increases the efficiency of environmental monitoring. The use of ARIMA and BATS models in combination with automated model selection allowed not only the prediction of the values of environmental parameters with high accuracy but also the analysis of seasonality and long-term trends.

Integrating spatial, correlation, and temporal analyses is also innovative, allowing not only forecasting but also giving recommendations for the placement of new monitoring posts. The novelty of the approach is also manifested in using interactive maps based on spatial modeling using the inverse distance weighted method, which allows the assessment of the spatial distribution of pollutants and their dynamics.

The practical significance of the information system lies in its ability to automate the process of monitoring and analyzing environmental data. Thanks to the interactive interface, users can quickly obtain the necessary information, conduct an analysis of exceeding the maximum permissible concentrations, and forecast future environmental indicators. This makes the system useful for environmental services seeking to optimize the process of monitoring pollution and for scientific researchers analyzing environmental trends.

Testing of the developed system confirmed its effectiveness in forecasting changes in environmental parameters. A comparison of ARIMA and BATS models showed that both provide sufficient accuracy. However, the model choice depends on the data’s specifics and forecast requirements. Interactive maps built on the inverse distance weighted (IDW) method allow users to study the geographical distribution of pollution and identify critical points. Thanks to the integration of correlation analysis, the system detects dependencies between parameters, contributing to a deeper understanding of environmental processes.

The implementation of the developed information system contributes to the achievement of the UN Sustainable Development Goals. In particular, it contributes to the sustainable development of cities and communities (Goal 11) through improved environmental monitoring. In addition, the information system helps mitigate the effects of climate change (Goal 13) by predicting pollutant concentrations and taking preventive measures. At the same time, it protects terrestrial ecosystems (Goal 15) by identifying critical ecological zones where intervention is needed. The developed information system is able to provide real-time air pollution monitoring and forecasting, help authorities respond to environmental challenges in a timely manner, and formulate strategies to improve the state of the environment. In addition, the developed information system can become a valuable tool for researchers working to improve the environmental situation.

Future research should focus on integrating real-time data streams and enhancing the system’s responsiveness for operational forecasting and alert systems. Incorporating diverse data sources, such as satellite remote sensing data, meteorological data, and traffic information, could significantly improve model accuracy and explanatory power. Exploring hybrid modeling approaches, combining statistical methods like BATS/ARIMA with machine learning models, may offer further advantages. Expanding the system’s capabilities to include source apportionment analysis would also be a valuable addition.

Author Contributions

Conceptualization, K.V. and A.P.; methodology, V.B.; software, K.V.; validation, K.V., A.P. and V.B.; formal analysis, V.S.; investigation, K.V.; resources, V.S.; data curation, Y.P.; writing—original draft preparation, S.S.; writing—review and editing, V.S.; visualization, K.V.; supervision, Y.P.; project administration, S.S.; funding acquisition, V.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to (data privacy and security concerns related to specific municipal infrastructure, to ensure appropriate data use in accordance with agreements with local authorities).

Acknowledgments

We thank the anonymous reviewers for their valuable comments, which helped improve the paper’s content, quality, and organization. This study was supported by efforts to advance digitalization in air pollution detection for resource-constrained municipalities in Ukraine. The authors thank all contributors to the development and testing of the forecasting information system, particularly for their work with retrospective data from the Kremenchuk agglomeration. Special thanks to those advancing environmental monitoring practices in line with AAQD and EU standards.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Directive (EU) 2024/2881 of the European Parliament and of the Council of 11 April 2024 on Ambient Air Quality and Cleaner Air for Europe (Recast). EUR-Lex, 2025. Available online: https://eur-lex.europa.eu/eli/dir/2024/2881/oj (accessed on 19 March 2025).
Infinity Laboratories. Top 10 Reasons Environmental Monitoring Programs Go Wrong. Infinity Laboratories, 2024. Available online: https://infinitylaboratories.com/top-10-reasons-environmental-monitoring-programs-go-wrong/ (accessed on 19 March 2025).
Premier Science. Predicting Tomorrow: A Review of Machine Learning’s Role in Shaping Environmental Forecasts. premierscience.com, 2024. Available online: https://premierscience.com/wp-content/uploads/2024/10/pjs-24-261-3.pdf (accessed on 19 April 2025).
Saiwa. AI in Environmental Monitoring|Keeping Our Planet Healthy. saiwa.ai, 2025. Available online: https://saiwa.ai/sairone/blog/ai-in-environmental-monitoring/ (accessed on 19 March 2025).
Dai, Z.; Ravichandran, P.; Fazelnia, G.; Carterette, B.; Lalmas-Roelleke, M. Model Selection for Production System via Automated Online Experiments. Adv. Neural Inf. Process. Syst. 2020, 33, 1–12. [Google Scholar] [CrossRef]
Khurana, U.; Samulowitz, H. Autonomous Predictive Modeling via Reinforcement Learning. In Proceedings of the CIKM ‘20: The 29th ACM International Conference on Information and Knowledge Management, Online, 19–23 October 2020. [Google Scholar]
Laredo, D.; Qin, Y.; Schütze, O.; Sun, J.-Q. Automatic Model Selection for Neural Networks. arXiv 2019, arXiv:1905.06010. [Google Scholar] [CrossRef]
Clark, J.S.; Carpenter, S.R.; Barber, M.; Collins, S.; Dobson, A.; Foley, J.A.; Lodge, D.M.; Pascual, M.; Pielke, R.; Pizer, W.; et al. Ecological Forecasts: An Emerging Imperative. Science 2001, 293, 657–660. [Google Scholar] [CrossRef]
Grimm, V.; Railsback, S.F. Individual-Based Modeling and Ecology; Series: Princeton Series in Theoretical and Computational Biology; Princeton University Press: Princeton, NJ, USA, 2005; ISBN 978-069-109-666-7. [Google Scholar]
Urban, D.L.; Keitt, T.H. Landscape connectivity: A graph-theoretic perspective. Ecology 2001, 82, 1205–1218. [Google Scholar] [CrossRef]
Cosme, M.; Thomas, C.; Gaucherel, C. On the History of Ecosystem Dynamical Modeling: The Rise and Promises of Qualitative Models. Entropy 2023, 25, 1526. [Google Scholar] [CrossRef]
Bayatvarkeshi, M.; Zhang, B.; Fasihi, R.; Adnan, R.M.; Kisi, O.; Yuan, X. Investigation into the Effects of Climate Change on Reference Evapotranspiration Using the HadCM3 and LARS-WG. Water 2020, 12, 666. [Google Scholar] [CrossRef]
Provenzale, A. Climate models. Rend. Lincei 2013, 25, 49–58. [Google Scholar] [CrossRef]
Jia, K.; Ruan, Y.; Yang, Y.; Zhang, C. Assessing the Performance of CMIP5 Global Climate Models for Simulating Future Precipitation Change in the Tibetan Plateau. Water 2019, 11, 1771. [Google Scholar] [CrossRef]
Zhao, T.B.; Dai, A.G. CMIP6 Model−Projected Hydroclimatic and Drought Changes and Their Causes in the Twenty−First Century. J. Clim. 2022, 35, 897–921. [Google Scholar] [CrossRef]
Johns, T.C.; Gregory, J.M.; Ingram, W.J.; Johnson, C.E.; Jones, A.; Lowe, J.A.; Mitchell, J.F.B.; Roberts, D.L.; Sexton, D.M.H.; Stevenson, D.S.; et al. Anthropogenic climate change for 1860 to 2100 simulated with the HadCM3 model under updated emissions scenarios. Clim. Dyn. 2003, 20, 583–612. [Google Scholar] [CrossRef]
Turner, J.; Connolley, W.M.; Lachlan-Cope, T.A.; Marshall, G.J. The performance of the Hadley Centre Climate Model (HadCM3) in high southern latitudes. Int. J. Climatol. 2006, 26, 91–112. [Google Scholar] [CrossRef]
Rajar, R.; Četina, M. Hydrodynamic models as a basis for water quality modelling: A review. Trans. Ecol. Environ. 1995, 7, 199–211. [Google Scholar]
Iglesias, I.; Bio, A.; Melo, W.; Avilez-Valente, P.; Pinho, J.; Cruz, M.; Gomes, A.; Vieira, J.; Bastos, L.; Veloso-Gomes, F. Hydrodynamic Model Ensembles for Climate Change Projections in Estuarine Regions. Water 2022, 14, 1966. [Google Scholar] [CrossRef]
Nandalal, K.D.W. Use of a hydrodynamic model to forecast floods of Kalu River in Sri Lanka. J. Flood Risk Manag. 2009, 2, 151–158. [Google Scholar] [CrossRef]
Idzelytė, R.; Čerkasova, N.; Mėžinė, J.; Dabulevičienė, T.; Razinkovas-Baziukas, A.; Ertürk, A.S.; Umgiesser, G. Coupled hydrological and hydrodynamic modelling application for climate change impact assessment in the Nemunas river watershed–Curonian Lagoon–southeastern Baltic Sea continuum. Ocean Sci. 2023, 19, 1047–1066. [Google Scholar] [CrossRef]
Arnold, J.G.; Srinivasan, R.; Muttiah, R.S.; Williams, J.R. Large area hydrologic modeling and assessment part I: Model development. Am. J. Water Resour. 1998, 34, 73–89. [Google Scholar] [CrossRef]
MIKE 21 Flow Model: Hydrodynamic and Transport Module—Scientific Documentation. Available online: https://www.dhigroup.com/upload/dhisoftwarearchive/shortdescriptions/marine/HydrodynamicModuleHD.pdf (accessed on 15 March 2025).
Seidenfaden, I.; Sonnenborg, T.; Refsgaard, J.; Trolle, D.; Børgesen, C.; Olesen, J.; Jeppesen, E.; Jensen, K. Combined effects of climate models, hydrological model structures and land use scenarios on hydrological impacts of climate change. J. Hydrol. 2016, 535, 301–317. [Google Scholar] [CrossRef]
Finizio, M.; Pontieri, F.; Bottaro, C.; Di Febbraro, M.; Innangi, M.; Sona, G.; Carranza, M.L. Remote Sensing for Urban Biodiversity: A Review and Meta-Analysis. Remote Sens. 2024, 16, 4483. [Google Scholar] [CrossRef]
Rocchini, D.; Féret, J.-B.; Papuga, G. Coupling in situ and remote sensing data to assess α- and β-diversity over biogeographic gradients. arXiv 2024, arXiv:2404.18485. [Google Scholar] [CrossRef]
González, A. GIS in Environmental Assessment: A Review of Current Issues and Future Needs. J. Environ. Assess. Policy Manag. 2012, 14, 250007. [Google Scholar] [CrossRef]
Roudgarmi, P.; Monavari, S.; Feghhi, J.; Nouri, J.; Khorasani, N. Environmental impact prediction using remote sensing images. J. Zhejiang Univ. 2008, 9, 381–390. [Google Scholar] [CrossRef]
Hernandez-Martinez, A.R. Remote sensing for environmental analysis: Basic concepts and setup. In Green Sustainable Process for Chemical and Environmental Engineering and Science; Inamuddin, R.B., Abdullah, M.A., Eds.; Elsevier: Amsterdam, The Netherlands, 2021; pp. 209–224. [Google Scholar] [CrossRef]
Madhavi, M.; Kolikipogu, R.; Prabakar, S. Experimental Evaluation of Remote Sensing–Based Climate Change Prediction Using Enhanced Deep Learning Strategy. Remote Sens. Earth Syst. Sci. 2024, 7, 642–656. [Google Scholar] [CrossRef]
Zhang, H.; Lin, H.; Zhang, Y.; Weng, Q. Remote Sensing of Impervious Surfaces in Tropical and Subtropical Areas; CRC Press: Boca Raton, FL, USA, 2015; ISBN 978-042-915-933-6. [Google Scholar]
Weng, Q.; Fu, P.; Gao, F. Generating daily land surface temperature at Landsat resolution by fusing Landsat and MODIS data. Remote Sens. Environ. 2014, 145, 55–67. [Google Scholar] [CrossRef]
Verburg, P.H.; Overmars, K.P. Combining top-down and bottom-up dynamics in land use modeling: Exploring the future of abandoned farmlands in Europe with the Dyna-CLUE model. Landsc. Ecol. 2009, 24, 1167–1181. [Google Scholar] [CrossRef]
Calka, B.; Szostak, M. GIS-Based Environmental Monitoring and Analysis. Appl. Sci. 2025, 15, 3155. [Google Scholar] [CrossRef]
Linh, N.H.K.; Pham, T.G.; Pham, T.H.; Tran, C.T.M.; Nguyen, T.Q.; Ha, N.T.; Ngoc, N.B. Land-Use and Land-Cover Changes and Urban Expansion in Central Vietnam: A Case Study in Hue City. Urban Sci. 2024, 8, 242. [Google Scholar] [CrossRef]
Ennouri, K.; Smaoui, S.; Triki, M.A. Detection of Urban and Environmental Changes via Remote Sensing. Circ. Econ. Sustain. 2021, 1, 1423–1437. [Google Scholar] [CrossRef]
Dutta, J.; Medhi, S.; Gogoi, M.; Borgohain, L.; Maboud, N.; Muhameed, H. Application of Remote Sensing and GIS in Environmental Monitoring and Management. In Remote Sensing and GIS Techniques in Hydrology, 1-34; Batchi, M., Moumane, A., Eds.; IGI Global Scientific Publishing: Hershey, PA, USA, 2025. [Google Scholar] [CrossRef]
SAS Institute. Predictive Modelling, Analytics and Machine Learning. Available online: https://www.sas.com/en_gb/insights/articles/analytics/a-guide-to-predictive-analytics-and-machine-learning.html (accessed on 15 March 2025).
Forbytes. Top 6 Machine Learning Techniques for Predictive Modeling and Data Analysis Explained. Available online: https://forbytes.com/blog/main-machine-learning-techniques/ (accessed on 15 March 2025).
Stefanini. Machine Learning Models for Precise Predictive Analytics. Available online: https://stefanini.com/en/insights/news/machine-learning-models-for-precise-predictive-analytics (accessed on 15 March 2025).
Marinov, E.; Petrova-Antonova, D.; Malinov, S. Time Series Forecasting of Air Quality: A Case Study of Sofia City. Atmosphere 2022, 13, 788. [Google Scholar] [CrossRef]
Gocheva-Ilieva, S.; Ivanov, A.; Voynikova, D.; Boyadzhiev, D. Time series analysis and forecasting for air pollution in small urban area: An SARIMA and factor analysis approach. Stoch. Environ. Res. Risk Assess. 2014, 28, 1045–1060. [Google Scholar] [CrossRef]
Liu, H.; Yan, G.; Duan, Z.; Chen, C. Intelligent modeling strategies for forecasting air quality time series: A review. Appl. Soft Comput. 2021, 102, 1045–1060. [Google Scholar] [CrossRef]
Kaur, J.; Parmar, K.S.; Singh, S. Autoregressive models in environmental forecasting time series: A theoretical and application review. Environ. Sci. Pollut. Res. 2023, 30, 19617–19641. [Google Scholar] [CrossRef] [PubMed]
Ray, M.; Sahoo, K.; Abotaleb, M.; Ray, S.; Sahu, P.; Mishra, P.; Al Khatib, A.; Khatib, A.; Das, S.S.; Vikas, J.; et al. Modeling and forecasting meteorological factors using BATS and TBATS models for the Keonjhar district of Orissa. Mausam 2022, 73, 555–564. [Google Scholar] [CrossRef]
Vira, S.; Yuliia, P.; Sergii, T.; Yevhen, K.; Yaroslava, B. Modeling techniques of electricity consumption forecasting. AIP Conf. Proc. 2022, 2570, 030004. [Google Scholar] [CrossRef]
Zavalieiev, A.; Vadurin, K.; Perekrest, A.; Bakharev, V. Information and analytical system for collecting, processing and analyzing data on air pollution. Autom. Technol. Bus. Process. 2024, 16, 72–82. [Google Scholar] [CrossRef]
Vadurin, K.; Perekrest, A.; Bakharev, V.; Deriyenko, A.; Ivashchenko, A.; Shkarupa, S. An information system for collecting and storing air quality data from municipal level Vaisala stations. Infocommun. Comput. Technol. 2023, 2, 38–49. [Google Scholar] [CrossRef]
Vadurin, K.; Perekrest, A.; Bakharev, V. Development of a method of automatic reporting on the number of exceedances of the established standards of atmospheric air markers. Infocommun. Comput. Technol. 2023, 2, 50–59. [Google Scholar] [CrossRef]
Perekrest, A.; Mamchur, D.; Zavaleev, A.; Vadurin, K.; Malolitko, V.; Bakharev, V. Web-based technology of intellectual analysis of environmental data of an industrial enterprise. In Proceedings of the 2023 IEEE 5th International Conference on Modern Electrical and Energy System (MEES), Kremenchuk, Ukraine, 27–30 September 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–7. [Google Scholar] [CrossRef]
Moussa, H.; Abboud, M. The Methodology of Applying Inverse Distance Weighting Interpolation Method in Determining Normal Heights. Resourceedings 2024, 4, 1–6. [Google Scholar] [CrossRef]
Chao, C.; Min, B.-W. Correlation Analysis of Atmospheric Pollutants and Meteorological Factors Based on Environmental Big Data. Int. J. Contents 2022, 18, 17–26. [Google Scholar] [CrossRef]
Chen, F.-W.; Liu, C.-W. Estimation of the spatial rainfall distribution using inverse distance weighting (IDW) in the middle of Taiwan. Paddy Water Environ. 2012, 3, 209–222. [Google Scholar] [CrossRef]
Chodakowska, E.; Nazarko, J.; Nazarko, Ł. ARIMA Models in Electrical Load Forecasting and Their Robustness to Noise. Energies 2021, 14, 7952. [Google Scholar] [CrossRef]
Daugėla, I.; Sužiedelytė-Visockienė, J.; Skeivalas, J. Analysis of Air Pollution Parameters Using Covariance Function Theory. Ecol. Chem. Eng. 2020, 27, 555–565. [Google Scholar] [CrossRef]
Choi, K.; Chong, K. Modified Inverse Distance Weighting Interpolation for Particulate Matter Estimation and Mapping. Atmosphere 2022, 13, 846. [Google Scholar] [CrossRef]
de Mesnard, L. Pollution models and inverse distance weighting: Some critical remarks. Comput. Geosci. 2013, 52, 459–469. [Google Scholar] [CrossRef]
Liu, Y.; Wen, L.; Lin, Z. Air quality historical correlation model based on time series. Sci. Rep. 2024, 14, 22791. [Google Scholar] [CrossRef]
Shendryk, V.; Parfenenko, Y.; Nenja, V.; Vashchenko, S. Information System for Monitoring and Forecast of Building Heat Consumption. Commun. Comput. Inf. Sci. 2014, 465, 1–11. [Google Scholar] [CrossRef]

Figure 1. The general structure of the information system.

Figure 2. Use case diagram.

Figure 3. UML class diagram of the environmental monitoring subject area. The numbers (‘1’) denote a multiplicity of ‘one’ and the asterisk (‘*’) denotes a multiplicity of ‘many’.

Figure 4. The diagram that specifies the interaction between the main system modules.

Figure 5. Sequence diagram of the «Data loading» scenario.

Figure 6. Sequence diagram of the «Data Processing» scenario.

Figure 7. Sequence diagram of the «Pollutant analysis» scenario.

Figure 8. Sequence diagram of the «Pollution level forecasting» scenario.

Figure 9. Sequence diagram of the «Data visualization» scenario.

Figure 10. Sequence diagram of the «MPC exceedance check» scenario.

Figure 11. Sequence diagram of the «Report generation» scenario.

Figure 12. Graphical representation of annual dust concentration in the air for the city of Kremenchuk during the period 2007–2024.

Figure 13. Spatial interpolation of dust concentration in multiples of MPC for 1 March in the city of Kremenchuk.

Figure 14. Visualization of dust concentration consecutive exceedances.

Figure 15. Сorrelation matrix of MPC air pollutant concentrations.

Figure 16. Time series decomposition of monthly average dust concentrations in Kremenchuk (2007–2024), expressed in multiples of the MPC. (a) Original monthly average data (blue) overlaid with the extracted long-term trend component (orange). (b) Isolated long-term trend component. (c) Extracted seasonality component showing average yearly cycles. (d) The remainder component represents random fluctuations after accounting for trend and seasonality.

Figure 17. Historical (2007–2024, solid lines) and forecasted (2024–2026, dashed lines) monthly average formaldehyde concentrations at four monitoring stations in Kremenchuk, expressed in multiples of the MPC. Forecasts were generated using ARIMA/BATS models with a 24-month horizon. Stations are shown: 9 Molodizhna Street (Actual: Blue, Forecast: Dark Blue Dashed, MSE = 0.95); Doctor O. Bohayevskyi Street (Actual: Orange, Forecast: Purple Dashed, MSE = 0.62); 22/30 Shevchenko Street (Actual: Green, Forecast: Pink Dashed, MSE = 0.78); 89 I. Prykhodka Street (Actual: Red, Forecast: Brown Dashed, MSE = 0.55).

Figure 18. Visualization of forecasted spatial distribution on the example of formaldehyde as of 31 May 2025.

Figure 19. Colored dots represent the predicted spatial distribution of the 10% of interpolated virtual stations with the lowest pollution levels, using formaldehyde as an example as of 31 May 2025. Red dots indicate the actual locations of the stations whose data were used for prediction and spatial interpolation.

Table 1. Recommended Python libraries and tools for the environmental monitoring forecasting system.

Library/Tool	Version	Purpose/Functionality
Pandas	2.2.2	Working with data tables: filtering, transformation, aggregation, and handling time series data (dates).
NumPy	1.26.4	Performing fast mathematical calculations on data arrays, speeding up data preparation for analysis.
Matplotlib	3.9.2	Basic data visualization for analyzing changes in pollutant concentrations through graphs.
Seaborn	0.13.2	Advanced data visualization, simplifying the construction of complex statistical graphs for comparing data between different cities and time periods to identify trends and anomalies.
Scikit-learn	1.5.2	Providing a wide range of machine learning tools, including model estimation methods (e.g., mean square error) for forecasting environmental parameters.
Sktime	0.35.0	Specializing in time series analysis, allowing for predicting future changes in pollutant concentrations and enabling automatic selection of the best forecasting model.
Geopandas	1.0.1	Processing geospatial data, including the coordinates of monitoring posts, building geometric shapes, and calculating distances between points.
Shapely	2.0.6	Working with geometric shapes for geospatial analysis.
Startinpy	0.11.0	Providing triangulation and spatial interpolation capabilities, used in conjunction with Geopandas and Shapely to create interactive pollution maps.
Statsmodels	0.14.4	Building statistical models, such as ARIMA, suitable for forecasting changes in environmental indicators with high accuracy and ease of integration.
IPython	8.26.0	Providing an interactive computing environment, particularly within Jupyter Notebooks.
ipywidgets	8.1.3	Enabling the creation of dynamic interface elements (e.g., sliders) in Jupyter Notebooks for user interaction with data, such as selecting parameters or date ranges.

Table 2. MSE values for air pollution forecasts in multiples of MPC.

Parameter	MSE_BATS	MSE_ARIMA
Dust,	2.418199 × 10⁻²	3.543434 × 10⁻²
Sulfur dioxide	1.568705 × 10⁻³	1.229420 × 10⁻³
Carbon monoxide	6.157435 × 10⁻³	1.255275 × 10⁻²
Nitrogen dioxide	6.762210 × 10⁻³	1.347585 × 10⁻²
Nitric oxide	1.111873 × 10⁻²	1.108849 × 10⁻²
Formaldehyde	1.474393	7.798375
Ammonia	1.600029 × 10⁻³	1.691275 × 10⁻³
Phenol	1.357639 × 10⁻²	7.497768 × 10⁻³
Soot	6.854720 × 10⁻⁴	4.431859 × 10⁻⁴
Benzene	6.208337 × 10⁻⁴	2.733791 × 10⁻⁵
Toluene	2.995573 × 10⁻⁴⁴	6.584821 × 10⁻³²
Ethylbenzene	3.084536 × 10⁻¹⁷	1.338575 × 10⁻²¹
Sum of m,p-xylenes and o-xylene	1.190219 × 10⁻¹⁴	9.798120 × 10⁻¹⁵

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Vadurin, K.; Perekrest, A.; Bakharev, V.; Shendryk, V.; Parfenenko, Y.; Shendryk, S. Towards Digitalization for Air Pollution Detection: Forecasting Information System of the Environmental Monitoring. Sustainability 2025, 17, 3760. https://doi.org/10.3390/su17093760

AMA Style

Vadurin K, Perekrest A, Bakharev V, Shendryk V, Parfenenko Y, Shendryk S. Towards Digitalization for Air Pollution Detection: Forecasting Information System of the Environmental Monitoring. Sustainability. 2025; 17(9):3760. https://doi.org/10.3390/su17093760

Chicago/Turabian Style

Vadurin, Kyrylo, Andrii Perekrest, Volodymyr Bakharev, Vira Shendryk, Yuliia Parfenenko, and Sergii Shendryk. 2025. "Towards Digitalization for Air Pollution Detection: Forecasting Information System of the Environmental Monitoring" Sustainability 17, no. 9: 3760. https://doi.org/10.3390/su17093760

APA Style

Vadurin, K., Perekrest, A., Bakharev, V., Shendryk, V., Parfenenko, Y., & Shendryk, S. (2025). Towards Digitalization for Air Pollution Detection: Forecasting Information System of the Environmental Monitoring. Sustainability, 17(9), 3760. https://doi.org/10.3390/su17093760

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Towards Digitalization for Air Pollution Detection: Forecasting Information System of the Environmental Monitoring

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Processing Stages for Air Pollution Detection Digitalization

2.2. Method of Exceeding Maximum Permissible Concentrations Analysis

2.3. Methods for Forecasting Changes in Concentrations of Air Pollutants

2.4. Method of Data Correlation Analysis

2.5. Mathematical Model of the Spatial Distribution of Air Pollution

2.6. Description of the Retrospective Municipal Monitoring Data of the Kremenchuk Agglomeration Used in the Information System

3. Results

3.1. Modeling of the Environmental Monitoring Forecasting Information System

3.2. Air Pollution Data Analysis Using an Information System

3.3. Analysis of Air Pollution MPC Exceedance Data

3.4. Spatial Forecasting of Air Pollutant Concentration Values Using the Interpolation Method

3.5. Forecasting Time Series of Air Pollutant Concentrations and Creating Spatial Interpolation

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI